Talk:Mysterious power failure takes down Wikipedia, Wikinews
Add topicI'm curious about the term "mysterious" in the story title. It's clear from the story content that the event had aspects (two separate power sources going down simultaneously) that suggest something other than simple accident. I'd like to see some comment from the people who manage the servers, and/or the people who manage the colo facility, that address those issues. Specifically, what kind of investigation is underway regarding the causes of the outage; what is the most likely cause? Surely this kind of questioning is within the scope of an unbiassed journalist's responsibility. How can WikiNews, as an institution, facilitate such investigative journalism?
Rblumberg 12:38, 24 Feb 2005 (UTC)
i'm disturbed by the amount of dependence on/references to GOOG, a very for-profit thing. If oracle software is not used for that purpose, can't anyone find a good search engine that adheres to the same principles this entire project is founded on?
another idea i'm having...has anyone noticed that the shutdown occurred just as people were starting to have interest in exchanging ideas about Hunter S Thompson???
http://www.bilderberg.org/ bodies.htm
let's see what happens now
- Link spamming is not accepted on Wikinews. This comment, unrelated to the article it is attached to and apparently written solely to include a POV and the link, is a potential indicator of vandalism. - Amgine 16:53, 24 Feb 2005 (UTC)
- I chose to use the word 'mysterious' in the article as (afaik) no-one knows why both circuit breakers tripped. That's a kinda "it can't happen" thing! However I seriously doubt 'foul play'; colos are very secure facilities. Human error in setting up the power supply system is a far more likely candidate - we just don't exactly why what happened, happened. Dan100 (Talk) 21:21, 24 Feb 2005 (UTC)
Refactor comments
[edit]i'm disturbed by the amount of dependence on/references to GOOG, a very for-profit thing. If oracle software is not used for that purpose, can't anyone find a good search engine that adheres to the same principles this entire project is founded on?
another idea i'm having...has anyone noticed that the shutdown occurred just as people were starting to have interest in exchanging ideas about Hunter S Thompson???
http://www.bilderberg.org/ bodies.htm
let's see what happens now
- Link spamming is not accepted on Wikinews. This comment, unrelated to the article it is attached to and apparently written solely to include a POV and the link, is a potential indicator of vandalism. - Amgine 16:53, 24 Feb 2005 (UTC)
excuuuuse me... i am NOT a vandal. but i can't help having a POV. if my input to this site is not desired by any powers-that-be, i'll take it to a more anarchist forum. i have tried to regulate myself here, and use discussion pages for the potentially POV inputs of mine. i have no desire to have people click that link, other than to encourage more people to see a pattern supportive of the way i see things. the "POV" about the shutdown being orchestrated by the government? that was in the story this page is supposed to discuss! I was not the one to put it there either. so, can someone clear up this obvious obscurity? am i relegated to the wikipedia page on conspiracy, until i start agreeing more with the mainstream? i felt i was contributing to the discussion initiated above, how can wikinewsians become investigative journalists... where did i go wrong, more specifically, please, than, "You are now a vandal. Go away"
- In what way are Google, Oracle, for-profit software, or Hunter S. Thompson related to the power outage in Florida? Provide specific and verifiable evidence of the relationship, or the comments are known as "speculative" and are not likely to be welcomed. Making a spurious connection which is unrelated to the article on the article's discussion page, and adding a further unrelated link, is a common trait of "link spamming", which is an attempt to increase a website's ranking in search engine algorithms. It is very unwelcome, and is considered a banning offense. - Amgine 00:30, 25 Feb 2005 (UTC)
- Let me be perfectly clear why I edited this page at all in the first place. When I first read this story, there was a line, which I did not know was about to be removed, blaming the DHS for the failure. Then there was also a comment saying that people have said, if Oracle software had been used, this would not have happened, but Oracle is a for-profit company, and Wikiwhatever is committed to free things. There was also a comment saying, use Google to search archived Wikipedia when it is down. So all I did was put those things together and come up with what I thought were relevant comments. One was about whether there is a search engine that is less contradictory with the values here, to "promote" when it is necessary to "promote" one. The other was just an observation I did not know if anyone else had made, which was inspired by what turned out to be considered a vandal's insertion into the story. Is that specific and verifiable? The only reason for the link was as an experiment to see if the government (of the country, not the wiki) would do something to prevent people from making the connections that come so readily to my (unique?) mind. - HumanityAgent 68.8.80.whatever it is
- In statistics, a correlation may be discovered which has no actual relationship to the subject being examined, an example might be data recorded from a REG by people with green eyes having a statistical difference when compared to data recorded by people with brown eyes. More commonly this would be called "coincidence". It isn't "fact", because all that can be said is it happened, once. It also isn't relevant, unless a further study shows it is repeatable and has a causal relationship. Bringing up an idea without giving a basis or support for the idea is not considered helpful, unless you are doing the research to find the basis or support.
- Just for your information, a non-commercial internet search system might be DMOZ.org, which is also used by Google. However, Google maintains a cache of Wikipedia, DMOZ does not, which I expect is why Google was suggested in the article. - Amgine 17:41, 25 Feb 2005 (UTC)
Errors
[edit]The following statements in the story are inaccurate or misleading in some way:
- The main database and the four slaves had all ruined the data on their hard disk drives.
- They didn't ruin it (in the way that normally means - having done something to it to ruin it rather than it being a possible normal consequence of some failure). It wasn't all ruined. It was a small number of pieces with inconsistent information in recently updated areas, sufficient to render the normal automatic correction unable to correct it.
- which had been out of active use for 31 hours.
- It was in active use at the time of the failure, applying updates at the maximum rate its hard drives are capable of. This system does not normally get requests from end users because it's used for backup and reporting. that works in two steps: the reporting runs with replication turned off, then replication is turned on again and it starts applying all of the updates which happened while it was generating the reports and continues applying new updates until the next time the reports are triggered. It was behind by more than a day because of the very high update rate last week, which had caused it to accumulate a significant backlog which it hadn't yet cleared. See the log at the end of this comment for more detail.
- as the last database download was made on February 9, all edits since then would have been lost if all the servers had been in active use.
- If it had been necessary we'd have asked MySQL for assistance in recovering all but some of the records in the individual inconsistent 16k chunks. To that we would have added the logged updates from the master database server and/or the 31 hours worth on the machine which was fine. Since the recently changed database chunks were the ones likely to have been damaged this would probably have resulted in a more painful and slower recovery process but near complete recovery.
- It is understood that a failure in the self-protection systems of the servers led to the database corruption.
- The understanding is incorrect. It was a failure in one or more of the operating system to flush the data (write it to the disk surface and not say it's written untilit really is), the disk controllers to flush the data and/or the hard drives to flush the data. There was enough of this to overcome the method the database server uses to recover from most faults.
- If the power to a database server is cut mid-write, the database will be corrupted and unreadable without many hours of repair work.
- MAY be corrupted. It's not certain. It's supposed to be fine - the operating system, disk controllers and drives are supposed to have written the data. In previous incident last year, after a circuit was overloaded and power was cut, the automatic recovery was able to handle it.
- However a failure - possibly software or hardware - meant that the WAL system on all but one computer was reporting that it had finished logging edits, when in actual fact it had not. This led to all data on each server being corrupted.
- The WAL system does not report this. It tells the operating system to tell the controllers to tell the drives to write and not say they have written until the write is saved on the hard drive disk surface.
- This led to all data on each server being corrupted.
- See above. It was one or more 16k chunks in the most recently updated areas, not all of the data.
- Many databases other than MySQL do not corrupt their data
- The database didn't corrupt its data. The operating system, disk controllers, disk drives or some combination of them apparently did, by not writing what they had said they had written. It would have been nice if the database had been able to work around the other parts of the system failing in their jobs, of course. Some others are able to do that. One notable one, with advocates who tend to seek opportunity to criticise MySQL, is PostgreSQL. It's a feature which isn't yet in MySQL, so it's a handy criticism to use. On the other hand, PostgreSQL doesn't come with replication and fulltext search, while MySQL does and the MediaWiki software used by Wikipedia uses both of those features.
Here's the relevant portion of the error log of khaldun, the system we recovered from, with comments:
Comment: next lines are the normal stopping of replication at the start of a report job on the 18th (all times are UTC): 050218 0:00:02 Slave I/O thread killed while reading event 050218 0:00:02 Slave I/O thread exiting, read up to log 'ariel-bin.211', position 233347129 050218 0:00:03 Slave SQL thread exiting, replication stopped in log 'ariel-bin.200' at position 76297910 050218 0:36:13 /usr/local/mysql/libexec/mysqld: Sort aborted
Comment: next lines are the restarting of replication at the end of the report job. Took 36 minutes because I aborted it to give this computer more time to catch up on the backlog of updates the compression work had caused:
050218 0:36:20 Slave SQL thread initialized, starting replication in log 'ariel-bin.200' at position 76297910, relay log './khaldun-relay-bin.006' position: 3298418 99 050218 0:36:20 Slave I/O thread: connected to master 'repl@iariel:3306', replication started in log 'ariel-bin.211' at position 233347129
Comment: next lines are the normal stopping of replication at the start of a report job on the 19th
050219 0:00:02 Slave I/O thread killed while reading event 050219 0:00:02 Slave I/O thread exiting, read up to log 'ariel-bin.221', position 181275051 050219 0:00:02 Slave SQL thread exiting, replication stopped in log 'ariel-bin.209' at position 7548144
Comment: next lines are the normal restarting of replication at the end of the report job - I didn't kill the report job early this time so it ran for its normal duration:
050219 9:49:11 Slave I/O thread: connected to master 'repl@iariel:3306', replication started in log 'ariel-bin.221' at position 181275051 050219 9:49:11 Slave SQL thread initialized, starting replication in log 'ariel-bin.209' at position 7548144, relay log './khaldun-relay-bin.008' position: 530113822
Comment: next line is replication reporting that it was unable to connect to the master database server to get more data, cause not known:
050219 21:54:10 Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) 050219 21:54:10 Slave I/O thread: Failed reading log event, reconnecting to retry, log 'ariel-bin.227' position 158061601 050219 21:54:40 Slave I/O thread: error reconnecting to master 'repl@iariel:3306': Error: 'Can't connect to MySQL server on 'iariel' (4)' errno: 2003 retry-time: 6 0 retries: 86400
Comment: next lines are restarting of replication:
050219 22:00:11 Slave: connected to master 'repl@iariel:3306',replication resumed in log 'ariel-bin.227' at position 158061601 050219 22:01:22 Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) 050219 22:01:22 Slave I/O thread: Failed reading log event, reconnecting to retry, log 'ariel-bin.228' position 218656
Comment: next lines are restarting of replication:
050219 22:01:38 Slave: connected to master 'repl@iariel:3306',replication resumed in log 'ariel-bin.228' at position 218656
Comment: next lines are the normal stopping of replication at the start of a report job on the 20th:
050220 0:00:02 Slave I/O thread killed while reading event 050220 0:00:02 Slave I/O thread exiting, read up to log 'ariel-bin.229', position 185010751 050220 0:00:02 Slave SQL thread exiting, replication stopped in log 'ariel-bin.216' at position 33534449
Comment: next lines are the normal restarting of replication at the end of the report job:
050220 7:22:38 Slave SQL thread initialized, starting replication in log 'ariel-bin.216' at position 33534449, relay log './khaldun-relay-bin.010' position: 288706407 050220 7:22:38 Slave I/O thread: connected to master 'repl@iariel:3306', replication started in log 'ariel-bin.229' at position 185010751
Comment: next lines are the normal stopping of replication at the start of a report job on the 21st:
050221 0:00:03 Slave I/O thread killed while reading event 050221 0:00:03 Slave I/O thread exiting, read up to log 'ariel-bin.236', position 64857627 050221 0:00:03 Slave SQL thread exiting, replication stopped in log 'ariel-bin.226' at position 112974740
Comment: next lines are the normal restarting of replication at the end of the report job, then assorted transient errors and restarts of replication:
050221 9:53:30 Slave SQL thread initialized, starting replication in log 'ariel-bin.226' at position 112974740, relay log './khaldun-relay-bin.012' position: 905494861 050221 9:53:30 Slave I/O thread: connected to master 'repl@iariel:3306', replication started in log 'ariel-bin.236' at position 64857627 050221 10:43:42 Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) 050221 10:43:42 Slave I/O thread: Failed reading log event, reconnecting to retry, log 'ariel-bin.238' position 252645704 050221 10:43:42 Slave I/O thread: error reconnecting to master 'repl@iariel:3306': Error: 'Lost connection to MySQL server during query' errno: 2013 retry-time: 60 retries: 86400 050221 10:45:12 Slave I/O thread: error reconnecting to master 'repl@iariel:3306': Error: 'Can't connect to MySQL server on 'iariel' (4)' errno: 2003 retry-time: 6 0 retries: 86400 050221 10:49:12 Slave: connected to master 'repl@iariel:3306',replication resumed in log 'ariel-bin.238' at position 252645704
Comment: previous line shows successful replication restart after an error. At this point replication is running. Note the position it was last started at: relay log './khaldun-relay-bin.012' position: 905494861
Comment: next line is the first sign that the server has had a power problem, the database serverlogging that it is restarting:
050222 02:40:39 mysqld started 050222 2:40:42 InnoDB: Database was not shut down normally. InnoDB: Starting recovery from log files... InnoDB: Starting log scan based on checkpoint at InnoDB: log sequence number 301 2308281842 InnoDB: Doing recovery: scanned up to log sequence number 301 2313524224 InnoDB: Doing recovery: scanned up to log sequence number 301 2318767104 InnoDB: Doing recovery: scanned up to log sequence number 301 2324009984 InnoDB: Doing recovery: scanned up to log sequence number 301 2329252864 InnoDB: Doing recovery: scanned up to log sequence number 301 2334495744 InnoDB: Doing recovery: scanned up to log sequence number 301 2339738624 InnoDB: Doing recovery: scanned up to log sequence number 301 2344981504 InnoDB: Doing recovery: scanned up to log sequence number 301 2350224384 InnoDB: Doing recovery: scanned up to log sequence number 301 2355467264 InnoDB: Doing recovery: scanned up to log sequence number 301 2360710144 InnoDB: Doing recovery: scanned up to log sequence number 301 2365953024 InnoDB: Doing recovery: scanned up to log sequence number 301 2371195904 InnoDB: Doing recovery: scanned up to log sequence number 301 2376438784 InnoDB: Doing recovery: scanned up to log sequence number 301 2381681664 InnoDB: Doing recovery: scanned up to log sequence number 301 2386924544 InnoDB: Doing recovery: scanned up to log sequence number 301 2392167424 InnoDB: Doing recovery: scanned up to log sequence number 301 2397410304 InnoDB: Doing recovery: scanned up to log sequence number 301 2401056520 InnoDB: 1 transaction(s) which must be rolled back or cleaned up InnoDB: in total 1 row operations to undo InnoDB: Trx id counter is 0 2567935744
050222 2:40:45 InnoDB: Starting an apply batch of log records to the database... InnoDB: Progress in percents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 4 8 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 InnoDB: Apply batch completed InnoDB: Starting rollback of uncommitted transactions InnoDB: Rolling back trx with id 0 2567935438, 1 rows to undo InnoDB: Rolling back of trx id 0 2567935438 completed InnoDB: Rollback of uncommitted transactions completed InnoDB: In a MySQL replication slave the last master binlog file InnoDB: position 0 13343080, file name ariel-bin.235 InnoDB: Last MySQL binlog file position 0 644188245, file name ./ariel-bin.332 050222 2:50:04 InnoDB: Flushing modified pages from the buffer pool... 050222 2:51:30 InnoDB: Started /usr/local/mysql/libexec/mysqld: ready for connections. Version: '4.0.22-log' socket: '/tmp/mysql.sock' port: 3306 Source distribution
Comment, next line shows the SQL part (the part which applies updates) starting. Observe that the last logged point was at './khaldun-relay-bin.012' position: 905494861 and it's now at './khaldun-relay-bin.014' position: 695946861. That is, it had been appling updates to get from the lower numbered position to the postion it started at and was apparently doing so when the power failed.
050222 2:51:31 Slave SQL thread initialized, starting replication in log 'ariel-bin.235' at position 13282058, relay log './khaldun-relay-bin.014' position: 695946861 050222 2:51:31 Slave I/O thread: error connecting to master 'repl@iariel:3306': Error: 'Lost connection to MySQL server during query' errno: 2013 retry-time: 60 retries: 86400
Jamesday 03:19, 25 Feb 2005 (UTC)
Punctuation errors
[edit]Change "1.5 - 2 hours" to "1.5–2 hours" and "Regular back-ups of the database of Wikipedia projects are maintained - the encyclopedia in its entirety was not at risk." to "Regular back-ups of the database of Wikipedia projects are maintained – the encyclopedia in its entirety was not at risk.". (These are not appropriate uses of hyphens.) TTWIDEE (talk) 20:08, 25 October 2024 (UTC)
- Wow. Somebody needs to switch to de-caf.--Bddpaux (talk) 15:35, 6 December 2024 (UTC)
- But: corrected anyway.--Bddpaux (talk) 15:37, 6 December 2024 (UTC)