- Feature Articles
- CodeSOD
- Error'd
-
Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Edit Admin
As we said in the military:
I'd sure be looking under the nearby rocks for some sign of that enemy.
Edit Admin
Why is this categorized as CodeSOD?
Admin
A back of the napkin calculation tells me that the odds of this happening (the drive failing on the same date on three years just by chance) is about 1 in 13 million. Something must be causing this failure each year.
Edit Admin
My dad's version of that was "once is happenstance, twice is coincidence, three times is a conspiracy".
Admin
Ian Fleming's version, which he puts in the mouth of Auric Goldfinger, combines both: "Once is happenstance. Twice is coincidence. Three times is enemy action."
Admin
Maybe August 12 was always the day that the cleaning staff decided to come into the server closet and whack some drives with a vacuum cleaner or something.
Admin
"Farenheight"?
Admin
High temperatures, I suppose...
Admin
Apparently Qi Faren is a Chinese aerospace engineer. I don't know what his height is.
Edit Admin
A better "a long time ago" storytelling technique is to replace that by "a private CVS server".
Ah the memories... of struggle mostly. Who didn't like spending nights fixing a corrupted CVS install by hand editing RCS files because it turned out the startup not only didn't have any fail over hardware but they also did not have backups (*) of the entire source code of the small company?
( The thing with backups that I learned: if you create backups but never test them, then you do not have backups.)
Admin
Fair in height, with a tall complexion
Admin
One of the things we noticed as a RAID company, if you fill your array with drives all from the same batch, the odds of two failing nearly at once is actually substantial. RAID is not always the panacea one might like for it to be. ... And of course, there were always the customers that never bothered to replace the failing drive in a RAID-5 config, until the second one died anyway.
Admin
I thought in the same line, but more that one of their neighbors switched off their electric appliance which could have caused a power surge
Admin
Without doing any maths, my "something is off" sensor was pinging wildly at the statement that this happened three times. If I were the sysadmin there, I'd be thinking very carefully about what could be causing the issue, and mitigating as much as possible - as well as setting up a video recorder to see if anything untoward happens physically at that time. Surge protector and UPS, at a minimum.
Admin
IBM DTLA... "We don't need no stinkin' backups - we have enterprise RAID now!"
Edit Admin
Having used RAID for personal use in NAS boxes, RAID 5 is probably the worst. Because nothing is worse than seeing your box rebuild the array but worrying of another drive failure that takes it all down. After all, if a drive is going to die, it's going to be when it's most busy like during a rebuild. RAID 5 rebuilds are just too damn stressful.
At last RAID 6 now if a drive dies, you can have another drive die while rebuilding and still be OK.
Edit Admin
And also it's highly likely that all the original drives in the array came from one manufacturing batch, and therefore there's a higher-than-normal chance that a second will fail shortly after the first fails even without the stress of rebuild.
Correlation risk is a bitch.
Admin
Had a similar thing at a previous place I worked back when everything was spinning disks.
The issue occurred between christmas and new years every year, but we weren't allowed back in the buildimg until the 2nd of January by which point it was hard to figure out what happened.
First couple of years we didn't know until after it'd happened, and had no hope of discovering the problem.
Then one year a few of us decided to check it out and decided to stay slightly later on xmas eve in the server room setting up additional monitoring and backups. Then the air cooling system for the server room turned off.
It turns out the air cooling for the server room was inadvertantly being turned off when the thermostat for the offices was turned off before the winter break. Xmas was the only time the office was ever shut for multiple days so the janitor was turning it off on xmas eve, then on again on the 1st of January so the office was fine again on the 2nd of Jan when everyone returned from the winter break.
The server room would get hot and massively increase the failure rate for the drives, but by the time we came back days later the room would be back to a normal temperature.