• (nodebb)

    First you bug, then you debug. Writing bugs is inevitable. Shipping them isn't: that is not only-human, that is not learning, that is an organizational failure. It is what review is for.

  • Dave (unregistered)

    "The astoundingly profitable internet giant hailed the software as a triumph because it saved a single network administrator over eight hours of work each week."

    This bit is obviously complete nonsense. As the TDWTF article notes, it's better to automate routine stuff so that you don't f it up. Google was not automating 8 hours a week to save paying an administrator, but so that there'd be less scope for cocking things up.

    Ironic that they cocked up the implementation of that, but it's a perfectly reasonable plan. No need whatsoever for the alt-right conspiratorial nonsense about 'massively profitable' Google blindly and idiotically penny pinching.

  • (nodebb)

    Let this be reminder to everybody: keep internet decentralized. Do not put your eggs in one basket. Do not rely on Facebook or Google or whatever. If you want to offer Facebook authentication, fine, but always allow passwords, as well. And this is on top of the morality aspect of dealing with Facebook and I'm sure at this point the whole world knows there are many reasons to not use Facebook at all. I don't and was completely unaffected by this outage. I couldn't be more happy for making the choice to not use any of these giant platforms years ago.

  • ThndrPnts (unregistered)

    What outage?

  • Prime Mover (unregistered)

    What's Facebook?

  • Brian Boorman (google) in reply to ThndrPnts

    Exactly. Those of us that actually work for a living never knew of the outage until it hit the evening news.

  • I dunno LOL ¯\(°_o)/¯ (unregistered)

    A caller on the radio yesterday compared a day without Facebook to a day without Satan.

    And I have never signed up for Facebork. I only knew about this because it was being talked about.

  • Eric (unregistered)

    For what it's worth, they've provided more explanations as to the cause of the outage here: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

  • RLB (unregistered)

    Another thing to remember: you're going to eff up again, so always have a way to quickly un-eff back to the previous state. I don't blame Faceblerk for making a mistake, but I do blame them for having no alternative way to reach their servers, apparently not even physically.

  • Philip Storry (unregistered)

    Systems Administrator checking in here. I read TDWTF because I code as a hobby, and because I'm sure my job will end up being subsumed by some form of DevOps at some point. (Or SRE, but that's another discussion.)

    If I join a DevOps team, what I'll bring won't be the fastest code writing or the smartest code writing. My code is workmanlike, staid and dull. It's also conservative and reliable, because I hate being called out at 3AM due to a script failing.

    What I will bring is that conservative sysadmin attitude. 25 years of experience in risk assessment, testing, planning, scheduling, post-rollout testing, rollbacks, incident reviews and everything else that keeps the business running.

    There can be a lot of bad blood between sysadmins and developers. But the one thing that good developers and good sysadmins share is this simple understanding: You don't eff up in Production.

    You can eff up in testing. You can eff up on your own machine. And you should. That's what it's there for. That's what it was built it to help you do - eff up in a safe place.

    But please, please try not to eff up in production.

    Sadly, it's inevitable. It will happen. It always does.

    So when it does, please don't skimp on the no-blame post incident review. It's one of the best tools you have to reduce the chances of it happening frequently.

  • TruePony (unregistered)

    Another key part, is after you screw up, figure out what it will take so that mistake can't happen again. I remember reading that when examining the cause of plane crashes, the investigators only name human error as a last resort. They instead focus on the equipment and the processes in place. Humans are going to screw up, so what tools, processes, etc do you need to find that out before your users do?

  • The Mole (unregistered)

    The problem is, if you get good at effing up then the mistakes are normally caught early before they have an impact. This means that the times you fail at being good and eff up badly you aren't experienced in resolving the problem.

    Pretty much by definition contingency planning for errors due to bugs is really hard. If you can anticipate the specific issue then you are unlikely to have a bug there in the first place. You can potentially anticipate direct consequences of some classes of failures (new build doesn't work, roll back). But even then you quickly hit issues (roll back didn't work as upgrade changed DB schema).

    For Facebook I imagine the backup 'plan' was if the automated BGP management fails they would grab the previous config from the archive and deploy that manually. With a bit more padding that could get through the change control committe.

    What it misses of course is the indirect consequences, no BGP means DNS is inaccessible, in accessible DNS means you can't resolve the machine holding teh backups, and even if you could you can't login for the same reason, and even if you could you can't get into the buildings because the card validation server is also on DNS.

    Its like the stories of copmanies regularly testing failing over to their backup diesel generators, only to find out that in a real power cut it doesn't work because the starter motor was connected to the power grid.

  • Grzes (unregistered)

    The joke about exiting from Vim is, unfortunately, outdated. The author(s) basically spoiled the Vim, implementing "Type :qa and press <Enter> to exit Vim" message which appears when the user presses ^C.

  • Sole Purpose Of Visit (unregistered) in reply to Philip Storry

    " It's also conservative and reliable, because I hate being called out at 3AM due to a script failing."

    Well, yes, quite. (We've all been there. Usually on a Sunday morning.)

    But once your organisation gets big enough (and 100 people would do, let alone FaceBook), this attitude is both entirely professional and entirely meaningless.,

    Automation code is usually off-shored, on-shored, contracted out, bought in from a "reputable" supplier, or otherwise disconnected from the person at the end of the phone at 3am on a Sunday morning. I'm sure there are management imperatives behind this. I'm equally sure that those management imperatives are, prima facie, insane.

  • my name is missing (unregistered)

    The problem that Facebook had was a bug in their audit system designed to catch bugs, which allowed a bug (misconfiguration actually) to take down every single connection between data centers which could only be fixed by physical data center access, which was hard because they are designed to be hard to get into, so they were hoisted by their own automation and physical security. You've heard of Big-O, well this is big-F effing up.

  • (nodebb)

    Facebook employs more than 60,000 people. If a change designed to save one of them a day a week has indeed taken the company offline for six or more hours, that's quite something.

    This statement is outright painful.

    It's like saying "why did you waste banging the stone a hundred times, when it broke on a single strike". Yes, that one thing ended up going badly wrong, but if we'd stop automating anything that saves anyone a bit of time because there's a risk of it going badly wrong, we wouldn't be using computers at all.

    Addendum 2021-10-06 09:05: Of course Facebook wouldn't have failed then. Because there would be no Facebook.

    Addendum 2021-10-06 09:07: There's also the issue of how often people doing it by hand would have eff'ed up as others pointed out...

  • jochem (unregistered)

    Exiting VIM? I generally just buy a new computer.

  • Carl Witthoft (google) in reply to TruePony

    OTOH, once it became mandatory for pilots (and other responsible folks) to go through a physical checklist before takeoff, the crash rate went way down. Later on, and with much bitching and whining from the "old guard," similar checklists to keep track of things like surgical tools, sponges, etc. in the course of surgical procedures greatly reduced the damage& error rates there as well.

  • Anon (unregistered)

    I've always though that software development experience levels go like this:

    Junior developers just try to write code that works. Intermediate developers try to write code that doesn't fail. Expert developers try to write code that fails gracefully.

  • Kleyguerth (github) in reply to jochem

    Exiting VIM is easy, it tells you how to. The real challenge is exiting ed. "?"

  • matt (unregistered)

    That should be "jokes that need a break".

  • Bill Turner (google) in reply to RLB

    Oh my God yes! I had an incident recently (a few months back) where I didn't, and things were not pretty.

    Situation, upgrading code on a VM. VMware gives you snapshots to help with this - snapshot, try something, if it bombs then rollback.

    Until you accidentally commit the snapshot rather than rolling back. Then you get to see how good your backups are. Hint: Not as good as we thought.

  • Barry Margolin (github) in reply to Brian Boorman

    Many people who "work for a living" depend on Facebook services for their work. There are some third-world countries where WhatsApp is the primary method used for placing orders with online vendors. These people lost a day of business.

    Facebook isn't just for chatting with friends.

  • ooOOooGa (unregistered)

    So the eff up is that their eff up prevention tool effed up.

  • CdrJameson (unregistered)

    What's the problem? They moved fast and broke things. Business objective: achieved! Trebles all round!

  • Hey, look over there! (unregistered)

    Wow, no mention of the timing of it being coincident with the whistleblower's testimony. Nothing like the news being saturated with the tragedy of facebook's being temporarily gone for a few hours at the same time its lightly peppered with "facebook is bad for kids" testimony.

  • (nodebb) in reply to Hey, look over there!

    Wow, no mention of the timing of it being coincident with the whistleblower's testimony.

    That's scarily credible...

  • (nodebb) in reply to RLB

    Sometimes rolling back isn't a good option. In my current job I was updating a system which is only used for special processing every 5 years. Except one group that use it changed some of the values it handles to be too large to be handled as an unsigned integer and instead had to be handled as an unsigned long. This messed up a number of different things which took weeks to track down. Rolling back wasn't an option as doing so which just revert things back to the original problem

  • xtal256 (unregistered) in reply to Grzes

    "The joke about exiting from Vim is, unfortunately, outdated. The author(s) basically spoiled the Vim, implementing "Type :qa and press <Enter> to exit Vim" message which appears when the user presses ^C."

    Right, but that in itself is a joke!

    Imagine a GUI app with a start up message that says in big letters "Click the red X button in the top right corner to exit". I'm sure some did that back in the early days of GUI apps, but these days you'd have to be an idiot not to know how to exit it.

  • Your Name (unregistered) in reply to xtal256

    The joke is rather that they know EXACTLY what the user wanted to do, and instead of doing the wanted action spite the user with a "do it differently because we don't care what you want".

    It's the same as the grandiously stupid message of old C compilers complaining that the last line doesn't have a line break. So what? You know what the problem is, you are a machine, shut up and help the user, that's your job.

  • löchlein deluxe (unregistered) in reply to xtal256

    Well, I used to happily press the Save/Print/whatever buttons at the bottom of the dialog until some update moved them elsewhere. In a weird way of streamlining, yes, the button to make the dialog go away is now always in the top right, but sometimes it'll cancel, sometimes not.

  • Officer Johnny Holzkopf (unregistered) in reply to Barry Margolin

    In Germany, many businesses rely on Facebook for customer communication, and require WhatsApp for their internal communication. Even though it is illegal (due to privacy regulations which prohibit, for example, transmitting patient data - unbeknownst to the patients and of course without their consent - using "non-approved" external services), that short service outage caused panic among many home care businesses because there was no communication between the head office and the mobile nurses. But: Probably for the first time, they were compliant to legal requirements! Thank you, responsible staff.

  • DQ (unregistered)

    If effing up is our job, then I know a few people who are very good at their job...

  • ParityTheUnicorn (unregistered)
    Comment held for moderation.
  • my name (unregistered)

    i would like to use this opportunity to make an announcement:

    "FUCK"

    is not a bad word. it actually was an artisanal term involved in the creation of swords.

    replacing it with "Eff" at every opportunity will only lead to "Eff" having the same stupid bad connotation and being replaced by something else. and the circle of false politeness repeats.

  • (nodebb)

    And hopefully, you do not let a bug slip into production that causes a chain of events leading to the main gun of a warship being slammed into the desk (I really hope the statute of limitations has run out)

  • (nodebb) in reply to Holywhippet

    "Sometimes rolling back isn't a good option" Rolling back (if the system has gone live) is often not an issue. Consider a system where the update has removed one column from a DB and added a completely different one. Thousands of people have used the system new version, including adding information ino the new column...

    Having a proper "disaster roll FORWARD" plan is a much better way to mitigate risk (you can never eliminate it - I have seen the backup tapes for a rollback all be bad)

  • RLB (unregistered) in reply to Holywhippet

    Sometimes rolling back isn't a good option.

    Certainly true, but a. it would've been in this case, and b. you still need to make sure any breakage you cause (in production, at least) doesn't also break your ability to repair the breakage. Which, in this case, it did for longer than it should have.

  • rosuav (unregistered) in reply to TheCPUWizard

    Desk or deck? Either way, I am curious.

  • (nodebb) in reply to Holywhippet

    Sometimes rolling back isn't a good option

    Spoilers for some Marvel Movies

    In Doctor Strange, the eponymous hero saved the day by continually rolling back the Universe until the Big Bad gets bored and gives in.

    In Avengers Endgame, rolling back is made dramatically unsatisfactory by giving Tony Stark a really cute daughter who would cease to exist if the victors did the obvious thing which would be to roll back.

  • Best Of 2021 (unregistered)

    There's two ways to try to do, well anything really.

    One is that to fuck up is completely unacceptable because there is not going to be any scope to fix it later - in software, something like a spacecraft control system, but also real world things like building engineering or power distribution systems. These occupations have tightly controlled production processes, fixed bureaucratic codes (sometimes nonsensical because editing the code is very hard and circumstances change) and very little flexibility for the people doing the work, to try to reduce the chance of a fuck up as far as possible.

    The other is to accept that people fuck up, but that the creative output that comes with it is worth it. So we allow them to do so in a controlled environment, put processes in place to reduce the negative effects of doing so, and to make it easier to recover from them. This is what general purpose software creation (not really engineering or architecture, despite the names our job titles often have) is. Sport is also like that - we accept a striker taking and missing shots, midfielders trying passes that don't work and so on, because when they get it right, it's worth it; but we also expect other players to cover when they get out of position and risk a counter attack, and expect them to get back as quickly as they can afterwards.

    We should still be trying to reduce the easily avoidable fuck ups though, or at least catching them before they leave your desk - this is what things like TDD are trying to achieve.

  • guest, I guess (unregistered)

    "Effed up"? Are we children? If you want to say "fuck," say "fuck." But "messed up" and "screwed up" don't need to be amplified as "fucked up" and certainly don't need to be amplified, then muffled as "effed up." That's as childish as "va-jay-jay."

  • Herr Otto Flick (unregistered) in reply to jochem
    Comment held for moderation.
  • Diane B (unregistered)

    I'm sorry but I really don't see what role EFF played in all this.

  • Anonymous (unregistered)

    Super, duper, absurdly relevant:

    https://how.complexsystems.fail/#18

  • Frank Wilhoit (unregistered) in reply to Anonymous

    Boy, that is a crazy sloppy piece of writing and thinking.
    First of all, he assumes agreement on what complexity is. If you ask 100 people to write down definitions of complexity and simplicity, you will get at least 25 different definitions of complexity and at least 50 different definitions of simplicity. Few people understand the different between "complex" and "complicated", or that complexity is neither good nor bad in itself but is something that must be managed. Then look at his point #2. He is talking about reactive defenses against failure: "learnings" from failure. Somehow I think the probability, as of today, of a second Chornobyl is much less than the probability, the day before it happened, of the first one. Call me a stiff-ass, but I don't think that's quite good enough. Cook is writing from a medical perspective. I'm sure he thinks the probability, as of today, of a second THERAC-25 is much less than, etc., etc. But the probability of fatal accidents in other industries due to exactly the same kinds of software defects as THERAC-25 is ~~1, and the reason is not the absence of reactive defenses, but the absence of proactive defenses. Without apology, I read no further. The author has already discredited himself.

  • Harsh Jain (unregistered)

    Hi, Thank you for this article. You have done great research on this topic. I know the better platform that provides Brief information on Full-stack web development courses. If you learn more about Full-stack development courses with placement guarantees then you should visit mentioned website. https://geekster.in/full-stack-web-development-program/

Leave a comment on “Eff Up Like It's Your Job”

Log In or post as a guest

Replying to comment #533769:

« Return to Article