• ray10k (unregistered)

    I don't quite understand how that update caused the issue. Did the file being read get bigger leading to further logic missing some important info?

  • PlanB (unregistered)

    Why nog (assuming it is c#):

                    var ms = new MemoryStream();
                    bufferedStream.CopyTo(ms);
                    byte[] buffer = ms.ToArray();
    
  • Martin Tilsted (unregistered)

    My guess is that what happend was that before the patch the OS would always deliver all 8192 bytes at a time as a single operation. So the read always did manage to read 8192 bytes.

    But the OS patch somehow changed the buffer strategy, so the OS would now deliver less then 8192 bytes. My guess is that this was an optimization where the OS no longer wanted to buffer data, but instead deliver data as soon as it got any.

  • LCrawford (unregistered)

    The word Socket was mentioned, which could means it could be a network stream which is not guaranteed to arrive in a single network read. It may have been on a LAN, thus working correctly for years until the OS patch split the message and requires multiple reads or at least checking the result.

  • SG (unregistered)

    I did not see the clue. This is a "normal" coding error, happens to nearly everyone by time. The search is hard, but also not special. And the bugfix is daily work. Where is the WTF?

  • gumpy_gus (unregistered)

    I was parachuted into a large company that had exactly the same problem. All throughout their code they just assumed that the packets coming from the Iridium satellites would be one and only one packet at a time and that one read would always read one and only one packet.

    After about an hour I figured that out. They didn't believe me. After three months I had a test environment running that would demonstrate the issue. When I showed this to mgmt, they of course did the only reasonable thing, fired me.

  • Brian (unregistered) in reply to SG

    The WTF is deploying an emergency patch at 5AM which presumably contained a lot of untested code changes written in the wee hours... sleepy people tend to make bad decisions, especially after a mind-numbing fourteen hours of combing through log files. Taking the next day off, while justified, also seems ripe for trouble - who knows how many new bugs might have been introduced during the marathon debugging session. If the thing "runs fine for a while" before throwing exceptions, seems like a reasonable short-term workaround would be to restart it every now and then while the team works on fixing it during normal hours.

  • I dunno LOL ¯\(°_o)/¯ (unregistered) in reply to Brian

    I see we have a volunteer to stay behind and make sure everything works! Good luck!

  • Sensible (unregistered) in reply to Brian

    Sounds like you've never been in an oh-s*** moment on the front line. Since you seemed to express your wtf in earnest, i'll respond in kind.

    Broken production costs money. Real Money. I once worked at a small company (<100 employees), and even our downtime costed ~$1,000/hour in revenue during peak business time. Emergencies happened.

    The code they targeted was already broken, the fix only targeted that particular code pattern, and it would get them back up to full steam. The judgement call was made, and they fixed it. They'll pay the price of the consequences after.

    If you're tire goes flat on the highway, you're going to pop your spare on till you have enough time to address the reason the tire went flat. It's not ideal, but it gets you running along.

  • Brian (unregistered) in reply to I dunno LOL ¯\(°_o)/¯

    No, just advocating for sensible processes - nothing in this story indicated that anyone's life or livelihood was on the line, and apparently some initial troubleshooting had already been done on an apparently normal schedule, so why the sudden urgency of pulling an all-nighter? Dev-QA-Release cycles exist for good reason, and allowing them to be circumvented once makes it easier to do (and expect) the next time, and the next...

  • masonwheeler (github)

    What's a DR environment? Never heard of that one before...

  • (nodebb) in reply to masonwheeler

    DR ==> Disaster Recovery. (Or, rather, that's the meaning I've seen in this kind of context.)

  • Paul Neumann (unregistered)

    Yay, another self aggrandizing Snoofle article. I'm quite certain the actual occasion was something along the lines of One time, I heard of a peer review where another developer had pointed out that the original developer forgot or didn't realize they needed to account for the number of bytes read from a stream.

  • D-Coder (unregistered) in reply to Paul Neumann

    I love Snoofle's posts. They're well-written and interesting. Keep writing them!

    Although this one wasn't very WTF-y.

  • (nodebb)

    Intellij warns when the result of methods like that are unused.

  • LXE (unregistered)

    Is the stream never expected to terminate, or it is guaranteed to contain only chunks (datagrams) of the same size? If the latter, does it mean that the buffer size has semantic importance, and if it does, why isn't it mentioned anywhere? If you read 4096 or 8190 bytes, you will fail for unexpected EOF. If that's not what you intended, sorry, you need to get back to work before anybody else notices.

  • Baboon (unregistered)

    Um since this is Java you could have just used new ObjectInputStream(bufferedInputStream).readFully(buffer) ... congrats you just reinvented the wheel ... ;)

  • Somebody Somewhere (unregistered)

    Maybe it's different in server-land, but in my neck of the woods (embedded) that read-check-length-read busy loop will absolutely murder performance if data doesn't show up in a timely manner, or at all. You're effectively starving every other task of CPU time while you wait as hard as you can for data to arrive.

  • snoofle (unregistered)

    re all: this was a critical financial system through which nearly a $billion was transacted daily. Every transaction that failed had to be processed manually, which caused all sorts of problems. It became critical when the clerks couldn't get the transactions processed in time for the overnight batch jobs, which made the firm miss regulatory deadlines, which resulted in huge daily fines, which forced the fix-it-and-deploy-it-NOW-at-any-cost mandate!

    re: Martin Tilsted: your assessment is correct; that's exactly what happened

    re: LCrawford : exactly

    re: Paul Neumann: hardly; I was the # 3 guy on the team. The three of us were looking over each other's shoulders precisely because it was very late and we were all very tired.

    re: Somebody Somewhere: it wasn't a spin loop; it was a blocking read

    FWIW: that thankless night was the very last time I ever worked 24 straight hours for anyone for any reason.

  • Anon (unregistered) in reply to Steve_The_Cynic

    So, they had a Disaster Recovery environment that was working? According to Snoofle's followup, this sounds like a disaster scenario. So, why didn't the fail over to DR until they could roll back the OS patch?

    And why is hours of coding at 3 AM more feasible than an OS rollback?

    And what happened to the SA that decided to apply a patch to production without first applying to DEV, QA, and pre-prod?

  • Arthur (unregistered)

    " That's when the other managers in the department pointed out that there was work to do." It may be subtle here, but this sort of thing pisses me off so much.

  • (nodebb)

    Why isn't the OS or standard library responsible for repeatedly reading from the stream up until 8192 bytes or EOF or stream error? I haven't studied OS but i feel the loop/etc. belongs in the library.

  • snoofle (unregistered) in reply to Anon

    re: Anon re DR: patches to production were automatically/simultaneously applied to DR; fail over wouldn't have helped. The patch actually WAS applied in the "lower" environments, but the hardware and OS-versions were different in each one, so apples-and-oranges. Yeah, I know; we openly warned about that for months, but every upper manager insisted it "couldn't be a problem" (translation: "that would cost a lot of time and effort to fix"). Of course in the post mortem, they Harrumphed their way out of it so nobody was ever held responsible for the $millions that were lost due to the failed transactions on which they had to make good.

  • Mr. Anderson (unregistered)

    There are more WTFs:

    1. Are you not doing code reviews? I recently had a code snippet with roughly the same issue (expecting the buffer to have been completely filled by a single read) for review and although all tests passed, this was an obvious thing to be fixed before merging into the master branch.

    2. In such a critical application this should cause a verbose Level.SEVERE log message and trigger an audiovisual alert in the office: } catch (Exception e) { // ... } Considering you had to dig 14 long hours through system calls and signals, I suppose it did nothing of the sort. Hopefully the new solution is better.

    3. This one is at least bad practice: Catching Exception (this includes all kinds of RuntimeException!) and expecting it to be a particular kind of Exception (IOException?) } catch (final Exception e) { // Handle read-error }

    4. This should be thrown at the place where it happens, which should be straightforward when catching only IOException in 3 and throwing some custom Exception in this case. if (totalBytesRead < COMMAND_BUF_SIZE) { // throw exception for unexpected EOF }

  • I Saw a Robot (unregistered) in reply to Somebody Somewhere

    Even in an embedded environment, a multitasking (RT)OS that can't let the other tasks run during a blocking read is broken.

    But yeah, it's definitely not a problem in "serverland".

  • Kalikur (unregistered)

    I haven't had my caffeine fix yet so not fully functional, however it seems to me that the 'fix' doesn't properly handle the EOS flag. If read the EOS flag but haven't read exactly 8192 bytes it will blow up. Not exactly the right behaviour is it? I'll go get my coffee now.

  • 🤷 (unregistered)

    @Arthur "" That's when the other managers in the department pointed out that there was work to do." It may be subtle here, but this sort of thing pisses me off so much."

    I'm currently (re)watching DS9 on Netflix. In one episode, Sisko asks O'Brien something like "How long will it take to get fixed?" and O'Brien says "Three weeks. I'm waiting for a part and it's going to take three weeks to be delivered". And Sisko is like "I want it fixed in 3 days!" and storms off. Of course, they somehow manage to get the part delivered earlier, so Sisko then asks "How long will it take?" - "8 hours" - "I want the ship ready in 2 hours" - "2 hours it is, then".

    Of course, this is played for laughs, but I somehow got the feeling that this is exactly the kind of shit where unreasonable expectations come from. Just put on an angry face and tell your employees that you want it done in much less time than the guys, who actually know their stuff, estimated, and it will get done in much less time. The "there is work to do" after the team pulled an all-nighter is exactly in the same boat: Unreasonable expectations.

  • Kanitatlan (unregistered)

    I must admit that I have become quite intolerant of the constant repetition of this same error by various programmers over the years. All writing socket software that assumes every read yield a whole message and only a single whole message on a TCP byte stream that delivers messages in succession and even when I make them fix it we end up with software that cannot cope with aggregation that splits messages between reads. It seems the dominant psychology at work is a belief that TCP is "record" based rather than a byte stream.

  • matste (unregistered)

    TRWTF is that they copied and pasted the loop into the source code multiple times, instead of using any of thousands available util methods for this.

    Also, detecting this error should be trivial for any static-analysis.

  • mihi (unregistered)

    And in case it is Java:

    new DataInputStream(bufferedInputStream).readFully(buffer);
    

    No need to reinvent the wheel every time :)

    It will throw EOFException in case it could not fill the buffer fully.

  • Anon (unregistered) in reply to snoofle

    [quote=snoofle]patches to production were automatically/simultaneously applied to DR; fail over wouldn't have helped[/quote] [quote=OP]We can not reproduce the issue in any of the other environments (DR, ...). [/quote]

    You should get your story straight.

  • Anon (unregistered) in reply to Anon

    Also, a comment preview button so one can see if the BBCode is working properly would be helpful.

  • (nodebb) in reply to Anon

    My understanding is the patch was to the operating system not the program.

  • siciac (unregistered) in reply to 🤷

    The NG episode where Scotty is found stuck in a transporter buffer points out this problem, as Scotty plainly tells La Forge that he heavily pads all his estimates.

  • siciac (unregistered) in reply to Kanitatlan

    It seems the dominant psychology at work is a belief that TCP is "record" based rather than a byte stream.

    It's more a limitation of cognition. If you don't have any evidence that something is failing, you don't have a reason to look for a problem.

    And once you've learned an idiom, you typically stop thinking about it because you "know" how it works. (That's why continuing education is good, you can be exposed to people who are poking around and looking at problems in common idioms.)

    API designers ought to take that psychology into account. In this case, the default read should guarantee the bytes are read or an error is raised, and the current behavior should be labeled readAtMost.

  • Wizard (unregistered) in reply to gumpy_gus

    Beware showing management that the people they thought were rockstars are incompetent. It just shows to higher management that management don't have a clue either.... result - you will be fired or life made hell

  • zaSmilingIdiot (unregistered)

    TRWTF in this entire article is actually the SA having applied an OS patch in a production environment without it having first been applied in QA or DR or even testing environments. As in, why was the problem not reproducible in any other environment?

  • N H (unregistered)

    My manager recently had the gall to ask me to stay in and work "the same hours as everyone else" after I worked overnight on a critical bug.

    However, he also had the good grace to apologize when I explained that I'd been working since 0900 the previous day, and I got ice cream the next day on top of overtime, so... Worth it?

  • (nodebb) in reply to gumpy_gus

    On POSIX, the recv* functions on a socket of type SOCK_DGRAM or SOCK_SEQPACKET will indeed only return one message[1] at a time. The recvfrom and recvmsg functions can also provide additional information such as (importantly on datagram sockets in particular) the remote address. It is just silly to use read() on a SOCK_DGRAM or SOCK_SEQPACKET socket, which may have happend in your example.

    [1] There is a recvmmsg function on Linux that can receive multiple messages at once, but which can also return complete metadata and is, I would guess, always used deliberately, so this should not be seen as a counterexample.

Leave a comment on “Assumptions are the Mother of all Bugs”

Log In or post as a guest

Replying to comment #:

« Return to Article