The Daily WTF: Curious Perversions in Information Technology

2020-08-04 Reply Admin

Good story.

I think you can still use the previous iteration of the sockets API ('non-blocking') rather than the ones that return Tasks, and that is lower level and would let you write essentially the same logic. If the underlying networking is still going to hold onto resources for all those failures then that will be a problem, but you shouldn't kill the .Net GC at least.

This does seem like a bit of a WTF because spamming UDP packets and not caring whether they're delivered or not is a reasonable pattern for this kind of thing (non-critical display data on a LAN)

dkf · 2020-08-04 Reply Admin

It should be possible to just blat the UDP packets out to IP addresses without engaging any lookups of any sort. Since you'll be on your own network segment, you can control the environment closely enough that that's both highly practical and fast.

2020-08-04 Reply Admin

Such behavior makes me rage instantly, if a machine, lib or some tech in any form believes to know it better than you. There is no problem with some reasonable fallbacks as default but give me an option to disable them for fucks sake.

2020-08-04 Reply Admin

What is wrong with simple DMX/Artnet to run a fuck ton of LEDs. It's not like it is a lighting standard or anything.

2020-08-04 Reply Admin

Obviously I don't know the whole story but it seems to me that just removing the "await" would do the trick. It'd render the operation non-blocking and thus instead of "queuing" x calls for every failed call, they would just keep flowing and failing at a smooth rate.

Remy Porter · 2020-08-04 Reply Admin

I'm no expert on the whole OSI stack, but the IP address is very high up in that stack- to actually send data, you need a physical address, and I'm fairly certain that's the lookup that was failing.

We did try doing just a Send instead of a SendAsync, but the Send blocked for just as long; you'd think, "Oh, I handed it off to the OS, it should just send the packet" but that wasn't the behavior we were seeing. I think if I switched to the WinAPI socket instead of .NET's libraries, that would have gone away, but that was a whole "deal".

@No One: DMX is 250kb/s. That's not fast enough for the amount of data we want to ship. We're treating the LEDs not like lights, but like pixels. They're addressable SK6812s, and we're doing 24bpp. I don't recall how many LEDs we've got in this installation as an exact number, but as a ballpark, let's say 20,000 pixels. That's 480kb a frame.

Remy Porter · 2020-08-04 Reply Admin

The await was inside of an async function which itself wasn't being awaited. I don't recall exactly why I found I needed to do it that way, but just removing the await didn't work.

Steve_The_Cynic · 2020-08-04 Reply Admin

If the underlying networking is still going to hold onto resources for all those failures then that will be a problem

It won't, or not for any longer than it takes for the ARP/NDP(1) (IP=>MAC mapping) requests to time out. While an ARP/NDP request is outstanding, the network stack will hold any packets for that IP(2), and as soon as it completes...

==> If it succeeds, all the queued packets will be sent ==> If it fails, all the queued packets will be (silently(3)) discarded.

Subsequent packets will trigger new requests and held, but each batch will be sent/discarded as soon as the request gives back a result, positive or negative.

CAVEAT LECTOR: that's how FreeBSD does it. YMMV may vary on other operating systems, although I cannot imagine any of them doing anything other than that.

(1) ARP for IPv4, NDP for IPv6, but aside from the details of the mechanism, it's the same thing.

(2) The meaning of "for that IP" varies according to the true destination, but in general, for "remote" networks, "that IP" refers to the IP of the relevant gateway/router, and for "local" or "directly attached" networks, it refers to the destination itself.

(3) Counted in network stats, but not otherwise logged anywhere.

2020-08-04 Reply Admin

You could use static ARP entries to ensure that address resolution doesn't fail, given that you control the network.

Steve_The_Cynic · 2020-08-04 Reply Admin

I'm no expert on the whole OSI stack, but the IP address is very high up in that stack- to actually send data, you need a physical address, and I'm fairly certain that's the lookup that was failing.

No. IP addresses are the equivalent (in the IETF stack) of layer 3 in the 7-layer OSI stack, so fairly low.

Above that are TCP and UDP in layer 4, and protocols like SSL/TLS (session) and HTTP (presentation/application) are higher still.

Addendum 2020-08-04 07:53: I should also point out that you need a link layer address, not a physical address.

2020-08-04 Reply Admin

Came across another one of these, with Windows SSL (predates TLS I think). It assumed that if you were doing SSL you were doing webby things, and that domain names therefore mattered, so it tried to do a reverse DNS lookup of the explicit IP address I was using to access the embedded device (which didn't, of course, have anything registered in DNS). After the thirty second DNS timeout expired it carried on and did the right thing in the end (I'd turned off all the actual checking) but the thirty second wait was intolerable.

2020-08-04 Reply Admin

So the delay in Windows was caused by the delay in ARP resolution? So a 'perfect' Windows UDP API would just discard a packet if the previous packet is undergoing ARP resolution?

Or is the real Windows API shortcoming the failure to cache a negative ARP resolution for eventual rapid packet discard?

2020-08-04 Reply Admin

All I can think of is to use broadcast addresses. There would be no need to get ARP involved, it should just go out to the link-layer broadcast address. You could assign one IPv4 broadcast address per beagle to help them filter the right one. Single-destination broadcast addresses, yeah! The individual beagles would probably have to listen to every packet and actively drop the ones they don't want, rather than let the Ethernet hardware do it, but ARP would be out of the equation. Maybe some kind of router with a lot of ports could filter by broadcast address into different network segments if total input bandwidth on individual beagles was a problem.

2020-08-04 Reply Admin

"120 fps chandelier" is one of those phrases that one never really expects to hear.

But regarding the bug, there are ways to cancel a task. Instead of just awaiting it, you could use Task.Run with a CancellationToken to kill the task after a timeout or network error or whatever. Or if that's too complicated, you could use the plain Socket class rather than UdpClient which would be closer to your Linux implementation.

2020-08-04 Reply Admin

I mean, if your environment is abusively second-guessing you, it's reasonable for you to second-guess its second-guessing. Indeed, my whole 13-year career can be summed up as "processing/transmitting media UDP packets, with the signalling to manage these sessions", and to me your architecture is a good match for the business goal.

And to those who disagree with that assessment, I think we can agree there definitely is a WTF somewhere, regardless of who's at fault.

2020-08-04 Reply Admin

Last I checked async/await did not work in Unity which could be the cause of the leak. Calls would never complete. I specifically tried to use the .NET websocket server and could never get it to work; await calls never continued after awaiting.

I assume Unity (2019 at least, not sure about 2020) lacks an async dispatcher which MS provides for other environments (ASP.NET Coe, WinForms, WPF) in their respective message loops. Someone could probably build one easily enough.

What I would try is to avoid using async and use sync calls instead even just for testing to see if it fixes the issue. Using sync calls on a separate thread (you're not supposed to use them in Unity because Unity doesn't let you use Unity API calls from them, but pure. NET is fair game) in a thread pool could potentially resolve problems.

But it sounds to me like it gets stuck on name resolution, so the real fix is probably to do the name resolution explicitly ahead of time in its own step. So if it fails you just don't send the packets altogether. This also helps the app be more aware about nodes that are up or down so it can optimize what it is doing and report errors better. Sending packets to a raw IP should work like the author originally expected things to work I would think.

Remy Porter · 2020-08-04 Reply Admin

That was definitely an approach I tried. The trick is that I might want to give the tasks a few frames to finish, so some of them should get cancelled and some shouldn't so you need multiple tokens, and then it just gets messy. After a little poking at that, it quickly turned into "y'know, it's just easier to maintain a list of which nodes are up right now". That's one of those: we could get fancy, or we could ship it, tradeoffs.

2020-08-04 Reply Admin

I disagree. IMHO sending out UDP packets is exactly the right thing here (and I've done similar):

Normally all devices are up, so you're not wasting any bandwidth.
Even if you did, it's a dedicated link, so you're not stealing bandwidth from others.
You want to dimension the link for maximum load, so it's useful to always run and test with full capacity, even if not connected (yet or temporarily).
TCP error correction is actively harmful here: Not only are resent packets useless because they arrive too late, they prevent the receiver from seeing subsequent, correctly received packets in the meantime.
If a packet fails, there's nothing you can do about it anyway before a new packet is received. And the next packet received will recover the situation if it contains the full state (as opposed to only changes from the last packet) as is done here.

Only things I'd have added:

A counter to detect out-of-order packets. Though rather unlikely on a local link, but a simple 1 or 2 byte counter with wraparound will do.
A status reply -- basically the health monitoring, but integrated in the main protocol. If nothing else, an empty "ack" packet, better including the counter value to detect lags (and you might find other interesting status info to return). If this seems too heavy, just ack every n'th packet. (1/2)

2020-08-04 Reply Admin

So I think the design is fine and the leak is the WTF here. I've never been a big fan of async I/O. I think the few cases where it's useful are vastly outnumbered by those where it's used just because the API does it by default while not needed, and sometimes (like here) actively harmful.

But I don't really get it: You said there's 30 packets per second, for 3 seconds (until they are finally discarded, aren't they?), for 30 hosts, so a total of 2700 packets. That's not really a huge number. If each packet is on the order of a KB in size, we're talking about a few MB. Even with some administrative overhead, that should not stress any modern machine. Just how much overhead is there? (2/2)

But TRWTF today is the comment system: I noticed my comment was getting too long, so I trimmed it until the input field accepted it. After posting, I got a message "Comment too long". Failed length verification!

So I had to split it into 2 parts, thereby getting to do TFreCAPTCHA 3 times, plus a bonus round because it "couldn't connect to the CAPTCHA server" meanwhile.

Steve_The_Cynic · 2020-08-04 Reply Admin

So the delay in Windows was caused by the delay in ARP resolution? So a 'perfect' Windows UDP API would just discard a packet if the previous packet is undergoing ARP resolution?

I'd normally recommend holding them until the resolution completed (send the waiting ones) or failed (discard the waiting ones).

But equally, the API should push the packet into UDP=>IP=>interface and then return as soon as the interface has taken it. It's UDP, so if it fails to send, that's your hard luck.

ARP/NDP resolution are tied to the interface, not IP (point-to-point links, SLIP, PPP, GRE tunnel intefaces, junk like that don't even have IP=>MAC resolution, and RFC1577 IP-over-ATM has a different mechanism that is nevertheless similar in concept), so you dispatch the packet into the interface, and it calls the ARP/NDP resolver. (And the NDP resolver calls back up into ICMPv6, ffs.)

Addendum 2020-08-04 10:03: Should have noted that the initial recommendation here is for the UDP/IP stack author, not the application programmer.

2020-08-04 Reply Admin

Well, "resolve the IP address" confused me a bit too, but I don't think it's about name resolution. Doing that for each packet would be a WTF indeed, but then again, it wouldn't fail if a device is down because the DNS server is (hopefully! :) not on the Beaglebones, but on a server (the one that runs the control application or another one).

I think it's rather ARP/NDP as Steve_The_Cynic said. And as he explained, this really shouldn't be the application's concern.

cellocgw · 2020-08-04 Reply Admin

Windows was friendly, Windows was smart, and Windows wasn't going to let us down

So, in other words, Windows is Rickrolling you.

2020-08-04 Reply Admin

Last project I got involved with doing this kind of thing they drove it using raspberry pi and fadecandy. The only real problems we had were down to iffy connectors for the LED curtains. They were feeding it using bluetooth, which I didn't expect to be robust enough, but I was proved wrong on that! Might not quite have the frame rate you need though.

Mind you, the problem is not that end, so that's irrelevant anyway! Is this Windows or the API trying to wrap us in cotton wool? I never got that low-level with networking to fiddle about like this.

Regardless, missing a reference to the Aral sea, and possibly the future of the Blue Nile, is TRWTF here.

2020-08-04 Reply Admin

| Or is the real Windows API shortcoming the failure to cache a negative ARP resolution for eventual rapid packet discard?

And even if it does cache a negative ARP resolution, you'd still want it to continue to try at some rate as packets are sent to allow communications to resume as soon as a Beagleboard comes back online

2020-08-04 Reply Admin

It's like a Rickroll except every time he says "never gonna" the video starts from the beginning in a new tab. The Infinite Rickroll.

2020-08-04 Reply Admin

Isn't that a recursive Rickroll?

2020-08-04 Reply Admin

Did the await SendAsync took 3 seconds to return?

Remy Porter · 2020-08-04 Reply Admin

It did, but we weren't awaiting the await, so we never saw that. Basic outline was:

async void SendPacket(…) {
  …
  await client.SendAsync(…); 
  …
}

//elsewhere
  SendPacket(packet…); //spawns the task, but we don't grab it, so the CLR manages it

2020-08-04 Reply Admin

Race condition, right? 3 second wait. 1 second create. Serial expiration checking, so expiring packets adds time too and it grows to infinity.

What if you make the kiosk GUI only? Kiosk -> Linux/Python board -> Chandelier. Kiosk sends packets to new board per user interaction. New board sends the packets to chandelier.

2020-08-04 Reply Admin

Seems like the most robust implementation would be to keep track of the SendPacket task, and maintain a frame buffer. While you're waiting for a packet to send, each update goes into the buffer rather than sending immediately. If the buffer is already occupied when a new frame comes in, overwrite it with the new frame.

This implementation would be useful not only in cases where a segment is down, but also if your frame processing rate could ever outpace your network transfer rate.

2020-08-04 Reply Admin

ARP resolution is really only a problem when the devices are on the same LAN segment, because the Ethernet packets need the destination MAC address. That is the purpose of ARP, to keep track of MAC addresses for the local segment. If there was a router between the Windows and Beaglebone LAN segments, the router would be responsible for ARP to the Beaglebones, thus possibly avoiding the broken waits on the Windows side.

2020-08-04 Reply Admin

Such behavior makes me rage instantly, if a machine, lib or some tech in any form believes to know it better than you. There is no problem with some reasonable fallbacks as default but give me an option to disable them for fucks sake.

QFT! I feel like I end up screaming Stop helping me! every time I'm forced to do anything in Microsoft-land. If I tell you to do something, watch that one Shia LaBoeuf video and just do it!

2020-08-04 Reply Admin

In this case it is DotNet class UDPClient. Based on documentation it seems to be intended for relatively simple bidirectional communication.

2020-08-04 Reply Admin

Couldn't you...um...choose to handle the result of the async call?

Also, isn't it unwise to write software for platform A but never test it on platform A until you're rolling it out?

2020-08-04 Reply Admin

This is not a WTF. This is why I still love love working in my field. Strange things happens and I want to know exaxtly why. Some times I figure it out. Thanks for a good story. Remy!

Remy Porter · 2020-08-04 Reply Admin

It still takes 3 seconds to fail!

And yes, we absolutely should have been doing more testing on Windows earlier, but we also didn't have all (or at points, any) of the Beaglebones set up, so we likely wouldn't have put IP addresses for them into our sending table, so we wouldn't have been sending to down nodes, and thus wouldn't have detected it till about when we did anyway. "Put known bad IP addresses" will be part of our development process now, though!

2020-08-04 Reply Admin

I can't see how that timeout is a problem, if you run 90+ tasks concurrently then by the time you have 90 new tasks the first 90 will have finished already.

2020-08-04 Reply Admin

I can't see how that timeout is a problem, if you run 90+ tasks concurrently then by the time you have 90 new tasks the first 90 will have finished already.

The problem comes that new tasks are being created much faster than the previous tasks have a hope of finishing - it's a formula to endlessly add tasks until memory or other limitation occurs.

stoborrobots · 2020-08-04 Reply Admin

I haven't written UDP code in almost a decade, but taking a look at the documentation for UdpClient, I think I may have noticed something... (I'm sure you tried it already, but it occurred to me, so I thought I'd ask about it...)

The code sample you provided showed:

await client.SendAsync(packet, packet.Length, hostip, port);

Using SendAsync(Byte[], Int32, String, Int32) appears to be explicitly suggesting to the client that it has a different IP/port each time. This may have been prompting the ARP wait each time.

If instead you had an array of clients already Connected to an IP/port combination, then you could just SendAsync(Byte[], Int32) which should suggest to the client that it knows where to send the packet already. (I say should - I haven't tested this...)

Something like:

UdpClient[] clients = new UdpClient[30]

...

int i = 0;
foreach (string host in hostnames)
{
    clients[i++].Connect(host, port)
}
... 

await clients[j].SendAsync(packet, packet.Length);

```

(Longest markdown fragment I've typed on TDWTF - and it includes brackets inside links... Let's see how this goes!)

2020-08-04 Reply Admin

I would say this isn't a WTF either, it's the system bending over backwards to try and make things work for you. Look at the opposite of this, Firefox (and possibly other web browsers), which directly expose the results of network operations to users, handing forever if a single packet is lost, just displaying a spinner on the tab that'll run for hours, days, weeks unless you close the tab. This is why we have automated systems that don't mind mindlessly trying fallback options, they can deal with the faults for you rather than exposing them to you an expecting you to manage them.

For an older example of this, the Unix EINTR is another notorious example. "We couldn't be bothered dealing with this ourselves, let's force every single program that ever calls this function to build in its own error handling for it rather than doing it once inside the function itself".

2020-08-05 Reply Admin

I think I would have tried something like one process on a loop trying to resolve ARP and the main application sending only to the IP where the MAC is already known.

2020-08-05 Reply Admin

You missed my point, if you have enough concurrent tasks the rate of completion matches the rate of creation, and enough is 90 because the rate of completion is the number of concurrent tasks divided by average runtime (3 seconds), so this gives you 30 tasks/second.

This balances itself as well if you accumulate 100 concurrent tasks because of some slow down then for the next few seconds the rate of completion will be faster than the rate of addition. Those are I/O bound tasks so a concurrency of 90 should be fine, the TPL can handle that.

2020-08-05 Reply Admin

Looking at the UDP class I see the problem, the IP resolution within SendAsync is done synchronously so the only way to have 90 concurrent tasks would be creating 90 threads which is insane anyways, the actual solution would have been trying to resolve the IP somewhere else at a manageable rate instead of letting SendAsync do it.

2020-08-05 Reply Admin

That's fine if your framework affords that kind of low-level access to the network stack, but it only solves this specific problem. There are plenty of reasons that the frame processor subsystem might want to go faster than the networking subsystem. IMO the right fix is to decouple these two subsystems by placing a buffer between them.

2020-08-05 Reply Admin

No need to use broadcast addresses. IP has multicast addresses and many services already use them (e.g., mDNS - guess what the 'm' stands for!). You configure the device to listen to a multicast IP, and you blast your IP packets (UDP, of course, since TCP doesn't handle multicast) at those IPs.

The IP stack filters out broadcast packets and routes the packets to the right processes automatically. As far as the client application knows, i just opened a UDP socket and is getting specific UDP packets.

Physically on Ethernet it appears as a multicast packet as well but the OS and drivers properly set up the multicast MAC filters so they're not bombarded with packets for other hosts. Quite a neat system and you don't get bogged down with stuff like ARP.

2020-08-05 Reply Admin

Well now.

I was once in charge of a four week project at Major Card Issuing Company Redacted, and one of my fifteen minions claimed that the C compiler was broken in a particular case. I had no better task to put him on, and besides I had to deal with the other fourteen minions, so I just told him to spend four weeks writing up test cases for how the C compiler was broken.

And anybody who tells me that a decent GC language leaks memory ... I feel the same way about that, twenty years on.

The libraries might leak, owing to P-Invoke and marshalling and what-not (translate to Java if you wish). But the garbage collector won't.

Or maybe I'm wrong. Maybe I haven't seen it yet (ten or twelve years and ongoing).

What tends to happen is that idiots don't understand how GC works, and claim things like (in C#) "structs are on the stack and classes are on the heap." Well, no, and who cares? The C# compiler will put structs wherever the heck it want's to put them. And the point about the heap(s) is that we are no longer dealing with a mark-and-sweep GC -- we are dealing with a generational GC. (There's also built-in OS cases for large allocations, but that's a different thing and indeed can go badly wrong).

Generational GCs are based on the fact that 90+% of "referenced" memory dies very quickly. Sure, you can get garbage collection pressure at various points, but as with everything else these days, you'd better damn well measure and prove it. Typically, you just throw the last generation away.

Sorry: TL;DR. It's yet another of my little beary bugs.

2020-08-05 Reply Admin

"An error which, as I stated before, we were ignoring, because we didn't care."

Well, your flavor of ice-cream against my flavor of ice-cream, and I hate ice-cream anyway, because it makes my toe-nails hurt.

But seriously. If you want to claim to be a professional -- YOU SHOULD CARE. You're calling into a network stack, and you can't be bothered to deal with timers, events, state machines, and all that stuff?

Your choice, I think. But you might actually consider dealing with the error coming from down the stack in unmanaged code.

2020-08-05 Reply Admin

" The await keyword is syntactic sugar which lets .NET know that it can hand off control to another thread while we wait. While we await here, we don't actually await the results of the await, because again: we don't care about the results of the operation. Just send the packet, hope for the best."

I love this. "Syntactic sugar" === something that the compiler rewrites and I don't need to understand. "We don't actually await" === a bit of syntactic sugar that we might actually care about, if we were sugar-averse, but since we're not, we're gonna order a 40 ox cane sugar smoothie!

"Hope for the best." Well, yes, Remy, good luck with that. We've had, what, seventy odd years since Alan Turing and Alonzo Church began to explore how primitive computing could deal with failure, and "hoping" is all you've got?

Jeez. At least stop blaming everybody else but yourself.

2020-08-05 Reply Admin

Indeed, you should not be bothered to deal with timers, events, state machines, and all that stuff. That's the job of the NETWORK STACK. (See, I can use all-caps too, so I must be right.) That's why it's a network stack and not a driver for your ethernet adapter.

You have a simple need, send this packet to that device if possible. If not possible, tough luck. No amount of error checking etc. will change this fact. Of course, you do health checking, either separately like Remy did, or integrated with the packets as I suggested, but even then you need to wait for the response, so error checking the sending is still pointless. Normally, I'm a staunch advocate of error-checking, but this is really a case where it hurts more than in helps. (Indeed, my code for a similar case (though not involving a chandelier :) is one of the few cases where I justifiably ignore some exceptions, namely those raised by the UDP sender.)

But I must correct myself in one point: TRWTF this time are the comments (some of them): Reading all the suggested "solutions" makes me sick: Buffers (more latency, another source of bugs), buffer machine (more latency, source of bugs, source of hardware failures), broadcast or multicast (don't get me started). So much overengineering, and none of it will change the basic fact that you send the packet if possible and don't if not.

2020-08-06 Reply Admin

and none of it will change the basic fact that you send the packet if possible and don't if not.

Isn't the problem that it's being sent either way?

Whilst I understand the reason behind the fire and forget approach, it still seems odd that the hub isn't doing a periodic healthcheck to see the nodes are actually there and appear to be able to receive what is sent, and log an error if needed. Of course it wouldn't fix this, but I imagine it would have helped diagnose it quicker and provide a route to mitigating the problem.

A Massive Leak

Leave a comment on “A Massive Leak”