The Daily WTF: Curious Perversions in Information Technology

JBert · 2015-06-02 Reply Admin

PaulaBean:
Discussion for the article: http://thedailywtf.com/articles/sleeping-in

Minor comfort: the code might be woefully inaccurate / unreliable, but at least it doesn't have a concurrency bug...

Buddy · 2015-06-03 Reply Admin

Oh man. Knowing what the contract on sleep usually is, I can't wait to see the comments on this one.

loose · 2015-06-03 Reply Admin

The code is almost certainly a WTF. But TRWTF is it (apparently) survived for a "decade" as Production Code (either that or expecting us to believe it).

Whilst the code may be correctly identified as to the "reason", the analysis as to the "why" leaves something to be desired.

Assumptions (based on code and provenance):

sleep(2) = "sleep" for 2 seconds.
% = modulus.
times.now.t.tv_sec produces an integer that represents time in seconds.
times.now.t.tv_sec % FULL_SAMPLE_DELAY produces something (of integer value between 1 and 59) or nothing (0) and thus TRUE or FALSE. Resulting in NOT TRUE being FALSE (do Quick Log) and NOT False being TRUE (do Main Log) [What was that someone (recently) said about using a double negative boolean?].

Best Case: 50/50 Chance of the Main Log being produced because:

The test is made every 2 seconds.
There is a 50/50 chance that times.now.t.tv_sec will be even (relative to when the process was started and there is no indication that there is any control over said start time).

This requires these further assumptions:

The loop takes less than 2 (and a bit) seconds to complete.
The sleep command is watching the "second hand" and not the "millisecond hand" (or faster). The justification of which is that there is no mention of getting 29 instead of 30 Quick Logs etc. -There is no reason, or indication, that this process needs to wait for stats.save(mainlog,times) or quickstats.summarise(quicklog, quickstats_top_n). They could well be spawned / forked processes that would be almost instantaneous to execute

Worst Case:

A "creeping" time error that will cause the Main Log to be produced (and the Quick Log) at greater and greater intervals. But there is no indication of this.

There are variations and subtitles of the above, but I'll be damned if I am going to type them all out for you. Hopefully, I have given you two points in a range of possibilities that are sufficiently far apart to be considered "upper and lower limits".

Yes. I know. In a perfect world it would be 100% success or failure because times.now.t.tv_sec would always be even or odd (and let's face it. If it was permanently odd, would we be here discussing it?).

To reiterate, the code is possible a WTF and certainly it could be killed, or otherwise seriously destabilised, with anything above the power of a popgun. Another possible WTF (and thus allowing to survive undetected) was that was "test" once (if at all) and worked.

Assume that it is only a recent problem: I would suspect that after working fine for years, an external influence affected the physical logs. I.e. It took longer and longer to compile / produce them, so that they had to be queued. Something changed (or not) that resulted in some of the queued items being "chucked away". Because the Quick Logs are, by definition, smaller than the Main Logs. They would not have (or have not yet) been affected in the same manner.

Yawn.... bored... hmmmmm lunchtime.....

while (eternity) {
	count = 30
	while (count) { 
		sleep(2)
		doQuickie()
		count--
	}
	doBiggie()
}

dkf · 2015-06-03 Reply Admin

loose:
They could well be spawned / forked processes that would be almost instantaneous to execute

Those operations are definitely not always quick to execute. Any system call can take an appreciable amount of time, but thread/process spawning is one of the prime candidates for really doing so, especially when under load…

loose · 2015-06-03 Reply Admin

Pedantic Fail accepted :grinning:

I was trying to allude to a concept as simply and as broadly as I could. So, to :persevere: I will try some other allusion such as: "Fire and Forget" - whatever that might translate to in the coding heaven (or hell) of your choice.

PleegWat · 2015-06-03 Reply Admin

sleep() will typically sleep for at least 2 seconds. Might be more, depending on scheduling. Might be less if a signal arrives betweentime. Notably the duration of the loop is the 2 second sleep, plus however long the rest of the loop takes.

aliceif · 2015-06-03 Reply Admin

@remy, @pjh

PJH · 2015-06-03 Reply Admin

This topic is now sleepwalking. It might appear in the full log.

foxyshadis · 2015-06-03 Reply Admin

loose:
Assume that it is only a recent problem: I would suspect that after working fine for years, an external influence affected the physical logs

I just assume that no one cared enough, and everyone unable to debug because of the bug screamed impotently to the heavens (or Microsoft). For any problem, most people who want to fix it can't, and most people who can fix it won't, so your company probably has to land intern D.H. before anything ever actually works right.

loose · 2015-06-03 Reply Admin

I would totally agree with you, in principle, in that a fresh pair of eyes - meaning somebody new to the Code at whatever skill / pay level will see the blindingly obvious. The fun then becomes the game of determining the extent of the WTFery:

Is it the result of some (misguided) policy?
Is it a "tradition" and as such is sacrosanct and should not be changed?
Is of such low priority, it will never be changed?
Or is self aware and adaptable, thus being able to hide or mutate to avoid detection?

Some of the above would result in a CLM, if "pushed" to hard by "the new kid on the block".

However, in this case the Code was "Inherited" and, as we know, the provenance and veracity of these "incidents" result in an evaluation (===) of true. I don't think any Interns were harmed or mentally scared whilst experiencing it :smiley:

DCRoss · 2015-06-03 Reply Admin

This reminds me of a monitoring solution I once saw for sale at a newly-invented-wheel-dealership.

Someone was faced with the nigh-insurmountable task of watching a log file on a remote server. It was pretty complex -- The log file was encoded in something called "ASCII", and each line started with a timestamp containing only the year, month, day, hour, minute and second with a decimal part which only went down to microseconds. Worse yet, there was no way of knowing when that line ended, unless you happened to catch on to the presence of a so-called "NEWLINE" character at the end.

Fortunately, our heroic developer had access to the most sophisticated monitoring tool ever invented. It ran on Windows and had a name something like "SCUMM", although I doubt that it did a very good job of playing Maniac Mansion. Since there was absolutely no way that there could have been any kind of native support for something as arcane as reading log files, it was time to do Something Clever(tm).

So, one minute after the previous check, like clockwork, a secure shell connection was opened. It called sudo to get the root privileges required to then call su to another user ID, and then finally called ls to check the modification time of the globally readable log file in question.

The mtime was then compared with the current time, and if the difference was less than sixty seconds, the monitoring system would sign in again, run sudo again, then su again, and finally use "tail -1" to gather all of the new lines added to the file since the last time.

Well, as long as there was only one of them. And it had to have been added within the last 59 seconds, even though the check interval was guaranteed to be longer than that. And don't even get me started on the incredibly unlikely chance that new messages would be added while the check was running. Or how the monitoring account was hooked in to active directory, and could take several seconds to authenticate each time because AD is really hard to use.

The truly amazing thing is that nobody could figure out how this mess was somehow sleeping in through critical alerts.

mruhlin · 2015-06-03 Reply Admin

I personally wouldn't categorize this as a WTF. Bugs happen, and I can see it as a totally reasonable one to have.
It probably survived as long as it did because none of the log's consumers actually needed up to the minute information, so getting it once a day or whatever was probably fine. And as long as it doesn't lose data during the minutes it skips (presumably it just dumps all the happenings since the latest attempt), it was apparently "good enough" for their use case.

RaceProUK · 2015-06-03 Reply Admin

mruhlin:
I personally wouldn't categorize this as a WTF

Relying on `sleep()` for accurate timing is as :wtf: as it comes

loose · 2015-06-03 Reply Admin

Yeah. The sleep() directive could be replaced with if (time.now.milliseconds() >= (time.then.milliseconds() + 720000)) then... to make it more accurate. And you could even put some fancy code in to modify the 720000 to account for the "ticks" used to process the functionality.

But wouldn't this sort of elevate the CPU priority of process [and the priority of the functionality], seeing as it would need to monitor it in pseudo real-time (possibly tipping it into an uncontrollable spin and [to mix my metaphors a bit] cause it to hang [in mid-air - so to speak and really WTFify the metaphor] ? :smiley:

RaceProUK · 2015-06-03 Reply Admin

loose:
The sleep() directive could be replaced with if (time.now.milliseconds() >= (time.then.milliseconds() + 720000)) then... to make it more accurate. And you could even put some fancy code in to modify the 720000 to account for the "ticks" used to process the functionality.

Not really; you could still miss the right time due to process scheduling. Then of course, you have the excessive CPU load it'd generate.

The right way to do it is to ask the OS to call your callback at the right time; it's the only way to get anywhere close to accurate.

kupfernigk · 2015-06-03 Reply Admin

RaceProUK:
The right way to do it is to ask the OS to call your callback at the right time

...which really requires a real time OS to do it right. However, if you are logging something every minute, real time accuracy almost certainly isn't needed. A lot depends on how long the 1 minute logging process takes to execute compared to the 2 second process; if it is more than 2 seconds we're in trouble. I personally would run them in separate threads; that to me is the wtf.

loose · 2015-06-03 Reply Admin

Hmmmm, dial down the sarcasm a bit - I'm sorry, did I say that with my "outside voice" lol.

Where do I need to send your leg? Seeing as I was pulling it and it came off in my hand.

I think I was aiming for something along the lines of "...agreed, but everything has it time and place. Given the (apparent) role of this code, I don't think it, or anything connected to it, (via cause and effect or synchronicity - not a spelling WTF, a dictionary WTF -- if it shows as such) gives a damn if it takes 1.75 or 2.25 mins to complete.

Nprz · 2015-06-04 Reply Admin

It seems like these problems are so easy to solve (at a detail that is sufficient). Just keep track of when you did the full log save last and if it has been 1 minute, do it on the next iteration and store that time. And if saving takes a long time, you could decide whether you want a 1-minute gap between them or 1-minute interval (e.g. whether to store the "time since last saved" before or after save).

And I'd say keep a sleep statement in there for simplicity (although keep it relative, 2 seconds minus time it took to do the save, to keep a ~2 second resolution on quick log).

Don't overcomplicate things, but at least do it so it works.

Sleeping In

Leave a comment on “Sleeping In”