The Daily WTF: Curious Perversions in Information Technology

JBert · 2016-02-08 Reply Admin

Given this input array (Malcolm,Ermelinda,Annika,Jesusa,Honey,Romelia,Dorene,Alvaro,Charmaine,Georgann,Troy), how many String instances does this code construct and throw away for concatenation?

Way, way too much. :facepalm:

Fox · 2016-02-08 Reply Admin

Bonus WTF: How many times does it recalculate name.Length - 1 and name[i].Length - 1?

PleegWat · 2016-02-08 Reply Admin

I'll additionally posit the author of that code was probably not capable of typing 10-finger blind, or he'd have put the string in alphabetical order.

LB_ · 2016-02-08 Reply Admin

I thought that to keep regular you just had to follow this guide?

Eldelshell · 2016-02-08 Reply Admin

A WTF which includes regular expressions but that the regular expression is not the WTF?!?!? What's next? Fixing kernel bugs with Visual Basic?

dkf · 2016-02-08 Reply Admin

Fox:
How many times does it recalculate ```name.Length - 1``` and ```name[i].Length - 1```?

And how expensive do you think those are? The concatenation problems are more of an issue. The repeated linear searching of a string is just extra fun, as nobody has ever heard of character properties and could hardly ever conceive of implementing them by an efficient lookup table…

Fox · 2016-02-08 Reply Admin

dkf:
And how expensive do you think those are?

Compared to everything else, not very, but .Length and -1 are both unnecessary calculations which further add to the performance hit caused by the primary :wtf:, and these unnecessary calculations could be easily avoided. That sort of thing is a pet peeve of mine.

LB_ · 2016-02-08 Reply Admin

If the compiler knows the length never changes, surely it optimizes it to cache the result of the calculation?

Fox · 2016-02-08 Reply Admin

LB_:
If the compiler knows the length never changes, surely it optimizes it to cache the result of the calculation?

According to the link that I provided:

Input data in benchmark: C#

// Example string.
string one = "one" + 1.ToString();

// Cached length.
int cl = one.Length;

Code benchmarked in loops, 1000000000 iterations: C#

// A.
int len = one.Length;
if (one[len - 1] == 'x')
{
}

// B.
if (one[cl - 1] == 'x')
{
}

Results

Access Length repeatedly: 1305 ms
Hoist Length out of loop:  646 ms [faster]

...Though I now see a :wtf: in that benchmark, depending on how well C# handles int declarations versus just accessing one.Length directly.

PleegWat · 2016-02-08 Reply Admin

Does the C# language definition exclude all ways the string (and hence its length) might be modified while the loop is running?

accalia · 2016-02-08 Reply Admin

PleegWat:
Does the C# language definition exclude all ways the string (and hence its length) might be modified

yes. string is immutable.

dkf · 2016-02-09 Reply Admin

LB_:
If the compiler knows the length never changes, surely it optimizes it to cache the result of the calculation?

A “sufficiently good” compiler could do that — the real semantics would support it safely — but the tricky bit would be telling the compiler that the length really really is immutable. Also, I've no idea how much work has been put into hoisting such things in the Microsoft C# compiler; there are other, more important, fish to fry. A heck of a lot bigger win would be detecting the “build a string by appending without really doing so properly” case and optimising that. That would do an awesome amount to improve the speed of user code as it is depressingly common.

It's definitely possible to do the analysis and determine that this sort of thing is applicable. I've seen code that does things of equivalent complexity when compiling other languages…

Unthinking · 2016-02-09 Reply Admin

So:

What does "[^\w]" do about "wędzony łosoś"?
What does the code in question do about "wędzony łosoś"?

Everybody will know the answer to (2). Not everybody will necessarily remember the answer to (1) offhand. And what if (2) is what was really wanted?

Self-documentation is valuable even at the price of inefficiency.

Yazeran · 2016-02-09 Reply Admin

Also let's not forget the whole mess with j <= names[i].length -1 which is (at least logically in a loop context) identical to j != names[i].length

at least this is the case in perl (and i think in other c-derived languages as well):

perl -e 'for ($x=0; $x != 5; $x++){print "$x\n"}' 0 1 2 3 4

perl -e 'for ($x=0; $x <= 4; $x++){print "$x\n"}' 0 1 2 3 4

What remains is if this makes any difference in runtime or not (and considering the create and throw away strings in this code, I guess not... :smile: ).

TheDailyBread · 2016-02-09 Reply Admin

Well then do regex like "[^QWERTYUIOPASDFGHJKLYXCVBNMqwertyuiopasdfghjklyxcvbnm0123456789_]" instead of "[^\w]" if you don't know what \w means. But PLEASE don't loop through every character and concatenate hundreds of immutable strings together and thus creating more immutable strings.

TheDailyBread · 2016-02-09 Reply Admin

What if someone stupid places a j++ in the end of the loop block for whatever reason (e.g. to skip a character)? Suddenly j > names[i].length-1 and != won't catch that but <= will.

Choonster · 2016-02-09 Reply Admin

Or they could have used j < names[i].Length, which is usually a lot more common in C-style for loops.

ChrisH · 2016-02-09 Reply Admin

Fox:
How many times does it recalculate ```name.Length - 1``` and ```name[i].Length - 1```?

Once for the array and once for each string. So?

https://dotnetfiddle.net/hOxdEC

isthisunique · 2016-02-09 Reply Admin

Anyone who doesn't simply do i < length is suspect.

I prefer to list the characters in regex (the range support keeps it small but readable) unless there is a really good reason to use something such as \w because of being anal about security. Specify exactly what you want. I rarely use something like that but it does come in useful and convenient with intl sometimes. Posix properties especially.

isthisunique · 2016-02-09 Reply Admin

Why not [^A-Za-z0-9_]-berg?

Also:

And how expensive do you think those are?

It is harder to read. More operations and more code. Also, even if it's a tiny micro optimisation, it actually isn't when written the fast simple way by default. Things often add up. If you're writing twice as slow everywhere your program ends up taking twice the cycles everywhere. Sooner or later those unnecessary slowdowns will be multiplied. People who add extra complexity that slows things down, has no benefit, no even a shorthand, etc I suspect are trying to create small cases to optimise later so that they can keep their job. Write a loop like that all the time you'll soon find one that iterates a billion times and benefits from this optimisation. I wouldn't go crazy with the premature micro optimisation but things like this should be created appropriately in the first place.

A “sufficiently good” compiler could do that

But you shouldn't rely on the compiler when you shouldn't have to as a rule of thumb.

CatPlusPlus · 2016-02-09 Reply Admin

Fox:
According to the link that I provided:

Synthetic benchmarks are bad, but any benchmarks with no environment information are absolutely worthless. There is no difference between the two on a current CLR.

Either way it comes down to function call overhead (since Length is a property), because the length itself is never calculated on the fly (String has been immutable from the beginning). The subtraction part is really not even worth thinking about. Know your things etc.

TheDailyBread · 2016-02-09 Reply Admin

Of course, but we're talking about :wtf:s here, not clean code.

TheDailyBread · 2016-02-09 Reply Admin

Again, :wtf:-ery not clean regex. If he doesn't trust \w to include some (huma) language specific characters, what makes you think he trusts A-Z not to including ÄÖÜ or something like that?

isthisunique · 2016-02-09 Reply Admin

This reminds me of a good point actually. I test and train people. As a rule of thumb, validate your input, don't corrupt it. Sometimes you have to go the sanitisation route but others do it out of a misguided sense of convenience or that something must always work. Another rule is that if you must sanitise, try to do it on display output only or consider if you need to retain your source data in case sanitisation gets it wrong. A big focus is around preserving data integrity.

The WTF sample is potentially bad because it might be doing sanitisation (lossy corruption) where validation will do. In one test I ask people to not allow special characters. If this WTF were validation it would do that the way that is most likely right for any given scenario but then we don't know the context so can't be sure. In validation it is usually safer to deny too much than not enough especially when feedback is given allowing the user to workaround or raise an issue. You can also use a sanitised version as a proposal if it is user input rather than an external datasource.

The funny thing about the validation test is that everyone thinks special characters are just a few because of the English language gives that impression (special = rare), the keyboard does, etc and always create huge lists of special characters (all the punctuation on the keyboard, special accented language characters, normally no more, often in the wrong charset so they are just dumping arbitrary binary data into regex). Very few realise to instead only allow the characters that are not special.

A-Z introducing characters not in the ASCII range would be weird. Some level of mistrust is good for righting more predictable, reliable and secure code. However too much and someone would not be able to write code at all, it would be too slow. It normally indicates a lack of knowledge or deeper understanding when you see overly cautious code. Normally you try to maintain enough control over your environment to not have crazy internal variability :D. So you wouldn't set your regex interpreter to anything other than ascii or some form or unicode and not have the interpreter compile your code as anything but ascii or utf8. I have never used a regex implementation that I know of that would allow something not extending ascii and I doubt many exist. If you needed something else it would be converted by the program according to need.

Watson · 2016-02-09 Reply Admin

How many check with you to verify that the characters you regard as "special" are the same ones they regard as "special"?

dkf · 2016-02-09 Reply Admin

isthisunique:
But you shouldn't rely on the compiler when you shouldn't have to as a rule of thumb.

I'm having problems assigning a meaning to that sentence. It's almost meaningful, but not quite…

In reference to what I was talking about, an optimisation which was pointed at making the concatenate-in-a-loop antipattern less horribly bad would improve a lot of user code. Yes, the compiler is not obligated to do such a thing, and there's a better solution already that better programmers use, but there's a lot of crap programmers out there and this would make a big difference to their code. It's also a really constrained case that it would be possible to actually do something about. (Yes, I do know what sort of things compilers can do. This is one of them.)

You might be arguing that compilers shouldn't be that smart. (I'm guessing that that's what you're talking about above, but I could be wrong as your sentence managed to be a bit slippery in meaning.) I understand such a position, but I know the shit that gets hacked together day in day out by the grunts in this industry: what I propose would definitely help.

isthisunique · 2016-02-09 Reply Admin

I'm having problems assigning a meaning to that sentence. It's almost meaningful, but not quite…

I mean that in this specific case (loop condition) the compiler shouldn't have to fix. The programmer should not have done it like that to begin with.

dkf · 2016-02-09 Reply Admin

isthisunique:
I mean that in this specific case the compiler shouldn't have to fix. The programmer should not have done it like that to begin with.

OK, that's what I thought most likely, and responded to. In short, yes, but no. Because incompetent programmers.

isthisunique · 2016-02-09 Reply Admin

How many check with you to verify that the characters you regard as "special" are the same ones they regard as "special"?

Not many professionally as it usually means the requirements and specifications aren't being done properly.

Giving training assignments that opportunity is not usually given because it is useful to test people's independent thinking about how to address the problem based on the context. If you don't specify it exactly giving no right answer, you give them a chance to produce a better scheme than your own.

To be fair though, more people should ask. A lot of people will work with unclear requirements and try their best without considering that there might be a problem.

Joe_D · 2016-02-09 Reply Admin

how many String instances does this code construct and throw away for concatenation?

Trick question. The code as presented doesn't compile.

String.Contains does not accept a char parameter.

Change it to

if ("QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm0123456789_".Contains(stringIn[i][j].ToString()))

and it works.

CoyneTheDup · 2016-02-09 Reply Admin

Even assuming a programmer was forced to loop through this string character-by-character, that's the whole reason for StringBuilder.

PWolff · 2016-02-09 Reply Admin

TheDailyBread:
What if someone stupid places a j++ in the end of the loop block for whatever reason (e.g. to skip a character)?Suddenly j > names[i].length-1 and != won't catch that but <= will.

At the additional cost of an extra comparison to avoid out of range access to an array. Maybe unless you're 102% sure the string is \0-terminated and nobody will ever add another j++ to the loop body.

(I love when a dialect considers all loop variables immutable within the loop.)

Joe_D:
String.Contains does not accept a char parameter.

In C# it does.

(The code compiles without change.)

vita10gy · 2016-02-09 Reply Admin

This was my favorite part. He added to the obscurity by managing to even go about typing out all the letters in a way that made it unclear if he skipped one or not.

Steve_The_Cynic · 2016-02-09 Reply Admin

accalia:

PleegWat:
Does the C# language definition exclude all ways the string (and hence its length) might be modified

yes. string is immutable.

Sure, the string itself is immutable, but a string variable can be changed to be a different string, so stringVar.Length can change from iteration to iteration.

dkf · 2016-02-09 Reply Admin

Steve_The_Cynic:
Sure, the string itself is immutable, but a string variable can be changed to be a different string, so stringVar.Length can change from iteration to iteration.

If only the compiler had some way to know whether a local variable was assigned to during its way round the loop. If only there was possible to tell if that might have happened…

Steve_The_Cynic · 2016-02-09 Reply Admin

dkf:

Steve_The_Cynic:
Sure, the string itself is immutable, but a string variable can be changed to be a different string, so stringVar.Length can change from iteration to iteration.

If only the compiler had some way to know whether a local variable was assigned to during its way round the loop. If only there was possible to tell if that might have happened…

Yeah, that one's a tough problem. I've heard that there's some guys at MIT working on it, and they might have a solution in a couple of years.

:scream:

PWolff · 2016-02-09 Reply Admin

Here’s a fun question. Given this input array (Malcolm,Ermelinda,Annika,Jesusa,Honey,Romelia,Dorene,Alvaro,Charmaine,Georgann,Troy), how many String instances does this code construct and throw away for concatenation?

In C#, 62.

Quite · 2016-02-09 Reply Admin

There's a worrying undercurrent of :wtf: here which I'm having difficulty accepting on a philosophical level. It goes: "It doesn't matter if my code is terminally appalling, because the optimiser will iron out all the kinks and make it as efficient as if I'd written it properly in the first place."

Yes I know I've deliberately misrepresented the intent of some of the above, but it's instructive to see where we end up.

If the compiler did not do any optimisation, then any bad code would soon be caught by the fact that it takes too long to run. But with optimisers to take the sting out of some of the unthinking sillinesses perpetrated by incompetent, naive and/or insouciant programmers, the accumulated codebase of the universe cannot help but be compromised.

While the above may be viewed as a naive and unrealisting view of the industry, I still believe it needs to be borne in mind.

Protoman · 2016-02-09 Reply Admin

At first, I thought TR :wtf: was going to be that the author missed a letter in their blacklist string (but turns out they got all of [A–Za-z0-9_]).

Unthinking:
What does "[^\w]" do about "wędzony łosoś"? What does the code in question do about "wędzony łosoś"?

Unicode is hard.

isthisunique:
The funny thing about the validation test is that everyone thinks special characters are just a few because of the English language gives that impression (special = rare), the keyboard does, etc and always create huge lists of special characters (all the punctuation on the keyboard, special accented language characters, normally no more, often in the wrong charset so they are just dumping arbitrary binary data into regex). Very few realise to instead only allow the characters that are not special.

Yup. Turns out, in a lot of contexts (especially security), whitelists are better than blacklists. Don't try to enumerate all of the bad things, you'll undoubtedly fail.

CatPlusPlus · 2016-02-09 Reply Admin

Quite:
There's a worrying undercurrent of :wtf: here which I'm having difficulty accepting on a philosophical level. It goes: "It doesn't matter if my code is terminally appalling, because the optimiser will iron out all the kinks and make it as efficient as if I'd written it properly in the first place."

If there's no performance difference then performance is not a relevant factor when comparing two approaches to implementing something. "Properly" is not absolutely defined.

Quite:
If the compiler did not do any optimisation, then any bad code would soon be caught by the fact that it takes too long to run. But with optimisers to take the sting out of some of the unthinking sillinesses perpetrated by incompetent, naive and/or insouciant programmers, the accumulated codebase of the universe cannot help but be compromised.

This is a dumb argument. If given implementation has an optimiser, then there is no point inventing hypothetical situation in which it doesn't. And if it doesn't, then it likely already incurs enough overhead on its own to make micro-optimisation doubly useless and/or is not used in context where shaving every possible microsecond is important.

Quite:
If the compiler did not do any optimisation, then any bad code would soon be caught by the fact that it takes too long to run.

Except that's demonstrably untrue. We're not running 386s any more. And actual high-performance computation code does way more than optimisers can do automatically at the moment (and that's things like vectorisation).

The fact is, there's not many applications that need optimisations in which the difference is measured in cycles. The code in OP is algorithmically bad and that's MAYBE the reason to optimise it, and not that oh no it calculates length - 1 more than once. And I say maybe because it might not be worth it. Especially if it's surrounded by I/O.

Profile first, dammit.

Quite:
While the above may be viewed as a naive and unrealisting view of the industry,

Not naive, just woefully outdated.

Quite · 2016-02-09 Reply Admin

CatPlusPlus:
Not naive, just woefully outdated.

In other words, shut up grandad, we write shit and are proud of it.

CatPlusPlus · 2016-02-09 Reply Admin

If you want to ignore all of the other words I typed, sure.

HardwareGeek · 2016-02-09 Reply Admin

CatPlusPlus:
If you want to ignore all of the other words I typed

Welcome to TDWTF.

Joe_D · 2016-02-09 Reply Admin

I didn't in my copy of VS 2013...

boomzilla · 2016-02-09 Reply Admin

Quite:
If the compiler did not do any optimisation, then any bad code would soon be caught by the fact that it takes too long to run.

Where "bad code" is at least partly defined as code written without respect to the compiler's implementation details.

Most major performance problems IME have pretty near 0% to do with particular compiler optimizations.

isthisunique · 2016-02-09 Reply Admin

Yup. Turns out, in a lot of contexts (especially security), whitelists are better than blacklists. Don't try to enumerate all of the bad things, you'll undoubtedly fail.

Indeed. A bit of a trick question for a training assignment, one of my favourites. The wrong answer is nearly always deny, allow (blacklist) rather than allow, deny (whitelist). Another correct answer is to complain about the lack of a specific specification. The best way to explain it seems to be to tell people to think of it like a shopping list. In that scenario, would you create a list containing all of the items you don't want to buy in the shop?

A real WTF would be the same code sample, 100% correct functionally and equivalent but with a list of all the disallowed characters. I am sure there has to be one of those in the dailywtf archives. In training assignments when you see the blacklist getting bigger than the whitelist would be you start to laugh.

The only real problem I see with the WTF code at a glance is that it will allow strange names and might be slower than it could be because of a lack of what appears to be common knowledge. Bad loop condition, contains is slow (O(n), could be O(1) with a LUT) and lack of string builder. Regex patterns can be pre-compiled as well in most implementations so probably wouldn't give anyone performance concerns for most scenarios.

Yamikuronue · 2016-02-09 Reply Admin

isthisunique:
The wrong answer is nearly always deny, allow (blacklist) rather than allow, deny (whitelist).

Wait, what? "Deny all characters except ABC" is a whitelist. "Allow all characters except DEF" is a blacklist.

isthisunique · 2016-02-09 Reply Admin

Wait, what? "Deny all characters except ABC" is a whitelist. "Allow all characters except DEF" is a blacklist.

It's from Apache httpd but also a basic principle for firewalls in general (default action). It just means really allow [list] then deny everything else. Everything in security has to be obscure and confusing (misdirection is key), even the instruction manual. I don't know why. Have you seen the sourcecode for some cryptographic libraries? When people go into doing security mode it leaks out everywhere and over everything. Ironically it's probably actually one cause of bugs that cause security holes because of "secure-obscure code" (security by obfuscation) which tends to just hide the bugs from being fixed, rather than from hackers.

Yamikuronue · 2016-02-09 Reply Admin

isthisunique:
It's from Apache httpd

Yeah, I recognised the syntax.

isthisunique:
It just means really allow [list] then deny everything else.

Something like:

Order deny,allow
Deny from all
Allow from dev.example.com

Right? Where you literally have the string deny,allow. It's a whitelist ;)

isthisunique · 2016-02-09 Reply Admin

I see, they have a third option which I don't use.

Allow,Deny First, all Allow directives are evaluated; at least one must match, or the request is rejected. Next, all Deny directives are evaluated. If any matches, the request is rejected. Last, any requests which do not match an Allow or a Deny directive are denied by default. Deny,Allow First, all Deny directives are evaluated; if any match, the request is denied unless it also matches an Allow directive. Any requests which do not match any Allow or Deny directives are permitted.

"First, all Allow directives are evaluated; at least one must match" -> "any requests which do not match an Allow or a Deny directive are denied by default."

That's all I use. I suppose the middle denies are to apply a blacklist to the whitelist.

Keeping Regular

Leave a comment on “Keeping Regular”