The Daily WTF: Curious Perversions in Information Technology

2016-11-14 Reply Admin

Case L'Frist'

2016-11-14 Reply Admin

bool NeedEscape ( wchar_t c ) { return true; }

There, I fixed it

2016-11-14 Reply Admin

For that developer '/' needs to be escaped too...

2016-11-14 Reply Admin

Your documented API doesn't exist in Windows-in-general, only in desktop Windows. The function you have quoted automatically exists everywhere.

Your documented API comes with enough caveats and variations to sink a battleship. The function you have quoted is self-documenting.

The documentation of the API practically tells you not to use it:

"In Internet Explorer 4.0 and later, InternetCanonicalizeUrl always functions as if the ICU_BROWSER_MODE flag is set. Client applications that must canonicalize the entire URL should use either CoInternetParseUrl (with the action PARSE_CANONICALIZE and the flag URL_ESCAPE_UNSAFE) or UrlCanonicalize."

"InternetCanonicalizeUrl always encodes by default, even if the ICU_DECODE flag has been specified..."

...and so on.

2016-11-14 Reply Admin

If you have to roll your own, roll your own, but at least put em on separate lines (only one byte difference between a space and a CRLF). Alternatively, compare the Hex Values,

if (InChar >= Hex("0") and InChar <= Hex("9")) return False, if (InChar >= Hex("a") and InChar <= Hex("z")) return False, etc.

Rather than, for example, doing 66 compares before discovering that "Z" is OK.

<sarcasm>Of course if you take the latter route, be sure to avoid any comments so that no-one can tell at a glance what/why you're doing..

<sarcasm>Or better yet, use a Regex

And I just realized it's C++ not C#, and googling says L("x") is already a Long Integer, and wchar_t is apparently something like nchar rather than char. So not being a C++ coder, it is truly one giant WTF to me. Does it even work?

2016-11-14 Reply Admin

switch/case is more efficient than if/else blocks since even a simple compiler can optimize it to use two simple jumps by using a jump or a perfect hash table.

2016-11-14 Reply Admin

"The function went character by character through the string, which was bad enough" any pattern matching function (regex also does by the way) goes character by character. Not being able to see it doesn't mean it does not happen.

Modern bytecode/IL/interpreded language guys, when are they gonna learn.

Steve_The_Cynic · 2016-11-14 Reply Admin

if (InChar >= Hex("a") and InChar <= Hex("z"))

This doesn't work on EBCDIC machines, since the letters are not contiguous.

2016-11-14 Reply Admin

TRWTF is EBCDIC.

2016-11-14 Reply Admin

@Steve_The_Cynic: Serious question; Are there any EBCDIC machines running Windows?

Steve_The_Cynic · 2016-11-14 Reply Admin

Not normally, but I didn't say "Windows machines that use EBCDIC." The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies. If someone gets into that habit, sooner or later he will write code that ends up being transported to an environment that uses EBCDIC, and people will wonder why '~' is classed as a lower-case letter, or that closing-brace and backslash are upper-case letters, but opening-brace is not.

2016-11-14 Reply Admin

EBCDIC machines have a gap between lower case SERIES and upper case SERIES, NOT within a-z or A-Z or 0-9.

and there would only be 3 compares, not 66: a-z, A-Z, and 0-9.

and for those in love with Case/Switch (which suck in ALL languages) just put the >="a" and <="z" in each case statement

and isn't it totally ass backwards? Why zero in on those that don't need escaping and escape all the others. Why not find ONLY those that NEED escaping presumably, ' " # & @ etc. depending on where you're going with them (SQL, URL, etc.)

2016-11-14 Reply Admin

Why not find ONLY those that NEED escaping presumably, ' " # & @ etc.

If it's a URL, you need to escape Unicode characters. It's easier to match those that don't need escaping.

2016-11-14 Reply Admin

any pattern matching function (regex also does by the way) goes character by character.

And here we get to the official "Actually..." part of today's thread. Ones like this one, which have to test each character, do have to touch each one (though no necessarily in first-to-last order), but ones with larger patterns - such as matching two substrings - usually don't.

I expect you knew this, but, well, it's not very relevant here, unless you are a pedantic dickweed like me, or - and here's the punchline - you know both the character encoding and the maximum system word length ahead of time. In that case, depending on the encoding you might be able to cast (or union-mask) the character strings to a packed full word and use a bitmask. Not exactly an elegant solution, and not one I would bother with unless this were for some reason a bottleneck in the program (rather unlikely) which I couldn't break in some better fashion (even less likely, even given my meager skill), but it would be just the sort of thing that a Triggered C Purist would probably love.

That having been said, and taking into account what others have said about the inconsistencies of the Windows API, the likelihood or unlikelihood of knowing the encoding, and the optimization potential of switch() versus if(), I would say that TRWTF is that the dev didn't at least explain why they didn't use the API, which would, if nothing else, demonstrate that they were aware of the existing solution and had a reason not to use it. Awareness of the API is at the heart of this WTF, and the fact that nothing was said about it indicates that they weren't the experts they claimed to be. Shocking, I know, that anyone in IT might be so misleading about their expertise.

Addendum 2016-11-14 12:21: "didn't at least explain..." in a comment, that is. I assume that I was clear enough, but I am a pedantic dickweed and got pissed off with myself that I dropped that part.

Addendum 2016-11-14 12:23: Also, I meant "to an array of packed words".

2016-11-14 Reply Admin

I would be willing to bet that 95% of programmers never have any of their code ported into an environment that uses EBCDIC. This is a case for YAGNI. Trying to write code that is perfect for the express purpose of, "It might get ported into some horrible place where even the ASCII standard breaks down..." is a ludicrous waste of time. I am completely certain that if this code gets ported into EBCDIC, then this function is going to be the least of their problems. I do not write my programs with the intent that they are easily ported into COBOL, and I do not write them to work with EBCDIC. ( 'a'<=i && i<='z' ) || ( 'A'<=i && i<='Z' ) is just fine. If I see a bug report come in that says "Does not work with EBCDIC" I will take that as my queue to quit my job and move to another position as fast as humanly possible.

2016-11-14 Reply Admin

I just fed the routine into my local friendly C compiler, and it does quite a bit of optimization. First (frist?) it uses a jump table for the cases, and Second, it makes the jump table as sparse as it can. So, it does execute in short order, and while it does take up some space, it isn't that bad in the grand scheme of things.

On the plus side, yes, it will work on EBCDIC machines (if they are still around), since for those there are gaps between I & J, R & S, and their lower case equivalents, but that is secondary. Unlike many of the routines here it doesn't execute in infinite time, or give wrong answers. As the saying goes: If it works, it is aesthetic, if you have a better solution, feel free to use it.

2016-11-14 Reply Admin

EBCDIC machines have a gap between lower case SERIES and upper case SERIES, NOT within a-z or A-Z or 0-9.

ASCII/ANSI/Unicode also has a gap between a-z, A-Z and 0-9. In fact, who hasn't heard of the bit-flip trick go from capitals to lowercase (bit 5 - set it to go uppercase, clear it for lowercase). Bit 5 adding 32 to the value.

2016-11-14 Reply Admin

Your documented API doesn't exist in Windows-in-general, only in desktop Windows.

That's an understatement. Some editions of Windows (and not just the original release of Windows 95) don't have Internet Explorer built in.

Your documented API comes with enough caveats and variations to sink a battleship.

Hmm, is that an understatement too, I'm not sure ^_^

2016-11-14 Reply Admin

The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies.

So why is it that letters between L'a' and L'z' and letters between L'A' and L'Z' don't need escaping but other letters do need escaping -- including some letters found in Latin-1 and some letters found on an English[*] keyboard?

[* and for another year or two, Scottish and Northern Irish]

2016-11-14 Reply Admin

EBCDIC machines have a gap between lower case SERIES and upper case SERIES, NOT within a-z or A-Z or 0-9.

and isn't it totally ass backwards?

Yes it is totally ass backwards.

EBCDIC code pages have gaps between 'I' an 'J', and between 'R' and 'S'. In fact the ordinary slash '/' comes just before 'S' in EBCDIC.

ASCII and its relatives have gaps between 'Z' and 'a' but not within ASCII a-z or ASCII A-Z.

But I can't imagine why anyone is Appalled by this idiot's postings. This is the WTF site after all.

2016-11-14 Reply Admin

If it's a URL, you need to escape Unicode characters.

In Windows, L'A' is Unicode and L"anything" is Unicode.

2016-11-14 Reply Admin

Trying to write code that is perfect for the express purpose of, "It might get ported into some horrible place where even the ASCII standard breaks down..." is a ludicrous waste of time.

Such as languages containing ñ or 表?

2016-11-15 Reply Admin

So why is it that letters between L'a' and L'z' and letters between L'A' and L'Z' don't need escaping but other letters do need escaping

Because the standard says so.

2016-11-15 Reply Admin

6 compares. not 3. Each end of your range is a compare.

And if you think a switch statement degenerates into an if/else if you really need to study what a compiler does with switch statements :-)

It's not unreasonable, when writing Windows code, to ignore the oddities of EBCDIC. If you insist, then I'd question why you don't make it work with the 6 bit characters of the old CDC Cyber series, where there were no lowercase alphabetics :-)

Steve_The_Cynic · 2016-11-15 Reply Admin

Ah, yes, CDC Display Code, where ':' == 0. I remember it with despair. I spent a small amount of time in the mid 80s working on a Cyber that was running in 60-bit mode. (One word of memory being 60 bits, 10 characters of six bits each, for the same type of definition of word that says that an Intel i7 has a word being 8 bits, one character of 8 bits each.)

And as for classifying characters, I'd normally (if it were C/C++) use "isalpha" to detect letters so that I don't have to worry about that sort of nonsense...

2016-11-15 Reply Admin

The advantage of switch/case is that the optimizer can choose which construct is more efficient. If you inspect the generated code you can find that it can do it by a if/else if chain, by a calculated value or by a jump table, depends on the number of values and the gap between them. Fall through cases influence also the generated code.

2016-11-15 Reply Admin

And 16-bit Unicode is the encoding, not Unicode built from 8-bit blocks. When details like that are built into the implementation, talking about portability just seems wrong.

2016-11-15 Reply Admin

The whole WinAPI angle seems to be a potshot by the editor. It bears no relevance to the CodeSOD.

InternetCanonicalizeUrl sounds just plain wrong. Give it an improper full URL, and it magically makes it right? What could possibly go wrong?

2016-11-16 Reply Admin

The number of people on here who firmly believe that 128 characters should be enough for anyone really blows me away. I thought TDWTF's audience was even slightly more up to date than that.

As for Remy's poor attempt at finding a substitute, obviously https://msdn.microsoft.com/en-us/library/windows/desktop/aa384099(v=vs.85).aspx would be better, as it combines canonicalization and open in one function, and works on everything from win2k to win10, without dependencies.

2016-11-16 Reply Admin

The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies.

So why is it that letters between L'a' and L'z' and letters between L'A' and L'Z' don't need escaping but other letters do need escaping -- including some letters found in Latin-1 and some letters found on an English[*] keyboard?

[* and for another year or two, Scottish and Northern Irish]

Because the standard says so.

Which standard? My rhetorical retort retorted to 'The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies.'

Someday I ought to test if isalpha('五') returns zero or not. 五 is numeric like 5 but also alphabetic like FIVE (but it's a single character not a string of four characters).

isupper and islower both have to return zero for that character.

Even in 8-bit universes, Greek letters are upper or lower case, Hebrew letters aren't, etc. What a mess.

2016-11-16 Reply Admin

obviously https://msdn.microsoft.com/en-us/library/windows/desktop/aa384099(v=vs.85).aspx would be better

Nope. WinHttpOpenRequest fails if it can't make an internet connection to a server. A function to canonicalize a URL can work even when no network connection is present.

2016-11-16 Reply Admin

Which standard?

Uh, maybe the standard that defines which characters can and can't appear in a URL without being escaped? Do try to keep up.

2016-11-16 Reply Admin

For the third time, I point out that I was retorting to this:

The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies.

Does the robot understand that I was not retorting to a standard about URLs?

Wait, why am I even asking. I know the answer to my question.

2016-12-18 Reply Admin

I fail to see the problem. Finally some code that is complete independant of underlaying machine representation or codeset used.

Just In Case

Leave a comment on “Just In Case”