- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Case L'Frist'
Admin
bool NeedEscape ( wchar_t c ) { return true; }
There, I fixed it
Admin
For that developer '/' needs to be escaped too...
Admin
Your documented API doesn't exist in Windows-in-general, only in desktop Windows. The function you have quoted automatically exists everywhere.
Your documented API comes with enough caveats and variations to sink a battleship. The function you have quoted is self-documenting.
The documentation of the API practically tells you not to use it:
"In Internet Explorer 4.0 and later, InternetCanonicalizeUrl always functions as if the ICU_BROWSER_MODE flag is set. Client applications that must canonicalize the entire URL should use either CoInternetParseUrl (with the action PARSE_CANONICALIZE and the flag URL_ESCAPE_UNSAFE) or UrlCanonicalize."
"InternetCanonicalizeUrl always encodes by default, even if the ICU_DECODE flag has been specified..."
...and so on.
Admin
If you have to roll your own, roll your own, but at least put em on separate lines (only one byte difference between a space and a CRLF). Alternatively, compare the Hex Values,
if (InChar >= Hex("0") and InChar <= Hex("9")) return False, if (InChar >= Hex("a") and InChar <= Hex("z")) return False, etc.
Rather than, for example, doing 66 compares before discovering that "Z" is OK.
<sarcasm>Of course if you take the latter route, be sure to avoid any comments so that no-one can tell at a glance what/why you're doing..
<sarcasm>Or better yet, use a Regex
And I just realized it's C++ not C#, and googling says L("x") is already a Long Integer, and wchar_t is apparently something like nchar rather than char. So not being a C++ coder, it is truly one giant WTF to me. Does it even work?
Admin
switch/case is more efficient than if/else blocks since even a simple compiler can optimize it to use two simple jumps by using a jump or a perfect hash table.
Admin
"The function went character by character through the string, which was bad enough" any pattern matching function (regex also does by the way) goes character by character. Not being able to see it doesn't mean it does not happen.
Modern bytecode/IL/interpreded language guys, when are they gonna learn.
Admin
if (InChar >= Hex("a") and InChar <= Hex("z"))
This doesn't work on EBCDIC machines, since the letters are not contiguous.
Admin
TRWTF is EBCDIC.
Admin
@Steve_The_Cynic: Serious question; Are there any EBCDIC machines running Windows?
Admin
Not normally, but I didn't say "Windows machines that use EBCDIC." The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies. If someone gets into that habit, sooner or later he will write code that ends up being transported to an environment that uses EBCDIC, and people will wonder why '~' is classed as a lower-case letter, or that closing-brace and backslash are upper-case letters, but opening-brace is not.
Admin
EBCDIC machines have a gap between lower case SERIES and upper case SERIES, NOT within a-z or A-Z or 0-9.
and there would only be 3 compares, not 66: a-z, A-Z, and 0-9.
and for those in love with Case/Switch (which suck in ALL languages) just put the >="a" and <="z" in each case statement
and isn't it totally ass backwards? Why zero in on those that don't need escaping and escape all the others. Why not find ONLY those that NEED escaping presumably, ' " # & @ etc. depending on where you're going with them (SQL, URL, etc.)
Admin
If it's a URL, you need to escape Unicode characters. It's easier to match those that don't need escaping.
Admin
And here we get to the official "Actually..." part of today's thread. Ones like this one, which have to test each character, do have to touch each one (though no necessarily in first-to-last order), but ones with larger patterns - such as matching two substrings - usually don't.
I expect you knew this, but, well, it's not very relevant here, unless you are a pedantic dickweed like me, or - and here's the punchline - you know both the character encoding and the maximum system word length ahead of time. In that case, depending on the encoding you might be able to cast (or union-mask) the character strings to a packed full word and use a bitmask. Not exactly an elegant solution, and not one I would bother with unless this were for some reason a bottleneck in the program (rather unlikely) which I couldn't break in some better fashion (even less likely, even given my meager skill), but it would be just the sort of thing that a Triggered C Purist would probably love.
That having been said, and taking into account what others have said about the inconsistencies of the Windows API, the likelihood or unlikelihood of knowing the encoding, and the optimization potential of switch() versus if(), I would say that TRWTF is that the dev didn't at least explain why they didn't use the API, which would, if nothing else, demonstrate that they were aware of the existing solution and had a reason not to use it. Awareness of the API is at the heart of this WTF, and the fact that nothing was said about it indicates that they weren't the experts they claimed to be. Shocking, I know, that anyone in IT might be so misleading about their expertise.
Addendum 2016-11-14 12:21: "didn't at least explain..." in a comment, that is. I assume that I was clear enough, but I am a pedantic dickweed and got pissed off with myself that I dropped that part.
Addendum 2016-11-14 12:23: Also, I meant "to an array of packed words".
Admin
I would be willing to bet that 95% of programmers never have any of their code ported into an environment that uses EBCDIC. This is a case for YAGNI. Trying to write code that is perfect for the express purpose of, "It might get ported into some horrible place where even the ASCII standard breaks down..." is a ludicrous waste of time. I am completely certain that if this code gets ported into EBCDIC, then this function is going to be the least of their problems. I do not write my programs with the intent that they are easily ported into COBOL, and I do not write them to work with EBCDIC. ( 'a'<=i && i<='z' ) || ( 'A'<=i && i<='Z' ) is just fine. If I see a bug report come in that says "Does not work with EBCDIC" I will take that as my queue to quit my job and move to another position as fast as humanly possible.
Admin
I just fed the routine into my local friendly C compiler, and it does quite a bit of optimization. First (frist?) it uses a jump table for the cases, and Second, it makes the jump table as sparse as it can. So, it does execute in short order, and while it does take up some space, it isn't that bad in the grand scheme of things.
On the plus side, yes, it will work on EBCDIC machines (if they are still around), since for those there are gaps between I & J, R & S, and their lower case equivalents, but that is secondary. Unlike many of the routines here it doesn't execute in infinite time, or give wrong answers. As the saying goes: If it works, it is aesthetic, if you have a better solution, feel free to use it.
Admin
ASCII/ANSI/Unicode also has a gap between a-z, A-Z and 0-9. In fact, who hasn't heard of the bit-flip trick go from capitals to lowercase (bit 5 - set it to go uppercase, clear it for lowercase). Bit 5 adding 32 to the value.
Admin
That's an understatement. Some editions of Windows (and not just the original release of Windows 95) don't have Internet Explorer built in.
Hmm, is that an understatement too, I'm not sure ^_^
Admin
So why is it that letters between L'a' and L'z' and letters between L'A' and L'Z' don't need escaping but other letters do need escaping -- including some letters found in Latin-1 and some letters found on an English[*] keyboard?
[* and for another year or two, Scottish and Northern Irish]
Admin
Yes it is totally ass backwards.
EBCDIC code pages have gaps between 'I' an 'J', and between 'R' and 'S'. In fact the ordinary slash '/' comes just before 'S' in EBCDIC.
ASCII and its relatives have gaps between 'Z' and 'a' but not within ASCII a-z or ASCII A-Z.
But I can't imagine why anyone is Appalled by this idiot's postings. This is the WTF site after all.
Admin
In Windows, L'A' is Unicode and L"anything" is Unicode.
Admin
Such as languages containing ñ or 表?
Admin
Admin
6 compares. not 3. Each end of your range is a compare.
And if you think a switch statement degenerates into an if/else if you really need to study what a compiler does with switch statements :-)
It's not unreasonable, when writing Windows code, to ignore the oddities of EBCDIC. If you insist, then I'd question why you don't make it work with the 6 bit characters of the old CDC Cyber series, where there were no lowercase alphabetics :-)
Admin
Ah, yes, CDC Display Code, where ':' == 0. I remember it with despair. I spent a small amount of time in the mid 80s working on a Cyber that was running in 60-bit mode. (One word of memory being 60 bits, 10 characters of six bits each, for the same type of definition of word that says that an Intel i7 has a word being 8 bits, one character of 8 bits each.)
And as for classifying characters, I'd normally (if it were C/C++) use "isalpha" to detect letters so that I don't have to worry about that sort of nonsense...
Admin
The advantage of switch/case is that the optimizer can choose which construct is more efficient. If you inspect the generated code you can find that it can do it by a if/else if chain, by a calculated value or by a jump table, depends on the number of values and the gap between them. Fall through cases influence also the generated code.
Admin
And 16-bit Unicode is the encoding, not Unicode built from 8-bit blocks. When details like that are built into the implementation, talking about portability just seems wrong.
Admin
The whole WinAPI angle seems to be a potshot by the editor. It bears no relevance to the CodeSOD.
InternetCanonicalizeUrl sounds just plain wrong. Give it an improper full URL, and it magically makes it right? What could possibly go wrong?
Admin
The number of people on here who firmly believe that 128 characters should be enough for anyone really blows me away. I thought TDWTF's audience was even slightly more up to date than that.
As for Remy's poor attempt at finding a substitute, obviously https://msdn.microsoft.com/en-us/library/windows/desktop/aa384099(v=vs.85).aspx would be better, as it combines canonicalization and open in one function, and works on everything from win2k to win10, without dependencies.
Admin
Which standard? My rhetorical retort retorted to 'The idea of testing "is this a letter" just by seeing if the character is between 'a' and 'z' gives me the creepy-crawlies.'
Someday I ought to test if isalpha('五') returns zero or not. 五 is numeric like 5 but also alphabetic like FIVE (but it's a single character not a string of four characters).
isupper and islower both have to return zero for that character.
Even in 8-bit universes, Greek letters are upper or lower case, Hebrew letters aren't, etc. What a mess.
Admin
Nope. WinHttpOpenRequest fails if it can't make an internet connection to a server. A function to canonicalize a URL can work even when no network connection is present.
Admin
Admin
For the third time, I point out that I was retorting to this:
Does the robot understand that I was not retorting to a standard about URLs?
Wait, why am I even asking. I know the answer to my question.
Admin
I fail to see the problem. Finally some code that is complete independant of underlaying machine representation or codeset used.