Internet.toLowerCase

« Return to Article
  • Steve 2013-02-18 06:09
    "Fr1st".toLowerCase()
  • Troll 2013-02-18 06:16
    Y U NO ESCAPE CHARS IN CODE? :P
  • Dat validation 2013-02-18 06:25
    Dat validation
  • Hmmmm 2013-02-18 06:29
    Troll:
    Y U NO ESCAPE CHARS IN CODE? :P

    Perhaps this was an ironic meta-WTF by the author given the second sentence, though I doubt it...
  • GettinSadda 2013-02-18 06:34
    For added Lulz, this post only makes sense if you read the HTML source!
  • biziclop 2013-02-18 06:36
    At least there will be no debate about what TRWTF is.
  • Anonymous 2013-02-18 06:39
    ESCAPE FROM THIS MADNESS
  • F***-it Fred 2013-02-18 06:51
    Time to go write some really WTFy code containing malicious Javascript and wait for it to be published.
  • foo 2013-02-18 07:19
    <div class="CommentBodyFeatured">
    F***-it Fred:
    Time to go write some really WTFy code containing malicious Javascript and wait for it to be published.
    </div>
  • phynol 2013-02-18 07:21
    if(pageHTML.toLowerCase().regionMatches(index, " tag } 


    What C family language supports arbitrarily closing a paran with a brace? Or are we just making WTFs up as we go along?
  • Preakness 2013-02-18 07:27
    But the PRE tag shuts off the HTML interpreter, doesn't it?

    Hint: They're trying to train us to view source on every article.
  • Remy Porter 2013-02-18 07:28
    Someone forgot to escape their "<". It's been fixed, and for good measure, run through a syntax highlighter so that you can see the WTFness in the code IN COLOR.
  • Remy Porter 2013-02-18 07:30
    There honestly should be such a tag. There should also be a tag that allows you to pass its contents to a different interpreter, thus making it easier to inline binary data.
  • faoileag 2013-02-18 07:31
    phynol:
    Erik Gern:
    if(pageHTML.toLowerCase().regionMatches(index, " tag }
    What C family language supports arbitrarily closing a paran with a brace? Or are we just making WTFs up as we go along?

    None.

    But sometimes people forget that "pre" in HTML does not allow you to use angular brackets in HTML directly (without encoding them as their corresponding HTML-entities).
  • Remy Porter 2013-02-18 07:37
    And regarding the article, this isn't a WTF. Regular expressions are expensive and difficult to maintain!
  • faoileag 2013-02-18 07:42
    So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

    Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)

    Profiler, anyone?
  • Raedwald 2013-02-18 07:44
    Parsing HTML with regular expressions? That never goes well.
  • fa2k 2013-02-18 07:54
    So the toLowerCase is clearly a WTF. Comparing the text to every known tag is at best a borderline WTF. There are more efficient methods, but they are more complicated to implement. I can think of:
    - Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree. (or do this implicitly, with switch, but that could be even uglier and more WTF-y)
    - Search for the first non-letter character, and use the string up to that as a key into a hash table.
  • snoofle 2013-02-18 07:54
    article:
    ...it also lower cased the entire document multiple times...
    So it converted the entire 1+M document to lower case 70+ times for every tag in the file? That's a lot of cpu-grinding. This generates unnecessary heat.

    Forget carbon emissions; this is where global warming comes from people!

  • Noumenon 2013-02-18 07:56
    As a PHP newb, I'd be thankful if someone could name one of those "reliable libraries a developer could use to do the heavy lifting." A simple one, please.
  • ZoomST 2013-02-18 07:56
    Remy Porter:
    And regarding the article, this isn't a WTF. Regular expressions are expensive and difficult to maintain!

    Sure, and as The Guru told us, "the delay is a little price to pay as long as the code keeps its essence. Just put more CPU power and memory". And the boss just bent before those deep words, while we were hearing it with astonishing devotion.
    Not a WTF at all. Just as The Guru told us.
  • gnasher729 2013-02-18 07:57
    faoileag:
    So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

    Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)

    Profiler, anyone?


    Once you figure out that your code crashes, or takes a day to process a large page, the optimization is not premature anymore.
  • faoileag 2013-02-18 07:58
    I actually like the first test in the sample given: it fires on all tags starting with "<a", not only the anchor tag.

    Ah well, the "Do a test involving the <a> tag"-Test will probably weed out applets, areas and the like.
  • faoileag 2013-02-18 08:04
    gnasher729:
    faoileag:
    So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

    Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)
    Once you figure out that your code crashes, or takes a day to process a large page, the optimization is not premature anymore.

    Definitely not. And "Pedro the Profiler" rightfully comes to the rescue.

    But storing the result of toLowerCase() in a temp var and working on that variable would be :-)
  • faoileag 2013-02-18 08:07
    faoileag:
    But storing the result of toLowerCase() in a temp var and working on that variable would be :-)

    But storing the result of toLowerCase() in a temp var and working on that variable straightaway before the method has had a chance to choke on large pages would be.

    FTFM
  • Black Bart 2013-02-18 08:09
    Slow yes, but who here thinks it would take 24 hours to process a single page?
  • snoofle 2013-02-18 08:30
    Black Bart:
    Slow yes, but who here thinks it would take 24 hours to process a single page?
    In fairness, have you seen some of the crap generated by Frontpage?
  • ZoomST 2013-02-18 08:57
    Black Bart:
    Slow yes, but who here thinks it would take 24 hours to process a single page?

    Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?
  • Bobby Tables 2013-02-18 09:12
    It's worse than that.

    Every time a tag is found the entire page is converted to lowercase.

  • Bobby Tables 2013-02-18 09:12
    ZoomST:
    Black Bart:
    Slow yes, but who here thinks it would take 24 hours to process a single page?

    Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?


    It's worse than that - every time a tag is found on the page the whole page is converted to lowercase. 70+ times.
  • Doctor_of_Ineptitude 2013-02-18 09:21
    ZoomST:
    Black Bart:
    Slow yes, but who here thinks it would take 24 hours to process a single page?

    Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?


    You must be a Russian.
  • faoileag 2013-02-18 09:23
    Bobby Tables :
    ZoomST:
    Black Bart:
    Slow yes, but who here thinks it would take 24 hours to process a single page?
    Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?
    It's worse than that - every time a tag is found on the page the whole page is converted to lowercase. 70+ times.

    It's worse than that - every time an opening angular bracket is found, the whole page is converted to lowercase 70+ times, because all if-clauses are executed everytime, no matter how early the current tag appears in the that list of if-clauses.

    That makes it N * 70+ lowercase calls, where N is the number of opening angular in the page.
  • DaveK 2013-02-18 09:34
    fa2k:
    So the toLowerCase is clearly a WTF. Comparing the text to every known tag is at best a borderline WTF. There are more efficient methods, but they are more complicated to implement. I can think of:
    - Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree. (or do this implicitly, with switch, but that could be even uglier and more WTF-y)
    - Search for the first non-letter character, and use the string up to that as a key into a hash table.
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.
  • Anon 2013-02-18 09:43
    DaveK:
    fa2k:
    So the toLowerCase is clearly a WTF. Comparing the text to every known tag is at best a borderline WTF. There are more efficient methods, but they are more complicated to implement. I can think of:
    - Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree. (or do this implicitly, with switch, but that could be even uglier and more WTF-y)
    - Search for the first non-letter character, and use the string up to that as a key into a hash table.
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.


    Or the fiery wheel. Which is all kinds of awesome!
  • Joe tester 2013-02-18 09:45
    <div class="CommentBodyFeatured">

    Wait, does this actually work?

    Featured Comment Baby!

    </div>
  • gnasher729 2013-02-18 10:01
    DaveK:
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation. Have a look at the instruction set of a newer Intel processor. There are additions to the instruction set that were specifically made because processing of XML etc. takes significant percentages of total CPU time.
  • foo 2013-02-18 10:49
    Joe tester:
    <div class="CommentBodyFeatured">

    Wait, does this actually work?

    Featured Comment Baby!

    </div>
    Works for me. Must be your fault. :)
  • dkf 2013-02-18 10:56
    gnasher729:
    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation.
    While strcmp is awesomely fast, the hashing might be a reasonable approach of the string is long (since if the data is large enough, you'll effectively-flush the DCache and your performance will be back to that of main memory). Depending on exactly what sort of match is desired.
  • foo 2013-02-18 10:57
    gnasher729:
    DaveK:
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation. Have a look at the instruction set of a newer Intel processor. There are additions to the instruction set that were specifically made because processing of XML etc. takes significant percentages of total CPU time.
    As always, it depends. With some techniques you can search for different things simultanously (e.g. a lexer generator such as flex with uses parallel regular expressions), so you could shave off a factor of 70 here. Specialized CPU instructions can hardly match that.

    Then again, if you get rid of the quadratic complexity (i.e. converting the whole string to lower-case and possibly anything else that traverses the whole string in each loop), you can shave off a factor on the order of a million for large files, so that's clearly the more important thing here. If that's done and it's still too slow (unlikely), you can care about a measly 70x speedup next.
  • Huck Finn 2013-02-18 11:40
    if(pageHTML.toLowerCase().regionMatches(index, "<img", 0, 4)){ //Do a test involving the <img> tag }
    But what if your code needs to be international? Do you really want to rewrite this to parse the Finnish <img> tag?

    Plan ahead. Maybe you should include your list of tags expressed in every possible language, just to be sure.
  • Jazz 2013-02-18 12:09
    Bobby Tables:
    It's worse than that. Every time a tag is found the entire page is converted to lowercase.


    It's worse than that, he's dead, Jim.
  • Jazz 2013-02-18 12:19
    DaveK:
    fa2k:
    There are more efficient methods, but they are more complicated to implement. I can think of:
    - Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree.
    - Search for the first non-letter character, and use the string up to that as a key into a hash table.
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.


    Right out of college I worked for a giant global consulting firm with a one-word name that sounds like a sneeze. I wrote crap-tons of J2EE for lots of huge enterprise applications. At that firm, we would have been given bad marks on our review if we had implemented either of the solutions you suggest.

    Speed and efficiency weren't really what our project leads cared about; making the code maintainable by cheap commodity programmers later was more their concern. If performance testing showed that the application had a bottleneck, they would just tell the client they're going to need some more infrastructure to drive the finished product.

    More than once I brought a module to my lead for a code review, and in the module I had done fairly simple things, like caching the results of expensive methods, or adding a subclass so I could pass data around in logical, sensical ways, and I would be told that it was "too complicated" for future developers to understand, and would I please just code the simplest and most straightforward procedure that met the (barely coherent) specifications and not spend time thinking about how "best" to do it?

    Anyway, my bitterness aside, it's entirely plausible that this code was written this way not because the developer thought it was a good idea, but because management found the good idea to be too complicated for their poor little brains.
  • Jazz 2013-02-18 12:27
    gnasher729:
    Have a look at the instruction set of a newer Intel processor. There are additions to the instruction set that were specifically made because processing of XML etc. takes significant percentages of total CPU time.


    This is TRWTF. A general-purpose processor should not have application-specific instructions implemented in hardware.

    Sometimes I wish Intel would let their engineers design the chips, instead of having the marketing department do it. (Pentium 4, I'm looking at you.)
  • chubertdev 2013-02-18 12:30
    this

    Raedwald:
    Parsing HTML with regular expressions? That never goes well.
  • Rnd( 2013-02-18 12:58
    Thank some entity that my homework is only partially implementing HTTP-protocol... Why can't they have nice strict spec on web... Arbitary white space and no enforcement cases.
  • Gary Olson 2013-02-18 14:36
    The Taginator -- destroying the web one page lookup at a time.
  • A. Nonymous 2013-02-18 14:52
    faoileag:
    So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

    Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)

    Profiler, anyone?


    No, in this case you have an easy reply to Donald: "It is not optimizing, I am only following DRY!"
  • A. Nonymous 2013-02-18 15:07
    This shows that Donald's advice is still good: If you don't write sh*tty code, there is probably no need to optimize. And if you wrote sh*tty code, it won't get better if *you* try to optimize it. Either way rule one of optimization holds: Don't do it.
  • Joe 2013-02-18 16:32
    A. Nonymous:
    sh*tty
    I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?

    I hope it isn't a bad word. But if it is, I'm safe. As long as I don't know what it means, your bad word won't make me think a bad thought.

    However if you've made some kind of error, that other people still understand, then they're still thinking bad thoughts despite your error.

    So that couldn't be it.

    Still confused.
  • A. Nonymous 2013-02-18 17:03
    Joe:
    A. Nonymous:
    sh*tty
    I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?


    Probably just a typo, seems to mean shoddy.
  • pjt33 2013-02-18 17:43
    Remy Porter:
    There honestly should be such a tag. There should also be a tag that allows you to pass its contents to a different interpreter, thus making it easier to inline binary data.
    CDATA?
  • DaveK 2013-02-18 17:46
    gnasher729:
    DaveK:
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation.
    Says who? There's nothing to stop you using SSE instructions to write your hash function, it hardly has to be crypto-strength; for you to just assume a hash function that's as bad compared to strcmp as you need it to be to make you correct is a circular argument. No matter how fast the strcmp, doing less of them is always going to be quicker than doing more of them, and doing less passes over memory is always better regardless of caching. In this particular case it should be trivial to generate a hash table that reduces 70+ strcmps to zero or one in all cases.

  • DaveK 2013-02-18 17:47
    Joe:
    A. Nonymous:
    sh*tty
    I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?
    Think glob, not regex.
  • chubertdev 2013-02-18 17:54
    Load it into an XmlDocument object and use XSLT. :)
  • gnasher729 2013-02-18 21:42
    DaveK:
    gnasher729:
    DaveK:
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation.
    Says who? There's nothing to stop you using SSE instructions to write your hash function, it hardly has to be crypto-strength; for you to just assume a hash function that's as bad compared to strcmp as you need it to be to make you correct is a circular argument. No matter how fast the strcmp, doing less of them is always going to be quicker than doing more of them, and doing less passes over memory is always better regardless of caching. In this particular case it should be trivial to generate a hash table that reduces 70+ strcmps to zero or one in all cases.



    There's nothing stopping you... except from time.
    strcmp is a standard C function. It is there. If you have a decent C library, you can expect a good implementation. You don't have to write it yourself. Comparing to a dozen strings using strcmp is trivial to implement and hard to get wrong.
    Hashing is difficult. You have to write it yourself. Writing it using SSE instructions makes it ten times harder. Then you need to build a hash table. Handle collisions. You still have to compare strings. Possibly multiple times. You have to test it. Carefully, because the code isn't trivial.
    "It should be trivial" is not the same as "it is trivial".

    And of course I said "a dozen strcmps". Not 70. On the other hand, to handle 70 compares, instead of hashing you can just switch on the first letter of the string, and then have 3 or 4 strcmps per letter. Still trivial to implement, and you try beating that with a hash table.
  • Arancaytar 2013-02-19 04:22
    While crude, unsophisticated overkill, the motivation is understandable. Uppercase elements are an abomination on the face of the web, and there is a separate circle of hell for developers who use them.
  • Kasper 2013-02-19 05:19
    faoileag:
    sometimes people forget that "pre" in HTML does not allow you to use angular brackets in HTML directly (without encoding them as their corresponding HTML-entities).
    Correct. If you want to do that, you use the <xmp> tag instead. The <xmp> tag was dropped though, and quite understandably. But it really was convenient to have. So back when the <xmp> tag was dropped from the newest HTML standard, I wrote a program to convert <xmp> tags to <pre> tags, such that I could still use them and convert them before putting the document on the webserver.

    This is what the converter looks like:
    %Start XMP
    
    %%
    "<"[Xx][Mm][Pp][^>]*">" { printf("<pre>"); BEGIN XMP;}
    <XMP>">" { printf("&gt;"); }
    <XMP>"</"[Xx][Mm][Pp][^>]*">" { printf("</pre>"); BEGIN INITIAL;}
    <XMP>"<" { printf("&lt;"); }
    <XMP>"&" { printf("&amp;"); }
  • Ironside 2013-02-19 05:23
    this is why websites should be 100% flash, that way they will parse faster. HTML is so old fashioned.
  • Jimshatt 2013-02-19 05:29
    Should have called pageHTML.toLowestCase(), incrementally lowering the case with toLowerCase never gives the desired result.

    And who is this Proro the Pedofiler guy?
  • Kasper 2013-02-19 05:38
    gnasher729:
    instead of hashing you can just switch on the first letter of the string, and then have 3 or 4 strcmps per letter.
    I'd take it one step further. Inside the switch on the first letter, you can have another switch on the second letter. If that second switch would distribute those 3 or 4 strcmps across separate cases, then it is totally worth it.

    Another approach is to have the outermost case switch on the length of the string you are looking up. The advantage of that is that the code inside doesn't have to worry about running off the end of the string. Then inside that you can have another level of switch statements on the first character.
  • no laughing matter 2013-02-19 05:53
    gnasher729:

    Hashing is difficult. You have to write it yourself. Writing it using SSE instructions makes it ten times harder. Then you need to build a hash table. Handle collisions. You still have to compare strings. Possibly multiple times. You have to test it. Carefully, because the code isn't trivial.
    "It should be trivial" is not the same as "it is trivial".
    Seventies are calling: They want their difficult hashing back!

    Seriously, have you stopped programming maybe twenty or thirty years ago?

    All relevant modern programming languages / standard libraries come with hashtables included. It is trivial, just use what is already implemented and tested.
  • Lordy 2013-02-19 06:28
    Don't several calls to toLowerCase() on an immutable object get optimised away anyway? I thought Java did a lot of background optimisation?
  • no laughing matter 2013-02-19 06:43
    Lordy:
    Don't several calls to toLowerCase() on an immutable object get optimised away anyway? I thought Java did a lot of background optimisation?

    For this to work toLowerCase() has to be transparent (free of side-effects / effects of the environment).

    It would work in a purely functional language like Haskell, but Java Bytecode doesn't give such guarantees.

    Imagine that toLowerCase() does some logging for debugging.
    Optimising it away would also optimise away the logging.
  • Andreas 2013-02-19 09:05
    Must
    Must add
    Must add one
    Must add one word
    Must add one word at
    Must add one word at a
    Must add one word at a time
    Must add one word at a time to
    Must add one word at a time to handle
    Must add one word at a time to handle Strings
    Must add one word at a time to handle Strings efficiently...
  • Purple Plastic Purse 2013-02-19 14:17
    Other problems: He used String concatenation in a loop instead of StringBuilder/StringBuffer. Also hard-coded newline instead of getting an environment variable. (In Java it'd be System.getProperty("line.separator");)
  • Purple Plastic Purse 2013-02-19 14:25
    Word policing is kinda a sh*tty thing to do.
  • Purple Plastic Purse 2013-02-19 14:27
    Joe:
    A. Nonymous:
    sh*tty
    I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?

    I hope it isn't a bad word. But if it is, I'm safe. As long as I don't know what it means, your bad word won't make me think a bad thought.

    However if you've made some kind of error, that other people still understand, then they're still thinking bad thoughts despite your error.

    So that couldn't be it.

    Still confused.


    Word policing is kinda a sh*tty thing to do.

    (Aside: why do they have a reply button if it doesn't quote the person you're replying? Now I'm confused.)
  • Purple Plastic Purse 2013-02-19 14:34
    Purple Plastic Purse:
    Other problems: He used String concatenation in a loop instead of StringBuilder/StringBuffer. Also hard-coded newline instead of getting an environment variable. (In Java it'd be System.getProperty("line.separator");)


    Building the String was probably unnecessary altogether. He could have done his stupid parsing line-by-line.
  • Randy Snicker 2013-02-19 15:17
    heads of the project that Pedro worked on

    ???
  • pezpunk 2013-02-19 15:55
    Not only did the validator test each potential tag against every known tag name, it also lowercased the entire document multiple times.

    The library developer must have really wanted those tags to be lowercase.


    at the risk of stating the icredibly obvious ... strings are immutable. he wasn't lowercasing the document multiple times, he was calling a method that returned a lowercased document (the value of pageHTML wasn't modified). granted, the point is the same -- calling toLowerCase() on the same huge a string repeatedly is stupid.
  • foxyshadis 2013-02-19 19:21
    faoileag:
    I actually like the first test in the sample given: it fires on all tags starting with "<a", not only the anchor tag.

    Ah well, the "Do a test involving the <a> tag"-Test will probably weed out applets, areas and the like.

    You could probably process a million random pages in the wild without running across a single applet, area, abbr, acronym, article, aside, or audio. Well, Wikipedia might have an audio if you hit the right page.
  • Xarthaneon the Unclear 2013-02-22 16:26
    For a brute-force evaluation, the continue keyword works wonders.

    CAPTCHA: venio - Veni veni venias, veni veni facias! (What does One Winged Angel have to do with HTML parsing? Nothing, the captcha just made me think of the only part that I can easily remember.)
  • nasch 2013-02-22 18:33
    Xarthaneon the Unclear:


    CAPTCHA: venio - Veni veni venias, veni veni facias!


    Hyrca, hyrce!