• (cs) in reply to Remy Porter
    Remy Porter:
    There honestly should be such a tag. There should also be a tag that allows you to pass its contents to a different interpreter, thus making it easier to inline binary data.
    CDATA?
  • (cs) in reply to gnasher729
    gnasher729:
    DaveK:
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.
    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation.
    Says who? There's nothing to stop you using SSE instructions to write your hash function, it hardly has to be crypto-strength; for you to just assume a hash function that's as bad compared to strcmp as you need it to be to make you correct is a circular argument. No matter how fast the strcmp, doing less of them is always going to be quicker than doing more of them, and doing less passes over memory is always better regardless of caching. In this particular case it should be trivial to generate a hash table that reduces 70+ strcmps to zero or one in all cases.
  • (cs) in reply to Joe
    Joe:
    A. Nonymous:
    sh*tty
    I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?
    Think glob, not regex.
  • (cs)

    Load it into an XmlDocument object and use XSLT. :)

  • gnasher729 (unregistered) in reply to DaveK
    DaveK:
    gnasher729:
    DaveK:
    If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.
    Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation.
    Says who? There's nothing to stop you using SSE instructions to write your hash function, it hardly has to be crypto-strength; for you to just assume a hash function that's as bad compared to strcmp as you need it to be to make you correct is a circular argument. No matter how fast the strcmp, doing less of them is always going to be quicker than doing more of them, and doing less passes over memory is always better regardless of caching. In this particular case it should be trivial to generate a hash table that reduces 70+ strcmps to zero or one in all cases.

    There's nothing stopping you... except from time. strcmp is a standard C function. It is there. If you have a decent C library, you can expect a good implementation. You don't have to write it yourself. Comparing to a dozen strings using strcmp is trivial to implement and hard to get wrong. Hashing is difficult. You have to write it yourself. Writing it using SSE instructions makes it ten times harder. Then you need to build a hash table. Handle collisions. You still have to compare strings. Possibly multiple times. You have to test it. Carefully, because the code isn't trivial. "It should be trivial" is not the same as "it is trivial".

    And of course I said "a dozen strcmps". Not 70. On the other hand, to handle 70 compares, instead of hashing you can just switch on the first letter of the string, and then have 3 or 4 strcmps per letter. Still trivial to implement, and you try beating that with a hash table.

  • Arancaytar (unregistered)

    While crude, unsophisticated overkill, the motivation is understandable. Uppercase elements are an abomination on the face of the web, and there is a separate circle of hell for developers who use them.

  • Kasper (unregistered) in reply to faoileag
    faoileag:
    sometimes people forget that "pre" in HTML does not allow you to use angular brackets in HTML directly (without encoding them as their corresponding HTML-entities).
    Correct. If you want to do that, you use the <xmp> tag instead. The <xmp> tag was dropped though, and quite understandably. But it really was convenient to have. So back when the <xmp> tag was dropped from the newest HTML standard, I wrote a program to convert <xmp> tags to <pre> tags, such that I could still use them and convert them before putting the document on the webserver. <p>This is what the converter looks like:<pre>%Start XMP %% "<"[Xx][Mm][Pp][^>]<em>">" { printf("<pre>"); BEGIN XMP;} <XMP>">" { printf(">"); } <XMP>"</"[Xx][Mm][Pp][^>]</em>">" { printf("</pre>"); BEGIN INITIAL;} <XMP>"<" { printf("<"); } <XMP>"&" { printf("&"); }</pre></p> </xmp>
  • Ironside (unregistered)

    this is why websites should be 100% flash, that way they will parse faster. HTML is so old fashioned.

  • Jimshatt (unregistered)

    Should have called pageHTML.toLowestCase(), incrementally lowering the case with toLowerCase never gives the desired result.

    And who is this Proro the Pedofiler guy?

  • Kasper (unregistered) in reply to gnasher729
    gnasher729:
    instead of hashing you can just switch on the first letter of the string, and then have 3 or 4 strcmps per letter.
    I'd take it one step further. Inside the switch on the first letter, you can have another switch on the second letter. If that second switch would distribute those 3 or 4 strcmps across separate cases, then it is totally worth it.

    Another approach is to have the outermost case switch on the length of the string you are looking up. The advantage of that is that the code inside doesn't have to worry about running off the end of the string. Then inside that you can have another level of switch statements on the first character.

  • (cs) in reply to gnasher729
    gnasher729:
    Hashing is difficult. You have to write it yourself. Writing it using SSE instructions makes it ten times harder. Then you need to build a hash table. Handle collisions. You still have to compare strings. Possibly multiple times. You have to test it. Carefully, because the code isn't trivial. "It should be trivial" is not the same as "it is trivial".
    Seventies are calling: They want their difficult hashing back!

    Seriously, have you stopped programming maybe twenty or thirty years ago?

    All relevant modern programming languages / standard libraries come with hashtables included. It is trivial, just use what is already implemented and tested.

  • Lordy (unregistered)

    Don't several calls to toLowerCase() on an immutable object get optimised away anyway? I thought Java did a lot of background optimisation?

  • (cs) in reply to Lordy
    Lordy:
    Don't several calls to toLowerCase() on an immutable object get optimised away anyway? I thought Java did a lot of background optimisation?
    For this to work toLowerCase() has to be transparent (free of side-effects / effects of the environment).

    It would work in a purely functional language like Haskell, but Java Bytecode doesn't give such guarantees.

    Imagine that toLowerCase() does some logging for debugging. Optimising it away would also optimise away the logging.

  • Andreas (unregistered)

    Must Must add Must add one Must add one word Must add one word at Must add one word at a Must add one word at a time Must add one word at a time to Must add one word at a time to handle Must add one word at a time to handle Strings Must add one word at a time to handle Strings efficiently...

  • Purple Plastic Purse (unregistered)

    Other problems: He used String concatenation in a loop instead of StringBuilder/StringBuffer. Also hard-coded newline instead of getting an environment variable. (In Java it'd be System.getProperty("line.separator");)

  • Purple Plastic Purse (unregistered) in reply to Joe

    Word policing is kinda a sh*tty thing to do.

  • Purple Plastic Purse (unregistered) in reply to Joe
    Joe:
    A. Nonymous:
    sh*tty
    I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?

    I hope it isn't a bad word. But if it is, I'm safe. As long as I don't know what it means, your bad word won't make me think a bad thought.

    However if you've made some kind of error, that other people still understand, then they're still thinking bad thoughts despite your error.

    So that couldn't be it.

    Still confused.

    Word policing is kinda a sh*tty thing to do.

    (Aside: why do they have a reply button if it doesn't quote the person you're replying? Now I'm confused.)

  • Purple Plastic Purse (unregistered) in reply to Purple Plastic Purse
    Purple Plastic Purse:
    Other problems: He used String concatenation in a loop instead of StringBuilder/StringBuffer. Also hard-coded newline instead of getting an environment variable. (In Java it'd be System.getProperty("line.separator");)

    Building the String was probably unnecessary altogether. He could have done his stupid parsing line-by-line.

  • Randy Snicker (unregistered)
    heads of the project that Pedro worked on
    ???
  • pezpunk (unregistered)
    Not only did the validator test each potential tag against every known tag name, it also lowercased the entire document multiple times.

    The library developer must have really wanted those tags to be lowercase.

    at the risk of stating the icredibly obvious ... strings are immutable. he wasn't lowercasing the document multiple times, he was calling a method that returned a lowercased document (the value of pageHTML wasn't modified). granted, the point is the same -- calling toLowerCase() on the same huge a string repeatedly is stupid.

  • foxyshadis (unregistered) in reply to faoileag
  • Xarthaneon the Unclear (unregistered)

    For a brute-force evaluation, the continue keyword works wonders.

    CAPTCHA: venio - Veni veni venias, veni veni facias! (What does One Winged Angel have to do with HTML parsing? Nothing, the captcha just made me think of the only part that I can easily remember.)

  • nasch (unregistered) in reply to Xarthaneon the Unclear
    Xarthaneon the Unclear:

    CAPTCHA: venio - Veni veni venias, veni veni facias!

    Hyrca, hyrce!

Leave a comment on “Internet.toLowerCase”

Log In or post as a guest

Replying to comment #:

« Return to Article