• drm (unregistered)

    I actually think this fellow is a covert switch-lover and tries to cover it up with weak excuses :P

  • Lothar (unregistered)

    This is a real WTF, especially with the reason why this was been done by the collegue:

    >He believes that directly comparing a string
    >variable (xmlTagName) to a string literal
    >(“sub_name“) will make your program run as
    >slow as cold molasses.

    XMLNEWSLETTERELEMENTS.get(cElement.getName))

    will lead to the following code to be executed (within the HashMap):

    int hashcode = key.hashCode();
    Object elem = internalArray[hashCode % internalArray.length];
    if (elem != null && elem.getKey().equals(key)){
    return elem.getValue();
    }
    else{
    return null;
    }

    (this is not the real code, there happens a little bit more). So instead of the "inefficient" String-comparison, we need a call of a method (hashCode), an access to an array, and a String-comparision. For sure this will be way more fast than a single String-comparison.


    Regards, Lothar

  • matt (unregistered)

    if the number of cases is large, this solution is actually not that bad. Think about it... a switch statement based on a number is a one-time lookup in the hash map, then a bunch of number comparisons. The alternative is a ton of if else if statements with a string comparison in each.

    this is certainly easier to look at.

  • AvonWyss (unregistered)

    Note that the current C# compiler does pretty much exactly this when you do a switch() on a string with some cases (I'm not sure about the number where it starts using the Hashtable).

    Maybe this guy was so impressed by an article describing this optimization done by the C# compiler that he wanted to do it in Java as well?

    Also note that, for XML data, the .NET framework often uses nametables, which are nothing but Hashtables containing references to the reference strings, so that one can do reference comparisons for equality on strings which are known to be in the same nametable. I guess this actually can speed things up if you're going to copare a hunge number of strings.

  • Lothar (unregistered)

    @matt: In Java one way to avoid a long if-elseif-cascade is using reflection.
    For handling XML there are better ways as well than this.

  • Alan Bellingham (unregistered)

    For a handful of strings, it's probably not worth it. For a large number (and the elided statements rather indicate that it is), then this is an optimisation.

    (Though I'd have used a binary search over an ordered string array, as probably having a slightly lower overhead.)

    So, how many strings were there? The breakeven point may be fewer than the five shown.

  • Saucepan (unregistered)

    Like others have said, there is nothing wrong with this technique if there are many strings to compare. Better still, since the tokens are apparently all known in advance and performance is apparently an issue, would be to use perfect hashing (http://sourceforge.net/projects/gperf/).

    But either way, it's hard to have an informed opinion about the tradeoff without seeing the profiler output.

  • WanFactory (unregistered)

    "Premature optimization is the root of all evil" - Donald Knuth

  • Guayo (unregistered)

    As others already said, this could be more an optimization if the number of strings is large. Each time you compare 2 strings you have to iterate over each array of characters
    For example with:
    if (s1.equals(s2)) {...}
    else if (s1.equals(s3)) {...}
    that means that you will have to iterate over s1 array of chars twice plus s2 and s3 iterations.

    Instead if you put s2 and s3 (and n more strings) in a hashtable then you iterated over each string char array just once to get the hashcode. Then to compare s1 to each of the strings in the hashtable you only have to get the s1 hashcode once and compare that int with the array of hashcodes. So even that getting the hashcode is slower than just iterating the array of chars to compare, you are saving the repeated iteration of s1 because that value is calculated on each time you call the get method of the hashtable, not even that but the hashcode of s2, s3,... etc. was calculated only when they were put in the hashtable. So I think it's easy to see how could this be an effective optimization depending of the number of strings to compare.
    I hope this time my english was at least barely understandable.

  • Tom (unregistered)

    matt: This is easier to look at than code which just uses simple string comparisons? Surely you jest?

  • Shawn Willden (unregistered)

    If the set of strings is large enough, and if the Java compiler is good, this is an excellent optimization technique.

    The "default" approach involves comparing the element name against each possible match. If there were, say, thousands, of possible names, this would be slow.

    This approach requires a hash function to be applied to the name, an hashtable lookup, and a very small number of string compares to search the hash bucket list. Then the switch has to find the right chunk of code for the resulting integer. Even with a stupid compiler that generates a long series of numerical comparisons, this will be an order of magnitude faster than doing a long series of string comparisons. But a proper compiler should implement a switch statement not as a series of 'if's, but as a hashtable lookup. The work of creating this hash table should be done at compile time, not runtime.

    This optimization reduces an O(n) problem to an O(1) problem.

    That said, I think Knuth's aphorism applies here. This sort of optimization should only be done after profiling shows that it does, indeed, matter. It's highly doubtful that the relevant XML scheme contains enough element types to make this worth the effort or the confusion factor.

  • Alex Beynenson (unregistered)

    Guayo: "Each time you compare 2 strings you have to iterate over each array of characters"

    I imagine that this is done only when the number of characters in both strings is the same, which looks like would happen very rarely in this example.

  • Guayo (unregistered)

    Yes that's obviously correct. If the length of those 2 strings is not the same you already know they aren't equal so you will return false from the equals method and skip the array iteration part. The five strings posted here have different lengths but a comment hinst that there are more that those. Even if I knew all the tag names I wouldn't be sure if this is an effective optimization. The definitive answer is profiling.
    But you definitely have a point, if the set of strings are basically of strings with different lengths then doing a lot of calls to the equals method would be faster than the hashtable lookup.
    My point is this. This could be a good optimization or not. By looking at that snippet I can't say it's a WTF or a JWD (job well done). And I think nobody can.

  • Guayo (unregistered)

    @ Shawn Willden
    The optimization goodness in this snippet does not depends only of the complexity of the XML Schema. (number of tags) but also of the frequency this code runs.

  • koala (unregistered)

    I can understand that someone would hate the zillions of if-else's that one would naturally prefer to write as switch-case.

    In a way, I see it as more readable...

  • JoeJoe (unregistered)

    If you're going to go overboard, why not go the way and use Java's reflection to map a method reference directly and completely get rid of the switch?


    XMLNEWSLETTERELEMENTS.put(
    "name",
    cmNewsObject.getMethod("setName", params)
    );

  • koala (unregistered)

    Well, reflection breaks IDE features like refactoring/intelligent searches/etc., so normally I would not use it (maybe in this case, I would, because it would be to call library functions, but...)

  • Jp (unregistered)

    Well, there is at least one WTF here, in that the guy defines integer constants that correspond to each of the XML tags, but then inexplicably doesn't use those constants when creating the hash table. WTF?

  • gerg (unregistered)

    Actually he's kind of on the right track. Hard coding all the tag names in the switch would make it harder to make the system more flexible later.

    The problem is he didn't go far enough. He still has every case hard coded in the switch. Instead of loading integers into the hashmap he should load a data structure that describes what to do to the cmNewsObject. The method name, and any conversion function to be called on the string.

    Then one day the hash table itself could be loaded from a DTD or used for other purposes in addition to this parsing function. This function could probably even be reused with other nodes in the parse tree.

  • Kent (unregistered)

    Weird how he didn't use the constants in the initialisation of the HashMap:

    XMLNEWSLETTERELEMENTS.put("name", new Integer(0));

    instead of:

    XMLNEWSLETTERELEMENTS.put("name", new Integer(NEWSLETTER_NAME));

    @WanFactory: How do you know this was a premature optimisation? He may have dutifully measured performance and found this (possibly critical) section of code to be slower than desirable.

    The fact is, we don't have enough context to judge this a WTF. How many more entries are there and are there many strings with the same length? How critical is this code path?

    The only WTF I see is not using the constants in the initialisation.

  • Chad (unregistered)

    But what if the xml was UNICODE!?

  • AndrewSeven (unregistered)

    Can that really be faster than:

    private static final string NEWSLETTER_NAME = "name";
    ...


    switch (cElement.getName())
    {
    case NEWSLETTER_NAME: {
    cmNewsObject.setName(cElement.getTextTrim());
    break;
    }


    ...

    }

  • Tim Smith (unregistered)

    For all you mind numbed robots who just repeat what Knuth said without understanding what he meant, the full quote is here (and Knuth didn't make it):

    "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." - Richard Hoare.

    To say that any premature optimization is bad is not only wrong but a poor way to design software. STL has the performance characteristics of all the containers well documented so you can use the right one for the right job. Selecting the right one from the start is a form of optimization and it would be stupid not to do so.

  • Kent (unregistered)

    To quote Jan Gray: "Premature optimization is the root of all evil. But so is careless inattention to efficiency."

  • foxyshadis (unregistered)

    Er, Tim, premature optimization refers to optimization done on unverified and often unresearched assumptions. The whole point of the STL docs are so that you can learn which best matches your algorithms, at which point it's no longer premature optimization, it's optimization done with an understanding of the complete system in which it runs.

    Premature optimization in that case would be changing your linked list to a vector and changing your entire codebase because you think it's traversing the list too slow, when the problem is that you have a method a few levels down that makes a temporary of each node each time it compares them in your braindead homemade seach code.

  • Jani (unregistered)

    Reflection is also quite expensive, much more expensive to invoke a method using reflection and a method name in a String run-time than to have a direct call of the method in compiled code. Therefore, given this is an XML parser and used for large input files, and the switch gets called very frequently and it really needs to be as fast as possible, I wouldn't use reflection.

  • Centaur (unregistered)

    By the amount of duplication in the switch, I see it wants to be a virtual function call, or maybe a visitor. Such as

    cmNewsObject.accept(makeElementVisitor(cElement.getName));

    Oh, and using a varying font size for code is a WTF.

  • DrPizza (unregistered)

    Why doesn't he just intern the input strings and then use pointer comparisons on them?

  • RichB (unregistered)

    cf. .Net's System.Xml.XmlNameTable

  • joe (unregistered)

    Doesn't the java string comparison stop at the first char that is not the same so..
    comparing "abcdefghijk" to "ghtjklmesns" will only compare the first char and return
    and comparing "abx" to "adx" will only check two before it returns.

  • WanFactory (unregistered)

    @Kent
    I will wager money that no profiler has been used on the app this code is running in. In all likelihood, this thing is running after parsing XML being sent over HTTP. That alone would be slow enough to make any hash vs comparison gains insignificant. No, I dont know for sure if this is premature but aren't 97% of optimizations premature?

    Though that being said, I'm not sure if a premature optimization counts as a WTF...

  • Sammy (unregistered)

    Everyone here missed the true WTF in this post. The guy is using JAVA. Who uses that anymore? IMO it's not even a real programming language. What a joke.

  • Fred (unregistered)

    I agree with Sammy. Java is so...1996. Do they even hire Java programmers anymore? Everything these days is written in C# or (gasp) VB.NET. The whole J# thing was only created to placate the masses. (It was little more than a publicity stunt.) Why are people still programming in this stuff? Like, eww.

  • Tad (unregistered)

    @Fred: Even though I don't agree with all of your comments, I will agree that Java is on its way out. There's really no point to it anymore. You can do everything you need to do in C#...easily. Java is unecessarily overly-complicated and the learning curve is huge. Not to mention that I've seen so many Java applications spontaneously crash it's not even funny. Why go through all that when you can drag and drop with C# in Visual Studio.NET? You can literally whip out a fully functional (stable) app in mere minutes. It's only a matter of time before Java goes the way of COBOL.

  • John (unregistered)

    Hey Fred, I tried building a Java app once. Listen to how bad the experience was. First I had to load this Java Runtime Environment which was really stupid. I mean, I already had the .NET Framework built-in to my OS and could write a .NET app no problem, so why should I go loading another "environment"? Anyways, I started to create these Java "classes" and whatnot, which was a real chore. You had to manage the class files and directories and all that crap. What a complete chore. Then you compile the thing and the "swing" interface (or whatever the thing is - the GUI part) looks so crappy. It looks like it's about to take down my entire machine. In any event, I gave up writing the app because honestly it was too hard. It was really complicated and I hated it. The funny thing is, I wrote an awesome application the next day in 5 minutes with .NET.

  • Doc B. (unregistered)

    I used to be a real Java advocate back in '95 when it first hit the scene. I believed at the time that only real manly men coded in Java. Looking back on it, it was a great language to code in, at that time. See, that's the key. It was good for what it was in the age it was created. But it's 2004 (almost 2005) and I think we all need to move on.

  • Guayo (unregistered)

    Zealots alert!

  • Tom (unregistered)

    Java is on the way out just because MS came along and made their own copy which you can now use instead? Oh dear; what a sad state the software industry is in.

  • Mike (unregistered)

    @Sammy: How can you say that about Java? It's a really good language to...you know what, nevermind. You're right. Java does suck.

  • Simon (unregistered)

    Look at all these people denouncing Java. Are they Microsoft employees, by any chance?

  • Guayo (unregistered)

    OK. Ignoring the zealotry...

    @Joe
    You are correct, however the performance gain is not because the equals function is slower than the hashcode function (It's obvious is the opposite), but because in a hashtable lookup you don't have to calculate the hashcode of each of the strings in your table (you already did that the first time those strings were put in the table) and the only hashcode you have to compute is of the string you want to find.
    Even more important is the fact that in a hashtable lookup after you get the hashcode of the string you want to check you don't need to iterate on each bucket to see if the string is found in the table, because knowing the hashcode means you know which bucket to search. Instead of an O(N) linear search you are using an O(1) lookup.
    Depending of the number of string involved (ant the strings itself) a linear search could be faster or slower than a hashtable lookup. However the way to really find out must be profiling your app. As someone said even if in this case the hashtable lookup is faster than a lot of if statements, the bottleneck could be another thing (most of the times: IO operations) so the performance gain will be negligible. Again, profiling is the way to find out what code paths needs to be optimized.

  • KoFFiE (unregistered)

    ignores the Java bashes

    Think the main thing has been said. For large lists of strings, the hash-table WILL be faster (a lot even). I did optimalisations like this in Java (after research offcourse) - and found out that hash-tables were a lot faster (2/3 times faster in fact) on the embedded platform that had to run the code... (which was in fact a 386em with 8mb ram btw) On pc somehow, there was little or no difference.

    But as also stated, this isn't worth doing when you're using XML... XML + performance don't go too well. (I don't like it - let's bash XML instead of java :p)

  • (cs)

    Defining some constants and not using them a few lines later, priceless:

    <font color="red">private</font> <font color="red">static</font> <font color="red">final</font> <font color="red">int</font> NEWSLETTER_NAME <font color="blue" size="+1">=</font> <font color="brown">0</font><font color="blue" size="+1">;</font>
    <font color="red">private</font> <font color="red">static</font> <font color="red">final</font> <font color="red">int</font> NEWSLETTER_SUB_NAME <font color="blue" size="+1">=</font> <font color="brown">1</font><font color="blue" size="+1">;</font>
    <font color="red"></font><font color="red">
    ...

    private
    </font> <font color="red">static</font> <font color="red">final</font> Map XMLNEWSLETTERELEMENTS <font color="blue" size="+1">=</font> <font color="red">new</font> HashMap<font color="blue" size="+1">(</font><font color="blue" size="+1">)</font><font color="blue" size="+1">;</font> <font color="red">static</font> <font color="blue" size="+1">{</font> XMLNEWSLETTERELEMENTS<font color="blue" size="+1">.</font>put<font color="blue" size="+1">(</font><font color="purple">"name"</font><font color="blue" size="+1">,</font> <font color="red">new</font> Integer<font color="blue" size="+1">(</font><font color="brown">0</font><font color="blue" size="+1">)</font><font color="blue" size="+1">)</font><font color="blue" size="+1">;</font> XMLNEWSLETTERELEMENTS<font color="blue" size="+1">.</font>put<font color="blue" size="+1">(</font><font color="purple">"sub_name"</font><font color="blue" size="+1">,</font> <font color="red">new</font> Integer<font color="blue" size="+1">(</font><font color="brown">1</font><font color="blue" size="+1">)</font><font color="blue" size="+1">)</font><font color="blue" size="+1">;
    ...
    </font>

  • (cs)

    I am left wondering why the hash value was an integer, as opposed to a function that did what the case does in the switch statement. That's the true wtf imho. If you really needed the integer values, you could have 2 tables... Not sure if Java has "delagates" or pointers to functions though, so I'm not sure if that's really possible.

  • (cs) in reply to GoatCheez
    GoatCheez:
    Not sure if Java has "delagates" or pointers to functions though, so I'm not sure if that's really possible.

    Java (mostly) uses anonymous inner classes where C# uses delegates, e.g.

    JButton button = new JButton("FooBar");
    button.addActionListener( new <font color="#006400">ActionListener</font>() {
      public void <font color="#008000">actionPerformed</font>(ActionEvent e) {
        someObject.doWhatEverIWant();
      }
    });


    This code creates an instance of an anonymous class that implements the ActionListener interface, which requires a actionPerformed method.

  • Scrub (unregistered) in reply to drm

    I'm a switch lover myself, I'm not sure how java would handle this internally and if it is more efficient, but I prefer it to and endless line of if (..) else if (...) else if (...)

     

    private enum NEWSLETTER {NEWSLETTER_NAME,
       NEWSLETTER_SUB_NAME,
       NEWSLETTER_SCREEN_NAME,
       NEWSLETTER_EXTERNAL_NAME,
       NEWSLETTER_ID};  

      switch (NEWSLETTER.valueOf(cElement<FONT color=blue size=+1>.</FONT>getName())) {
              case NEWSLETTER_NAME: 
                      cmNewsObject.setName(cElement.getTextTrim());
                      break;
              case NEWSLETTER_SUB_NAME: 
                      cmNewsObject.setSubName(cElement.getTextTrim());
                      break;
              case NEWSLETTER_SCREEN_NAME: 
                      cmNewsObject.setScreenName(cElement.getTextTrim());
                      break;
              case NEWSLETTER_EXTERNAL_NAME: 
                      cmNewsObject.setExternalName(cElement.getTextTrim());
                      break;
              case NEWSLETTER_ID: 
                      cmNewsObject.setId(Long.valueOf(cElement.getTextTrim());
                      break;
      }  

  • (cs) in reply to Scrub

    Why even bother with switch?

    enum Newsletter {
      NAME () {
        void apply (NewsObject newsobj, String value) {
          newsobj.setName (value);
        }
      },
      SUB_NAME () {
        void apply (NewsObject newsobj, String value) {
          newsobj.setSubName (value);
        }
      } /* ... */;



      // method name sucks, illustration only
      abstract void apply (NewsObject newsobj, String value);
    }



    /* ... */



    Newsletter.valueOf (cElement.getName ()).apply (cmNewsObject, cElement.getTextTrim ());

  • missing (unregistered)

    This is not so much of a wtf if you consider the usual alternative of if-else trees. Doing it for only 4 would be a wtf, but for maybe 100 or 200 it could certainly speed things up. I did something similar in C++ once for a big parser like thing with over 500 cases and the speedup was about 300%. Of course doing that kind of thing in all cases isnt good, as you should always obey the rule of making things work, profile then optimize. And there is something in between those two ways:

    <font face="Courier New">string s = ....; // some data
    switch( s[0] )
    {
    case 'a': if( s == "abort" ) { } else if ( s == "add" ) { } ...
    case 'b': if( s == "break" ) { } // I think you get the idea...
    }

    <font face="Arial">Which when applied at the right point, can be faster than both...</font>
    </font>

  • FORUM IS WTF! (unregistered)

    I'm not sure which is a bigger WTF... The original posting or the responses to this forum....

    Why not do the following:

    1) For each element type handler, create an object to handle it.  Make the object also contain a reference to a string of its element type.

    2) Sort the handlers, once.


    For all future lookups, just do a binary search.

    Log2(X) is a very small number even when X is huge.

  • JoelKatz (unregistered)

    The code is precisely what modern parsers, such as those build by bison/flex do. The code in your typical C compiler probably looks much like this. There is nothing wrong with 'tokenizing' input text into a numeric tag, then processing based on the numeric tag.

    As for string comparisons, the cost varies depending on how they're implement. For example, in C, sometimes the length of both strings is known, and the comparison can shortcut if they differ. Sometimes the lengths aren't known, so each comparison requires either going through all the matching characters or computing the length of one or more strings.

    My bet is, though, the guy who thought java string comparison was slow was probably doing it wrong. He had either specifically rigged it so the length was never known, had 'intern'ed each string so he could compare with ==, or something equally silly.

Leave a comment on “Java's Expensive String Comparison”

Log In or post as a guest

Replying to comment #:

« Return to Article