• Methinks (unregistered) in reply to fred
    fred:
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Do you understand how hashes work? It wouldn't be an O(n) operation...

    That said, as others have pointed out, we might be looking for matches on more than individual words....

    Why? There is nothing in the original implementation that suggests that. In fact, searchText.Split(" ") pretty much says that no, that is not what we are doing here.

    Why are people still pissing all over each other trying to solve the substring problem when this isn't one???

    Hats off to Mark Bowytz for this WTF. It's a 3 bagger.

    The Bob's code is a WTF. Jeb's solution is a bigger WTF. (it fails basic regression tests). Most solutions posted here FAIL.

    WTF?

  • Dave (unregistered)

    Lol @ this all Why doesn't anyone study data structures and algorithms anymore?

    Seriously. Want to persistently, incredibly fastly, sort and search strings and substrings? Ever heard of a trie? A Judy array, specifically. I mean seriously! TTWTF is that this is (IMO) basic algorithmic stuff that 99% of TDWTF are failing at

  • NPSF3000 (unregistered) in reply to foo
    foo:
    Yale:
    foo:
    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.).

    I just implemented that method in Ruby as follows:

    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/
    That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.

    I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.

    Yes, efficient code (in terms of O()) can be slower for small data where it doesn't matter, but it gets relatively faster (i.e., not so much slower) than the naive code when the data size grows.

    Again, that's exactly the point: Coders (such as you, apparently) who don't understand efficiency do irrelvant/misleading tests and come to the wrong conclusions. (Additionally, they usually stobbornly refuse to accept results from those who have studied the issues more intensively, because that's "academic BS".)

    But if you only believe what you see yourself, repeat your test with 10* the number of search items and 10* the data size, then with 100* both. The native version will take roughly 100* and 10000* as long, the regex one (or any other effient one) somewhat more than 10* and 100*.

    I'm getting somewhat upset about this because I see happen this all too often in real applications, things that slow down much more than necessary when the data size grows because the programmers were apparently too lazy to use the right algorithm (because "it won't happen", yeah, everyone knows how those "noone will ever need more than ..." assumptions play out).

    The worst offender at the moment is my stupid TV recorder. Merely opening the list of recordings takes more than 20seconds with ~200 entries. To be honest, I really don't know how they managed to achieve that, even the most stupid way of doing things I can imagine wouldn't take nearly as long. But apparently they only ever tested it with a few entries (where the delay is hardy notable, given the device's general sluggishness) and never bothered to either test with larger data or consider how their code scales based on known fasts about efficiency.

    </rant>

    Each to their own.

    I personally get pissed off when some douche bag spends a full day implementing some special x, y, z followed by days of testing and modification to get it working 'just right' - when I in 15 mins created a simple n^2 loop, threaded it and moved on with only 5 lines of code to ever worry about should there be problems down the track.

    IMO you have to PROVE that the naive way is too slow with the given requirements before you even think about implementing a non-trivial alternative. I note you haven't.

  • Someone (unregistered)

    Ah, the Sunk Cost fallacy. The manager is a fail.

  • Quicksilver (unregistered)

    Is there no Regexp Pattern quoting in that language?

    In Java I would expect someone to at least call Pattern.quote(searchstring) before concatenating them all with pipes.

  • (cs) in reply to AB
    AB:
    Its so pointless. Why do sheeple do it? Can they not see the total redundancy of 5 lines of code? Do they really think its "more readable"?

    Return matches.Count > 0

    +1
  • Nick (unregistered) in reply to AB

    By the same token:

    For i As Integer = 1 To paramTable.stackHeight() - 1 Step 1 If searchTable.Contains(paramTable.Pop()) Then truthTable.Push(True) Else truthTable.Push(False) End If Next

    Should be

    For i As Integer = 1 To paramTable.stackHeight() - 1 Step 1 truthTable.Push(searchTable.Contains(paramTable.Pop())) Next

    But on the whole I actually prefer the first version. TRWTF is VB. VB is a toy and has no place in professional programming.

  • Jeltz (unregistered) in reply to Yale
    Yale:
    I just implemented that method in Ruby as follows:
    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/

    Agree with your solution, but Ruby has a function in the standard library for creating a regexp from a list of strings.

    def search(terms, body)
      body =~ Regexp.union(terms)
    end
    
  • ProgrammerRob (unregistered)

    The purpose of a code review is not to determine if the solution is elegant.

    A solution should work, should be maintainable, and should have reasonable performance.

    If the first solution worked, and if a novice programmer could read the code and understand it , and if the solution did not create a noticeable performance issue, then changing to a different solution is risky with no benefit.

  • TheJonB (unregistered) in reply to ProgrammerRob
    ProgrammerRob:
    The purpose of a code review is not to determine if the solution is elegant.

    It should probably flag up where it's shit though.

  • L. (unregistered) in reply to Yale
    Yale:
    This kind of shit makes me glad I code in Ruby.
    def doc_contains_at_least_one_term( terms, body )
      terms.each{ |term| return true if body.match(term) }
      false
    end
    
    Captcha: sino (actually, Ruby is Japano)

    I'm glad you code in ruby . more ignorant fools should code in ruby just like you do and then we'd only have ruby code to ignore.

    Even if your ruby was good code, which it is obviously not, it'd still be slower than a lot of alternatives.

    Seriously .. people like you should just quit coding . completely.

  • L. (unregistered) in reply to AP²
    AP²:
    Yale:
    This kind of shit makes me glad I code in Ruby.
    def doc_contains_at_least_one_term( terms, body )
      terms.each{ |term| return true if body.match(term) }
      false
    end
    
    Captcha: sino (actually, Ruby is Japano)

    Pfft. Here's in Python:

    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    

    50% of the lines.

    And yes, it short-circuits: the expression inside the any() is a generator which only processes each term as they're consumed by any.

    AAAAnd you're fired too. friggin tourists posting code -- wtf.

  • L. (unregistered) in reply to Jay
    Jay:
    Oak:
    Steve The Cynic:
    The documented requirements:

    Write a function that returns true if at least one string in the /search terms/ array is found in /bigstring/, otherwise returns false.

    So, my version, written in pseudocode because I'm lazy:

    func containsAtLeastOneOf( array-of-strings searchTerms, string bigString ) returns truefalse
    
    for each searchTerm in searchTerms:
      if bigstring contains searchTerm:
        return true
    
    return false

    No regex, no stacks, no hash tables, nothing. Just enumerate the search terms and return true as soon as you find one that's in the big string. If you don't find one, return false.

    Jed is an idiot, as is Bob. So are most of you.

    If there are 80 search terms on the 77th appears inside bigstring, your code will run

    bigstring contains
    77 times. Each in O(n), and we assume n is very large here. I'm guessing
    contains
    itself is a character-by-character (short-circuiting) comparison. So we could say your solution runs in O(n (bigstring length) * m (num of search terms) * k (length of search terms, assuming for simplicity they are all in the same length) )

    A regex match, on the other hand, will only run on bigstring once. O(n). Not even O(n*k) since it will match characters against the regex automata as it goes. Granted, taking branches in the automata is more expensive than character comparison, but in general the regex solution does have a significant performance advantage here.

    Umm, wait ...

    We have to make assumptions about how the regex processor works, but let's suppose it does traverse the big string just once. It still has to compare a string beginning at each position in the big string to each of the search terms in turn. i.e. number of comparisons is len(bigstring)*count(terms)

    The proposed "contains" search will pick each term and for each loop through the big string. Number of comparisons is count(terms)*len(bigstring).

    As multiplication is commutative, the number of comparisons is the same in either case. Well, assuming that none of the search terms are found and so you have to search all the way to the end.

    Well if we're going to make assumptions, It's safe to assume that if you're programming in a GOOD language (not VB), your regex processor has been written by people who know at least a good deal about it, and will thus be way more efficient than what you will come up with.

    Now stop being ridiculous, use the damn regex WHICH IS MUCH FASTER THAN LOOPED CONTAINS, and be done with it.

    Yes, many people get things wrong etc. but if you're not smart enough to see that contains sucks balls and regex is better unless implemented by someone like you (which is not the case luckily), stop thinking, stop discussing, and use the damn regex.

  • L. (unregistered) in reply to LinqMe
    LinqMe:
    In C#:
    static bool Contains(this string text, IEnumerable<string> words)
    {
       return words.Any(w => text.IndexOf(w, StringComparison.CurrentCultureIgnoreCase) >= 0);
    }
    or if the Split was really intened:
    static bool Contains1(this string text, IEnumerable<string> words)
    {
       return text.
          Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).
          SelectMany(t => words.Select(w => new { t, w })).
          Any(p => p.w.Equals(p.t, StringComparison.CurrentCultureIgnoreCase));
    }

    For large texts you could also throw an AsParallel in between to make it multithreaded.

    AAANd another loser . seriously ... how can you possibly post that crap after so many people have posted (somewhat) informed replies ??

  • L. (unregistered) in reply to not the other thing
    not the other thing:
    Yale:
    This kind of shit makes me glad I code in Ruby.
    def doc_contains_at_least_one_term( terms, body )
      terms.each{ |term| return true if body.match(term) }
      false
    end
    
    Captcha: sino (actually, Ruby is Japano)

    you're making it too complicated

    def doc_contains_at_least_one_term(terms,body)
      terms.any?{|term| body.match(term)}
    end
    

    And again .. another topper-idiot...

  • L. (unregistered) in reply to NPSF3000
    NPSF3000:
    foo:
    AP²:
    Yale:
    This kind of shit makes me glad I code in Ruby.
    def doc_contains_at_least_one_term( terms, body )
      terms.each{ |term| return true if body.match(term) }
      false
    end
    
    Captcha: sino (actually, Ruby is Japano)

    Pfft. Here's in Python:

    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    

    50% of the lines.

    And yes, it short-circuits: the expression inside the any() is a generator which only processes each term as they're consumed by any.

    You know, if your languages allow you to write inefficient code in few lines, that's not exactly reason to celebrate.

    But keep posting your ignorance. I'm sure it works in many other languages as well.

    Few lines of [clean, understandable] code IS efficient in developer time - usually one of the most important metrics there is.

    And morons like you would know . because ?

    This code is not just clean and understandable, it's also an excellent example of retarded slow inefficient bullcrap - at least by an order of magnitude slower.

    By writing it correctly once, you save much more developer time than by accepting a crappy version in the first place.

    The fact that devs who can write it correctly are few does imply some productivity limits, but otoh you won't spend 100x the dev time fixing bugs and others.

  • David (unregistered)

    Suddenly,

    a search term with a regular expression in it.

    "(x+x+)+y" anyone?

    http://www.codinghorror.com/blog/2006/01/regex-performance.html

  • L. (unregistered) in reply to Yale
    Yale:
    foo:
    You didn't get my point: This was a pissing contest about languages. If those languages allow for a short/readable (efficient in developer time) solution which is not efficient, that's not a good thing. (If you're bound to use this language, for whatever reason, this might still the best way to go. But it's certainly no point to advocate for these languages.)

    A thing to be proud of is when a language allows short and readable code that is also efficient.

    What's inefficient about it? Or are you just a snob who still clings to the myth that "scripting languages are too slow"?

    You sir, are an ignorant.

    The very fact that you can ask such a question shows that you cling to the myth that you can "code".

    Please, do acknowledge your total inability to do such an activity and tell your brethren that they too should renounce coding forever.

  • L. (unregistered) in reply to Chris Judge
    Chris Judge:
    Jed and Bob's code is in VB.NET. .NET provides all kinds of terse syntax to handle this. For example (this one in c#, but the necessary extensions are available under VB.NET as well):

    public bool containsAtLeastOne (string source, string[] terms) { return terms.Any(t => source.Contains(t)); }

    Another one bites the dust ? Seriously . don't code, don't post code, don't talk about code.

  • foo (unregistered) in reply to NPSF3000
    NPSF3000:
    I personally get pissed off when some douche bag spends a full day implementing some special x, y, z followed by days of testing and modification to get it working 'just right' - when I in 15 mins created a simple n^2 loop, threaded it and moved on with only 5 lines of code to ever worry about should there be problems down the track.
    What the hell are you talking about? I specifically said not to roll your own, but use an existing library which doesn't talk more than 5 lines either. When I ran my tests (375536) I did just that and it took me like 5 minutes.
    NPSF3000:
    IMO you have to PROVE that the naive way is too slow with the given requirements before you even think about implementing a non-trivial alternative. I note you haven't.
    Many people have proved it (yes, the dreaded academia). Relying on their proofs is like code reuse. You don't need to roll your own every time. You just know that certain algorithms are O(n^2), and O(n^2) really gets slow unless you can prove that the data size remains bounded for the lifetime of the program (which is usually impossible to prove, and most auch assumptions turn out wrong, time and again).

    Threading is no solution either, since the number of CPU cores will hardly grow as fast as typical data sizes grow in the foreseeable future.

  • (cs) in reply to PedanticCurmudgeon
    PedanticCurmudgeon:
    PiisAWheeL:
    Katy Gaga:
    Could we please show a little sensitivity here? I have a son who is lame and I can assure you it's no joking matter.

    Captcha: secundom = those dummies who try to post "frist" after someone else already has.

    That no laughing matter joke needs to take an arrow to the knee.

    This is at least partially my fault. I tried to kill the meme by making a ridiculous application of it. Fake Bob never went away, and the meme just multiplied. If you can think of anything that can kill it, you will have our undying gratitude.

    Please show some sensitivity. I had a son who made a ridiculous application once. It allowed hymns to be selected in real-time during a funeral service. And let me assure you: it was no laughing matter.

  • L. (unregistered) in reply to Yale
    Yale:
    foo:
    The inefficiency has been discussed at length in this thread, so why don't you just read it? Or are you a snob who thinks anytime someone mentions O-notation, that's academic BS with no relation to the real world?

    On my system, one million iterations of the above algorithm in a test case with three arguments where none of them matches takes 10.167 seconds, so roughly 10.167 microseconds per iteration. If the algorithm is called an average of 10 times per second (highly unlikely, since it's using a user-specified set of arguments), and you improve its performance by 50%, over the course of a year you'd save a grand total of 26.8 CPU-minutes. Divide that by the number of threads that can be processed concurrently to find your actual time savings.

    So yes, if you can theoretically improve the performance of an algorithm, but as a practical matter it produces at best a negligible time savings, then that's academic BS that has no relation with the real world.

    Except that today we work on virtualized everything and those performance improvements are at least 10x faster, not just 2x faster - this means 10x higher server density or 10x longer lasting batteries, so yes it matters.

    note : the 10x factor was pulled out of my ass, and it's pretty much CONSERVATIVE compared to reality (I've often improved speed of existing software north of 100X, very rarely less than 2x) - also it does not take into account stupid slow languages like Ruby or node.js or other FOTM funky-looking nice-paradigm slow-as-shit stuff.

  • L. (unregistered) in reply to Yale
    Yale:
    foo:
    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.).

    I just implemented that method in Ruby as follows:

    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/
    That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.

    I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.

    1. this is ruby so your point does not stand . ruby is far from optimized or fast or decent speed yet.
    2. retarded test gives retarded results, always
    3. I've found the standard library is REGEX and not some homemade crappy contains stunt. you are indeed wrong.
  • (cs) in reply to Grassfire
    Grassfire:
    Well for starters you could look at improving your aim. You'll never kill a joke by shooting it in the knee. A couple of arrows to the chest should do the job.

    ... and besides, if the joke is already lame, what good does it do rendering it lame?

  • (cs) in reply to foo
    foo:
    Yale:
    foo:
    Again: I'm not suggesting you run such extensive tests every time you encounter such a problem. My point is rather to show how even well-intended tests can be misleading (as was my T1). So rather than concluding performance doesn't matter, rather err on the safe side and use well-known efficient algorithms (when available, but most of the time they are). And again, my real-world experience has show me far too many cases where performance really did matter and could have been better if the programmers had just used known efficient algorithms. (Don't be clever and invent your own -- most of the time it's not worth the effort, and even more often it goes horribly wrong, i.e. the result is either wrong or less efficient or both.)

    And my point is that you shouldn't make the knee-jerk assumption that a piece of clear, easy to understand code that uses a standard library function to accomplish exactly what you need must be slower than a "known efficient algorithm". You usually just end up reinventing the wheel, and poorly at that.

    The question is not whether it's a library function (I'm using one as well: regex library, I'm not reinventing the wheel), but whether it's efficient. And scanning the same string n times is not efficient, that's a well known fact, to anyone who doesn't close their eyes and thereby shift the burden to their successors and/or users who will have to deal with slow programs, perhaps a few years later when typical data sizes have increased another order of magnitude. -- No, you can't foresee every possible problem of the future, but you can guard against the known ones, like using O(n^2) when O(n) or O(n log n) is readily available.
    It's also worth bearing in mind that if PHB tells you to improve the performance of a particular algorithm, first check to make sure that this is actually the place where the performance bottleneck is located. Lots of manhours are wasted in shaving a few percent off algorithm A when in fact the program is really getting bogged down running algorithm B.

    And, like the man said, it's more than likely that a library function is by-and-large pretty damn efficient, and better brains than yours have spent considerable effort improving it over more time than you have. If you can demonstrate that a given popular library algorithm is really underperforming compared with your own implementation, then go to it - but I won't believe it till I see it.

  • Dilbertino (unregistered) in reply to Maurits
    Maurits:
    Larry:
    Dilbertino:
    A regex wtf, and nobody mentioned the perl oneliner yet? :)

    return grep { $searchText =~ $_ } @searchTerms;

    It continues to amaze me how verbose other languages are for the most basic operations...

    For those of you who are not blessed with a project written entirely in perl, check if your language supports a "system" call. Then go ahead and do your wimpy eye candy code in whatever language they make you use, and shell out to perl for the heavy lifting.

    This version returns the number of search terms, which means you keep looking after you've found one.

    for (@searchTerms) { return 1 if $searchText =~ $_ }
    return 0;
    

    This still runs into the "search terms are not regular expressions" problem.

    for (@searchTerms) { return 1 if index($searchText, $_) >= 0 } 
    return 0; 
    

    This still runs into the "you probably want a case insensitive match" problem. Text processing is hard.

    Not that hard: for (@searchTerms) { return 1 if $searchText =~ /\Q$_\E/i } (drop the i flag if you want case sensitive.)

  • (cs) in reply to QJo
    QJo:
    Grassfire:
    Well for starters you could look at improving your aim. You'll never kill a joke by shooting it in the knee. A couple of arrows to the chest should do the job.

    ... and besides, if the joke is already lame, what good does it do rendering it lame?

    I was using 1 tired overused joke to make a point about another tired lame overused joke... that 1 of them needs to retire, not be shot in the chest. In other words, that no laughing matter joke used to be funny, until it took an arrow to the knee. I assume you A: don't play much skyrim or B: never heard of the arrow to the knee meme. See Google.

  • TheJonB (unregistered) in reply to L.
    L.:
    Yale:
    foo:
    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.).

    I just implemented that method in Ruby as follows:

    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/
    That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.

    I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.

    1. this is ruby so your point does not stand . ruby is far from optimized or fast or decent speed yet.
    2. retarded test gives retarded results, always
    3. I've found the standard library is REGEX and not some homemade crappy contains stunt. you are indeed wrong.

    You know L. one of the main problems with being a complete prick is that even if you're right no-one listens.

  • Kryptus (unregistered)

    Please stop enumerating the hundreds of possible much better solutions. They are just VB.NET developers, so praying for their souls is the best think we can do.

  • LinqMe (unregistered) in reply to L.
    L. :
    AAANd another loser . seriously ... how can you possibly post that crap after so many people have posted (somewhat) informed replies ??
    LOL! I posted it for the fun of it. Tell me why this is crap. It's short, readable (at least for me, but I admit I'm a Linq nerd) and does what the original code was trying to accomplish. I most likely wouldn't make it like this in production code but even if I did this would be by far better than the original code (including the "Regex-fix").
  • cork soaker (unregistered)

    It seems our little post-whore L. loves to throw himself under every bridge he passes by. Even if they are only being tended by billy goats.

  • (cs)
       For i As Integer = 1 To paramTable.stackHeight() - 1 Step 1
         If searchTable.Contains(paramTable.Pop()) Then
           truthTable.Push(True)
         Else
           truthTable.Push(False)
         End If
       Next
    

    If truthTable.stackHeight() > 0 And truthTable.Contains(True) Then Return True Else Return False End If

    There's much wrong with this whole story (and the responses to it have made me chuckle to myself; Zunesis… you realize that nothing you propose is as bad as the real VHDL?) but the above section just wants me to punch someone. It's so utterly moronic to do all that work once you've found enough information to know the answer. Only a (very stupid) computer programmer would keep looking for something after they've found it. Nobody else is that idiotic.

  • foo (unregistered) in reply to QJo
    QJo:
    foo:
    Yale:
    foo:
    Again: I'm not suggesting you run such extensive tests every time you encounter such a problem. My point is rather to show how even well-intended tests can be misleading (as was my T1). So rather than concluding performance doesn't matter, rather err on the safe side and use well-known efficient algorithms (when available, but most of the time they are). And again, my real-world experience has show me far too many cases where performance really did matter and could have been better if the programmers had just used known efficient algorithms. (Don't be clever and invent your own -- most of the time it's not worth the effort, and even more often it goes horribly wrong, i.e. the result is either wrong or less efficient or both.)

    And my point is that you shouldn't make the knee-jerk assumption that a piece of clear, easy to understand code that uses a standard library function to accomplish exactly what you need must be slower than a "known efficient algorithm". You usually just end up reinventing the wheel, and poorly at that.

    The question is not whether it's a library function (I'm using one as well: regex library, I'm not reinventing the wheel), but whether it's efficient. And scanning the same string n times is not efficient, that's a well known fact, to anyone who doesn't close their eyes and thereby shift the burden to their successors and/or users who will have to deal with slow programs, perhaps a few years later when typical data sizes have increased another order of magnitude. -- No, you can't foresee every possible problem of the future, but you can guard against the known ones, like using O(n^2) when O(n) or O(n log n) is readily available.
    It's also worth bearing in mind that if PHB tells you to improve the performance of a particular algorithm, first check to make sure that this is actually the place where the performance bottleneck is located. Lots of manhours are wasted in shaving a few percent off algorithm A when in fact the program is really getting bogged down running algorithm B.

    And, like the man said, it's more than likely that a library function is by-and-large pretty damn efficient, and better brains than yours have spent considerable effort improving it over more time than you have. If you can demonstrate that a given popular library algorithm is really underperforming compared with your own implementation, then go to it - but I won't believe it till I see it.

    What is it? Is it intentionally-misunderstanding-day? Again, and for the last time, i do not advocate rolling your own, but rather, when you have the choice between different applicable readily-available library algorithms (here: naive search and regex and probably more), and some of them are well-known more efficient than others, it's smart to use the efficient one from the start, even if the difference may not show immediately. It's not more effort to use one vs. the others (as numerous demos in this thread have shown), and your code is future-proof (in this regard at least).

  • foo (unregistered) in reply to TheJonB
    TheJonB:
    L.:
    Yale:
    foo:
    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.).

    I just implemented that method in Ruby as follows:

    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/
    That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.

    I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.

    1. this is ruby so your point does not stand . ruby is far from optimized or fast or decent speed yet.
    2. retarded test gives retarded results, always
    3. I've found the standard library is REGEX and not some homemade crappy contains stunt. you are indeed wrong.

    You know L. one of the main problems with being a complete prick is that even if you're right no-one listens.

    You say this as if it was a good thing. Must be terrible to be on a job where learning new things is actively discouraged. The best way to become the new dinosaurs 5-10 years from now.

  • daef (unregistered) in reply to Dilbertino

    might be why i like c#...

    bool ContainsAny(string[] needles, string search) {
       return needles.Any(p => search.Contains(p));
    }
    

    if i'd really have to use regex i'd use ex.IsMatch(...) instead of ex.Matches(...).Count>0 since it stops evaluating after finding the first match - no need to process the whole string at all :-)

  • daef (unregistered) in reply to daef

    or even as extension-method...

    public static bool ContainsAny(this string search, params string[] needles) {
       return needles.Any(p => search.Contains(p));
    }
    

    call would be like

    string s = "there is a cat on the tree";
    if(s.ContainsAny("cat", "dog")) {
       //do something  
    }
    
  • More from the CS department (unregistered)

    Oh, by the way, assuming n >> p, that hash/sort/binsearch solution winds up being O(n log n) anyway due to the sort step. That is better than O(m + n) (i.e. linear time) how? And also, all it'd take to make Aho-Corasick or Rabin-Karp match on word boundaries is to bracket each of the keywords being searched for with spaces. Case-insensitive searching can be implemented by case-normalizing the input in a preprocessing step (this gives you a chance to do other things too such as tab expansion, Unicode normalization, ...). (And you still haven't changed the big-O complexity of your algorithm with this: also, you don't have to modify the implementation of the algorithm itself, which allows you to use a known-good implementation (say from a library, or just by invoking fgrep).

    CAPTCHA: feugiat

  • geoffrey, MCP, PMP (unregistered) in reply to Kryptus
    Kryptus:
    Please stop enumerating the hundreds of possible much better solutions. They are just VB.NET developers, so praying for their souls is the best think we can do.

    Ah, here come the VB insults. I'd be willing to bet that more business value has been delivered with VB than all other languages combined.

  • (cs) in reply to geoffrey, MCP, PMP
    geoffrey:
    Kryptus:
    Please stop enumerating the hundreds of possible much better solutions. They are just VB.NET developers, so praying for their souls is the best think we can do.

    Ah, here come the VB insults. I'd be willing to bet that more business value has been delivered with VB than all other languages combined.

    I believe COBOL may trump it.

  • Gibbon1 (unregistered) in reply to jarfil
    jarfil:
    Whoa...

    For i As Integer = 0 To searchText.Split(" ").Length - 1 Step 1 searchTable.Push(searchText.Split(" ")

    Way to go! NN complexity just for the sake of it, followed by MN to top it off.

    No wonder they had a performance bottleneck, if everything else was written like that.

    Only comment I have is that it's possible that the compiler will optimize the complexity away. Maybe.

    Also might be that N is a small number. If that's the case, N*M may win.

    We have no idea what the previous solution was. Might be that it was even worse. Or a technically better algorithm but as I've found, sometimes the 100X boost you get using library routines in an interpreted language wins.

    Choosing regx over strstr is the sign of a programmer that has his priorities out of whack.

    Neither solution attempts to strip extraneous characters.

    The real WTF here is, not forwarding a copy of the email to my boss with the comment,

    'You want me to waste any time on this guy? Because I sure don't.'

  • foo (unregistered) in reply to Matt Westwood
    Matt Westwood:
    geoffrey:
    Kryptus:
    Please stop enumerating the hundreds of possible much better solutions. They are just VB.NET developers, so praying for their souls is the best think we can do.

    Ah, here come the VB insults. I'd be willing to bet that more business value has been delivered with VB than all other languages combined.

    I believe COBOL may trump it.

    Don't forget MUMPS.

  • Gunslinger (unregistered) in reply to AB
    AB:
    This really p!sses me off:

    If matches.Count > 0 Then Return True Else Return False End If

    Its so pointless. Why do sheeple do it? Can they not see the total redundancy of 5 lines of code? Do they really think its "more readable"?

    Return matches.Count > 0

    Yes, it absolutely is more readable, and the compiler will make the end result to be the same, so what's your problem?

  • Brendan (unregistered) in reply to AP²
    AP²:
    Yale:
    This kind of shit makes me glad I code in Ruby.
    def doc_contains_at_least_one_term( terms, body )
      terms.each{ |term| return true if body.match(term) }
      false
    end
    

    Pfft. Here's in Python:

    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    

    Pfft. Here's in C:

    int doc_contains_at_least_one_term(char *terms, char *body) { return some_function_someone_else_wrote(terms, body); }

    It's half as many lines as the python version.

  • anon (unregistered)

    If my team spent this much time pontificating over such a triviality I'd shoot them all

  • Tud (unregistered) in reply to Brendan
    Brendan:
    AP²:
    Yale:
    This kind of shit makes me glad I code in Ruby.
    def doc_contains_at_least_one_term( terms, body )
      terms.each{ |term| return true if body.match(term) }
      false
    end
    

    Pfft. Here's in Python:

    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    

    Pfft. Here's in C:

    int doc_contains_at_least_one_term(char *terms, char *body) { return some_function_someone_else_wrote(terms, body); }

    It's half as many lines as the python version.

    You seem jealous that Python programmers don't have to spend most of their time rewriting functions that were probably invented in Charles Babbage's time.

    I personally think Python should have MOAR built-in functions, the more the merrier.

  • Brendan (unregistered) in reply to Tud
    Tud:
    Brendan:
    Pfft. Here's in C:
    int doc_contains_at_least_one_term(char *terms, char *body) { return some_function_someone_else_wrote(terms, body); }

    It's half as many lines as the python version.

    You seem jealous that Python programmers don't have to spend most of their time rewriting functions that were probably invented in Charles Babbage's time.

    Perhaps my point was too subtle.

    I'll show you all of the code that the C version uses, including "some_function_someone_else_wrote()" and any other functions that end up (directly or indirectly) called by it; if you show me all of the code that the Python version uses, including the code for "any()" and any other functions that end up (directly or indirectly) called by it.

    Then we'd have a fair comparison. I'd probably be posting about 50 lines of C, and you'd probably be posting several thousand lines of C (and a little Python).

    If you're going to suggest that a line of code ceases to be a line of code if someone else wrote it (e.g. it exists in a library or as a built-in), then I'm going to suggest that all problems (regardless of how hard) can be solved with 1 line of C (by finding someone to write it for you).

  • Yale (unregistered) in reply to Brendan
    Brendan:
    Perhaps my point was too subtle.

    I'll show you all of the code that the C version uses, including "some_function_someone_else_wrote()" and any other functions that end up (directly or indirectly) called by it; if you show me all of the code that the Python version uses, including the code for "any()" and any other functions that end up (directly or indirectly) called by it.

    Then we'd have a fair comparison. I'd probably be posting about 50 lines of C, and you'd probably be posting several thousand lines of C (and a little Python).

    If you're going to suggest that a line of code ceases to be a line of code if someone else wrote it (e.g. it exists in a library or as a built-in), then I'm going to suggest that all problems (regardless of how hard) can be solved with 1 line of C (by finding someone to write it for you).

    You sound like the kind of person who sits around pondering how you can optimize "Hello World" by eliminating the massive overhead in stdio.h.

    Besides, for your comparison to be really fair, you'd need to include the entire source code for the compiler, assembler, linker, and any included source files or linked libraries, not to mention the operating system and hardware drivers (and don't forget the shell you launched the compiler from!). After all, you didn't write any of those things—you're just taking advantage of other people's work in your "one line program".

  • Brendan (unregistered) in reply to Yale
    Yale:
    Brendan:
    Perhaps my point was too subtle.

    I'll show you all of the code that the C version uses, including "some_function_someone_else_wrote()" and any other functions that end up (directly or indirectly) called by it; if you show me all of the code that the Python version uses, including the code for "any()" and any other functions that end up (directly or indirectly) called by it.

    Then we'd have a fair comparison. I'd probably be posting about 50 lines of C, and you'd probably be posting several thousand lines of C (and a little Python).

    If you're going to suggest that a line of code ceases to be a line of code if someone else wrote it (e.g. it exists in a library or as a built-in), then I'm going to suggest that all problems (regardless of how hard) can be solved with 1 line of C (by finding someone to write it for you).

    You sound like the kind of person who sits around pondering how you can optimize "Hello World" by eliminating the massive overhead in stdio.h.

    Besides, for your comparison to be really fair, you'd need to include the entire source code for the compiler, assembler, linker, and any included source files or linked libraries, not to mention the operating system and hardware drivers (and don't forget the shell you launched the compiler from!). After all, you didn't write any of those things—you're just taking advantage of other people's work in your "one line program".

    I'm only pointing out the futility of comparing languages based on lines of code. You're helping to prove my point.

    You seem like the sort of person who's deductive reasoning has atrophied. Because it's not polite to poke fun of disadvantaged people, I won't suggest that this is caused by spending all your time gluing together other people's code (rather than actually writing code and needing to think).

  • sfs (unregistered) in reply to Pro Coder
    Pro Coder:
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n).

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Such bravado.

    You're completely neglecting prefactors, and have no knowledge of how big N is. If N is typically small, then your O(log N) solution with its (typically) large prefactor will be slower than a simple O(N) solution.

    If 99.9% of the searches are for small words in short list of words, then you've wasted valuable developer time for a negligible performance gain, or a possible performance loss.

    Get it working first. Optimize it later, when you know what the actual bottlenecks are. Critical thought and analysis is better done with real-world data.

  • CS strikes again (unregistered) in reply to sfs
    sfs:
    Pro Coder:
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n).

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Such bravado.

    You're completely neglecting prefactors, and have no knowledge of how big N is. If N is typically small, then your O(log N) solution with its (typically) large prefactor will be slower than a simple O(N) solution.

    If 99.9% of the searches are for small words in short list of words, then you've wasted valuable developer time for a negligible performance gain, or a possible performance loss.

    Get it working first. Optimize it later, when you know what the actual bottlenecks are. Critical thought and analysis is better done with real-world data.

    Besides, his "O(log N)" solution is really an O(N log N) solution. (Blame the sort step.) So, even if this was a performance bottleneck in the production system, he's wasting his developer time trying to tinker with something he's pulled out of his hat while yielding a result that is still not properly optimized.

    TL;DR: He could have gotten a O(N) (yes, linear time for the entire problem) solution without jumping through all those hoops, considering that Aho-Corasick has been implemented quite a few times over already.

Leave a comment on “The Regex Code Review”

Log In or post as a guest

Replying to comment #:

« Return to Article