• foo (unregistered) in reply to NPSF3000
    NPSF3000:
    Few lines of [clean, understandable] code IS efficient in developer time - usually one of the most important metrics there is.
    You didn't get my point: This was a pissing contest about languages. If those languages allow for a short/readable (efficient in developer time) solution which is not efficient, that's not a good thing. (If you're bound to use this language, for whatever reason, this might still the best way to go. But it's certainly no point to advocate for these languages.)

    A thing to be proud of is when a language allows short and readable code that is also efficient.

  • foo (unregistered) in reply to ekolis
    ekolis:
    Geez, both solutions are pretty crappy... for starters, they both continue checking for subsequent search terms, instead of returning true immediately upon finding one of them...
    You (and many others who wrote the same) are missing the forest for the trees.

    Cutting after the first find will on average half the effort in case of finds, and do nothing if not found. That's not bad, but pales in comparison to reducing the complexity from O(n*m) to O(n) or such, i.e. traversing the big-string just once, which usually means an FSM-based algorithm (Aho–Corasick or regex).

  • Yale (unregistered) in reply to AP²
    AP²:
    Pfft. Here's in Python:
    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    
    50% of the lines.
    def doc_contains_at_least_one_term( terms, body ) terms.any?{ |term| body.match(term) } end
    

    One line, including the function definition. Granted, anyone who writes code like this should be drawn and quartered, but still... top THAT, Python.

    (Thanks to not the other thing who correctly pointed out that my previous code was bloated.)

  • Beldar the Phantom Replier (unregistered) in reply to Pro Coder
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Return False

    O(1), every time. To where do I send my CV?

  • Jed (unregistered)

    Dear Bob,

    I would request that you stop wasting my teams time by submitting buggy code that does not seem to meet our basic business requirements.

    Your "solution" failed on the very first 2 tests our team ran. I have attached the test cases for your enlightenment.

    Regards, Jed

    Test Cases:

            Dim fail As Boolean
            Dim terms1 As String() = {"foo", "bar"}
            Dim terms2 As String() = {"SEXY", "BARTENDER"}
            Dim text [color=blue]As String = "I met a sexy bartender at the football game."
    fail = DocContainsAtLeastOneTermNew(terms1, text) <> DocContainsAtLeastOneTerm(terms1, text)
            Debug.WriteIf(fail, "FAIL")
    fail = DocContainsAtLeastOneTermNew(terms2, text) <> DocContainsAtLeastOneTerm(terms2, text)
            Debug.WriteIf(fail, "FAIL")
  • Yale (unregistered) in reply to foo
    foo:
    You didn't get my point: This was a pissing contest about languages. If those languages allow for a short/readable (efficient in developer time) solution which is not efficient, that's not a good thing. (If you're bound to use this language, for whatever reason, this might still the best way to go. But it's certainly no point to advocate for these languages.)

    A thing to be proud of is when a language allows short and readable code that is also efficient.

    What's inefficient about it? Or are you just a snob who still clings to the myth that "scripting languages are too slow"?

  • PRMan (unregistered) in reply to NN
    NN:
    AB:
    This really p!sses me off:

    If matches.Count > 0 Then Return True Else Return False End If

    Its so pointless. Why do sheeple do it? Can they not see the total redundancy of 5 lines of code? Do they really think its "more readable"?

    Return matches.Count > 0

    better:

    Return (New Regex(regexString)).Matches(searchText) > 0

    Regex.IsMatch is better, because it will return as soon as anything matches...

  • geoffrey, MCP, PMP (unregistered) in reply to Pro Coder
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    No, it just short circuits time. In your perfect world, you could spend all day writing your own boilerplate code that has already been written and tested in a framework. Business value be damned.

    Get off your high horse. Professional life is about getting the job DONE. Not hand-wringing over whether this piece of code is the right O().

  • AP² (unregistered) in reply to Yale
    Yale:
    AP²:
    Pfft. Here's in Python:
    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    
    50% of the lines.
    def doc_contains_at_least_one_term( terms, body ) terms.any?{ |term| body.match(term) } end
    

    One line, including the function definition. Granted, anyone who writes code like this should be drawn and quartered, but still... top THAT, Python.

    (Thanks to not the other thing who correctly pointed out that my previous code was bloated.)

    Uh, you can just remove the line break from my code and you get a one liner too. I was just too disgusted to do it.

    There's also:

    doc_contains_at_least_one_term = lambda terms, body: any((term in body for term in terms))
    
  • Zunesis, In the Flesh! (Your mom's!) (unregistered) in reply to geoffrey, MCP, PMP
    Those who live in glass houses...:
    In your perfect world, you could spend all day writing your own boilerplate code that has already been written and tested in a framework. Business value be damned.

    Get off your high horse. Professional life is about getting the job DONE. Not hand-wringing over whether this piece of code is the right O().

    Get off your high horse

    Get off your high horse

    Get off your high horse

    Get off your high horse

    Get off your high horse

    Bum Bum Bum!!!

  • nobody (unregistered)

    Most the solutions proposed are making are silly because they are trying to just re-implement the function. Think about it, where is this list of words coming from? If the same list of words every time, why would anyone every reconstruct a regex or hast table over and over? Just make a hash table once and make the function take a hash table (or Set in java) instead of a list.

    public boolean docContainsAtLeastOneTerm(Set<String> searchTerms, String text){
        for(String word : text.split( "\\W+" )){
            if(searchTerms.contains( word )){
                return true;
            }
        }
        return false;
    }
    

    Fast and efficient and easy to understand (for those scared of regexs, "\W+" is anything that not A-Z, 0-9, or an underscore)

  • AP² (unregistered) in reply to foo
    foo:
    AP²:

    Pfft. Here's in Python:

    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    

    50% of the lines.

    And yes, it short-circuits: the expression inside the any() is a generator which only processes each term as they're consumed by any.

    You know, if your languages allow you to write inefficient code in few lines, that's not exactly reason to celebrate.

    But keep posting your ignorance. I'm sure it works in many other languages as well.

    I was comparing the two languages, dolt. How do you propose I do that with different algorithmic implementations? Maybe you should go back to school and learn the definition of "context" before calling people ignorants.

  • Chris Judge (unregistered)

    Jed and Bob's code is in VB.NET. .NET provides all kinds of terse syntax to handle this. For example (this one in c#, but the necessary extensions are available under VB.NET as well):

    public bool containsAtLeastOne (string source, string[] terms) { return terms.Any(t => source.Contains(t)); }

  • AP² (unregistered) in reply to Yale
    Yale:

    What's inefficient about it? Or are you just a snob who still clings to the myth that "scripting languages are too slow"?

    I'm not foo, but the code I wrote (and what he was complaining about) is inefficient for multiple reasons: the first is that it has to run through the whole haystack for each term being searched (until one is found), when one could search for them all at the same time. Secondly, because techniques such as hashing - which is used for example in the Rabin–Karp algorithm - can reduce each string comparison (which is O(n), n = len(string)) to a rolling hashing (very cheap) + integer comparison (single instruction in most processors).

    In any case, I was striving for efficiency but for algorithm compatibility with the Ruby version that had been posted.

  • AP² (unregistered) in reply to AP²

    Above, it's s/was striving/wasn't striving/.

    Damn my sudden but inevitable lack of attention.

  • (cs) in reply to kit

    Yes, it's a WTF. That has nothing to do with yhr spproproateness of regular expressions, which are in fact ideally suited to the problem of finding if any of a (perhaps large) set of strings occur in a particular large string.

    It's nothing to do with regular expressions, it's to do with ignorance or slovenlieness on the coder's part.

    The limp joke has about the same value as Bobby Tables and the same lesson to be learned. If you use a powerful tool, you need to learn how it should be used and how it can go wrong if misused. Pretending that regular expressions are somehow inherently dangerous is one of the more depressing recurring themes on this site.

  • foo (unregistered) in reply to AP²
    AP²:
    foo:
    AP²:

    Pfft. Here's in Python:

    def doc_contains_at_least_one_term( terms, body ):
        return any((term in body for term in terms))
    

    50% of the lines.

    And yes, it short-circuits: the expression inside the any() is a generator which only processes each term as they're consumed by any.

    You know, if your languages allow you to write inefficient code in few lines, that's not exactly reason to celebrate.

    But keep posting your ignorance. I'm sure it works in many other languages as well.

    I was comparing the two languages, dolt. How do you propose I do that with different algorithmic implementations? Maybe you should go back to school and learn the definition of "context" before calling people ignorants.

    Comparing languages based on how few lines they require to implement the same bad algorithm is as useful as comparing cars based on how well they can overrun dogs.

    A better response would have been: "In $LANGUAGE I can do $GOOD_ALGORITHM in 2 lines." Feel free to subtitute matching values, this way we might actually learn something useful.

  • big picture thinker (unregistered)

    Why don't they just use a grep library? It is very efficient at multiple string searches http://www.tgries.de/agrep/

  • (cs) in reply to Tasty
    Tasty:
    Not all languages have true booleans. The full if...then syntax guarantees a particular return value. This is not a sheeple thought process.

    C evaluates a < b to either 0 or1.

    C’mon, even C has real booleans since 13 years; It’s even compatible with the comparision operators in using 1 for true.

    But that’s not actually important here—the language in the OP, VB, has real booleans.

    Furthermore, a device drive might not like 0x0001. It may need 0xffff as its true input.
    Still there a more elegent ways then if-statements for converting you host language’s notion of datatype to an external representation (two if-statements, if you also need to convert the other way). Unfortunately most languages do’t support one. sob.
    type Bool_for_your_device is new Boolean;
    for Bool_for_your_device use (Down => 0, Up => 16#ffff#);
    Device_register: Bool_for_your_device;
    for Device_register'Size use 16;
    for Device_register'Address use To_Address(16#abcde#);
    …
    Device_register := Bool_for_your_device(Count(matches) > 0);
  • foo (unregistered) in reply to Yale
    Yale:
    foo:
    You didn't get my point: This was a pissing contest about languages. If those languages allow for a short/readable (efficient in developer time) solution which is not efficient, that's not a good thing. (If you're bound to use this language, for whatever reason, this might still the best way to go. But it's certainly no point to advocate for these languages.)

    A thing to be proud of is when a language allows short and readable code that is also efficient.

    What's inefficient about it? Or are you just a snob who still clings to the myth that "scripting languages are too slow"?

    Not at all. If the interpreter of scripting languages is a serious bottleneck in this problem, you're doing something wrong, anyway.

    The inefficiency has been discussed at length in this thread, so why don't you just read it? Or are you a snob who thinks anytime someone mentions O-notation, that's academic BS with no relation to the real world?

  • big picture thinker (unregistered) in reply to Pro Coder
    Pro Coder:
    If you do didn't immediately think that this should be done with a hash, a sort, and a binary search, kill yourself now and save us all future aggravation.

    Not sure why the sort would be necessary, but you might be able to do something like that:

    First hash all the Needles. Then OR (bitwise operator) all the hash values together into a "composite needle".

    Then use a string-searching algorithm like Rabin-Karp that utilizes a "rolling hash" (a very efficient, recursive non-cryptographic hash function) to progress through the "Haystack", each time AND'ing (bitwise operator) the OR'd "composite needle".

    If the AND results in true, that should tell you if there is a match to any one of the needles in O(n) time, without actually iterating each needle in another for loop O(n^2).

    Because checking OR'd values would probably result in collisions (false positives), you'd have to track the progression through the Haystack with a counter integer and then do an actual byte-for-byte check of that particular location where a positive match might have been found.

    Just a disclaimer, I'm just throwing down thoughts. I haven't even tried to implement that in code, and I have no desire to do so. So fair warning: It's just a thought and it might not work.

  • Yale (unregistered) in reply to AP²
    AP²:
    Uh, you can just remove the line break from my code and you get a one liner too. I was just too disgusted to do it.

    There's also:

    doc_contains_at_least_one_term = lambda terms, body: any((term in body for term in terms))
    

    Removing any unnecessary characters/whitespace:

    Ruby:
    def f(t,b)t.any?{|x|b.match x}end
    33 characters
    

    Python: x=lambda t,b:any(x in b for x in t) 35 characters

    I think it's obvious to even the slowest of Pythonistas (and that's truly saying something) which is the clearly superior language.

    You can even take that to the bean counters: "If we use Python on this project instead of Ruby, our keyboards will wear out at least 6% faster."

  • Tom (unregistered)

    Return Regex.IsMatch(searchText, "\b(?:" + String.Join("|", searchTerms.Select(Function(x) Regex.Escape(x))) + ")\b")

    One line in VB. O(n). Respects word boundaries. Escapes regex characters.

  • big picture thinker (unregistered) in reply to Pro Coder
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Most average-case sort algorithms are, by themselves, O(n * log(n)), + a rolling hash function is usually O(n) or more, + multiplication itself is non-linear. How deep do you want to go?

  • foo (unregistered) in reply to Yale
    Yale:
    AP²:
    Uh, you can just remove the line break from my code and you get a one liner too. I was just too disgusted to do it.

    There's also:

    doc_contains_at_least_one_term = lambda terms, body: any((term in body for term in terms))
    

    Removing any unnecessary characters/whitespace:

    Ruby:
    def f(t,b)t.any?{|x|b.match x}end
    33 characters
    

    Python: x=lambda t,b:any(x in b for x in t) 35 characters

    I think it's obvious to even the slowest of Pythonistas (and that's truly saying something) which is the clearly superior language.

    You can even take that to the bean counters: "If we use Python on this project instead of Ruby, our keyboards will wear out at least 6% faster."

    Where can I buy your keyboard with those (, ), ?, {, } or | keys?

    For a more relevant economic comparison, I suggest comparing the 1-bits in the strings, since everyone knows how expensive those are.

  • Yale (unregistered) in reply to foo
    foo:
    The inefficiency has been discussed at length in this thread, so why don't you just read it? Or are you a snob who thinks anytime someone mentions O-notation, that's academic BS with no relation to the real world?

    On my system, one million iterations of the above algorithm in a test case with three arguments where none of them matches takes 10.167 seconds, so roughly 10.167 microseconds per iteration. If the algorithm is called an average of 10 times per second (highly unlikely, since it's using a user-specified set of arguments), and you improve its performance by 50%, over the course of a year you'd save a grand total of 26.8 CPU-minutes. Divide that by the number of threads that can be processed concurrently to find your actual time savings.

    So yes, if you can theoretically improve the performance of an algorithm, but as a practical matter it produces at best a negligible time savings, then that's academic BS that has no relation with the real world.

  • Jed (unregistered) in reply to Tom
    Tom:
    Return Regex.IsMatch(searchText, "\b(?:" + String.Join("|", searchTerms.Select(Function(x) Regex.Escape(x))) + ")\b")

    One line in VB. O(n). Respects word boundaries. Escapes regex characters.

    Just add RegexOptions.IgnoreCase at the end of that, and I think we have a winner.

  • Yale (unregistered) in reply to foo
    foo:
    Where can I buy your keyboard with those (, ), ?, {, } or | keys?

    For a more relevant economic comparison, I suggest comparing the 1-bits in the strings, since everyone knows how expensive those are.

    I had one imported from Tibet, where monks high atop Mount Kailash carve each individual key out of mammoth ivory, and set them all into an enclosure forged from Damascus steel. I think the electronics come from China, though. I'd be happy to sell you mine for the bargain basement price of $600!

    Besides, presses of the shift key don't count. There are two of them, not to mention caps lock, so you're still covered in the event of a failure.

  • Mikey (unregistered)

    TRWTF is that both versions suck.

  • foo (unregistered) in reply to Yale
    Yale:
    foo:
    The inefficiency has been discussed at length in this thread, so why don't you just read it? Or are you a snob who thinks anytime someone mentions O-notation, that's academic BS with no relation to the real world?

    On my system, one million iterations of the above algorithm in a test case with three arguments where none of them matches takes 10.167 seconds, so roughly 10.167 microseconds per iteration. If the algorithm is called an average of 10 times per second (highly unlikely, since it's using a user-specified set of arguments), and you improve its performance by 50%, over the course of a year you'd save a grand total of 26.8 CPU-minutes. Divide that by the number of threads that can be processed concurrently to find your actual time savings.

    So yes, if you can theoretically improve the performance of an algorithm, but as a practical matter it produces at best a negligible time savings, then that's academic BS that has no relation with the real world.

    Thanks for proving my point because that's exactly what I see happening all too often:

    Someone does a single test with a non-representative data set (3 entries, not really much, the size of the string wasn't even specified) and when it runs fast, they conclude performance doesn't matter. Maybe it's so, maybe not. From a single test you can't tell. O-notation can help judging how the result from a single test scales to bigger amounts of data.

    Also, the total amount of CPU time is often not the relevant factor, but latency. And note that the original article does mention a performance bottleneck.

    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.). And now please don't complain how that code is too complex or something. If you're working in this language and encounter this kind of problems, you should be able to write such code (or read it, if written by your predecessor) or you've got the wrong job.

  • Yale (unregistered) in reply to foo
    foo:
    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.).

    I just implemented that method in Ruby as follows:

    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/
    That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.

    I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.

  • Methinks (unregistered) in reply to Mikey
    Mikey:
    TRWTF is that both versions suck.

    Methinks that was done on purpose. A double WTF. Replacing bad code that works (well, almost), with "better" faster code that...doesn't work. Happens all the time.

    No, I would say TRWTF is all the proposed solutions (Jeds included) that do a substring match instead of a word match. Finding a word in a text and finding a substring are 2 different problems. The requirement was to find a word. All the substring solutions are a FAIL.

  • foo (unregistered) in reply to Yale
    Yale:
    foo:
    I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.).

    I just implemented that method in Ruby as follows:

    body.index /#{terms.collect{ |x| Regexp.escape x }.join("|")}/
    That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.

    I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.

    Yes, efficient code (in terms of O()) can be slower for small data where it doesn't matter, but it gets relatively faster (i.e., not so much slower) than the naive code when the data size grows.

    Again, that's exactly the point: Coders (such as you, apparently) who don't understand efficiency do irrelvant/misleading tests and come to the wrong conclusions. (Additionally, they usually stobbornly refuse to accept results from those who have studied the issues more intensively, because that's "academic BS".)

    But if you only believe what you see yourself, repeat your test with 10* the number of search items and 10* the data size, then with 100* both. The native version will take roughly 100* and 10000* as long, the regex one (or any other effient one) somewhat more than 10* and 100*.

    I'm getting somewhat upset about this because I see happen this all too often in real applications, things that slow down much more than necessary when the data size grows because the programmers were apparently too lazy to use the right algorithm (because "it won't happen", yeah, everyone knows how those "noone will ever need more than ..." assumptions play out).

    The worst offender at the moment is my stupid TV recorder. Merely opening the list of recordings takes more than 20seconds with ~200 entries. To be honest, I really don't know how they managed to achieve that, even the most stupid way of doing things I can imagine wouldn't take nearly as long. But apparently they only ever tested it with a few entries (where the delay is hardy notable, given the device's general sluggishness) and never bothered to either test with larger data or consider how their code scales based on known fasts about efficiency.

    </rant>
  • Meep (unregistered) in reply to Philipp
    Philipp:
    characters which have a special meaning

    Metacharacters is the word you're looking for.

  • Yale (unregistered) in reply to foo
    foo:
    But if you only believe what you see yourself, repeat your test with 10* the number of search items and 10* the data size, then with 100* both. The native version will take roughly 100* and 10000* as long, the regex one (or any other effient one) somewhat more than 10* and 100*.

    With 100 times the search term list size and data size, the native algorithm takes 5.166 milliseconds, and the regexp based one takes 25.440 milliseconds (based on 1000 iterations of each). So the regexp one is a whopping 500% slower.

    Do you just like being wrong or something?

  • Meep (unregistered) in reply to Hoffmann
    Hoffmann:
    minuses for Jed: - is looking case sensitive

    Don't know about VB, but most regex engines can toggle case sensitivity, with a simple flag. And the engine is usually locale sensitive.

    - regEx is expensive

    A regex engine can do a lot of optimization for you and match complex patterns in near linear time. It's incredibly easy using repeated string manipulation to run in polynomial time. And usually the repeated string manipulation solutions only barely solve the problem.

    - not stable for regEx-search-tearms

    I take it you mean he didn't escape the metacharacters. That's true; his implementation is horribly broken.

  • (cs) in reply to Larry
    Larry:
    Dilbertino:
    A regex wtf, and nobody mentioned the perl oneliner yet? :)

    return grep { $searchText =~ $_ } @searchTerms;

    It continues to amaze me how verbose other languages are for the most basic operations...

    For those of you who are not blessed with a project written entirely in perl, check if your language supports a "system" call. Then go ahead and do your wimpy eye candy code in whatever language they make you use, and shell out to perl for the heavy lifting.

    This version returns the number of search terms, which means you keep looking after you've found one.

    for (@searchTerms) { return 1 if $searchText =~ $_ }
    return 0;
    

    This still runs into the "search terms are not regular expressions" problem.

    for (@searchTerms) { return 1 if index($searchText, $_) >= 0 }
    return 0;
    

    This still runs into the "you probably want a case insensitive match" problem.

    Text processing is hard.

  • Grassfire (unregistered) in reply to PedanticCurmudgeon

    Well for starters you could look at improving your aim. You'll never kill a joke by shooting it in the knee. A couple of arrows to the chest should do the job.

  • Grassfire (unregistered) in reply to PedanticCurmudgeon
    PedanticCurmudgeon:
    PiisAWheeL:
    Katy Gaga:
    Could we please show a little sensitivity here? I have a son who is lame and I can assure you it's no joking matter.

    Captcha: secundom = those dummies who try to post "frist" after someone else already has.

    That no laughing matter joke needs to take an arrow to the knee.

    This is at least partially my fault. I tried to kill the meme by making a ridiculous application of it. Fake Bob never went away, and the meme just multiplied. If you can think of anything that can kill it, you will have our undying gratitude.

    Well for starters you need to improve your aim. Try a couple of arrows to the chest instead of the knee.

    (Yes I know I double posted, reply, quote apparently not the same thing)

  • foo (unregistered) in reply to Yale
    Yale:
    foo:
    But if you only believe what you see yourself, repeat your test with 10* the number of search items and 10* the data size, then with 100* both. The native version will take roughly 100* and 10000* as long, the regex one (or any other effient one) somewhat more than 10* and 100*.

    With 100 times the search term list size and data size, the native algorithm takes 5.166 milliseconds, and the regexp based one takes 25.440 milliseconds (based on 1000 iterations of each). So the regexp one is a whopping 500% slower.

    Do you just like being wrong or something?

    Well, this may be the effects of early termination on matches. Got bitten by this myself when I just ran a few tests (C++ with regex):

     #SI  length    T1    T2   T3
       3    1000    10    10    4
      30    1000    32    32    5
     300    1000    63    63   14
    3000    1000   190   190  130
       3  100000   750   750  200
      30  100000  3500  3400  200
     300  100000  5000  5000  210
    3000  100000  2600  6300  320
    

    T1 is the time (in microseconds) with random text and search items. Note that the time actually goes down in the last row, because the chance of an early match increases.

    T2 is with search items that never match, containing one upper-case letter in the middle, rest lower-case, text all lower-case. (But be careful: T3 is with the search text all lower-case and search-items all upper-case. Here the engine can recognize non-matches very quickly by checking impossible characters.)

    T2 shows clearly how the time is roughly linear with text length, but increases much less with the number of search terms.

    The naive algorithm grows linearly in both, so at some point it will be slower. I actually suppose that point is below what you tested before if you exclude early matches (but since you didn't state the size or content of your test data, I can only guess).

    Again: I'm not suggesting you run such extensive tests every time you encounter such a problem. My point is rather to show how even well-intended tests can be misleading (as was my T1). So rather than concluding performance doesn't matter, rather err on the safe side and use well-known efficient algorithms (when available, but most of the time they are). And again, my real-world experience has show me far too many cases where performance really did matter and could have been better if the programmers had just used known efficient algorithms. (Don't be clever and invent your own -- most of the time it's not worth the effort, and even more often it goes horribly wrong, i.e. the result is either wrong or less efficient or both.)

  • Yale (unregistered) in reply to foo
    foo:
    Again: I'm not suggesting you run such extensive tests every time you encounter such a problem. My point is rather to show how even well-intended tests can be misleading (as was my T1). So rather than concluding performance doesn't matter, rather err on the safe side and use well-known efficient algorithms (when available, but most of the time they are). And again, my real-world experience has show me far too many cases where performance really did matter and could have been better if the programmers had just used known efficient algorithms. (Don't be clever and invent your own -- most of the time it's not worth the effort, and even more often it goes horribly wrong, i.e. the result is either wrong or less efficient or both.)

    And my point is that you shouldn't make the knee-jerk assumption that a piece of clear, easy to understand code that uses a standard library function to accomplish exactly what you need must be slower than a "known efficient algorithm". You usually just end up reinventing the wheel, and poorly at that.

  • foo (unregistered) in reply to Yale
    Yale:
    foo:
    Again: I'm not suggesting you run such extensive tests every time you encounter such a problem. My point is rather to show how even well-intended tests can be misleading (as was my T1). So rather than concluding performance doesn't matter, rather err on the safe side and use well-known efficient algorithms (when available, but most of the time they are). And again, my real-world experience has show me far too many cases where performance really did matter and could have been better if the programmers had just used known efficient algorithms. (Don't be clever and invent your own -- most of the time it's not worth the effort, and even more often it goes horribly wrong, i.e. the result is either wrong or less efficient or both.)

    And my point is that you shouldn't make the knee-jerk assumption that a piece of clear, easy to understand code that uses a standard library function to accomplish exactly what you need must be slower than a "known efficient algorithm". You usually just end up reinventing the wheel, and poorly at that.

    The question is not whether it's a library function (I'm using one as well: regex library, I'm not reinventing the wheel), but whether it's efficient. And scanning the same string n times is not efficient, that's a well known fact, to anyone who doesn't close their eyes and thereby shift the burden to their successors and/or users who will have to deal with slow programs, perhaps a few years later when typical data sizes have increased another order of magnitude. -- No, you can't foresee every possible problem of the future, but you can guard against the known ones, like using O(n^2) when O(n) or O(n log n) is readily available.

  • AP² (unregistered) in reply to foo
    foo:
    Comparing languages based on how few lines they require to implement the same bad algorithm is as useful as comparing cars based on how well they can overrun dogs.

    A better response would have been: "In $LANGUAGE I can do $GOOD_ALGORITHM in 2 lines." Feel free to subtitute matching values, this way we might actually learn something useful.

    I never claimed it was useful. If I wanted to share useful code or insight on languages I'd be on Lambda the Ultimate or StackOverflow, not TDWTF. It was a simple pissing contest. Now can you please fuck off? Thank you!

  • (cs) in reply to qbolec
    qbolec:
    Ryan O'Hara (minitech):
    Public Function ContainsAny(ByVal searchTerms() As String, ByVal document As String) As Boolean
        Return searchTerms.Any(Function(x) document.Contains(x))
    End Function
    How about eta-reduction?
    Public Function ContainsAny(ByVal searchTerms() As String, ByVal document As String) As Boolean
        Return searchTerms.Any(document.Contains)
    End Function
    

    I'm not sure if it would work in VB, nor does Contains takes words boundaries into account...

    Oops, even better. In VB.NET, all it needs is an AddressOf:

    Public Function ContainsAny(ByVal searchTerms() As String, ByVal document As String) As Boolean
        Return searchTerms.Any(AddressOf document.Contains)
    End Function
  • foo (unregistered) in reply to AP²
    AP²:
    I never claimed it was useful. If I wanted to share useful code or insight on languages I'd be on Lambda the Ultimate or StackOverflow, not TDWTF. It was a simple pissing contest. Now can you please fuck off? Thank you!
    Had a bad day? Are you Bob?
  • Methinks (unregistered) in reply to minitech
    minitech:
    qbolec:
    Ryan O'Hara (minitech):
    Public Function ContainsAny(ByVal searchTerms() As String, ByVal document As String) As Boolean
        Return searchTerms.Any(Function(x) document.Contains(x))
    End Function
    How about eta-reduction?
    Public Function ContainsAny(ByVal searchTerms() As String, ByVal document As String) As Boolean
        Return searchTerms.Any(document.Contains)
    End Function
    

    I'm not sure if it would work in VB, nor does Contains takes words boundaries into account...

    Oops, even better. In VB.NET, all it needs is an AddressOf:

    Public Function ContainsAny(ByVal searchTerms() As String, ByVal document As String) As Boolean
        Return searchTerms.Any(AddressOf document.Contains)
    End Function

    What do any of these solutions have to do with the original problem?

    They all do case sensitive substring searches on a string.

    The original problem required a case-insensitive search of space delimited words in a string.

    Why is so much brain power going into solving the wrong problem here???

    WTF?

  • geoffrey, MCP, PMP (unregistered) in reply to big picture thinker
    big picture thinker:
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Most average-case sort algorithms are, by themselves, O(n * log(n)), + a rolling hash function is usually O(n) or more, + multiplication itself is non-linear. How deep do you want to go?

    TRWTF is that while you were embroiled in a pedantic academic discussion, your competitor just passed you with a simple VB application.

  • foo (unregistered) in reply to geoffrey, MCP, PMP
    geoffrey:
    big picture thinker:
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Most average-case sort algorithms are, by themselves, O(n * log(n)), + a rolling hash function is usually O(n) or more, + multiplication itself is non-linear. How deep do you want to go?

    TRWTF is that while you were embroiled in a pedantic academic discussion, your competitor just passed you with a simple VB application.

    No discussion. Can't you read?

  • fred (unregistered) in reply to Pro Coder
    Pro Coder:
    Melnorme:
    Why would need a sort and binary search? Just put all the words in the text into a hash set and do a contains (or whatever the VB equivalent is) for each word.
    Because Contains() is O(n) and a binary search is O(log n). If you work for me, I demand mandatory suicide from all team members who can't find a solution that is at least as good as O(log n). \m/

    This illustrates the point as to why using built-in library functions is considered harmful. It short-circuits critical thought and analysis.

    Do you understand how hashes work? It wouldn't be an O(n) operation...

    That said, as others have pointed out, we might be looking for matches on more than individual words....

  • shebang (unregistered) in reply to Oak
    Oak:
    Steve The Cynic:
    The documented requirements:

    Write a function that returns true if at least one string in the /search terms/ array is found in /bigstring/, otherwise returns false.

    So, my version, written in pseudocode because I'm lazy:

    func containsAtLeastOneOf( array-of-strings searchTerms, string bigString ) returns truefalse
    
    for each searchTerm in searchTerms:
      if bigstring contains searchTerm:
        return true
    
    return false

    No regex, no stacks, no hash tables, nothing. Just enumerate the search terms and return true as soon as you find one that's in the big string. If you don't find one, return false.

    Jed is an idiot, as is Bob. So are most of you.

    If there are 80 search terms on the 77th appears inside bigstring, your code will run

    bigstring contains
    77 times. Each in O(n), and we assume n is very large here. I'm guessing
    contains
    itself is a character-by-character (short-circuiting) comparison. So we could say your solution runs in O(n (bigstring length) * m (num of search terms) * k (length of search terms, assuming for simplicity they are all in the same length) )

    A regex match, on the other hand, will only run on bigstring once. O(n). Not even O(n*k) since it will match characters against the regex automata as it goes. Granted, taking branches in the automata is more expensive than character comparison, but in general the regex solution does have a significant performance advantage here.

    Well.... Firstly contains is a bit cleverer than that, so we don't actually need to do so many comparisons (the length is definitely irrelevant, because at an absolute maximum we would have to check (almost) every character once for each search term (but google "searching for Substrings", just because it might interest you, and it will explain why I say 'almost' above).

    The regex itself might only run once, but the number of comparisons is still similar. In both cases, the worst case occurs when there is no match, and in these circumstances both approaches need to check almost every character in the search string with almost every character in the big string(m*n).
    If a term is found, where it appears in both search data and big data affects efficiency of either approach (with regex being worse if the match is early in the search data, but late in the long data.

    Think of it this way, without regex, we iterate through the long data (n) the number of search terms amount of times (m) whereas with regex we iterate through the search data (m) the number of bog data times (n).

    Both are n*m they just do it in different order....(assuming of course, an efficient implementation of Contains - but that's a reasonably safe bet here....)

Leave a comment on “The Regex Code Review”

Log In or post as a guest

Replying to comment #:

« Return to Article