• Yale (unregistered) in reply to Brendan
    Brendan:
    I'm only pointing out the futility of comparing languages based on lines of code. You're helping to prove my point.

    If you only need to write an average of one line of code in language A to accomplish the same task as ten lines of code in language B, that is an advantage for language A. It's irrelevant how many actual lines of code are behind that one line, because the programmer doesn't have to write those lines of code to complete the task at hand.

    Brendan:
    You seem like the sort of person who's deductive reasoning has atrophied. Because it's not polite to poke fun of disadvantaged people, I won't suggest that this is caused by spending all your time gluing together other people's code (rather than actually writing code and needing to think).

    Sp: "Reinventing the wheel". Talk about chronic NIH syndrome...

  • Brendan (unregistered) in reply to Yale
    Yale:
    Brendan:
    I'm only pointing out the futility of comparing languages based on lines of code. You're helping to prove my point.

    If you only need to write an average of one line of code in language A to accomplish the same task as ten lines of code in language B, that is an advantage for language A. It's irrelevant how many actual lines of code are behind that one line, because the programmer doesn't have to write those lines of code to complete the task at hand.

    So you're saying that C is better, because (if someone has already written the code you need) you only need 1 line to use it rather than 2?

    Brendan:
    You seem like the sort of person who's deductive reasoning has atrophied. Because it's not polite to poke fun of disadvantaged people, I won't suggest that this is caused by spending all your time gluing together other people's code (rather than actually writing code and needing to think).

    Sp: "Reinventing the wheel". Talk about chronic NIH syndrome...

    It's not fair to blame me for reinventing Perl and calling it Python.

  • Tud (unregistered) in reply to Brendan
    Brendan:
    I'll show you all of the code that the C version uses, including "some_function_someone_else_wrote()" and any other functions that end up (directly or indirectly) called by it; if you show me all of the code that the Python version uses, including the code for "any()" and any other functions that end up (directly or indirectly) called by it.

    Then we'd have a fair comparison. I'd probably be posting about 50 lines of C, and you'd probably be posting several thousand lines of C (and a little Python).

    But the whole point is that those lines don't count. They are already in your computer before you even start coding. They are included in the language, so you don't have to write them. Otherwise it would only make sense to measure it in number of transistors used.

    If you told me there is a good, easy-to-use C library that searches for words in a list of strings, then you might have a good point for C.

  • L. (unregistered) in reply to More from the CS department
    More from the CS department:
    Oh, by the way, assuming n >> p, that hash/sort/binsearch solution winds up being O(n log n) anyway due to the sort step. That is better than O(m + n) (i.e. linear time) how? And also, all it'd take to make Aho-Corasick or Rabin-Karp match on word boundaries is to bracket each of the keywords being searched for with spaces. Case-insensitive searching can be implemented by case-normalizing the input in a preprocessing step (this gives you a chance to do other things too such as tab expansion, Unicode normalization, ...). (And you still haven't changed the big-O complexity of your algorithm with this: also, you don't have to modify the implementation of the algorithm itself, which allows you to use a known-good implementation (say from a library, or just by invoking fgrep).

    CAPTCHA: feugiat

    Not just spaces, you need start and end of string as well. And considering we speak a language (not true for everyone) where correct sentences end with a dot, ... you're in for a whole lot of trouble if you think word separation is easy The very least you will need is a character class including all punctuation except the dash, plus the space (only the normal space), the start of string on one side, and end of string on the other side.

    Over simplifying makes everything look easy and one-liner without using the proper tools ...

    But . seeing how this is opposed with the hash/sort/binsearch, the same arguments will have to be taken into account for the hash when splitting the string.

  • L. (unregistered) in reply to Yale
    Yale:
    Brendan:
    Perhaps my point was too subtle.

    I'll show you all of the code that the C version uses, including "some_function_someone_else_wrote()" and any other functions that end up (directly or indirectly) called by it; if you show me all of the code that the Python version uses, including the code for "any()" and any other functions that end up (directly or indirectly) called by it.

    Then we'd have a fair comparison. I'd probably be posting about 50 lines of C, and you'd probably be posting several thousand lines of C (and a little Python).

    If you're going to suggest that a line of code ceases to be a line of code if someone else wrote it (e.g. it exists in a library or as a built-in), then I'm going to suggest that all problems (regardless of how hard) can be solved with 1 line of C (by finding someone to write it for you).

    You sound like the kind of person who sits around pondering how you can optimize "Hello World" by eliminating the massive overhead in stdio.h.

    Besides, for your comparison to be really fair, you'd need to include the entire source code for the compiler, assembler, linker, and any included source files or linked libraries, not to mention the operating system and hardware drivers (and don't forget the shell you launched the compiler from!). After all, you didn't write any of those things—you're just taking advantage of other people's work in your "one line program".

    Well .. Yale you're a good troll I'll give you that.

    But to make the comparison really really fair, we should include all possible platform-related code that enables that C to be processed on virtualy any CPU, and then the few versions of python that run your code, plus all the platform-related coded that enables running it on a few cpus on a few OS's.

    Tell you what I'm pretty sure you'll have less LOC with python since it runs on virtually nothing compared to C.

    However, the main fact remains : C is faster, the actual code behind the work IS much shorter, ANYTHING you implement in C will not be slower because you're not using the base library function, and it can be as smart as you want it to be because there are no limits.

    Still, good trolling Yale it's impressively borderline.

  • L. (unregistered) in reply to Yale
    Yale:
    Brendan:
    I'm only pointing out the futility of comparing languages based on lines of code. You're helping to prove my point.

    If you only need to write an average of one line of code in language A to accomplish the same task as ten lines of code in language B, that is an advantage for language A. It's irrelevant how many actual lines of code are behind that one line, because the programmer doesn't have to write those lines of code to complete the task at hand.

    Brendan:
    You seem like the sort of person who's deductive reasoning has atrophied. Because it's not polite to poke fun of disadvantaged people, I won't suggest that this is caused by spending all your time gluing together other people's code (rather than actually writing code and needing to think).

    Sp: "Reinventing the wheel". Talk about chronic NIH syndrome...

    Good trolling again, but everyone knows MLOC is not the answer.

    No decent programmer spends most of his time typing, instead favoring thinking and reducing the required coding by orders of magnitude in the process.

    Reinventing the wheel sounds funny, but if the wheel was as fucked up as most of the code available from the work of others, I'm pretty sure cars would still be under 1 MPG.

    Now java gluers may be all the rage today, they're still not coders.

    Gluing together half a dozen botched libraries to make bloatware does not count as programming, sorry.

  • L. (unregistered) in reply to Tud
    Tud:
    Brendan:
    I'll show you all of the code that the C version uses, including "some_function_someone_else_wrote()" and any other functions that end up (directly or indirectly) called by it; if you show me all of the code that the Python version uses, including the code for "any()" and any other functions that end up (directly or indirectly) called by it.

    Then we'd have a fair comparison. I'd probably be posting about 50 lines of C, and you'd probably be posting several thousand lines of C (and a little Python).

    But the whole point is that those lines don't count. They are already in your computer before you even start coding. They are included in the language, so you don't have to write them. Otherwise it would only make sense to measure it in number of transistors used.

    If you told me there is a good, easy-to-use C library that searches for words in a list of strings, then you might have a good point for C.

    yes, it's called perl.

  • someone (unregistered) in reply to Dilbertino
    Dilbertino:
    A regex wtf, and nobody mentioned the perl oneliner yet? :)

    return grep { $searchText =~ $_ } @searchTerms;

    It continues to amaze me how verbose other languages are for the most basic operations...

    Your oneliner seems to have at least the following issues, most of which have already been mentioned in the context of the OP:

    • missing \Q (your version dies if @searchTerms contains invalid regexen such as '*')
    • behaves differently in list and scalar context
    • using a regex in the first place (now you have two problems...)

    Why not just use index?

    return grep(index($searchText, $_) >= 0, @searchTerms) ? 1 : 0;

    Obviously replace grep with List::Util::first_idx (and check for >= 0) for efficiency if you're expecting more than a handful of search terms.

  • Dilbertino (unregistered) in reply to someone
    someone:
    Dilbertino:
    A regex wtf, and nobody mentioned the perl oneliner yet? :)

    return grep { $searchText =~ $_ } @searchTerms;

    It continues to amaze me how verbose other languages are for the most basic operations...

    Your oneliner seems to have at least the following issues, most of which have already been mentioned in the context of the OP:

    • missing \Q (your version dies if @searchTerms contains invalid regexen such as '*')
    Has been addressed, see comment page 4...
    someone:
    * behaves differently in list and scalar context
    That's the idea! : )
    someone:
    * using a regex in the first place (now you have two problems...)
    Tired cliche that only applies if you don't understand regexes very well... has also been addressed in previous comments : )
  • More from the CS department (unregistered) in reply to L.
    L.:
    More from the CS department:
    Oh, by the way, assuming n >> p, that hash/sort/binsearch solution winds up being O(n log n) anyway due to the sort step. That is better than O(m + n) (i.e. linear time) how? And also, all it'd take to make Aho-Corasick or Rabin-Karp match on word boundaries is to bracket each of the keywords being searched for with spaces. Case-insensitive searching can be implemented by case-normalizing the input in a preprocessing step (this gives you a chance to do other things too such as tab expansion, Unicode normalization, ...). (And you still haven't changed the big-O complexity of your algorithm with this: also, you don't have to modify the implementation of the algorithm itself, which allows you to use a known-good implementation (say from a library, or just by invoking fgrep).

    CAPTCHA: feugiat

    Not just spaces, you need start and end of string as well. And considering we speak a language (not true for everyone) where correct sentences end with a dot, ... you're in for a whole lot of trouble if you think word separation is easy The very least you will need is a character class including all punctuation except the dash, plus the space (only the normal space), the start of string on one side, and end of string on the other side.

    Over simplifying makes everything look easy and one-liner without using the proper tools ...

    But . seeing how this is opposed with the hash/sort/binsearch, the same arguments will have to be taken into account for the hash when splitting the string.

    Good point, and one that'd sway it slightly towards a regex-based solution actually, as it'd be easier to take things like this into account (although I wouldn't be surprised if it was possible to do a DFA-merging equivalent of Aho-Corasick, which would make this whole debate moot).

    CAPTCHA: uxor...U XOR ME?

  • (cs) in reply to Pro Coder

    Mandatory Suicide, massacre on the front line. Dun Dun Dun Duh!

  • (cs) in reply to More from the CS department
    More from the CS department:
    Good point, and one that'd sway it slightly towards a regex-based solution actually, as it'd be easier to take things like this into account (although I wouldn't be surprised if it was possible to do a DFA-merging equivalent of Aho-Corasick, which would make this whole debate moot).
    More important is the consideration of whether you can cache the build of search terms (whether this is to a DFA or something else) so that you don't pay that cost every time you search. It's not clear from the original article whether this is possible, but if it is — and that's pretty commonly true so it's at least worth considering — then you can consider building extremely sophisticated matchers because their construction cost will be amortized over many searches. Similarly, if you're searching against a large corpus of largely static text then constructing some kind of index against it makes a lot of sense (this is how internet search engines work on the query processing side).

    If both are static, the whole article's a WTF from start to finish. Every last word. If neither is, testing on real data and real search terms is the only sane approach. (Complex matchers can slow things down by causing problems with data locality; this is counter-intuitive by comparison with normal algorithmic analysis.)

  • tundog (unregistered) in reply to jes
    jes:
    if you really think using regular expressions invariably creates problems, then you are ignorant, incomptent or both.

    Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

  • Anone (unregistered) in reply to tundog

    Some people, when confronted with regular expressions, think "I know, I'll use that quote". Now they have two head wounds.

  • the loose couple (unregistered)

    TruthTable... that's sexy. We have one in our basement. It's all shiny, with hinges and straps and everything.

  • lollan (unregistered)

    Jed is brave, I would have not even bothered reply to the guy who sent an email to everyone on such a matter.

    Good luck Jed !

  • (cs) in reply to Gunslinger

    It's only more readable if you're an idiot.

Leave a comment on “The Regex Code Review”

Log In or post as a guest

Replying to comment #:

« Return to Article