- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
A thing to be proud of is when a language allows short and readable code that is also efficient.
Admin
Cutting after the first find will on average half the effort in case of finds, and do nothing if not found. That's not bad, but pales in comparison to reducing the complexity from O(n*m) to O(n) or such, i.e. traversing the big-string just once, which usually means an FSM-based algorithm (Aho–Corasick or regex).
Admin
One line, including the function definition. Granted, anyone who writes code like this should be drawn and quartered, but still... top THAT, Python.
(Thanks to not the other thing who correctly pointed out that my previous code was bloated.)
Admin
O(1), every time. To where do I send my CV?
Admin
Dear Bob,
I would request that you stop wasting my teams time by submitting buggy code that does not seem to meet our basic business requirements.
Your "solution" failed on the very first 2 tests our team ran. I have attached the test cases for your enlightenment.
Regards, Jed
Test Cases:
Admin
What's inefficient about it? Or are you just a snob who still clings to the myth that "scripting languages are too slow"?
Admin
Regex.IsMatch is better, because it will return as soon as anything matches...
Admin
No, it just short circuits time. In your perfect world, you could spend all day writing your own boilerplate code that has already been written and tested in a framework. Business value be damned.
Get off your high horse. Professional life is about getting the job DONE. Not hand-wringing over whether this piece of code is the right O().
Admin
Uh, you can just remove the line break from my code and you get a one liner too. I was just too disgusted to do it.
There's also:
Admin
Admin
Most the solutions proposed are making are silly because they are trying to just re-implement the function. Think about it, where is this list of words coming from? If the same list of words every time, why would anyone every reconstruct a regex or hast table over and over? Just make a hash table once and make the function take a hash table (or Set in java) instead of a list.
Fast and efficient and easy to understand (for those scared of regexs, "\W+" is anything that not A-Z, 0-9, or an underscore)
Admin
I was comparing the two languages, dolt. How do you propose I do that with different algorithmic implementations? Maybe you should go back to school and learn the definition of "context" before calling people ignorants.
Admin
Jed and Bob's code is in VB.NET. .NET provides all kinds of terse syntax to handle this. For example (this one in c#, but the necessary extensions are available under VB.NET as well):
public bool containsAtLeastOne (string source, string[] terms) { return terms.Any(t => source.Contains(t)); }
Admin
I'm not foo, but the code I wrote (and what he was complaining about) is inefficient for multiple reasons: the first is that it has to run through the whole haystack for each term being searched (until one is found), when one could search for them all at the same time. Secondly, because techniques such as hashing - which is used for example in the Rabin–Karp algorithm - can reduce each string comparison (which is O(n), n = len(string)) to a rolling hashing (very cheap) + integer comparison (single instruction in most processors).
In any case, I was striving for efficiency but for algorithm compatibility with the Ruby version that had been posted.
Admin
Above, it's s/was striving/wasn't striving/.
Damn my sudden but inevitable lack of attention.
Admin
Yes, it's a WTF. That has nothing to do with yhr spproproateness of regular expressions, which are in fact ideally suited to the problem of finding if any of a (perhaps large) set of strings occur in a particular large string.
It's nothing to do with regular expressions, it's to do with ignorance or slovenlieness on the coder's part.
The limp joke has about the same value as Bobby Tables and the same lesson to be learned. If you use a powerful tool, you need to learn how it should be used and how it can go wrong if misused. Pretending that regular expressions are somehow inherently dangerous is one of the more depressing recurring themes on this site.
Admin
A better response would have been: "In $LANGUAGE I can do $GOOD_ALGORITHM in 2 lines." Feel free to subtitute matching values, this way we might actually learn something useful.
Admin
Why don't they just use a grep library? It is very efficient at multiple string searches http://www.tgries.de/agrep/
Admin
But that’s not actually important here—the language in the OP, VB, has real booleans.
Still there a more elegent ways then if-statements for converting you host language’s notion of datatype to an external representation (two if-statements, if you also need to convert the other way). Unfortunately most languages do’t support one. sob.Admin
The inefficiency has been discussed at length in this thread, so why don't you just read it? Or are you a snob who thinks anytime someone mentions O-notation, that's academic BS with no relation to the real world?
Admin
Not sure why the sort would be necessary, but you might be able to do something like that:
First hash all the Needles. Then OR (bitwise operator) all the hash values together into a "composite needle".
Then use a string-searching algorithm like Rabin-Karp that utilizes a "rolling hash" (a very efficient, recursive non-cryptographic hash function) to progress through the "Haystack", each time AND'ing (bitwise operator) the OR'd "composite needle".
If the AND results in true, that should tell you if there is a match to any one of the needles in O(n) time, without actually iterating each needle in another for loop O(n^2).
Because checking OR'd values would probably result in collisions (false positives), you'd have to track the progression through the Haystack with a counter integer and then do an actual byte-for-byte check of that particular location where a positive match might have been found.
Just a disclaimer, I'm just throwing down thoughts. I haven't even tried to implement that in code, and I have no desire to do so. So fair warning: It's just a thought and it might not work.
Admin
Removing any unnecessary characters/whitespace:
I think it's obvious to even the slowest of Pythonistas (and that's truly saying something) which is the clearly superior language.
You can even take that to the bean counters: "If we use Python on this project instead of Ruby, our keyboards will wear out at least 6% faster."
Admin
Return Regex.IsMatch(searchText, "\b(?:" + String.Join("|", searchTerms.Select(Function(x) Regex.Escape(x))) + ")\b")
One line in VB. O(n). Respects word boundaries. Escapes regex characters.
Admin
Most average-case sort algorithms are, by themselves, O(n * log(n)), + a rolling hash function is usually O(n) or more, + multiplication itself is non-linear. How deep do you want to go?
Admin
For a more relevant economic comparison, I suggest comparing the 1-bits in the strings, since everyone knows how expensive those are.
Admin
On my system, one million iterations of the above algorithm in a test case with three arguments where none of them matches takes 10.167 seconds, so roughly 10.167 microseconds per iteration. If the algorithm is called an average of 10 times per second (highly unlikely, since it's using a user-specified set of arguments), and you improve its performance by 50%, over the course of a year you'd save a grand total of 26.8 CPU-minutes. Divide that by the number of threads that can be processed concurrently to find your actual time savings.
So yes, if you can theoretically improve the performance of an algorithm, but as a practical matter it produces at best a negligible time savings, then that's academic BS that has no relation with the real world.
Admin
Just add RegexOptions.IgnoreCase at the end of that, and I think we have a winner.
Admin
I had one imported from Tibet, where monks high atop Mount Kailash carve each individual key out of mammoth ivory, and set them all into an enclosure forged from Damascus steel. I think the electronics come from China, though. I'd be happy to sell you mine for the bargain basement price of $600!
Besides, presses of the shift key don't count. There are two of them, not to mention caps lock, so you're still covered in the event of a failure.
Admin
TRWTF is that both versions suck.
Admin
Someone does a single test with a non-representative data set (3 entries, not really much, the size of the string wasn't even specified) and when it runs fast, they conclude performance doesn't matter. Maybe it's so, maybe not. From a single test you can't tell. O-notation can help judging how the result from a single test scales to bigger amounts of data.
Also, the total amount of CPU time is often not the relevant factor, but latency. And note that the original article does mention a performance bottleneck.
I'm not suggesting you write an academic paper about every trivial problem you encounter, but just use known efficient algorithms. If they can be implemented with (much) less code, all the better. E.g. Tom's code in 375519 (I don't know VB, so I just assume the syntax is right etc.). And now please don't complain how that code is too complex or something. If you're working in this language and encounter this kind of problems, you should be able to write such code (or read it, if written by your predecessor) or you've got the wrong job.
Admin
I just implemented that method in Ruby as follows:
That may not have precisely the same effect, but it's close enough to draw a comparison. It took 18.270 seconds to run one million iterations. So your "efficient" algorithm is about 80% SLOWER.I've found it's generally a mistake to think you can out-do the standard library in terms of performance for basic stuff like this. Not only does your code become a lot messier very quickly, but you're usually wrong.
Admin
Methinks that was done on purpose. A double WTF. Replacing bad code that works (well, almost), with "better" faster code that...doesn't work. Happens all the time.
No, I would say TRWTF is all the proposed solutions (Jeds included) that do a substring match instead of a word match. Finding a word in a text and finding a substring are 2 different problems. The requirement was to find a word. All the substring solutions are a FAIL.
Admin
Again, that's exactly the point: Coders (such as you, apparently) who don't understand efficiency do irrelvant/misleading tests and come to the wrong conclusions. (Additionally, they usually stobbornly refuse to accept results from those who have studied the issues more intensively, because that's "academic BS".)
But if you only believe what you see yourself, repeat your test with 10* the number of search items and 10* the data size, then with 100* both. The native version will take roughly 100* and 10000* as long, the regex one (or any other effient one) somewhat more than 10* and 100*.
I'm getting somewhat upset about this because I see happen this all too often in real applications, things that slow down much more than necessary when the data size grows because the programmers were apparently too lazy to use the right algorithm (because "it won't happen", yeah, everyone knows how those "noone will ever need more than ..." assumptions play out).
The worst offender at the moment is my stupid TV recorder. Merely opening the list of recordings takes more than 20seconds with ~200 entries. To be honest, I really don't know how they managed to achieve that, even the most stupid way of doing things I can imagine wouldn't take nearly as long. But apparently they only ever tested it with a few entries (where the delay is hardy notable, given the device's general sluggishness) and never bothered to either test with larger data or consider how their code scales based on known fasts about efficiency.
</rant>Admin
Metacharacters is the word you're looking for.
Admin
With 100 times the search term list size and data size, the native algorithm takes 5.166 milliseconds, and the regexp based one takes 25.440 milliseconds (based on 1000 iterations of each). So the regexp one is a whopping 500% slower.
Do you just like being wrong or something?
Admin
Don't know about VB, but most regex engines can toggle case sensitivity, with a simple flag. And the engine is usually locale sensitive.
A regex engine can do a lot of optimization for you and match complex patterns in near linear time. It's incredibly easy using repeated string manipulation to run in polynomial time. And usually the repeated string manipulation solutions only barely solve the problem.
I take it you mean he didn't escape the metacharacters. That's true; his implementation is horribly broken.
Admin
This version returns the number of search terms, which means you keep looking after you've found one.
This still runs into the "search terms are not regular expressions" problem.
This still runs into the "you probably want a case insensitive match" problem.
Text processing is hard.
Admin
Well for starters you could look at improving your aim. You'll never kill a joke by shooting it in the knee. A couple of arrows to the chest should do the job.
Admin
Well for starters you need to improve your aim. Try a couple of arrows to the chest instead of the knee.
(Yes I know I double posted, reply, quote apparently not the same thing)
Admin
T1 is the time (in microseconds) with random text and search items. Note that the time actually goes down in the last row, because the chance of an early match increases.
T2 is with search items that never match, containing one upper-case letter in the middle, rest lower-case, text all lower-case. (But be careful: T3 is with the search text all lower-case and search-items all upper-case. Here the engine can recognize non-matches very quickly by checking impossible characters.)
T2 shows clearly how the time is roughly linear with text length, but increases much less with the number of search terms.
The naive algorithm grows linearly in both, so at some point it will be slower. I actually suppose that point is below what you tested before if you exclude early matches (but since you didn't state the size or content of your test data, I can only guess).
Again: I'm not suggesting you run such extensive tests every time you encounter such a problem. My point is rather to show how even well-intended tests can be misleading (as was my T1). So rather than concluding performance doesn't matter, rather err on the safe side and use well-known efficient algorithms (when available, but most of the time they are). And again, my real-world experience has show me far too many cases where performance really did matter and could have been better if the programmers had just used known efficient algorithms. (Don't be clever and invent your own -- most of the time it's not worth the effort, and even more often it goes horribly wrong, i.e. the result is either wrong or less efficient or both.)
Admin
And my point is that you shouldn't make the knee-jerk assumption that a piece of clear, easy to understand code that uses a standard library function to accomplish exactly what you need must be slower than a "known efficient algorithm". You usually just end up reinventing the wheel, and poorly at that.
Admin
Admin
I never claimed it was useful. If I wanted to share useful code or insight on languages I'd be on Lambda the Ultimate or StackOverflow, not TDWTF. It was a simple pissing contest. Now can you please fuck off? Thank you!
Admin
Oops, even better. In VB.NET, all it needs is an AddressOf:
Admin
Admin
What do any of these solutions have to do with the original problem?
They all do case sensitive substring searches on a string.
The original problem required a case-insensitive search of space delimited words in a string.
Why is so much brain power going into solving the wrong problem here???
WTF?
Admin
TRWTF is that while you were embroiled in a pedantic academic discussion, your competitor just passed you with a simple VB application.
Admin
Admin
That said, as others have pointed out, we might be looking for matches on more than individual words....
Admin
The regex itself might only run once, but the number of comparisons is still similar. In both cases, the worst case occurs when there is no match, and in these circumstances both approaches need to check almost every character in the search string with almost every character in the big string(m*n).
If a term is found, where it appears in both search data and big data affects efficiency of either approach (with regex being worse if the match is early in the search data, but late in the long data.
Think of it this way, without regex, we iterate through the long data (n) the number of search terms amount of times (m) whereas with regex we iterate through the search data (m) the number of bog data times (n).
Both are n*m they just do it in different order....(assuming of course, an efficient implementation of Contains - but that's a reasonably safe bet here....)