Comment On Internet.toLowerCase

Parsing HTML is no walk in the park. With the possibility of unclosed tags and mismatched quotation marks on any given page, it’s a veritable minefield of horrible hypertext. However, there are dozens of reliable libraries that a developer could use to do the heavy lifting. [expand full text]
« PrevPage 1 | Page 2Next »

Re: Internet.toLowerCase

2013-02-18 06:09 • by Steve (unregistered)
"Fr1st".toLowerCase()

Internet.toLowerCase

2013-02-18 06:16 • by Troll (unregistered)
Y U NO ESCAPE CHARS IN CODE? :P

Re: Internet.toLowerCase

2013-02-18 06:25 • by Dat validation (unregistered)
Dat validation

Re: Internet.toLowerCase

2013-02-18 06:29 • by Hmmmm
401433 in reply to 401431
Troll:
Y U NO ESCAPE CHARS IN CODE? :P

Perhaps this was an ironic meta-WTF by the author given the second sentence, though I doubt it...

Re: Internet.toLowerCase

2013-02-18 06:34 • by GettinSadda
For added Lulz, this post only makes sense if you read the HTML source!

Re: Internet.toLowerCase

2013-02-18 06:36 • by biziclop
At least there will be no debate about what TRWTF is.

Re: Internet.toLowerCase

2013-02-18 06:39 • by Anonymous (unregistered)
ESCAPE FROM THIS MADNESS

Re: Internet.toLowerCase

2013-02-18 06:51 • by F***-it Fred (unregistered)
Time to go write some really WTFy code containing malicious Javascript and wait for it to be published.

Re: Internet.toLowerCase

2013-02-18 07:19 • by foo (unregistered)
401439 in reply to 401438
<div class="CommentBodyFeatured">
F***-it Fred:
Time to go write some really WTFy code containing malicious Javascript and wait for it to be published.
</div>

Re: Internet.toLowerCase

2013-02-18 07:21 • by phynol
if(pageHTML.toLowerCase().regionMatches(index, " tag } 


What C family language supports arbitrarily closing a paran with a brace? Or are we just making WTFs up as we go along?

Re: Internet.toLowerCase

2013-02-18 07:27 • by Preakness (unregistered)
But the PRE tag shuts off the HTML interpreter, doesn't it?

Hint: They're trying to train us to view source on every article.

Re: Internet.toLowerCase

2013-02-18 07:28 • by Remy Porter
401443 in reply to 401440
Someone forgot to escape their "<". It's been fixed, and for good measure, run through a syntax highlighter so that you can see the WTFness in the code IN COLOR.

Re: Internet.toLowerCase

2013-02-18 07:30 • by Remy Porter
401444 in reply to 401442
There honestly should be such a tag. There should also be a tag that allows you to pass its contents to a different interpreter, thus making it easier to inline binary data.

Re: Internet.toLowerCase

2013-02-18 07:31 • by faoileag (unregistered)
401446 in reply to 401440
phynol:
Erik Gern:
if(pageHTML.toLowerCase().regionMatches(index, " tag }
What C family language supports arbitrarily closing a paran with a brace? Or are we just making WTFs up as we go along?

None.

But sometimes people forget that "pre" in HTML does not allow you to use angular brackets in HTML directly (without encoding them as their corresponding HTML-entities).

Re: Internet.toLowerCase

2013-02-18 07:37 • by Remy Porter
And regarding the article, this isn't a WTF. Regular expressions are expensive and difficult to maintain!

Multiple Chained Method Calls

2013-02-18 07:42 • by faoileag (unregistered)
So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)

Profiler, anyone?

Re: Internet.toLowerCase

2013-02-18 07:44 • by Raedwald
Parsing HTML with regular expressions? That never goes well.

Re: Internet.toLowerCase

2013-02-18 07:54 • by fa2k (unregistered)
So the toLowerCase is clearly a WTF. Comparing the text to every known tag is at best a borderline WTF. There are more efficient methods, but they are more complicated to implement. I can think of:
- Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree. (or do this implicitly, with switch, but that could be even uglier and more WTF-y)
- Search for the first non-letter character, and use the string up to that as a key into a hash table.

Re: Internet.toLowerCase

2013-02-18 07:54 • by snoofle
401452 in reply to 401430
article:
...it also lower cased the entire document multiple times...
So it converted the entire 1+M document to lower case 70+ times for every tag in the file? That's a lot of cpu-grinding. This generates unnecessary heat.

Forget carbon emissions; this is where global warming comes from people!

Re: Internet.toLowerCase

2013-02-18 07:56 • by Noumenon (unregistered)
As a PHP newb, I'd be thankful if someone could name one of those "reliable libraries a developer could use to do the heavy lifting." A simple one, please.

Re: Internet.toLowerCase

2013-02-18 07:56 • by ZoomST (unregistered)
401454 in reply to 401447
Remy Porter:
And regarding the article, this isn't a WTF. Regular expressions are expensive and difficult to maintain!

Sure, and as The Guru told us, "the delay is a little price to pay as long as the code keeps its essence. Just put more CPU power and memory". And the boss just bent before those deep words, while we were hearing it with astonishing devotion.
Not a WTF at all. Just as The Guru told us.

Re: Multiple Chained Method Calls

2013-02-18 07:57 • by gnasher729 (unregistered)
401455 in reply to 401449
faoileag:
So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)

Profiler, anyone?


Once you figure out that your code crashes, or takes a day to process a large page, the optimization is not premature anymore.

Testing for the "a"-Tag

2013-02-18 07:58 • by faoileag (unregistered)
I actually like the first test in the sample given: it fires on all tags starting with "<a", not only the anchor tag.

Ah well, the "Do a test involving the <a> tag"-Test will probably weed out applets, areas and the like.

Re: Multiple Chained Method Calls

2013-02-18 08:04 • by faoileag (unregistered)
401457 in reply to 401455
gnasher729:
faoileag:
So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)
Once you figure out that your code crashes, or takes a day to process a large page, the optimization is not premature anymore.

Definitely not. And "Pedro the Profiler" rightfully comes to the rescue.

But storing the result of toLowerCase() in a temp var and working on that variable would be :-)

Re: Multiple Chained Method Calls

2013-02-18 08:07 • by faoileag (unregistered)
401458 in reply to 401457
faoileag:
But storing the result of toLowerCase() in a temp var and working on that variable would be :-)

But storing the result of toLowerCase() in a temp var and working on that variable straightaway before the method has had a chance to choke on large pages would be.

FTFM

Re: Internet.toLowerCase

2013-02-18 08:09 • by Black Bart (unregistered)
Slow yes, but who here thinks it would take 24 hours to process a single page?

Re: Internet.toLowerCase

2013-02-18 08:30 • by snoofle
401460 in reply to 401459
Black Bart:
Slow yes, but who here thinks it would take 24 hours to process a single page?
In fairness, have you seen some of the crap generated by Frontpage?

Re: Internet.toLowerCase

2013-02-18 08:57 • by ZoomST (unregistered)
401462 in reply to 401459
Black Bart:
Slow yes, but who here thinks it would take 24 hours to process a single page?

Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?

Re: Internet.toLowerCase

2013-02-18 09:12 • by Bobby Tables (unregistered)
401464 in reply to 401462
It's worse than that.

Every time a tag is found the entire page is converted to lowercase.

Re: Internet.toLowerCase

2013-02-18 09:12 • by Bobby Tables (unregistered)
401465 in reply to 401462
ZoomST:
Black Bart:
Slow yes, but who here thinks it would take 24 hours to process a single page?

Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?


It's worse than that - every time a tag is found on the page the whole page is converted to lowercase. 70+ times.

Re: Internet.toLowerCase

2013-02-18 09:21 • by Doctor_of_Ineptitude (unregistered)
401467 in reply to 401462
ZoomST:
Black Bart:
Slow yes, but who here thinks it would take 24 hours to process a single page?

Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?


You must be a Russian.

Re: Internet.toLowerCase

2013-02-18 09:23 • by faoileag (unregistered)
401468 in reply to 401465
Bobby Tables :
ZoomST:
Black Bart:
Slow yes, but who here thinks it would take 24 hours to process a single page?
Methinks. Do you imagine how painful should be to lowercase Finnish text? And more than 70 times?
It's worse than that - every time a tag is found on the page the whole page is converted to lowercase. 70+ times.

It's worse than that - every time an opening angular bracket is found, the whole page is converted to lowercase 70+ times, because all if-clauses are executed everytime, no matter how early the current tag appears in the that list of if-clauses.

That makes it N * 70+ lowercase calls, where N is the number of opening angular in the page.

Re: Internet.toLowerCase

2013-02-18 09:34 • by DaveK
401469 in reply to 401451
fa2k:
So the toLowerCase is clearly a WTF. Comparing the text to every known tag is at best a borderline WTF. There are more efficient methods, but they are more complicated to implement. I can think of:
- Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree. (or do this implicitly, with switch, but that could be even uglier and more WTF-y)
- Search for the first non-letter character, and use the string up to that as a key into a hash table.
If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

Re: Internet.toLowerCase

2013-02-18 09:43 • by Anon (unregistered)
401470 in reply to 401469
DaveK:
fa2k:
So the toLowerCase is clearly a WTF. Comparing the text to every known tag is at best a borderline WTF. There are more efficient methods, but they are more complicated to implement. I can think of:
- Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree. (or do this implicitly, with switch, but that could be even uglier and more WTF-y)
- Search for the first non-letter character, and use the string up to that as a key into a hash table.
If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.


Or the fiery wheel. Which is all kinds of awesome!

Re: Internet.toLowerCase

2013-02-18 09:45 • by Joe tester (unregistered)
<div class="CommentBodyFeatured">

Wait, does this actually work?

Featured Comment Baby!

</div>

Re: Internet.toLowerCase

2013-02-18 10:01 • by gnasher729 (unregistered)
401472 in reply to 401469
DaveK:
If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation. Have a look at the instruction set of a newer Intel processor. There are additions to the instruction set that were specifically made because processing of XML etc. takes significant percentages of total CPU time.

Re: Internet.toLowerCase

2013-02-18 10:49 • by foo (unregistered)
401473 in reply to 401471
Joe tester:
<div class="CommentBodyFeatured">

Wait, does this actually work?

Featured Comment Baby!

</div>
Works for me. Must be your fault. :)

Re: Internet.toLowerCase

2013-02-18 10:56 • by dkf
401474 in reply to 401472
gnasher729:
Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation.
While strcmp is awesomely fast, the hashing might be a reasonable approach of the string is long (since if the data is large enough, you'll effectively-flush the DCache and your performance will be back to that of main memory). Depending on exactly what sort of match is desired.

Re: Internet.toLowerCase

2013-02-18 10:57 • by foo (unregistered)
401475 in reply to 401472
gnasher729:
DaveK:
If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.

Actually, with a good strcmp implementation, a dozen calls to strcmp will likely be faster than your homegrown hash implementation. Have a look at the instruction set of a newer Intel processor. There are additions to the instruction set that were specifically made because processing of XML etc. takes significant percentages of total CPU time.
As always, it depends. With some techniques you can search for different things simultanously (e.g. a lexer generator such as flex with uses parallel regular expressions), so you could shave off a factor of 70 here. Specialized CPU instructions can hardly match that.

Then again, if you get rid of the quadratic complexity (i.e. converting the whole string to lower-case and possibly anything else that traverses the whole string in each loop), you can shave off a factor on the order of a million for large files, so that's clearly the more important thing here. If that's done and it's still too slow (unlikely), you can care about a measly 70x speedup next.

Re: Internet.toLowerCase

2013-02-18 11:40 • by Huck Finn (unregistered)
if(pageHTML.toLowerCase().regionMatches(index, "<img", 0, 4)){ //Do a test involving the <img> tag }
But what if your code needs to be international? Do you really want to rewrite this to parse the Finnish <img> tag?

Plan ahead. Maybe you should include your list of tags expressed in every possible language, just to be sure.

Re: Internet.toLowerCase

2013-02-18 12:09 • by Jazz (unregistered)
401477 in reply to 401464
Bobby Tables:
It's worse than that. Every time a tag is found the entire page is converted to lowercase.


It's worse than that, he's dead, Jim.

Re: Internet.toLowerCase

2013-02-18 12:19 • by Jazz (unregistered)
401479 in reply to 401469
DaveK:
fa2k:
There are more efficient methods, but they are more complicated to implement. I can think of:
- Construct a tree-structure before processing, containing all known tags, where each node is a character. Then read each tag one character at a time while navigating the tree.
- Search for the first non-letter character, and use the string up to that as a key into a hash table.
If you really think that using a hash table to do string lookups is "complicated" and that sequential strcmps against every possible match is only a "borderline WTF", you should not be programming. Hash tables are about as basic as fire or the wheel.


Right out of college I worked for a giant global consulting firm with a one-word name that sounds like a sneeze. I wrote crap-tons of J2EE for lots of huge enterprise applications. At that firm, we would have been given bad marks on our review if we had implemented either of the solutions you suggest.

Speed and efficiency weren't really what our project leads cared about; making the code maintainable by cheap commodity programmers later was more their concern. If performance testing showed that the application had a bottleneck, they would just tell the client they're going to need some more infrastructure to drive the finished product.

More than once I brought a module to my lead for a code review, and in the module I had done fairly simple things, like caching the results of expensive methods, or adding a subclass so I could pass data around in logical, sensical ways, and I would be told that it was "too complicated" for future developers to understand, and would I please just code the simplest and most straightforward procedure that met the (barely coherent) specifications and not spend time thinking about how "best" to do it?

Anyway, my bitterness aside, it's entirely plausible that this code was written this way not because the developer thought it was a good idea, but because management found the good idea to be too complicated for their poor little brains.

Re: Internet.toLowerCase

2013-02-18 12:27 • by Jazz (unregistered)
401480 in reply to 401472
gnasher729:
Have a look at the instruction set of a newer Intel processor. There are additions to the instruction set that were specifically made because processing of XML etc. takes significant percentages of total CPU time.


This is TRWTF. A general-purpose processor should not have application-specific instructions implemented in hardware.

Sometimes I wish Intel would let their engineers design the chips, instead of having the marketing department do it. (Pentium 4, I'm looking at you.)

Re: Internet.toLowerCase

2013-02-18 12:30 • by chubertdev
this

Raedwald:
Parsing HTML with regular expressions? That never goes well.

Re: Internet.toLowerCase

2013-02-18 12:58 • by Rnd( (unregistered)
Thank some entity that my homework is only partially implementing HTTP-protocol... Why can't they have nice strict spec on web... Arbitary white space and no enforcement cases.

Re: Internet.toLowerCase

2013-02-18 14:36 • by Gary Olson (unregistered)
The Taginator -- destroying the web one page lookup at a time.

Re: Multiple Chained Method Calls

2013-02-18 14:52 • by A. Nonymous (unregistered)
401484 in reply to 401449
faoileag:
So you look at your code, and you think: "hmmm... maybe I shouldn't call toLowerCase() more than once on the same string".

Bang, along comes Donald Knuth and says "premature optimization is the root of all evil!" ;-)

Profiler, anyone?


No, in this case you have an easy reply to Donald: "It is not optimizing, I am only following DRY!"

Re: Multiple Chained Method Calls

2013-02-18 15:07 • by A. Nonymous (unregistered)
401485 in reply to 401484
This shows that Donald's advice is still good: If you don't write sh*tty code, there is probably no need to optimize. And if you wrote sh*tty code, it won't get better if *you* try to optimize it. Either way rule one of optimization holds: Don't do it.

Re: Multiple Chained Method Calls

2013-02-18 16:32 • by Joe (unregistered)
401488 in reply to 401485
A. Nonymous:
sh*tty
I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?

I hope it isn't a bad word. But if it is, I'm safe. As long as I don't know what it means, your bad word won't make me think a bad thought.

However if you've made some kind of error, that other people still understand, then they're still thinking bad thoughts despite your error.

So that couldn't be it.

Still confused.

Re: Multiple Chained Method Calls

2013-02-18 17:03 • by A. Nonymous (unregistered)
401489 in reply to 401488
Joe:
A. Nonymous:
sh*tty
I don't recognize that word. It isn't in my dictionary. Can someone tell me what it means?


Probably just a typo, seems to mean shoddy.
« PrevPage 1 | Page 2Next »

Add Comment