- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
They weren't covered by my school.
Admin
I went to a community college so naturally we didn't cover any advanced topics. :)
Admin
Can't believe this quote hasn't appeared yet:
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." --Jamie Zawinski
Admin
We didn't cover them as such; a description of a very simple regular expression language turned up in a maths exam, once, under formal languages, but that was it.
Mind you, my course didn't really address 'practical' stuff; it let people figure that out if they felt like it. I must say, I think there's a lot to be said for this approach; if I had had lectures on how to make web applications I would probably have gone mad.
Admin
is never valid... the / is invalid in HTML, the capitals invalid in XHTML... but that probably won't stop most browsers from accepting it.
And I vote for s/<br\b[^>]*>/\r\n/gsi;... This'll work in any valid HTML/XHTML... but it won't be the same as the usual tag-soup parsers, and will choke on
which is invalid HTML but many browsers will accept.
If VB doesn't like \b then s/<br([^A-Za-z0-9-._:>][^>]*)?>/\r\n/gsi; is a working, but less readable, replacement... with the same caveats as above.
Also, why are people putting <\s*br in their expressions? You can't put spaces in there... it's illegal, and neither IE nor Fx will accept it.
Admin
Can we implement a system in these threads to vote posts down?
Admin
. Nevertheless, it's (incorrectly) treated as
by most (all?) visual user agents. Actually,
is equivalent to
> This is due to the support for the null end-tags minimization. In short, <ELEMENT/stuff/ is a shortcut for <ELEMENT>stuff</ELEMENT> when the OMITTAG option is specified in the SGML declaration of SGML applications such as HTML.
Hello
/
is equivalent to
> which is equivalent to
>
Actually, XML inherits the /> syntax from SGML. The SGML declaration of XML specifies a NET delimiter equal to / and a NESTC (net-enabling start-tag close) delimiter equal to >.
So, in XML, instead of using the syntax: <element/some data/, one must use the syntax <element/some data> Moreover, XML adds a constraint (violating the SGML specification, but that's not the matter of our story): There must be zero bit of data between the NET and the NESTC... This can only be used for empty elements. Like that, an SGML parser, fed with the SGML declaration of XML, will be able to parse an XML document.
Admin
I would welcome suggestions, as I'm not a regex master, but it seems to do the job quite well so far.
Admin
[quote=Phlip] And I vote for s/<br\b[^>]*>/\r\n/gsi;... This'll work in any valid HTML/XHTML... but it won't be the same as the usual tag-soup parsers, and will choke on
which is invalid HTML but many browsers will accept. [/quote]
BR supports the core attributes: id, class, style and title. id is restricted to name characters, but the style attribute may contain the > character in a valid, conforming HTML document, as in:
You can probably imagine more with the title attribute.
It's also possible (in valid, conforming, but unsupported code) to write <br
which sould be interpreted as
because of the unclosed tags feature of HTML (not to be confused with the end-tag omission feature).
[quote=Phlip] Also, why are people putting <\s*br in their expressions? You can't put spaces in there... it's illegal, and neither IE nor Fx will accept it. [/quote] It isn't illegal, but it must be interpreted as < followed by spaces and the two br letters. Yeah < br> is never equivalent to
Admin
I gotta say, given the example the RegEx is way simpler than the gaggle of Replaces... But uhmmm doesn't it fail for 2/3's of what the Replace statements work on?
Admin
F*** there are some dickheads around!!
Admin
Admin
with or without a
is also acceptable.
Using all caps ia also fairly reasonable, and I know some basic books on HTML recommended that (I think that this was to make life easier if using a editor without highlighting, although I'm not sure)
Admin
Mine certainly didn't - although granted it was back in the day before world+dog decided to jump on the programming bandwagon.
There was no .NET, Java only just came out and there were none of these fancy fad languages that seemed to proliferate in the last decade. Heck, even html was new, where H1, H2 and H3 were the height of page formatting.
I did do the Compiler elective and the Finite Automata one as well, so it was a heck of a lot of theory - but not really delving into any specific implementations like regex.
You young-uns have it easy these days </old man's rant>
Admin
isn't valid... what you're after is
.
[edit] My mistake... turns out only less-than signs are verboten in attributes, not greater-than signs.
Damn, that makes this a lot more unnecessarily complicated.
Admin
Just goes to show how easy things are in PHP...
<? $str = "string where tags show be replaced, maybe read from a file"; $replace_arr = array("<br/>","","
"); for($x=0;$x<=$replace_arr;$x++){ $str = str_replace($replace_arr[$x],"\n",$str); } echo $str; ?>
that is bound to get them out.
Admin
You're on to something there.... not to mention the well know east asian tags <brack> and <brue>
Admin
Then I will start using it to mess with your mind!
Admin
is invalid HTML. Moreover, it doesn't behave consistently among browsers. Opera 9 and IE 6 interpret
as
, while FF 1.5 interpret it as
. The W3C recommandation uses all caps for element names. Yes, this style is fairly reasonable.
Admin
Trade school graduates have got to stop trying to pass themselves off as college graduates. Being a Gamma-minus machine minder is nothing to be ashamed of. You have no idea, the troubles which Alphas and Betas need to deal with.
Admin
Some people just try to be morons with every post...
Sure, maybe the original developer didn't learn about regular expressions in Comp Sci. They did, however, learn about LCase(), didn't they? So they at least could have come up with one better way of doing what they did.
Or can you just not figure that out yourself?
Admin
Let's see how my attempt holds up
<([0-9a-zA-z]+:)?br((\s+(((title|style)=".")|((id|class)="[0-9a-zA-Z]")\s+)(((title|style)=".")|((id|class)="[0-9a-zA-Z]"))?)|(\s/?))>
from the rest of the comments, this should cover all the bases for valid x?html and might even still be readable to somebody other than me.
Admin
Yeah. Because b, /, and > are totally whitespace characters. His only problem was failing to take into account the possibility that there might be something other than /, or nothing at all (except whitespace), between br and > - neither of which had anything to do with greedy mode.
Hint: Even with greedy mode, the * operator won’t eat anything that’s not matched by what it’s attached to. All your question marks are unnecessary, and they introduce something that has to be changed for different regex flavors (in vim, you use {-} for a “non-greedy star” - posix basic regexes don't support it at all.)
Admin
Admin
In the real world, a lot of programming is done by people without computer science degrees. And why not? For most business application programming, domain knowledge is as important as anything taught in CS classes.
Admin
Admin
This thread is so very funny. You all blathering about schools/degrees.
Many programmers do not have MISs or CSs. I have a business degree, am I certified MCDBA, and write VB6 and C#. I know about regexes and eschew there use. They are difficult to write for any type of complicated task and worse to maintain.
There are two concepts in coding that often are orthogonal, efficiency and maintainability. What is more important? In my world, it is maintainability.
Flame away all you CS degree holders...
Admin
I defy you to show me any instance where I have to choose between writing efficient code and writing maintainable code. If you are in the habit of sacrificing one for the other, then you should not be coding; you are not a programmer, you are a monkey with a keyboard. "I have a business degree, am I certified MCDBA..." I don't know, are you? It doesn't matter, an MCDBA is no substitute for a brain.
Admin
UVM will teach you finite state machines. Regex is more dubious, since I didn't see that in any of the required or optional classes I took. I'd kinda like to learn Regex, but I honestly think it looks like it's pretty limited in usefulness. I think a better use of time might be more comprehensive coverage of design patterns.
Admin
Well said.
Admin
I guess where you obtained your degree from determines the size of your ePenis. I have never stepped foot in a postsecondary classroom and probably never will. It took me 13 years of experience to get where I am now, but I am completely happy with my ridiculous salary and real world education.
...and all the CS grads that work for me usually bring the bad habits of their professors along with them.
Admin
[quote=Phlip]
My mistake... turns out only less-than signs are verboten in attributes, not greater-than signs. [/quote]
Both less-than and greater-than signs are allowed in attributes in HTML.
However, ampersand is interpreted as the start of an entity reference, so that ampersands must be encoded as &
Admin
Very true. Many clients would be happier with buggy, dodgy software that looks like a dream than software that executes flawlessly but looks like crap. It's all about perception.
Admin
Honestly, if you have trouble understanding the regex that was supplied, then get a new job. I don't care whether or not you have a MIS, CS, or business school degree. None of them, including a MCDBA, makes you a good programmer.
(In case anyone else doesn't know, MCDBA is short for, "I have a very small penis.")
Admin
I'm not even gonna try too hard, but if you know the code is valid HTML 4.01, even /<br(\s[^>]*|/)?>/i should cover most of the possible valid situations. But even then; what to do with
within comments? Or within a javascript string within a <script> block?
Some WTFs are actually no WTF if you know that certain situations won't occur. Don't overdo it. If you did so, you would really have to parse the entire HTML document, and then serialize it back to HTML. Certainly a simple regular expression based replacement would cover most circumstances. And for all other circumstances: fix the tool/person that supplied the HTML code in the first place.
Admin
Admin
Ha! Try this:
<!-- <br --><!-- <br> --><!-- </br> -->
I think it even does validate (with warnings).
Lesson for you kids: Regular expressions can't reliably parse HTML (at least not those which are humanly comprehensible).
Use DOM and getElementsByTagName() (or avoid crappy "web-oriented" environments that don't have methods for processing HTML/XML).
Admin
correctly
It does get confused by the comments - but assuming that the only useful purpose for replacing <br...> with CRLF is to convert to plain text, the routine should strip out comments/scripts/styles/etc first anyway, so the comments wouldn't be a problem.
Yes, parsing HTML properly needs a proper parser - but converting HTML to text can be done reasonably well using a set of regexps.
Admin
hahahahahahahah, I love it, I work with CS grads with equal years of experience as me and I am paid more. I have to laugh at all you CS grads that can't even write the most simple SQL.
Admin
It was not efficient: it was efficient to author (not having to crack open the regex help file to start typing the line noise), or are you speaking of execution efficiency? What if this only ran once a day?
Hard to extend: within the context of adding other
permutations, it would be easy to add a couple more lines. How to extend the regex adds more complexity to a simple problem.
Unclear what exactly it was trying to accomplish: huh? it was replacing various instances of
with vbCrLF. The code comment says 'Replace
with vbCrLf. The code lists all of the ways
could be spelled to get the vbCrLf, with no surprises.
Hard to make sure that all of the relevant cases were considered: Bingo. But it does list the cases it does consider.
Heck, I'm not even against the use of a regex to simplify that replacement. But to make a wtf out of it... If this snippet is considered a serious wtf, the programmers these days are getting to be really good!
Admin
Hey look! Broad stroke brush in the works! What's your next act? Women can't be programmers? All Blacks are criminals? French are surrender monkeys? C'mon... you can do better!
-- Seejay
Admin
I'm sure I could come with a few more for your entertainment. I really don't appreciate being called a monkey in front of a key board because I don't have a CS degree.
I believe Regex's suck, not because of the problems I have using them, but in supporting other peoples code, often CS degree holders. Regex's are often the cause of my rant on maintainability vs efficiency. Take that for what you will.
Here are some for some for your entertainment:
How do you tell if a CS degree holder is an introvert or an extrovert?
Whether they look at their shoes or your's when they talk to you.
How do you keep a CS degree holder in the shower all day?
Give them a shampoo bottle that says rinse, lather, repeat.
Admin
With the same notion of "extensible", the following program is an extensible multiplicator:
int multiply(int x,int y) { if (x==1 && y==1) return 1; else if (x==2 && y==1) return 2; else if (x==1 && y==2) return 2; else if (x==2 && y==2) return 4; /* easily extensible, just add new conditions! */ }
Admin
Admin
That would be "Cheese-eating Surrender Monkeys," Seejay, and I'm ashamed at you not remembering that. Particularly because it's hysterically funny, and I like French cheese. But mostly only in France. Unless you can get Raclette your way, which I definitely recommend. Oh, and that thing with the ash in the middle and the morning cheese on top and the evening cheese below. Or is that Raclette? I forget. I certainly wouldn't recommend anything calling itself Camembert or Brie in the US, because it's either rancid or a lie.
Anyway, isn't it up to the monkeys to surrender?
Admin
Unfortunately, I noticed that worthlessFred is certified.
Now, I'm a politically-correct sort of guy. All kinds of creeds, colours, sexes, religions, and spoons are grist to my metrosexual mill.
What, precisely, might a "certified MCDBA" be? And why should the nation (any nation) care?
Unfortunately, I have to disagree with you. This isn't a choice. This is a hierarchy.
(1) Write documented and (repeatably) testable code. (2) Write maintainable code (3) Write efficient code.
(3) is generally regarded as an 80/20 rule. (2) is, interestingly enough, also regarded as an 80/20 rule.
To transition from (3) to (2), (1) almost certainly gives you a benefit better that 80/20.
Well, I don't have an MDBARGHCEX-ComeInPluto, and I think my brain is gently frying right now, but I'd be careful of too much "efficiency" at the expense of "maintainability" if I were you. If only because I do maintenance, I'm much bigger than you, and I know the dark alleys round where you live. (isn't Google Earth a wonderful tool?)
Believe me. We'll all be happier if it's maintainable, but not particularly efficient.
Mind you, if you could make it scalable and testable, then you've got my vote, and MCDBAs be damned.
Admin
I like to say that even almost in every edge of codes we have are done by driving it to a maintainable direction while keeping the efficiency, unless you're not a part of "we" I mentioned.
But then.. we can't always be perfect all the time. Some times you must choose between a best-material pants but costly and an intermediate-material pants with a low price.
Sure we got both of maintainable code and efficient code when we wrote this single line:
Regex.Replace(html, "<br ?/?>", vbCrLf, RegexOptions.IgnoreCase)
Although in some cases we faced the fork, hey which way I should to go? maintainable or efficient..? But still we can minimize the effect of sacrificing one of them.
Admin
And the colleague got promoted because he wrote so many lines of code, how productive of him !
Admin
LOL =)) nice to have a colleague like him.
Admin
You're likely right about it being a junior developer, but I also wouldn't be surprised if it were someone with years of experience. I've seen some people go to extremes to avoid learning regular expressions, or really anything new for that matter.