The Daily WTF: Curious Perversions in Information Technology

FredSaw · 2007-11-26 Reply Admin

Welbog:

public void get(ThrowableObject key) {
   /* ... */
   if (key.tostring == "item") {
     throw item;
   }
   /* ... */
}

Why don't you throw key instead? Then it'll compile.

2007-11-26 Reply Admin

Keith Hackney:
There is a language outside of computers called english that some of us like to use.

Haven't seen anyone using this 'english' you speak of in this forum.

Jackal von ÖRF · 2007-11-26 Reply Admin

Here are two functions (in Java) which do the same thing as today's WTF code.

First with regexps. (It would be faster to use a pre-compiled pattern, though.)

    public static String removeSpecialCharsV2(String s) {
        return s.replaceAll("`", "'").replaceAll("[^A-Za-z\\Q\"&'()-./\\E\\x00-\\x0E\\xC0-\\xFF]", "");
    }

Then with an imperative approach.

    public static String removeSpecialCharsV3(String s) {
        StringBuilder result = new StringBuilder(s.length());
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (c == '`') {
                c = '\'';
            }
            if (isAsciiAlphabet(c) || isHarmlessSpecialChar(c) || isHarmlessControlChar(c)) {
                result.append(c);
            }
        }
        return result.toString();
    }

    private static boolean isAsciiAlphabet(char c) {
        return (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || (c >= 'À' && c <= 'ÿ');
    }

    private static boolean isHarmlessSpecialChar(char c) {
        String excluded = "\"&'()-./";
        return excluded.contains(Character.toString(c));
    }

    private static boolean isHarmlessControlChar(char c) {
        return (c >= 0 && c <= 14);
    }

Addendum (2007-11-26 14:04): The WTF appears to be that the code DOES remove comma (,) even though the method name claims the opposite, twise. Also the handling of chars 0-14 is questionable. Maybe the original intent was to exclude only tab (9), newline (10) and carriage return (13) chars instead of the whole 0-14 range.

Addendum (2007-11-26 14:11): (In my code isAsciiAlphabet should actually be called isLatin1Letter.)

2007-11-26 Reply Admin

Cloak:
etc:
Cloak:
But WTF do they do with chars between 15 and 31? They aren't even printable.

They remove them. WTF, did you not read the code?

Ahmmm, I'm sorry? Who did not read it?

At this point I'm not sure reading it would help you. I think you mentioned "15" because you saw it in the line "if ((chrCode < 15)". The little open-mouth-shaped thingy between "chrCode" and "15" is called a less-than sign. It lets chars through if they're less than 15. And that's the first of a long list of conditions under which the character will be included in the output. Everything else, such as characters 15 thru 31, or even 16 thru 30 ("between 15 and 31"), will not be included -- that is, will be removed.

Apology accepted.

Noam Samuel · 2007-11-26 Reply Admin

Just asking: what would be the best way to write the function? As far as I can see, this is slightly different than a search/replace/escape because it whitelists rather than blacklisting. Using (for example) an allowed chars array would be slower because searching a char array for a char is more costly than making a few integer comparisons. I don't know much Java, but he could probably make the function a bit clearer by leaving in character literals rather than putting the integers there (in C, at least, character literals would work just as well as integer literals). As for regular expressions... do the words "now he has two problems" ring a bell?

2007-11-26 Reply Admin

This will like totally fail for ebcdic!

snoofle · 2007-11-26 Reply Admin

Zemm:
titrat:
>Regular Expressions would seldom be easier to understand

s/[^-&'(),./0-9A-Za-z]//;

How's that hard to understand? ;-)

I've been working with RE's for nearly 25 years now, and am quite comfortable with them, but there's a guy in my office who stores reg ex's in Excel, then dumps to xml, then uses xslt to formulaically merge them into combined reg-ex's - walking his code is truly a zen experience!

brazzy · 2007-11-26 Reply Admin

Noam Samuel:
Just asking: what would be the best way to write the function? As far as I can see, this is slightly different than a search/replace/escape because it whitelists rather than blacklisting. Using (for example) an allowed chars array would be slower because searching a char array for a char is more costly than making a few integer comparisons. I don't know much Java, but he could probably make the function a bit clearer by leaving in character literals rather than putting the integers there (in C, at least, character literals would work just as well as integer literals). As for regular expressions... do the words "now he has two problems" ring a bell?

It's C#, not Java. For maximum performance, a boolean array where you use the char's code as an index to look up whether to include it or not would probably be the best idea.

Iago · 2007-11-26 Reply Admin

brazzy:
For maximum performance, a boolean array where you use the char's code as an index to look up whether to include it or not would probably be the best idea.

And this is exactly what a sufficiently smart compiler would convert the appropriate regex to.

(And people, please stop mindlessly quoting that "two problems" thing. It does not mean that a regex is never the right solution.)

2007-11-26 Reply Admin

Lisp-style languages can represent the name gracefully:

(defun |REMOVE-SPECIAL-CHARS-EXCEPT-"&'[],-./| (s) ...)

2007-11-26 Reply Admin

Sounds like the "longest serving developer" should lose his job.

Either that or you should find a new one.

chrismcb · 2007-11-26 Reply Admin

Noam Samuel:
Just asking: what would be the best way to write the function? As far as I can see, this is slightly different than a search/replace/escape because it whitelists rather than blacklisting. Using (for example) an allowed chars array would be slower because searching a char array for a char is more costly than making a few integer comparisons. I don't know much Java, but he could probably make the function a bit clearer by leaving in character literals rather than putting the integers there (in C, at least, character literals would work just as well as integer literals). As for regular expressions... do the words "now he has two problems" ring a bell?

Best way? Depends... Personally I would forgo regexp, as you can see by the plethora of regexp displayed here, they are easy to get wrong.

Since you only care about the unicode chars < 256, a lookup table would be quick and simple.

But if you are going to write code like this... don't bother to convert it to an int.

don't do things like : if (x = this or that) add x to new else if x = 39 convert to 39 then add the new.

Just add 39 to new (much like this did) There is no reason to convert 39 to a 39.

I also would do: c = s[i]; if (c = 30 or ....) sb.append(c); //don't use sb.append(s[i]);

Doing that forces the user to go back and figure out what s[i] is, and why s[i] is being added in the if is checking for the value of c. Not to mention forcing the computer to do more work to get the same value you already have.

VGR · 2007-11-26 Reply Admin

Wow. What an obscenity. Goggles indeed.

This is what you get from anyone who claims self-documenting code is all the documentation that's needed.

Self-documenting code helps... but it's not enough. It's not even close to enough. And if this WTF doesn't drive that point home, I don't know what will.

2007-11-26 Reply Admin

Right-Click the function name, Refactor -> Rename. Simple in Eclipse!

Captcha: paint Ugh.

2007-11-26 Reply Admin

Actually the code is Java, albeit very badly written Java ;-)

2007-11-26 Reply Admin

I'll give you harder to understand if you're not very familiar with regexes, but slower??? They're not slow. I take it you have little experience using them, because they're NOT slow and I've written many of them.

They usually get optimized into some sort of state machine. Even a mind-bogglingly ugly regex (take a look at that one someone made to match an email address, per the RFCs) runs very quickly. You have to do some ridiculously insensible pattern like aaaaa* to get regexes to be slow. The only bad ones I've seen were crafted as examples. If you actually understand what the regex is doing, you only get a slow one if you're doing something really stupid. You could say the same about SQL, because you're doing something very like Cartesian joins at that point.

Particularly, when all you're doing is a bit of character filtering, it's bound to be MUCH faster than all those object method calls. Unless they get optimized the hell out of the way, that is.

Seriously, a function like this is one line of Perl. You choose the range to allow, tr/your chars here// the bugger, and do this stupid thing once in the function you get input from, not in its own function. But that's why this is on the daily WTF, of course.

2007-11-26 Reply Admin

might Convert.ToInt32(p_string[i]); return a int > 255 or why do we have the test: ... && chrCode < 256)?

I would expect a regular expression to be faster, because it may operate on the whole string.

But the class of discriminated characters looks suspicious. (chr)0-15 allowed, 0-9 not, (chr)191 allowed again...

But we don't know how often it is called, how often the configuration file changes, and who has access to it.

2007-11-26 Reply Admin

What's the problem? The name of the function adequately reflects the poor standard of the code which it contains, thereby warning future programmers against using said function. Granted it could be a little clearer, such as "ThisIsAReallyReallyCrappyFunctionDoNotCallItEverOkay", but it works as it is.

2007-11-27 Reply Admin

Bulletmagnet:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. —Jamie Zawinski, in comp.lang.emacs

'Some people, when confronted with a problem, think "I know, I'll use <whatever it is you want to slag off>." Now they have two problems' - Verity Stob's superior, adaptable, data-driven template version of an original quote by Jamie Zawinski

2007-11-27 Reply Admin

Using regular expressions is slower, about 10 times slower, I tested it. Also the original code can be optimized quite a bit by removing unnecessary substring operations and conversions. This is what I came up with:

public static string RemoveSpecialCharacters(string p_string)
{
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < p_string.Length; i++)
    {
        char c = p_string[i];

        if (c == '`')
        {
            c = '\'';
        }

        switch (c)
        {
            case '"':
            case '&':
            case '\'':
            case '(':
            case ')':
            case '-':
            case '.':
            case '/':
                sb.Append(c);
                break;
            default:
                if ((c < 15)
                    || (c >= 'A' && c <= 'Z')
                    || (c >= 'a' && c <= 'z')
                    || (c >= 'À' && c <= 'ÿ'))
                {
                    sb.Append(c);
                }
                break;
        }
    }

    return sb.ToString();
}

And the regex code:

private static readonly Regex remover = new Regex(@"[^A-Za-zÀ-ÿ""'&\/()\-.]", RegexOptions.Compiled);
private static readonly Regex graveRemover = new Regex(@"`", RegexOptions.Compiled);

public static string RemoveSpecialCharactersByRegex(string p_string)
{
    string removedGrave = graveRemover.Replace(p_string, "'");
    return remover.Replace(removedGrave, "");
}

The test string was a random 5 MB string (Having characters 0..255). The machine was a The testing times are:

RemoveSpecialCharsExceptQuote... Time: 00:00:00.5000096
RemoveSpecialCharacters Time: 00:00:00.2812554
RemoveSpecialCharactersByRegex Time: 00:00:01.4844035

2007-11-27 Reply Admin

Eh, your regex version is just wrong. It removes graves rather than replacing them. And nobody would use a regex for a single-character replacement; just use String.Replace() for that.

2007-11-27 Reply Admin

Never mind, I'm not awake. Despite the name, you do actually use it to replace graves. My point about not using a regex for that still stands, though.

2007-11-27 Reply Admin

Slower?:
I'll give you harder to understand if you're not very familiar with regexes, but slower??? They're not slow. I take it you have little experience using them, because they're NOT slow and I've written many of them.

I couldn't agree more! For all those nitwits here that claim that regexes are slow, read the following article!

http://swtch.com/~rsc/regexp/regexp1.html

I don't know the current state of regex in Java, Perl, Python, ... or what C# and co use but there are good and fast implementations of regex.

And if you want to make sure, roll your own version and treasure it. All the info is there in the article!

</rant> Sorry, had this regexes are slow discussion too many times.

2007-11-27 Reply Admin

JM:
Eh, your regex version is just wrong. It removes graves rather than replacing them. And nobody would use a regex for a single-character replacement; just use String.Replace() for that.

I did some more testing. It's true that String.Replace() is faster than a normal regex replace, however if you use a compiled regex, the difference is very small (I didn't see any).

Don't get me wrong, I'm not saying that regex is slow as hell, I'm just saying that it's a little slower than optimized code. In this case I would actually prefer the regex version to the other ones, because it turned a crappy 30-line monstrosity into two-line clean and nice function.

It would only matter if you need to process strings that are larger than a megabyte and you need to save a couple hundred milliseconds.

2007-11-27 Reply Admin

Jackal von ÖRF:
Can somebody tell that what are the special chars except quote, ampersand, apostrophe, open bracket, close bracket, comma, hyphen, full stop, comma and forward slash? (I suppose comma is mentioned twise in the list just to make sure.) My brain blew up when I was trying to find out the characters which the code removes. That code could be written with one regexp replace command.

I think those commas are for "," i.e. the function name is taken from the requirement document that says:

[enterprise_requirement 34909422]Remove special characters except quote ampersand apostrophe open bracket close bracket, full stop, and forward slash.

afaict the code doesn't remove commas. It does remove special characters like 0 1 2 3 4 5 6 7 8 9 though.

2007-11-27 Reply Admin

s.:
If you absolutely, positively have to use 'magic numbers', COMMENT!

What do you think this line from near the top is?

//34 | 38 | 39 | 40 | 41 44| 45 | 46 | 47

OK, it doesn't explain what the numbers mean, but it does save you scrolling down ten or so lines to see the code. Actually it doesn't - you have to look at the code anyway because the numbers aren't the same as in the comment. So which one is right? Who knows - maybe it's some kind of reduncancy mechanism for safety.

But it is a comment.

2007-11-27 Reply Admin

SlyEcho:
The test string was a random 5 MB string (Having characters 0..255). The machine was a The testing times are:
RemoveSpecialCharsExceptQuote... Time: 00:00:00.5000096
RemoveSpecialCharacters Time: 00:00:00.2812554
RemoveSpecialCharactersByRegex Time: 00:00:01.4844035

Does the regex time include the time required to compile the regex?

2007-11-27 Reply Admin

dunbar:
Actually the code is Java, albeit very badly written Java ;-)

Really appalling bad Java, considering Java has no "string" class, and Strings have no "Length" field. In fact, it's almost as if... no, surely it couldn't be C#.

2007-11-27 Reply Admin

function isEmailValid($email){ //TODO: regex here return true; }

2007-11-27 Reply Admin

Soviut:
What happens if (or rather, WHEN) new exception characters need to be added? Do they refactor all their code so their function name remains accurate?

Does manually doing a find/replace count as refactoring?

In any case, that would be stupid. You'd just copy the existing one to <name>_old before changing it. Obviously you'd document that it doesn't actually do what its name says with a post-it note stuck to the file cabinet. After a week or two, it will curl up and fall off.

Grovesy · 2007-11-27 Reply Admin

Just to add a little bit to this...

The long argument (which had kind of being going on for a week or two) was about the whole idea of these formatters, from what we could work out there was no business need that made any sense for these formatting functions (and there were allot of them)... It was just how it was 'done'…. People argued, people cried but the Business Analyst one the argument and forced the development team to comply… The legacy was put in place, it became an unwritten rule that we had to strip random chars from just about every piece of data we handled... fortunately things have changed recently.

Instead of validating fields that did require validation and throwing errors, we simply stripped out the data... very bad user experience in my book...

This was for an Address line, so in my book I would have called the function 'FormatAdressLine', and I would have probably moved it, for what it was worth into the Address class, or called this function from the property setter in the Address class. I would probably have written it in a very different way

The WTF was several fold…

There was a Formatter class which contained hundreds of these functions. Many of them were duplicated, badly named and had no comments on their intention of what they actually did (without having to read through the code). Often the function was cut short, so they were actually doing more than what it said on the tin... (Hence I rather obtusely named this function thus..)

Because these formatters were called through some evil mapping layer (each application had it’s own xml file) it was very hard to work out from where these methods were called.

To give you an idea of some of the pain.

RemoveSpecialCharsExceptHyphen (actually removes double quotes and underscores) RemoveSpecialCharsExceptHyphenApostrophe (leaves double quotes alone, and strips * & and %) RemoveSpecialCharsExceptFullstopCommaHyphenApostropheSpace (my function, for some reason has been renamed.. so it does actually do more than the method name suggests…)

2007-11-27 Reply Admin

This is what happens when you hire "software engineers" instead of code hackers. I do most of my programming using scripting languages. If I need to sort some strings I'll just write a little script to do it. I won't bother with an user interface if it is something that only needs to be done once. If I need to reformat some text I might just use search and replace in Ultra-Edit which supports regular expressions. Then I won't even need to write any code.

A "software engineer" would create some generic method that makes no assumptions about the data type. Then he would try to make it extensible to cover every conceivable situation.

Give a software engineer the simple task of sorting some strings and it will become a major engineering feat that can sort any kind of object in any conceivable order. But first you'll need to figure out the cryptic parameters for sort order, sort depth, sort range, and the mathematical precision of the sort. None of this will be documented.

Welbog · 2007-11-27 Reply Admin

Robert S. Robbins:
This is what happens when you hire "software engineers" instead of code hackers.

I can't tell whether you are an extraordinarily talented troll or just crazy.

Just in case you are being serious, though, I think it's important to point out that the problem isn't "software engineers" but in fact an idiot whose title is "software engineer".

2007-11-27 Reply Admin

OK, it's ugly coding. One could use a regex, indeed, or even store the chars in an array and compare this way. OR WHATEVER.

But down the road the true questions are:

Does it do the job well? (Rule #1) Is it prone to glitching? (Rule #2) Is it easy to analyze and maintain? (Rule #3)

If the construction industry was acting the way you are, they'd build one house instead of ten.

Of course, it would be very nicely built, all the nails would be driven in the middle of every stud and at regular interval, meaning no nail in the middle of nowhere or splitting the side of a stud, all the beams and studs cut perfectly at the right dimension, all the blocks would be perfectly aligned, all the cabinets would be placed at exactly the right level, with no variation in degree, all the roofing and flooring tiles operfectly straight and even. Even the stucco would have a nice tiling effect of the same structure to please the eye. Probably the nicest house built in the whole development. There would even be a gnome coming out of a trap door to spray air freshener when needed.

But you'd go out of business.

Bottomline is: if the code behind follows the 3 rules, do you think the end user gives a crap?

m0ffx · 2007-11-27 Reply Admin

When looking at this problem, my first though was to use a regexp. Now I have two problems.

2007-11-27 Reply Admin

Nicolas Verhaeghe:
OK, it's ugly coding. One could use a regex, indeed, or even store the chars in an array and compare this way. OR WHATEVER.
But down the road the true questions are:

Does it do the job well? (Rule #1) Is it prone to glitching? (Rule #2) Is it easy to analyze and maintain? (Rule #3)

Bottomline is: if the code behind follows the 3 rules, do you think the end user gives a crap?

Looks to me like we're breaking Rule #3. Which means we can't answer Rule #1 and makes me inclined to answer "yes" to Rule #2.

2007-11-27 Reply Admin

savar:
Zemm:
titrat:
>Regular Expressions would seldom be easier to understand

s/[^-&'(),./0-9A-Za-z]//;

How's that hard to understand? ;-)

I guess you're being a bit sarcastic, but I find that pretty easy to read. If you know the purpose of a regex, deciphering it becomes much easier because there are really only a few sensible ways to write it.

Only thing I would change is to add the global flag:

s/[^-&'(),./0-9A-Za-z]//g

...and digit char (save 1 char) + the ignore case flag (save two chars!)

s/[^-&'(),.\/\da-z]//gi

Too bad they didn't want to also strip the underline (_) or we could get rid of even more (seven char savings, baby!!!):

s/[^-&'(),.\/\w]//g

\ --CAPTCHA: "dubya" - the biggest WTF ever. //

java.lang.Chris; · 2007-11-28 Reply Admin

wtf:
titrat:
>Regular Expressions would seldom be easier to understand, and they are an order of magnitude slower.
You do need CS 101 refresh... This is bullshit, (compiled) regex execution is as efficient as it gets. Unless an idiot implement it, I guess.

Are compiled regexes threadsafe in .Net when used for matching? If not then the overhead of synchronisation could make it slower than this boondoggle. If I was to write this method in Java - which is a very big if - then apart from giving it a sensible name, I would at least construct the buffer with the length of the input string as an argument to prevent excessive preallocation of space or reallocations if the space is too small ...

java.lang.Chris; · 2007-11-28 Reply Admin

dunbar:
Actually the code is Java, albeit very badly written Java ;-)

No it's C#. The string class in Java is String, in C#, as in the WTF, it's string. The method to get a string representation of an object is toString in Java, in C# and again in the WTF, it's ToString. Note the case, which is often the only thing that differs in the core Java and .Net APIs.

java.lang.Chris; · 2007-11-28 Reply Admin

Bulletmagnet:
Jackal von ÖRF:
That code could be written with one regexp replace command.
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. —Jamie Zawinski, in comp.lang.emacs

Some people when confronted with their refusal to learn regular expresions think "I know, I'll quote Jamie Zawinski". Now they have two problens.

Just consider how fscked up Netscape code was (by Zawinskis own admission), and then consider whether his opinions are worth much.

2007-11-28 Reply Admin

etc:
Cloak:
etc:
Cloak:
But WTF do they do with chars between 15 and 31? They aren't even printable.

They remove them. WTF, did you not read the code?

Ahmmm, I'm sorry? Who did not read it?

At this point I'm not sure reading it would help you. I think you mentioned "15" because you saw it in the line "if ((chrCode < 15)". The little open-mouth-shaped thingy between "chrCode" and "15" is called a less-than sign. It lets chars through if they're less than 15. And that's the first of a long list of conditions under which the character will be included in the output. Everything else, such as characters 15 thru 31, or even 16 thru 30 ("between 15 and 31"), will not be included -- that is, will be removed.

Apology accepted.

Yes, you're right. But then what do they keep 0 to 14 for? Well, 0 may be OK but still...

Grovesy · 2007-11-28 Reply Admin

Also the use of StringBuilder instead of StringBuffer is a giveaway.

Anyhow, sarcasm aside, yes this is c#.. I know, I wrote it

2007-11-28 Reply Admin

One place I worked, the Java developer who headed the group made us call a method that would convert all "'" (single-quotes) to "" (back-tics) before they would be used in an Informix SQL insert or update regular Statement. On retrieval, we were supposed to convert "" back to "'".

Apparently, either he never heard of parameters in prepared statements (ie. "PreparedStatement.setString(i,String)"), or was too lazy to double up single quotes if he wanted to build the SQL string to be used in a Statement.createStatement() call. Doing either of those would have preserved the integrity of the input data. The first choice also being he most portable of the two, leaving all DB-specific pre-massaging of the data to the JDBC driver itself.

This was the same guy whose code ate all Exceptions, creating more "user-friendly" messages which were fetched from an exhausting array of strings - as if he knew every condition that the program would fail at every point. He didn't. Those "friendlier" errors were passed back in a public field, accessed publicly - not with a getter. He also didn't know what the meaning of "finally" was, and also thought that creating a new Class, or extending Objects in the language added overhead. Some of his classes were 10000 lines long and had over 200 methods all of which pretty much did the same thing. They almost could have been generated by a code generator, but they weren't. He also seemed to be more comfortable with ASP or VB code. All of his naming conventions looked like that language, versus using the Java naming standards. I could go on.

I asked him why he was misusing the language in such a way, and he replied that "this is a kind of army-style programming - we are merely using Java classes as container for procedures, nothing more".

He called the shots and at the time answered to a manager whose focus was elsewhere, and whose expertise at the time was in more in C and not in Java. In my case, as a new employee, I was coming into a shop and trying to figure out the lay of the land, and expecting that the language would be used appropriately. I was simply not in the position to throw up any major alarm bells, for fear of getting fired myself. Such is life when you have a mortgage to pay. I did voice my opinions to my peers, though.

Patience prevailed though, and approximately a year later, and finally with the ear of his manager, and a few other people's input, his flawed use of the language was revealed. He was actually given the opportunity to change his approach, but failing to do so, ultimately moved on to a different job.

IMO, stories like this really bolster the argument for a team-based approach, with code reviews, where no one developer is ever permitted to call the shots. The structure needs to be in place so that it simply never is allowed to get that bad in the first place.

2007-11-28 Reply Admin

SlyEcho:
Using regular expressions is slower, about 10 times slower, I tested it.
(...)

RemoveSpecialCharacters Time: 00:00:00.2812554 RemoveSpecialCharactersByRegex Time: 00:00:01.4844035

Interesting. What number system are you using? In the decimal system, 1.5 vs. 0.3 seconds is not a "10 times" difference. Moreover, the single run results are potentially meaningless for at least two reasons: First, the time it takes to compile the regex (as was already mentioned), if it was included for any reason, will make up for an undue proportion of the time. Second, unpredictable scheduling behavior can greatly influence the real time required for such short runs, especially if the system is under heavy load.

I repeated your test with 100 runs in a tight loop for both your RemoveSpecialCharacters and RemoveSpecialCharactersByRegex functions (regex compilation excluded) and got these results:

First test: RemoveSpecialCharacters: 00:00:24.1562500 RemoveSpecialCharactersByRegex: 00:00:24.0468750

Second test: RemoveSpecialCharacters: 00:00:24.4531250 RemoveSpecialCharactersByRegex: 00:00:24.3593750

I will assume that the slight speed advantage of the regex version measured in these tests is statistically insignificant, but even then the RemoveSpecialCharacters version does not offer even the slightest performance boost in exchange for its ridiculous verbosity.

By the way, even if your hand-crafted version was any faster than .NET's regex implementation, that would still be no indication that regexes are slow in general.

HTH. HAND.

2007-11-28 Reply Admin

That's what you get for posting late at night - the times I posted previously are wrong. The .NET regex version was much slower after all, however by simply adding a plus to the end of the remover regex, the difference shrinks down to:

RemoveSpecialCharacters: 00:00:23.7656250 RemoveSpecialCharactersByRegex: 00:00:43.2031250

i.e. the regex version takes about 82% longer, not 900%.

The probable reason for the speed improvement: Making the matches longer reduces the string concatenation overhead incurred by Regex.Replace(string, string). IOW, the actual regex is pretty fast while Regex.Replace is not.

It's also noteworthy that your regex version uses two passes which isn't necessary at all, and is easy to fix in many regex implementations other than .NET's.

In .NET, you have to create a MatchEvaluator or iterate through Match objects yourself, in both cases there will be a lot of string allocations, and all the new string instances that litter the heap will then have to be GC'ed again...

In more traditional (read: efficient) regex implementations, there is no such overhead and I expect a function like this to be blisteringly fast.

2007-11-29 Reply Admin

Not Dorothy:
Despite the long function name this is not entirely what this function does, consider the else.
else if ((chrCode == 39)|| (chrCode == 96))
        {
            sb.Append(Convert.ToChar(39));
        }
the this is converting a 96 to a 39. The function name is wrong.

That's what refactoring is for....

2007-11-30 Reply Admin

This looks to me as a typical case why a good programmer (say, senior developer) in one programming language (here: C) is not neccessarily a good programmer in another one (here: Java). In C, a routine very similar to this piece of code would probably be the most efficient implementation, given that it would run on a char array (char *). Attempting such a low-level implementation in an environment where you have to juggle with string objects for each character is of course a terrible idea. I remember having a course in Prolog given by a guy who had a Fortran background - his examples and homework assignments were also quite Fortan-ish and totally unsuitable for Prolog, if anyone in this forum is old enough to remember that programming language.

2007-12-15 Reply Admin

wtf:
titrat:
>Regular Expressions would seldom be easier to understand, and they are an order of magnitude slower.
You do need CS 101 refresh... This is bullshit, (compiled) regex execution is as efficient as it gets. Unless an idiot implement it, I guess.

I love it when programmers talk about how inefficient something is v. something else. Almost never have they actually benchmarked it, either on its own or in a production environment.

Some code written by fresh-out-of-CS-MS-program coworkers of mine is littered with comments like "// todo: make this more efficient" in places where it doesn't matter one bit (e.g. string normalization or very fast, cached, single-record database lookups using the PK where the network overhead in getting the result back to the client is orders of magnitude greater than any penalty in the lookup or normalization), while other sections of their code initiate multiple database handles and run a bunch of single database queries in loops to build rendered GUI tables on the fly rather than looking up the data in one chunk, freeing the db resources sooner, and rendering it in another method. Talk to the hand.

2007-12-15 Reply Admin

The real WTF is Java. No matter whether it is a good idea to have a function like this (or named like this, though it's not that far from the habits in Javaland) ... do you really feel OK writing so much for something as simple as this:

sub RemoveSpecialCharsExceptQuoteBlahBlah {
  my ($string)=@_;
  $string =~ tr/\x00-\x0E\x41-\x5A\x22\x26\x27\x28\x29\x2d\x2e\x2f\x61-\x7A\xc0-\xff//cd;
   # transform all except the specified characters (the /c flag)
   # and delete the ones that do not have a matching character
   # in the replacement, that is all (the /d flag)
  $string =~ tr/\x60/\x27/;
   # transform all \x60 characters to \x27
  return $string;
}

Short, to the point and efficient.

Captha: bene

2007-12-16 Reply Admin

Yes, but people don't choose their language and platform based upon how easy it will be to write small utility functions in the most concise manner. Usually, at least in realistic projects, people choose their language based upon what the project actually has to accomplish.

Overruled by RemoveSpecialCharsExceptQuoteAmpersand...

Leave a comment on “Overruled by RemoveSpecialCharsExceptQuoteAmpersand...”