why the capturing parens in the regexp? Why append a tab character to the end?

2007-05-04 Reply Admin

Split is a very slow function implemented upon regex pattern. Using StringTokenizer is far most efficient. This isn't probably an issus in most application, but some application that require performances could not use String.split().

Jake is probably a nice guy, but it look like he don't know anything in Java developement...

2007-05-04 Reply Admin

What idiot wrote that Perl code?

It should be:

sub split_at_tabs {
   my ($line) = @_;
   my @record = split(/\t/,$line);
   return \@record;
}

ParkinT · 2007-05-04 Reply Admin

(He counts four things wrong with that function. Can you find them all?)

The first thing wrong is that he is using PERL!!! [But, better than using VB, I suppose]

Veinor · 2007-05-04 Reply Admin

Anonymous!:
What idiot wrote that Perl code?
It should be:
sub split_at_tabs {
   my ($line) = @_;
   my @record = split(/\t/,$line);
   return \@record;
}

Why are you returning a reference?

return @record;

would work just as well, and let you say

@records = split_at_line($line);

bstorer · 2007-05-04 Reply Admin

JD:
Split is a very slow function implemented upon regex pattern. Using StringTokenizer is far most efficient. This isn't probably an issus in most application, but some application that require performances could not use String.split().
Jake is probably a nice guy, but it look like he don't know anything in Java developement...

The current API regards StringTokenizer as a legacy class which should not be used in new code, instead opting for split. Plus, split would have allowed the acts of tokenizing and then trimming to be implemented in one step. I'm not saying it's a performance improvement to do so, because I haven't bothered to test it. Still, even with StringTokenizer the code is a travesty. It puts the data into a StringTokenizer, into a Vector, into an Enumeration, and finally into an array. Let's not even go into the fact that StringTokenizer implements Enumeration.

XIU · 2007-05-04 Reply Admin

It's not even used :s nothing is returned or am I missing something...

2007-05-04 Reply Admin

Why was that perl code even a function? It is a one-liner in perl, and a really short one at that, AND a really common one anyone who uses perl should instantly recognize.

split /\t/,$line

Done. What does a function gain you?

2007-05-04 Reply Admin

Is the name of that function being setXXXXXX the result of anonymization? Because it's pretty obvious that it used to be setBureauLocations().

viraptor · 2007-05-04 Reply Admin

Nomikos:
Is the name of that function being setXXXXXX the result of anonymization? Because it's pretty obvious that it used to be setBureauLocations().

Nah... one X was not anonymized: setBureauxLocations() :)

2007-05-04 Reply Admin

It's amusing that in a function (built in or otherwise) that's designed to throw away all the tabs, so much effort goes into keeping them:

my @_record = split(/(\t)/,$line."\t");

why the capturing parens in the regexp? Why append a tab character to the end?

my @record = map { $_ eq "\t" ? () : $_ } @_record; #now that you've split it on tabs, why would there be any tabs in the array? Why create a whole new array @record when you could just reassign it to @_record?

Anyway, that's four, plus the fact that it you should just use the built-in split function in the first place. Well as the Perl mongers say, "there's more than one way to do it." I'd never really considered that some of those ways are not good.

bstorer · 2007-05-04 Reply Admin

Nomikos:
Is the name of that function being setXXXXXX the result of anonymization? Because it's pretty obvious that it used to be setBureauLocations().

Don't be silly. Expecting the name of the function to reflect the code inside is crazy, given the incompetency of the rest of the code. For all we know, the name of the function was setInitech, and was anonymized to keep us from knowing the function was from Initech, even though we all knew that anyway.

akatherder · 2007-05-04 Reply Admin

Too bad java isn't object oriented or anything. You have to store all data in a string and parse it out like that every single time.

/jerk

I suppose this "could" be the start of loading everything into an object, but I'm not that naive.

2007-05-04 Reply Admin

Lrep:
split /\t/,$line

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

2007-05-04 Reply Admin

Robert Hanson:
When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.
For a very small price in performance (calling a function) you get a lot in readability and maintainability.

Hey, I'm all for maintaining readability, but if split /\t/,$line is that hard for you to read, code readability isn't your problem.

2007-05-04 Reply Admin

Robert Hanson:

Lrep:
split /\t/,$line

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

then never code in perl. You are not suited to it :)

2007-05-04 Reply Admin

The original function trims the individual fields and does not count empty fields. It also uses a semicolon as the field delimiter.

As of Java 1.4, it can be replaced with:

String[] parts = str.trim().split("\s*;\s*");

2007-05-04 Reply Admin

Robert Hanson:

Lrep:
split /\t/,$line

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

The name split_at_tabs, IMO, is worthless when a one-liner would do, especially one as common in idiomatic perl as split. Better to name the function in a way that gives a purpose and isn't wrong if they change from tab-separated to comma-separated, eg. extract_bureau_locations().

2007-05-04 Reply Admin

First of all, the perl-code and the java-code are not equivalent (the java-version has error-checking, trims each part and removes empty elements).

Second, the java-version can be made simpler (java has a library function converting Vector to array which can replace the last 6 lines, has been in java since 1.2)

Third, if you want functionalty like the perl-version, you can get it down to 6 lines as well:

	public List setC(String str) {
		StringTokenizer st = new StringTokenizer(str, ";");
		Vector v = new Vector(st.countTokens());
		while(st.hasMoreElements()) { v.add(st.nextElement()); }
		return v;
	}

Captcha: there is nothing like a flame-war at the end of the week...

2007-05-04 Reply Admin

split() takes either a regex or a string as its delimiter. I don't know whether Perl is smart enough to optimize away the invocation of the regex engine in this case, but why risk it?

2007-05-04 Reply Admin

Veinor:
Anonymous!:
What idiot wrote that Perl code?
It should be:
sub split_at_tabs {
   my ($line) = @_;
   my @record = split(/\t/,$line);
   return \@record;
}
Why are you returning a reference?
return @record;
would work just as well, and let you say
@records = split_at_line($line);

I think the 2nd WTF here is that it takes several attempts to get the Perl right.

skztr · 2007-05-04 Reply Admin

This hits home, as I am currently splitting a short string. Do let me point out some WTF: "A sequence of two or more contiguous delimiter characters in the parsed string is considered to be a single delimiter. Delimiter characters at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always non-empty strings."

so: "foo,bar,,baz" and "foo,bar,baz" and "foo,,,,,,,,,,,,,,,,,,,,,bar,baz"

are equivalent.

No errors, no NULL returned, just silent "pretending it's okay". Gee, thanks!

So I'm using strchr().

2007-05-04 Reply Admin

Even prior to Java 1.4 this could be accomplished much more succinctly:

    public static List split(String str, String sep) {
        List l = new ArrayList();
        StringTokenizer st = new StringTokenizer(str, sep);
        while (st.hasMoreTokens())
            l.add(st.nextToken().trim());
        return l;
    }

2007-05-04 Reply Admin

ParkinT:
(He counts four things wrong with that function. Can you find them all?)
The first thing wrong is that he is using PERL!!! [But, better than using VB, I suppose]

Are you sure you'd rather spend your life Perling than VB'ing? Based on my experience in both, I'd rather not. Although I do concede that, as long as you don't need a GUI, a Perl program is generally shorter than the same in VB.

2007-05-04 Reply Admin

sub split_at_tabs { my ($line) = @_; my @_record = split(/(\t)/,$line."\t"); my @record = map { $_ eq "\t" ? () : $_ } @_record; return \@record; }

So, the four things wrong: 1. Uses grouping parens in the split regex where not needed 2. Uses a regex in the split where it's not needed 3. No reason to add the trailing tab to the split input string 4. Use of 'map' instead of 'grep' results in empty fields 5. No obvious reason to return an array ref rather than the array (though there may be design considerations extrinsic to the example)

And I guess that's five things. :)

The whole thing could be written shorter and better as

sub split_at_tabs2 {
   return grep { $_ eq "\t" ? () : $_ } split "\t", shift;
}

Or, if you're into readability and actually do want to return the array reference:

sub split_at_tabs3 {
   my $input = shift;
   my @array = split "\t", $input;
   @array = grep { $_ eq "\t" ? () : $_ }, @array;
   return \@array;
}

2007-05-04 Reply Admin

Larry Rubinow:
sub split_at_tabs { my ($line) = @_; my @_record = split(/(\t)/,$line."\t"); my @record = map { $_ eq "\t" ? () : $_ } @_record; return \@record; }
So, the four things wrong: 1. Uses grouping parens in the split regex where not needed 2. Uses a regex in the split where it's not needed 3. No reason to add the trailing tab to the split input string 4. Use of 'map' instead of 'grep' results in empty fields 5. No obvious reason to return an array ref rather than the array (though there may be design considerations extrinsic to the example)
And I guess that's five things. :)

The whole thing could be written shorter and better as
sub split_at_tabs2 {
   return grep { $_ eq "\t" ? () : $_ } split "\t", shift;
}
Or, if you're into readability and actually do want to return the array reference:
sub split_at_tabs3 {
   my $input = shift;
   my @array = split "\t", $input;
   @array = grep { $_ eq "\t" ? () : $_ }, @array;
   return \@array;
}

Okay, I'm an idiot; add another thing wrong. The grep should more simply be

grep { length $_ }

2007-05-04 Reply Admin

Lrep:
Why was that perl code even a function? It is a one-liner in perl, and a really short one at that, AND a really common one anyone who uses perl should instantly recognize.
split /\t/,$line

Done. What does a function gain you?

I was just going to say that. However, you might want to just remove the regex part of it since we don't need it:

my @record = split("\t", $line);

Done.

And if anyone can't read that easily, then please stop programming. I know Perl can be really hard to read but that line right there couldn't be more self explanatory.

2007-05-04 Reply Admin

Bollocks, omitted the check for a zero length string:

import java.util.*;

public class StringUtils {
    public static void main(String[] args) {
        List strs = split("ooh ;   me;  ;;arse;grapes itch  ", ";");
        for (Iterator i = strs.iterator(); i.hasNext();)
            System.out.println("'" + i.next() + "'");
    }

    public static List split(String str, String sep) {
        List l = new ArrayList();
        for (StringTokenizer st = new StringTokenizer(str, sep); st.hasMoreTokens();) {
            String s = st.nextToken().trim();
            if (s.length() > 0)
                l.add(s);
        }
        return l;
    }
}

2007-05-04 Reply Admin

Veinor:
Anonymous!:
What idiot wrote that Perl code?
It should be:
sub split_at_tabs {
   my ($line) = @_;
   my @record = split(/\t/,$line);
   return \@record;
}
Why are you returning a reference?
return @record;
would work just as well, and let you say
@records = split_at_line($line);

Better question: Why are you writing readable perl?

2007-05-04 Reply Admin

Robert Hanson:

Lrep:
split /\t/,$line

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

I always thought that readability was an anathema to Perl.

2007-05-04 Reply Admin

If you want to retain the empty strings between contiguous delimiters then use split(). so "foo,bar,,baz" = "foo","bar","","baz" and "foo,bar,baz" = "foo","bar","baz"

2007-05-04 Reply Admin

Robert Hanson:

Lrep:
split /\t/,$line

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

Then you're not a Perl programmer. And Perl will inline the function anyway. HTH. HAND.

2007-05-04 Reply Admin

sub split_at_tabs {
my ($line) = @;
my @record = split(/(\t)/,$line."\t");
my @record = map { $ eq "\t" ? () : $ } @_record;
return @record;
}

(He counts four things wrong with that function. Can you find them all?)

Yes, I can.

my ($line) = @_;
my @_record = split(/(\t)/,$line."\t");
my @record = map { $_ eq "\t" ? () : $_ } @_record;
return @record;

2007-05-04 Reply Admin

Robert Hanson:
When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

What you just said would be like someone saying of C code that "increment_by_one(&x);" is more clear than "x++;" (ugh, it looks like line noise!)

2007-05-04 Reply Admin

I'm not a java programmer, but I thought that language didn't have any pointers? If so, what's up with all the NULL stuff?

2007-05-04 Reply Admin

Larry Rubinow:
Okay, I'm an idiot; add another thing wrong. The grep should more simply be
grep { length $_ }

No it should not. There should be no grep at all.

The rwtf is the suggested "fixes" for this godawful function.

2007-05-04 Reply Admin

Sadly, there are lots of app server installations in the Java world that are stuck at Java 1.3. Personally, I have never been able to use 1.4, and I think this is true of many people who work with IBM frameworks that are built on top of WebSphere. So there might be good reasons for rolling your own split(), or there may be no compelling reason to go back and swap in the real split after an upgrade to 1.4.

Of course, that doesn't explain using Vector...

bstorer · 2007-05-04 Reply Admin

Chris:

Even prior to Java 1.4 this could be accomplished much more succinctly:

    public static List split(String str, String sep) {
        List l = new ArrayList();
        StringTokenizer st = new StringTokenizer(str, sep);
        while (st.hasMoreTokens())
            l.add(st.nextToken().trim());
        return l;
    }

Except that in your example, should a trimmed string be empty, it is added. In the original, it's discarded.

Marjo:
I'm not a java programmer, but I thought that language didn't have any pointers? If so, what's up with all the NULL stuff?

Everything in Java (except native types like int) is actually a reference to the object. Thus, the reference can be to null. For example:

 Foo obj;
Foo obj2 = new Foo ();

In this example, obj is simply created as a reference to null. No object is created by default. On the other hand, obj2 references a new object using the zero-parameter constructor. The difference between these references and pointers is that you can't just point to random places in memory, but only reference either null or objects already created.

2007-05-04 Reply Admin

Um, in Java EVERYTHING is a pointer.

2007-05-04 Reply Admin

but java have references, and those can point to nothing

2007-05-04 Reply Admin

Java does have pointers (hence the existence of the famous NullpointerException). It doesn't have pointer arithmetics though.

In fact, in Java, all the variables that don't contain a basic type (char, int, float, etc.) are pointers to objects. And these pointers are passed by value.

bstorer · 2007-05-04 Reply Admin

Dave:
Um, in Java EVERYTHING is a pointer.

I hate it when people say this. First, native types are obviously not pointers. Second, neither is anything else. You can access arbitrary memory, you can't do pointer arithmetic, and you don't have to concern yourself with the deallocation of the object. But most importantly -- and pay close attention because this is a subtle, but vital, difference -- Java's references are not first-class objects. You can't create a reference to a reference and you can't manipulate them directly (except to change what they reference). In some sense, they are merely a syntactic convenience to bridge the way the computer will behave with the way we want to think about it.

2007-05-04 Reply Admin

What this thread really shows is that Java and Perl both suck

print 'foo, bar, baz'.split(', ') ['foo', 'bar', 'baz']

2007-05-04 Reply Admin

C#: str.split(';');

bstorer · 2007-05-04 Reply Admin

infidel:
What this thread really shows is that Java and Perl both suck
PythonWin 2.5 (r25:51908, Mar 9 2007, 17:40:28) [MSC v.1310 32 bit (Intel)] on win32. Portions Copyright 1994-2006 Mark Hammond - see 'Help/About PythonWin' for further copyright information.

print 'foo, bar, baz'.split(', ') ['foo', 'bar', 'baz']

Sounds good to me.

irb(main):001:0> 'foo, bar, baz'.split(',')
=> ["foo", " bar", " baz"]

But how did we prove Java and Perl suck? You can do the exact same thing in Java.

nilp · 2007-05-04 Reply Admin

random guy:
The original function trims the individual fields and does not count empty fields. It also uses a semicolon as the field delimiter.
As of Java 1.4, it can be replaced with:

String[] parts = str.trim().split("\s*;\s*");

I've just done some micro benchmarks (it's a slow morning here) that shows the string split method is 50% slower than the original StringTokenizer method. Chris's more efficient StringTokenizer method is 3x faster than using split. And the regex split doesn't even omit zero length strings like the original.

Considering how slow string manipulation is in Java, I would go for the longer StringTokenizer method every time.

bstorer · 2007-05-04 Reply Admin

JB:
Java does have pointers (hence the existence of the famous NullpointerException). It doesn't have pointer arithmetics though.
In fact, in Java, all the variables that don't contain a basic type (char, int, float, etc.) are pointers to objects. And these pointers are passed by value.

NullPointerException is a horrible misnomer. It represents what happened in the underlying architecture of the JVM (which obviously uses pointers), and not what happened in the programming language. The Java programming language does not have pointers. It has references, which are different, but have many of the same features. There are tons of programmers out there who "know" that pointer == reference. But that don't make it true.

bstorer · 2007-05-04 Reply Admin

nilp:
random guy:
The original function trims the individual fields and does not count empty fields. It also uses a semicolon as the field delimiter.
As of Java 1.4, it can be replaced with:

String[] parts = str.trim().split("\s*;\s*");

I've just done some micro benchmarks (it's a slow morning here) that shows the string split method is 50% slower than the original StringTokenizer method. Chris's more efficient StringTokenizer method is 3x faster than using split.

Considering how slow string manipulation is in Java, I would go for the longer StringTokenizer method every time.

Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.

nilp · 2007-05-04 Reply Admin

Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.

Neither does your regex.

Anyway, I used Chris's amended method - look for his second post (assuming it's the same Chris). And I added code to iterate over the array returned from the regex split to remove empty strings, and double checked that the two methods did indeed return the same arrays.

How about trying it yourself, mate.

And none of that changes the fact that the original inefficient code is still faster than the string split.

Addendum (2007-05-04 11:54): Sorry, I shouldn't have said "your regex" ... I didn't check who posted what.

2007-05-04 Reply Admin

nilp:
Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.

Neither does your regex.

Anyway, I used Chris's amended method - look for his second post (assuming it's the same Chris). And I added code to iterate over the array returned from the regex split to remove empty strings, and double checked that the two methods did indeed return the same arrays.

How about trying it yourself, mate.

Yup, both those posts reference the same Chris object. In fact I like to behave much like a Singleton, but it causes exceptions from a particular Girlfriend object whenever I try to interact with other instances of the HotBabe class.

bstorer · 2007-05-04 Reply Admin

nilp:
Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.

Neither does your regex.

Anyway, I used Chris's amended method - look for his second post (assuming it's the same Chris). And I added code to iterate over the array returned from the regex split to remove empty strings, and double checked that the two methods did indeed return the same arrays.

How about trying it yourself, mate.

And none of that changes the fact that the original inefficient code is still faster than the string split.

Addendum (2007-05-04 11:54): Sorry, I shouldn't have said "your regex" ... I didn't check who posted what.

I don't question that StringTokenizer would be faster. I merely wanted to make sure we compare like functions. I had missed Chris's second post, which does take care of the issue. For the record, had I been given this problem, I would likely have used StringTokenizer; splitting on a static regex seems too heavy-weight to me.

Splitting Headache

why the capturing parens in the regexp? Why append a tab character to the end?

Leave a comment on “Splitting Headache”