Comment On Splitting Headache

When it comes to string manipulation, it is not uncommon to want to split a single string into multiple strings based on a delimiter. Many languages provide split functionality outright. Even in C, it's fairly easy to roll your own --- assuming you don't like strtok_r --- with functions like strchr or strpbrk. [expand full text]
« PrevPage 1 | Page 2 | Page 3Next »

Re: Splitting Headache

2007-05-04 09:05 • by JD (unregistered)
Split is a very slow function implemented upon regex pattern. Using StringTokenizer is far most efficient. This isn't probably an issus in most application, but some application that require performances could not use String.split().

Jake is probably a nice guy, but it look like he don't know anything in Java developement...

Re: Splitting Headache

2007-05-04 09:09 • by Anonymous! (unregistered)
What idiot wrote that Perl code?

It should be:

sub split_at_tabs {
my ($line) = @_;
my @record = split(/\t/,$line);
return \@record;
}

Re: Splitting Headache

2007-05-04 09:12 • by ParkinT
(He counts four things wrong with that function. Can you find them all?)

The first thing wrong is that he is using PERL!!!
[But, better than using VB, I suppose]

Re: Splitting Headache

2007-05-04 09:18 • by Veinor
135036 in reply to 135033
Anonymous!:
What idiot wrote that Perl code?

It should be:

sub split_at_tabs {
my ($line) = @_;
my @record = split(/\t/,$line);
return \@record;
}



Why are you returning a reference?
return @record;

would work just as well, and let you say

@records = split_at_line($line);

Re: Splitting Headache

2007-05-04 09:22 • by bstorer
135037 in reply to 135032
JD:
Split is a very slow function implemented upon regex pattern. Using StringTokenizer is far most efficient. This isn't probably an issus in most application, but some application that require performances could not use String.split().

Jake is probably a nice guy, but it look like he don't know anything in Java developement...

The current API regards StringTokenizer as a legacy class which should not be used in new code, instead opting for split. Plus, split would have allowed the acts of tokenizing and then trimming to be implemented in one step. I'm not saying it's a performance improvement to do so, because I haven't bothered to test it.
Still, even with StringTokenizer the code is a travesty. It puts the data into a StringTokenizer, into a Vector, into an Enumeration, and finally into an array. Let's not even go into the fact that StringTokenizer implements Enumeration.

Re: Splitting Headache

2007-05-04 09:27 • by XIU
It's not even used :s nothing is returned or am I missing something...

Re: Splitting Headache

2007-05-04 09:30 • by Lrep (unregistered)
Why was that perl code even a function? It is a one-liner in perl, and a really short one at that, AND a really common one anyone who uses perl should instantly recognize.

split /\t/,$line

Done. What does a function gain you?

Re: Splitting Headache

2007-05-04 09:39 • by Nomikos (unregistered)
Is the name of that function being setXXXXXX the result of anonymization? Because it's pretty obvious that it used to be setBureauLocations().

Re: Splitting Headache

2007-05-04 09:44 • by viraptor
135041 in reply to 135040
Nomikos:
Is the name of that function being setXXXXXX the result of anonymization? Because it's pretty obvious that it used to be setBureauLocations().

Nah... one X was not anonymized:
setBureauxLocations() :)

Re: Splitting Headache

2007-05-04 09:44 • by Tom (unregistered)
It's amusing that in a function (built in or otherwise) that's designed to throw away all the tabs, so much effort goes into keeping them:

my @_record = split(/(\t)/,$line."\t");
# why the capturing parens in the regexp? Why append a tab character to the end?

my @record = map { $_ eq "\t" ? () : $_ } @_record;
#now that you've split it on tabs, why would there be any tabs in the array? Why create a whole new array @record when you could just reassign it to @_record?

Anyway, that's four, plus the fact that it you should just use the built-in split function in the first place. Well as the Perl mongers say, "there's more than one way to do it." I'd never really considered that some of those ways are not good.

Re: Splitting Headache

2007-05-04 09:45 • by bstorer
135043 in reply to 135040
Nomikos:
Is the name of that function being setXXXXXX the result of anonymization? Because it's pretty obvious that it used to be setBureauLocations().

Don't be silly. Expecting the name of the function to reflect the code inside is crazy, given the incompetency of the rest of the code. For all we know, the name of the function was setInitech, and was anonymized to keep us from knowing the function was from Initech, even though we all knew that anyway.

Re: Splitting Headache

2007-05-04 09:47 • by akatherder
Too bad java isn't object oriented or anything. You have to store all data in a string and parse it out like that every single time.

/jerk

I suppose this "could" be the start of loading everything into an object, but I'm not that naive.

Re: Splitting Headache

2007-05-04 09:51 • by Robert Hanson (unregistered)
135045 in reply to 135039

Lrep:

split /\t/,$line


When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

Re: Splitting Headache

2007-05-04 09:57 • by sammy (unregistered)
135046 in reply to 135045
Robert Hanson:

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.


Hey, I'm all for maintaining readability, but if split /\t/,$line is that hard for you to read, code readability isn't your problem.

Re: Splitting Headache

2007-05-04 09:57 • by Bart B (unregistered)
135047 in reply to 135045
Robert Hanson:

Lrep:

split /\t/,$line


When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.


then never code in perl. You are not suited to it :)

Re: Splitting Headache

2007-05-04 10:00 • by random guy (unregistered)
The original function trims the individual fields and does not count empty fields. It also uses a semicolon as the field delimiter.

As of Java 1.4, it can be replaced with:

String[] parts = str.trim().split("\\s*;\\s*");

Re: Splitting Headache

2007-05-04 10:01 • by aaaaaaaaaaaaaaaaaaa (unregistered)
135049 in reply to 135045
Robert Hanson:

Lrep:

split /\t/,$line


When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.


The name split_at_tabs, IMO, is worthless when a one-liner would do, especially one as common in idiomatic perl as split. Better to name the function in a way that gives a purpose and isn't wrong if they change from tab-separated to comma-separated, eg. extract_bureau_locations().

Re: Splitting Headache

2007-05-04 10:01 • by JavaHead (unregistered)
First of all, the perl-code and the java-code are not equivalent (the java-version has error-checking, trims each part and removes empty elements).

Second, the java-version can be made simpler (java has a library function converting Vector to array which can replace the last 6 lines, has been in java since 1.2)

Third, if you want functionalty like the perl-version, you can get it down to 6 lines as well:


public List setC(String str) {
StringTokenizer st = new StringTokenizer(str, ";");
Vector v = new Vector(st.countTokens());
while(st.hasMoreElements()) { v.add(st.nextElement()); }
return v;
}


Captcha: there is nothing like a flame-war at the end of the week...

Re: Splitting Headache

2007-05-04 10:10 • by Larry Rubinow (unregistered)
135051 in reply to 135039
split() takes either a regex or a string as its delimiter. I don't know whether Perl is smart enough to optimize away the invocation of the regex engine in this case, but why risk it?

Re: Splitting Headache

2007-05-04 10:11 • by David (unregistered)
135052 in reply to 135036
Veinor:
Anonymous!:
What idiot wrote that Perl code?

It should be:

sub split_at_tabs {
my ($line) = @_;
my @record = split(/\t/,$line);
return \@record;
}



Why are you returning a reference?
return @record;

would work just as well, and let you say

@records = split_at_line($line);



I think the 2nd WTF here is that it takes several attempts to get the Perl right.

Re: Splitting Headache

2007-05-04 10:17 • by skztr
This hits home, as I am currently splitting a short string.
Do let me point out some WTF:
"A sequence of two or more contiguous delimiter characters in the parsed string is considered to be a single delimiter. Delimiter characters at the start or end of the string are ignored. Put another way: the tokens returned by strtok() are always non-empty strings."

so:
"foo,bar,,baz"
and
"foo,bar,baz"
and
"foo,,,,,,,,,,,,,,,,,,,,,bar,baz"

are equivalent.

No errors, no NULL returned, just silent "pretending it's okay".
Gee, thanks!

So I'm using strchr().

Re: Splitting Headache

2007-05-04 10:20 • by Chris (unregistered)
Even prior to Java 1.4 this could be accomplished much more succinctly:


public static List split(String str, String sep) {
List l = new ArrayList();
StringTokenizer st = new StringTokenizer(str, sep);
while (st.hasMoreTokens())
l.add(st.nextToken().trim());
return l;
}

Re: Splitting Headache

2007-05-04 10:23 • by Shinobu (unregistered)
135055 in reply to 135035
ParkinT:
(He counts four things wrong with that function. Can you find them all?)

The first thing wrong is that he is using PERL!!!
[But, better than using VB, I suppose]

Are you sure you'd rather spend your life Perling than VB'ing? Based on my experience in both, I'd rather not. Although I do concede that, as long as you don't need a GUI, a Perl program is generally shorter than the same in VB.

Re: Splitting Headache

2007-05-04 10:24 • by Larry Rubinow (unregistered)

sub split_at_tabs {
my ($line) = @_;
my @_record = split(/(\t)/,$line."\t");
my @record = map { $_ eq "\t" ? () : $_ } @_record;
return \@record;
}

So, the four things wrong:
1. Uses grouping parens in the split regex where not needed
2. Uses a regex in the split where it's not needed
3. No reason to add the trailing tab to the split input string
4. Use of 'map' instead of 'grep' results in empty fields
5. No obvious reason to return an array ref rather than the array (though there may be design considerations extrinsic to the example)

And I guess that's five things. :)

The whole thing could be written shorter and better as

sub split_at_tabs2 {
return grep { $_ eq "\t" ? () : $_ } split "\t", shift;
}

Or, if you're into readability and actually do want to return the array reference:

sub split_at_tabs3 {
my $input = shift;
my @array = split "\t", $input;
@array = grep { $_ eq "\t" ? () : $_ }, @array;
return \@array;
}

Re: Splitting Headache

2007-05-04 10:28 • by Larry Rubinow (unregistered)
135057 in reply to 135056
Larry Rubinow:

sub split_at_tabs {
my ($line) = @_;
my @_record = split(/(\t)/,$line."\t");
my @record = map { $_ eq "\t" ? () : $_ } @_record;
return \@record;
}

So, the four things wrong:
1. Uses grouping parens in the split regex where not needed
2. Uses a regex in the split where it's not needed
3. No reason to add the trailing tab to the split input string
4. Use of 'map' instead of 'grep' results in empty fields
5. No obvious reason to return an array ref rather than the array (though there may be design considerations extrinsic to the example)

And I guess that's five things. :)

The whole thing could be written shorter and better as

sub split_at_tabs2 {
return grep { $_ eq "\t" ? () : $_ } split "\t", shift;
}

Or, if you're into readability and actually do want to return the array reference:

sub split_at_tabs3 {
my $input = shift;
my @array = split "\t", $input;
@array = grep { $_ eq "\t" ? () : $_ }, @array;
return \@array;
}


Okay, I'm an idiot; add another thing wrong. The grep should more simply be

grep { length $_ }

Re: Splitting Headache

2007-05-04 10:33 • by SomeCoder (unregistered)
135058 in reply to 135039
Lrep:
Why was that perl code even a function? It is a one-liner in perl, and a really short one at that, AND a really common one anyone who uses perl should instantly recognize.

split /\t/,$line

Done. What does a function gain you?



I was just going to say that. However, you might want to just remove the regex part of it since we don't need it:

my @record = split("\t", $line);

Done.

And if anyone can't read that easily, then please stop programming. I know Perl can be really hard to read but that line right there couldn't be more self explanatory.

Re: Splitting Headache

2007-05-04 10:33 • by Chris (unregistered)
Bollocks, omitted the check for a zero length string:


import java.util.*;

public class StringUtils {
public static void main(String[] args) {
List strs = split("ooh ; me; ;;arse;grapes itch ", ";");
for (Iterator i = strs.iterator(); i.hasNext();)
System.out.println("'" + i.next() + "'");
}

public static List split(String str, String sep) {
List l = new ArrayList();
for (StringTokenizer st = new StringTokenizer(str, sep); st.hasMoreTokens();) {
String s = st.nextToken().trim();
if (s.length() > 0)
l.add(s);
}
return l;
}
}

Re: Splitting Headache

2007-05-04 10:34 • by anon (unregistered)
135060 in reply to 135036
Veinor:
Anonymous!:
What idiot wrote that Perl code?

It should be:

sub split_at_tabs {
my ($line) = @_;
my @record = split(/\t/,$line);
return \@record;
}



Why are you returning a reference?
return @record;

would work just as well, and let you say

@records = split_at_line($line);


Better question:
Why are you writing readable perl?

Re: Splitting Headache

2007-05-04 10:34 • by Anon (unregistered)
135061 in reply to 135045
Robert Hanson:

Lrep:

split /\t/,$line


When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.


I always thought that readability was an anathema to Perl.

Re: Splitting Headache

2007-05-04 10:44 • by burned (unregistered)
135062 in reply to 135053
If you want to retain the empty strings between contiguous delimiters then use split().
so
"foo,bar,,baz" = "foo","bar","","baz"
and
"foo,bar,baz" = "foo","bar","baz"

Re: Splitting Headache

2007-05-04 10:47 • by Steve (unregistered)
135063 in reply to 135045
Robert Hanson:

Lrep:

split /\t/,$line


When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.

For a very small price in performance (calling a function) you get a lot in readability and maintainability.

Then you're not a Perl programmer. And Perl will inline the function anyway. HTH. HAND.

Re: Splitting Headache

2007-05-04 10:50 • by TylerK (unregistered)

sub split_at_tabs {
my ($line) = @_;
my @_record = split(/(\t)/,$line."\t");
my @record = map { $_ eq "\t" ? () : $_ } @_record;
return \@record;
}

(He counts four things wrong with that function. Can you find them all?)


Yes, I can.
1. my ($line) = @_;
2. my @_record = split(/(\t)/,$line."\t");
3. my @record = map { $_ eq "\t" ? () : $_ } @_record;
4. return \@record;

Re: Splitting Headache

2007-05-04 10:52 • by Mike (unregistered)
135065 in reply to 135045
Robert Hanson:

When reading code, I understand split_at_tabs() a lot quicker than split /\t/,$line.


What you just said would be like someone saying of C code that "increment_by_one(&x);" is more clear than "x++;" (ugh, it looks like line noise!)

Re: Splitting Headache

2007-05-04 10:52 • by Marjo (unregistered)
I'm not a java programmer, but I thought that language didn't have any pointers? If so, what's up with all the NULL stuff?

Re: Splitting Headache

2007-05-04 10:56 • by TylerK (unregistered)
135068 in reply to 135057
Larry Rubinow:

Okay, I'm an idiot; add another thing wrong. The grep should more simply be

grep { length $_ }


No it should not. There should be no grep at all.

The rwtf is the suggested "fixes" for this godawful function.

Re: Splitting Headache

2007-05-04 11:01 • by anonymous (unregistered)
Sadly, there are lots of app server installations in the Java world that are stuck at Java 1.3. Personally, I have never been able to use 1.4, and I think this is true of many people who work with IBM frameworks that are built on top of WebSphere. So there might be good reasons for rolling your own split(), or there may be no compelling reason to go back and swap in the real split after an upgrade to 1.4.

Of course, that doesn't explain using Vector...

Re: Splitting Headache

2007-05-04 11:01 • by bstorer
135071 in reply to 135054
Chris:
Even prior to Java 1.4 this could be accomplished much more succinctly:


public static List split(String str, String sep) {
List l = new ArrayList();
StringTokenizer st = new StringTokenizer(str, sep);
while (st.hasMoreTokens())
l.add(st.nextToken().trim());
return l;
}


Except that in your example, should a trimmed string be empty, it is added. In the original, it's discarded.
Marjo:

I'm not a java programmer, but I thought that language didn't have any pointers? If so, what's up with all the NULL stuff?

Everything in Java (except native types like int) is actually a reference to the object. Thus, the reference can be to null. For example:
 Foo obj;

Foo obj2 = new Foo ();
In this example, obj is simply created as a reference to null. No object is created by default. On the other hand, obj2 references a new object using the zero-parameter constructor. The difference between these references and pointers is that you can't just point to random places in memory, but only reference either null or objects already created.

Re: Splitting Headache

2007-05-04 11:03 • by Dave (unregistered)
135072 in reply to 135066
Um, in Java EVERYTHING is a pointer.

Re: Splitting Headache

2007-05-04 11:05 • by nuller (unregistered)
135073 in reply to 135066
but java have references, and those can point to nothing

Re: Splitting Headache

2007-05-04 11:10 • by JB (unregistered)
135074 in reply to 135066
Java does have pointers (hence the existence of the famous NullpointerException). It doesn't have pointer arithmetics though.

In fact, in Java, all the variables that don't contain a basic type (char, int, float, etc.) are pointers to objects. And these pointers are passed by value.

Re: Splitting Headache

2007-05-04 11:16 • by bstorer
135077 in reply to 135072
Dave:
Um, in Java EVERYTHING is a pointer.

I hate it when people say this. First, native types are obviously not pointers. Second, neither is anything else. You can access arbitrary memory, you can't do pointer arithmetic, and you don't have to concern yourself with the deallocation of the object. But most importantly -- and pay close attention because this is a subtle, but vital, difference -- Java's references are not first-class objects. You can't create a reference to a reference and you can't manipulate them directly (except to change what they reference). In some sense, they are merely a syntactic convenience to bridge the way the computer will behave with the way we want to think about it.

Re: Splitting Headache

2007-05-04 11:26 • by infidel (unregistered)
What this thread really shows is that Java and Perl both suck

PythonWin 2.5 (r25:51908, Mar 9 2007, 17:40:28) [MSC v.1310 32 bit (Intel)] on win32.
Portions Copyright 1994-2006 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> print 'foo, bar, baz'.split(', ')
['foo', 'bar', 'baz']

Re: Splitting Headache

2007-05-04 11:32 • by c# (unregistered)
C#:
str.split(';');

Re: Splitting Headache

2007-05-04 11:34 • by bstorer
135083 in reply to 135080
infidel:
What this thread really shows is that Java and Perl both suck

PythonWin 2.5 (r25:51908, Mar 9 2007, 17:40:28) [MSC v.1310 32 bit (Intel)] on win32.
Portions Copyright 1994-2006 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> print 'foo, bar, baz'.split(', ')
['foo', 'bar', 'baz']

Sounds good to me.

irb(main):001:0> 'foo, bar, baz'.split(',')
=> ["foo", " bar", " baz"]

But how did we prove Java and Perl suck? You can do the exact same thing in Java.

Re: Splitting Headache

2007-05-04 11:35 • by nilp
135084 in reply to 135048
random guy:
The original function trims the individual fields and does not count empty fields. It also uses a semicolon as the field delimiter.

As of Java 1.4, it can be replaced with:

String[] parts = str.trim().split("\\s*;\\s*");


I've just done some micro benchmarks (it's a slow morning here) that shows the string split method is 50% slower than the original StringTokenizer method. Chris's more efficient StringTokenizer method is 3x faster than using split. And the regex split doesn't even omit zero length strings like the original.

Considering how slow string manipulation is in Java, I would go for the longer StringTokenizer method every time.

Re: Splitting Headache

2007-05-04 11:36 • by bstorer
135085 in reply to 135074
JB:
Java does have pointers (hence the existence of the famous NullpointerException). It doesn't have pointer arithmetics though.

In fact, in Java, all the variables that don't contain a basic type (char, int, float, etc.) are pointers to objects. And these pointers are passed by value.

NullPointerException is a horrible misnomer. It represents what happened in the underlying architecture of the JVM (which obviously uses pointers), and not what happened in the programming language. The Java programming language does not have pointers. It has references, which are different, but have many of the same features. There are tons of programmers out there who "know" that pointer == reference. But that don't make it true.

Re: Splitting Headache

2007-05-04 11:39 • by bstorer
135088 in reply to 135084
nilp:
random guy:
The original function trims the individual fields and does not count empty fields. It also uses a semicolon as the field delimiter.

As of Java 1.4, it can be replaced with:

String[] parts = str.trim().split("\\s*;\\s*");


I've just done some micro benchmarks (it's a slow morning here) that shows the string split method is 50% slower than the original StringTokenizer method. Chris's more efficient StringTokenizer method is 3x faster than using split.

Considering how slow string manipulation is in Java, I would go for the longer StringTokenizer method every time.


Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.

Re: Splitting Headache

2007-05-04 11:48 • by nilp
135090 in reply to 135088
Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.


Neither does your regex.

Anyway, I used Chris's amended method - look for his second post (assuming it's the same Chris). And I added code to iterate over the array returned from the regex split to remove empty strings, and double checked that the two methods did indeed return the same arrays.

How about trying it yourself, mate.

And none of that changes the fact that the original inefficient code is still faster than the string split.

Addendum (2007-05-04 11:54):
Sorry, I shouldn't have said "your regex" ... I didn't check who posted what.

Re: Splitting Headache

2007-05-04 11:56 • by Chris (unregistered)
135094 in reply to 135090
nilp:
Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.


Neither does your regex.

Anyway, I used Chris's amended method - look for his second post (assuming it's the same Chris). And I added code to iterate over the array returned from the regex split to remove empty strings, and double checked that the two methods did indeed return the same arrays.

How about trying it yourself, mate.


Yup, both those posts reference the same Chris object. In fact I like to behave much like a Singleton, but it causes exceptions from a particular Girlfriend object whenever I try to interact with other instances of the HotBabe class.

Re: Splitting Headache

2007-05-04 11:58 • by bstorer
135095 in reply to 135090
nilp:
Chris's method, as I mentioned above, doesn't discard empty strings. It's something akin to saying, "Brute-forcing 1024-bit key RSA is a much slower method than simply adding 2 + 2." Well, sure, but they don't accomplish the same thing, do they? Apples to oranges, chief.


Neither does your regex.

Anyway, I used Chris's amended method - look for his second post (assuming it's the same Chris). And I added code to iterate over the array returned from the regex split to remove empty strings, and double checked that the two methods did indeed return the same arrays.

How about trying it yourself, mate.

And none of that changes the fact that the original inefficient code is still faster than the string split.

Addendum (2007-05-04 11:54):
Sorry, I shouldn't have said "your regex" ... I didn't check who posted what.

I don't question that StringTokenizer would be faster. I merely wanted to make sure we compare like functions. I had missed Chris's second post, which does take care of the issue. For the record, had I been given this problem, I would likely have used StringTokenizer; splitting on a static regex seems too heavy-weight to me.
« PrevPage 1 | Page 2 | Page 3Next »

Add Comment