Don’t always trust javadocs
I’ve been looking for ways to improve performance in some of my code, at least trying to grab all those “low-hanging” performance fruits, when you just change a couple of lines and the performance dramatically improves. With profilers like NetBeans Profiler, it’s almost too easy and doesn’t take much time.
I’ve noticed that we use StringTokenizers here and there, mostly to parse quite big text files, while the javadoc for the StringTokenizer states:
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
OK, I thought, let’s switch to regular expressions and see, they are indeed more powerful and convenient to use. Well, yes, except for the fact that the new code runs typically 4-5 times slower than the original one. Come to think of it, the regexp-based version indeed should have been much slower when compared to very simple and straightforward Tokenizer.
So, the lesson for me here is to use String.split() for non-performance sensitive operations and to keep using good old StringTokenizer when the performance is critical. Oh, and not to trust everything I read in javadocs.
April 4th, 2007 at 1:34 pm
Seeing how the Javadocs are generated directly from source code comments, I’d say it’s pretty safe to trust them. The javadocs for StringTokenizer make no guarantees about performance. So your point about what the javadocs say are pretty much irrelevant to the performance debate.
Should you use code that has been explicitly stated as being legacy code just because it performs better? I’m not so sure. I can almost promise you that in another version or two, StringTokenizer will become deprecated. Will you still advocate its use then?
April 5th, 2007 at 2:17 am
Joe, sure, being working in Java SE/ME conformance teams for years, I *do* trust the javadocs a great deal.
That’s why when I saw a note that StringTokenizer should not be used, I tried to make the suggested change, only to figure out that the new code is a bit too slow for my purposes.
As for the future, within the next two years, the hardware will become faster, and by then the performance of StringTokenizer will not probably matter much…
April 5th, 2007 at 8:13 am
vvs:
I suppose that makes sense. Fair enough!
April 5th, 2007 at 3:51 pm
You should try the CSV Parser from Ostermiller
http://ostermiller.org/utils/CSV.html
April 5th, 2007 at 3:57 pm
Thanks, K. Will take a look. Luckily for us, we have very simple file format, and most probably StringTokenizer is enough. On anything more complex, I’d imagine, more powerful libraries will be quite handy!
April 7th, 2007 at 10:50 am
In the end you can just grab StringTokenizer.java from JDK’s src.zip and go with it forever, even after it’s removed from JDK.
May 4th, 2007 at 4:59 am
Hi,
Even StringTokenizer is not very efficient, if you don’t need all the features it supports.
If you only need to split by one char, you can write a routine that is several times faster and allocates much fewer objects than StringTokenizer does
Regards,
Markus
July 9th, 2007 at 9:28 am
Just curious… How many times were you calling String.split()? Was it just one time and it still was 4-5 times slower than tokenizer? Or was it many times. If it was many times, I would expect it to be slower since internally, String.split() is compiling a regex Pattern, an expensive operation. If you need to split many String objects, all using the same regex, you will get much better performance by using Pattern.split(), which will only compile the Pattern one time. The split() methods in String are only considered “convenience” methods.
Regards,
Dan
December 3rd, 2007 at 4:26 am
For speed related regular expressions you are recommended to use the Pattern / Matcher classes in Java5. Also while it is not an issue for simple tokenizing, more complicated Regex can cause a performance hit if you code it incorrectly.