- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
No, it's because most of us are competent engineers who understand that the people who wrote the compilers are probably better at those kinds of optimizations than we are. In many cases, it's not even something that the high-level-language programmer has any control over.
Admin
Every so often I get something (like today's "Job Description Project" where they want me to describe what I've been doing the past decade+) that leaves me wondering what exactly is (the new) management really trying to do?
I then consider how I liked the little bit of programming I did years ago. Wouldn't it be cool to write programs to make the computer do your bidding?
Then I read articles like today's TDWTF and realize (like a cold shower) I could very well end up troubleshooting things like that for a living. I feel for you lot.
Or, worse yet, I could end up supporting something like :disco::horse:
:doing_it_wrong:
ERROR_RUNS_AWAY_SCREAMING.WAV
Admin
It isn't only that, in high level languages they can much better respond to circumstances dynamically sometimes using data we can't see or anticipate. They don't always get it right though. I find this the most with SQL where I occasionally have to convolute queries or do little things to give hints to the query optimiser. Perhaps as a declarative language it sticks out more when you have to do it. Not ideal for portability or for future scenarios where the optimiser might be improved.
In fact it's not that at all. You have no idea if the optimiser is going to do something better or not. But you give it a chance first because it takes five minutes and you need a point of reference to make improvements anyway. Things such as the CPU specifics, software stack, nature of the IO, etc can be extremely hard to anticipate even for a compiler. There is also the wisdom of don't fix a problem until you actually know you're going to have it or don't fix what you have no idea if it's broken or not.
It still isn't an excuse for writing code with a poor minimum standard. Often there is no difference between A and B in terms of productivity. Both take the same amount of time to produce, perhaps even A is quicker to write. However A can be better. Sometimes in all respects, easier to read, faster, quicker to write. The only difference between A and B then is if the developer only knows of B or also knows of A beforehand and is into the habit of doing A. The annoying thing is that if a developer in these scenarios often only does B in too many situations then you'll have Bs all over the place and eventually one will sting someone.
Admin
Got curious and ran some tests, judging memory and execution time between different methods of doing this, starting with the code from the article as a "control", then doing RegEx (with the horrible concatination), RegEx with a StringBuidler, then RegEx.Replace(), and finally just using LINQ.
For testing, I created 100 random strings, each 20 characters in length.
A couple of things really surprised me. First off, the fact that doing a RegEx search with a string builder actually took longer (10 whole ms) AND used more memory than the code from the article. This implies that the .NET compiler is doing something to optimize all that string creation. I was also surprised to see that LINQ was by far the slowest, but caused less heap allocation. This is most likely because of the dynamic IL code generation that LINQ uses, maybe that memory is being allocated under a different domain? Honestly, I think LINQ would have compared better with a much larger dataset.
Code can be found here: https://dotnetfiddle.net/gaIGYx
Control (Article Code) Memory Start: 173864 Memory End: 383376 Memory diff: 209512 Control time: 00:00:00.0010000
String-Happy RegEx Memory Start: 176800 Memory End: 640952 Memory diff: 464152 RegEx Time: 00:00:00.0030000
RegEx w/StringBuilder (string concat properly) Memory Start: 202008 Memory End: 587032 Memory diff: 385024 RegEx w/StringBuilder Time: 00:00:00.0020000
RegEx.Replace() Memory Start: 202064 Memory End: 390480 Memory diff: 188416 RegEx Replace Time: 00:00:00.0010000
LINQ Memory Start: 203464 Memory End: 367304 Memory diff: 163840 Linq Time: 00:00:00.0210000
Admin
I just realized that LINQ's low heap size report may be simply due to the fact that it's using so much memory it's triggering garbage collection before I'm checking the value again.
When I upped the initial string to 1000 sets of 200 characters, my benchmark makes it appear that the "string-happy" RegEx method is suddenly using the least amount of memory (which I know is not likely :smile:)
Control Memory diff: 4,290,048 Time: 00:00:00.0810081
RegEx (string-happy) Memory diff: 1,276,864 Time: 00:00:00.0630063
RegEx (w/StringBuilder) Memory diff: 3,374,472 Time: 00:00:00.0440044
RegEx.Replace() Memory diff: 1,833,568 Time: 00:00:00.0210021
LINQ Memory diff: 1,342,512 Time: 00:00:00.0890089
Admin
If you're saying we should optimize our code (for performance) ourselves rather than relying on the compiler to do it for us, I disagree. Optimize your code for clarity, and don't worry about performance unless there is actually a performance problem. If the code takes 10 times as long to run but is a little easier to understand, I would say in most situations that's a good trade off. If it's 50-100% slower, it's probably very rare that you would be better off trading clarity for performance.
Admin
To be honest, \w often captures more than is wanted, as a word character ≠ a letter character. A better (although less well supported) match would be \p{L}
Admin
Admin
Add support after unicode and I'll agree with you.
Admin
When performance testing something that's basically a short microbenchmark, you are best off doing many iterations. Otherwise you hit all sorts of weird one-off problems. It's probably better to look into how to trigger the environment to use its most expensive JIT; if stuff is happening just once, it might be using a bytecode engine instead as that's typically a win for a one-off. (Compilation is itself expensive.)
In other words, your figures are probably so beset by noise that we can't draw any safe conclusions from them. Sorry. :smile:
Admin
What I'm saying is that "bad code" should not be excused away by assuming (rightly or wrongly) that "it doesn't matter -- the compiler's optimiser will take care of it." Yes, it's a philosophical position, along much the same lines as "It doesn't matter how, selfishly, antisocially and dishonestly I live my life, as long as I say sorry on my deathbed God will forgive me", but still and all -- writing bad code is appallingly bad manners.
(and by "bad code" I mean things like: reinvention of techniques that you can use a library function for, writing loops in an untidy way, e.g. "while" with an incrementor when a "for" should be used instead, nesting loops unnecessarily, etc.) In particular, there is no excusing any of the code that OP presented on the grounds that some of it doesn't matter because it's optimised away.
As for optimising your own code for efficiency, then unless you are shaving microseconds off a real-time hardware controller and prepared to work down at machine code level, the best technique for manual optimisation is to analyse it properly, and write it neatly and well. And, as is de rigueur, approach optimisation only after you have identified exactly where the bottlenecks are.
Admin
And my position is that the bad concatenation approach used in the original code is disappointingly common, and so worth doing something special to improve. Yes, it's bad code, but it's bad code that is written very often in the wild (:confounded:) and causes a lot of trouble. It is also a case that a compiler can improve relatively easily; it's entirely capable of seeing what is going on in enough detail, and can do it all before issuing any intermediate code without deep changes to the back end. I understand your philosophical position, but I think that grubby reality will always trump philosophical purity.
The difference would be observable with debugging tools, but only with them. (Stopping the code in the middle of the loop and examining the variables would allow you to see some synthetic string builder. Right now, that'll be how
+=
is implemented anyway, so it's almost just adjusting some object lifetimes.)Admin
Trick answer.
String.Contains
doesn't.System.Linq.Contains<T>
, on the other hand...Admin
For this particular case, I even find it a bit unfair to lay the blame entirely on the coder. It is the most obvious way to build a string, and thus IMHO the language violates the principle of least surprises if this method does not have acceptable performance. A good interface should be easy to use correctly and difficult to use incorrectly, which clearly is not the case here.
But then, what do I know, I do C++ for a living, which is probably one of the worst offenders in this regard.
Admin
And I know that it's possible for compilers to figure this case out. How? I've got code (written by a friend for a joint project of ours on compiling a scripting language) that does exactly the analysis that would be required. Given that, I can assert that this case is 100% for sure trivial to analyse and optimise for.
Or failing that, IDEs ought to provide at least some sort of warning that something dumbass is being done.
Admin
It's often hard to say what properly is and there are many schools of thought. However there are times where you can definitely say that something is improper. For example...
One of the things here is pointless redundant operations that should not be written in the first place to need pruning.
Admin
I expected as much, at least for relatively simple cases such as the OP. Still, it's often pointed out as a :wtf:. My point is that, if it is indeed a :wtf:, it's more one of the language / engine / compiler than of the code.
Admin
Can't argue with that, Java string handling is suboptimal, and the "obvious" way to do it is the "wrong" way to do it. But then the "right" way to contatenate strings (i.e. using StringButter.append, oh sorry, should I be using StringBuilder.append? God knows) is ugly and unwieldy. In such circumstances it can be (IMHO) appropriate to allow the optimiser to do its thing and construct the optimised code for you. But wow, wouldn't it be nice for the IDE to give the newbie a heads-up ...
Admin
I usually find that when people start throwing around phrases like “suboptimal” in relation to this sort of thing, they're operating from a position of some ignorance. Java and C# use an immutable class for strings, which means that they are values that are safe to use in a whole bunch of security-sensitive situations without copying. There's a lot of gotchas from (e.g.) C++ which simply don't apply because you simply cannot observe any changes; the possibility of something pulling the rug out from under the feet of your code does not exist at all.
But when you do that, you need to introduce a mechanism for constructing strings out of pieces or you have crippling performance problems. That's what
StringBuilder
is for; it's effectively a mutable string (that internally implements an efficient buffer management strategy) and comparatively few APIs in Java and C# accept them. The+=
and+
operators (over strings) are implemented by transformation into a sequence of operations onStringBuilder
s. This gives safe and efficient operation when used well.Java and C# (the latter to a slightly lesser extent) focus on having very clear and easily predictable semantics over the greater syntactic and semantic looseness of C++. In many ways, the designers of those languages looked at C++ and went “No Thank You!” Because of this, things that are abhorrent to someone used to C++ are nothing like as bad from the perspective of C#. The flip side is also true.
Admin
I think the main problem is that the code is not testable/reusable at all and contains error possibilities. See this approach:
See full example: http://rextester.com/AVTMMP67746 Of course regexp would help to have the business requirements more readable.
Admin
Combining characters are always fun. Like I said, Unicode is hard.
Admin
I understand that danger in micro benchmarks, which is why I started generating the larger sample set, rather than using the name list from the article. I definitely saw a lot of "noise" with the smaller sample set. With 100 sets of 20 characters, the results were consistent after running 4-5 times.
It was as much an attempt to see how the compiler handled the bad code as anything else.
It's worth noting that the times don't calculate running that code in DotNetFiddle, for some reason. It works fine running as an actual console app on my machine.
Admin
I really want to go look at the source code for .NET's StringBuilder class, now. I was under the impression that using += on strings in C# was just as bad as it was in C++, forcing allocation of memory to create entirely new string objects each time.
Looking at the IL, there is definitely a difference between the two methods, although I am not certain what the jitter is doing with string::concat. While that could be some kind of wrapper, it would require the
string += string IL_0056: call string [mscorlib]System.String::Concat(string, string)
StringBuilder.Append IL_0056: callvirt instance class [mscorlib]System.Text.StringBuilder [mscorlib]System.Text.StringBuilder::Append(string)
Looking in detail at the surrounding IL code, you might well be correct. I am not "fluent" in reading IL code, though :smile:
EDIT: You are exactly correct. I found the source code for .NET's String.Concat function (which is what the compiler appears to be calling) and it is creating an internal StringBuilder object. When doing this in a loop, it's still not nearly as efficient as a single StringBuilder being declared outside the loop, but not automatically as bad as I had assumed.
http://referencesource.microsoft.com/#mscorlib/system/string.cs,3091
Admin
Some do (I don't know about for C# but there are IDEs that will warn about this same mistake in Java).
Admin
I agree, but perhaps for different reasons. Doing those things is bad, but not because it's slow, so "the optimizer will take care of it" is not only not good enough, it doesn't even make sense as an excuse, because the badness of the code is at the source code level.
Perhaps I misunderstand, but that doesn't seem right to me. Optimize only after identifying bottlenecks - I agree. Correct optimization technique is well-written tidy code.... so only write tidy neat well-written code after you identify a bottleneck? I would say always strive to write clean code. This will often have the side effect of reducing the need for performance tuning, but IMO that isn't the primary benefit. If optimization is required, in my experience it's almost always a matter of reducing I/O. Cleaning up some loop variables or the like is unlikely to make a big difference, unless it's eliminating a nested loop that runs millions of times or something like that.
Admin
QFT
It depends on the code. Sometimes you need to reduce I/O (or, more generally, system calls), sometimes memory allocations, sometimes plain old copies or adds. Measure, don't guess. ;-)
Admin
Not with categories.
I don't see anyone having mentioned those.
Admin
And how is someone with, e.g., a Dvorak keyboard going to verify fast that all the letters each are in the string once and only once?
If some such developer should happen on that code on some later date, would it be acceptable for them to find out who originally wrote that code (from the version system) and then to drive to their home after work and whack them on the behind repeatedly with said keyboard?