- Feature Articles
- CodeSOD
-
Error'd
- Most Recent Articles
- Office Politics
- Secret Horror
- Not Impossible
- Monkeys
- Killing Time
- Hypersensitive
- Infallabella
- Doubled Daniel
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
Not only off by 1, but no consideration for the terminating NUL so off by 2.
Admin
I really want to know how
digits_count()
is used. I full-well expect it to be used in a manner, that doesn't account for the trailing 0 byte :)Admin
The real C way is to just use a fixed size buffer that's definitely large enough. A 12 byte buffer will do just fine and has happy alignment on most platforms.
Admin
A 12 byte buffer will do just fine until the day when someone invents a machine with 64-bit
int
...Admin
An optimization would be to test for smaller numbers first. How did my mind go there?
Admin
A 12 byte buffer will also be just fine until someone invents a sting encoding system where a character and a byte are not guaranteed to be the same length. Someday.
Oh, yeah. Wait. Damn. There's that newfangled thing just catching on in some weird furrin countries. I think it's called Unitrode; something like that anyway. It's probably a passing fad we can ignore. ASCII is good enough for anyone.
Admin
Aren't digit characters guaranteed to be single byte in Unicode?
Admin
I don't trust "UniCode" (is that how it's spelled?). They claim to be a "universal" character consortium, but they can't make up their minds on how many bytes each CodePoint takes. It's supposed to be two, but then they decided we should make it variable length!? They call it "ootf eight" or something, and each character can be anywhere from one to four bytes! /s
Addendum 2023-05-15 08:02: In seriousness, this code (if fixed for billions), would actually be the most efficient way to compute the value its supposed to. Printing to a string is a lot more expensive than the bunch of ifs here, and logarithms would be even worse. However, I'd probably write this as a while loop that divides by 10; No missing cases when done that way. Performance would be slightly worse than the ifs here, but it would still be better than sprintf+strlen.
Admin
Seen this so often it's a cliche. Even seen it in code left lying around by a HPC at one place I worked.
In fact, I think I've even seen it on TDWTF.
Admin
colejohnson66 - Since being defined in Sept 1992 [I was involved, indirectly], it has ALWAYS been variable length code codepoints. The first byte uses a variable number of bits to determine the length of the sequence, the remainder all are 10xxxxxx for validation against over/under run....
Admin
Unicode may have up to 4 bytes per character, but we're talking about numbers here. And unless you're talking about Roman numerals or other more exotic ways to represent numbers, Arabic numerals all exist in the first 128 bytes of Unicode space, which means utf8 encoding still only takes one byte per digit.
Not that this fact makes this code snippet in any way sane. I'll just point out that besides lacking the 10 digit length of 32 byte ints, and testing the largest numbers first, and the possibility of 64 bit integers, it's also stupid because there's no point in worrying about such minor ways of saving a few bytes here and there. Unless you're on a microprocessor (which is possible. You're writing in C, right?) In which case you probably don't have a log10 function, and just wasted who knows how many bytes of code space (but at least 4 for each number you stored for comparison) on writing each conditional instead of using a loop.
And still failed to include the extra byte for null termination on the end of each string.
Admin
Printing to an already allocated string is not a lot more expensive, basically a mod-div operation and an addition per digit. Your while loop would roughly double the cost (which is still no problem in most cases).
In regards to other comments: UTF-8 is no problem because all digits are in the ASCII range and need only 1 byte. (Unless you're formatting for other number systems, but then your length calculation may be off anyway.)
Using a fixed-sized buffer is still the best approach. To handle different integer sizes you can use "3 * sizeof (x)" (or something more fancy if you need to save every byte possible), +1 if signed and (depending on how it's used) another "+1" for the NUL terminator.
Admin
Umm, as opposed to... a macroprocessor?
I'm like 99.99% sure everybody visiting this site uses one of those. micros, that is
Admin
Taking log10 of a number doesn't directly tell you how many digits are in the number because the result isn't an integer.
If you take the ceiling, you almost are correct, except for the exact powers of 10. You can get the number of digits of you take the floor of log10, but then add 1.
Admin
Weeeeeell acshually, The Arabic digits (the digits as used by the language Arabic as opposed to the Western digits based on them) occupy U+660 to U+669 and each requires two bytes in UTF-8.
٠١٢٣٤٥٦٧٨٩
Admin
Given the name of the function, the extra byte for null termination shouldn't be part of the result though.
Admin
No. 3*sizeof(x) is wrong. It's possible that sizeof(long) == sizeof(int) == sizeof(short) == 1 if a char is 32-bits. And I have used such a system. A correct way to handle it is to use a fixed size buffer (e.g. 12 bytes) and #ifdef to verify that this is sufficient. Then when someone makes unsigned int by 128 bits or whatever, your code will fail gracefully by giving a compile-time error and maintenance will consist of merely increasing the buffer size and updating the check.
Admin
Okay, obviously using Log10 would be an anti-pattern - sure it's more readable, but AFAIR math.h converts the int to a float/double, does a very costly log approximation and the you have to cast the result to an integer again. So to just get the amount of digits, a chain of ifs makes sense, but personally I would start from the bottom up to 10.
However, everyone coding C actually knows that base 16 aligned head segments are way to go (for all 32 bit platforms or higher), so you have a char[16] buffer, which is ironically better than using an char[3] or char[10]. So for allocation purposes it is completely pointless anyway.
Admin
log10, if you manage to do proper rounding of the result will give you the number of digits.
Until someone feeds it the number 0.
Admin
No. UTF-8 may have up to 4 bytes per codepoint, and it's not the only possible encoding for Unicode codepoints.
Admin
This function isn't wrong, just incomplete. It needs one more line at the top. And I rather suspect the Log10 approach is more expensive (time) than this, I'm sure it was long ago. This isn't ideal but doesn't rise to WTF territory for me. In many cases simply printing it to a guaranteed-big-enough buffer would be the optimum approach but maybe it's not being printed yet. (Deciding how to format things.)
Admin
The number of bytes required, and how many bytes per character are needed...
Admin
That old 386SX in the corner is a 32-bit platform, and so is the 386DX next to it. Neither of them gains any benefit from aligning stuff more tightly than on 4-byte boundaries. (And the benefit from 4-byte alignment versus 2-byte alignment on the 386SX is limited.)
Admin
Maybe you meant "microcontroller"... (And even there, there's no causal link from "using C" back to "must be a microcontroller". At work, we use C on a range of 64-bit x86-64 CPUs which are most definitely not microcontrollers.)
Admin
What's sort of interesting to me is how many of us, myself included, assumed on no concrete evidence that the purpose of the
digits_count()
function was to determine a buffer length for a string conversion. As opposed to some utterly non-stringly use.Admin
Actually up to 6 bytes, though current versions of the Unicode standard only define codepoints that will fit in four bytes or less.
Admin
The code needs then also be recompiled and part of porting process is to look those changes. But yeah, it could be problem.
Admin
Nowhere does this say this is used to calculate the final size of a string buffer so all the comments about "+1" are WTF.
It might be used like this:
size_t buffer_size = strlen(label) + digits_count(value) +1; char buffer = (char)malloc(buffer_size); sprintf(buffer, "%s%d", label, value);
Of course it would me much easier to print it into a temporary buffer and use strdup:
char temp[128]; snprintf(temp, 128, "%s%d", label, value); char *buffer = strdup(temp);
Admin
Admin
16 bytes, not 16 bits. Sorry, I thought this was clear by my examples given ;-)
Admin
For those wondering about reason why 16 byte alignment is the fastest:
Back during the development of the i386 Intel learnt the very hard lesson that implementing a memory mode that nobody wants is not really the killer feature they hoped for. So they focused heavily on supporting DOS and spoiler: The ~~crappy~~ sub optimal memory management "API" allowed allocations in paragraphs or in other words in 16 byte steps aligned at 16 byte offset addresses (segments are basically 16 byte offsets anyway, so maybe that was the thought process? Who knows with DOS...). And like most historical stuff it kinda stuck and is actually a nice compromise for modern cache alignment and not wasting to much valuable cache due to over fetching.
Addendum 2023-05-17 02:00: Aww, the strike out doesn't work... whatta shame.
Admin
Log10 likely is slower, as it requires expensive floating point operations. In particular as you still need to ceil, convert, and handle 0 anyway, which does not have minus infinity digits. There probably is a bit fiddling hack that computes the highest set bit (i.e., log2) then does the logarithm transform, but with a value that is slightly off as to round exactly to the desired value. But: assuming small numbers are more frequent, begin with small values first. Assuming uniform distribution, a binary search would seem appropriate instead. Or as a compromise to shock WTF readers with nested ternary operations and no line breaks: return x<100?x<10?1:2:x<10000?x<1000?3:4:x<1000000?x<100000?5:6:...