The Daily WTF: Curious Perversions in Information Technology

2023-05-15 Reply Admin

in ugly ways

Not only off by 1, but no consideration for the terminating NUL so off by 2.

R3D3 · 2023-05-15 Reply Admin

I really want to know how digits_count() is used. I full-well expect it to be used in a manner, that doesn't account for the trailing 0 byte :)

dkf · 2023-05-15 Reply Admin

The real C way is to just use a fixed size buffer that's definitely large enough. A 12 byte buffer will do just fine and has happy alignment on most platforms.

Steve_The_Cynic · 2023-05-15 Reply Admin

A 12 byte buffer will do just fine until the day when someone invents a machine with 64-bit int...

Rick · 2023-05-15 Reply Admin

An optimization would be to test for smaller numbers first. How did my mind go there?

2023-05-15 Reply Admin

A 12 byte buffer will also be just fine until someone invents a sting encoding system where a character and a byte are not guaranteed to be the same length. Someday.

Oh, yeah. Wait. Damn. There's that newfangled thing just catching on in some weird furrin countries. I think it's called Unitrode; something like that anyway. It's probably a passing fad we can ignore. ASCII is good enough for anyone.

2023-05-15 Reply Admin

Aren't digit characters guaranteed to be single byte in Unicode?

colejohnson66 · 2023-05-15 Reply Admin

I don't trust "UniCode" (is that how it's spelled?). They claim to be a "universal" character consortium, but they can't make up their minds on how many bytes each CodePoint takes. It's supposed to be two, but then they decided we should make it variable length!? They call it "ootf eight" or something, and each character can be anywhere from one to four bytes! /s

Addendum 2023-05-15 08:02: In seriousness, this code (if fixed for billions), would actually be the most efficient way to compute the value its supposed to. Printing to a string is a lot more expensive than the bunch of ifs here, and logarithms would be even worse. However, I'd probably write this as a while loop that divides by 10; No missing cases when done that way. Performance would be slightly worse than the ifs here, but it would still be better than sprintf+strlen.

2023-05-15 Reply Admin

Seen this so often it's a cliche. Even seen it in code left lying around by a HPC at one place I worked.

In fact, I think I've even seen it on TDWTF.

2023-05-15 Reply Admin

colejohnson66 - Since being defined in Sept 1992 [I was involved, indirectly], it has ALWAYS been variable length code codepoints. The first byte uses a variable number of bits to determine the length of the sequence, the remainder all are 10xxxxxx for validation against over/under run....

2023-05-15 Reply Admin

Unicode may have up to 4 bytes per character, but we're talking about numbers here. And unless you're talking about Roman numerals or other more exotic ways to represent numbers, Arabic numerals all exist in the first 128 bytes of Unicode space, which means utf8 encoding still only takes one byte per digit.

Not that this fact makes this code snippet in any way sane. I'll just point out that besides lacking the 10 digit length of 32 byte ints, and testing the largest numbers first, and the possibility of 64 bit integers, it's also stupid because there's no point in worrying about such minor ways of saving a few bytes here and there. Unless you're on a microprocessor (which is possible. You're writing in C, right?) In which case you probably don't have a log10 function, and just wasted who knows how many bytes of code space (but at least 4 for each number you stored for comparison) on writing each conditional instead of using a loop.

And still failed to include the extra byte for null termination on the end of each string.

2023-05-15 Reply Admin

Printing to an already allocated string is not a lot more expensive, basically a mod-div operation and an addition per digit. Your while loop would roughly double the cost (which is still no problem in most cases).

In regards to other comments: UTF-8 is no problem because all digits are in the ASCII range and need only 1 byte. (Unless you're formatting for other number systems, but then your length calculation may be off anyway.)

Using a fixed-sized buffer is still the best approach. To handle different integer sizes you can use "3 * sizeof (x)" (or something more fancy if you need to save every byte possible), +1 if signed and (depending on how it's used) another "+1" for the NUL terminator.

2023-05-15 Reply Admin

Unless you're on a microprocessor [...]

Umm, as opposed to... a macroprocessor?

I'm like 99.99% sure everybody visiting this site uses one of those. micros, that is

2023-05-15 Reply Admin

Taking log10 of a number doesn't directly tell you how many digits are in the number because the result isn't an integer.

If you take the ceiling, you almost are correct, except for the exact powers of 10. You can get the number of digits of you take the floor of log10, but then add 1.

jeremypnet · 2023-05-15 Reply Admin

Arabic numerals all exist in the first 128 bytes of Unicode space, which means utf8 encoding still only takes one byte per digit.

Weeeeeell acshually, The Arabic digits (the digits as used by the language Arabic as opposed to the Western digits based on them) occupy U+660 to U+669 and each requires two bytes in UTF-8.

٠١٢٣٤٥٦٧٨٩

R3D3 · 2023-05-15 Reply Admin

And still failed to include the extra byte for null termination on the end of each string.

Given the name of the function, the extra byte for null termination shouldn't be part of the result though.

2023-05-15 Reply Admin

No. 3*sizeof(x) is wrong. It's possible that sizeof(long) == sizeof(int) == sizeof(short) == 1 if a char is 32-bits. And I have used such a system. A correct way to handle it is to use a fixed size buffer (e.g. 12 bytes) and #ifdef to verify that this is sufficient. Then when someone makes unsigned int by 128 bits or whatever, your code will fail gracefully by giving a compile-time error and maintenance will consist of merely increasing the buffer size and updating the check.

MaxiTB · 2023-05-15 Reply Admin

Okay, obviously using Log10 would be an anti-pattern - sure it's more readable, but AFAIR math.h converts the int to a float/double, does a very costly log approximation and the you have to cast the result to an integer again. So to just get the amount of digits, a chain of ifs makes sense, but personally I would start from the bottom up to 10.

However, everyone coding C actually knows that base 16 aligned head segments are way to go (for all 32 bit platforms or higher), so you have a char[16] buffer, which is ironically better than using an char[3] or char[10]. So for allocation purposes it is completely pointless anyway.

2023-05-15 Reply Admin

log10, if you manage to do proper rounding of the result will give you the number of digits.

Until someone feeds it the number 0.

2023-05-15 Reply Admin

Unicode may have up to 4 bytes per character

No. UTF-8 may have up to 4 bytes per codepoint, and it's not the only possible encoding for Unicode codepoints.

LorenPechtel · 2023-05-15 Reply Admin

This function isn't wrong, just incomplete. It needs one more line at the top. And I rather suspect the Log10 approach is more expensive (time) than this, I'm sure it was long ago. This isn't ideal but doesn't rise to WTF territory for me. In many cases simply printing it to a guaranteed-big-enough buffer would be the optimum approach but maybe it's not being printed yet. (Deciding how to format things.)

2023-05-15 Reply Admin

The number of bytes required, and how many bytes per character are needed...

are not depended on Unicode (they dependo on how the Unicode "codepoint" is stored - but not on Unicode itself. And all the "it requires X bytes" claims here above are wrong, or bettere they are nonsense, if you don't specify)
are irrelevant, at least according to the function name: it counts the digits, not the bytes. P.S. the terminating NULL is not a digit.

Steve_The_Cynic · 2023-05-15 Reply Admin

However, everyone coding C actually knows that base 16 aligned head segments are way to go (for all 32 bit platforms or higher),

That old 386SX in the corner is a 32-bit platform, and so is the 386DX next to it. Neither of them gains any benefit from aligning stuff more tightly than on 4-byte boundaries. (And the benefit from 4-byte alignment versus 2-byte alignment on the 386SX is limited.)

Steve_The_Cynic · 2023-05-15 Reply Admin

Unless you're on a microprocessor (which is possible. You're writing in C, right?)

The i9-10980XE in my PC that I'm writing this on is a microprocessor.
Nothing stops you from programming that IBM zSystem mainframe (most definitely not a microprocessor in there) in C.

Maybe you meant "microcontroller"... (And even there, there's no causal link from "using C" back to "must be a microcontroller". At work, we use C on a range of 64-bit x86-64 CPUs which are most definitely not microcontrollers.)

2023-05-15 Reply Admin

What's sort of interesting to me is how many of us, myself included, assumed on no concrete evidence that the purpose of the digits_count() function was to determine a buffer length for a string conversion. As opposed to some utterly non-stringly use.

2023-05-15 Reply Admin

Actually up to 6 bytes, though current versions of the Unicode standard only define codepoints that will fit in four bytes or less.

2023-05-16 Reply Admin

The code needs then also be recompiled and part of porting process is to look those changes. But yeah, it could be problem.

2023-05-16 Reply Admin

Nowhere does this say this is used to calculate the final size of a string buffer so all the comments about "+1" are WTF.

It might be used like this:

size_t buffer_size = strlen(label) + digits_count(value) +1; char buffer = (char)malloc(buffer_size); sprintf(buffer, "%s%d", label, value);

Of course it would me much easier to print it into a temporary buffer and use strdup:

char temp[128]; snprintf(temp, 128, "%s%d", label, value); char *buffer = strdup(temp);

2023-05-16 Reply Admin

Unicode includes 22 different sets of graphemes for the decimal digits, and also various decimal points, thousands separators, negative signs, etc. <wink>

2023-05-17 Reply Admin

16 bytes, not 16 bits. Sorry, I thought this was clear by my examples given ;-)

MaxiTB · 2023-05-17 Reply Admin

For those wondering about reason why 16 byte alignment is the fastest:

Back during the development of the i386 Intel learnt the very hard lesson that implementing a memory mode that nobody wants is not really the killer feature they hoped for. So they focused heavily on supporting DOS and spoiler: The ~~crappy~~ sub optimal memory management "API" allowed allocations in paragraphs or in other words in 16 byte steps aligned at 16 byte offset addresses (segments are basically 16 byte offsets anyway, so maybe that was the thought process? Who knows with DOS...). And like most historical stuff it kinda stuck and is actually a nice compromise for modern cache alignment and not wasting to much valuable cache due to over fetching.

Addendum 2023-05-17 02:00: Aww, the strike out doesn't work... whatta shame.

2023-05-29 Reply Admin

Log10 likely is slower, as it requires expensive floating point operations. In particular as you still need to ceil, convert, and handle 0 anyway, which does not have minus infinity digits. There probably is a bit fiddling hack that computes the highest set bit (i.e., log2) then does the logarithm transform, but with a value that is slightly off as to round exactly to the desired value. But: assuming small numbers are more frequent, begin with small values first. Assuming uniform distribution, a binary search would seem appropriate instead. Or as a compromise to shock WTF readers with nested ternary operations and no line breaks: return x<100?x<10?1:2:x<10000?x<1000?3:4:x<1000000?x<100000?5:6:...

Counting Digits

Leave a comment on “Counting Digits”