I think it’s fair to say that C, as a language, has never had a particularly great story for working with text. Individual characters are okay, but strings are a nightmare. The need to support unicode has only made that story a little more fraught, especially as older code now suddenly needs to support extended characters. And by “older” I mean, “wchar
was added in 1995, which is practically yesterday in C time”.
Lexie inherited some older code. It was not designed to support unicode, which is certainly a problem in 2019, and it’s the problem Lexie was tasked with fixing. But it had an… interesting approach to deciding if a character was alphanumeric.
Now, if we limit ourselves to ASCII, there are a variety of ways we could do this check. We could convert it to a number and do a simple check- characters 48–57 are numeric, 65–90 and 97–122 cover the alphabetic characters. But that’s a conditional expression- six comparison operations! So maybe we should be more clever. There is a built-in library function, isalnum
, which might be more optimized, and is available on Lexie’s platform. But we’re dedicated to really doing some serious premature optimization, so there has to be a better way.
bool isalnumCache[256] =
{false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false,
false, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false,
false, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false};
This is a lookup table. Convert your character to an integer, and then use it to index the array. This is fast. It’s also error prone, and this block does incorrectly identify a non-alphanumeric as an alphanumeric. It also 100% fails if you are dealing with wchar_t
, which is how Lexie ended up looking at this block in the first place.