• (nodebb)

    I don't see a problem here. Just extend that lookup table to cover 2^16=65,000 characters in all languages and be done with it.

  • P (unregistered)

    Using true/false in C? Now that's TRWTF there.

    I expect them to be 1/0 instead.

  • bvs23bkv33 (unregistered)

    converting character to int in C? you don't

  • Sheriff Fatman (unregistered) in reply to P

    Using true/false in C? Now that's TRWTF there.

    I expect them to be 1/0 instead.

    #define true 0

  • Legacy Warrior (unregistered)

    This code is so optimized! The obvious best solution is to create a 256-bit wide flag mask that you perform bitwise operations on to determine the isalnumCache result for any given value! That way you don't even do any frivolous array lookups!

  • Legacy Warrior (unregistered)

    And by my calculations, that bit mask would be of type "unsigned long long long long long long long long long long long long long long long long long long long long long long long long long long long long long long long char"

  • Foo AKA Fooo (unregistered) in reply to Mr. TA

    Welcome to the 21th century! Unicode has >1 million valid code points. (Yes, I know, Emoji aren't alphanumeric anyway, but there are some letters in non-BMP.) And it's certainly much fun having to adjust your table any time Unicode adds new characters.

    Which is also why wchar_t is problematic -- it can be 16-bit or 32-bit, depending on the platform. char32_t was added more recently, in C11.

    (And FWIW, isalnum for ASCII can be done with two comparisons rather than six, using some dirty low-level tricks.)

  • Pedro (unregistered) in reply to Mr. TA

    Actually, Unicode now has about 1,114,112 code points/characters. UTF-16 represents them using 2 to 4 bytes.

  • Pedro (unregistered)

    The correct enterprimentation(tm) is to call a web service that queries a database to find out if a given character is alphanumeric or not. Since UTF-16 codepoints can have up to 4 bytes, the database table only needs to have 4 billion rows, each with a true/false isAlphanumeric column.

  • Pedro (unregistered) in reply to Mr. TA

    Actually, Unicode has about 1,114,112 codepoints/characters. UTF-16 represents them using 2 to 4 bytes, so we need a 2^32 byte lookup table ;)

  • (nodebb)

    How about using the isascii() function in <ctype.h>? Yet another problem in roll-your-own code that is solved simply using the documentation and a provided standard library. If you want wide (Unicode) characters, use iswascii().

  • (nodebb) in reply to Pedro

    I knew somebody would write that! Aren't most characters after 65,000 Chinese words? I thought there were no symbols in that upper range. Which is why my proposal only goes up to 65,000 as opposed to 4,000,000,000 like you're suggesting - this clearly proves that I'm way more reasonable than you.

  • (nodebb) in reply to Legacy Warrior

    Funny you mention it, I actually think that the best way for a standard library (in any language) to implement this would be using a bitmask. Now I'm curious - want to go decompile .NET to see how they solved it.

    Addendum 2019-09-11 09:42: Yes! .NET does it using unsafe bitmask math. Great minds think alike.

  • (nodebb) in reply to Mr. TA

    Aren't most characters after 65,000 Chinese words?

    That or emoji. You just know you need to use 💩😇 in your input data…

  • Pedro (unregistered) in reply to Mr. TA

    Chinese? Not at all: https :// en.wikipedia .org/wiki/Plane_(Unicode) "Future-proof" is the management buzzword of the day, so 2^32 it is! ;-)

  • Brian Boorman (google) in reply to Legacy Warrior

    You mock, but look up tables are common in low-memory embedded applications. If you had to do this function (assume plain ASCII chars) on a 16-bit micro with 2KB of memory damn straight 256-bit (32-byte) look up table. Shift char right 3 bits, index the array, and shift the result by the lower 3 bits of the original char and bitwise mask with 0x01.

    Let's see - Copy Char, Shift Copied Char, Load Indexed, Copy Char, Mask Char, Shift Loaded Byte, And Immediate - That's 7 16bit instructions (excluding function prologue/epilogue) + 32 bytes = 46 bytes.

    Show me an implementation that does better.

  • Pedro (unregistered) in reply to Brian Boorman

    For 8-bit isAlpha() check the smallest footprint is simply to check the valid ranges, as this same article suggests: just check if the char is in the ranges 48–57, 65–90 and 97–122. That's much shorter than 46 bytes.

  • Officer Johnny Holzkopf (unregistered)

    As long as you stay within the "7 bit US-ASCII" range, isalnum() will probably work. But - and this is a big but! - as soon as you have umlauts like Ö or ligatures like ß, things start to get interesting. And hardcoding numeric values for characters and expecting them to be universal is a belief that you're going to question when you port your program to Linux on /z and have to discover that EBCDIC is quite different.

    The proper enterprisery way which is good for the company of course is to use a web service that contains an instance of a IsSomeCharacterOrSymbolAlphabeticOrNumericFactoryFactory that queries a database (different web service!) where the results for the isalnum() question is stored in a column of type VARCHAR which contains a random selection of true, TRUE, True, false, FALSE, False, yes, YES, Yes, no, NO, No, null, NULL, the empty string, nil, 0, 1, O, I, l, jawohl, niente, tak, Nyet, and of course FILE_NOT_FOUND for proper globalization and international business communication.

  • gnasher729 (unregistered)

    In my experience wchar_t is just asking for trouble. Not portable, problems with endianness that you could do without. Use standard strings holding UTF-8 bytes. Then learn that number of bytes ≠ number of unicode code points ≠ number of characters :-)

  • gnasher729 (unregistered)

    Pedro: You don't need a 2^32 byte lookup table. First, you don't need it anyway because there are just above 2^20 possible code points. Second, because anyone who isn't completely insane will use some library, which will have nicely compressed tables that are a lot smaller than 2^20 bytes. Because most pages have the same values in most lookup tables, so you do one lookup for the page, then use a per-page result, or another lookup within the page. All of course hidden away in some library.

  • (nodebb)

    The funny part is, most implementations of isalnum() I know of (which means, hum... only Microsoft's) already use a lookup table (at least when working on ASCII).

    #define __fast_ch_check(a,b)       (__initiallocinfo.pctype[(a)] & (b))
    
    extern __inline int (__cdecl isalnum) (
            int c
            )
    {
        if (__locale_changed == 0)
        {
            return __fast_ch_check(c, _ALPHA|_DIGIT);
        }
        else
        {
            return (_isalnum_l)(c, NULL);
        }
    }
    
    
  • Pedro (unregistered) in reply to gnasher729

    I can't believe you thought I was serious...

  • (nodebb)

    Not using 1s and 0s will disguise the fact that this is the preliminary setup for a particular X-Files episode (https://www.imdb.com/title/tt0751094/mediaviewer/rm1515662080).

    In my very brief self-taught-c period, I used the gpp compiler, just because of a missing boolean type...

  • I dunno LOL ¯\(°_o)/¯ (unregistered) in reply to Nutster

    The best part is that the standard way to implement the (8-bit char) <ctype.h> library functions is with an array of 256 ints, each with a bit mask of ALL attributes like numeric, alpha, etc. Only it's a single array for all of the attributes, and someone has bothered to NOT make a mistake in the array, decades ago. Apparently some implementations even make it an array of 384 to safely handle signed vs unsigned.

    So not only is it re-implementing the wheel, it's re-implementing the wheel with a flatterned tire.

  • X (unregistered)

    Aren't you supposed to use unichar_t and not wchar_t...?

  • sizer99 (google) in reply to Medinoc

    A lookup table is the way to go in general because instead of flattening a 256-bit 'is it alphanumeric?' down to 32 bits, they keep a byte for each character and have various flags on it like 'is alpha?' 'is numeric?' 'is punctuation?' 'is whitespace?' 'is printable?' Assuming nobody botches the table it's very clean and you can do easy combinations like

    #define isalphanumeric(c) ( chartable[c] & ( IS_ALPHA | IS_NUMERIC ) )

    What this guy did is... not the best.

  • Lőrinczy, Zsigmond (github)

    'wchar_t' isn't Unicode. It is defined as 'platform-dependent something'; it can even be same as char.

  • (nodebb)

    A lookup table for the whole Unicode space wouldn't actually be unreasonable. The key is to chop the problem up! We don't want anything like a table with 2^32 entries. Instead, lets break it up into a whole bunch of 256-entry tables. These come in two flavors, branches (in which each entry is a 16-bit number saying what table to look at) and leaves (in which each entry is a bitmap, thus 32 bytes.) 16-bits is actually overkill, without building the table I don't know how many you actually need. I wouldn't be shocked to learn that 8 is enough.

    Break your character into 4 bytes. Byte #1 is looked up in table #1. Byte #2 is looked up in the entry pointed to by byte #1. Byte #3 is looked up in the entry pointed to by byte #2. Byte #4 is looked up in the leaf table pointed to by byte #3.

    The key here is that characters of interest are pretty uncommon and thus you can point most of the entries to the same place.

    Don't even think of generating the table without computer assistance, though!!

  • (nodebb)

    If you're going to do a WTF, at least try to make it look like you're doing something right. The true/false table should be programmatically generated using the range values indicated. If runtime performance is that critical, put the table generation into the build.

    In case you didn't realize it, this is the start of SkyNet, having programs program themselves!

  • löchlein deluxe (unregistered) in reply to Legacy Warrior

    <quote>unsigned long long long long long long long long long long long long long long long long long long long long long long long long long long long long long long long char</quote>

    Hey, I've seen those at Burger King, haven't I?

  • (nodebb)

    Sure, this lookup table is a WTF... but the method signature for this library method must also be a WTF if it's accepting random other types being passed to it such as wchar_t.

    Or did Lexie commit another WTF by changing the method signature before looking at the actual implementation of the method?

  • (nodebb) in reply to Loren Pechtel

    With Unicode being only 20.1 bits rather than 32, one could do this with only three levels. Or, to follow the same organization as Unicode (seventeen 16-bit planes), two levels: One for (byte)(c>>16) and one for (c & 0x00FFFF). And to avoid both a test and memory overuse, all values other than 0-16 would point to the same 64k of "no to all" values.

    Total space: 64kib * (17+1) + 256*sizeof(pointer) = 1152kib + 2kib ≈ 1.257Mib on a 64-bit system.

    Addendum 2019-09-12 05:21: I mean 1.127Mib. Dunny how I could mistype that.

  • Little Bobby Tables (unregistered) in reply to I dunno LOL ¯\(°_o)/¯

    Everybody knows that it's the corners on a square wheel that give it a bumpy ride. So the answer is to reduce the number of corners. Here, try my triangular wheel.

  • Anonymous') OR 1=1; DROP TABLE wtf; -- (unregistered) in reply to Loren Pechtel

    That's basically how I implemented Unicode-aware versions of toupper() and tolower() some years ago in a multi-platform C++ code base where some platforms didn't have any real native Unicode support and we couldn't use any (L)GPL libraries such as libiconv.

    We were already using UTF-8 internally, so naturally I used a lookup table that was up to 4 levels deep where each level was indexed by one byte in the UTF-8 representation of the character being looked up, without any leading 1 bits. I wrote a Python script to parse the Unicode Character Database files and spit out the tables.

    In worked great and was fairly compact in terms of memory size, but it was way overkill for what we were trying to do. We only really needed to support EFIGS languages, so for all practical purposes, a solution that was correct for all Latin-1 characters would have sufficed.

  • jgh (unregistered)

    A lookup table is the standard way classical C did do isalpha() and ispunc() type of stuff, usually as macros. Usually a 257-byte table so EOF matched at offset -1.

  • doubting_poster (unregistered)

    I'm more interested in learning if the optimisation was premature or not. Was the code profiled? Was this originally coded this way, or was there a production requirement that lead to it?

    It was probably just dumb, but still, don't make assumptions.

  • (nodebb)

    regarding the comments... uppercase and lowercase are 0x20 apart, it looks better in a table than being contiguous

  • Decius (unregistered)

    You can always check for something in that range in four comparisons, and most in just two (=>65, -<90, =<122, =>97 for everything in the 90-122 range in the example) if you optimize rather than use redundant checks. If at or above 65( If at or below 90 return true if at or below 122( If at or above 97 return true) return false) if at or above 48( if at or below 57 return true) return false

    If you're going to do something that should never be done, do it right!

  • Bruce (unregistered)

    I think this ridiculing ignores history a bit. In pre-standard C (K&R), everything was ascii, and character maps were exactly how you implemented character class testing. As a previous poster said, using bit fields generally. This was particularly relevant in those days with small address spaces.

    In those days most of the standard libraries were tiny, and depending upon platform or compiler were very minimal, in fact it was uncommon for character class functions or macros to be included. Most of us to have roll our own libraries [and they might vary from system to system depending upon word/character size].

    This is possibly old code either bought into a project from someone's toolkit, or inherited from an older seed project that should have been refactored, but got the attention that refactoring always gets: not until it's way past critical.

  • asdfse (unregistered) in reply to Mr. TA

    Then some annoying boss will want it to support emojis too.

  • asdfse (unregistered) in reply to Legacy Warrior

    But bitwise operations are operations! They must be slow because CPUs still do one operation after the other, don't they? Array lookups are obviously practically free being just a single operation! Cache and memory speed be damned.

Leave a comment on “ImAlNumb?”

Log In or post as a guest

Replying to comment #:

« Return to Article