• (nodebb)

    If only there was an approach to software development that put the technical decisions in the hands of the teams that were doing the work.... Of wait - there have been multiple ones for over 20 years!!!! and so many organizations claim to use one of them, but it is clearly in name only....

  • Tree (unregistered)

    This sounds like it should have been a SQLite database.

  • Peter (unregistered)

    Some fileformats allow readable headers and put the binary data behind the End-Of-File character, PNG for instance. Show the idea is not completely wrong.

  • Industrial Automation Engineer (unregistered)

    XML and CDATA. Problem solved.

  • Industrial Automation Engineer (unregistered)

    To clarify my previous post: I was being sarcastic...

  • (nodebb)

    Ever looked at the "HTML fragment" clipboard data format? It contains several size/offset fields in text. Thankfully, it allows padding them with zeroes.

  • Steve (unregistered) in reply to Tree

    If SQLite is your answer, you're asking the wrong question

  • Sole Purpose Of Visit (unregistered) in reply to Industrial Automation Engineer

    That's a relief. XML, the comms equivalent of Regexps ...

    Although looking at this problem in the abstract, XML was my first thought. Enterprisey enough for the PHCTO, and would probably work with a suitable schema.

  • (nodebb) in reply to Steve

    SQLite's really good at replacing custom binary formats and whacking great directory trees full of lots of small files. It doesn't do everything a server-based DB does… but it's a complete doddle to deploy.

  • LCrawford (unregistered)

    Ok, I'm an XML horror newbie. What would be wrong with XML in this case?

  • (nodebb) in reply to dkf

    Even with context, I have no idea whether a doddle is a good thing or a bad thing.

    ...To the inter--! uh.

    To a different part of the internet!

  • (nodebb) in reply to kilroo

    doddle noun

    1. a very easy task

    Example: "this printer's a doddle to set up and use"

    Now I just need 1/4 of a red laser and 1/4 of a blue laser

  • Sole Purpose Of Visit (unregistered) in reply to Sole Purpose Of Visit

    I've spotted the terrible flaw in my argument. CDATA doesn't cat very well...

    Of course, any CTO who insists on using cat for comparisons these days is the real WTF.

    (Dogs don't work any better.)

  • Ollie Jones (unregistered)

    There's this really cool way to stash binary data in text files. It's called base64. This CTO's predjudice in favor of text files was probably valid.

    There's a this really cool way to handle binary-data endianness. It's called network order. (htonl(), ntohl()).

    Everything in the file is text. Records formatted Tag: value separated by cr/lf (just like http headers, eh?) Binary data is converted to network order and base 64 encoded to store it.

    The real WTF? CTOs who don't know how to explain what they want in a convincing way. They have one job: explaining stuff. Many CTOs can't do that.

  • ZZartin (unregistered) in reply to LCrawford

    Ok, I'm an XML horror newbie. What would be wrong with XML in this case?

    For one thing XML doesn't like raw binary data, now sure you can base 64 everything but that's another level of unreable.

  • Duston (unregistered)

    "you can base 64 everything but that's another level of unreable." And then you can encrypt it with ROT13 just to make it secure too. /snark.

  • (nodebb)

    All of this created some technical issues. The key one was that the header length, stored as text, could change the length of the header. This wasn't itself a deal-breaker, but other little flags created problems. If they represented byte-order as BIGENDIAN=Y, would that create confusion for their users? Would users make mistakes about what architecture they were on, or expect to use LITTLEENDIAN=Y instead?

    Tim C's team is TRWTF. No, the cat-fanatic CTO doesn't help, but the above paragraph shows that the developers are in a whole different class of WTF.

    The header length should, indeed, be stored in a way that doesn't change the length of the header, but that's not a hard problem. Either pack it with leading spaces or zeroes or something, or don't actually store the length as such. Take a lesson from HTTP, and represent the header as a sequence of lines of text, ended with a blank line. Duh.

    For the endianness, you pick one representation and impose it on the idiot users. If they get it wrong, it's their problem. Or you discuss it carefully with the users and agree how it should be represented. And, of course, just in case they users are PDP-11 holdouts (that happens, even now) or other forms of idiocy, make sure to allow for flavours of middle-endian...

  • ooOOooGa (unregistered)

    It was solving a hard problem: they were collecting messages from a variety of organizations, in a mix of binary and plaintext, and dumping them into a flat file. Each file might contain many messages, and they needed to be able to split those messages and timestamp them correctly.

    And the problem with normalizing the incoming data and storing it in a consistent format is...

  • (nodebb) in reply to Ollie Jones

    The real WTF? CTOs who don't know how to explain what they want in a convincing way. They have one job: explaining stuff. Many CTOs can't do that.

    "I am good at dealing with people, can't you understand that?! What the hell is wrong with you people?!"

  • Zygo (unregistered)

    It was solving a hard problem: they were collecting messages from a variety of organizations, in a mix of binary and plaintext, and dumping them into a flat file. Each file might contain many messages, and they needed to be able to split those messages and timestamp them correctly.

    I see you are reinventing MIME. Carry on.

  • MaxiTB (unregistered)

    I'm confused - I thought the clear way to go this days for this type of problem is to create a manifest file in a standardized format (like XML or json), and put binary data in either different files or inline them and the put everything including the original files in a container format, like zip? I think I'm missing something here %-)

  • LordOfThePigs (unregistered) in reply to MaxiTB

    Exactly what I was thinking. And if you need it to be cat-able, make that container a .tar file.

  • Anne on a Mouse (unregistered) in reply to ZZartin

    That’s going to be a problem for ANY readable scheme, which the CTO is insisting on, so why would XML be worse than any other option, given that limitation? (Not that I’m particularly pleased with the idea of XML, but if the CTO insists on a readable header that already breaks a lot of things. For that matter, the “header length might change the length of the header” problem ceases to be an issue if you use a format like XML where you just read until you reach the marker for the end of the section, which would resolve the problem by eliminating the need to write out the length in the first place.

    On another note, though: if you’re going to support different endian-ness in the file format, rather than requiring all numbers to be recorded one way, then surely you need to write out more than just “1”, in case the individual bytes within an int have different endian-ness, which IIRC was the case on some obscure systems and you can’t be sure that some random client isn’t still using one. You should make them write 19088743, which is 0x01234567, so you can track each byte — even if they’re stored backwards you can count the number of bits in each byte to see which order they come in. (Or, for a 64-bit integer, 81985529216486895, which is 0x0123456789ABCDEF.)

  • Colin (unregistered)

    Wouldn't the big-endian form of the 0x00000001 read as 0x01000000, not 0x10000000?

  • (nodebb)

    I'm wondering whether it would have been easier to find a friendly sysadmin to set up the CTO with replacement versions of cat and head that check the file type - if it's their new format then it displays some randomly-generated data that looks almost, but not quite, like XML. If it's anything else, it silently invokes the real head/cat.

    Probably only needs to check the first file argument. I'm willing to bet good money that the CTO never actually uses cat to catenate files.

    If that's too hard, then create a data file that includes a binary sequence that logs out (or crashes) a variety of terminal types, and name it 1_BiggestSpendingClient_TestFile.xmimelite, then make sure everyone except the CTO knows not to use that file. Surely that would be less work than reengineering the whole file format, and maybe the CTO will learn something, if only not to make such a big show about knowing about cat.

  • (nodebb) in reply to Colin

    Are you using a Cedrus Stimtracker too? If yes, I have some tips for you (not sarcasm).

    Otherwise, AFAIK - no, Big-Endian is not supposed to shift bits besides reversing the order. At least, all the devices I worked with just reversed (with one exception, see above, and it turned out the cable was wrong).

  • löchlein deluxe (unregistered)

    "Having the header length be an integer and not text also meant that recording the length wouldn't impact the length."

    Errr, no. ("But 64k should be big enough for everyone"?)

    Also, re the Base64 crowd here, nooooo, what you clearly want is a variant of Base85 with a permutation of the characters so it's incompatible with your competitor's tools.

  • Szzzzt (unregistered) in reply to Andre Alexin

    Colin is right - endian-ness (at least with regards to memory and file storage) reverses the byte order, not the bit order. 0x00000001 stored in a file on a little-endian system is written as "0x01 0x00 0x00 0x00". There is no bit-shifting involved. The bytes are simply stored/written with the least significant byte at the lowest address. I believe that sometimes some comms systems transmit byte data in reversed bit order, but that's apparently not what we're talking about here.

  • Gnasher729 (unregistered)

    Is there anything that you can’t do with JSON and all binary date base64 encoded? Or base-85 for added fun?

  • Gnasher729 (unregistered)

    Bigendian vs littleendian: Ages ago when I wrote binary files, we used something similar to utf-8, which was compact and independent of byte order. JSON obviously means you send all numbers in decimal, also no problem with endianness.

  • chris (unregistered)

    A CTO that knows commandline tools? I'd take it every day. All CTOs I've seen know no tool beyond PPT and possibly Excel.

  • Mark (unregistered) in reply to ZZartin

    Nonsense, it's easy to represent binary data in XML: <binaryData> <bit>1</bit> <bit>1</bit> <bit>0</bit> <bit>1</bit> <bit>0</bit> ... </binaryData>

  • (nodebb) in reply to Colin

    You're right. Endianness reverses bytes, not nibbles.

  • Duke of New York (unregistered)

    Tim presented a file format that contradicted a prior agreement, without having given notice, and the CTO checked him. I'm afraid the advantage is with the CTO.

    For the endianness, you pick one representation and impose it on the idiot users. If they get it wrong, it's their problem. Or you discuss it carefully with the users and agree how it should be represented. And, of course, just in case they users are PDP-11 holdouts (that happens, even now) or other forms of idiocy, make sure to allow for flavours of middle-endian...

    The idiot users pay the bills, call the tune, and don't want to be bothered with unnecessary minutiae. "Aggregate the messages in a file so we can split them out again." "OK, which byte order should we conv--" "AGGREGATE. THE. MESSAGES."

  • Fernando (unregistered) in reply to Gnasher729

    "JSON obviously means you send all numbers in decimal"

    which can be bad if the number is floating-point binary and you don't want to lose bits converting to/from decimal. I hacked up a protocol once that had to send such numbers in JSON. I formatted them as strings of hexadecimal.

  • Gnasher729 (unregistered) in reply to Fernando

    Huh? Converting floating-point numbers between decimal and binary floating-point is a solved problem. It requires a lot of attention to detail, but your C Standard library should have code handling it, and your JSON parser should use that code. Now if you go above 64 bit double, that gets tricky.

  • jo (unregistered)

    In our company, we say: what you can't store it in a protocol buffer, it doesn't exist.

  • I dunno LOL ¯\(°_o)/¯ (unregistered) in reply to Steve_The_Cynic

    I actually used a NUXI system once. It was a 68000 CPU with an LSI-11 backplane running a Unix clone (Regulus), inside of a VT-100 terminal shell. Any data written out through SCSI was in NUXI order. I just wish I had kept a binary of the original non-Apple MACSBUG that it used as a boot rom.

    Oh the number of WTFs in that description. Did I mention that it was used for a government contract?

  • (nodebb) in reply to Mark
    Comment held for moderation.
  • Shoal (unregistered)

    Someone should tell the cat guy about strings

  • (nodebb) in reply to LCrawford

    I'm the author. We needed to store millions or billions of tiny messages, e.g. 80 bytes each, in each file. The domain was stock exchange trading engines.

  • (nodebb) in reply to Zygo

    I'm the author. We needed to store millions or billions of tiny messages, e.g. 80 bytes each, in each file. This was for stock exchange trading engine data.

  • (nodebb)

    "The cat guy" LOL

  • If only there was a solution to this already (unregistered)

    https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html

  • Haris (unregistered)
    Comment held for moderation.

Leave a comment on “A Binary Choice”

Log In or post as a guest

Replying to comment #:

« Return to Article