Copy Serialization

Twenty years ago, Stefano Z was a lowly junior developer, working with a set of senior developers, who had rules. A lowly junior developer, for example, couldn't be trusted to do something risky and dangerous, like serialize data to a buffer. Not without a safe API to keep them from foot-gunning themselves.

The API interface went thus:

void TheLibraryName_Client_SerializeInt(char** serializedBufferPosition, int value);
void TheLibraryName_Client_SerializeDouble(char** serializedBufferPosition, double value);
void TheLibraryName_Client_SerializeString(char** serializedBufferPosition, const char* value, size_t size);

The first parameter was a pointer to a pointer. The address of an address of the next piece of data we are going to write to. The functions all had a side effect- they'd write data to the destination address, and then increment it by the size of the data written.

Already, from API alone, I don't like it. This is maybe just my preference, but I hate side effects, I hate methods which do two things. I'd much rather break it out into two steps: writeData(dest, data); incrementAddress(dest, sizeof(data)).

Let's see what the senior engineer implementation looked like.

void TheLibraryName_Client_SerializeInt(char** serializedBufferPosition, int value)
{
	*(int*)(*serializedBufferPosition) = value;
	(*serializedBufferPosition) += sizeof(int);
}

void TheLibraryName_Client_SerializeDouble(char** serializedBufferPosition, double value)
{
	*(double*)(*serializedBufferPosition) = value;
	(*serializedBufferPosition) += sizeof(double);
}

Ah, C pointer casting. I "love" it. First, we start by dereferencing serializedBufferPosition, which is our pointer to the pointer. This turns it into just a pointer. But it's a pointer of char (or uint8_t or whatever alias you prefer), so we then need to cast it to a pointer of (int*) or (double*). Then we dereference it again, getting the actual address of the data- and write our value in there.

I don't really love any of this. It's a lot of casting to get to a very simple result. I wouldn't call it a WTF, I just don't like it. But to see it taken to its absurdity, you need to look at the string serializer:

void TheLibraryName_Client_SerializeMemory(char** serializedBufferPosition, const char* value, size_t size)
{
	for (size_t i = 0; i < size; i++)
	{
		**serializedBufferPosition = *value;
		(*serializedBufferPosition)++;
		value++;
	}
}

They follow the same logic, one character at a time. Byte by byte, they manually copy the results around, and increment by 1 each time. If only, if only there were an easier way to copy memory from one address to another. Some sort of memory copy or memcpy function.

Wait, there is such a function. So, on a whim, Stefano implemented a version using memcpy, like so:

void MyOwnSerializeMemory(char** serializedBufferPosition, const char* value, size_t size)
{
	memcpy(*serializedBufferPosition, value, size);
	(*serializedBufferPosition) += size;
}

Again, I don't love moving the address pointer in the function, but in the scheme of things, it's a minor problem, and it fits the API as already defined. Stefano ran some tests and benchmarked the two versions- and the memcpy was significantly faster, for even small blocks of memory, but the benefit was massive for larger blocks- and 100MB was a common size to copy, for their software.

This isn't surprising- memcpy is highly optimized and can leverage CPU level operations to make copying memory extremely fast.

The other thing that isn't surprising is what happened after Stefano showed the seniors his results. He demonstrated speedups on actual workloads that were on the order of ten times faster. He pointed out that memcpy was a standard function that they definitely had and could use on their target platforms.

And the seniors patted Stefano on his head, said, "Use the library we built, don't try and be clever," and then sent him on his way.

Stefano didn't use their library. He continued to use his memcpy implementation. Performance crept up as he made changes, and no one complained. That program is still in production somewhere, all these years later.