The Keys to Cloud Storage

When you want to store data in Amazon’s S3 cloud-based storage, you have to assign that data a key. In practice, this looks and behaves like a filename, but the underlying APIs treat it like a key/value store, where the value can be a large data object.

S3 is flexible and cost-effective enough that Melinda’s company decided to use it for logging HTTP requests to their application. These requests often contained large data files for upload, and those files might need to be referenced in the future, so a persistent and reliable storage was important.

Each of these incoming HTTP requests had a request_id field, so a naive implementation of logging would be to write the body of the request to an S3 key following a pattern like requests/c418b58b-164d-4e1f-970b-ed00dea855b6. For a number of reasons, however, clients might send multiple requests using the same request_id. Since a logging system that overwrites old logs would be pretty terrible, they decided that each log file also needed an ID, so they could write them out with keys like requests/c418b58b-164d-4e1f-970b-ed00dea855b6/${x}, where ${x} was the ID of the log file.

The developer responsible for implementing this decided that ${x} should be an auto-incremented number. This presented a problem, though: how on earth could they keep that index in sync across all of their API nodes?

function findFreeKey(bucket, key, append, startNum, callback) {
        var testPath = key;
        if (typeof startNum != 'number' || startNum < 0)
                startNum = 0;
        else if (startNum > 0)
                testPath += (append ? append : '') + startNum;
        get(bucket, testPath, function(err) {
                if (err) {
                        if (err == 404)
                                callback(null, testPath);
                        else
                                callback(err);
                }
                else
                        findFreeKey(bucket, key, append, startNum + 1, callback);
        });
}

function get(bucket, key, callback) {
        var client = getClient(bucket);
        var req = client.get(key);
        req.on('response', function(res) {
                if (res.statusCode >= 200 && res.statusCode < 300) {
                        res.setEncoding(CHARSET);
                        var str = '';
                        res.on('data', function(chunk) {
                                str += chunk;
                        });
                        res.on('end', function() {
                                callback(null, res, str);
                        });
                }
                else
                        callback(res.statusCode, res, null);
        });
        req.on('error', function(err) {
                callback(err);
        });
        req.end();
}

The core idea of this code is that instead of trying to keep the autoincremented index in sync, instead just start at zero, and fetch requests/c418b58b-164d-4e1f-970b-ed00dea855b6/0. If you get a 404, great! Use that key to write the file. If there’s actually data at that key, try fetching requests/c418b58b-164d-4e1f-970b-ed00dea855b6/1. And so on.

This, of course, does nothing to defend against race conditions. There was no requirement that ${x} be sequential, there was never a need to order these log files that way, so the developer could have used a UUID for each log file, and there would have been no problems. That’s not the actual problem with this code, though.

Note the line var req = client.get(key). This uses the Amazon S3 API to get the object located at that key- the entire object, including the data. These requests could contain large data files, and the entire body would be downloaded. It should be noted that there is a perfectly good listObjects function which can simply return a list of used keys with a single request.

So, each time a request_id was reused, the logging of that request took longer and longer, as every single previous request with that request_id needed to be re-downloaded in its entirety before the system could finish logging. It should also be noted that S3 does charge you based on both the content stored there, and how much bandwidth you use.

Melinda noticed this atrocity, and thought her trendy, self-organizing, and democratic team might want to tackle it. Each week, everyone is allowed to nominate a piece of ugly technical debt, and then the team votes for what they want to tackle first. Over the past two years, they’ve replaced their terrible test-fixtures with merely bad ones, they’ve swapped out the ORM tool that no one liked with an ORM tool that only the technical lead likes, and they’ve cycled through every JavaScript build system out there before deciding that they’re better off with an in-house solution.

In those two years, no matter how many times Melinda nominated this particular block of code, it’s remained their lowest-priority piece of technical debt.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!