One of Raquel's junior developers was having some troubles. They had a Lua script that needed to call out to a Redis store to fetch data. The poor developer was getting useless "failed to communicate to redis," message and needed help figuring out what was going on.
Of course, the Lua script wasn't the problem. The company was building a full on Inner-Platform. The core application stack was actually in Java, but it was extensible with Lua scripts. The Java code was meant to handle all the plumbing and interacting with the data store, and the Lua code could then just be where developers put their business logic.
The Lua code was fairly straightforward, at least at a glance:
local keys = redis.call('ZRANGE', some_very_big_sorted_set, min_index, max_index)
local values = redis.call('HMGET', some_very_big_hash, unpack(keys))
The Java code, on the other hand, was more obviously problematic:
private static final Predicate<Throwable> somePredicate = throwable -> {
return throwable instanceof RedisScriptException;
};
private static RuntimeException convertException(Throwable throwable) {
if (somePredicate.test(throwable) && throwable.getCause().getMessage().contains("-OOM")) {
return new ResourceUnavailableException("resource unavailable", throwable);
} else if (somePredicate.test(throwable) && throwable.getCause().getMessage().contains("CLUSTERDOWN")) {
return new GenericStorageException("cluster unavailable", throwable);
} else if (throwable.getCause().getMessage().contains("KEY_DOES_NOT_EXIST")) {
return new KeyDoesNotExistException("key does not exist", throwable);
} else if (throwable instanceof SomeInternalException) {
return new SomeInternalException(throwable.getMessage(), throwable);
} else if (throwable instanceof RedisCommandTimeoutException || throwable instanceof QueryTimeoutException) {
return new GenericTimeoutException("command timeout", throwable);
} else if (throwable.getCause().getMessage().contains("Redis is loading the dataset in memory")) {
return new RedisLoadingException("Redis is loading the dataset in memory", throwable);
} else if (throwable.getCause().getMessage().contains("stack size exceeded")) {
return new ResourceExhaustedException("stack size exceeded", throwable);
}
return new GenericStorageException("failed to communicate to storage", throwable);
}
Now, there's nothing wrong with trying to convert Redis exceptions into your own application specific exceptions, but given the mechanics of try/catch, there are better ways. What's specifically problematic here are the cases where it's actually checking the error message itself and matching strings, instead of checking types. But it's also unnecessary, since the Redis driver in question supports registering callbacks for any given kind of error, so you don't need this block.
But the block also gets weird- why is there a case where SomeInternalException
gets wrapped in a new instance of SomeInternalException
with the same message?
It was the "stack size exceeded" one which got Raquel's attention. That didn't seem like an error that should be coming from any of the Java layers, and in fact, it was not. The Lua runtime has a method call lua_checkstack
which checks if a Lua call has enough available stack space to execute. The stack has a configurable depth, but a few thousand elements is a reasonable expectation.
Which brings us back to the Lua code:
local values = redis.call('HMGET', some_very_big_hash, unpack(keys))
The HMGET
operation is variadic, taking an arbitrary number of keys as parameters (instead of a table of keys as parameters). Each of those keys needs to live on the stack in this scenario, which means if you have more than a few thousand keys, you've blown out your stack space.
This is a situation where several WTFs piled together. The interaction between the Java layer and the Lua scripts was ill thought out. The Java layer exception handling was built to be fragile, and elsewhere in the chain, it just eradicated the error message anyway, leaving the junior developer scratching their heads, with no idea that the "fix" was just to narrow their search range into something that the Lua stack could handle.
And finally, as Raquel writes:
The truly depressing part was that this abomination wasn't just checked in whole - several different engineers contributed to it, and all the code was reviewed and signed off on.
As the saying goes: no one of us can be as dumb as all of us.