TN-wall-clock hg

When people think about government, they usually think about a President or Prime Minister, Senators, MPs, or what have you. But government isn't just a handful of people at the top of the food chain: there's government all the way down to the city level, quietly making the country run. Driver's licenses have to be issued, as do pet licenses. Buildings have to be inspected and certified. All those elevator certificates get printed up somewhere. Increasingly, these small functions are being computerized—in bits and pieces, in incompatible systems—and hooked up to the Internet.

Lisa was the lead engineer for one of these public websites. At its core, it took in personally identifying details and spat out some sort of official document. This meant they had to deal with the PII issues that come with taking people's information: encrypting and salting the data, securing the database backend, et cetera.

One of the pieces in this chain was a separation of data: until the user had paid for the document, proving their identity (or at least their possession of the credit card for the person they claimed to be), their data sat in a frontend database accessible to the Internet. After payment was taken, the data was sent to a more secure database in the backend and removed from the potentially hackable frontend. The frontend ran in a VM that could only make an outgoing connection to the database. It could receive incoming connections and respond, but not initiate them. Basic security for this type of system.

There was one issue, however, that Lisa struggled to track down. It seemed that a small percentage of users, fewer than 1%, were getting an error page immediately after payment. Their application was fine; payment was received, and their document was sent to them along with a confirmation. But they saw an error page suggesting they hadn't completed their transaction.

When Lisa managed to catch the issue in the act, she was able to reconstruct the sequence from the logs:

  • The user entered their card info
  • The payment processor accepted the payment
  • The application marked the record as paid
  • The application responded with the redirection back to the confirmation page
  • The transfer service kicked in and moved the data
  • The user landed on the confirmation page—and the record was no longer there to display.

In other words, a timing issue. So far, this was just a run-of-the-mill everyday problem. Race conditions happen all the time, after all. The problem was, Lisa had already fixed this race condition: the logic indicated that the system would wait at least 5 minutes before moving the data from the frontend to the backend, to allow for the confirmation page to be generated.

So, what gives? Lisa wondered.

The logic was still in place. In fact, the logs showed that the data hadn't been moved until 5 minutes after it was marked to be moved. But the confirmation page had generated in mere seconds. How could this possibly have occurred?

"It just doesn't make sense," she complained to her coworker.

"What time is it on the server right now?" he asked with a frown.

And that was it: the transfer service was 6 minutes off from the frontend box. As soon as it was marked eligible for transfer, the backend box would move the data. If the periodic service ran right when the row was generated, the user would get an error.

With a groan, Lisa put in a ticket to have both boxes sync to a time server.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!