Steam. AT&T. Marks and Spencer. Bebo. What do they all have in common? One morning, as customer support tickets rolled in at her online retail firm, it became Belle's job to find out.
Belle had inherited the codebase, which had originally been made overseas in India. It seemed as solid as any offshored website could be—which is to say, it had a million different quirks and the source code made her want to vomit, but it had been tested until the BA got tired of filing bugs and just decided to put it in prod.
So Belle expected it to get basic functionality like ordering a product right. Alas, that assumption was unfounded. While the BA's testing had been, for all intents and purposes, single-threaded, no sooner had the doors been opened to the public than reports came in of users being greeted by the wrong name, and with the wrong list of stored credit cards. Once the Ops guys rolled it back, Belle girded her loins and dove into the code head-first.
The first thing she found while tracing the retrieval of user credentials was this gem:
public static UserInfo GetUser()
{
UserInfo user = null;
if (HttpContext.Current.Session == null)
user = (UserInfo)Thread.GetData(Thread.GetNamedDataSlot("USER"));
else
{
try
{
user = (UserInfo)HttpContext.Current.Session["USER"];
}
catch (Exception ex)
{
}
}
return user;
}
Nobody was quite sure why it used thread-local storage. At first glance, it just seemed like a belt-and-suspenders approach. After all, HttpContext.Current
did something very similar conceptually, but bounded to a single request instead of a thread.
And therein lay the problem. ASP.NET used a thread pool, meaning data stored for one user in the thread-local storage would be reused for the next session-less user.
Compounding the issue was the following snippet:
[OutputCache(Duration = 1)]
public class ControllerBase : Controller
{
//Boring stuff here
}
This would normally be fine, if a little strange; the cache duration being 1 second would raise some eyebrows, but there were worse caching strategies. However, because the user information was fetched from thread-local storage instead of being part of the controller's responsibility, there was nothing to break the cache for different users. They all seemed identical to the controller. So after User 1's thread was reused by User 2, everyone who hit the site within one second of User 2 also saw User 1's data, session or no session.
There's a reason cache invalidation is one of the Two Hard Things in Computer Science.
Fixing the cache timeouts was easy, but corporate policies got in the way of changing the bizarre session problems. That was "logic," and changing programming logic belonged to the team that wrote the code: offshore. Belle escalated up to management with an explanation of the thread-local problem. Management escalated to offshore, they debated whether this was rework or a bug (for the purposes of deciding who would pay for it), and eventually, someone fixed it ... by making sure to spin up a new thread before calling GetUser
.