The Spider of Doom

Josh Breckman worked for a company that landed a contract to develop a content management system for a fairly large government website. Much of the project involved developing a content management system so that employees would be able to build and maintain the ever-changing content for their site.

Because they already had an existing website with a lot of content, the customer wanted to take the opportunity to reorganize and upload all the content into the new site before it went live. As you might imagine, this was a fairly time consuming process. But after a few months, they had finally put all the content into the system and opened it up to the Internet.

Things went pretty well for a few days after going live. But, on day six, things went not-so-well: all of the content on the website had completely vanished and all pages led to the default "please enter content" page. Whoops.

Josh was called in to investigate and noticed that one particularly troublesome external IP had gone in and deleted *all* of the content on the system. The IP didn't belong to some overseas hacker bent on destroying helpful government information. It resolved to googlebot.com, Google's very own web crawling spider. Whoops.

After quite a bit of research (and scrambling around to find a non-corrupt backup), Josh found the problem. A user copied and pasted some content from one page to another, including an "edit" hyperlink to edit the content on the page. Normally, this wouldn't be an issue, since an outside user would need to enter a name and password. But, the CMS authentication subsystem didn't take into account the sophisticated hacking techniques of Google's spider. Whoops.

As it turns out, Google's spider doesn't use cookies, which means that it can easily bypass a check for the "isLoggedOn" cookie to be "false". It also doesn't pay attention to Javascript, which would normally prompt and redirect users who are not logged on. It does, however, follow every hyperlink on every page it finds, including those with "Delete Page" in the title. Whoops.

After all was said and done, Josh was able to restore a fairly older version of the site from backups. He brought up the root cause -- that security could be beaten by disabiling cookies and javascript -- but management didn't quite see what was wrong with that. Instead, they told the client to NEVER copy paste content from other pages.

[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!