Rebecca inherited some code that’s responsible for gathering statistical data from network interface. It was originally written a decade ago by one of those developers who wanted to make certain their influence was felt long after they left the company.

The code was supposed to write to one of two log files: a “quick log”, with 2-second resolution (but only for the last minute’s data), and a “full log”, with 1-minute resolution.

Unfortunately, it would often fail to do anything with the full log. Frustrated that this code- which had lived in a shipping product for over a decade- was so unreliable, Rebecca dug in to see what the problem was.

#define FULL_SAMPLE_DELAY 60

void dostats(FILE *quicklog, FILE* mainlog) {
       //[code omitted for brevity]
        while(!done) {  
                sleep(2);
                if(!(times.now.t.tv_sec % FULL_SAMPLE_DELAY)) {
                        // main samples
                        stats.save(mainlog, times);
                }
                else {
                        // quick samples 
                        quickstats.summarise(quicklog, quickstats_top_n);
                }
        }
}

Rebecca’s predecessor had the good sense to use sleep() to keep the loop from spinning the CPU, but made one major error: he assumed that calling quickstats.summarise took no time . Even if the loop started at exactly the right time, the amount of time spent executing quickstats.summarise guaranteed that eventually, the current time wouldn’t line up with an even minute, and the full log would become unreliable.



[Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!