Compiling Datasets

Managing datasets is always a challenging task. So when Penny's co-worker needed to collect a pile of latitude/longitude positions from one dataset and prepare it for processing in a C++ program, that co-worker turned to the tools she knew best. Python and C++.

Now, you or I might have dumped this data to a CSV file. But this co-worker is more… performance minded than us. So the Python script didn't generate a CSV file. Or a JSON document. Or any standard data file. No, that Python script generated a C++ file.

// scraped using record_data.py
const std::vector<GpsPt> route_1 =
    {
      { 35.6983464357, -80.4201474895},
      { 35.6983464403, -80.4201474842},
      // several hundred more lines like this
    };
const std::vector<GpsPt> route_2 =
    {
      { 35.8693464357, -80.1420474895},
      { 35.8693464392, -80.1420474821},
       // another thousand lines
    };
// more routes like this

Now, there are clear advantages to compiling in thousands of data-points instead of reading in data from a data file. First, no one can easily change the data points once you've built your code, which means no one can corrupt your data or make the file invalid easily. Second, the runtime performance is going to be significantly better and your compilation will be much slower, encouraging developers to think more carefully about their code before they hit that compile button.

I think this is the future of high performance computing, right here. No more are we going to pay the high costs of parsing data and letting it change without recompilation. Burn that data into your code.