Many years ago, Yazeran needed to work with some floor-plan data. Now, fortunately for Yazeran, a vendor had already collected this information and packed it up behind a SOAP-based web service, so it would in theory be easy to get. This was long enough ago that SOAP was all the rage, and computers with multiple gigabytes of RAM were still on the higher end of things.
In fact, there was an end-point called GetFullDataExtract
which promised to, as the name implies, get him all the data. Yazeran didn't need all the data, but the other end-point, GetGMLBuildings
returned only the subset of data Yazeran didn't need. So Yazeran simply had to request too much.
Yazeran fired up a wget
session with all the appropriate credentials and waited. And waited. And waited…
When the 20MB XML response finally downloaded, minutes later, it looked like this:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://tempuri.org/"><DataExtract>
<Location ID="13" MasterID="13" Name="Nunich " IsMunicipality="true" ParentMasterID="" DisplayName="Munich ">
<Estate ID="31" MasterID="10" LocationID="13" LocationMasterID="13" Name="Munich Campus" Description="" DisplayName="Munich Campus">
<Building ID="6462" MasterID="323" EstateID="31" EstateMasterID="10" Name="Munich" Description="" Address="Kings Road" AlternativeName="" Number="15" City="Munich" ZipCode="6542" BasementGrossArea="17474.495491066667" GrossArea="56913.5818959713" NetArea="47924.600000000079" DisplayName="Munich">
<GIS>
<OuterPolygon>
<Coordinate x="45.73155848600081" y="11.395289797465072" />
<Coordinate x="45.73133750252103" y="11.396345176530383" />
<Coordinate x="45.73158863712057" y="11.396510410132915" />
</OuterPolygon>
</GIS>
<Drawing ID="8587" MasterID="1010" BuildingID="6462" BuildingMasterID="198" EstateID="31" EstateMasterID="8" DrawingTypeID="3">
<Floor ID="8640" MasterID="1010" FloorNumber="0" FloorType="1" DrawingID="8587" GrossArea="131.99712099999974" DrawingMasterID="1010">
<Room ID="364742" MasterID="41627" DrawingMasterID="1010" FloorMasterID="1010" NetArea="132" EstimatedGrossArea="132">
<LayerData ID="231389" LayerFieldID="24" Name="ROOM NR." Value="g01" />
<LayerData ID="231390" LayerFieldID="27" Name="TYPE" Value="ROOM" />
<LayerData ID="231391" LayerFieldID="28" Name="CATEGORY" Value="STORRAGE" />
<LayerData ID="231392" LayerFieldID="30" Name="BRUGER" Value="PHYSICS" />
<LayerData ID="231393" LayerFieldID="34" Name="x_UID" Value="B712D571AB564658999E75D65543" />
</Room>
</Floor>
</Drawing>
</Building>
</Estate>
</Location>
<Location ID="14"....
<!-- snip some 200k lines... -->
</Building>
</Estate>
</Location>
</DataExtract></string>
They produced an XML document with one element, <string>
and within that element put an escaped version of another XML document.
XML is large and bureaucratic and complicated, but that complexity comes with benefits- namespaces, schemas, validation, and so on. None of that is possible when you've just mashed your XML into text and then wrapped the text in XML again.
There are other problems with this, beside the obvious. XML parsers are notoriously memory intensive. Your core options are a DOM-based parser, which loads the entire document in memory, and thus is very expensive, or a SAX-based parser, which streams the document and emits events as it encounters nodes in the document. SAX is more difficult to use, but is potentially faster and definitely more memory efficient.
Unless, of course, your XML document just contains one giant text node. In that case, SAX has to load the entire document into memory anyway, and all those efficiency gains vanish. In fact, when Yazeran tried to parse this as-is, it took 40+ minutes and involved a lot of paging.
Now Yazeran was using Perl to do this work, and fixed this with the obvious tool at hand: regexes. No, no parsing XML with regexes, but "un-escaping" the internal text. Since Yazeran was already munging some data, it was easy to add a few more regexes which stripped out the unnecessary data in the document, like the large blocks of Coordinate
elements.
That shrunk the document down to a reasonable size that could be parsed using a DOM parser in under 5 minutes.
Yazeran sums it up thus:
It corresponds to writing a letter, putting it in an envelope, adding postage, and then instead of sending the letter, putting it into yet another envelope and adding postage before sending… (in effect paying twice for the same service)