- Feature Articles
- CodeSOD
- Error'd
- Forums
-
Other Articles
- Random Article
- Other Series
- Alex's Soapbox
- Announcements
- Best of…
- Best of Email
- Best of the Sidebar
- Bring Your Own Code
- Coded Smorgasbord
- Mandatory Fun Day
- Off Topic
- Representative Line
- News Roundup
- Editor's Soapbox
- Software on the Rocks
- Souvenir Potpourri
- Sponsor Post
- Tales from the Interview
- The Daily WTF: Live
- Virtudyne
Admin
it seems there's a market for companies specializing in sanitizing this kind of data. only, who would voluntarily do such thing full-time? ...i think I just invented the perfect prison sentence for cyber criminals :-D
Admin
Politics data is always so horrible. my masters thesies needed to aggregate data from worldbank, UN, CIA factbook and a few others - have fun to merge by country name when every data source call a land differently. USA. United States (of America) etc. Gaaah!
Admin
The real WTF here is the rant about the majority field. I don't know it is done over there, but in the UK it is normal standard practice to talk about a candidate winning with a majority of n votes, that is the number of votes the person won by compared to the second place candidate. So the field appears to be perfectly and correctly named to me. In first past the post system it isn't important if a person got more than 50% of the vote, just that they got 1 more than the second place person.
Admin
Funny, as soon as a saw the sentence "There's a column labeled "Majority".", I thought, "you mean, by how much the candidate won over the next candidate?"
It's also how the word is used in french, eh. And by the newscasters and newspaper headline writers in Canada.
It's almost as if Canada is a different country, and almost as if Canadian english use more of the British meanings and spellings, especially while they're on their chesterfields.
Admin
Ding ding ding! Exactly. If all you know is a two party system, it makes no sense. If you live and work in Canada, and it confuses you, you need to go back to school. If you work for Elections Canada and it confuses you, well you have no excuse.
Same deal with the encoding. It makes sense when you have ridings that are 100% French, they submit in the format they work in, UTF-8. If you are submitting from the vast majority of the rest of the country where it is 100% English, 8859. If you are in a mixed riding, you use whatever supports both. Also the reason for the crazy name fields. Quebec has a different basis for their legal system (civil vs common), and it seems most women take both surnames, hyphenated. Not sure if they must, but a lot do, certainly a lot more than the rest of the country. Also, culturally, many French first names are double names "Jean-Guy, Jean-Paul, Jean-Gabriel, Marie-Claire".
Doesn't make it easy, but it is pretty much normal for Government IT work in Canada. These are systemic WTF's, not ignorant ones, and are not unique when dealing with Federal data in Canada.
Admin
My mom, who works at scientific journal, once asked me to make (from scratch) some database of all articles in previous publications, and authors.
Her .xls files were somewhat close to this "table-12".
Admin
Yeah, does this perhaps mean the task of dealing with the data has been outsourced to someone not familiar with ward-based parliamentary systems? That would be pretty hefty WTF.
And that problem with all the descriptors, delimiters, order changes ... that's SOP for government departments where the objective is just get the data out and published, with no proper metric to check the structure, as long as the numbers are right. That's been my life every April for a while. Yeah, it's a PITA, but yyou end up just writing a bit of an inner-platformy parser and kicking it until it consumes everything. It's a WTF, but once I'd sorted mine it just took a couple of hours once a year to fiddle with it for the latest abomination.
Accents, if not the entire French language, is the real WTF
Admin
You're right. TRWTF is UK democracy.
Admin
Designed? This kind of real world data table isn't designed. Canadians are lucky it exists. Organizing world data is hard. And it's worth doing. If Vicki does a good job of this, maybe they'll let her put a nice tablet in that beautful library in Ottawa in the Parliament building, so people can look up the results.
Admin
My favorite csv files that I've had to write a parser have decimal numbers that are formatted according to conventions of different locales, the locales are not specified, and they occasionally change. For example, in the string '1.000' the period is either the thousands separator or the decimal separator and both are used in different places of the same file. The same holds for '1,000'. And of course, since the numbers originate from SAP, all integral quantities are given with three decimal places. The parser basically reads the numbers using different locales, and then guesses which value looks to be most reasonable. For example, when the field is a quantity of really expensive machines that have been ordered, then we can be quite certain that we have a decimal separator before the three zeroes no matter what the thingy in there actually is. Until the day that someone actually buys 1000 of them at the same time the day after someone decided finally to remove the extra zeroes from integers.
Oh, and I lied when I said they are formatted for different locales. The last few months some of the numbers have been coming with an ordinary space as the thousands separator. I'm pretty certain no actual locale does something that stupid.
Admin
Nice to see a WTF from my home country! Has anyone talked to you folks yet about a little thing called the Phoenix pay system?
Admin
But CSVs can totally have a mix of quoted and unquoted strings, it's in the RFC: https://tools.ietf.org/html/rfc4180
It's certainly a bit of a WTF that column layout is different every year, but I would want to see these CSV files myself before deciding who is really doing the WTF in terms of the CSV format here. If these files were saved from inside Microsoft Excel, they are probably 100% valid CSV files, and the developer doesn't actually fully understand all the nuances of the format.
With my current job we do daily imports from several different providers, so I have experienced this quite a bit myself. The problem I often find with CSV files that we receive from other systems is similar to a common problem we often encounter with XML that we receive, the developer clearly wasn't using a proper tool to generate it, they were naively doing hand rolled string concatenation which generates output that looks like XML or a CSV file, but doesn't cover edge cases, like how to handle characters which need escaping.
I am not so naive and as a C# developer, I either use .NET's built in XML serialization libraries or for CSV, use the spectacular CsvHelper NuGet package. Both of these handle all edge cases per spec and are quite configurable.
As an example, with one provider which sends us a daily "CSV" files, our import works almost every day, except for occasionally when a field has a separator character in it. Fortunately I know they do this, so the import detects if one row has an unexpected number of columns and warns the user. I have also had experiences with XML where they haven't escaped things like angled brackets. My favourite with XML are providers which use SOAP/ Web Services which return a single string, which is XML which they hand rolled and doesn't properly cover edge cases.
Admin
Pretty sure I read somewhere that South Africa uses space as thousands separator.
Admin
"Oh, and I lied when I said they are formatted for different locales. The last few months some of the numbers have been coming with an ordinary space as the thousands separator. I'm pretty certain no actual locale does something that stupid."
A lot of Europe does that. France (maybe others as well) combines the space AND the comma instead of a full stop decimals separator: "1 000 000,27" instead of "1,000,000.27"
Could be worse, though. In India, they use comma and full stop, just like the Brits and Americans. Well, no. Not just like them.
"10,00,000.27" or, worse, "1,00,00,00,234.27". Only the last group is three digits. All the others are two.
Admin
True, we use spaces as separators, and a comma to indicate decimal place. Weird, but true.
Admin
Officially maybe, but I have never in my entire career ever see anyone actually use a comma as a decimal seperator. Everyone uses a full stop
Admin
I just worked on a government system this month where thousands had to be separated by spaces in currency values. Sure, I used the non-breaking variety, and the exported CVS files certainly don't have them. While the locale is different, it's still an officially mandated way of writing some numbers, not all. What's stupid about spaces, apart from you being used to a different convention?
Admin
In Belgium we use a space and a comma.
Of course as an IT'er I use periods and no thousand separators. Makes for a lot of fun when transferring data from and to colleagues.
Admin
This sounds like chapter 8 or 9 of "the book." Who-ever gets the reverence. I tip my hat to you ;)
Admin
I would like to welcome you to europe where several countries have dots as commas and commas as dots. Even spaces don't space out evenly where commas or dots would be.
Worst of the worst is the automatic switching windows does.. program A has dots for decimal separation and suddenly excel only accepts commas
Admin
Fun fact: The regional settings in Windows was always a full stop as the decimal separator for South Africa. Up until Windows 8 I think when they changed it to the comma, which is the official separator but which no one actually uses. You can imagine the hilarity that ensued when all imports suddenly started failing with the error "invalid number format". So now you have to manually change the decimal separator in Windows on each new installation after setting the locale to South Africa. Each and every time.
Admin
TRWTF is the author assuming that every country in the world uses American conventions and definitions
Admin
Umm, spaces for thousands separators are used in most (all) countries that use the metric system. So that is every country, except the US and sometimes UK. Want a real wrench thrown into your assumptions? The French (Quebec) use the comma as the separator for cents. So a thousand dollars and 10 cents is 1 000,10 parse that.
Admin
The average American has never heard of things like 'internationalisation' and thus think 'their way' is the only way. They think different ways are weird simply because they have never been exposed to them. Kind of like $5.00 vs 5,00$ vs 5$00 thing. I'm sure most of you have rarely seen those 3 ways of writing monetary values, but all 3 are standard for some place in the world. What looks weird to you, might look normal to them and vice-versa.
If you are from the US and you are reading this post, please look up formats for :time, date, paper, number, monetary, telephone number, temperature, measurement, etc
You would make our life easier and avoid miscommunication.
(Yes I'm still bitter about the time I got laugh at for using 'military time'.)
Admin
I don't doubt it. This "database" sounds very much like it was exported from one or more spreadsheets. A real database would tend to be more strict and consistent, whereas you can throw almost anything into a spreadsheet.
Admin
As a Candian programmer, I can say the TDWTF is the Canadian Government. My Tax dollars well spent! /sarcasm Well, at least it makes my current WTF programming job not seem so bad to me now.
Admin
Here in the UK we often have half a dozen or more candidates, often with at least 3 and often more getting a substantial number of votes. The majority is always the difference between the winner and second place and is a very important piece of information. Sometime, the person in third gets almost as many votes as the person in second. That makes no difference. According to Wikipedia (not the most reliable source I know), the Canadian electoral system is not that different to ours. The article there doesn't explain a different meaning to majority. I think a lot of readers are puzzled as to what it means in this context.
Admin
I also live in South Africa and you're doing it wrong.
You've essentially decided to fight with default Windows settings for the rest of your life. Every time you need to change it for a new install, it will annoy you, meanwhile Windows installer is completely impassive and will just continue doing it this way every time, forever, unrelentingly with no emotion, as machines tend to do with any task assigned to them.
The better solution as a software developer is to make your code robust enough to handle the (super annoying) change, I manage it fine on the systems we build and we had a mix of Windows 7 and Windows 8 users/developers for a while.
Admin
You are 100% correct, the Canadian system is very similar to the UK system, everything we do is similar to you, except we decided to drive on the correct side of the road ;). Majority has various meanings, in common discussion it is more than 50%, in elections it means what you say it does, for the same reasons. Multiple candidates, from more than 2 parties with substantial votes for each. In this context, being sourced to Elections Canada, it is the diff between 1st and 2nd.
TRWTF is someone who works for Elections Canada, not understanding what it means. If you are working on EC data, you must be a Canadian Citizen, not some outsource shop in Iowa or Mumbai. The only thing in that whole article that was a WTF was the non standard, ever changing layout, everything else was what we would call Thursday in a GoC IT environment. That or GoC IT is TRWTF, and I am open to that possibility too.
Admin
"Quebec has a different basis for their legal system (civil vs common), and it seems most women take both surnames, hyphenated. "
Actually this is not permitted. We got rid of this practice 36 years ago. You see many people with two hyphenated surnames because children can have that, and it was all the rage for some time. Now the trend seems to have abated and we see more children with only one surname.
The Quebec civil code states that both spouses keep their own name, legally. What they do socially is their own business, but any legal of financial document may only bear the woman's maiden name. The principle of name stability is paramount..
Exceptions are few as the woman must demonstrate that her maiden name causes her serious prejudice, e.g. her maiden name is Douchebag or Trump or something similar.
Some women have won a reversal in Superior Court, for cultural or religious reasons, but that's quite a fight.
Admin
Nothing wrong with that.
A crore is 100 lakh. A lakh is 100,000 ... er ... anything, really. Presumably rupees in this case, although back in the day the bit after the decimal point wasn't how you'd decompose rupees into pies, but whatever.
So then. A crore is 100,00,000 rupees. This is perfectly intuitive, if you happen to be using that particular locale.
Admin
Nice Stephenson reference.
Admin
You've got it backwards. You should be laughing at those who don't use military time.
Admin
At least the table is (somewhat) parsable. It would be a giant WTF if it were in XML or some such, grown by orders of magnitude in the process with inconsistent data names and the like.
Boy am I glad I don't work with such stuff.
Admin
Thousands separators are about display. CSV files are about saving data in a portable format. Why would you store the thousands separators in the CSV? The application that opens the file should present the data according to the default locale.
Admin
That would be the point of the comment. "Those idiots wouldn't know a time format if it slapped them upside the head!"
Admin
For best results you could use Unicode Character 'NARROW NO-BREAK SPACE' (U+202F) so that the gap isn't massive
Admin
The real WTF is COMMA separated values. When I have to flatten a list into a line, I use the pipe character. Funny how Indians can't ever seem to maintain compatibility when you need it (interfaces, for example) but always manage to hold to broken stuff like this.
Admin
I think some people are missing the point in how stupidly this system has been implemented. A "name" field that contains the full name, plus the party name, plus possibly an asterisk to indicate they are the incumbent representative, all separated using delimiters that frequently occur within the text they are delimiting? It would have been very simple to make the first name, last name, party name (or foreign key / identifier), and incumbent flag as separate fields. It should also have been easy to make the column names and usage consistent in all ridings and elections.
+1 for not saving thousands separators in the csv text. If there are decimal points used, it should be output in a standard format, not the local format. That's why there are things like "Invariant Culture" in .NET. But if you must save it in the local format, record what format was used. And all text should be in the one standard format, although that shouldn't be too big an issue to deal with.
They have taken a relatively simple thing and made it as difficult for themselves as they could.
Admin
You mean the same Excel that doesn't recognize tsv files? (Tab..., got its own proper canonical extension and Mimetype and everything.)
Admin
What if someone has no surname? What if their given name comes after the surname? What about patronymics, matronymics, middle names, initials, prefixes, suffixes? You should always treat names as opaque strings.
Admin
I use Libre Office because its data import system is much easier than Excel.
Admin
Me too..
And it handles tab-delimted data as it should as opposed to excel where you have to manually import and set it to use 'tab' as well as comma or whatnot.
I personally prefer to use tab as field delimiter when exporting to 'csv' (ok I know it likely should be called tsv), as viewing files exported that way with less or cat will still display the data somewhat sensible.
Admin
Submitter here. This is mostly as originally written, so Snoofle isn't to blame. Except for nerd-sniping me with the database at the beginning; and that probably just means I subconsciously love what I do.
"Majority" isn't used in that sense here (maybe it is in other parts of the country, though); we know that statistic as "margin of victory". If you told a local "Alice won with a majority of 12,345 votes", he would interpret that as either (a) Alice received 12,345 votes, which represents more than half of all votes cast, or (b) 12,345 votes were cast and Alice won more than half of them. You have to remember 90% of the population lives within 100 miles of the border, so American meanings predominate over British ones.
The CSV files were syntactically valid as far as I recall, but just badly designed. The name field was something like "Joe Mackenzie * Pizza Party/Parti Pizza" in the simplest case if the candidate had only a first and last name, and were the incumbent. But it could be as complex as "Jean-François Benoit David la Fayette String Cheese/Dairy Party/Parti Fromage à Effilocher/Latier" if the candidate had two first names, a middle name, and a compound surname; and his party had been merged. There was no other field that had any of them separately or a list of all parties; all we had to work with was this stupid bastard field. So isolating the candidate's last name "la Fayette" was just unreasonably difficult.
This was some time ago, so I don't remember what happened with the encodings except that it manifested itself in some weird way. We would see crashes with unrelated error messages on sort operations or something like that.
And no, it won't get placed in the Library; the project was aborted. The government decided to abandon its promise to reform the electoral system and the people who originally wanted the data had no more use for it.
Admin
Sorry to disappoint you but there are lots of locals that use a space as the thousand separator, except it's not thousands that are being separated. I've seen this in some data 1234 5678 90.1
Admin
The most horrible data parsing I had to do was for a client that needed to merge their supplier's catalog into their own. The supplier was from a different continent and considered turning over the data digitally to be equivalent to corporate spying. So all that was available was a pdf, that had to be OCR-ed for each page.
Anyone that's ever worked with OCR-ed data knows it's horribly unreliable. So data that might be in place A on page 1, might be in place K on the next, even though it's on the same relative spot on the page. I gave it my best shot and got about 90% of the 5000+ pages working correctly, with the remaining 10% being detectable as 'have to do by hand'. Unfortunately they're still finding errors in the catalogue even today, due to the 'garbage in, garbage out' paradigm.
That was one job I'll say no to if it ever comes up again, even if the pay is outrageous.
Admin
Using a non-breaking space for thousands separator is not stupid.
Using an ordinary breaking space is bloody stupid, because then there are cases where you can't know whether you have one number with a thousands separator or two numbers without it.
I freely admit that I haven't checked all locale definitions in all programming languages, but those that I've checked all use a non-breaking space. I have had the luck of not having to delve into internals of SAP, but unless its locales are done in the most boneheaded way possible, then the guy who wrote the code that creates the csvs went through the trouble of inventing a square wheel.
Admin
Some of this is very misleading.
The column headers have been consistent (both in format and in order) since the third time this file was generated, in 2006. The format has been ISO-8859 for all but one and UTF-8 for the odd one out.
The claims on the content appear to be correct though.
It would be nice to see the story getting 2 minutes of verification effort before posting.
Admin
Have you looked at a WooCommerce database? This is their standard way of storing data. A giant mess.
Admin
"Majority" being the difference between the winner and the highest-placed loser IS what I would expect it to be. eg "Fred Smith, 12,345 votes, elected with a majority of 152"