Format Identified

Many nations have some form of national identification number, especially around taxes. Argentina is no exception.

Their "CUIT" (Clave Única de Identificación Tributaria) and "CUIL" (Código Único de Identificación Laboral) are formatted as "##-########-#".

Now, as datasets often don't store things in their canonical representation, Nick's co-worker was given a task: "given a list of numbers, reformat them to look like CUIT/CUIL. That co-worker went off for five days, and produced this Java function.

public String normalizarCuitCuil(String cuitCuilOrigen){
	String valorNormalizado = new String();
	
	if (cuitCuilOrigen == null || "".equals(cuitCuilOrigen) || cuitCuilOrigen.length() < MINIMA_CANTIDAD_ACEPTADA_DE_CARACTERES_PARA_NORMALIZAR){
		valorNormalizado = "";
	}else{
		StringBuilder numerosDelCuitCuil = new StringBuilder(13);
		cuitCuilOrigen = cuitCuilOrigen.trim();
		
		// Se obtienen solo los números:
		Matcher buscadorDePatron =  patternNumeros.matcher(cuitCuilOrigen);
		while (buscadorDePatron.find()){
			numerosDelCuitCuil.append(buscadorDePatron.group());
		}
		
		// Se le agregan los guiones:
		valorNormalizado = numerosDelCuitCuil.toString().substring(0,2) 
							+ "-"
							+ numerosDelCuitCuil.toString().substring(2,numerosDelCuitCuil.toString().length()-1) 
							+ "-"
							+ numerosDelCuitCuil.toString().substring(numerosDelCuitCuil.toString().length()-1, numerosDelCuitCuil.toString().length());
		
	}
	return valorNormalizado;
}

We start with a basic sanity check that the string exists and is long enough. If it isn't, we return an empty string, which already annoys me, because an empty result is not a good way to communicate "I failed to parse".

But assuming we have data, we construct a string builder and trim whitespace. And already we have a problem: we already validated that the string was long enough, but if the string contained more trailing whitespace than a newline, we're looking at a problem. Now, maybe we can assume the data is good, but the next line implies that we can't rely on that- they create a regex matcher to identify numeric values, and for each numeric value they find, they append it to our StringBuilder. This implies that the string may contain non-numeric values which need to be rejected, which means our length validation was still wrong.

So either the data is clean and we're overvalidating, or the data is dirty and we're validating in the wrong order.

But all of that's a preamble to a terrible abuse of string builders, where they discard all the advantages of using a StringBuilder by calling toString again and again and again. Now, maybe the function caches results or the compiler can optimize it, but the result is a particularly unreadable blob of slicing code.

Now, this is ugly, but at least it works, assuming the input data is good. It definitely should never pass a code review, but it's not the kind of bad code that leaves one waking up in the middle of the night in a cold sweat.

No, what gets me about this is that it took five days to write. And according to Nick, the responsible developer wasn't just slacking off or going to meetings the whole time, they were at their desk poking at their Java IDE and looking confused for all five days.

And of course, because it took so long to write the feature, management didn't want to waste more time on kicking it back via a code review. So voila: it got forced through and released to production since it passed testing.

[Advertisement] Keep all your packages and Docker containers in one place, scan for vulnerabilities, and control who can access different feeds. ProGet installs in minutes and has a powerful free version with a lot of great features that you can upgrade when ready.Learn more.