Up the Garden Path

Sam worked on an application which needed to parse files out of a directory. There was a ticket to change the search path for those files to search an additional directory. That didn't turn out to be terribly hard, but the existing code raised some serious eyebrows.

def get_info_for_identifier(base_name: str, identifier: str) -> dict:
    ...
    all_ids = os.listdir(base_name)
    ...
    results = {}
    for i in all_ids:
        results.update(parse_identifier(base_name, i))
    ...
    return results[identifier]

This is the first method we're going to look at. You can see that it starts by listing a directory, finding every file in it. Then, we iterate across all of those files, parsing them and adding them to a dictionary. We then return… just one key in that dictionary. The one for the one file we actually cared about when we called this method. So it parses every file, to return the results for one file.

And this is the first method we're looking at, because that's just the framing device for our WTF. Note how it calls that parse_identifier method. Let's take a look at that one.

def parse_identifier(base_path: str, identifier: str) -> dict:
    main_file = Path(os.path.join(Path(base_path), identifier + ".details"))
    ...
    files = list(base_path.glob(main_file.name))
    if len(files) > 1:
        raise InternalError("this shouldn't happen")
    return parse_files(files)

So, Python has two ways of managing filesystem paths. The "old" way, and the outdated way, is to use the os.path library, and methods like join. I wouldn't say it's wrong to use os.path, but it's definitely better to use the newer Pathlib, which has the Path object in it. You know, the Path object that you see them using in this method.

That first line should be: main_file = Path(base_path) / (identifier + ".details". Or os.path.join(base_path, identifier + ".details"). This mixing and matching is definitely wrong.

But, the end result of this operation is that we have a path to a single file. So then we use the Path.glob method of base_path which… is a string. So it doesn't have that, except in Python type annotations aren't enforced at runtime, so this code's type annotations lie and it wouldn't pass a type-checker.

But we glob for main_file.name, but main_file is also a Path object, so we don't need to do any of this, and also why are we globbing when we know we're only pointing at one file at most (or maybe no files, if the file doesn't exist).

And then, and then, we do a check- if we found more than one file, we throw an exception: "This shouldn't happen".

And that's true. This absolutely shouldn't happen. None of this should happen. None of this should have happened.