If you go by the popular press articles on the tech field, anyone who's anyone has migrated into the "machine learning" space. That's where all the "smart" programmers go.
Of course, ML is really just statistics and automated model fitting. It's a powerful tool, and it does require some highly specialized skills- skills that might involve programming, but maybe don't quite overlap with programming.
Noreen inherited some code from a highly-paid contractor who specializes in doing machine learning. The HPC put together an ML system capable of generating all sorts of useful output, output which the users at Noreen's company wanted compiled into nice PDFs. The HPC, being a very busy ML consultant who billed $300/hr didn't want to build this, so they subcontracted out to different HPC, who also was in the ML space, but billed at $200/hr.
The result is about 500 lines of Python script, in one large Python file. There's no real organization, no clear entry point, and all sorts of… interesting choices. Let's start with the very first method:
def run_report(file):
'''parser = argparse.ArgumentParser()
parser.add_argument("-c", nargs=1)
opts = parser.parse_args()
file = opts.c'''
base = os.getcwd() + "/project_name/"
inputs = np.genfromtxt(file, delimiter=",", dtype=None, encoding="utf_8_sig") # this is just a CSV, by the way
inp_list = inputs.tolist()
…
This is the first thing you see after the imports
section. The section enclosed in '''
should be the method's "docstring" (block comment), but in this case is really just being used to comment out code that the developer couldn't get working. Which hey, I sympathize, I find the argparse
documentation to be cumbersome and hard to interpret. But, ah, maybe one should remove the dead code?
A few lines later, this appears:
merger = PdfFileMerger()
There's nothing interesting about that line, in and of itself, except for one thing. That variable isn't referenced for another 200 lines or so, and the next time it's referenced, you see:
merger = PdfFileMerger()
merger = PdfFileMerger()
I'm just trying to give you the sense that… there's a lot here- enough for multiple articles, so we'll probably be picking on this giant pile for a bit. For now, I just want to focus in on one of many WTFs in the whole thing.
Among the many things this script does, it calls out to a shell script which runs a Makefile, which calls some other Python code to prep the reports. There's just one problem: those scripts need some input when they're invoked. As you might gather from the commented-out argparse
code above, our developer didn't have the greatest understanding of how to pass arguments between programs, so this is what they came up with:
if (inp_list[i][1] not in unable_to):
dest = "ReportGenerator/index" + str(i) + ".html"
with in_place.InPlace(base+'makePDF.sh') as file:
for line in file:
line = line.replace("baseDir/", base)
line = line.replace("ReportGenerator/index1.html", dest)
print("In makePDF, replaced with: " + line)
file.write(line)
dest2 = "index" + str(i) + ".html" + " " + "report" + str(i) + ".pdf"
with in_place.InPlace(base+'ReportGenerator/Makefile') as file:
for line in file:
line = line.replace("final.html report.pdf", dest2)
print("In Makefile, replaced with: " + line)
file.write(line)
in_place
, as you might gather from the name, allows the editing of a file in place. Instead of passing arguments to the script, we just edit the script. Or at least, we're editing the script if this entry didn't end up in the unable_to
bucket because of previous failures.
I bring up that last bit, because this block continues. Perhaps what follows is just an indenting error, because even if this entry is on our unable_to
list, we're going to able at it anyway:
try:
name = inp_list[i][0]
domain = inp_list[i][1]
print(i, name, domain)
filename = base+"text_files/" + domain.split(".")[0]+ ".txt"
print(filename)
Lines = f.readlines()
f.close()
risk = None
for line in Lines:
risk = line
risk = float(risk)
clt.get_report(name, filename) # why? why is this called again?
# I should note that get_report is really damn slow because of really shoddy code
# inside of it that's loading and reloading needlessly large files off disk ad
# nauseum. So nice we did it twice.
# Speaking of so nice we did it twice, why are we doing *this* twice?
os.system(base+"./makePDF.sh")
os.system(base+"./makePDF.sh")
dest2 = "index" + str(i) + ".html" + " " + "report" + str(i) + ".pdf"
with in_place.InPlace(base+'ReportGenerator/Makefile') as file:
for line in file:
line = line.replace(dest2, "final.html report.pdf")
print("In Makefile, returned to: " + line)
file.write(line)
dest = "ReportGenerator/index" + str(i) + ".html"
dest2 = "cat /home/actualuseraccount/abbas/ReportGenerator/index" + str(i) + ".html" # why
with in_place.InPlace(base+'makePDF.sh') as file:
for line in file:
line = line.replace(base, "baseDir/")
line = line.replace(dest, "ReportGenerator/index1.html")
print("In replacePDF, returned to: " + line)
file.write(line)
pdfs.append([base+"ReportGenerator/report" + str(i) + ".pdf", name, risk]) # why
print("=====================================================================================================")
time.sleep(5) # WHY
except Exception as e:
print(e)
dest2 = "index" + str(i) + ".html" + " " + "report" + str(i) + ".pdf"
with in_place.InPlace(base + '/ReportGenerator/Makefile') as file:
for line in file:
line = line.replace(dest2, "final.html report.pdf")
file.write(line)
dest = "ReportGenerator/index" + str(i) + ".html"
dest2 = "cat /home/actualuseraccount/abbas/ReportGenerator/index" + str(i) + ".html" # wh y ?
with in_place.InPlace(base+'makePDF.sh') as file:
for line in file:
line = line.replace(base, "baseDir/")
line = line.replace(dest, "ReportGenerator/index1.html")
file.write(line)
print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++") # W H Y ?
Comments supplied by Noreen, and no, I'm not repeating blocks you've already seen. The repetition is in the code. And you'll note what the in_place
edits are doing this time- reverting the scripts back to their original state. Which we need to do if we succeed, and we also need to do if we fail, and our developer apparently never learned about the finally
block.
And in the middle of the whole thing, for no reason, we have a random time.sleep(5)
to pause the program for 5 seconds. I mean, ML is computationally intensive, so you don't want your script to run too fast, people might think you're not really doing work, I guess? That isn't the only random time.sleep
scattered in the code. We might come back to a few of those, because there are more WTFs.
And we also haven't touched upon the ReportCreator
helper, developed by the original highly-paid contractor (before they subcontracted out). But, to quote Noreen's comments at the end of the run_report
method:
# God help me.
# I'm out of words, and I don't want to look at this anymore.