Angela's team hired someone who was "good" at SQL. When this person started, the team had some regular jobs which ran in the mornings. The jobs were fairly time consuming, and did a lot of database IO. When their current database person left for another job, they hired someone who had a "good grasp" on SQL. We'll call him Barry.
Barry started out by checking the morning jobs every day. And over time, the morning jobs started getting slower and slower. That was a concern, but Barry swore he had it under control. Barry did not share that a handful of slow queries- queries which took three or so minutes to run- had suddenly started taking 75+ minutes to run. Barry didn't think about the fact that a little time with the query planner and some indexes could have probably gotten performance back to where it should have been. Barry saw this problem and decided: "I'll write a Python script".
import time
from datetime import datetime, timedelta
import pytz # for time zone
current_date = datetime.now()
day_number = current_date.weekday() # integer value: 0 is Monday
hub_1_ready = False
hub_2_ready = False
hub_1_results = []
hub_2_results = []
job_ran_later = False # If this job is manually run later in the day, avoid sending a "both hubs failed" email
# Monday (day_number 0) runs later than the other 6 days
if day_number == 0:
end_time = datetime.strptime("08:30", "%H:%M")
end_time = end_time.time() # get just the time portion
else:
end_time = datetime.strptime("07:30", "%H:%M")
end_time = end_time.time() # get just the time portion
# If this job is run later in the day than the normaolly scheduled time
if datetime.now(pytz.timezone('US/Central')).time() > end_time:
job_ran_later = True
# Starting when Morning jobs are scheduled to kick off, check for completion of both hubs every 3 minutes until end_time. If both hubs are not a Success by end_time, an email is sent
while datetime.now(pytz.timezone('US/Central')).time() < end_time:
h1 = session.sql("SELECT LOG_STATUS FROM PROD_CTRL.CTRL.DRB_EXECUTION_LOG WHERE LOG_PROJECT = 'SRC_PROD_1' AND date(log_start_date) = current_date AND date(LOG_END_DATE) = current_date").take(1)
hub_1_results = []
hub_1_results.append(h1)
if str(hub_1_results[0]) == "[Row(LOG_STATUS='SUCCESS')]":
hub_1_ready = True
h2 = session.sql("SELECT LOG_STATUS FROM PROD_CTRL.CTRL.SRC_EXECUTION_LOG WHERE LOG_PROJECT = 'SRC_PROD_2' AND date(log_start_date) = current_date AND date(LOG_END_DATE) = current_date").take(1)
hub_2_results = []
hub_2_results.append(h2)
if str(hub_2_results[0]) == "[Row(LOG_STATUS='SUCCESS')]":
hub_2_ready = True
# If both hubs are Success, then break out of while loop, even if it's not end_time yet
if hub_1_ready == True and hub_2_ready == True:
break
time.sleep(180) # Sleep for 3 minutes before trying again
if not hub_1_ready and not hub_2_ready and job_ran_later == False:
message = "Neither Hub_1 nor Hub_2 finished in time for Morning jobs."
context.updateVariable('METL_MESSAGE', message)
raise ValueError("send email: "+message)
elif hub_1_ready == False and hub_2_ready == True:
message = "Hub_1 did not finish in time for Morning jobs."
context.updateVariable('METL_MESSAGE', message)
raise ValueError("send email: "+message)
elif hub_1_ready == True and hub_2_ready == False:
message = "Hub_2 did not finish in time for Morning jobs"
context.updateVariable('METL_MESSAGE', message)
raise ValueError("send email: "+message)
elif job_ran_later == True:
message = "This job was run manually later in the day. Check that both Source hubs have completed. If you did not run this job, you can probably ignore this email."
context.updateVariable('METL_MESSAGE', message)
raise ValueError("send email: "+message)
I don't particularly like any of this. Some of it is just little ugliness, like the fact that job_ran_later and the closing if statements could be written to be much more clear. Or the way that, after our main while loop, which we'll come back to, we compare boolean variables against boolean literals.
The core of it is the while loop, which checks the current time, and while it's before the target end time, it runs a pair of queries. For each query it runs, it empties an array, then append the results (which we know is only one value, because they take(1)) to the array. Then they check the first element of the array against an expected string.
Why the arrays? Who knows. Perhaps at one point they thought they'd keep the results from multiple iterations, then decided against it. Why do the check against the string in the Python code and not the query? No idea, but maybe I don't have a "good grasp" of SQL. That said, with my bad grasp, I'm pretty sure I could figure out how to do all that in one single query and not two that are almost identical.
In any case, if we don't see what we want in the database, we sleep for three minutes, then try again.
At the end of the process, we check what happened and output messages and raise exceptions based on what we did see in the database.
It's also worth noting that Angela's team used a pretty reasonable job management system. All of their other scripts doing similar jobs didn't include retry logic inside themselves- they just failed. That let the job runner decide whether or not to retry, and that allowed all sorts of valuable configuration options that are more fine grained than "sleep for 3 minutes".