The Parameters

As we frequently observe, code in scientific applications tends not to have the best code quality. And it's not hard to understand why: a researcher is frequently trying to take a wholistic understanding of a complicated problem domain and turn it into code. Software engineering, on the other hand, teaches us to break that wholistic picture up into little parts that we glue back together. The latter approach produces better code, but requires a lot of thinking about the structure of code, which is a whole layer on top of that wholistic understanding.

Which brings us to Plink, a C program Oliver R has had to work with. It analyzes genetic variance data. It also has a method with this signature:

int32_t plink(char* outname, char* outname_end, char* bedname, 
    char* bimname, char* famname, char* cm_map_fname, 
    char* cm_map_chrname, char* phenoname, char* extractname, 
    char* excludename, char* keepname, char* removename, 
    char* keepfamname, char* removefamname, char* filtername, 
    char* freqname, char* distance_wts_fname, char* read_dists_fname, 
    char* read_dists_id_fname, char* evecname, char* mergename1, 
    char* mergename2, char* mergename3, char* missing_mid_template, 
    char* missing_marker_id_match, char* makepheno_str, 
    char* phenoname_str, Two_col_params* a1alleles, 
    Two_col_params* a2alleles, char* recode_allele_name, 
    char* covar_fname, char* update_alleles_fname, 
    char* read_genome_fname, Two_col_params* qual_filter, 
    Two_col_params* update_chr, Two_col_params* update_cm, 
    Two_col_params* update_map, Two_col_params* update_name, 
    char* update_ids_fname, char* update_parents_fname, 
    char* update_sex_fname, char* loop_assoc_fname, 
    char* flip_fname, char* flip_subset_fname, char* sample_sort_fname,
    char* filtervals_flattened, char* condition_mname, 
    char* condition_fname, char* filter_attrib_fname, 
    char* filter_attrib_liststr, char* filter_attrib_sample_fname, 
    char* filter_attrib_sample_liststr, char* rplugin_fname, 
    char* rplugin_host_or_socket, int32_t rplugin_port, 
    double qual_min_thresh, double qual_max_thresh, double thin_keep_prob,
    double thin_keep_sample_prob, uint32_t new_id_max_allele_len, 
    uint32_t thin_keep_ct, uint32_t thin_keep_sample_ct, 
    uint32_t min_bp_space, uint32_t mfilter_col, uint32_t fam_cols, 
    int32_t missing_pheno, char* output_missing_pheno, 
    uint32_t mpheno_col, uint32_t pheno_modifier, 
    Chrom_info* chrom_info_ptr, Oblig_missing_info* om_ip, 
    Family_info* fam_ip, double check_sex_fthresh, 
    double check_sex_mthresh, uint32_t check_sex_f_yobs, 
    uint32_t check_sex_m_yobs, double distance_exp, double min_maf, 
    double max_maf, double geno_thresh, double mind_thresh, 
    double hwe_thresh, double tail_bottom, double tail_top, 
    uint64_t misc_flags, uint64_t filter_flags, uint64_t calculation_type, 
    uint32_t dist_calc_type, uintptr_t groupdist_iters, 
    uint32_t groupdist_d, uintptr_t regress_iters, uint32_t regress_d, 
    uint32_t parallel_idx, uint32_t parallel_tot, uint32_t splitx_bound1, 
    uint32_t splitx_bound2, uint32_t ppc_gap, uint32_t sex_missing_pheno, 
    uint32_t update_sex_col, uint32_t hwe_modifier, uint32_t min_ac, 
    uint32_t max_ac, uint32_t genome_modifier, double genome_min_pi_hat, 
    double genome_max_pi_hat, Homozyg_info* homozyg_ptr, 
    Cluster_info* cluster_ptr, uint32_t neighbor_n1, uint32_t neighbor_n2, 
    Set_info* sip, Ld_info* ldip, Epi_info* epi_ip, Clump_info* clump_ip, 
    Rel_info* relip, Score_info* sc_ip, uint32_t recode_modifier, 
    uint32_t allelexxxx, uint32_t merge_type, uint32_t sample_sort, 
    int32_t marker_pos_start, int32_t marker_pos_end, 
    int32_t snp_window_size, char* markername_from, char* markername_to, 
    char* markername_snp, Range_list* snps_range_list_ptr, 
    uint32_t write_var_range_ct, uint32_t covar_modifier, 
    Range_list* covar_range_list_ptr, uint32_t write_covar_modifier, 
    uint32_t write_covar_dummy_max_categories, uint32_t dupvar_modifier, 
    uint32_t mwithin_col, uint32_t model_modifier, uint32_t model_cell_ct, 
    uint32_t model_mperm_val, uint32_t glm_modifier, 
    double glm_vif_thresh, uint32_t glm_xchr_model, 
    uint32_t glm_mperm_val, Range_list* parameters_range_list_ptr, 
    Range_list* tests_range_list_ptr, double ci_size, double pfilter, 
    double output_min_p, uint32_t mtest_adjust, double adjust_lambda, 
    uint32_t gxe_mcovar, Aperm_info* apip, uint32_t mperm_save, 
    uintptr_t ibs_test_perms, uint32_t perm_batch_size, double lasso_h2, 
    uint32_t lasso_lambda_iters, double lasso_minlambda, 
    Range_list* lasso_select_covars_range_list_ptr, 
    uint32_t testmiss_modifier, uint32_t testmiss_mperm_val, 
    uint32_t permphe_ct, int32_t known_procs, 
    Ll_str** file_delete_list_ptr)

Whitespace added for readability.

That's well over a hundred parameters for one function call, with a mix of literal values and pointers being passed in. When invoked, some of the values passed in are already pointers, some are converted to pointers using the & operator, which makes it extra hard to understand where the data actually lives in the program:

retval = plink(outname, outname_end, pedname, mapname, famname, cm_map_fname, cm_map_chrname, phenoname, extractname, excludename, keepname, removename, keepfamname, removefamname, filtername, freqname, distance_wts_fname, read_dists_fname, read_dists_id_fname, evecname, mergename1, mergename2, mergename3, missing_mid_template, missing_marker_id_match, makepheno_str, phenoname_str, a1alleles, a2alleles, recode_allele_name, covar_fname, update_alleles_fname, read_genome_fname, qual_filter, update_chr, update_cm, update_map, update_name, update_ids_fname, update_parents_fname, update_sex_fname, loop_assoc_fname, flip_fname, flip_subset_fname, sample_sort_fname, filtervals_flattened, condition_mname, condition_fname, filter_attrib_fname, filter_attrib_liststr, filter_attrib_sample_fname, filter_attrib_sample_liststr, rplugin_fname, rplugin_host_or_socket, rplugin_port, qual_min_thresh, qual_max_thresh, thin_keep_prob, thin_keep_sample_prob, new_id_max_allele_len, thin_keep_ct, thin_keep_sample_ct, min_bp_space, mfilter_col, fam_cols, missing_pheno, output_missing_pheno, mpheno_col, pheno_modifier, &chrom_info, &oblig_missing_info, &family_info, check_sex_fthresh, check_sex_mthresh, check_sex_f_yobs, check_sex_m_yobs, distance_exp, min_maf, max_maf, geno_thresh, mind_thresh, hwe_thresh, tail_bottom, tail_top, misc_flags, filter_flags, calculation_type, dist_calc_type, groupdist_iters, groupdist_d, regress_iters, regress_d, parallel_idx, parallel_tot, splitx_bound1, splitx_bound2, ppc_gap, sex_missing_pheno, update_sex_col, hwe_modifier, min_ac, max_ac, genome_modifier, genome_min_pi_hat, genome_max_pi_hat, &homozyg, &cluster, neighbor_n1, neighbor_n2, &set_info, &ld_info, &epi_info, &clump_info, &rel_info, &score_info, recode_modifier, allelexxxx, merge_type, sample_sort, marker_pos_start, marker_pos_end, snp_window_size, markername_from, markername_to, markername_snp, &snps_range_list, write_var_range_ct, covar_modifier, &covar_range_list, write_covar_modifier, write_covar_dummy_max_categories, dupvar_modifier, mwithin_col, model_modifier, (uint32_t)model_cell_ct, model_mperm_val, glm_modifier, glm_vif_thresh, glm_xchr_model, glm_mperm_val, &parameters_range_list, &tests_range_list, ci_size, pfilter, output_min_p, mtest_adjust, adjust_lambda, gxe_mcovar, &aperm, mperm_save, ibs_test_perms, perm_batch_size, lasso_h2, lasso_lambda_iters, lasso_minlambda, &lasso_select_covars_range_list, testmiss_modifier, testmiss_mperm_val, permphe_ct, known_procs, &file_delete_list);

Whitespace not added.

Again, this is written with a different philosophy than a software engineer would. It's not a mistake that the code is this way, the developer was clearly smart and optimizing for writing code that represents their understanding of the problem as quickly as possible. But it doesn't mean that I would ever want to support this.

Featured Comments

Bananafish (nodebb) 2022-11-14

"Whitespace added for readability." <<< THERE's your WTF!

I have worked in IT supporting research for over 25 years. This is "normal", and usually these programs are maintained by the folks who write them. Genetics programs are as complicated as the topic they are meant to understand. Plink is extra-compilicated in that it does a lot of things that incorporate the functionality of several other program suites. Most of these have "setup scripts" to get the parameters correct in a file, and few of these types of programs take more than a few command line parameters (one or two files, maybe 3, a database, a few keywords, etc.) so the user never has to worry about 100 parameters being in the correct order.

That said, this is pretty WTFy because A LOT of those parameters are flags and filenames -- and I'm fairly certain that the output from this function would not change if filenames were referenced by pointers rather than to be parameters. (I've never used Plink, but based on my experience with science-y stuffs, you probably can't change the name of the output file, so why is it a parameter here??) This was definitely written with a lot of addon modules to expand what the program does. That's how Science works.

So, I guess what I'm saying is, Science is TRWTF?