latent_phenotype_project

NOTE: File names are listed and described in the order that they are supposed to be run.

NOTE: Any file name containing "step0" is not meant to be run directly. Rather, some other file in the sequence calls that file to run when necessary.

NOTE: Descriptions for files that are called by other files but do not have "step0" in the name start with "DO NOT RUN DIRECTLY"

NOTE: Descriptions starting with "IMPORTANT; run in sections, not all at once" have sections that need to be completed manually in the code. You may run the code in subsections that do not overlap the sections to be completed manually.

NOTE: Directories are ordered from top to bottom as the sequence in which they should be run.

Directory: step1_get_phenotypes_complex

step1a_get_UKB_phenotypes.py: Imports and processes UK biobank (UKB) data fields. Handles multiple measurements, modifies smoking pack years, calculates annual alcohol consumption, and binarizes categorical variables.
step1a_library.py: Functions used by step1a_get_UKB_phenotypes.py.

Directory: step1_get_phenotypes_simple

step1a_import_HF_ICD_codes_and_covariates.py: Imports additional UKB data fields related to heart failure; binarizes and renames as needed.
step1b_process_HF_ICD_codes_and_covariates.py: Binarizes ICD codes, creates the all cause heart failure binary phenotype ('AHF' in text, 'any_HF' in code), and removes ICD codes not significantly correlated to AHF.

Directory: step2_get_UKB_samples

step2a_make_UKB_sample_getters.py: Generates bash scripts for importing individual' non-imputed genotype data.
step2b_get_UKB_samples.sh: Executes bash scripts from the previous step to collect genotype samples.

Directory: step3_merge_chr_and_remove_quitters

step3a_get_people_who_quit.py: generates a list of individuals who opted out of UKB data use (from other manually typed lists).
step3b_merge_datasets.sh: Merges plink files from Step 2; removes opt-out individuals.
step3c_find_SNP_cutoffs.py: Plots SNPs' minor allele frequencies, missingness, and HWE p-values (exploratory only).
step3d_remove_bad_SNPs.sh: Removes SNPs exceeding thresholds for aforementioned quantities.
step3e_find_sample_cutoffs.py: Plots individuals' heterozygosity, average SNP missingness, and X chromosome heterozygosity scores (exploratory only).
step3f_remove_bad_samples.sh: Removes individuals exceeding thresholds for aforementioned quantities.

Directory: step4_remove_relatives

step4a_divide_eids_into_subsets_prep.py: Splits remaining eids from Step 3 into 10 equal subsets.
step4b_divide_eids_into_subsets.sh: Partitions plink files based on the 10 eid subsets; removes remainder eids.
step4c_make_king_subjobs.py: Generates a bash script that runs KING for each SNP subset and each pair of SNP subsets.
step4d_run_king_subjobs.sh: Runs KING through the previously generated bash scripts. King outputs all pairs of related individuals with 3rd degree relatedness or more.
step4e_get_unrelated_eids.py: Prunes dataset to keep only 4th degree or less related individuals. The number of all-cause heart failure cases is prioritized over the total sample size. Both are maximized with that in mind.
step4f_remove_relatives.sh: Excludes relatives based on prior steps. From these unrelated individuals, creates a pruned SNP dataset such that all SNPs are in low LD for genetic principle component computation. Also keeps the non-pruned dataset.

Directory: step5_verify_sample_unrelatedness

step5a_divide_eids_into_subsets_prep.py: Splits eids from Step 3 into 10 equal subsets.
step5b_divide_eids_into_subsets.sh: Partitions plink files into 10 subsets corresponding to the eid subsets.
step5c_make_king_subjobs.py: Generates a bash script to run KING for each SNP subset and each pair of SNP subsets.
step5d_run_king_subjobs.sh: Executes the KING subjob scripts.
step5e_confirm_unrelatedness.py: Confirms that no related individuals remain post-Step 4. This validates that KING was used correctly.

Directory: step6_PCA

check_LD.sh: Computes LD between SNP pairs in pruned SNP set from Step 4.
check_LD.py: Validates that plink was used correctly, so no SNP pairs exceed the LD R^2 threshold from Step 4.
step0_run_PCA.par: Specifies parameters for Eigensoft's PCA computation.
step6a_make_PCA_documents.py: Prepares input files for Eigensoft's PCA.
step6b_run_PCA.sh: Executes PCA computation using Eigensoft.

Directory: step7_adjust_HF_for_covariates_logistic_PCA

step7a_get_HF_ICD_codes_unrelated.py: Imports phenotypes and relevant UK Biobank fields for unrelated individuals from Step 4.
step7b_logPCA_transform_the_data.py: DO NOT RUN DIRECTLY. Applies logistic PCA to 311 ICD10 codes and heart failure. Creates latent phenotypes (k specified via argparse) and adjusts them using PCs from Step 6.
step7b_logPCA_transform_the_data.sh: Executes the above for k = 1 to 20; paper uses k=15.
step7c_impute_missing_values.py: DO NOT RUN DIRECTLY. Applies MICE imputation to environmental factors with missing values. Correlation threshold for feature selection specified via "nn".
step7c_impute_missing_values.sh: Runs the above for various nn values; paper uses nn=0.05.
step7d_get_imputed_values_and_trasformed_variables.py: Similar to above, but sets nn at 0.05 and does not simulate missingness. Outputs imputed environmental factors.
step7d_get_imputed_values_and_trasformed_variables.sh: Executes the above, recommended for job submission due to long runtime.
step7e_get_PCs_effs.py: Outputs logistic regression beta coefficients for genetic PCs vs AHF, used to correct the standard logistic regression GWAS against AHF.

Directory: step7_adjust_HF_for_covariates_NN

step7a_create_AE_phenotypes.py: DO NOT RUN DIRECTLY. Computes autoencoder test accuracy based on layer nodes, cross-validation folds, and dropout rate. Generates latent phenotypes.
step7a_create_AE_phenotypes.sh: Runs the above for specified nodes, folds, and dropout rates; paper uses 0.3 dropout.
step7b_create_best_phenotypes_normal_AE_0.3dop.py: Trains final autoencoder model twice, using the first run's weights as a starting point for the second.
step7b_create_best_phenotypes_normal_AE_0.3dop.sh: Executes the above model training.
step7c_impute_missing_values.py: DO NOT RUN DIRECTLY. Uses MICE to impute missing environmental data, selecting features based on correlation "nn".
step7c_impute_missing_values.sh: Executes imputation for various "nn"; paper uses nn=0.05.
step7d_get_imputed_values_and_transformed_variables.py: Similar to above, fixes "nn" at 0.05 and outputs imputed factors.
step7d_get_imputed_values_and_transformed_variables.sh: Executes the above, recommended for job submission.
step7e_compute_network_shapley_values.py: Calculates Shapley values for each latent phenotype.
step7e_compute_network_shapley_values.sh: Applies the above to all latent phenotypes.
step7f_analyze_shapley_values.py: Produces dendrograms based on correlations of high-impact or high-correlation Shapley values.

Directory: step7_adjust_HF_for_covariates_PCA

step0_compute_SV_p_values.py: Identifies key ICD10 codes affecting SNP-latent phenotype correlations. Used by step7g_sub_phenotype_analysis.sh.
step7a_get_HF_ICD_codes_unrelated.py: Fetches unrelated individuals' phenotypes and UK Biobank fields from Step 4.
step7b_PCA_transform_the_data.py: Performs PCA on 311 ICD10 codes and all-cause heart failure, adjusting latent phenotypes with PCs from Step 6.
step7c_impute_missing_values.py: Applies MICE imputation to environmental factors, using a fixed "nn" of 0.05. Confirmed to outperform mean imputation.
step7d_get_imputed_values_and_transformed_variables.py: Similar to step7c, but with "nn" fixed at 0.05 and no missingness simulation. Outputs imputed factors.
step7e_compute_network_shapley_values.py: Calculates Shapley values for ICD10 codes and heart failure in PCA-based latent phenotypes.
step7f_analyze_shapley_values.py: Generates dendrograms based on Shapley value correlations.
step7g_sub_phenotype_analysis.sh: Executes step0_compute_SV_p_values.py for varying numbers of top-contributing ICD10 codes.

Directory: step8_get_imputed_ukb_samples

step8.1_get_imputed_data_setup.py: Generates shell scripts for importing GWAS-targeted, imputed SNPs per chromosome.
step8.2_get_rsID_positions.py: Lists SNPs to remove per chromosome based on low MAF or insufficient information.
step8.3_get_eids.py: copies a file identifying which unrelated individuals to retain in the analysis.
step8.4_get_imputed_data.sh: Executes shell scripts from step8.1.
step8.5_make_bims_tab_spaced.py: Converts BIM files to tab-spaced format.

Directory: step9_regress_phenotypes_against_SNPs_logistic_PCA

step0_binary_HF_QTL_getter.py: Performs logistic regression on a subset of SNPs against all-cause heart failure, corrected by genetic PCs.
step0_QQ_plot_getter.py: Generates a QQ plot for all SNPs' real vs expected p values with respect to one latent phenotype.
step0_significant_QTL_getter.py: Regresses SNPs against a genetic PC-corrected latent phenotype using linear regression, EDGE, and target encoding. Manually set "env_name" on line 87 for GxE effects. Examined Env_name values include 'pack-years', 'annual-consumption', ['874-average', '894-average', '914-average'] (a python list), and '22001-0.0'
step9a_get_genotype_metadata.sh: gets SNPs' minor allele frequencies. Having these pre-computed makes the GWAS faster.
step9b_get_significant_QTLs_setup.py: Creates bash scripts for each (latent phenotype, chromosome) pair. Remember to update line 87 in step0_binary_HF_QTL_getter.py to change the environmental factor.
step9c_get_significant_QTLs.sh: Executes scripts generated by the previous step.
step9d_get_QQ_plots.sh: Applies step0_QQ_plot_getter.py to all latent phenotypes.
step9e_get_binary_HF_QTLs.sh: creates bash scripts to run step0_binary_HF_QTL_getter.py on all SNP subsets in parallel.
step9f_get_QQ_plots_normal_GWAS.sh: applies step0_QQ_plot_getter.py to all cause heart failure.

Directory: step9_regress_phenotypes_against_SNPs_NN

step0_QQ_plot_getter.py: Generates a QQ plot for all SNPs' real vs expected p values with respect to one latent phenotype.
step0_significant_QTL_getter.py: Regresses SNPs against a genetic PC-corrected latent phenotype using linear regression, EDGE, and target encoding. Manually set "env_name" on line 87 for GxE effects. Examined Env_name values include 'pack-years', 'annual-consumption', ['874-average', '894-average', '914-average'] (a python list), and '22001-0.0'
step9a_get_genotype_metadata.sh: gets SNPs' minor allele frequencies. Having these pre-computed makes the GWAS faster.
step9b_get_significant_QTLs_setup.py: Creates bash scripts for each (latent phenotype, chromosome) pair. Remember to update line 87 in step0_binary_HF_QTL_getter.py to change the environmental factor.
step9c_get_significant_QTLs.sh: Executes scripts generated by the previous step.
step9d_get_QQ_plots.sh: Applies step0_QQ_plot_getter.py to all latent phenotypes.

Directory: step9_regress_phenotypes_against_SNPs_PCA

step0_QQ_plot_getter.py: Generates a QQ plot for all SNPs' real vs expected p values with respect to one latent phenotype.
step0_significant_QTL_getter.py: Regresses SNPs against a genetic PC-corrected latent phenotype using linear regression, EDGE, and target encoding. Manually set "env_name" on line 87 for GxE effects. Examined Env_name values include 'pack-years', 'annual-consumption', ['874-average', '894-average', '914-average'] (a python list), and '22001-0.0'
step9a_get_genotype_metadata.sh: gets SNPs' minor allele frequencies. Having these pre-computed makes the GWAS faster.
step9b_get_significant_QTLs_setup.py: Creates bash scripts for each (latent phenotype, chromosome) pair. Remember to update line 87 in step0_binary_HF_QTL_getter.py to change the environmental factor.
step9c_get_significant_QTLs.sh: Executes scripts generated by the previous step.
step9d_get_QQ_plots.sh: Applies step0_QQ_plot_getter.py to all latent phenotypes.

Directory: step10_get_significant_SNPs_logistic_PCA

step0_compute_GxE_p_values.py: given an rsID, chromosome, latent phenotype, and environmental factor as input, computes a permutation test p value for the pure GxE effect.
step0_filter_significant_SNPs_and_get_GxE_effects.py: For each chromosome and environmental factor, for each latent phenotype, segments SNP hits into intervals, and selects independently nominally significant SNPs. For each independent SNP hit, prepares a bash file to run step0_compute_GxE_p_values.py
step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py: If independent SNP hits files are available, prepares bash files to run `step0_compute_GxE_p_values.py as described previously.
step10a_get_significant_rsIDs.py: Retreives all rsIDS corresponding to SNP hits with a nominal TRACE p value < 5E-8/16 (16 is a bonferroni correction for the number of latent phenotypes)
step10b_get_significant_SNPs.sh: Retreives plink files for all SNPs from the previous step
step10c_filter_significant_SNPs_and_get_GxE_effects.sh: applies step0_filter_significant_SNPs_and_get_GxE_effects.py to all chromosomes and environmental factors
step10c_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.sh: applies step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py to all chromosomes and environmental factors
step10d_get_significant_GxE_p_values.sh: applies step0_compute_GxE_p_values.py to all independently nominally significant rsIDs.
step10e_access_common_SNPs.py: Generates lists of main and interaction effects for various environmental factors with logistic PCA latent phenotypes, counts of each effect, and data for table 1a (step10e_p_val_analysis.txt). Machine learning is deferred to step10g.
step10f_get_CV_folds.py: Generates 30 outer training/validation index sets for nested cross-validation.
step10g_get_CV_testing_accuracy.py: For 1 of the 30 outer index sets, conducts 10-fold cross-validation on the training set. Reports optimal model parameters and validation accuracy.
step10g_get_CV_testing_accuracy.sh: applies step10g_get_CV_testing_accuracy.py to all 30 outer training/validation index sets.

Directory: step10_get_significant_SNPs_NN

step0_compute_GxE_p_values.py: given an rsID, chromosome, latent phenotype, and environmental factor as input, computes a permutation test p value for the pure GxE effect.
step0_filter_significant_SNPs_and_get_GxE_effects.py: For each chromosome and environmental factor, for each latent phenotype, segments SNP hits into intervals, and selects independently nominally significant SNPs. For each independent SNP hit, prepares a bash file to run step0_compute_GxE_p_values.py
step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py: If independent SNP hits files are available, prepares bash files to run `step0_compute_GxE_p_values.py as described previously.
step10a_get_significant_rsIDs.py: Retreives all rsIDS corresponding to SNP hits with a nominal TRACE p value < 5E-8/16 (16 is a bonferroni correction for the number of latent phenotypes)
step10b_get_significant_SNPs.sh: Retreives plink files for all SNPs from the previous step
step10c_filter_significant_SNPs_and_get_GxE_effects.sh: applies step0_filter_significant_SNPs_and_get_GxE_effects.py to all chromosomes and environmental factors
step10c_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.sh: applies step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py to all chromosomes and environmental factors
step10d_get_significant_GxE_p_values.sh: applies step0_compute_GxE_p_values.py to all independently nominally significant rsIDs.
step10e_access_common_SNPs.py: Generates lists of main and interaction effects for various environmental factors with NN latent phenotypes, counts of each effect, and data for table 1a (step10e_p_val_analysis.txt). Machine learning is deferred to step10g.
step10f_get_CV_folds.py: Generates 30 outer training/validation index sets for nested cross-validation.
step10g_get_CV_testing_accuracy.py: For 1 of the 30 outer index sets, conducts 10-fold cross-validation on the training set. Reports optimal model parameters and validation accuracy.
step10g_get_CV_testing_accuracy.sh: applies step10g_get_CV_testing_accuracy.py to all 30 outer training/validation index sets.

Directory: step10_get_significant_SNPs_PCA

step0_compute_GxE_p_values.py: given an rsID, chromosome, latent phenotype, and environmental factor as input, computes a permutation test p value for the pure GxE effect.
step0_filter_significant_SNPs_and_get_GxE_effects.py: For each chromosome and environmental factor, for each latent phenotype, segments SNP hits into intervals, and selects independently nominally significant SNPs. For each independent SNP hit, prepares a bash file to run step0_compute_GxE_p_values.py
step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py: If independent SNP hits files are available, prepares bash files to run `step0_compute_GxE_p_values.py as described previously.
step10a_get_significant_rsIDs.py: Retreives all rsIDS corresponding to SNP hits with a nominal TRACE p value < 5E-8/16 (16 is a bonferroni correction for the number of latent phenotypes).
step10b_get_significant_SNPs.sh: Retreives plink files for all SNPs from the previous step
step10c_filter_significant_SNPs_and_get_GxE_effects.sh: applies step0_filter_significant_SNPs_and_get_GxE_effects.py to all chromosomes and environmental factors
step10c_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.sh: applies step0_filter_significant_SNPs_and_get_GxE_effects_LAST_PART_ONLY.py to all chromosomes and environmental factors
step10d_get_significant_GxE_p_values.sh: applies step0_compute_GxE_p_values.py to all independently nominally significant rsIDs.
step10e_access_common_SNPs.py: Generates lists of main and interaction effects for various environmental factors with PCA latent phenotypes, counts of each effect, and data for table 1 (step10e_p_val_analysis.txt). Machine learning is deferred to step10g.
step10f_get_CV_folds.py: Generates 30 outer training/validation index sets for nested cross-validation.
step10g_get_CV_testing_accuracy.py: For 1 of the 30 outer index sets, conducts 10-fold cross-validation on the training set. Reports optimal model parameters and validation accuracy.
step10g_get_CV_testing_accuracy.sh: applies step10g_get_CV_testing_accuracy.py to all 30 outer training/validation index sets.

Directory: step11_analyze_complete_dataset

Code Files

step11a_merge_rsID_output.py: Generates Table 1b and effect-specific rsID file pairs (rsIDs_effects.txt and rsIDs_effects_pvals.txt, where * is the effect type) for FUMA input.
Important Note on FUMA details: For each rsID file pair, Inputs both files into FUMA with a placeholder total sample size of 380000. It is merely a required input that does not effect the distances between SNP hits and genes, thereby making it irrelevant. Initially uses a 500KB gene-SNP distance, later refined to 300KB for precision with small gene count reduction. Skips gene by exercise interactions due to limited hits. Renames FUMA output files as annov_*.txt. for clarity, where * = (main, smoking, gender, alcohol).
step11b_get_enrichment_all.R: IMPORTANT; run in sections, not all at once. Intermediate manual steps generate miEAA.csv, which lists KEGG-enriched miRNA pathways. Uses FUMA outputs and rsIDs_*_effects.txt as input. Generates Tables 2 and S6.
step11c_get_chr_seperated_lists.py: makes one file per SNP hit for input into the LDlink tool.
step11d_ldlink.sh: Sends files from previous step to LDlink, fetching data on prior GWAS hits in LD with each SNP hit.
step11e_make_SNP_tables_LD_pruning.py: Creates Table 1a, counting independent GWAS SNP hits that output from step11d_ldlink.sh finds are related to AHF. The term list "possible_AHF_terms" was manually curated and validated through substring matching of terms related to cardiovascular dysfunction in the output from step11d_ldlink.sh.
step11f_miRNA_enrichment_analysis.R: IMPORTANT; run in sections, not all at once. Computes p-values for enrichment of genic SNP hits in genes that enrich miRNA. Produces Tables 3 and S7. Manual steps are clearly outlined for gene-disease associations. Also uses data from Genevestigator ("see genevestigator_hits_methods folder").
step11g_make_heritability_figure.py: Generates Table S5b and Figure 3a using outputs from step10g_get_CV_testing_accuracy.sh across PCA, logistic_PCA, and NN directories.
step11h_confirm_model_consistency.py: Generates Tables 1c and 1d as per manuscript methods.
step11i_get_independent_normal_GWAS_SNPs.py: Identifies SNP hits from standard logistic GWAS that are independent of six default AHF-defining SNPs.
step11j_make_figure3b.sh: Prepares files for Figure 3b, including six default SNPs and one additional independent SNP (rs73188900) found in the previous step.
step11k_make_table1.R: Creates Figure 3b and Table 1 using the output from the previous step. Also generates figure 1.
step11l_make_supp_figs.py: Produces Tables S1-S4 and Figures S1, S2, and partial input for Figure S3a.
step11m_finish_fig_S3a.R: Completes Figure S3a.

Intermediate Files

annov_alcohol.txt: relevant FUMA output when using rsIDs_GxAlcohol_effects.txt and rsIDs_GxAlcohol_effects_pvals.txt as input. Originally named "annov.txt".
annov_gender.txt: relevant FUMA output when using rsIDs_GxGender_effects.txt and rsIDs_GxGender_effects_pvals.txt as input. Originally named "annov.txt".
annov_main.txt: relevant FUMA output when using rsIDs_main_effects.txt and rsIDs_main_effects_pvals.txt as input. Originally named "annov.txt".
annov_smoking.txt: relevant FUMA output when using rsIDs_GxSmoking_effects.txt and rsIDs_GxSmoking_effects_pvals.txt as input. Originally named "annov.txt".
disease_gene_associations_logistic_PCA_smoking.tsv: Output from entering "CASZ1::AKR7A3::PPIE::HIVEP3::SSBP3::ST6GALNAC3::DPYD::NOS1AP::DDR2::SLC9A2::THSD7B::RAPGEF4::GPR155::ALS2::CHL1::IL5RA::THRB::CACNA2D3::CADPS::PRICKLE2::ROBO1::GAP43::LSAMP::AGTR1::MED12L::SERPINI1::MECOM::PARL::ST6GAL1::MUC4::EVC2::NPNT::HPGD::TENM3::LPCAT1::DROSHA::PRLR::GHR::EDIL3::CHSY3::FAM53C::EXOC2::ATXN1::BTBD9::TFEB::RUNX2::ESR1::TMEM242::PDE10A::COX19::ICA1::CREB5::PKD1L1::CALN1::CALCR::RINT1::ATXN7L1::DGKI::DPP6::TNFRSF10B::ADRA1A::ASPH::HNF4G::UBR5::TBC1D31::TMEM65::SH3GL2::FBP1::PALM2-AKAP2::GARNL3::FAM107B::CXCL12::MYPN::CDH23::CCDC147::SORCS3::RBM20::GRK5::ATE1::SYT9::SBF2::SLC22A8::TENM4::MAML2::NTM::TMTC1::ANO4::ATP12A::LRCH1::HTR2A::KLHL1::GPC6::FNTB::GALNT16::SMOC1::MAP3K9::KCNK13::C14orf159::ITPK1::UNC79::EML1::RYR3::WDR72::UNC13C::ARNT2::KIAA1199::PDE8A::SLCO3A1::CDIP1::NXN::PIK3R5::CA10::GAA::PIEZO2::BCL2::CLEC4M::ZNF627::SLC1A6::KDELR1::MACROD2::PCSK2::STX16-NPEPL1::CDH4::RIPK4::POF1B" into https://www.disgenet.org/search. Output file was originally named "54897__22977__10450__59269__23648__256435__1806__9722__4921__6549__80731__11069__151556__57679__10752__3568__7068__55799__8618__166336__6091__2596__4045__185__116931__5274__2122__55486__6480__4585__132884__2557.tsv" and was downloaded on 9/12/2023.
disease_gene_associations_NN_smoking.tsv: Output from entering "EPHA8::PGM1::SCCPDH::CNTNAP5::PLEKHM3::CNTN3::MAP1B::GRK6::GLP1R::COL19A1::SCARA5::ANP32B::NAV2::KSR2::TMEM132B::XYLT1::AP2B1::DMD" into https://www.disgenet.org/search. Output file was originally named "2046__5236__129684__389072__5067__4131__2870__2740__1310__286133__10541__89797__283455__114795__64131__163__1756_gene_gda_summary.tsv" and was downloaded on 9/12/2023.
disease_gene_associations_PCA_smoking.tsv: Output from entering "CAMTA1::ECE1::HSPG2::CSMD2::NFIA::LPAR3::CHIA::KCNN3::PAPPA2::SOX13::PTPN14::TP53BP2::DNAH14::PCNXL2::GPR137B::RYR2::OR2L13::GRHL1::ALK::NDUFAF7::GALM::SLC8A1::PRKCE::STON1-GTF2A1L::GPR75-ASB3::EML6::ARHGAP25::LRRTM4::CTNNA2::ACVR1::STK39::CCDC173::PDE11A::CCDC141::PAX3::SRGAP3::DYNC1LI1::ROBO2::GUCA1C::IGSF11::ADCY5::NEK11::MLF1::PHC3::TNIK::KCNMB2::EIF2B5::EIF2B5::LPP::DLG1::CLNK::KCNIP4::GPR125::GABRB1::FIP1L1::EPHA5::DCLK2::KIAA0922::RAPGEF2::TRIO::MYO10::EGFLAM::PDE4D::ARHGEF28::ARSB::EFNA5::GRAMD3::UBE2D2::STK32A::COL23A1::F13A1::DEK::CDKAL1::OPN5::EYS::EPHA7::PREP::CLVS2::TRDN::PERP::HIVEP2::SYNE1::SMOC2::GET4::TNRC18::HDAC9::JAZF1::AMPH::GLI3::WBSCR17::GTPBP10::COL26A1::TBXAS1::CNTNAP2::CSMD1::SGCZ::CSGALNACT1::PIWIL2::DPYSL2::ELP3::ZMAT4::XKR4::CNGB3::SDC2::NCALD::MTSS1::ASAP1::KCNQ3::SLA::ZFAT::KANK1::AK3::OSTF1::PCSK5::GNAQ::TLE1::STX17::ZNF462::SUSD1::COL27A1::AKNA::BRINP1::DENND1A::CAMK1D::C1QL3::RSU1::PARD3::TSPAN14::HPS1::CTBP2::C10orf90::TMEM41B::INSC::NELL1::SLC17A6::SHANK2::DLG2::RAB38::OPCML::RAD52::SLC2A14::GRIN2B::TMCC3::ANKS1B::NUP37::BRAP::DTX1::CCDC92::TMEM132D::CRYL1::ATP8A2::FREM2::FAM124A::PCDH9::ABCC4::NALCN::COL4A1::DCUN1D2::NPAS3::EGLN3::RGS6::CDC42BPB::ATP10A::GABRB3::GABRG3::MEIS2::FBN1::CGNL1::RORA::VPS13C::ZNF609::KIAA1024::PDE8A::MFGE8::ABHD2::CERS3::ADAMTS18::WWOX::CDH13::STX8::PRKCA::SLC39A11::RIT2::ST8SIA5::DCC::MBP::ATP9B::ATCAY::TICAM1::C19orf47::PSG8::PLAUR::ZNF227::VASP::SLC17A7::ZNF175::ZBTB45::PTPRT::CDH22::RUNX1::YBEY::CECR2::CABIN1::SGSM1::LARGE::MPPED1::PJA1" into https://www.disgenet.org/search. Output file was originally named "23261__1889__3339__114784__4774__23566__27159__3782__60676__9580__5784__7159__127602__7107__6262__284521__29841__238__55471__130589__6546__5581__286749__100302652__400954__9938__80059__1496__90__27347__50940__2.tsv" and was downloaded on 9/12/2023.
miEAA.csv: Output from entering MIRS_smoking.txtinto https://ccb-compute2.cs.uni-saarland.de/mieaa2/user_input/ by following manual steps 1 and 2 in step11b_get_enrichment_all.R. Was originally named "miEAA - miRNA Enrichment and Annotation -- Analysis results.csv" and was downloaded on 5/29/2023.
rsIDs_GxAlcohol_effects.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_GxAlcohol_effects_pvals.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_GxExercise_effects.txt: Not used as FUMA input due to there being only one SNP hit.
rsIDs_GxExercise_effects_pvals.txt: Not used as FUMA input due to there being only one SNP hit.
rsIDs_GxGender_effects.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_GxGender_effects_pvals.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_GxSmoking_effects.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_GxSmoking_effects_pvals.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_main_effects.txt: FUMA input. Refer to Important Note on FUMA details.
rsIDs_main_effects_pvals.txt: FUMA input. Refer to Important Note on FUMA details.
SNP_MAFs_rsIDs_GxAlcohol_effects.txt: for possible future use
SNP_MAFs_rsIDs_GxExercise_effects.txt: for possible future use
SNP_MAFs_rsIDs_GxGender_effects.txt: for possible future use
SNP_MAFs_rsIDs_GxSmoking_effects.txt: used by step11f_miRNA_enrichment_analysis.R and for possible future use
SNP_MAFs_rsIDs_main_effects.txt: for possible future use
step11e_logistic_PCA_GxAlcohol_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_logistic_PCA_GxAlcohol_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_logistic_PCA_GxGender_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_logistic_PCA_GxGender_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_logistic_PCA_GxSmoking_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a. Used by step11f_miRNA_enrichment_analysis.R to make input for https://www.disgenet.org/search.
step11e_logistic_PCA_GxSmoking_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a. Used by step11f_miRNA_enrichment_analysis.R to make input for https://www.disgenet.org/search.
step11e_logistic_PCA_main_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_logistic_PCA_main_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_NN_GxAlcohol_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_NN_GxAlcohol_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_NN_GxGender_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_NN_GxGender_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_NN_GxSmoking_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a. Used by step11f_miRNA_enrichment_analysis.R to make input for https://www.disgenet.org/search.
step11e_NN_GxSmoking_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a. Used by step11f_miRNA_enrichment_analysis.R to make input for https://www.disgenet.org/search.
step11e_NN_main_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_NN_main_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_PCA_GxAlcohol_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_PCA_GxAlcohol_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_PCA_GxGender_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_PCA_GxGender_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_PCA_GxSmoking_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a. Used by step11f_miRNA_enrichment_analysis.R to make input for https://www.disgenet.org/search.
step11e_PCA_GxSmoking_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a. Used by step11f_miRNA_enrichment_analysis.R to make input for https://www.disgenet.org/search.
step11e_PCA_main_rsIDs_known.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.
step11e_PCA_main_rsIDs_novel.txt: Shows SNP hit subsets used by step10e_access_common_SNPs.py to make table 1a.

Output Files

step11f_logistic_PCA_miRNA_associated_genic_SNP_enrichment.txt: counts and p-value for enrichment of logistic PCA SNP hits inside of genes linked to miRNA
step11f_NN_miRNA_associated_genic_SNP_enrichment.txt: counts and p-value for enrichment of NN SNP hits inside of genes linked to miRNA
step11f_PCA_miRNA_associated_genic_SNP_enrichment.txt: counts and p-value for enrichment of PCA SNP hits inside of genes linked to miRNA
table1a.txt: refer to manuscript
table1b.txt: refer to manuscript
table1c.txt: refer to manuscript
table1d.txt: refer to manuscript
table2a.txt: refer to manuscript
table2b.txt: refer to manuscript
table2c.txt: refer to manuscript
table3a.txt: refer to manuscript
table3b.txt: refer to manuscript
table3c.txt: refer to manuscript
table3d.txt: refer to manuscript

Resource Folders

all_sig_rsIDs_logistic_PCA: all statistically significant SNP hits for logistic PCA latent phenotypes
all_sig_rsIDs_NN: all statistically significant SNP hits for NN latent phenotypes
all_sig_rsIDs_PCA: all statistically significant SNP hits for PCA latent phenotypes
genevestigator_hits: all studies with differentially expressed genes
genevestigator_hits_methods: contains methodological details regarding genevestigator_hits

epistasislab / latent_phenotype_project Goto Github PK