GWAS Quality Control Pipeline Project

Most steps can be run in PLINK and results plotted with Python modules like matplotlib. We have previously used R in my lab to make plots, I would like you to use Python instead.
Build a Python script with system calls to other tools to run the pipeline.
Consider using a jupyter notebook to make testing plotting functions easier

Sex check
SNP call rate: plot histogram & filter
Person call rate: plot histogram & filter
Calculate Hardy-Weinberg statistics & filter SNPs
LD prune for relationship check & heterozygosity calculation
Relationship check: plot IBD stats & filter relateds (See table here for IBD info)
Heterozygosity check: plot & filter outliers
Principal component analysis (PCA) to determine genetic ancestry
- check genome build (NCBI36/hg18 or GRCh37/hg19 or newer?)
- automate the merge with HapMap3 genotypes (/home/wheelerlab2/Data/HAPMAP3_hg1*/)
- run smartpca to get principal components (see documentation in /home/wheelerlab2/EIG-6.1.4/EIGENSTRAT/README)
- plot and choose threshold for filtering people (probably can’t automate)
- rerun smartpca with filtered set (no HapMap3)
Plate effects analysis (if data is available)
Prepare for imputation
- make QC’d .bed, .bim, .fam files
- use HRC or 1000G Imputation prep tool
- make .vcf.gz files for upload to Michigan Imputation Server