Input:
- PLINK
.bed, .bim, .fam format files (examples in /home/wheelerlab2/comp383/plink_test_data/)
Output:
- Quality control plots/reports
- Filtered PLINK
.bed, .bim, .fam format files ready for GWAS
- Filtered
.vcf files ready for imputation
Goal:
- Build a quality control pipeline that prepares genome-wide genotype data for downstream GWAS and other applications
- Document the process with step-by-step instructions
Pipeline Outline
- Most steps can be run in PLINK and results plotted with Python modules like
matplotlib. We have previously used R in my lab to make plots, I would like you to use Python instead.
- Build a Python script with system calls to other tools to run the pipeline.
- Consider using a jupyter notebook to make testing plotting functions easier
- Sex check
- SNP call rate: plot histogram & filter
- Person call rate: plot histogram & filter
- Calculate Hardy-Weinberg statistics & filter SNPs
- LD prune for relationship check & heterozygosity calculation
- Relationship check: plot IBD stats & filter relateds (See table here for IBD info)
- Heterozygosity check: plot & filter outliers
- Principal component analysis (PCA) to determine genetic ancestry
- check genome build (NCBI36/hg18 or GRCh37/hg19 or newer?)
- automate the merge with HapMap3 genotypes (
/home/wheelerlab2/Data/HAPMAP3_hg1*/)
- run
smartpca to get principal components (see documentation in /home/wheelerlab2/EIG-6.1.4/EIGENSTRAT/README)
- plot and choose threshold for filtering people (probably can’t automate)
- rerun
smartpca with filtered set (no HapMap3)
- Plate effects analysis (if data is available)
- Prepare for imputation
Return to Course Schedule