For better or worse, new file formats are developed frequently in computational biology, often every time a new method or software is released. Thus, as budding computational biologists, you will often be tasked with parsing, filtering, or converting data in a particular way for a particular application. Tools or scripts will not be available for most applications. Thus, it is often quicker/safer/easier to write your own script. For this assignment, you are given the “real life” task of parsing and filtering imputed genotype data into a specified format.
Imputation in genetics refers to the statistical inference of unobserved genotypes. Because single nucleotide polymorphisms (SNPs) are linked throughout the genome, we can genotype a subset of known SNPs (~1 million) and then use reference populations like the Haplotype Reference Consortium to estimate the unmeasured genotypes (~40 million). Write a script in Python 3 that does the following:
Input: VCF (variant call format) file of imputed genotypes (example: chr22.vcf.gz
). VCF is a text file format stored in a compressed manner. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
SNP reference file with chromosome, position, and rsID information (example: chr22_SNPref.tab.gz
)
Example files are available in /homes/hwheeler/PythonParserHW
on compbio.cs.luc.edu
. Review these slides for ways to connect.
Output: A filtered file where filt_
preceeds the input VCF file name (example: filt_chr22.vcf.gz
)
Your output should meet these criteria:
#CHROM
line (example: line 12 in chr22.vcf.gz
)0_10219
on) should only contain “Estimated Alternate Allele Dosage” DS valuesID
column should be replaced with the rsID number when possible, i.e. the SNP has an rsID in the reference file. If there is no rsID, leave as is.Your script should meet these criteria:
argparse
, i.e. no file paths should be hardcoded in your script.compbio.cs.luc.edu
.Upload your commented Python script (*.py
file) to Sakai by 2:30PM on the due date. Don’t upload any input/output files. Your code must be commented in your own words. Code should be commented sufficiently so someone learning Python can understand it and get the script to run properly. Again, you may be asked to explain how your script works and your comments will help you.
##
to figure out where the dosages are.zcat chr22.vcf.gz | head -n 1000 | gzip > test.vcf.gz
on the command line will generate a text file of the first 1000 lines called test.vcf.gz
.gzip
module is useful for reading and writing compressed *.gz
files.lambda
and map
functions (useful for parsing out MAF, R2, and DS values).order of SNPs has been changed from chr22.vcf.gz
to increase SNP diversity
##fileformat=VCFv4.1
##filedate=2016.11.7
##source=Minimac3
##contig=<ID=22>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0_10219 0_9913
22 16050822 22:16050822 G A . PASS AF=0.14142;MAF=0.14142;R2=0.02982 GT:DS:GP 0|0:0.272:0.746,0.236,0.018 0|0:0.250:0.766,0.219,0.016
22 16051477 22:16051477 C A . PASS AF=0.00267;MAF=0.00267;R2=0.00953 GT:DS:GP 0|0:0.006:0.994,0.006,0.000 0|0:0.015:0.985,0.015,0.000
22 16052870 22:16052870 G A . PASS AF=0.00057;MAF=0.00057;R2=0.00038 GT:DS:GP 0|0:0.001:0.999,0.001,0.000 0|0:0.001:0.999,0.001,0.000
22 17071513 22:17071513 G A . PASS AF=0.05169;MAF=0.05169;R2=0.83684 GT:DS:GP 0|0:0.005:0.995,0.005,0.000 0|0:0.001:0.999,0.001,0.000
22 17072483 22:17072483 A G . PASS;GENOTYPED AF=0.92830;MAF=0.07170;R2=0.99386;ER2=0.86772 GT:DS:GP 0|1:1.000:0.000,1.000,0.000 0|0:0.056:0.944,0.056,0.000
22 17075353 22:17075353 C A . PASS;GENOTYPED AF=0.81872;MAF=0.18128;R2=0.99875;ER2=0.87432 GT:DS:GP 0|1:1.000:0.000,1.000,0.000 1|0:0.999:0.001,0.999,0.000
tabs were replaced with one space for viewing
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 0_10219 0_9913
22 17071513 rs11089253 G A . PASS AF=0.05169;MAF=0.05169;R2=0.83684 DS 0.005 0.001
22 17072483 rs2236639 A G . PASS;GENOTYPED AF=0.92830;MAF=0.07170;R2=0.99386;ER2=0.86772 DS 1.000 0.056
22 17075353 rs5747999 C A . PASS;GENOTYPED AF=0.81872;MAF=0.18128;R2=0.99875;ER2=0.87432 DS 1.000 0.999
filt_chr22.vcf.gz
should be 48165 lines long and zcat filt_chr22.vcf.gz | grep -c rs
should return 48117
.