For better or worse, new file formats are developed frequently in computational biology, often every time a new method or software is released. Thus, as budding computational biologists, you will often be tasked with parsing, filtering, or converting data in a particular way for a particular application. Tools or scripts will not be available for most applications. Thus, it is often quicker/safer/easier to write your own script. For this assignment, you are given the “real life” task of parsing and filtering imputed genotype data into a specified format.

Imputation in genetics refers to the statistical inference of unobserved genotypes. Because single nucleotide polymorphisms (SNPs) are linked throughout the genome, we can genotype a subset of known SNPs (~1 million) and then use reference populations like the Haplotype Reference Consortium to estimate the unmeasured genotypes (~40 million). Write a script in Python 3 that does the following:

Input: VCF (variant call format) file of imputed genotypes (example: chr22.vcf.gz). VCF is a text file format stored in a compressed manner. It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

SNP reference file with chromosome, position, and rsID information (example: chr22_SNPref.tab.gz)

Example files are available in /homes/hwheeler/PythonParserHW on compbio.cs.luc.edu. Review these slides for ways to connect.

Output: A filtered file where filt_ preceeds the input VCF file name (example: filt_chr22.vcf.gz)

Your output should meet these criteria:

Your script should meet these criteria:

Upload your commented Python script (*.py file) to Sakai by 2:30PM on the due date. Don’t upload any input/output files. Your code must be commented in your own words. Code should be commented sufficiently so someone learning Python can understand it and get the script to run properly. Again, you may be asked to explain how your script works and your comments will help you.

TIPS and HINTS

Example INPUT

Example OUTPUT

When you use your script with the example input, your output file filt_chr22.vcf.gz should be 48165 lines long and zcat filt_chr22.vcf.gz | grep -c rs should return 48117.