Bioinformatics Armory (Num 24 - 30)

Each problem will mention some outside software that could solve it under the Problem heading. However, you should use Biopython to write your own script to solve each problem. Be sure to use modules mentioned in the Programming Shortcut sections at the bottom of the page to make everything easier.

The first 2 are warm-up problems, introducing you to some Biopython tools that may come in handy for your projects and/or future research.

24 INI: Install Biopython or use the compbio.cs.luc.edu server. Once installed, the solution is easy (see Programming Shortcut). The easiest way to install is from the command line. Try this first:

conda install -c anaconda biopython

If that doesn’t work, make sure you’re computer can find the correct conda. Because I have both python2 and python3 installed, I had to first find the path to python2 from the terminal (Mac/Linux): which python2

And then: [path output from above]/anaconda/bin/conda install -c anaconda biopython

In the Python shell, try:

>>> import Bio

If no error, you’re set! If Bio can’t be found, make sure the IDE you’re using includes the paths to anaconda, try from the IDE shell:

>>> import sys
>>> sys.path

If anaconda paths are missing:

>>> sys.path.append('/path/to/anaconda/')

If that doesn’t work, see http://biopython.org/wiki/Download for other ways to install.

25 GBK: OMG, automated retrieval from NCBI? Awesome!

Again, see the Programming Shortcut. Use the Bio.Entrez and Bio.SeqIO modules. Entrez requires your email address, a precaution against excessive usage. See the Biopython tutorial for details. You can always try your search on the NCBI Nucleotide database to figure out how to structure your query.

The last 5 problems deal with the quality control of next-generation sequencing reads. These are best done in order.

26 TFSQ: Review the FASTQ format under How to Handle Quality. Don’t use any of the links described under the Problem heading. Instead, write your own one-liner using the Bio.SeqIO module (see Programming Shortcut).

27 PHRE: Although you would probably use FastQC software in your research, the goal here is to get some more practice with Biopython, so use the Programming Shortcut.

28 FILT: Use the same Programming Shortcut from number 32 for obtaining phred scores. Be sure your conditional statements include reads at or above (>=) the thresholds q and p.

29 BPHR: Get the phred scores like you’ve done previously. Consider making a list of lists or a matrix using the numpy module.

30 BFIL: Likely the most challenging problem. Again, get the phred scores like you’ve done previously. Consider building a list of indices of where to trim the left end of sequences and a list of indices of where to trim the right end of sequences. Replace lines 2 and 4 with the trimmed sequences for each record, while keeping lines 1 and 3 stored somewhere in order to print out the trimmed fastq file.