Tips for ROSALIND Homework

Bioinformatics Stronghold 2 (Num 14-23)

Problem 14 ORF (0.5 pt)

Make sure anything you print/write starts with a start codon and ends with a stop codon, without any other stop codons in between. Avoid while loops, you don’t need them for this one. Biopython functions for handling sequences may be useful or use what you did in Problem 9 PROT.

Problem 15 SPLC (0.5 pt)

Straightforward. Remove the given introns, concatenate the exons, and translate (just like your cells do!)

Problem 16 TRAN (0.5 pt)

Straightforward once you review the difference between transitions and transversions. One approach: if you have a list of transitions called ts, you can test whether a given mutation called mut is in that list with

>>> mut in ts #tests whether mut is in ts and returns True or False
True

Problem 17 KMER (0.5 pt)

Try initializing a dictionary with keys for every possible 4-mer (itertools.product may be helpful). Then, iterate over the DNA sequence and add a count to your dictionary when a 4-mer is encountered. Remember a string of length n can be seen as composed of n−k+1 overlapping k-mers.

Problem 18 SETO (1 pt)

Generate separate functions to calculate the union, intersection, or set difference of two sets (or use built-in Python functions). Then build the required sets from the input and call your functions.

Problem 19 DBRU (1 pt)

Understand how sets work in Python, use your reverse complement function (or Biopython’s function) from previous problems, and it should be fairly straightforward to print the adjacency list.

Problem 20 EDIT (1 pt)

Build a matrix and fill it in just like when you do it by hand. See Resources on Sakai for slides reviewing pairwise alignment with dynamic programming. Write out the Sample by hand to verify your matrix is filled correctly.

#one way to build a matrix is a list of lists
#try this in your shell:
M = []
for i in range(5):      #make a 5 row x 10 column matrix filled with zeroes
    M.append([0] * 10)
M                       #view matrix M
M[0][1] = 1             #change the 0 at row 0, column 1 to a 1
M                       #view matrix M

Problem 21 EDTA (1 pt)

Builds on Problem 20 EDIT. You’ll need a second matrix to keep track of “arrows”, so you can perform a traceback to align the sequences.

On the due date, in class, you may be asked questions about how your code works and your comments will help you explain. Inability to explain how your code works will result in a reduced grade.

BONUS PROBLEMS

Note: You will only get credit for these if you have completed 14-21 correctly, so focus on those first. These two are much more challenging.

Problem 22 LONG (0.5 pt BONUS)

Builds on number 11 GRPH. (Review “Overlap Graphs” videos under Resources on Sakai for inspiration.)

Problem 23 GASM (0.5 pt BONUS)

Builds on number 19 DBRU. (Review “de Bruijn Graphs” videos under Resources on Sakai for inspiration.)

Optional challenges to help you prepare for your projects

Connect to compbio.cs.luc.edu and test running your scripts from the command line
- Check out the Unix scp command
Avoid hardcoding your input/output file paths in your scripts. Try passing file paths into your script via the command line.
- Check out python modules sys and argparse
- See example problem and solution on the compbio server: /homes/hwheeler/python_examples/DNA.py