Input: A string BWT(Text), followed by a collection of Patterns. Output: A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text.
Extra Dataset
Sample Input: TCCTCTATGAGATCCTATTCTATGAAACCTTCA$GACCAAAATTCTCCGGC CCT CAC GAG CAG ATC Sample Output: 2 1 1 0 1
Input: A string BWT(Text) followed by a collection of strings Patterns. Output: A list of integers, where the i-th integer corresponds to the number of substring matches of the i-th member of Patterns in Text.
Extra Dataset
Sample Input: GGCGCCGC$TAGTCACACACGCCGTA ACC CCG CAG Sample Output: 1 2 1
Time Limit: 5 mins
Input: A string Text followed by a collection of strings Patterns. Output: All starting positions in Text where a string from Patterns appears as a substring.
Extra Dataset
Sample Input: AATCGGGTTCAATCGGGGT ATCG GGGT Sample Output: 1 4 11 15
Time Limit: 5 mins
You should now be ready to design your own approach to solve the Multiple Approximate Pattern Matching Problem and use this solution to map real sequencing reads.
CODE CHALLENGE: Solve the Multiple Approximate Pattern Matching Problem. Input: A string Text, followed by a collection of strings Patterns, and an integer d. Output: All positions where one of the strings in Patterns appears as a substring of Text with at most d mismatches.
Extra Dataset
Sample Input: ACATGCTACTTT ATT GCC GCTA TATT 1 Sample Output: 2 4 4 6 7 8 9
Time Limit: 5 mins
smnpbnnaaaaa$a ana
Sample Input: 4 0->4:11 1->4:2 2->5:6 3->5:7 4->0:11 4->1:2 4->5:4 5->4:4 5->3:7 5->2:6 Sample Output: 0 13 21 22 13 0 12 13 21 12 0 13 22 13 13 0
Sample Input: 4 1 0 13 21 22 13 0 12 13 21 12 0 13 22 13 13 0 Sample Output: 2
Sample Input: 4 0 13 21 22 13 0 12 13 21 12 0 13 22 13 13 0 Sample Output: 0->4:11 1->4:2 2->5:6 3->5:7 4->0:11 4->1:2 4->5:4 5->4:4 5->3:7 5->2:6
Sample Input: 4 0 20 17 11 20 0 20 13 17 20 0 10 11 13 10 0 Sample Output: 0->5:7.000 1->6:8.833 2->4:5.000 3->4:5.000 4->2:5.000 4->3:5.000 4->5:2.000 5->0:7.000 5->4:2.000 5->6:1.833 6->5:1.833 6->1:8.833
Sample Input: 4 0 23 27 20 23 0 30 28 27 30 0 30 20 28 30 0 Sample Output: 0->4:8.000 1->5:13.500 2->5:16.500 3->4:12.000 4->5:2.000 4->0:8.000 4->3:12.000 5->1:13.500 5->2:16.500 5->4:2.000
CODE CHALLENGE: Implement SmallParsimony to solve the Small Parsimony Problem. Input: An integer n followed by an adjacency list for a rooted binary tree with n leaves labeled by DNA strings. Output: The minimum parsimony score of this tree, followed by the adjacency list of the tree corresponding to labeling internal nodes by DNA strings in order to minimize the parsimony score of the tree.
Note: Remember to run SmallParsimony on each individual index of the strings at the leaves of the tree.
Extra Dataset
Sample Input: 4 4->CAAATCCC 4->ATTGCGAC 5->CTGCGCTG 5->ATGGACGA 6->4 6->5 Sample Output: 16 ATTGCGAC->ATAGCCAC:2 ATAGACAA->ATAGCCAC:2 ATAGACAA->ATGGACTA:2 ATGGACGA->ATGGACTA:1 CTGCGCTG->ATGGACTA:4 ATGGACTA->CTGCGCTG:4 ATGGACTA->ATGGACGA:1 ATGGACTA->ATAGACAA:2 ATAGCCAC->CAAATCCC:5 ATAGCCAC->ATTGCGAC:2 ATAGCCAC->ATAGACAA:2 CAAATCCC->ATAGCCAC:5
Input: An integer n followed by an adjacency list for an unrooted binary tree with n leaves labeled by DNA strings. Output: The minimum parsimony score of this tree, followed by the adjacency list of the tree corresponding to labeling internal nodes by DNA strings in order to minimize the parsimony score of the tree.
Extra Dataset
Sample Input: 4 TCGGCCAA->4 4->TCGGCCAA CCTGGCTG->4 4->CCTGGCTG CACAGGAT->5 5->CACAGGAT TGAGTACC->5 5->TGAGTACC 4->5 5->4 Sample Output: 17 TCGGCCAA->CCAGGCAC:4 CCTGGCTG->CCAGGCAC:3 TGAGTACC->CAAGGAAC:4 CCAGGCAC->CCTGGCTG:3 CCAGGCAC->CAAGGAAC:2 CCAGGCAC->TCGGCCAA:4 CACAGGAT->CAAGGAAC:4 CAAGGAAC->CACAGGAT:4 CAAGGAAC->TGAGTACC:4 CAAGGAAC->CCAGGCAC:2
Time Limit: 5 mins
5 4 0->4 4->0 1->4 4->1 2->5 5->2 3->5 5->3 4->5 5->4 Sample Output: 1->4 0->5 3->4 2->5 5->2 5->4 5->0 4->1 4->5 4->3
1->5 0->4 3->4 2->5 5->2 5->4 5->1 4->0 4->5 4->3
Sample Input: 4 CGAAGATTCTAA->4 ATGCCGGGCTCG->4 CTTTTAGAAGCG->5 AACTCATGATAT->5 5->AACTCATGATAT 5->CTTTTAGAAGCG 5->4 4->ATGCCGGGCTCG 4->CGAAGATTCTAA 4->5 Sample Output: 22 AACTCATGATAT->CTATCAGGATCG:6 CTTTTAGAAGCG->CTATCAGGATCG:4 CGAAGATTCTAA->CTAACAGGCTCG:6 CTATCAGGATCG->CTTTTAGAAGCG:4 CTATCAGGATCG->CTAACAGGCTCG:2 CTATCAGGATCG->AACTCATGATAT:6 CTAACAGGCTCG->ATGCCGGGCTCG:4 CTAACAGGCTCG->CGAAGATTCTAA:6 CTAACAGGCTCG->CTATCAGGATCG:2 ATGCCGGGCTCG->CTAACAGGCTCG:4
21 AACTCATGATAT->CTATCATGCTAA:5 CTTTTAGAAGCG->CTTTCAGGCTCG:4 CGAAGATTCTAA->CTATCATGCTAA:4 CTATCATGCTAA->CTTTCAGGCTCG:4 CTATCATGCTAA->CGAAGATTCTAA:4 CTATCATGCTAA->AACTCATGATAT:5 CTTTCAGGCTCG->ATGCCGGGCTCG:4 CTTTCAGGCTCG->CTTTTAGAAGCG:4 CTTTCAGGCTCG->CTATCATGCTAA:4 ATGCCGGGCTCG->CTTTCAGGCTCG:4
Time Limit: 5 mins
Input: Integers k and m followed by a set of points Data in m-dimensional space. Output: A set Centers consisting of k points (centers) resulting from applying FarthestFirstTraversal(Data, k), where the first point from Data is chosen as the first center to initialize the algorithm.
Extra Dataset
Sample Input: 3 2 0.0 0.0 5.0 5.0 0.0 5.0 1.0 1.0 2.0 2.0 3.0 3.0 1.0 2.0 Sample Output: 0.0 0.0 5.0 5.0 0.0 5.0
Squared Error Distortion Problem: Compute the squared error distortion of a set of data points with respect to a set of centers.
Input: A set of points Data and a set of centers Centers. Output: The squared error distortion Distortion(Data, Centers).
CODE CHALLENGE: Solve the Squared Error Distortion Problem. Input: Integers k and m, followed by a set of centers Centers and a set of points Data. Output: The squared error distortion Distortion(Data, Centers).
Extra Dataset
Sample Input: 2 2 2.31 4.55 5.96 9.08
3.42 6.03 6.23 8.25 4.76 1.64 4.47 4.33 3.95 7.61 8.93 2.97 9.74 4.03 1.73 1.28 9.72 5.01 7.27 3.77 Sample Output: 18.246
Input: Integers k and m followed by a set of points Data in m-dimensional space. Output: A set Centers consisting of k points (centers) resulting from applying the Lloyd algorithm to Data and Centers, where the first k points from Data are selected as the first k centers.
Extra Dataset
Sample Input: 2 2 1.3 1.1 1.3 0.2 0.6 2.8 3.0 3.2 1.2 0.7 1.4 1.6 1.2 1.0 1.2 1.1 0.6 1.5 1.8 2.6 1.2 1.3 1.2 1.0 0.0 1.9 Sample Output: 1.800 2.867 1.060 1.140
Time Limit: 5 mins
Input: Integers k and m, followed by a stiffness parameter β, followed by a set of points Data in m-dimensional space. Output: A set Centers consisting of k points (centers) resulting from applying the expectation maximization algorithm for soft k-means clustering. Select the first k points from Data as the first centers for the algorithm and run the algorithm for 100 E-steps and 100 M-steps. Results should be accurate up to three decimal places.
Extra Dataset
Sample Input: 2 2 2.7 1.3 1.1 1.3 0.2 0.6 2.8 3.0 3.2 1.2 0.7 1.4 1.6 1.2 1.0 1.2 1.1 0.6 1.5 1.8 2.6 1.2 1.3 1.2 1.0 0.0 1.9 Sample Output: 1.662 2.623 1.075 1.148
Time Limit: 5 mins
Input: An integer n, followed by an n x n distance matrix. Output: The result of applying HierarchicalClustering to this distance matrix (using Davg), with each newly created cluster listed on each line.
Extra Dataset
Sample Input: 7 0.00 0.74 0.85 0.54 0.83 0.92 0.89 0.74 0.00 1.59 1.35 1.20 1.48 1.55 0.85 1.59 0.00 0.63 1.13 0.69 0.73 0.54 1.35 0.63 0.00 0.66 0.43 0.88 0.83 1.20 1.13 0.66 0.00 0.72 0.55 0.92 1.48 0.69 0.43 0.72 0.00 0.80 0.89 1.55 0.73 0.88 0.55 0.80 0.00 Sample Output: 4 6 5 7 3 4 6 1 2 5 7 3 4 6 1 2 5 7 3 4 6
Time Limit: 5 mins
CODE CHALLENGE: Solve the Probability of a Hidden Path Problem.
Given: A hidden path π followed by the states States and transition matrix Transition of an HMM (Σ, States, Transition, Emission). Return: The probability of this path, Pr(π).
Note: You may assume that transitions from the initial state occur with equal probability.
Extra Dataset
Sample Input: ABABBBAAAA
A B
A B A 0.377 0.623 B 0.26 0.74 Sample Output: 0.000384928691755
Time Limit: 5 mins
CODE CHALLENGE: Solve the Probability of an Outcome Given a Hidden Path Problem.
Input: A string x, followed by the alphabet from which x was constructed, followed by a hidden path π, followed by the states States and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: The conditional probability Pr(x|π) that x will be emitted given that the HMM follows the hidden path π.
Note: You may assume that transitions from the initial state occur with equal probability. Extra Dataset
Sample Input: zzzyxyyzzx
x y z
BAAAAAAAAA
A B
x y z A 0.176 0.596 0.228 B 0.225 0.572 0.203 Sample Output: 3.59748954746e-06
Time Limit: 5 mins
Input: A string x, followed by the alphabet from which x was constructed, followed by the states States, transition matrix Transition, and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: A path that maximizes the (unconditional) probability Pr(x, π) over all possible paths π.
Note: You may assume that transitions from the initial state occur with equal probability. Extra Dataset
Sample Input: xyxzzxyxyy
x y z
A B
A B A 0.641 0.359 B 0.729 0.271
x y z A 0.117 0.691 0.192 B 0.097 0.42 0.483 Sample Output: AAABBAAAAA
Time Limit: 5 mins
Input: A string x, followed by the alphabet from which x was constructed, followed by the states States, transition matrix Transition, and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: The probability Pr(x) that the HMM emits x.
Note: You may assume that transitions from the initial state occur with equal probability. Extra Dataset
Sample Input: xzyyzzyzyy
x y z
A B
A B A 0.303 0.697 B 0.831 0.169
x y z A 0.533 0.065 0.402 B 0.342 0.334 0.324 Sample Output: 1.1005510319694847e-06
Time Limit: 5 mins
Input: A threshold θ, followed by an alphabet Σ, followed by a multiple alignment Alignment whose strings are formed from Σ. Output: The transition matrix followed by the emission matrix of HMM(Alignment, θ).
Note: Your matrices can be either space-separated or tab-separated.
Extra Dataset
Sample Input: 0.289
A B C D E
EBA E-D EB- EED EBD EBE E-D E-D Sample Output: S I0 M1 D1 I1 M2 D2 I2 E S 0 0 1.0 0 0 0 0 0 0 I0 0 0 0 0 0 0 0 0 0 M1 0 0 0 0 0.625 0.375 0 0 0 D1 0 0 0 0 0 0 0 0 0 I1 0 0 0 0 0 0.8 0.2 0 0 M2 0 0 0 0 0 0 0 0 1.0 D2 0 0 0 0 0 0 0 0 1.0 I2 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0
A B C D E S 0 0 0 0 0 I0 0 0 0 0 0 M1 0 0 0 0 1.0 D1 0 0 0 0 0 I1 0 0.8 0 0 0.2 M2 0.143 0 0 0.714 0.143 D2 0 0 0 0 0 I2 0 0 0 0 0 E 0 0 0 0 0
Time Limit: 5 mins
Input: A threshold θ and a pseudocount σ, followed by an alphabet Σ, followed by a multiple alignment Alignment whose strings are formed from Σ. Output: The transition and emission matrices of HMM(Alignment, θ, σ).
Note: Your matrices can be either space-separated or tab-separated.
Extra Dataset
Sample Input: 0.358 0.01
A B C D E
A-A ADA ACA A-C -EA D-A Sample Output: S I0 M1 D1 I1 M2 D2 I2 E S 0 0.01 0.819 0.172 0 0 0 0 0 I0 0 0.333 0.333 0.333 0 0 0 0 0 M1 0 0 0 0 0.398 0.592 0.01 0 0 D1 0 0 0 0 0.981 0.01 0.01 0 0 I1 0 0 0 0 0.01 0.981 0.01 0 0 M2 0 0 0 0 0 0 0 0.01 0.99 D2 0 0 0 0 0 0 0 0.5 0.5 I2 0 0 0 0 0 0 0 0.5 0.5 E 0 0 0 0 0 0 0 0 0
A B C D E S 0 0 0 0 0 I0 0.2 0.2 0.2 0.2 0.2 M1 0.771 0.01 0.01 0.2 0.01 D1 0 0 0 0 0 I1 0.01 0.01 0.327 0.327 0.327 M2 0.803 0.01 0.168 0.01 0.01 D2 0 0 0 0 0 I2 0.2 0.2 0.2 0.2 0.2 E 0 0 0 0 0
Time Limit: 5 mins
Input: A string x followed by a threshold θ and a pseudocount σ, followed by an alphabet Σ, followed by a multiple alignment Alignment whose strings are formed from Σ. Output: An optimal hidden path emitting x in HMM(Alignment, θ, σ).
Extra Dataset
Sample Input: AEFDFDC
0.4 0.01
A B C D E F
ACDEFACADF AFDA—CCF A–EFD-FDC ACAEF–A-C ADDEFAAADF Sample Output: M1 D2 D3 M4 M5 I5 M6 M7 M8
Time Limit: 5 mins
Input: A string x of symbols emitted from an HMM, followed by the HMM’s alphabet Σ, followed by a path π, followed by the collection of states of the HMM. Output: A transition matrix Transition followed by an emission matrix Emission that maximize Pr(x, π) over all possible transition and emission matrices.
Extra Dataset
Sample Input: yzzzyxzxxx
x y z
BBABABABAB
A B C Sample Output: A B C A 0.0 1.0 0.0 B 0.8 0.2 0.0 C 0.333 0.333 0.333
x y z A 0.25 0.25 0.5 B 0.5 0.167 0.333 C 0.333 0.333 0.333
Time Limit: 5 mins
Input: A number of iterations j, followed by a string x of symbols emitted by an HMM, followed by the HMM’s alphabet Σ, followed by the HMM’s states, followed by initial transition and emission matrices for the HMM. Output: Emission and transition matrices resulting from applying Viterbi learning for j iterations.
Extra Dataset
Sample Input: 100
zyzxzxxxzz
x y z
A B
A B A 0.599 0.401 B 0.294 0.706
x y z A 0.424 0.367 0.209 B 0.262 0.449 0.289 Sample Output: A B A 0.5 0.5 B 0.0 1.0
x y z A 0.333 0.333 0.333 B 0.4 0.1 0.5
Time Limit: 5 mins
Input: A string x, followed by the alphabet Σ from which x was constructed, followed by the states States, transition matrix Transition, and emission matrix Emission of an HMM (Σ, States, Transition, Emission). Output: An |x| x |States| matrix whose (i, k)-th element holds the conditional probability Pr(πi = k|x).
Extra Dataset
Sample Input: zyxxxxyxzz
x y z
A B
A B A 0.911 0.089 B 0.228 0.772
x y z A 0.356 0.191 0.453 B 0.040 0.467 0.493 Sample Output: A B 0.5438 0.4562 0.6492 0.3508 0.9647 0.0353 0.9936 0.0064 0.9957 0.0043 0.9891 0.0109 0.9154 0.0846 0.964 0.036 0.8737 0.1263 0.8167 0.1833
Time Limit: 5 mins
Given: A space-delimited list of integers Spectrum. Return: Graph(Spectrum).
Note: Throughout this chapter, all dataset problems implicitly use the standard integer-valued mass table for the regular twenty amino acids. Examples sometimes use the toy amino acid alphabet {X, Y} whose masses are 4 and 5, respectively.
Extra Dataset
Sample Input: 57 71 154 185 301 332 415 429 486 Sample Output: 0->57:G 0->71:A 57->154:P 57->185:K 71->185:N 154->301:F 185->332:F 301->415:N 301->429:K 332->429:P 415->486:A 429->486:G
Time Limit: 5 mins
Given: A space-delimited list of integers Spectrum. Return: An amino acid string that explains Spectrum.
Extra Dataset
Sample Input: 57 71 154 185 301 332 415 429 486 Sample Output: GPFNA
Time Limit: 5 mins
Given: An amino acid string P. Return: The peptide vector of P (in the form of space-separated integers).
Extra Dataset
Sample Input: XZZXX Sample Output: 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1
Time Limit: 5 mins
Given: A space-delimited binary vector P Return: An amino acid string whose binary peptide vector matches P. For masses with more than one amino acid, any choice may be used.
Extra Dataset
Sample Input: 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 Sample Output: XZZXX
Time Limit: 5 mins
Given: A space-delimited spectral vector Spectrum’. Return: An amino acid string with maximum score against Spectrum’. For masses with more than one amino acid, any choice may be used.
Extra Dataset
Sample Input: 0 0 0 0 4 -2 -3 -1 -7 6 5 3 2 1 9 3 -8 0 3 1 2 1 8 Sample Output: XZZXX
Time Limit: 5 mins
Given: A space-delimited spectral vector Spectrum’ and an amino acid string Proteome. Return: A substring of Proteome with maximum score against Spectrum’.
Extra Dataset
Sample Input: 0 0 0 0 4 -2 -3 -1 -7 6 5 3 2 1 9 3 -8 0 3 1 2 1 8 XZZXZXXXZXZZXZXXZ Sample Output: ZXZXX
Time Limit: 5 mins
Given: A set of space-delimited spectral vectors SpectralVectors, an amino acid string Proteome, and an integer threshold. Return: The set PSMthreshold(Proteome, SpectralVectors).
Extra Dataset
Sample Input: 0 -1 5 -4 5 3 -1 -4 5 -1 0 0 4 -1 0 1 4 4 4 1 -4 2 -2 -4 4 -5 -1 4 -1 2 5 -3 -1 3 2 -3 XXXZXZXXZXZXXXZXXZX 5 Sample Output: XZXZ
Time Limit: 5 mins
Given: A spectral vector Spectrum’, an integer threshold, and an integer max_score. Return: The size of the dictionary Dictionarythreshold(Spectrum’).
Note: Use the provided max_score for the height of your table. Your answer should be the number of peptides whose score is at least T and at most max_score.
Extra Dataset
Sample Input: 0 4 -3 -2 3 3 -4 5 -3 -1 -1 3 4 1 -1 1 5 Sample Output: 3
Time Limit: 5 mins
Given: A spectral vector Spectrum’, an integer threshold, and an integer max_score. Return: The probability of the dictionary Dictionarythreshold(Spectrum’).
Note: Use the provided max_score for the height of your table.
Extra Dataset
Sample Input: 0 4 -3 -2 3 3 -4 5 -3 -1 -1 3 4 1 -1 1 5 Sample Output: 0.375
Time Limit: 5 mins
Given: A peptide Peptide, a spectral vector Spectrum’, and an integer k. Return: A peptide Peptide’ related to Peptide by up to k modifications with maximal score against Spectrum’ out of all possibilities.
Extra Dataset
Sample Input: XXZ 0 4 -3 -2 3 3 -4 5 -3 -1 -1 3 4 1 -1 2 Sample Output: XX(-1)Z(+2)
Time Limit: 5 mins