This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: THE UNIVERSITY OF HONG KONG Department of Biochemistry Bachelor of Science in Bioinformatics: Final Examination (2007—2008)
BIOC1805 Elements of Bioinformatics Date: 6th May, 2008 (Tuesday) Time: 9:30 am - 11:30 am Candidates may use any calculator which fulﬁls the following criteria: (a) it should be self-contained, silent, battery-operated and pocket—sized; and (b) it should have numeral-display facilities only and be used only for the purpose of calculation. It is the candidate‘s responsibility to ensure that the calculator operates satisfactorily and the candidate
must record the name and type of the calculator on the front page of the examination scripts. Lists of
permitted/prohibited calculators will not be made available to candidates for reference. The onus will be
on the candidate to ensure that the calculator used will not be in violation of the criteria listed above. Answer all questions. Each question carries the marks indicated. Q 1. a) You wrote a program that 15 1. read a sequence ﬁle obtained from the NCBI in GenPept format; marks 2. called readseq to convert that ﬁle to FastA format; 3. subsequently used readseq again to convert the FastA format ﬁle to GenPept format.
What is(are) the difference(s) between the original and ﬁnal GenPept format ﬁles? b) A gene sequence was downloaded, in default format, from the EMBL databank of the EBI
and from GenBank of the NCBI. What difference(s) do you expect between the two ﬁles? 0) A colleague wants to store his multiple sequence alignment in a GenPept format ﬁle.
Explain Why this is possible. Is this the best format to store a multiple sequence alignment?
Give reasons for your answer. Q 2. You are using a dot-plot to compare two homologous nucleic acid sequences that code for a
protein. You know that, although the corresponding amino acid sequences are very similar,
15 there are many synonymous substitutions between the sequences. marks a) If you use the k-tuple method, what consequences do you expect for k-tuple sizes of 2, 4 and 8?
b) Explain whether a window and threshold/stringency approach is better? c) What Window size and threshold/stringency do you expect to give the best results for that
method? Why? Q 3. a) Clustal uses either a k—tuple based alignment or a dynamic programming based method for 10 the initial all-pair-wise alignment phase. What is(are) the main difference(s) between these . _ . . 9
marks two methods for par Wise sequence ahgnment. b) Explain the differences among global, semi-global and local pair-wise alignments. Page 1 of 2 Q 4. 21) Give 3 methods that are commonly used to represent the information in a multiple 10 sequence alignment. Explain their main advantages and disadvantages for representing the marks alignment. b) What are the main differences between the “block” and “gap” approaches to multiple
sequence alignment? Q 5. a) Explain the meaning of the positive, zero and negative values in the PAM and BLOSUM
series of substitution matrices. 10 16 14 16 6 16 16 sequences. 1 0
marks b) What are the main differences in the way that the PAM and BLOSUM series of matrices
are derived? Q 6. A B C D A distance matrix determined from the nucleotide sequences 10 B 1 5 of 5 taxa is given, in lower—triangular form, to the leﬁ.
C l 4 1 6 marks D a) Calculate and draw the UPGMA tree relating these
E b) Give the Newick format version of this tree, including the branch lengths. Q 7. a) What are the main uses of each of the three nucleotide BLAST programs (Megablast, 10 Discontiguous Megablast and BlastN)? marks b) Why is the expect score “E” used to assess the quality of a match between a query
sequence and a sequence in a large databank (such as GenBank)? c) What do E values of 10 and 0.001 mean in terms of a BLAST sequence database search? Q 8. You have found an open reading frame in a piece of genomic sequence. How would you test 10 computationally to see if it was likely to belong to a protein coding gene if it was from marks a) a prokaryotic genome b) a eukaryotic genome? Q 9. a) Give the principle behind the three main methods for protein structure modelling. Explain 10 the level of sequence similarity between target and template appropriate for each method. marks b) How does the apparently limited number of natural protein folds assist modelling projects? ---- -- End of Paper ------ Page 2 of 2 ...
View Full Document