# CS2220 Introduction to Computational Biology Assignment #3

```CS2220 Introduction to Computational Biology
Assignment #3
To be submitted electronically before 2pm on Thurs 25 March 2010
This assignment contributes 10% to the final course grade
The purposes of this assignment are (a) to reinforce lessons on sequence comparison, and (b) to
give practice at identifying problems that can be posed in terms of optimization. Please provide
PROBLEM ONE: Dynamic Programming and Edit Distance between Strings.
[Associated Reading: The Practical Bioinformatician, chapter 10, and Algorithms on Strings, Trees
and Sequenc (...), chapter 11]
Since the trace-back paths in a dynamic programming table correspond one-to-one with the
optimal alignments, the number of distinct co-optimal alignments can be obtained by computing
the number of distinct trace-back paths.
(1a) [4 marks] Specify an algorithm (e.g., in pseudo-code) to compute the number of co-optimal
alignments, performed using dynamic programming or another method of your choice.
(1b) [1 mark] What is the asymptotic run time of your algorithm?
PROBLEM TWO: [1 mark] Global and Local Alignments
(2a) If using the same score/penalty function for all types of alignments, with higher scores
indicating greater similarity, then how will the local score for aligning sequences S and T compare
with the global score?
(i)
always less
(ii)
always less than or equal
(iii)
always equal
(iv)
always greater than or equal
(v)
always greater than
(vi)
none of the above
(2b) How will the local alignment for S and T compare with the global alignment?
(i)
The positions aligned in the local alignment are always
a strict subset of the positions aligned in the global alignment
(ii)
always a subset
(iii)
always the same set
(iv)
always a superset
(v)
always a strict superset
(vi)
none of the above
(2c) [Optional/ExtraCredit] Can you think of a new, atypical score function that would change either
or both of the answers you have given above? Explain very briefly.
PROBLEM 3: Applications of Optimization.
[Associated Reading: BMC Systems Biology 2008, 2:47 doi:10.1186/1752-0509-2-47 This
article is available from: http://www.biomedcentral.com/1752-0509/2/47]
In this exercise you must venture into reading unfamiliar texts from unfamiliar fields with
insufficient background and incomplete information, but still be able to “smell” which aspects of a
problem contain optimization issues. You must be familiar with PCR and evolution, but otherwise
you do not need to do further reading or background study before answering these questions. For
this exercise, you should use terms like optimal and optimize according to their mathematical
meanings. Non-mathematical and non-computational optimization, such as choosing the "optimal"
color scheme to complement your personality, is absolutely not considered a type of optimization
in this exercise. Please state yes/no whether optimization is a good way to approach each of the
problems listed below. If yes, please describe how optimization could be set up to address the
problem, by giving brief English text for
i. what is the objective,
ii. what are the decision variables (and their domains or constraints, if relevant)
iii. what part of the goals would be achieved by performing this optimization.
Situation 2A: [2 marks] DRUG DEVELOPMENT
“Our goal is to improve the drug-like qualities of chemical compounds ("leads") that show initial
positive results during primary screening, so as to create modified compounds with efficacy and
low toxicity, likely succeed in pre-clinical and clinical trials. Because of the improved efficiency in
primary screening, refinement of lead compounds is now the major bottleneck in the drug
discovery process. Traditionally, the process of refining leads would utilize manual, benchtop
biological research that includes secondary screens, studies of the relationship between the
structure and activity of compounds and cellular toxicity measurements. This slow and expensive
process usually employs single measurements of biological activity without capturing both time
and space data from cells. Time and space data, which we call high content data, is important to
the understanding of complex cell functions. Without cellular analysis systems which provide high
content data on cell functions, the pharmaceutical and biotechnology industries have historically
been focused on a narrow range of targets, primarily the receptors on the surface of cells.
However, cell functions involve not only the number and distribution of specific receptors localized
on the surface of cells, but also the distribution and activity of other molecules on and within the
cells. For example, the cycle of internalization of receptors to the inside of cells and back to the
surface that regulates the responsiveness of many cells, involves numerous proteins in different
locations within cells and exhibits different activities. The ability to measure the time and space
activities of these proteins in relationship to specific cell functions, such as receptor-based
stimulation, is an important challenge for lead optimization.”
[Text adapted from the "Business Description" of the IPO statement of Cellomics Inc.
according to ipoportal.edgar-online.com, and from www.aapspharmaceutica.com]
SITUATION 2B: [2 marks] IMAGE ALIGNMENT
Image alignment is the process of matching one image called the template with another image.
Image alignment is one of the most widely used techniques in computer vision and it is widely
applicable to many goals including analysis of microscopy data in cell biology.
To automate image alignment, we must first determine the appropriate mathematical model
relating pixel positions in one image to pixel positions in another. Next, we must somehow
estimate the correct alignments relating pairs of images. The mathematical relationships that map
pixel coordinates from one image to another are typically chosen from a variety of parametric
motion models including simple 2D transforms, to planar perspective models, 3D camera rotations,
lens distortions, and the mapping to non-planar surfaces. To facilitate working with images at
different resolutions, we use normalized device coordinates. For a typical (rectilinear) image or
video frame, we let the pixel coordinates range from [−1, 1] along the longer axis, and [−x, x] along
the shorter, where x is the inverse of the aspect ratio. Once we have chosen a suitable pixel
representation system and a motion model to describe the range of possible alignments, we need
to devise some method to estimate the parameter values for the motion model alignment. One
approach is to shift or warp the images relative to each other and to look at how much the pixels
agree. Approaches that use pixel-to-pixel matching are often called direct methods, as opposed to
the feature-based methods.
SITUATION 2C: [2 marks] PHYLOGENETIC TREES
A phylogenetic tree is a graphical representation of the evolutionary relationships among
entities that share a common ancestor. Those entities can be species, genes, genomes, or any
other operational taxonomic unit (OTU). More specifically, a phylogenetic tree, with its pattern of
branching, represents the descent from a common ancestor into distinct lineages. It is critical to
understand that the branching patterns and branch lengths that make up a phylogenetic tree can
rarely be observed directly, but rather they must be inferred from other information.
The principle underlying phylogenetic inference is quite simple: Analysis of the similarities
and differences among biological entities can be used to infer the evolutionary history of those
entities. However, in practice, taking the end points of evolution and inferring their history is not
straightforward….The concept of descent with modification tells us that organisms sharing a recent
common ancestor should, on average, be more similar to each other than organisms whose last
common ancestor was more ancient. Therefore, it should be possible to infer evolutionary
relationships from the patterns of similarity among organisms. This is the principle that underlies
the various distance methods of phylogenetic reconstruction, all of which follow the same general
outline. First, a distance matrix (i.e., a table of “evolutionary distances” between each pair of taxa)
is generated. In the simplest case, the distances represent the dissimilarity between each pair of
taxa (mathematically, they are 1 – S, where S is the similarity). The resultant matrix is then used to
generate a phylogenetic tree. [Quoted from http://evolution-textbook.org]
SITUATION 2D: [2 marks] DETECTING GENOMIC LESIONS
Recent work has developed a new experimental protocol for detecting structural variation
in genomes with high efficiency. It is particularly suited to cancer genomes where the precise
breakpoints of alterations such as deletions or translocations vary between patients. The problem
of designing PCR primers is challenging because a large number of primer pairs are required to
detect alterations in the hundreds of kilobases range that can occur in cancer. Good primer pairs
must achieve high coverage of the region of interest, while avoiding primers that can dimerize with
each other, and while satisfying the traditional physico-chemical constraints of good PCR primers
Established experimental techniques for detecting structural genomic changes include
array-CGH (Pinkel and Albertson, 2005), FISH (Perry et al., 1997) and End-sequence Profiling
(ESP) (Volik et al., 2006), but array-CGH will detect only copy number changes, FISH is laborintensive, and ESP is costly. PCR provides one possible solution to this problem because
appropriately designed primer pairs within 1 kb of the fusing breakpoints will amplify only in the
presence of the mutated DNA, and can amplify even with a small population of cells. Such PCRbased screening has been useful in isolating deletion mutants in Caenorhabditis elegans (Jansen
et al., 1997).
We seek a PCR method with multiple simultaneous primers whose PCR products cover a
region in which breakpoints may occur. Every primer upstream of one breakpoint is in the same
orientation, opposite to the primers downstream of the second breakpoint. A primer pair can form a
PCR product only if a genomic lesion places the pair in close proximity. If the primer pairs are
spatially distinct, then any lesion will cause the amplification of exactly one primer–pair. Primers
must be selected which adequately cover the entire region, the primers must be chosen from a
unique region of the genome, and not allowed to dimerize with each other. Finally, a selected
primer must satisfy physico-chemical characteristics that allow it to prime the polymerase reaction.
The dimerization and physic-chemical characteristics of good primers are well-studied problems
with pre-existing methods available.
SITUATION 2E: [2 marks] NETWORK RECONSTRUCTION
In many complex systems found across disciplines, such as biological cells and organisms,
social networks, economic systems, and the Internet, individual elements interact with each other,
thereby forming large networks whose structure is often not known. In these complex networks,
local events can easily propagate, resulting in diverse spatio-temporal activity cascades, or
avalanches. Examples of such cascading activity are the propagation of diseases in social
networks, cascades of chemical reactions inside a cell, the propagation of neuronal activity in the
brain, and e-mail forwarding on the Internet.
Exploring all possible topological configurations for a complete network with N nodes is a
daunting task, since that number is on the order of exp(2, N*N). Using a variety of methods and
assumptions, correlations in the dynamics between nodes have been successfully used to identify
functional links in relatively large networks such as obtained from MEG or fMRI recordings of brain
activity. A pure correlation approach, however, is prone to induce false connectivities. For example,
it will introduce a link between two un-connected nodes, if their activities are driven by common
inputs. More elaborate approaches such as Granger Causality, partial Granger Causality, partial
directed coherence, and transfer entropy partially cope with the problem of common input,
however, these methods require extensive data manipulations and data transformations and have
been mainly employed for small networks.
```