UNIQUIMER 3D, A software system for struc

UNIQUIMER 3D, A software system for structural DNA nanotechnology design, analysis and
evaluation
Manual for version 1.0.
1
User’s Guide to the UNIQUIMER 3D
Version 1.0
http://ihome.ust.hk/
Jinhao Zhu, Yongli Mi
Email: csnickle@gmail.com, keymix@ust.hk
December 30, 2008
2
Contents
I
II
Introduction
4
Getting Started
4
III
Building a Motif
4
IV
Geometrical Optimization
7
V
VI
Sequence Generation
10
Summary
12
3
Part I
Introduction
A user-friendly software system, UNIQUIMER 3D, was developed for structural DNA nanotechnology that
consists of 3D visualization, internal energy minimization, sequence generation, and construction of motif array
simulations (2D tiles & 3D lattices) functionalities. The system can be used to check the structural deformation
and design errors under scaled-up conditions. UNIQUIMER 3D aims to facilitate the design of novel DNA motifs
for the DNA nanotechnology and has been tested for the design of both existing motifs (holiday junction, 4x4 tile,
DX, DNA tetrahedron, DNA cube, etc.) and not-existing motif (soccer ball). A de novel sequence generation
algorithm is integrated into the system. UNIQUIMER 3D was developed for the Windows environment and is
provided free of charge to the non-profit research institutions.
In this user manual documentation, we start with an explanation of how to install UNIQUIMER 3D. Next,
a short tutorial is given that explains most of the features that you’ll need in designing SDN. Following the
tutorial you will find a “possible workflow” for creating a DNA motif. Following this workflow may help you
avoid problems later on. After that, we will describe how to make use of the energy minimization functionality
to optimize the geometrical shape of a designed structure. Then we will demonstrate how to apply sequence
generation algorithm to your design. At the end of the documentation you will find a summary of our work.
Part II
Getting Started
This part helps you getting started. It starts with an explanation of how to install UNIQUIMER 3D. UNIQUIMER
3D can be accessed from http://ihome.ust.hk/ for free. The first step is to grab a copy of the UNIQUIMER
3D (zip archive), then uncompress the file to your local hard drive. The folder “Structures” contains a collection
of some default existed motifs, such as holiday junction, 4x4 tile, cube, tetrahedron, etc. The folder “Report”
holds a collection of all the HTML reports you generate for a designed structure. You will also find the executable file, namely “UNIMQUIMER3D.exe” in the unzipped parent directory. The major user interface of
UNIQUIMER 3D is shown in Fig 1.
There is a “Item List” tree control on the left docking window. All the SDN information are maintained
in this control as a tree structure. A “Information Window” is docked on the left bottom side that displays
the detailed information of the user selections. Users can insert, translate, rotate items by making use of the
“Control Panel” docked on the right bottom side of the window. The center window is the 3D display canvas.
Part III
Building a Motif
In UNIQUIMER 3D, DNA strands can be manipulated on a 3D canvas (addition, deletion, translation, rotation,
etc.) Basic DNA components can be joined together at crossover points to build certain motifs.
This part contains an explanation of all the steps that are necessary to create a DNA motif. The basic
building block is the DNA double helical domain.
The most basic element and the lowest level of our system hierarchy is the DNA base pair. Each DNA base
pair consists of a pair of nucleotide residues (DNA nodes hereinafter) connected by hydrogen bonds, and DNA
nodes are rendered as spheres. Phosphodiester linkages and hydrogen bonds are rendered as lines of different
colors in the 3D space. Individual nucleotide residues (DNA customized nodes hereinafter) are also modeled
and rendered as spheres in our system.
The double helical domain contains a parent node rendered as box, a green strand and a blue strand as
shown in Fig 3. The polarity of the double helical domain is defined. Starting from the parent node, the green
strand is pointing upwards all the way to the topmost base and the blue strand then pointing downwards till
reaching the parent node. The parent node determines the position of the double helix. The direction of double
4
Figure 1:
System User Interface.
Figure 2:
System Hierarchy.
5
Figure 3:
DNA Double Helical Domain.
helix is defined to be a ray casting from the parent node to the center position of the topmost base pair along
the center pole.
A double helical domain can be added to the 3D canvas by clicking on the “DNA Strand” button in the
“Control Panel” as shown in Fig 4.
Figure 4:
DNA Double Helical Domain Insertion.
The top textbox allows the user to input the number of base pairs of the DNA double helix. The middle
textbox is for inputting the initial direction of the DNA double helix. The bottom textbox is for inputting the
initial position of the DNA double helix.
A grouping operation is embedded in our system so that multiple DNA double helices with DNA customized
nodes can be regrouped for further manipulation or management. The grouped components are maintained in
a tree structure (parent-children structure) that is consistent with the system hierarchy.
Two more operations are defined for DNA constructions: (1) the opening operation that breaks the con-
6
secutive DNA nodes; (2) the closing operation that connects two open nodes. The two operations redefine
connectivity between DNA nodes. Invalid polarities are automatically detected by the system and will not be
allowed (as illustrated in Fig 5).
Invalid
Valid
Figure 5:
Valid and invalid closing operations.
Part IV
Geometrical Optimization
offers ways to change the geometrical appearance of a designed structure. Given the user-defined structure
(hierarchy and connectivity), a state is defined to be the entire geometric information of all of its components.
An energy function is designed to assign a nonnegative real number representing its stability to each state.
Energy function is introduced to eliminate the structural defects and design errors, which might result in
constructional failure, thus to get a better evaluation whether or not a specific DNA structure is able to form
in a stable way. For random configurations of DNA strands, the energy function may be very complicated.
However, the regular and predictable double helical DNA structure makes the energy minimization relatively
simple. Accordingly, the distance between two nucleotide residues and the smoothness of the double helix are
taken into consideration. So the energy function is defined as
E = (1 − λ)Edistance + λEsmoothness ,
(1)
where Edistance and Esmoothness are two terms that are consistent with our motivation, and λ is a weight.
Since motif arrays are assembled by motifs, and motifs are further composed of DNA double helices, sticky
ends and DNA customized nodes, on the basic level, only geometric information for DNA double helices, sticky
ends, DNA customized nodes and the corresponding connectivity is necessary to model our energy function.
Given a set of n DNA double helical domains and their corresponding rotation and translation parameters,
Θi s and Ti s, and m DNA customized nodes with translation vectors, Lj s, a state can be defined as (Θ, Γ), where
Θ = (Θ1 , Θ2 , . . . , Θn ), Γ = (T1 , T2 , . . . , Tn , L1 , L2 , . . . , Lm ).
E is a function mapping from (Θ, Γ) to nonnegative real numbers.
Edistance is defined as the sum of squared differences of the distance between each pair of connected nodes
and a constant, d. d is chosen to be the distance between any two consecutive DNA nodes on the same helix,
which is 63.05. Therefore, with a user-defined connectivity map, C = {< σ1 , ρ1 >, < σ2 , ρ2 >, . . . , < σk , ρk >},
7
Figure 6:
Illustration of the smoothness term.
where σs and ρs (s = {1, 2, . . . , k}) denote user connected nodes and < · > denotes connectivity.
Edistance =
k
X
(kP (σs ) − P (ρs )k − d)2 ,
(2)
s=1
where k · k is the Euclidean norm and P (·) is a function that denotes the position of a node. It should be noted
that σs (ρs ) is either a node on a helix or a DNA customized node. If the node is on a helix, it will have position
R(Θi ) · h∗ (t) + Ti , where Θi , t, Ti and h∗ = {h+ , h− } depend on the working structure. If the node is a DNA
customized node, j, its position is simply Lj , depending on the structure.
The second term, Esmoothness , in our energy function refines the smoothness of the user connection of DNA
double-helical domains.
In an ideal case of DNA strand connection, the B-form DNA’s helical structures are preserved. As shown in
Fig 6, two DNA double helices are joined together by connecting nodes, σs and ρs . An optimal angle between
the two vectors formed by any three neighboring DNA nodes on the same helix, σs , σs0 , σs00 , is defined to be
θoptimal = arccos (
−
−−
→
σs σs0
−
−−
→ ·
kσs σs0 k
−−
−→
σs0 σs00
−−
−→ ).
kσs0 σs00 k
θoptimal is a constant that equals to 24.53. It is desirable to have a connection
−−→0
−−
→
−
→
0
ρ−
s σs that forms the same angle, θoptimal , with both σs σs and ρs ρs . Therefore, the smoothness term is defined
as
−−→0
−
→
X
σs σ
σ−
s ρs
Esmoothness =
((arccos ( −−→s · −−→ ) − θoptimal )2
kσs σs0 k kσs ρs k
σs ,ρs
−−→0
−
→
ρs ρ
σ−
s ρs
+ (arccos ( −−→s · −−→ ) − θoptimal )2 ).
(3)
kρs ρ0s k kσs ρs k
The summation is over all connections between nodes σs and ρs on the helices. σs0 denotes the node next to
σs , which is not ρs ; σs00 denotes the node next to σs0 , which is not σs ; ρ0s denotes the node next to ρs , which is
not σs . This smoothness term penalizes angular discrepancies from θoptimal of the connected structure.
Our energy function takes (Θ, Γ) as variables. Given an initial user-defined state, (Θ, Γ), we want to improve
it using an energy minimizing technique. Currently the properties are controlled using simple geometry as
defined in Edistance and Esmoothness , which is a convex function. As a result, gradient-based local minimization
algorithm is considered in this version. Powell’s method, which is an iterative optimization method that finds a
local optimizer, is implemented in our system for this purpose.
8
Since the result from running Powell’s method greatly depends on its initial state, it is very important to
supply a relatively stable state with low energy as the input to Powell’s method. We first coarsely scan through
states that uniformly cover the solution space (Θ, Γ) to find a state with the lowest energy. Since E is a smooth
function, it is reasonable to assume that a global minimum exists somewhere near this state. Therefore, this
selected state is the starting point of the Powell’s method.
One should note that there is no guarantee that the final state, which is a local minimum, is a global one.
However, if the initial state is close enough to the global minimum, it can be found using Powell’s method.
As shown in Fig 7(a) and Fig 7(c), certain DNA double-helical domains are set to be distorted. After
applying energy minimization to the structures, the geometrical shapes of these double-helical domains are
refined as shown in Fig 7(b) and Fig 7(d).
(b)
(a)
(c)
(d)
Figure 7:
Geometrical Optimization. (a) Two distorted DNA double-helical domains before energy minimization. (b) Two refined DNA
double-helical domains after energy minimization. (c) Two distorted DNA double-helical domains with crossover before energy minimization.
(d) Two refined DNA double-helical domains with crossover after energy minimization.
After the user has selected the portion needed to be optimized, a energy minimization dialog is shown by
selecting the “Optimization” in the menubar. The dialog contains three textbox inputs, allowing the user to
specify the number of iterations the optimization algorithm will do, the translation and rotation offset in the
3D space, and five selection boxes, which allows the user to disable/enable translation, global/local rotation or
collision detection. The dialog is shown in Fig 8.
UNIQUIMER 3D will generate a detailed report in HTML format of the refined structure including information about its hierarchy and several showcase images of the structure from different viewing angles. Besides,
the process of energy minimization is illustrated in this report with a chart showing each iteration in this process
9
Figure 8:
Geometrical Optimization Dialog.
and the corresponding energy value.
Part V
Sequence Generation
A sequence generation algorithm is designed and integrated it to UNIQUIMER 3D. The algorithm of generating
sequence sacrifices storage to gains speed. The idea is to compute all the possible combinations of the specific
maximum length of repetitive segment starting with {A, T, G, C}. The combination term is an unordered
collection of distinct elements, usually of a prescribed size and taken from a given set. This approach guarantees
each of the segment is distinct in terms of sequence so that the length of repetitive segment is controlled. The
combinations can be pre-computed and stored to a local file that can be loaded for recycling usage.
The same basic rules for SDN sequence generation are taken into consideration for UNIQUIMER 3D. The
first one to follow is the pairing up rule of {A=T}, {G≡C} (i.e. certain segments should be complementary
respectively as shown in Fig 9).
In order to avoid segment mismatching cases as much as possible, there is the second rule, to limit the length
of repetitive segments. It is illustrated in Fig 10. If the requirement is set to have no repetitive segments of 4
base pairs, the sequence does not meet the requirement. However, if the requirement is set to have no repetitive
segments of 5 base pairs, the sequence will pass. As the main restriction used for sequence generation, the
maximum length of repetitive segment should be set as short as possible to prevent mismatching cases. If the
value is set to be 3, there will be no repetitive segment with a length of 4 or more bases. Suppose we want to
do sequence generation for a structure of a DNA double helical domain with 100 base pairs. The total number
of combinations for segments with a length of 3 bases is only 43 = 64, so if the maximum length of repetitive
segment is set to be 3, there would be not enough candidates to fill in the 98 (100-3+1) blanks of segments
with a length of 3 bases. Therefore, no solution could be found in this case. However, if the value is set to be
4 instead, the possible combinations increase to 44 = 256. There are enough candidates available in this case
and the generator will find a solution. In general, the maximum length of repetitive segment is relatively bigger
10
Figure 9:
Base pair matching. Sequences shown in the box are subjected to complementary region.
for complicated structures compared with simpler structures like DX or TX. The value ranges from 4 to 6 for
most of the SDN structures. However, it could be as big as 7 or more for extremely complicated structures. The
maximum length of the repetitive segment in the sticky ends is set to be 3 base pairs (no repetitive segment of
4 base pairs) no matter what the global rule of repetitive segment length minimization is.
Figure 10:
Mismatching prevention. Sequnces shown in the box are two repetitive segments of 4 base pairs.
UNIQUIMER 3D is able calculate the number of potential mismatching cases that are hindrances of formation of the desired structure. Given the length of potential mismatching segment, a scoring system of a generated
sequence for a structure is formulated as following:
P air is defined as
P air(a, b) =
1
0
[a, b] ∈ {[A,T], [T,A], [G,C], [C,G]}
.
otherwise
Comp is defined as
Comp(a, b) =
1
0
a=b
.
otherwise
D = {D1 , D2 , . . . , Dn } is the set of DNA double helical domains, Di ∈ D is a particular double helical domain
|D |
|D |
with |Di | base pairs {< Ni 1 , Ni 1 >, < Ni 2 , Ni 2 >, . . . , < Ni i , Ni i >}. given the digit d that represents for
the length of potential mismatching segment, a score function is defined as
J=
|D| |Di |−d
X
X
i=0
j=0
Comp[
d
X
P air(Ni j+k , Ni j+k ), d]
k=0
11
The score function J is a numerical representation of the expected formation of the structure consisting of
a set of DNA double helical domains D. In real self-assembly process, potential mismatching cases will result
in secondary structure formation. S = {S1 , S2 , . . . , Sn } is the set of DNA strands, Si ∈ S is a particular
|S |
strand with |Si | bases {Ni 1 , Ni 2 , . . . , Ni i }. We then define another score function Jˆ taking into account all
potential mismatching cases as a comparison with the value we get from J. Since J is the most optimal score, Jˆ
is always greater than or equal to J. The difference between these Jˆ and J indicates how likely the strands S
with generated sequences will self-assemble into the expected structure. The less difference between the values
we get from Jˆ and J, the more likely S will self-assemble into the expected structure formation.
Jˆ =
|S | |Si |−d |S | |Sj |−d
X
X X X
Comp[
i=0 p=0 j=0 q=0
d
X
P air(Nip+k , Njq+k ), d]
k=0
Jˆ is designed to find all potential mismatching cases, which is equivalent to string matching in terms of
conjugating one of the segment to search for all matchings of this conjugation from a given set of segments. The
conjugate operation is based on DNA complementarity {A=T}, {G≡C}, {T=A}, {C≡G}. Consider segment
[Nip , Nip+1 , . . . , Nip+k ] as the template, it is equivalent to essentially searching on strand j for the total number
ˆ
of conjugate segments. As a result, Boyer-Moore string search algorithm is adopted for J.
Our score functions take a structure with sequence and the length of potential mismatching segment d
as input. A set of sequence generated for n times of a structure is denoted as Φ = {Φ1 , Φ2 , . . . , Φn }. The
corresponding numbers of potential mismatching cases are denoted as Ω = {Jˆ1 − J1 , Jˆ2 − J2 , . . . , Jˆn − Jn }. A
tuple ∆d = [Ω, Φ] is associated. Ωi = min(Ω), ∆di = [Ωi , Φi ] is selected for a specific d.
The sequence generation dialog is shown is Fig 11. The user has options to specify the maximum length
of repetitive segment and a exclusion set. The result of the sequence generation is saved to a text file in the
parent directory namely “strandSequence All.txt”. It contains all the sequences of each individual strand of the
structure.
Figure 11:
Sequence Generation Dialog.
Part VI
Summary
With all these built-in features, users can easily design, analyze and evaluate different structures. Especially,
the energy minimization function can help users to obtain a relatively stable state of the working structure and
the result can be very helpful for SDN prediction. DNA structures without satisfactory optimized state can be
screened out for wet-lab experiments.
The main contributions of this work can be summarized as follows:
• Users can visualize DNA motifs and motif assemblies in a 3D environment.
12
• Users can design DNA structures in a convenient and efficient way.
• An energy function is designed for measuring the stability of structures. Our system can relax this energy function to predict a relatively stable structure, which can validate and/or predict SDN wet-lab
experiments.
• Each DNA node in a structure can be automatically assigned a tag from {A, T, G, C} using a built-in
sequence-generating algorithm, and the generated sequence can be analyzed by our scoring system.
• A detailed HTML report is generated after the energy minimization, which contains hierarchical information on the refined structure, showcases images of it from different viewing angles and gives information
on the energy minimization.
As for future development, a systematic analysis of our current energy minimization will be carried out down
to the molecular level of DNA backbone structures. The current single-level algorithm makes the efficiency of
the minimization process low. Multi-level of optimization which minimize all factors simultaneously at each
iteration will be utilized to address this issue. We will consider the simulated annealing method that can sample
a wider range of conformations compared with other gradient based local minimization methods. We will also
work on the modeling down to the molecular level with precise atomic positional control since rotations of each
bond in the backbone structure will result in different conformations of the entire design and the construction
using different forms of DNA or even RNA. This will be updated into our next release and will enable controls of
the base and sugar. It will be much more accurate compared with the current version that models only simplified
DNA. PDB format of exportation will also be supported by then.
13