Hierarchical visualization of the chemical space

Jakub Velkoborský
Hierarchical visualization
of the chemical space
Department of Software Engineering
Supervisor of the master thesis:
Study programme:
Study branch:
RNDr. David Hoksza, Ph.D.
Computer Science
Discrete Models and Algorithms
Prague 2016
I declare that I carried out this master thesis independently, and only with the
cited sources, literature and other professional sources.
I understand that my work relates to the rights and obligations under the Act
No. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that the
Charles University has the right to conclude a license agreement on the use of
this work as a school work pursuant to Section 60 subsection 1 of the Copyright
In ........ date ............
signature of the author
Title: Hierarchical visualization of the chemical space
Author: Jakub Velkoborský
Department: Department of Software Engineering
Supervisor: RNDr. David Hoksza, Ph.D., Department of Software Engineering
Abstract: The purpose of this thesis was to design and implement a hierarchical
approach to visualization of the chemical space. Such visualization is a challenging yet important topic used in diverse fields ranging from material engineering
to drug design. Especially in drug design, modern methods of high-throughput
screening generate large amounts of data that would benefit from hierarchical
analysis. One possible approach to hierarchical classification of molecules is a
structure based classification based on molecular scaffolds. The scaffolds are
widely used by medicinal chemists to group molecules of similar properties. A few
scaffold-based hierarchical visualization methods have been proposed. However,
to our best knowledge, there exists no tool that would provide a scaffold-based
hierarchical visualization of molecular data sets on the background of known
chemical space. In this thesis, such tool was created. First, a scaffold tree hierarchy based on ring topologies was designed. Next, this hierarchy was used to
analyze frequency of scaffolds extracted from molecules in PubChem Compound
database. Subsequently, the PubChem Compound scaffold frequency data was
used as a background for visualization of molecular data sets. The visualization
is performed by a client-server application implemented as a part of this thesis.
It provides an interactive zoomable tree map based visualization of data sets, up
to hundreds of thousands molecules large. The application is free to use and has
been published under an open source license.
Keywords: cheminformatics, chemical space, molecular visualization
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Structural classification of compound data set . . .
1.2.2 Associating scaffolds with biological activities . . .
1.2.3 Ligand based virtual screening and scaffold hopping
1.3 Aim, Scope and Contribution . . . . . . . . . . . . . . . .
2 Background and Terminology
2.1 Chemical Space and Databases . . . . . . . .
2.2 Scaffolds . . . . . . . . . . . . . . . . . . . . .
2.3 Visualization Approaches for Chemical Space .
2.3.1 Direct visualization . . . . . . . . . . .
2.3.2 Hierarchical visualization . . . . . . . .
2.4 Cheminformatics Toolkits . . . . . . . . . . .
2.5 Computer Representation of Molecules . . . .
3 Design Choices
3.1 The Scaffold Hierarchy . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Cheminformatics Toolkit . . . . . . . . . . . . . . . . . . . . . .
3.3 Representation of Molecules and Scaffolds . . . . . . . . . . . . .
4 Data Extraction and Processing
4.1 Source of Data . . . . . . . . . .
4.1.1 Choosing the data source .
4.1.2 Accessing the data . . . .
4.2 Initial Analysis . . . . . . . . . .
4.3 Clustering and Similarity . . . . .
4.4 Visualization . . . . . . . . . . .
6 Results
6.1 The Application . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 The Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Generating the background hierarchy . . . . . . . . . . . .
5 Implementation
5.1 Architecture . . . . . . . . . . . . . . . . .
5.1.1 Client-server model . . . . . . . . .
5.1.2 Preprocessing the hierarchy . . . .
5.2 Technologies Used . . . . . . . . . . . . . .
5.3 Implementation overview . . . . . . . . . .
5.3.1 Project structure and build process
5.3.2 Generator . . . . . . . . . . . . . .
5.3.3 Server . . . . . . . . . . . . . . . .
5.3.4 Client . . . . . . . . . . . . . . . .
5.3.5 Shared . . . . . . . . . . . . . . . .
6.2.2 The Scaffold Hierarchy . . . . . . . . . . . . . . . . . . . .
The Visualizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Conclusion
A Installation and Generator Tasks
A.1 Hardware Requirements . . . . . . . . .
A.2 Software prerequisites . . . . . . . . . . .
A.3 Downloading Scaffold Visualizer . . . . .
A.4 Downloading Libraries . . . . . . . . . .
A.5 Running Scaffold Visualizer . . . . . . .
A.6 Generating the background hierarchy . .
A.7 Running the server in development mode
A.8 Running the server in production mode .
1. Introduction
The aim of this thesis was to design and implement a an approach to hierarchical
visualization of the chemical space. Since it might not be immediately obvious,
especially to a reader with no chemistry background, how might such visualization be useful, or even what is that chemical space, let us start with a short
The chemical space is a space populated by all possible chemical molecules
and compounds1 . The number of possible molecules is limited because not all
structurally imaginable molecules are energetically stable. That is, taking into
account only simple molecules, i.e. excluding polymers. The number of possible
unique polymer molecules is virtually unlimited – for example the largest human
chromosome has about 250 million base pairs and there exist 4 different DNA
bases giving us 4250,000,000 such unique DNA molecules. But even when we constrain ourselves to the chemical space of small molecules, the chemical space is
vast. If we imagined the molecules as the stars in the universe, the size of the
chemical space would by orders of magnitude surpass the size of the cosmological
space. It has been estimated that the number of possible small organic molecules
is about 1060 [5] while the number of stars in the observable universe might be
around 1021 [34]. There are also only about 1044 carbon atoms on Earth2 which
means that from our earthly perspective majority of the molecules are only imaginary. Still, the fact that these molecules do not exists, does not mean that they
cannot be synthesized.
We will explore the chemical space mainly through the lens of medicinal chemistry. The medicinal chemists are explorers in the chemical space who are in
pursue of novel drugs. Their area of interest is the drug-like chemical space – the
space of drug-like molecules that mostly coincides with the space of small organic
molecules above3 . The size of drug-like space is estimated widely between 1023
and 1060 [43, 16, 5]. From its sheer size it is obvious that navigating the chemical space is no easy task. To make the data more manageable, chemists group
molecules by similarity, or better yet, build hierarchies upon them. There is no
single best way to build such hierarchy. Existing approaches are mostly based
either on hierarchical clustering by an appropriately chosen similarity measure
or on classification of molecules by molecular scaffolds. Having experimented
with both approaches, this work uses the latter one – scaffold based hierarchical
A molecular scaffold might be intuitively understood as a central part of
A compound is a molecule containing at least two different elements. Every compound is
a molecule but not vice versa.
Estimation based on: Mass of Earth is 5.98·1027 g, weight fraction of carbon is 4.46·10−4 [35],
there is 1 mol of carbon atoms in 12 grams and Avogadro constant NA = 6.022 · 1023 .
This does not mean that all known drugs are small organic molecules. There are examples
of drugs which are not small molecules as well as examples of drugs which are not organic.
Examples of macromolecular drugs (i.e. non-small) are innumerable biopharmaceuticals – e.g.
molecular antibodies, hormones and many more. From inorganic compounds we can name the
nitrous oxide or many modern chemotherapeutic agents.
a chemical compound. It holds molecule’s substituents in their positions and
from the scaffold the molecule derives its basic shape and some basic properties
such as flexibility or rigidity. A key property of a scaffold is also its reactivity which then determines its toxicity and metabolic stability. Based on that,
it is hardly surprising that molecular scaffolds are one of the key concepts of
medicinal chemistry. But there is another way to look at the scaffolds. If we represent molecule as an undirected graph with vertices corresponding to the atoms
and edges corresponding to the bonds then we can obtain representations of the
molecule’s scaffold by a sequence of simplifying operations on this graph. Obviously, manipulating graphs and building hierarchies is where cheminformatics
comes to help. And we shall explore this in a great detail later.
But the hierarchical classification is only half of the work needed. Having the
molecules classified and the hierarchy built, we need to find a way to present the
data back to the chemists. And since a picture is often worth a thousand words,
this work presents scaffold hierarchies in a form of an interactive visualization.
We feel that providing unique new views on chemical data is a worthy thing to do.
Drug design still relies heavily on human researchers’ intuition. And the better
information and tools we provide to medicinal chemists, the better the results we
may expect from them.
Despite having already provided some intuition behind molecular scaffolds, let
us now have brief look how exactly the scaffolds are used by medicinal chemists.
Molecular scaffolds are one of the central themes of this thesis and so doing
this excursion into the real world usage of scaffolds shall provide us with better
understanding of some design decisions later, as well as give us an idea of to
whom this work might be beneficial.
The rationale behind most of the applications is the so-called similarity principle. This similarity principle states that structurally similar molecules exhibit
similar biological activity4 [26]. But various methods employ the similarity principle in different ways. Some are directly based on the principle while some other
rather surprisingly try to find their way around it. Remember that we are in the
field of medicinal chemistry so the ultimate goal in both cases is to find novel
compounds with a specified biological activity – i.e. new candidates for drugs.
The natural approach, following the similarity principle, is to first identify which
scaffolds are linked to the specified bioactivity and then search for new drug candidates within this high yield subspace. The other approach, battling against
the similarity principle, is starting with a set of known molecules with desired
activity and trying to find a new compound with similar activity but based on a
different molecular scaffold – a process known under a colloquially sounding term
“scaffold hopping”. A comprehensive general overview of scaffold based techniques
employed in medicinal chemistry is available in a recent article “Computational
Exploration of Molecular Scaffolds in Medicinal Chemistry” by Ye Hu et al.[23].
We shall now describe some of the aforementioned methods in more detail.
Although it is not always the case.
Structural classification of compound data set
The basic use case for scaffolds is the structural classification of compounds in
a data set, grouping similar molecules together. The main question being asked
is how many different classes of molecules does a data set contain and what is
their relative frequency [51]. If we represent classes of molecules by scaffolds, this
question becomes one of the scaffold diversity of the data set.
Scaffold diversity can be used to describe a single dataset – using a scaffoldbased diversity metric [33], the diversity of a data set can be described by a single
number. Even more interesting is to compare diversity of two data sets against
each other. Often a set of molecules yielded from a novel screening method is
compared to a reference library; then the number of scaffolds which are present
in the new data set, but not contained in the reference, can be used to estimate
the capability of the new screening method to find novel active compounds.
Besides assessing diversities of data sets, scaffold-based classification can also
be used to organize molecules in databases or, of course, to visualize data sets –
as will be demonstrated in this work.
Scaffold, as a core structure, also characterizes the molecule in which it is
contained. Molecules sharing a common scaffold are likely to also share their
metabolic pathways – influencing stability of a drug in vivo and many other
pharmaceutical properties.
Associating scaffolds with biological activities
Adding functional data to the structural information the association between
a compound’s scaffold and the compound’s biological activity, also known as
the structure–activity relationship, is often analyzed [22]. If a suitable scaffold
definition is used – such that it enables to build a scaffold hierarchy – the analysis
of the structure to activity association can be extended from the scaffolds existing
in the data set to virtual scaffolds in the higher level of the hierarchy. This
generalized activity information is then used in activity prediction – estimation
of the activity of compounds whose activity has not yet been measured. Based
on this information compounds can be prioritized for in vitro activity testing.
Or taking this idea even further, synthesis of new compounds might be pursued
based on the specified virtual scaffold type.
Scaffold activity analysis sometimes also leads to identification of so called
privileged structures [17, 58] or masterkeys [36]. These terms denote the situation
when the scaffold itself serves as a non-specific target for multiple various receptors5 . Such scaffolds form an ideal basis for production or even mass-production
of substances to undergo further examination – most commonly in the form of
high-throughput screening [64]. This often leads to construction of targeted highaffinity ligands which are in turn pharmaceutically useful.
In contrast to the ordinary situation when the scaffold serves as a molecular core for substituents and only together they form a suitable key to one specific target receptor. In other
words, usually we start with an inactive scaffold and we add substituents to it in order to gain
a molecule interacting with at least one receptor. In the case of masterkeys the scaffold itself
interacts with a wide variety of receptors and we substitute it in order to make its activity
better targeted.
Ligand based virtual screening and scaffold hopping
Virtual screening is a technique employed to computationally analyze vast libraries of compounds in order to select the most hopeful candidates for chemical
screening (chemical screening being orders of magnitude more expensive). Ligand based virtual screening assesses the candidate compounds based on common
marks of existing compounds with desired biological activity. The other possibility being the structure based virtual screening, where candidate molecules are
tested directly against the target molecule6 .
Scaffold hopping is a technique used to derive new active compounds from
the existing ones. In that respect it is strongly connected to the ligand-based
virtual screening and might be considered to be its specific application. However,
the aim of scaffold hopping is, starting with a set of known molecules exhibiting
desired biological activity, to find a new compound with the same bioactivity but
with a different molecular scaffold. Motivations for such replacement are aplenty.
Examples are given by Böhm et al.[10] – we may replace a lipophilic7 scaffold
by a more polar one and increase solubility of the compound; a replacement of
a metabolically labile scaffold with a more stable one may improve pharmacokinetics of a drug; we might exchange the scaffold for a less toxic one; rigidity,
stability and/or other properties of the scaffold may affect its binding ability to
the target protein receptor8 ; and last but not least a significant change in the
structure might lead to a new patentable drug.
The only detail that might need clarification is, how does scaffold hoping
go together with the similarity principle. The similarity principle claims that
molecule’s biological activity strongly related to its structure. So, at first sight,
trying to change the structure significantly (the aim of scaffold hopping) might
seem like a hopeless idea. Fortunately, it is more complicated than that.
Even though in most of this work we tend to simplify a molecule to its two
dimensional structural graph, the biological activity of a molecule is in reality
determined by its three dimensional shape; or, more precisely by local properties
of its surface – such as hydrophobicity9 or polarity10 . This is completely intuitive
when provided with the information that hydrophobic and polar interactions of
the molecule’s surface with a target protein are the chemical basis of drug action11 .
And even though a scaffold gives a molecule its basic shape, two different looking
scaffolds may lead to two molecules which have the same shape, or which are at
least similar in the part of the surface that is crucial for the biologic activity.
So scaffold hopping does not conflict with the similarity principle. But how do
we recognize suitable scaffolds for substitution? We are looking for the biological
activity in scaffolds where it has not yet been discovered – that rules out the
In theory the structure based screening should give better results – based on a more complete input – but in praxis it is prohibitively computationally expensive to model the interaction
precisely and significant simplifications have to be used.
Soluble in lipids (e.g. fats) and non-polar solvents (e.g. hexane, benzene).
Which in turn determines its pharmaceutical properties.
A hydrophobic substance is seemingly repelled from water. It is usually non-polar and
prefers non-polar solvents, water being polar.
Polar/hydrophile is the opposite of non-polar/hydrophobe. A polar molecule has uneven
distribution of electrons creating an electrical dipole. Polar molecules interact together by
dipole-dipole intermolecular forces.
Together with hydrogen bonds and other forces.
method of activity prediction from the preceding section. Nevertheless, cheminformatics offers a range of methods to identify the suitable candidates (and
scaffolds have little to do with them).
One possibility, when we have a set of active compounds, to create their
3D models and superimpose them to identify the features they have in common
(and which are likely responsible for their biological action). This process is
called creating a pharmacophore. Having guessed the correct pharmacophore,
we can use the pharmacophore to search for suitable novel structures (and be
successful in that). However, determining the correct pharmacophore is not easy
and the search can easily only be performed on databases of known molecules –
which represent only a small fraction of the chemical space. So we often need to
accompany the search by methods of generating new 3D structures – for example
by replacement of fragments of existing molecules or even by de novo design [56].
Here, scaffolds may again come into play, acting as basic building blocks – but
that is definitely out of scope of this text.
To avoid the inherent complexity of the described approach matching molecules to pharmacophores in three dimensions, similar methods have been developed to predict biological similarity of molecules that are considerably easier to
implement and compute; and many of them suffice with the graph description of
the molecule only. Due to a higher abstraction, these methods are less reliable, but
on the other hand they are fast and easy and always available. One such popular
approach are molecular fingerprints – for example Daylight fingerprints – which
are binary strings that denote presence or absence of a small substructure. Calculating similarity based on fingerprints is almost trivial – it suffices to compare how
many of the bits have the same values and how many differ12 . Some fingerprints
go beyond the two dimensional structure – for example PharmPrint[31] – which
is based on pharmacophore features – or Spectrophores [55] – which describe the
3D structure and molecular surface. Even though these fingerprints are based on
the three dimensional structure, they are themselves again plain binary strings
that are easy to work with. Yet another approach is based on comparing feature
trees – in which a node represents chemical features of a part of a molecule and
the tree encodes how the parts are connected together [45]. Both the molecular
fingerprints and the feature trees can be used for database searching and for de
novo molecule design, usually by fragment joining.
The overview of ligand based virtual screening methods would easily fill a
book. But by now the applications of scaffolds should be very clear. And we can
move forward to the core content of this thesis.
Aim, Scope and Contribution
The aim of this thesis was to design and implement a hierarchical visualization
of the chemical space. That encompasses two largely independent goals. First
goal being to find a suitable hierarchical representation of the chemical space.
Second goal was to design a suitable visualization. Both, the hierarchy and the
visualization, were then implemented.
The first step was to design a hierarchy on the chemical space. A scaffold
More precisely known as the Tanimoto or Jaccard similarity.
based approach has been chosen because of the wide applications that scaffolds
have in medicinal chemistry – as described above. A combined scaffold/clustering
approach has also been evaluated – the hierarchical classification would be based
on scaffolds but on some levels the scaffolds would be clustered. This idea has
been discarded and a pure scaffold approach is used.
Some scaffold based hierarchies are already available, such as HierS [61] and
Scaffold Tree [52], both discussed in the following section. Instead of reusing an
existing hierarchy definition, this work uses a newly designed scaffold hierarchy
based on scaffold ring topologies [44]. We have not found this approach implemented yet in any of existing scaffold visualization tools and we hope to provide
the medicinal chemists with a new look at large data sets using this novel scaffold
Nevertheless, our hierarchy is defined in a single source file of our implemented
solution and is easily modifiable (or even replaceable) based on practical experience with the visualization tool.
The second step was to design a suitable visualization of molecular data sets
based on the scaffold hierarchy. The data sets are displayed relative to a predefined background formed by a large database of known compounds – in this case
the PubChem Compound database [25], which is the largest existing database of
known compounds and has been shown to have the highest scaffold diversity [59].
The presence of this PubChem background allows a data set to be compared
with a given reference – that puts the information gained into a context and help
to better understand how does a particular data set differ from a generic subset
of known molecules.
Presenting the molecules on a fixed background also allows to easily compare
two datasets together and gain insight on the differences between them.
To our knowledge, visualization on a background of such large compound
database (virtually all known chemical space) is a novel one and was one of the
primary aims of this thesis. Again, we hope that it provides a fresh look at
the data that medicinal chemists work with and hopefully helps them with their
As was the case with the hierarchy, the PubChem is not hard encoded into
the visualization solution and should be easily user replaceable if there occurs a
specific situation calling for a specific background.
The last part of this work is the visualization tool itself. It has been defined as
an free open-source interactive web application. The web application format has
been chosen for maximum accessibility. A demo installation is available online for
scientists to evaluate how it fits their needs. Regular users are expected to run the
application on their own servers, which is easy enough as it only requires JavaTM
and several libraries. A free open-source solution is used because we cannot hope
to provide a perfect solution and we would like everybody to be able to customize
it for his own uses. An interactive presentation of the hierarchy is used because
we believe that the closer the interaction with the data, the better intuition can
be gained.
Although there exist several web based data set visualization tools, some of
them even scaffold based, we believe that the implemented solution is unique in
that in presents the data on such a large background. One of the key successes of
the work is that even though the visualization application runs mostly in browser,
it is able to open datasets with hundreds of thousands of molecules. As far as
we have found, all the other solutions are only able to work with datasets orders
of magnitude smaller. Ability to work with datasets of this size is an important
practical advantage when working with results of high throughput screening which
is able to produce vast amounts of data rapidly.
To summarize the contribution of this thesis, we have created a scaffold based
hierarchical visualization of the chemical space, which is unique in several respects. Among the most important, our work brings a newly designed ring
topology based scaffold hierarchy, providing a fresh look on existing data. The
visualization is presented on a background of almost all known chemical space,
providing context to analyzed data sets. And the visualization comes in a form
of a performant open-source web application which can be used to analyze the
ever-growing data sets resulting from the high-throughput screening.
2. Background and Terminology
Chemical Space and Databases
We have previously introduced the chemical space as a space containing all possible chemical molecules. However, for the chemical space to really be a space, a
structure has to be defined on it. But there is or natural way to map molecules to
coordinates. It is not even obvious how exactly to decide whether two molecules
are similar or not - i.e. how close should they be to each other.
Despite the perfect solution not being available, there are still ways to define
a structure that is practically useful. The usual approach starts with choosing
a set of molecular properties that are relevant in our area of interest. Based on
these properties we create a high dimensional property space - coordinates of the
molecules corresponding to their values of selected properties. At last principal
component analysis (PCA) is used to reduce the number of dimensions. Such
structure is application specific. But there are have also been attempts to define
a global coordinate system for molecules, at least within the drug like subspace.
Two examples would be the Molecular Quantum Numbers (MQN) [37] and Chemical Global Positioning System (ChemGPS) [42]. Both use the aforementioned
approach of defining a property space and performing a PCA. The MQN uses
integer properties such as counting of atoms of different types, bonds of different types, hydrogen bond donors and acceptors, charges and small topological
features. The ChemGPS uses molecule descriptors based on size, lipophilicity,
polarizability, charge, flexibility, rigidity, and hydrogen bond capacity.
A global coordinate system helps to explore the chemical space in more systematic ways. Still, the space is too large to explore it all, even computationally.
As mentioned in the previous chapter, the space of small organic molecules is estimated to be about 1060 molecules large. Even the drug-like subspace is expected
to contain at least 1023 molecules, using a conservative estimate. Compare it to
the size of the largest existing database of known molecules - PubChem Compound [25] - which has 91.4 million molecules (as of July 2016 1 )- close to 108 which means that less than 1 from each 1015 of possible drug like molecules has
been discovered, let alone tested.
Even smaller is the database of commercially available - ZINC [24] - offering
about 22.7 million purchasable molecules (ZINC12, July 2016). One the one
hand, limited number of commercially available molecules helps to choose which
one to explore first. On the other hand, exploring only the known subspace might
prevent us from finding a truly novel compound.
Offering some middle ground between the unmanageably large theoretical
drug like subspace and the comparatively small subspace of known molecules is
the GDB-17 [49] database. Continuing the tradition of GDB-11 and GDB-13,
the GDB-17 enumerates drug like molecules up to 17 atoms containing carbon,
nitrogen, oxygen, sulfur and halogens. The process of generating the GDB-17
relies on graph theory - it starts with enumerating graphs on 17 vertices - yielding about 1.1 · 1011 such graphs - which are then filtered by geometrical criteria
But the PubChem compund database grows rapidly. Compared to the 91 million molecules
in July 2016, it only had 54 million entries in September 2014.
(removing strained and small rings) - leaving only around 5 · 106 hydrocarbon
molecules. Then graph edges are substituted for different types of bonds (single,
double, triple), yielding 1.3 · 109 “skeletons”; and finally heteroatoms are added
and molecules are again filtered to remove potentially unstable and other problematic molecules, resulting in the final count of 1.66 · 1011 molecules. It is worth
mentioning that the filtering rules are quite strict because according to the analysis done by the authors, only 57% of PubChem compounds up to 17 atoms satisfy
the filtration rules.
These 166 billion molecules bring us back to our original question of how to
explore such a large space. It can be done at random, but we might also employ a
more systematic way, starting from islands of bioactivity and using the mentioned
MQN, ChemGPS or local property spaces as means of navigation. But, as was
implied in the introduction of this thesis, they may be another useful guide in the
exploration - the scaffold concept. We shall not reiterate on the techniques how
scaffolds are employed. Instead, we share some numbers showing significance of
scaffold based number. In 2006 Ertl et al. published an analysis [15] comparing a
sample of 150 000 bioactive molecules to a collections of generic natural products.
The analysis states that 75.6% of the bioactive molecules contained an aromatic
rings, while at the same time such ring occurred only in 37.9% of the natural
products. An aromatic ring might be in this case considered to be a scaffold, albeit
a very simple one. The analysis continues with exploration of simple aromatic
scaffolds (created from up to three fused rings size 5-6). Of 580 165 such virtual
scaffolds, only 780 were present in the set of analyzed bioactive molecules, showing
that molecule’s scaffold is a very significant predictor of its molecular activity.
Despite being so heavily used, the term scaffold does not seem to possess a
single rigorous definition [8]. A detailed discussion on the meaning and history
of scaffold is given by Nathan Brown in his quite recent article “Identifying and
Representing Scaffolds” [7]. In the text Brown traces the origins of the scaffold
concept to Markush structures - named after Eugene A. Markush who in his 1924
patent on pyrazolone dyes [29] uses a generic structure definition which allowed
for his patent to protect a whole class of molecules. Markush structures are
still relevant to this day - often used in patents - denoting structures that have
placeholders for certain substituents [20]. The term scaffold itself is traced by
Brown to an article from 1969 [47] which describes a ring system that “can act
as a scaffold for placing functional groups in set geometric relationships to one
another”, which is intuitive but does not make the definition any clearer.
Precise in definitions is an oft-cited article on molecular frameworks by Bemis and Murcko [4]. The article decomposes a drug molecule into a framework
(often referred to as Murcko framework ) and side chains. Framework is being
defined as an union of ring systems (i.e. cycles sharing an edge) and linker atoms
connecting the ring systems together. Side chains are the rest of the molecule non-ring, non-linker atoms. The article also defines graph frameworks, which only
consider (heavy) atom connectivity and disregard atom type, hybridization and
bond order. These graph frameworks are sometimes called the Murcko scaffolds.
We shall emphasize that such scaffolds no longer represent a chemical compound2
per se but are instead abstract graphs.
Nevertheless, Murcko frameworks (and even Murcko scaffolds) can still be very
diverse an so methods of further simplification are sought-after. Many approaches
have been devised to the problem, each progressing in a slightly different direction.
We will briefly describe three of them - HierS 3 (2005) [61], The Scaffold Tree
(2007) [16] and Scaffold topologies (2008) [44].
HierS Described by Wilkens et al. [61], HierS is an algorithm that starts
from a molecular framework (called a superscaffold in that context), takes one
by one each of its ring systems, removes it (together with corresponding linkers)
and continues recursively on the rest of the molecule. This yields scaffolds which
are all possible ring system combinations of the original framework.
A nice feature of this approach is that each such scaffold can be interpreted
as a chemical structure and that this structure is contained as a substructure in
the original molecule. What might be considered a disadvantage is that for each
molecular framework having more than one ring system we get multiple scaffolds4
and for a framework with one ring systems only, we get the framework itself. So
no obvious tree hierarchy is formed.
The authors of the HierS algorithm order all the scaffolds (generated from a
set of molecules/framework) by inclusion and form a hierarchy from that, with
the minimal scaffolds being at the top and each next level having one additional
ring system. Unfortunately, such hierarchy is not a tree (nor a forest).
Also worth mentioning is that the algorithm never breaks the fused rings in
a complex ring system.
The Scaffold Tree An algorithm proposed by Schuffenhauer et al. [52, 16]
uses a similar basic idea of iterative ring removal but applies it in a very different
way. A less important modification is that the algorithm only removes one ring
at a time, not a whole ring system. But the key difference is that these rings
are removed deterministically, by a list of priorities5 , and the algorithm never
backtracks. So for each molecular framework, we get a sequence of scaffolds
linearly ordered by inclusion - each following scaffold having one less ring and
being contained in the previous one. Such scaffolds are sometimes referred to as
Schuffenhauer scaffolds [54].
The clear advantage over HierS is that scaffolds obtained in this way form
a tree hierarchy. The obvious disadvantage is that a ring system that is more
important for the molecule’s function might be removed first and the molecule
might be on higher levels represented by a ring system which is not functionally
Scaffold Topologies A very different approach was described by Pollock
et at. [44]. They use a concept of a scaffold topology which further abstracts
Not even a part of it as in the case of a molecular framework or a Markush structure.
Hierarchical Scaffold Clustering.
Their count is in fact exponential to the number of ring systems in the framework.
The list of priorities is non-trivial and based on chemical insights and one could imagine
using the same basic approach with a different list.
on the idea of a graph framework. It is defined as a “connected graph with the
minimal number of nodes and corresponding edges required to fully describe its
ring structure”. We can obtain such topology from a graph framework by iterative
replacement of vertices of degree two by a single edge - i.e. the procedure of
edge merging, a process inverse to edge subdivision 6 . Such topologies sometimes
refereed to as Oprea scaffolds.
Oprea scaffolds are another example of scaffolds that cannot be interpreted as
a chemical compound. That is a property that they share with Murcko scaffolds
(graph frameworks), but unlike Murcko scaffolds, Oprea scaffolds do not usually
even resemble a molecule in shape. This might be considered a disadvantage
compared to HierS scaffolds and Schuffenhauer scaffolds which do have a chemical
interpretation. One practical limitation of a scaffold not representing a chemical
compound is that traditional chemical fingerprinting methods cannot be used on
it and so no measure of similarity between such scaffolds is readily available more on that later.
On the other hand there is one unique Oprea scaffold for each molecule and
so Oprea scaffolds can easily be used for classification. Moreover, together with
molecule frameworks and Murcko scaffolds, the Oprea scaffolds form a clearly
defined tree hierarchy corresponding to topological configuration of ring systems
in a molecule, with different degree of abstraction on each level. This corresponds
to how some medicinal chemists intuitively perceive molecules. We extend this
hierarchy with another levels as a part of this work.
All approaches based on molecular frameworks share one common disadvantage - adding a ring to a molecule changes the resulting scaffold while the functional properties might stay very similar. This is especially accented in scaffold
topologies, molecule graphs and molecule frameworks themselves, but present to
some extent even in the case of Schuffenhauer scaffolds 7 and HierS 8 .
Maximum common substructures Less prone to the problem of ring
addition is classification of molecules my their maximum common substructure
(MCS). Again, that has some advantages - such as that MCS can be interpreted
as a chemical compound. And also has some disadvantages - first, the MCS is not
computable in polynomial time (but effective heuristics do exist [46, 19]), second,
in contrast to approaches above, it only makes sense to consider MCS for a set
of molecules, not for a single molecule. That makes all molecule classification
methods based on MCS dataset dependent and non-universal.
Also, it is not obvious, how to form a hierarchy using the MCS. There is
an approaches based on iterative clustering [57, 38] - cluster the molecules by a
method independent on MCS (fingerprints, neural networks,. . . ), then calculate
the MCS for the clusters and iterate to form a hierarchy. An interesting method
based on Schuffenhauer scaffolds has also been described [11].
From that we can observe that a molecular graph belongs to a particular topology if and
only if it can be obtained as its subdivision.
The higher the priority of the new ring, the bigger the change.
If the new ring is a part of an existing ring system.
Visualization Approaches for Chemical Space
Visualizing the chemical space, two distinct approaches can be taken. Either
the molecules might be depicted individually, embedded into a plane or a three
dimensional space, based on selected properties and employing techniques of dimensionality reduction or multidimensional scaling. Or a hierarchical structure
can be defined on the chemical space and the molecules can then be visualized in
groups at various levels of the hierarchy.
As stated before, this work follows the latter approach. Nevertheless, for the
sake of completeness, we provide a short description of both approaches.
Direct visualization
In a direct visualization approach, molecules are drawn as individual entities. As
explained in section 2.1, there is no universal way to map molecules to coordinates
in an Euclidean space (of preferably at most three dimensions) and it is a nontrivial task to devise a reasonable and practically useful mapping.
Starting with a set of molecules to be visualized, the process of calculating
their coordinates is usually two-phase - first a property space is created then it is
projected to a two- or three-dimensional Euclidean space.
The property space is obtained by choosing a set of molecule properties that we
want to use in our visualization. From these properties we either get coordinates
in a high dimensional Euclidean space (e.g. choosing numerical properties and
using the properties’ values as the coordinates) or we at least define a metric
based on similarity of the molecules (e.g. molecular fingerprints and Tanimoto
distance9 ) yielding a metric space.
Once a property space is defined, a suitable projection method is used to
calculate the desired coordinates. Amongst the used methods are:
• Principal component analysis (PCA) - dimensionality reduction using principal components.
• Multidimensional scaling (MDS) - embedding based on preserving pairwise
distance/similarity [2].
• Force-directed graph drawing - for example using Kamada-Kawai algorithm
with target distance based on chemical similarity [21].
Direct visualization, despite being commonly used, has some notable disadvantages. First, it is not suitable for large data sets. Second, the mapping of
molecules to coordinates is often itself dependent on the original input. That
means that the same molecule pictured in two different datasets might be assigned different coordinates. It also means that if we add new molecules to a
dataset, the position of the original molecules might change. This instability
makes the direct visualization impractical for visual data set comparison. However, this latter problem is being targeted by researches attempting to create
more global data set independent mappings, such as the ChemGPS and MQN
(Molecular Quantum Numbers) referenced in section 2.1.
Tanimoto distance does satisfy triangle inequality [28].
An example of the direct visualization approach is a recently published software CheS-Mapper [18].
Hierarchical visualization
The other approach to visualization of the chemical space is the hierarchical
visualization. One way of obtaining a suitable hierarchy is hierarchical clustering
- using for example again the molecular fingerprint based similarity as described in
the previous section. The other way is creating a structure based hierarchy - such
as the Scaffold Tree [52] described in section 2.2 or the scaffold hierarchy used in
this work, defined in section 3.1. Yet another scaffold based approach is the HierS
algorithm [61], again described in section 2.2. There is also a mixed approach, that
is the hierarchical clustering based on maximum common substructure (MCS),
once again described in section 2.2.
The difference between the hierarchical clustering and the scaffold approaches
is that the scaffolds integrate medicinal chemists’ insights into the process - stemming from the observation that molecule’s scaffold is a significant predictor of its
biological activity - as described in previous sections.
Since this whole thesis is devoted to scaffold based hierarchical visualization,
describing the principles and applications in detail elsewhere, we do not further
elaborate on these topics in this section. Instead, we give examples of existing
software for hierarchical visualization.
One example is Scaffold Hunter [60], which provides visualization based on
the Scaffold Tree as well as on hierarchical clustering.
Another example is Scaffold Explorer [1] - described “an interactive tool for
organizing and mining structure-activity data spanning multiple chemotypes”.
The scaffolds in Scaffold Explorer are defined interactively by the user at the
time of performing the interactive analysis.
Cheminformatics Toolkits
Cheminformatics toolkits are software packages which provide various functionality related to manipulation of molecules and molecule databases. Such functionality includes:
• reading and writing various chemical file formats;
• molecule manipulation - adding and removing atoms and bonds, converting between different representations - in respect to aromaticity, implicit
hydrogens,. . . ;
• structural analysis - finding molecule rings (SSSR, CSSR,. . . ), substructure
matching (SMARTS,. . . ), finding maximum common substructure in a set
of molecules;
• calculating molecule fingerprints and similarity;
• modeling of two dimensional and three dimensional molecule arrangement;
• visualization - generating molecule depictions in raster and vector graphics
• modeling of chemical reactions;
• molecule property prediction (solubility, pKa,. . . );
• and a lot more.
At least a dozen toolkits are available, each specialized on a different subset
of functionality and suitable for different tasks. Besides the functionality, the
toolkits also differ in other practical areas such as price, programming languages
supported, availability of the source code and documentation and so on.
The cheminformatics toolkits with the longest tradition are the Daylight
Toolkit10 - history reaching to 1980s - and the OpenEye toolkits11 - whose code
is based on a molecule format conversion tool from 1990s called Babel. Both
toolkits are commercial, expensive, often perceived to be of high quality. The
Daylight Toolkit is setting the industry standards in many areas (e.g. Daylight
fingerprints, Daylight SMILES and SMARTS12 ). The OpenEye toolkits are used
for example at the core the PubChem Compound database[6], that we chose as
the source of our data13 .
No less important is the family of open-source cheminformatics toolkits. Open
Babel has its roots in the OpenEye Babel tool, same as the OpenEye toolkits. It
does not come as a surprise that Open Babel has supports virtually all existing
molecule file formats. Another two notable open-source toolkits are Chemistry
Development Kit (CDK) and, a young addition, RDKit.
The last example to be listed is ChemAxon JChem toolkit, which stands in
between the two categories as it is proprietary and closed source but free to use
for non-commercial purposes. It is significant in the way that we chose it as the
toolkit to be used in our application (having first experimented with Open Babel
and RDKit); the reasons that led to our choice of JChem are explained in detail
in section 3.2.
But the list of existing toolkits is by no means limited to the examples above
- there are other important projects that simply did not fit into this short introduction.
Computer Representation of Molecules
A non-trivial issue that needs to be considered when computationally processing
molecules is the computer representation of the molecules that is used. A suitable
representation of molecules is needed on the input, another representation is
used for processing and yet another representation is used to store the results
and intermediate results. In this section we give a brief overview of the formats
More information on SMILES and SMARTS is available in section 2.5
SMILES strings in PubChem compounds are even stored under the key PUBCHEM_OPENEYE_ISO_SMILES - OpenEye SMILES being a modified implementation of
the original Daylight SMILES algorithm.
generally available. In the next chapter, section 3.3, we follow up with discussion
of which formats are used in our application and why.
The formats of molecule computer representation can be divided into two
broad groups - 1) in memory working formats used by various cheminformatics
toolkits and 2) serialized molecule formats used for data exchange. The usual
workflow is as follows. Molecules are acquired in a serialized file format. Before
any calculation can be performed on the molecules, they have to be imported to a
chosen cheminformatics toolkit, which effectively means that they are converted
to its in memory format. This conversion might take a significant time. Then a
series of computations can be performed, intermediate results being stored in the
in memory format. In the end, the resulting molecules have to be exported i.e.
serialized to a suitable file format. This export might again be computationally
It does not come as a surprise that each of cheminformatics toolkits comes
with it own proprietary in memory format. These file formats are mutually incompatible and there is no direct way to convert between the in memory formats.
The result of which is that it is not easily possible to combine multiple cheminformatics toolkit in one workflow. Cinfony - a project trying to integrate various
cheminformatics toolkits together, resorts to using export and import functionality to pass molecules between the toolkits[39]. On the other hand this means
that the choice of in memory representation is tightly tied the choice of the cheminformatics toolkit used and once the toolkit is chosen, there is no choice (of in
memory format) to be made at all. In result, there is no use for us in describing
or comparing various in memory representations.
In the area of the serialized file formats the compatibility amongst toolkits
is higher but the situation is more complicated too. There is no single standard
format, various toolkits support different subsets of existing formats and implementation details of the conversions vary amongst the toolkits leading to minor
incompatibilities and inaccuracies.
The formats also differ widely in their features. To give an example, some
might completely discard the stereo information14 , some might encode it structurally (e.g. cis/trans configuration and chirality) and some other might encode
it in a form of 3D coordinates of atoms and bond. But that is just the tip of
the iceberg - the formats differ in all other imaginable aspects too - they differ in
how aromaticity is encoded, in their support for radicals and implicit hydrogen
control, support for isotopes, ability to include additional meta-information and
so on. In the light of that we will refrain from trying to make a comprehensive
comparison of the formats and we will only list and describe some of the most
significant ones.
Structure Data Format (SDF) SDF is probably the most widely supported
format. It is a very old format, which has been created in the 1980s by a company
called Molecular Design Limited [12]. It is a de facto standard, but there is no official specification and the format even exists in two incompatible versions (V2000
Stereo information refers to the relative spacial arrangement of atoms and bonds in three
dimension. Two molecules might seem identical when we only take into account the connectivity
of their atoms but simultaneously be very different in their three dimensional shape. This
difference might also lead to their different biological activity.
and V3000). That altogether leads to incompatibilities between its implementations. Moreover, the format lacks support for some of the advanced molecular
properties beyond connectivity, diminishing its usefulness as an archival format.
On the other hand the format is very simple and straightforward. An SDF
file is composed of multiple so-called Molfile documents delimited by “$$$$”. A
single Molfile starts with a header containing molecule’s name and some other
optional information (such as description). The header is followed a line stating
the number of atoms and the number of bonds. Than atoms are specified, one
per line, including their 3D coordinates. Afterwards the bonds are listed, again,
one per line, specifying the two endpoints and some other properties. After the
bonds, arbitrary properties follow.
The SDF format as it was described is also quite verbose and SDF files tend
to be quite large.
To recapitulate, the format is widely supported but the compatibility of particular implementations is limited. It is a simple format but lacks support for
advanced features and is not space efficient.
SMILES The “Simplified molecular-input line-entry system”, almost exclusively
called just “SMILES” is another traditional format with its history reaching to
1980s [3], created by Daylight Chemical Information Systems, Inc. There is an
initiative to make it an open standard called “OpenSMILES”. The format is very
different from the SDF described above. It describes a molecule in a form of one
line ASCII string, which are even human readable for simple molecules.
Basically, the SMILES representation of a molecule is obtained in a depthfirst tree traversal of the molecule’s chemical graph. Such a representation is not
unique, but there exists a very useful concept of canonical SMILES - which, using
a canonicalization algorithm, allow to associate each molecule with exactly one
unique SMILES string. The canonicalization can be computationally expensive
and the canonical representations differ amongst various implementations. Nevertheless, canonical SMILES are immensely useful for comparison of molecules i.e. answering a question whether two molecules, obtained for example as a result
of some computation, are in fact one and the same identical molecule or whether
they differ. Such question is not an easy one - in closer look it is related to the
notoriously hard graph isomorphism problem, which also explains why the export
to canonical SMILES format is so computationally intensive - once we have the
canonical representation, we can answer the equality question immediately.
Based on SMILES is the SMARTS language allowing to describe structural
patterns and used in substructure searching. Any valid SMILES string is also a
valid SMARTS expression, but the semantics are different (SMILES represent a
molecules, SMARTS represent a pattern15 ).
IUPAC International Chemical Identifier (INChI) The INChI is a recent
format, created in 2000s to serve as a new open standard for computer representation of molecules [32]. It is backed by the International Union of Pure and
Surprisingly, this SMARTS pattern might not even match the “corresponding” SMILES
molecule. For example the string “C1=CC=CC=C1”, when seen as SMILES, can be interpreted
as an aromatic benzene molecule and not be matched by the cyclohexatriene (C1=CC=CC=C1)
SMARTS pattern.
Applied Chemistry (IUPAC), which notably stands behind the “IUPAC nomenclature of organic chemistry” - a current standard for assigning names to organic
compounds. Same as the canonical SMILES, INChI represents a molecule as a
unique ASCII string. The INChI string, unfortunately, is not human readable.
The format, being two decades newer than SDF and SMILES, has been created
with support for many of the advanced molecule properties in mind. The support
for the INChI format itself, however, is not yet widespread.
Based on INChI is the INChIKey, which is a 27-characters long key obtained
by hashing a full INChI representation. It has been designed for searches in
chemical databases.
Other formats There exist innumerable other formats such as the Chemical
Markup Language (CML) - an open XML based standard for molecular and
other chemical data - or the Abstract Syntax Notation 1 (ASN.1) format which
is notable for being the PubChem’s native internal data format. And many
many more. The Open Babel cheminformatics toolkit is notable for supporting
an immense number (over a hundred) of different formats [40] and so it can
serve as a suitable starting point for anybody who would be interested in further
3. Design Choices
In this chapter, we build upon the foundations established in the previous chapter.
First, we introduce the scaffold hierarchy we use and provide a glimpse into its
origins and into the design process – which is then described in more detail in
chapter 4. After that, we revisit the cheminformatics toolkits and data formats
and explain which of them are used in our application and the rationale behind
The Scaffold Hierarchy
Our scaffold hierarchy forms a rooted tree with a single virtual hierarchy root
(level 0), eight levels of scaffolds (numbered 1-8), and molecules as leaves (level
9). That means that each molecule is mapped to a sequence of exactly eight
scaffolds, one for each level. Each molecule and scaffold also has one uniquely
defined parent.
The decision to use a tree hierarchy in itself had a large impact on the scaffold definitions used. Not all common scaffold hierarchies form trees, a notable
non-tree example being the hierarchy obtained using the HierS algorithm [61]
described earlier. But using a tree hierarchy has notable advantages. Not only
that each molecule can be classified by a unique representant at every level, as
was already mentioned; but also that the hierarchy is largely independent on the
molecules - knowing a level 8 scaffold, we can compute scaffolds on all higher levels, without a need for the original molecule. That also implies that the hierarchy
can be largely precomputed, using a sufficiently large set of representative and
diverse molecules.
For the purpose of visualization, it would be ideal if the tree had a homogeneous branching factor – i.e. if all nodes had a similar number of children.
This property showed to be extremely difficult to enforce using structural based
hierarchization; nevertheless, it played an important role in our decisions about
hierarchy levels used.
Even if we do not manage to obtain an homogeneous branching factor, we
would like the number of children to at least be limited. That is again for the
visualization purposes – when displaying a single point (node) in the hierarchy,
the number of children that can be reasonably visualized as distinct entities seems
to be about 100. This again proved difficult to enforce using structural definitions
only. However, it would seem trivial to achieve using some clustering method we shall describe our experiments on that later.
One check that shall be done is whether 9 levels is enough – i.e. whether
a 9 level hierarchy with a branching factor limited by 100 can cover our entire
desired chemical space. We use PubChem Compound as the reference chemical
space (more on that in section 4.1), so the target is about 108 molecules (leaves).
A tree of height 9 and constant branching factor 100 has 1009 = 1018 leaves, which
is indeed sufficient. In fact, for a tree with 108 leaves and 9 levels the average
branching factor is only 108/9 =7.7.
So the height of the hierarchy should not be
a limiting factor.
At the bottom of the hierarchy, the scaffolds are inspired by the original work
on molecular frameworks by Bemis and Murcko [4]. The middle of the is defined
based on ring topologies as described by Pollock et al. [44]. The top levels are
then our original design, further abstracting the idea of ring topologies.
Now let us describe the particular levels. For easier understanding of the transformations, four existing drugs together with theirs scaffolds on every level can be
seen in figure 3.2. The drug molecules pictured are four known drugs – ibuprofen, sulfamethoxazole, diazepam, and hydrocortisone; the molecule data have
been obtained from DrugBank [62], record ids DB01050, DB01015, DB00829,
and DB00741 respectively.
Level 8: Rings with Linkers Stereo The bottom level contains all ring
systems and liker atoms - i.e. it corresponds to molecule frameworks as defined
above; however, at this level we conserve all the chemical stereo information that
is left after the removal of side chains. To restate it another way, level 8 scaffolds
are obtained from a molecule by the process of deleting sidechains, performing
standard aromatization, neutralizing charge, removing explicit hydrogens and
radicals, and discarding element isotope information.
Level 7: Rings with Linkers The seventh level is identical to the bottom
level except for that all the stereo information has been discarded. We have
decided to remove the stereo information here because in the next step we are
discarding information about chemical elements and in such reduced model the
stereo information would be absolutely out of context.
Level 6: Murcko Rings with Linkers This level corresponds to the Murcko
scaffolds described earlier - i.e. it is a skeleton of the original molecule with all
of the chemical information discarded, atom connectivity being the only thing
left. Therefore, to convert a level 7 scaffold to a level 6 scaffolds we replace all
elements by carbon and we make all the bonds single. Number of nodes of the
resulting graph is still equal to the number of (non-hydrogen) atoms in the base
molecule framework.
Level 5: Murcko Rings Another level is obtained by removal of superfluous
linker atoms - superfluous meaning atoms of degree 2. That means that we only
leave branching linker atoms and replace all linker paths by a single edge. In
this step, for the first time, the size of graph might be lesser than the size of the
original molecule framework.
Level 4: Oprea The Oprea scaffolds are obtained by performing similar contraction on ring atoms. In this process, we remove all remaining vertices of degree
2 by performing edge merging operation on them. The only exception being when
both vertice’s neighbors are connected (i.e. we have a triangle), when edge merging would lead to a loss of the cycle. We gain a minimum cycle topological
representation of the original molecule.
Level 3: Ring Connectivity Extended Up to now, every vertex of a scaffold
graph corresponded to an atom in the original molecule. However, on levels 2 and
Figure 3.1: The red ring bond is excluded when calculating a ring connectivity
scaffold; including the edge would form a triangle instead of a linear path
3 the vertices do not correspond to atoms but to entire rings. Level 3 scaffolds
are created from Oprea scaffolds in two steps.
First, the molecule is decomposed to rings. This by itself is a surprisingly
difficult task. Two standard algorithms exist – SSSR (Smallest Set of Smallest
Rings) and CSSR (Complete Set of Smallest Rings) – both giving very unexpected
results in some corner cases. We decided for the CSSR variant as it seems to
gradually replace SSSR as the method of choice in cheminformatics frameworks.
Second, having the set of small rings – the vertices of the new graph – we
connect the rings that were connected in the original graph. This again is a lot
more complicated than it seems. We distinguish two types of connectivity:
1. Strong connectivity – when the two rings share an edge in the original graph.
2. Weak connectivity – when the two rings share a vertex in the original graph
or when they are connected by an edge which is not a part of any ring (a
“linker edge”) or when it is connected by a path formed by linker edges.
The rules for strong and weak connectivity try to reflect whether the connection
of the original rings is rigid or flexible. Rings sharing an edge – fused rings – are
considered to be rigid. Other connections are considered flexible/weak.
The restriction to linker edges is to prevent unexpected superfluous edges in
the resulting graph, leading to unexpected. An illustration of why ring edges
have to be excluded is provided in figure 3.1.
In the scaffold molecule representation, we model strong connectivity as bonds
of order two (double bonds) and weak connectivity as bonds of order one (single
bonds). In a visual representation, we show strong connectivity as bold bonds
and weak connectivity as standard bonds.
Level 2: Ring Connectivity A simplified variant of the ring connectivity
scaffold removes the distinction between the strong and weak connectivity, making
all bonds single/standard, and distinguishing only between connected and not
connected rings, with the same exceptions as on the previous level.
Level 1: Ring Count On the top level we represent all molecules simply by
their ring count. More precisely, that is defined as the number of vertices of
the ring connectivity scaffolds at levels 2 and 3. Which is in turn equal to the
number of rings in the CSSR decomposition of the Murcko scaffold or the original
Level 0: Root
A single node at level 0 serves as the root of the scaffold hier-
Cheminformatics Toolkit
We provided a brief overview of existing cheminformatics toolkits and the functionality they offer in the previous chapter, section 2.4. In this section we describe
our requirements on a cheminformatics toolkit, our experience with a few selected
toolkits and our choice of the toolkit for the final implementation, including the
rationale behind it
In the previous chapter we gave a long list of functionality provided by chemical toolkits. Our application only needs a limited subset of the aforementioned
functionality – a subset that is usually available in all the toolkits. First, the
toolkit needs to provide the ability to read and write chemical files; this functionality is needed in order to process user input, data from chemical databases and
to store results (more on chemical formats and their usage in the next section).
Second, the toolkit has to be capable of 2D-depiction of molecules – in order to
create images which are then shown to the users of the interactive visualization
web application. Last, the toolkit should preferably offer comfortable molecule
manipulation capabilities and basic structural analysis – namely the ring system
analysis - which helps to easily calculate the scaffolds.
In respect to the features outside of the domain specific functionality, we
would have preferred a framework that is is free to use or better yet open source;
we had no strict preference of the programming language being used – support
for Python or JVM or .NET was enough for us; a well documented and easily
usable framework would obviously be preferable.
Having first prototyped the hierarchy calculations using two open-source toolkits – RDKit1 and Open Babel2 , via their Python wrappers – we finally settled
on using the ChemAxon JChem suite3 . The JChem suite is free to use for noncommercial purposes (although it is unfortunately not open source). It is implemented in Java and requires no installation – compared to RDKit and Open
Babel, which are written in C++ and their installation is a non-trivial task, especially on some platforms. We also found the JChem API quite well documented
– in the case of RDKit and Open Babel, the situation was complicated by the
existence of two different APIs - the C++ API and the Python wrapper API,
with methods often behaving differently in each API and documentation being
scattered over multiple places.
Figure 3.2: Examples of a few well-known drugs and the scaffolds representing
them in the scaffold hierarchy
From the functionality standpoint, all three evaluated frameworks were capable of performing all molecule manipulations and structural analysis that we
needed. All the frameworks were also capable of reading and writing sufficient
amount of file formats (with Open Babel having probably the best support for
chemical file formats of all frameworks, format conversion being its original mission). However, there were some practical differences amongst the frameworks.
Since we have processed not only “ordinary” molecules but also topological scaffolds encoded to the form of molecules; and because these scaffolds not following traditional rules of chemistry (such as the carbon atom having at most four
bonds), we have stumbled upon many corner cases. It the result, some of the scaffolds could not have been exported and some other could have not been opened
from their serialized forms. In this regard the JChem toolkit seemed to be the
most benevolent while RDKit was the most restrictive4 . The same problem manifested itself in the area of visualization - some of the frameworks were not able
to calculate a 2D structure of the scaffolds and in effect were not able to depict them. Again, JChem was the most benevolent and so the most suitable for
our use case. Last but not least, we also found the JChem generated graphics
aesthetically pleasing, which can be considered an advantage for an application
targeted at visualization of chemical structures.
Another reason to choose the JChem had been availability of some extension modules which appeared like they might have been useful for our use case.
Amongst them the ChemAxon Standardizer which assists in conversion of structures to their canonical forms (useful to eliminate duplicates and to assess equality of molecules); the ChemAxon JKlustor providing both hierarchical and nonhierarchical clusterization of compounds (which aligned with the aim of this thesis); and tools for storage of molecules in databases (which might have been useful
for our precomputed background hierarchy). In the end, none of these modules
was used. Either their practical functionality was not suitable for us, or it was
suitable but there was a little benefit to be gained compared to the disadvantages
of using a paid module (such was the case with Standardizer).
Worth mentioning is also the Chemistry Development Kit (CDK) - a free and
open source toolkit written in Java. It has not been evaluated by us mainly
because we were already satisfied with JChem. It is however open source and
free to use even for commercial purposes. Being JVM-based, it might serve as
a straightforward replacement of JChem in our application, had there been a
suitable use case.
Representation of Molecules and Scaffolds
Based on the information in the preceding chapter, section 2.5, let us discuss
which formats are used by our application. First of all, the in memory format. It
has already been discussed that the JChem toolkit is used for all processing and
together with it comes its proprietary in memory representation of molecules. For
This cannot be considered a shortcoming of the framework in general because the scaffolds
are indeed nonsensical when interpreted as molecules and the framework only performs a correct
structural sanity check. However, it would have been useful if such checks could have been
a user of our application this is completely invisible and for a programmer it is
only relevant in the few files of the source where the calculations are defined.
For the user input format, we would ideally like to support as many of them
as it is possible. This goal has been achieved by using JChem to parse user input
files together with automatic format detection. JChem, and thus our application,
supports almost all of the formats described in the previous section (except for
ASN.1) and many more 5 .
The format in which data from PubChem Compound are to be imported had
also been to be chosen. The best would seem to use the PubChem’s native format
- the ASN.1. Unfortunately, the JChem toolkit does not support an ASN.1 input,
so we had to choose a different format. We tried SDF/Molfile as well as SMILES
provided by PubChem. Interestingly, many of the molecules differed slightly when
imported from the two different formats. Empirically it seemed that when there
is a conflict, the version imported from SMILES matches the depiction on the
PubChem site. Based on that we decided to use the SMILES data as the source6 .
The most interesting choice has been the one of the internal serialized format.
In our application, we only store molecules, but more importantly we also store
scaffolds. The scaffolds are not true molecules and on some levels of our hierarchy
they have no direct chemical interpretation at all. Still, all the scaffolds (except
for ring counts) have been designed in such way that they form a moleculeresembling graph. So we are indeed able to save them in a molecule format, even
though it might not make sense semantically. But does it have any advantages
to save them as molecules instead of saving them as graphs? As it turns out, it
does, and surprisingly many of them.
Some formats can solve the canonicalization problem for us - using a canonical serialized form we can compare two scaffolds and easily determine whether
they are equal (without having to solve a difficult graph isomorphism problem
ourselves). An ability to perform such equality comparisons on the scaffolds is
absolutely critical. As will be discussed later, we store a preprocessed scaffold hierarchy based on the PubChem Compound database to be used as a background
for the hierarchical visualization. During the preprocessing, using canonical forms
of scaffolds prevents duplicates from being introduced into the database. Later,
the canonical forms of scaffolds extracted from user data sets are compared to
the records in the database to determine whether the scaffold is present in the
database and where to the hierarchy does it belong (this would have been almost
impossible using graph-isomorphism based equality checks).
Another advantage of using a standard molecule format for scaffolds is, that
import/export functionality is readily available in the toolkits; so we do not
have to implement the serialization logic ourselves. Moreover, the scaffolds at
the bottom two levels of the hierarchy are very close to molecules and so using
molecule serialization format is suitable and needed for them. And using the same
format on all levels of hierarchy helps to simplify the implementation; having all
the scaffolds available in a common chemical format allows for easy generation of
their images – used in the visualization; and so on.
See https://docs.chemaxon.com/display/docs/Molecule+Formats.
Almost peculiar detail of that being that the PubChem does not offer the Compound
database to be downloaded in the SMILES format directly, and so we need to extract the
SMILES strings from metadata in SDF files, effectively using both formates.
Based on these requirements, a suitable serialization format has to be picked
for both the scaffolds and the molecules – preferably a common one. Absence of
canonicalization rules out the SDF, which would not have been suitable anyway,
due to its size inefficiency. But there is no clear favorite amongst the canonical
SMILES and the INChI format. In the end, the canonical SMILES format has
been chosen – mostly on the basis of being more commonly used and more widely
supported by other tools that the target users may use. Also, compared to INChI,
the SMILES format seems to be a little more space efficient, which is an advantage
too, with our background database being quite large. Nevertheless, the INChI
format would probably have been a comparatively suitable choice.
At last, the choice of the output format has been quite obvious. As SMILES is
used everywhere internally, is also makes sense as an output format. No conversion is needed and thus no data is lost in translation. Also, the export functionality can be implemented without using a cheminformatics toolkit (which is useful
in a browser environment). And the SMILES format is also widely supported
which is crucial for further processing.
4. Data Extraction and Processing
In this chapter, we describe some of the work that has been done before the
implementation of the visualization tool has even begun. It shall shine some
additional light on the design choices described in the previous chapter, as well as
on the choices yet to be described. The description of the process is significantly
simplified relative to what really has been done But trying to explain all the
problems that we encountered and solved would come at the expense of clarity.
Therefore, we only mention the key problems and design choices.
Source of Data
The aim of the thesis was clear - visualize the user input relative to a reference
background - so that molecules are anchored to some fixed points, avoiding the
instability of some visualization methods, that was described in section 2.3. What
is the suitable source of this background, however, was not clear.
Choosing the data source
Using the whole drug like chemical space, size uncertain but well above 1020 ,
was not an option. So it has been decided to use one of the existing databases
of chemical structures. There is a rich choice of such databases; some notable
examples of the databases that were examined can be seen in table 4.1.
An important criterion was for the database to be a free access one (ruling out
ChemNavigator) and downloadable (excluding ChemSpider). Another factor to
consider was the context information that the chosen database would provide - as
we compare frequency of a scaffold in a user dataset to the frequency of the same
scaffold in the background database, a generic database of compounds (such as
PubChem) would give a different picture than a database of known drugs (e.g.
DrugBank)1 . But the decisive difference was the number of distinct scaffolds
contained - the more scaffolds the background would contain, the larger portion
of the chemical space we would visualized. The winner, in this respect, is the
PubChem database - it is a large one and according to Wester et al. [59] it also
has the highest scaffold diversity. So we opted for PubChem.
However, PubChem also exists in two varieties - PubChem Substance and
PubChem Compound - the latter being a cleaned and standardized form of the
former one. It was not quite obvious, which one to choose - with the Substance
database providing higher diversity and Compound database offering cleaned
data. In the end, we processed both and, based on the results, we opted for
the Compound database. That was mostly due to mixtures being present in the
Substance database, which led to many disconnected scaffolds representing two
or more compounds simultaneously, carrying no useful information and acting as
a noise in the data. But before any such analysis could have been performed, we
have to gain access to the PubChem data.
The ideal solution would probably be to allow comparison to multiple background databases
simultaneously. That idea has not occurred to us originally and is not implemented in the
current version, but it would be a possible extension, reasonably straightforward to implement.
91 million
221 million
1.6 million
57 million
23 million
8.2 thousand
102 million
Structures contained and comments
Pure and characterized chemical
Mixtures, extracts, complexes
and uncharacterized substances.
Drug-like molecules, including biological
Integrates structures from over
500 databases; does not allow downloads.
Commercially available compounds.
Drug molecules, detailed data.
Commercially accessible screening
compounds; commercial access only.
Table 4.1: Examples of chemical structure databases; sizes as of July 23, 2016
Accessing the data
Accessing the PubChem data was quite manageable. PubChem offers FTP access
to the databases, with the molecule data available in three formats - ASN.1, SDF
and (non-standardized) XML. The ANS.1 format is the internal representation
used by PubChem, but unfortunately, it is not readable by common cheminformatics toolkit. Of the three formats, only SDF is commonly supported, so that
was our choice. Surprisingly, the Compound SDF data contained each molecule
twice - once in the native Molfile format (commonly used in SDF) and once in
the form of OpenEye SMILES, included as a Molfile property. When imported
by ChemAxon JChem toolkit, the molecules obtained from the Molfiles were not
equal to the molecules obtained from the SMILES strings. The differences were in
aromaticity and stereo chemistry. We checked a number of the differing molecules
by hand against their graphical representation on PubChem website and based
on that chosen the SMILES variant, which provided results more consistent with
the reference PubChem depiction.
Size of PubChem Compound in compressed SDF format is 61.55 GB, the size
Substance database is a bit smaller with 56.06 GB2 - which is surprising, considering that it contains two and half times as many molecules. The difference is
due to the fact that Compound database provides richer set of molecule properties (such as the aforementioned OpenEye SMILES strings). Even though the
amount of data is not small, it still fits a common hard drive comfortably, so we
mirrored the data for further analysis.
Still, even when stored locally, the process of reading all the molecules in
such database is a lengthy one - consuming many hours of CPU time 3 Because
of that, we extracted the only information we use - the SMILES strings - and
stored it in a custom compressed database. In this form the database in not only
40× smaller (1.55 GB for the Compound database) but it is also approximately
2.3× faster to load. In the end, that saved us days of processing and allowed
Both sizes current at the date of last update - July 23, 2016.
Using the ChemAxon JChem, on a contemporary 4 GHz CPU and sufficiently fast storage.
faster experimenting. This importing procedure made it to the final application
and exists as the ImportPubChem generator task (more on generator tasks in the
following chapters).
Initial Analysis
With the data readily available, the next step was to design the hierarchy. We
decided to construct the hierarchy using the ring topology based scaffolds, for
reasons stated before. However, the exact levels to be used were not obvious at
the beginning.
As stated in section 3.1, we had several requirement on how the hierarchy
should look. One of the requirements was that the branching factor should not
be too big, so that children of every node in the hierarchy could reasonably be
displayed on a single screen. This should be satisfied on every level, including
the root level - implying that the number of scaffolds at the top level (i.e. the
children of the root) should be small enough. The highest level ring topology
based scaffolds previously known were the Oprea scaffolds (as described in section
2.2). But it was uncertain how many such distinct scaffold would be present in the
Compound database - would they be remotely suitable as the top level scaffolds?
That was the first thing to be determined.
As it turns out, the number of Oprea scaffolds in Compound database is 138
thousand, making them very unsuitable candidates for the top level scaffolds.
This led us to designing additional levels above the Oprea scaffolds, which in the
form level 4 of the scaffold hierarchy (described in section 3.1).
First attempt to abstract upon Oprea scaffolds was to represent every ring
by just one vertex and include information, which rings are connected and how.
Here, we distinguish two types of connectivity - a strong connectivity, when two
rings share a common bond (graph edge), and a weak connectivity otherwise
(more details in section 3.1). This abstraction only brought down the count to
119 thousand, still infeasible for the top level.
In the next step we disregarded the connectivity type, only distinguishing connected and not connected rings. The count was brought down more significantly,
to 50 thousand. Which is unfortunately still too high.
So in the last step, the ring connectivity information was removed completely,
distinguishing the ring topologies only by ring count. This yielded an almost
optimal number of 102 top level scaffolds.
Altogether, the described scaffold types form top four levels of the final scaffold hierarchy. Having the top levels ready, we designed the bottom four levels,
this time taking most of the inspiration from the original article on molecular
frameworks by Bemis et al. [4]. Again, in order to keep the number of children
small, we aimed for as small steps between levels as possible; however, we haven’t
subdivided the steps, where it seemed chemically nonsensical. A notable example, where we opted for not-subdividing, is the step between level 7 (rings with
linkers) and level 6 (Murcko rings), which, as described in 3.1, discards bond
multiplicity and heteroatoms (and other less important properties) all at once.
That is a huge leap in the information contained (leading to some level 6 scaffolds
having excessive number of children), however, we could not see, how a smaller
step could be made - for example discarding heteroatoms and keeping bond mul30
Number of scaffolds by branching factor
1 274
7 476
5 756
1 602
Table 4.2: Scaffolds by number of children, divided into four bins; the table shows
how many scaffolds there are in each bin
tiplicity or the other way around did not seem chemically justifiable and we did
not find any examples in the literature where such partially simplified scaffolds
would be used4 .
Clustering and Similarity
Having the initial hierarchy designed, we analyzed, how does it satisfy our design
goals - most importantly wanted to know, how many scaffolds do have excessive
amount of children and how many children do they have.
Even from some of the counts mentioned in the previous section, namely that
there exist 102 distinct ring counts (level 1) an 50 thousand ring connectivity
scaffolds (level 2), i.e. the average number of children for each ring count is
almost 500, it is obvious, that the ideal amount of children is exceeded - either
often or somewhere significantly.
We divided scaffold into four groups, according to number of children they
have: optimal (1-100), good (101-400), large (401-1600), excessive (>1600)5 . Table 4.2 shows absolute and relative count of scaffold in each group by level.
Although the absolute majority of scaffolds is in the “optimal” group, in absolute numbers there is still a significant amount of scaffolds, which have an
excessive amount of children. We decided to try to break these scaffolds into
groups by a clustering them based on their similarity. That naturally led us to
question, how to define scaffold similarity.
It has been described above that in cheminformatics, molecular fingerprintsbased similarity calculations are the standard method of assessing molecule similarity. However, the scaffolds are not molecules - especially the scaffolds from level
6 up are devoid of their chemical features - and so calculating their fingerprints
would be misleading (although, in principle, possible). Instead, considering that
these scaffolds are plain simple graphs, we decided to use a similarity measure
based on their graph edit distance.
But it would most likely be an interesting research question for medicinal chemists, whether
there exists a reasonable class of scaffolds between rings with linkers and Murcko scaffolds such, that it would have prediction value higher than Murcko scaffolds themselves.
The limits - 100, 400, 1600 - come from an idea of organizing the scaffolds into a square
grid - 100 scaffolds fitting into a 10×10 grid, 400 fitting into 20×20, and 1600 into 40×40.
We ended up implementing a procedure that, for each scaffold with a large
amount of children, calculated pairwise similarity of its children and then, based
on this similarity measure, grouped the children into a fixed number of clusters.
To calculate the graph edit distance, we used Graph Matching Toolkit [48], which
was kindly provided to us by one of the authors - Kaspar Riesen. Graph Matching
Toolkit implements several fast heuristics to the graph edit distance problem,
which is quite significant as the problem is known to be NP-hard. When clustering
the scaffolds, we came across the problem that most commonly used clustering
algorithms (and libraries) expect points in Euclidean space; whereas we only had
pairwise distance but no coordinates. We opted for the k-medoids algorithm,
which is similar to the well known k-means, but only relies on distances and
does not use coordinates. To perform the calculations, we used ELKI6 [50],
implementing the k-medoids algorithm.
The results looked promising - scaffolds in the resulting clusters were indeed
very similar to each other. But upon closer look, they were also similar amongst
the clusters. Indeed, as they were all children of a single parent scaffold, they
mostly shared the same basic substructure and differed from it only slightly.
We realized, that the graph edit distance is not a reliable measure of scaffold
similarity, and, even worse, we cannot tell which measure would be. However,
as in the case of further level structural subdivision, the question of defining
scaffold similarity is probably a research problem better left to qualified medicinal
chemists, rather than computer scientists. Regardless, with no suitable similarity
metric, the attempts to cluster the scaffolds have reached a dead end.
Another area where scaffold similarity came in play was the visualization itself.
Our idea was to create a hierarchical visualization akin to a zoomable geographical
map - a concept most of computer users are familiar with. In order to do that, we
would need to be able to embed children of every scaffold into a plane, starting
with the children of the root - i.e. the top level scaffolds. Embedding a set
of scaffolds into a plain is a problem similar to traditional (non-hierarchical)
visualization of a set of molecules, as described in section 2.3. Same as in the
case of molecules, there is no obvious method to map scaffolds to coordinates
in a two-dimensional plane. But if we have a similarity measure on scaffolds
available, we can use the previously described methods - principal component
analysis, multidimensional scaling or force directed layouts - to solve the problem.
Unfortunately, as we described, obtaining a scaffold similarity measure is hard.
Not only that our measure based on graph edit distance was unreliable, it was
also very expensive to compute. Graph edit distance is an NP-hard problem and
even the heuristics are quite slow, or at least far from instantaneous on graphs of
size of the scaffolds. On top of that, to calculate a distance matrix, the number
of computations is quadratic to the number of displayed scaffolds. Some rare
scaffolds in our hierarchy have over 16 000 children, which means that over 128
million distance computations would be needed for such scaffold to calculate its
children’s pairwise similarity. This is beyond reach of a common computer - with
Environment for Developing KDD-Applications Supported by Index-Structures.
one such calculation taking about 100 ms, it would take months to calculate the
distance matrix even for one such set of children. And even if we could overcome
this difficulty by optimizing the computation a little more and performing it on
a computing grid, the sample layouts that we obtained from using graph edit
distance as similarity metric looked almost random. So we discarded graph edit
distance even for this use case.
Without a reasonable plane embedding being available, an idea came to display the scaffold in a form of a tree map [53] - using scaffolds frequency in the
background hierarchy to determine each scaffolds size (significance). Having reviewed some papers on tree maps, the choice fell on the Squarified TreeMap
algorithm [9] which aims at generating layouts in which the rectangles are approximate squares. The tree map layout subjectively felt both - sufficiently clear
and aesthetically pleasing - so it became the final form of the visualization. Moreover, had we managed to find a suitable metric, it would be possible to integrate
the similarity information into the tree map layout using Nmap algorithm [14],
which creates the layout based on given anchor point coordinates, keeping the
resulting rectangles as close to their original anchor point as possible. Therefore
we believe that the approach of using tree maps to visualize scaffold hierarchy
members is a solid one; and once the problem with scaffold similarity is solved,
the ideal form would use a distance-similarity preserving layout.
5. Implementation
In the following chapter we describe the architecture and selected implementation
details of our application. We discuss the reasons behind why the particular
architecture has been chosen. We follow up by a list of selected technologies
being used. And then we describe the implementation itself, especially in respect
to modifiability.
Our application is divided into three distinct parts - a client, a server and a
generator. The client application is a single-page web application offering the
visualization functionality of our solution. The server provides support to the
web application - mostly in the form of API to access the precomputed background hierarchy as well as the functionality of the chemical framework. The
generator project is a simple command line application which serves the purpose
of computing the background hierarchy.
This architecture is not an arbitrary but follows from the underlying problem
that is being solved. The motivation to create provide the visualization in the
form of a web application stems from usability reasons, such as low access barrier,
multiplatformity or even the potential for a later third-party integration into a
larger chemical web project. So the obvious solution would be to implement the
whole solution as a standalone client-side application. And we would have done
so, had it been viable.
Client-server model
There were two separate reason why a standalone client-side web application
was unfeasible and a server component was needed. The first reason was the
prohibitive size of the background hierarchy used, the second reason was the
difficulty of performing chemical calculations on the client side.
First, to the background hierarchy size. The background hierarchy, calculated
up to its specification described in section 3.1 and based on the PubChem Compound database, contains over 14 million unique scaffolds (as of July 2016). For
each of the scaffolds we need to store at least its ID, ID of its parent and the
scaffold itself - in our case we used the canonical SMILES representation, which
is very space efficient, but it was still 45 characters long on average. Additionally,
we would like to at least reference children from the parents, which is in total
another ID per record. Although it is a very crude estimate (in reality we store
more information and have secondary indices), it is 14 million records, each 57
bytes on average (3 × 4 bytes for IDs and 45 bytes for SMILES), totaling 800
megabytes of data. That is not only unfeasible to bundle with the application
(even when compressed), but it is also on the verge of memory limit of modern
browsers1 . There might be ways to work around these problems but there was
For example the Google Chrome version 51.0 (32 bit) has memory limited to 756.3 MB
per page. This information is based on the window.performance.memory.jsHeapSizeLimit
property accessible from browser’s console.
no sufficient motivation for to attempt such extreme feat.
In our design, we store the hierarchy on server and the client application
accesses the required data through an API.
Second, to the topic of preforming in browser chemical calculations. To our
knowledge, there is no JavaScript cheminformatics framework available. There
are some cheminformatics browser components by ChemAxon (such as Marvin
JS2 ), which was another reason why closely explored ChemAxon product portfolio
(which in the end let to the choice of ChemAxon JChem), but none of these
components fulfilled our needs (described in the section 3.2).
We also experimented with using RDKit or Open Babel, transpiled from C++
to JavaScript using Emscripten [63], a compiler plug-in for LLVM [27]; following
a helpful series of blog articles on this topic written by Noel O’Boyle [41]. On
the one hand, we managed to create a functional prototype, on the other hand,
there was very significant added complexity compared to using these frameworks
directly and the potential advantages (saving server CPU time) were not worth
so significant complication in the application design.
Finally, we settled for the ChemAxon JChem toolkit and implemented all the
chemical calculations on the server side. The implementation is very straightforward and is shared between the server and the generator. The client application
accesses the needed chemical calculations through a high-level API.
Still, all logic that could be reasonably implemented on client-side, is implemented on the client-side, thus resulting in a thick client - thin server architecture
Preprocessing the hierarchy
The PubChem Compound database in compressed SDF format is 61.5 gigabytes
large, as of time of writing (July 2016). It is obviously virtually impossible to
use this form directly for any kind of online scaffold analysis. Thus, the data
have to be preprocessed. This processing of a source background database into a
form suitable for online analysis is the primary purpose of the generator project.
For each molecule scaffolds on all levels are computed, assigned unique numerical
identifiers and these identifiers are stored to a tree allowing for instant retrieval
of scaffold’s parent and children. During the development, we have also used the
generator project to run all kinds of ad-hoc analytical queries on the generated
database - a practice that we can only recommend.
Technologies Used
Having described the three main blocks of the application, let us offer a list of
some technologies used in their implementation as well as our experience with
them. This list of technologies is by no means exhaustive and the technologies
are listed in no particular order.
Scala All parts of the application - the generator, the server and the client are created using a single programming language - the Scala programming language [30]. Scala is natively a JVM language - that is a language that runs on
the Java Virtual Machine (JVM). Scala has been created by Martin Odersky,
who is also the author of the current reference Java compiler and who directly
influenced addition of generic types to Java3 . Scala is a language designed to be
compatible with Java, it is fully interoperable with Java (which especially implies
that all existing Java libraries can be used) and it can almost serve, as a drop in
replacement for Java as it is possible to write very Java-like code in Scala. But
Scala also has numerous advantages compared to Java - it has been designed as a
functional language, it has a great standard library, it is very extensible and has
a great type system, which altogether allows for very succinct code to be written;
and shorter code arguably leads to better maintainability and less bugs - at least
such has been the author’s experience with the Scala language.
We were bound to choose a JVM language due to our choice of ChemAxon
JChem as our cheminformatics toolkit. We chose Scala over Java partially because of the advantages listed above and partially because Scala can also be used
as a language for browser development, allowing us to use only one language all
across the application, simplifying the development and increasing the maintainability significantly.
Scala.js Scala.js [13] is a transpiler from Scala to JavaScript. It allows browser
applications to be written in Scala. Due to resemblance between Scala and
JavaScript - both being objective and functional C-family languages - the Scala
code is very close to the equivalent JavaScript code; that specifically allows most
of the documentation and resources available for JavaScript programming to be
used. Scala also has a significant advantage over JavaScript in the form of statical
type checking.
We implemented a prototype client application in both - Scala.js and JavaScript. Comparing both implementations, we chose to use Scala.js for our project.
The advantages were many. Using Scala on both - the server and the client - significantly simplified interoperability between them - allowing for example the
same data structures to be used on both sides - again reducing code duplication
and improving maintainability. The static type checking, that Scala has over
JavaScript, also prevented innumerous errors, simplifying the development. Finally, it was also very comfortable to only use one language - reducing mental
context switching. Amongst the disadvantages, the most significant disadvantage
we came upon was that the transpiled code is very hard to debug in the browser.
Another slight disadvantage could be that wrappers have to be written around
native JavaScript libraries, but it was not a problem in our case as wrappers for
all the libraries we used already existed.
Overall, our experience with Scala.js was great and we would recommend it
for any comparable project.
React React4 is a JavaScript library for building user interfaces, which we used
in the client application. React breaks the application into distinct composable
components. Each component gets a set of properties and can also have an inner
Odersky created a language called Generic Java which was a superset of the Java language
with added support for generic types. The design has been used by Sun Microsystems, Inc. as
the basis for generics support in the Java.
state (which however is discouraged). A component also has a render function
that based on the properties and state renders the component into virtual DOM
(an equivalent of standard browser DOM, which is however a lot faster to update).
React then automatically manages rendering components to virtual DOM (when
properties or state are changed) and also synchronizes the virtual DOM to browser
We used React through scalajs-react5 wrapper by David Barri which provides
a type safe API to access React. We were satisfied with both - React and scalajsreact.
Diode Diode6 is a Scala.js library by Otto Chrons, used for managing application state. The state is a collection of immutable objects. Updates of the state
are performed exclusively by a designed hander, based on a stream of actions.
Diode facilitates to build an application based on so called unidirectional data
flow - a paradigm commonly used with React; in simple terms that means that
the application has a central state (a single source of truth), this central state is
passed to React components (in form of properties) and based on the state the
components are rendered. The components may dispatch actions, for example on
user interaction. These actions are then processed by the central handler, leading
to the state being updated - and the cycle is closed.
The unidirectional data flow architecture plays well with a functional language, such as Scala, with emphasis on immutable data collections. The Diode
library worked well for us and can be recommended.
MapDB MapDB7 is a simple embedded Java database engine by Jan Kotek. It
provides implementation of off-heap and on-disk data structures such as B-trees
and hash maps, without the added complexity of using a fully-featured database
system. We use MapDB to store intermediate results during processing of the
background hierarchy as well for storage of the final result. The server also queries
MapDB when processing user requests and uses in memory MapDB database as
an image cache.
Overall our experience with MapDB was good. Unfortunately, during development we managed to get the database into an inconsistent state several times
during a forced JVM shutdown - despite using transactions and a write-ahead
log. So we are not convinced by MapDB’s fault tolerance. Also, the performance
is sometimes slower than we would imagine, especially for the purposes of the
server component. Compared to the on heap collections in the Scala standard
library, MapDB has been more than 100 times slower for the queries used by
the server - even for repeated queries with supposedly warm disk cache. These
queries, however, are not performance limiting for the server and so we kept using
MapDB - to keep RAM requirements low. At any rate, after our application has
been finished, a new major version of MapDB has been released, that is a total
rewrite from the scratch; therefore, our experience is probably not applicable to
the new version.
Play Framework A lightweight web framework used as a basis for our server
component. Play framework is based on Akka8 , is stateless, asynchronous, has a
very clean design and many other features. For us, however, the most important
feature was the convenience that Play brings into the development process - it is
trivial to create a new project, it is obvious how to extend it a the documentation
is very good. Moreover, Play Framework comes with a bundled JBoss Netty web
server, which can be used in both development and production. Running the
code in development mode comes with built in hot-reloading and detailed and
useful error messages. The production mode brings increased speed and secruity.
Play Framework can also automatically generate a distributable package in the
form of a portable multiplatform ZIP file; alternatively, native packages can be
generated, such as Linux packages (DEB, RPM), Microsoft Installer (MSI) or OS
X disk images. These distributable packages contain all required dependencies,
including the Netty web server, excluding only the Java Virtual Machine.
Overall, the Play Framework is both easy to use and scalable, making in
suitable for all stages of an application - from an early prototype to a a final
product. Web applications based on Play Framework are also easy to distribute
and run for the end users, which was also an important concern for us.
Implementation overview
In this section we give an overview of how the application code is structured. We
also describe some implementation details, where they are important or interesting, but opt to skip them elsewhere to keep the size of this section within limits.
We also pinpoint parts of the code, which are suitable for customization.
Project structure and build process
The application is divided into multiple projects, which are defined and brought
together using a single sbt9 build definition.
Each of the the basic components of the application - server, client and generator - resides in its own likewise named project. All three projects can be built
simultaneously using sbt. The generator and server projects are compiled into
JVM bytecode, while the client project is transpiled into optimized JavaScript
(JS) code, which is then passed to the server as a static resource to be served to
the clients.
The code that is shared between the client and the server (e.g. the API
interface) or between all the tree projects (such as basic data structure definitions)
resides in a separate project called shared ; this project is then both compiled to
JVM bytecode and transpiled into JS. The code which is shared only between the
server and the generator resides simply in the generator project - fragmenting the
Akka is “a toolkit and runtime for building highly concurrent, distributed, and resilient
message-driven application on the JVM”, as described by Akka’s creators. Akka enables actorbased concurrency, an approach inspired by the Erlang programming language’s library.
The name “sbt” stands for Scala Build Tool (http://www.scala-sbt.org/). It is an open
source build tool similar to Maven, Ant or remotely to Automake. Sbt is the tool of choice for
Scala projects and is sometimes used even for Java project - especially those built on the Play
code in another shared project seemed to be more confusing than helpful - the
generator project is simple enough, uses all the shared code (in fact by definition),
meanwhile the amount of generator specific code is quite limited (mostly the
processing scripts) and last but not least the generator forms a meaningful standalone project - large amount of analysis behind this thesis has been done with
the generator alone, before even the server project was started.
The generator project has three major functions - integration with the cheminformatics toolkit, processing of the PubChem Compound database into the
background hierarchy, and providing access to the processed hierarchy (for the
server or for user defined tasks). The functionality is divided into the following
chemistry The chemistry package implements all the functionality based on the
used ChemAxon JChem toolkit. The toolkit is not directly referenced anywhere else in the whole project and so it should be in the ideal case possible
to replace the JChem toolkit by any other just by reimplementing the methods in this package. Major part of the implemented functionality are the
scaffold transformations for our scaffold hierarchy. Besides that there are
also thin wrappers to the chemical format conversion functionality (parsing
an input stream, converting an individual molecule from and to SMILES),
the imaging functionality (converting a molecule to its SVG representation)
and some other minor functions.
hierarchy Next is the hierarchy package where our scaffold hierarchy is defined
together with a generic interface to access a generated hierarchy. Notable is
the HierarchyTransformations object10 where the chemical transformations
used to obtain scaffolds from molecules are defined. In case there was a
need to use a different hierarchy, this is the file to be modified. Another file
that would likely have to be defined is the HierarchyLevels object, which
defines the names of the hierarchy levels, and which resides in the shared
project as it is shared by the client application.
tasks The tasks package contains definitions of all the processing tasks performed
by the generator - most importantly the tasks used to obtain the background
hierarchy from the PubChem Compound database. This is also a suitable
place for user defined ad-hoc analytic tasks on the processed hierarchy - the
HierarchyStatistics task may serve as an example.
processing The processing package contains some helpers used by the aforementioned tasks. The ProcessingHelper object provides a generic method of
processing one key-value map into another providing parallelization, crash
recovery (if the processing is interrupted unexpectedly, the progress is not
lost), computation timeouts (when the cheminformatics toolkit hangs on
one of the tens of millions of molecules), error logging, and so forth. This
package is not expected to be modified.
It indeed is a singleton object, not a class. Scala provides support for easy definition of
singletons as one of its numerous advantages over Java.
stores The stores package encapsulates all the functionality related to the underlying MapDB database which is used to store the final generated background hierarchy as well as the intermediate results. Replacing the MapDB
with another database, if it was so desired, should be quite straightforward
by reimplementing this package and accommodating the few places in code
that directly use the org.mapdb.BTreeMap interface.
configuration The only one left is the configuration package whose sole purpose
is to provide paths to the database files for the stores package and a path
to the PubChem Compound files to be processed - used by one of the tasks.
Last there is Application object which allows for the tasks to be called from the
command line.
Second of the three main projects is the server. The most important functionality
of the server lies in the API that it provides to the client application. Through
the API, the client application accesses the background hierarchy data as well as
the functionality of the cheminformatics toolkit.
The server is based on the Play Framework, already described above. Besides
processing the API calls, the Play Framework is used for some additional minor
functionality such as, but not limited to:
• serving the client application to the client, using a simple template page;
• compilation of the SCSS stylesheet and providing the compiled CSS to the
client, including the Bootstrap CSS library we use;
• serving other static resources such as fonts;
• routing, dependency injection, configuration;
• and of course the functionality described in the section 5.2.
The functionality is again divided into a few packages:
components The first package, components, contains two components - ScaffoldHierarchyComponent and SvgComponent. The scaffold hierarchy component provides, as many would guess, access to the background hierarchy.
However, it is also able to calculate scaffolds for a given molecule and return
its position in the hierarchy - i.e. it also contains a part of the mentioned
functionality based on the cheminformatics toolkit. The SVG component
provides SVG pictures of molecules and scaffolds; the scaffold pictures are
also being cached using a LRU11 cache based on an MapDB in memory
hash table. The cache size can be specified in the configuration file.
controllers The controllers package mostly provides wiring. The most interesting pieces of technology used here are probably Autowire and uPickle. Both
are libraries by Li Haoyi12 , an influential member of the Scala.js community.
Least Recently Used - discards the least recently used items when the cache is full.
The uPickle library provides conversion between in memory data objects
(defined in the shared project) and their JSON representation. Autowire
provides routing Ajax/RPC calls to the API methods. What is really interesting is that the same two libraries are used in the client application.
Consequently, the serialization and deserialization is almost transparent
(yielding equal objects); and the RPC calls behave almost like standard
method calls, including typesafety. The main difference is that the result
of such call is wrapped into a future13 , which is however most convenient.
services Last is the services package containing the implementation of the API
methods, based on the two components from the components package. The
API interface itself is defined in the shared project, as it is also used by the
client application. The functionality provided by the API is the following:
1) read access to the background hierarchy (for a given scaffold, fetch its
children or ancestors); 2) providing SVG pictures (given a scaffold or a
molecule, return its SVG depiction); and 3) processing user input (read
molecules from an uploaded file, calculate their scaffolds and return the
position in the background hierarchy). User input is processed in parallel,
even for one user, as the processing is computationally quite expensive (more
on that later) and the number of users using the application concurrently
is expected to be low.
The client project is a web browser application which provides the visualization
functionality - it takes a set of molecules as an input from user and displays these
molecules in the context of the background scaffold hierarchy. The molecules are
shown in a form of a pageable list. The scaffolds are, by default, shown as a tree
map, in which the size of every rectangle corresponds to frequency of the scaffold
in PubChem Compound database and the color of a rectangle corresponds to
the count of molecules with this scaffold in user data. Both are configurable,
so for example the size can be made dependent on the dataset and color on the
PubChem. Alternatively, scaffolds can also be shown as a sortable paged list.
The molecules are also selectable and basic search functionality is also included.
The whole client application is built around the React library, which is an UI
library, described in short in section 5.2. As the code is written in Scala, scalajsreact14 wrapper around React is used, which provides a type safe access to the
React library. Moreover, scalajs-react also provides a routing component which
is used to handle the application URLs (the URL always points to the currently
explored scaffold and is stable - i.e. can be bookmarked or shared).
Accompanying scalajs-react is the Diode library, also described in section 5.2.
Diode is used to store the state of the application, which is composed of the
loaded dataset, selected molecules, a cached part of the background hierarchy,
images of scaffolds and molecules which were already computed, settings of the
scaffold components (e.g. sort order, function used to calculate colors in the tree
A future is a read-only placeholder for a future value used to await blocking. A future’s value
can be accessed providing a callback. Consult http://docs.scala-lang.org/overviews/
core/futures.html for detailed information.
map,. . . ), and some other minor details. Almost all of the application state is
stored centrally (using Diode), which is an approach recommended for React. The
components themselves are then mostly stateless and straightforwardly transform
input properties to resulting DOM.
Another two libraries employed are uPickle (serialization and deserialization)
and Autowire (Ajax/RPC), both described in subsection 5.3.3. The client also
uses another serialization library - boopickle - which is by the same author as
Diode and almost the same as uPickle but it uses a binary serialization format
instead of JSON. In the application boopickle is used to save currently open set
of molecules, including the information about scaffolds, significantly saving time
on reopening.
Besides that, the client application also uses Bootstrap CSS library, more precisely its Bootstrap-Sass port. Bootstrap is used mostly as a basic CSS stylesheet
(which is then heavily customized). Besides the stylesheet, Bootstrap also provides functionality to show modal dialogs, where it is accompanied by jQuery
(this is, however, the only place where jQuery is used). A set of icons - Glyphicons - which is included with Bootstrap, is also used. Also the scala-css library is
used, provided CSS mixins to React components - in our case mostly just strongly
typed access to CSS class names, preventing typos.
Same as in the previous project, code in the client project is also divided into
components In the components package, all the React components are contained - i.e. everything that is shown is implemented here. Accompanying
the components, contained in subpackage common, are helper classes providing common functionality shared by the components. React components
are composable and this is heavily used, breaking code apart into smaller
pieces. As there are seventeen components in total, we shall not describe
them all; instead we describe functionality of one of the components to give
a grasp of what a React component is. An overview of all the components
and their relations is show in figure 5.1.
A nice example of a component is the ListBox component, that is used
to show the list of molecules and the list of scaffolds (alternative to the
tree map). Besides rendering the lists to the DOM (including formatting,
event handlers,. . . ), ListBox provides the pagination functionality. The list
of items to be shown is evaluated lazily - the items are provided as a lazy
sequence, and only the items to be shown are fully evaluated. If a single
item in the list is changed (for example a scaffold is highlighted), only
that item is re-rendered - this functionality is provided by the ListBoxItem
child component, which checks whether its properties have changed before
re-rendering itself. That altogether means that rendering of the ListBox
is instantaneous even for an unfiltered list with hundreds of thousands of
items contained. An example of a helper class is the SvgProvider class,
which handles asynchronous loading of SVG representations of molecules
or scaffolds from the server - tracking which images need to be loaded,
marking those images as pending (not to be requested multiple times in
rapid succession), storing the images to the central cache when they are
Figure 5.1: React components in the client project
ready15 . The SvgProvider is used for example by the described ListBox
components, to only request images for the items that are being displayed
(and only when they are first being displayed).
services A lot smaller package is the services package. It contains two helpers
for the remote procedure calls to the server APIs. That means that they
take the input parameters for the call to be made, perform the call and
return a future of the call result, ensuring that the future is completed once
the response is received. One of the helpers is for the calls that use Autowire
and uPickle, which is the absolute majority. The only exception is made
for sending user input files to the server - as those files can easily be over
hundred of megabytes large, we avoid any processing on the client side and
send them directly to the server (using the other helper). That is not only
faster but in fact the only alternative possible as serializing such file using
As the image cache is part of the centralized state of the application handled by the Diode
library, only a part of the functionality is implemented in the SvgProvider itself, the rest being
implemented in the corresponding Diode handler.
uPickle would simply crash the browser (and also multiply the size of the
data several times).
store Another package of significant size is the store package. It contains the
Store singleton object, which holds the application state. Together with it it
contains a actions, which specify the changes that are to be performed on the
state. Next are the handlers, which are the functions that change the state
according to the actions. The handlers are registered in the Store object,
which then accepts a stream of actions and processes them one by one in
a serialized manner. Next is the subpackage model, which contains data
structures (classes) that describe the central state. Last is the subpackage
serializer which provides serialization and deserialization of the application
state for the purpose of saving to a file or loading from it.
util Last is a small package of non-specific functionality to simplify interactions
with the browser. One example might be reading of a file, i.e. converting
a JavaScript File into a future of byte array, which is used when loading
the application state from a file. Another example is a helper to create
dataURIs16 and objectURLs17 , used for displaying images and when saving
the application state or exporting the molecules or scaffolds.
The last project - shared - is a small one. The code it contains is then available in
all the three main projects. This is absolutely needed for the interfaces and data
structures shared by the client and server side of the application - specifically
the interface of the server API (used by Autowire on both ends) and the data
structures describing Molecules and Scaffolds (used almost universally).
The rest of the code is generic functionality that can be useful in multiple
parts of the application. One example is the Squarified TreeMap [9] layout,
which is used by the scaffold map on the client, but was originally used in the
generator when prototyping the hierarchy. Another piece of interesting code is
the ReusableComputation helper, which wraps an ordinary function and adds
caching capabilities to it - the last input and result are remembered and if the
function is called again with the same parameters, the cached result is returned.
That is immensely useful especially in the context of the React components in
the client application - not only it saves time by not performing the computation
again, but it also keeps the result referentially equal to the previous result, which
then enables very fast equality checks in subcomponents, which in turn allows to
prevent unnecessary re-rendering, ultimately increasing fluency of the application
by a noticeable amount.
This concludes our overview of the implementation details. For whoever might
be interested in more detailed information, the code itself should be the ultimate reference. We attempted to keep the code concise, structured, following
the DRY18 principle and lacking any surprises (other than where documented).
Hopefully, that should make the code comprehensible and even customizable and
Don’t repeat yourself.
maintainable. At the same time, the preceding subsections should serve as a
practical guide for anybody set to exploring the code for himself.
6. Results
In this section we describe the final visualization application, its availability and
basic usage. Together with the application, we also describe the resulting scaffold hierarchy that has been created and that servers as a background for the
hierarchical visualization.
The Application
The final application, which we called the Scaffold Visualizer, has been published
under the GPLv3 license and is freely available at:
The Scaffold Visualizer can either be installed to a dedicated server or run
locally. The installation process is explained in the appendix, together with the
requirements, which are minimal. The installation guide is also available on-line.
Another possibility to evaluate the application is to visit a demo instance at:
The application itself consists of two functional parts. First is the generator,
which is used to generate the background hierarchy from the PubChem database.
Second is the visualizer - a client-server application, which allows to explore user
provided data sets on the background of the generated hierarchy. In the rest of
this chapter we describe the two parts separately.
The Generator
The generator is a command line application which allows to run predefined tasks.
The main purpose of the generator is to generate the background hierarchy from
the PubChem Compound database1 .
The generator can also be used to run arbitrary other analytical tasks over the
scaffold hierarchy, allowing for customized and targeted exploration of chemical
space. Such customized tasks have been used during the process of designing the
hierarchy and the final visualization, described in chapter 4. Also all the data
presented in this chapter and in chapter 4 have been obtained using the generator.
The task used to obtain these data has been left in the final version to serve as
an example of how to define such custom user tasks.
Generating the background hierarchy
As described earlier, the generator operates with so called tasks. The background
hierarchy can be obtained by a sequence of three such tasks:
With only minor modifications, any custom background database should be usable.
ImportPubChem The first task was already explained in section 4.1.2. It
converts the PubChem Compound database from the form of compressed
SDF files, tens of gigabytes large, to a simplified custom database. In
this database, each compound is stored as a record consisting of a unique
identifier (PubChem Compound ID) and the compound’s representation in
SMILES format. The database is compressed, 40× smaller than the original form and faster to access. More importantly, it serves as a standardized
input format for the following task. This implies that any other database
can be used as a source for the background data just by implementing a
custom task akin to ImportPubChem - i.e. such that it provides a unique
identifier and a SMILES string for each molecule in the source database which should be easy enough.
GenerateScaffolds The second task simply processes the input molecules one
by one and calculates their scaffolds. The scaffolds are calculated by level,
bottom up. This way for each unique scaffold the appropriate transformation is performed only once - i.e. if two input compounds share a scaffold
on some level, this shared scaffold’s parent is computed only once and not
once for each of the molecules. The task creates a raw processing hierarchy.
In this hierarchy, for each scaffold its parent scaffold is stored. As an additional information, number of compounds that correspond to each scaffold
is calculated. All the scaffolds are represented by SMILES strings, unique
per level.
GenerateHierarchy The final task converts the processing hierarchy into the
final form, which is more compact and suitable to be used by the visualizer.
The task consists of two steps. In the first step, each scaffold is assigned a
unique numerical id (a primary key) and is stored by under this id. In the
second step, two maps are created on the ids - a children map a parent map.
The children map contains for each scaffold a list of ids of its children, the
parent map contains for each scaffold the unique id of its parent scaffold.
The three tasks can be executed one by one or all at once, as described in the
installation guide.
The calculations are performed in parallel, using all available CPU cores. As
a result, on a common 4-core 4 GHz processor, using PubChem Compound as the
source, the ImportPubChem task takes approximately 4 hours and 15 minutes to
complete, the GenerateScaffolds task takes about 5 hours 35 minutes and the
GenerateHierarchy task can be completed in under 25 minutes.
The Scaffold Hierarchy
The constructed scaffold hierarchy is one of the key results of this work. It has
already been described qualitatively. And in this subsection we add the quantitative view. That should give an idea about the size and shape of the hierarchy.
And it will also highlight areas of possible improvement and future research.
Moreover, the hierarchy statistics can be interesting from chemical standpoint –
providing insights about scaffold diversity of the PubChem Compound database.
To improve informativeness in this area, we show the most common scaffolds and
their frequencies.
Ring Count
Ring Connectivity
Ring Connectivity Extended
Murcko Rings
Murcko Rings with Linkers
Rings with Linkers
Rings with Linkers Stereo
Number of Scaffolds
49 809
118 902
137 633
596 058
1 281 788
7 476 787
9 867 182
19 528 262
Table 6.1: Number of scaffolds per hierarchy level
All the following data are based on PubChem Compound database as of July
23, 2016, containing 91.4 million molecules. All the numbers have been computed using a generator task which is included in the application under the name
HierarchyStatistics – allowing future readers to recalculate the statistics using more recent data or to modify the task to extract additional statistics they
would be interested in.
The total number of scaffolds in the hierarchy is 19.5 million. A break down
by levels is provided in table 6.1.
Another interesting statistic is the branching factor of scaffolds in the tree.
We have already analyzed the number of children that scaffolds have in section
4.3 – we divided scaffolds into four bins and calculated number of scaffold in each
bin – results shown in table 4.2. Another view at the same distribution can be
gained through percentiles. The percentile data are provided in table 6.2.
It can be immediately noticed the table that all number for level 0 are identical.
That is due to level 0 being the root level, which contains only a single scaffold,
the root scaffold, that has exactly 102 children (distinct ring count scaffolds).
Number of children per scaffold
P25 P50 P75
102 102 102
102 102.00
4 29 199 1 660 3 396 5 083 5 180 488.10
43 3 666
20 2 206
60 16 755
Table 6.2: Number of children per scaffold - distributions by level; the table shows
how many children the scaffolds on each level have – for example the scaffolds on
level 4 have on average 4.33 children each, also at least 90 % of level 4 scaffolds
have at most 5 children
Therefore, the average, the minimum (P0 ), and the maximum (P100 ) number of
children, as well as all other percentiles, are identical and equal to 102.
Otherwise, It can be seen, that the distributions are very skewed – there are
lots of scaffolds with small amount of children and then small amount of scaffolds
with high number of children. Ideally, all scaffolds would have a moderate number
of children – neither too high (making the visualization cluttered), neither too low
(a tree map of one scaffold being quite useless). Unfortunately, using a structurebased hierarchy, fixing one goes against the other – finer structural steps increase
the number of scaffolds with only one child; coarser structural steps increase the
number of scaffolds with excessive amount of children. As described in chapter 4,
it seemed to us that scaffolds with excessive number of children pose a bigger
problem than scaffolds with low amount of children, so we decided to make the
steps as small as possible.
Another interesting family of scaffolds are the scaffolds representing the highest number of compounds in the PubChem database. Let us stress that those are
not the same scaffolds as the scaffolds with the highest number of children; on
the contrary, at the higher levels (1-3), the top scaffolds tend to have a moderate
amount of children. On the other hand, at levels 4 and 6 (which are the levels with
high average branching factor), the top scaffolds belong to the top 1st percentile
by branching factor; with the only exception being the scaffolds representing the
acyclic molecules.
Let us have a closer look at this anomaly, present in fact in all the hierarchy levels. Most or possibly all scaffold derivations begin with the molecular
frameworks by Bemis and Murcko [4], which essentially describe the ring systems in a molecule2 . But acyclic molecules by definition have no rings. So from
the lowest level of the hierarchy, they are reduced to an “empty” scaffold. This
scaffold then shows up amongst the top scaffolds on all levels, with a constant
frequency of 3.12 % – corresponding to the fraction of acyclic molecules in PubChem Compound. On each level, this empty scaffold has only a single child –
the empty scaffold bellow; therefore they also form an exception from the rule of
high branching top scaffolds at levels 4 and 6.
That is due to rings being a key predictor of biological activity, as explained earlier.
Ring count
Frequency in PubChem Number of children
3.12 %
14.01 %
30.74 %
25.68 %
16.03 %
6.59 %
2.10 %
0.69 %
0.37 %
1 044
0.21 %
1 948
99.52 %
Table 6.3: Top ten most frequent ring counts in PubChem Compound
30.64 %
23.64 %
14.01 %
10.98 %
3.12 %
3.05 %
2.89 %
2.11 %
1.97 %
1.69 %
Figure 6.1: Most common ring connectivity scaffolds (level 2) and their frequency
in PubChem Compound database
Now to the ring counts – the level 1 scaffolds in the hierarchy. The ring counts
are best described by the table 6.3. The molecules with a small number of rings
are by far the most frequent. The corresponding scaffolds have small number of
children, which is expected, as there are only a few ways how to connect a small
amount of rings. For higher ring counts, the number of children grows rapidly.
In the table 6.3, one another anomaly can be observed. Namely, the number
of ring connectivity scaffolds on two rings is higher than could be expected – there
is only one way how to connect two rings, but there are two scaffolds. The second
topology are two disconnected rings. It is quite rare (less than 0.1 % PubChem
compounds), but still shows that disconnected compounds exist in the PubChem
Compound database.
Next are the ring connectivity scaffolds, level 2 of the hierarchy. Again, top 10
scaffolds cover a large fraction of the Compound database – whole 94.10 %. The
most common ring connectivity scaffolds are depicted in figure 6.1, accompanied
by their frequency. The branching factor of these level 2 scaffolds is low - all of
them having at most 12 children.
An interesting phenomena can be observed on levels 3 and 4 (extended ring
connectivity and Oprea). On these levels, the most common scaffolds cover very
similar portions of the PubChem compounds – 81.35 % in case of level 3 and
80.12 % on level 4. But not only that, the ten most common scaffold correspond
to each other one by one, as can be seen in figure 6.2. Not only the order of
corresponding scaffolds is the same on both levels, even the fractions of molecules
they represent are very similar. So although most of the depicted level 3 scaffolds
have more than one child, in every case one of the children is disproportionately
more frequent than the others, and belongs to the top 10 scaffolds on level 4. This
might lead to a hypothesis that level 3 and level 4 scaffolds, although looking very
different, convey very similar information.
Going further down to the Murcko graph frameworks, the top 10 level 5 scaffolds, shown in figure 6.3, still cover almost half of the PubChem compounds
(47.98 %); that falls rapidly on level 6, where the top 10 scaffolds, figure 6.4,
cover only 31.51 %.
Finally, on molecular framework levels 7 and 8 the top scaffolds are identical.
Level 3 - Extended Ring Connectivity
26.56 %
14.01 %
12.81 %
10.04 %
4.07 %
1.89 %
1.59 %
9.94 %
4.07 %
1.62 %
1.56 %
3.64 %
3.62 %
3.12 %
Level 4 - Oprea
26.41 %
14.01 %
12.44 %
3.52 %
3.43 %
3.12 %
Figure 6.2: Most frequent scaffolds on levels 3 and 4 correspond to each other
13.16 %
10.26 %
9.08 %
3.12 %
3.02 %
2.27 %
2.20 %
1.94 %
1.48 %
1.46 %
Figure 6.3: Most common Murcko Rings (level 5) scaffolds – graph frameworks
with contracted linkers
10.26 %
3.12 %
3.02 %
2.91 %
2.73 %
2.12 %
1.94 %
1.94 %
1.75 %
1.73 %
Figure 6.4: Most frequent Murcko Rings with Linkers (level 6) scaffolds – graph
frameworks with linkers preserved
6.83 %
3.12 %
0.51 %
0.40 %
0.87 %
0.55 %
0.37 %
0.54 %
0.35 %
0.31 %
Figure 6.5: Most Rings with Linkers (level 7) scaffolds and their frequency in
PubChem Compound database
Most common level 7 scaffolds can be seen in figure 6.5. Their frequencies on
both levels are almost identical too – with 13.85 % total for level 7 and 13.82 %
for level 8. Although under 14 % might seem like a low number, compared with the
fractions before, there are 7.5 million scaffolds on level 7 and almost 10 million
scaffolds on level 8; in that context, a single scaffold covering around 0.5 % is
still very significant. In the picture, there is also noticeable a predominance of
aromatic rings over non-aromatic ones.
We conclude the description of the hierarchy with recapitulation of key information from the preceding chapters. The scaffold hierarchy spans the known
chemical space, using the largest chemical database available as its source. The
hierarchy can be calculated on commodity hardware in the course of a few hours.
The hierarchy provides a stable background for visualization. Each molecule is
classified by exactly one scaffold on each level of the hierarchy. The scaffold definitions are based on existing concepts of molecule frameworks and ring topologies;
however, the ring connectivity scaffolds are a new type of scaffolds introduced by
this work.
The Visualizer
The visualizer is a server-client application which enables to interactively explore
molecule data sets based on the scaffold hierarchy. The visualizer allows users to:
• browse the scaffold hierarchy as a zoomable tree map - explore molecular
scaffolds in the underlying PubChem Compound database and their frequency;
• import an existing data set in all common cheminformatics file formats
(SDF/Molfile, SMILES, InChI, CML, and more);
• display imported compounds in the tree map;
• colorize the tree map according to scaffold frequency in user data set, in
PubChem, frequency in user data set relative to frequency in PubChem;
• change the used color gradient;
• base tree map element sizes either or frequency in data set or on frequency
in PubChem;
• apply logarithmic transformation to the sizes or the color source values;
• select molecules by scaffold or directly;
• search for molecules;
• filter molecules based on the selection, the search filter and the current
• browse the filtered molecules in a pageable list;
• display scaffolds as a list instead of a tree map;
• sort the scaffold list by frequency in data set, in PubChem, or lexicographically;
• export the molecules based on the selection, search filter and current subtree;
• export scaffolds on the current level;
• save current data set including the computed scaffolds and selection;
• bookmark or link the current position in the scaffold hierarchy.
The user interface of the visualizer is show in figure 6.6 . The interface consists
of several basic components - in the center of the image, from left to right, the
breadcrumb navigation, the scaffold box and the molecule box. Above and bellow
that are the header and the footer. In the lower right corner is shown the tooltip.
Not shown in the picture are three forms - Load data set form, Export data form
and Settings form. We now describe the components one by one:
Figure 6.6: Main screen of the Scaffold Visualizer
Header The header toolbar at the top of the screen provides buttons opening the
dialogues for import or export of data. Next to it are two buttons which
switch between a tree map visualization of scaffolds and a scaffold list.
Accompanying them is a third button which provides additional settings
for the scaffold box. At the right of the header are three buttons enabling
and disabling three filters to be applied to molecules in the molecule box.
Last is the input box to enter search queries for the molecule box.
Breadcrumb At the left of the screen is the breadcrumb navigation bar which
displays the current position in the hierarchy. The bottom most, and
slightly darker, scaffold in the breadcrumb is the currently selected scaffold - i.e. the scaffold whose children are shown in the scaffold box next
to bread crumb. The unique id of this current scaffold is shown in the
browser’s address and can be used to revisit it later. The current scaffold
is also used to filter molecules in the molecule box are filtered (as long as
the “show only subtree” filter is active). Above the current scaffolds are all
his ancestors, up to the hierarchy root. Clicking them selects them as the
current scaffold. Hovering the mouse above any of the scaffolds shows the
Scaffold Box The scaffold box show the children of the current parent scaffold.
These children scaffolds can be shown either as the tree map (the default)
or the scaffold list. Both can be used to navigate the hierarchy up and down
and show information about the scaffolds and the current dataset. But the
form is different in both cases, so they will be described separately:
Tree Map The scaffold tree map is the hierarchical visualization that we have
been designing throughout this whole work. It displays a set of scaffolds,
each scaffold represented by a rectangle. In the rectangle is shown a picture
of the scaffold. The rectangle’s area is by default based on the relative
frequency of the scaffold in the background hierarchy; the color is based
on the frequency of the scaffold on the user data set. Sources of the area
and color can be changed in the Settings form. Optionally, a logarithmic
transformation can be applied. And the color gradient can be replaced
by a different one. Clicking a scaffold selects or deselects corresponding
molecules. On mouse wheel scrolling the tree map is zoomed in (navigating
to the scaffold under the mouse pointer) or zoomed out (navigating to the
current scaffold’s parent). Moreover, upon hovering the mouse above a
scaffold the tooltip is shown providing detailed information.
Scaffold List The functionality of the scaffold list is analogous to that one of the
scaffold tree map. The form is, however, different. The scaffolds are shown
as a list, sorted lexicographically or by their frequency in the user data set
or in the background hierarchy. Each item in the list contains a small image
of the scaffold, its name, level in the hierarchy and the number of molecules
in the user data set and in the background hierarchy, that correspond to
the particular scaffold. A button for “zooming in” the scaffold is present.
And clicking the scaffold selects the corresponding molecules, same as in
the tree map.
Tooltip An important part of the scaffold tree map is the tooltip. It provides
the same information as the scaffold list, together with a larger picture. A
picture-only tooltip is also available in the scaffold list and the molecule
Molecule List The molecule list is empty until a user data set is loaded. After
that, it shows the list of loaded molecules. By default, the list is restricted
to the molecules in the current subtree - i.e. the molecules that correspond
to the current scaffold. The list can also be restricted to show only the
selected molecules or to only show search results - as long as a search query
is entered. All the three filters can be toggled using the buttons at the right
side of the header. The search query can be entered in the same place. For
each molecule a picture is shown, together with its SMILES string and its
name and comment, as long as they were present in the input. Clicking
a molecule selects it. The molecules are shown by pages consisting of 50
items. The total number of pages is not precalculated to avoid evaluation
of the filters on huge data sets - i.e. only the next page and the preceding
pages are available in the pager.
Footer The footer shows the size of the loaded data set and the size of the
current subtree. If any molecules are selected, it also shows the number of
selected molecules in both - the data set and the current subtree.
Export Form The Export form is accessible via the “Export” button in the
toolbar. It allows exporting molecules or scaffolds in standard SMILES
format or exporting the whole dataset in a custom binary format.
Load Dataset Form The Load Dataset form is accessible clicking the “Open”
button in the toolbar. It allows selecting a file containing user molecules
to be loaded; the file is then submitted to the server. The file can be in
arbitrary format which is supported by the ChemAxon JChem cheminformatics toolkit, or it can be the binary dataset obtained through the Export
form, which is considerably faster to load (as the scaffolds are already precomputed).
Settings Form The Settings form, also accessible via the the header toolbar,
offers a range of settings for the Scaffold Box.
That concludes the overview of the functionality. The Scaffold Visualizer is implemented using standard HTML5/JavaScript/SVG and should work using any
modern browser. It has been confirmed to work under Google Chrome (version
51) and Microsoft Internet Explorer (version 11). Most of the development has
been done using Google Chrome so that is the preferable option which should
provide the best experience. On top of that, 64-bit version of Google Chrome has
a higher memory limit than the 32-bit version and allows for larger data sets to
be loaded.
Performance-wise a lot of attention has been devoted to making the application able to handle larger data sets - that is especially important considering the
size of data sets used in context of high throughput screening methods. The exact
size of data set that can be loaded depends on the used browsers and even on the
data itself. The most significant limitation is the amount of memory available
to the browser JavaScript interpreter. Using the aforementioned Google Chrome
browser, version 51, 64-bit, a dataset of 500 thousand drug-like molecules from
ZINC database has been opened and the application performed well. We consider
that to be an important success.
On the other hand, the data sets are subjectively quite slow to load. That
is being determined by the speed of the initial processing at the server. Again,
on our reference 4-core 4-GHz CPU, the server is able to process about 1000
molecules per second - meaning that the 500 thousand molecules large dataset
takes almost 10 minutes to load. The processing time is linear to the size of the
It appears that the speed of processing cannot be significantly improved. Of
the processing time, it takes about 25% to load the molecules from the input
format and to calculate their canonical SMILES inner representation; the other
75% of time is spent calculating the scaffolds and searching for them in the
scaffold database. The search time, which is significant, could be reduced by
orders of magnitude by preloading the database to the memory - which we opted
not to do in order to keep memory requirements minimal. Even if we reduced the
search time, we could not eliminate the 25% of work which is spent on reading
the molecules and calculating their canonical representation - meaning that the
speed of loading could by increased at most by a factor of 4. However, it would be
a possible optimization, and an easy to implement one, if the Scaffold Visualizer
would gain significant usage.
Besides that, two other approaches have been employed to reduce the dataset
loading time. One being an asynchronous loading of molecule images - the
molecules are only rendered when they are to be displayed. And as the Molecule
List component only displays 50 molecules at time, only a small number of images is required at any moment. Compared to the original implementation where
the images had been generated together with the scaffolds, this has increased the
speed of data set loading more than 10×. Another attempt to reduce to loading
time is the implemented binary export format - the molecules can be saved together with their generated scaffolds and such data set can be opened without
even being sent to the server - reducing loading time from minutes to seconds for
the largest data sets. However, this approach only helps when a data set is used
repeatedly - the first processing always takes the full time.
7. Conclusion
In this thesis, a new approach for scaffold-based hierarchical visualization of the
chemical space has been designed and implemented. This approach is based on
two key components – a ring topology based scaffold hierarchy and a tree map
scaffold layout, both designed as a part of this thesis. An important contribution
of this thesis is that the proposed approach allows to visualize data sets on top of
a background chemical library – providing context and reference for the data and
also securing stability of the visualization in respect to changes in the data set.
To our best knowledge, no such approach has been implemented nor proposed
Another contribution is the Scaffold Visualizer – a client-server visualization
tool implementing the proposed visualization approach. The Scaffold Visualizer
allows to explore user data sets interactively, in a form of a zoomable tree map.
This visualization is provided by a web browser application, which is accompanied
by a server counterpart that provides the background data and the chemical
functionality. An important feature of Scaffold Visualizer is that it allows large
data sets to be loaded – hundreds of thousands of molecules in size – which is
orders of magnitude larger than what is allowed by other on-line visualization
Our reference implementation is based on the PubChem Compound chemical
database as the background chemical library. PubChem Compound is the largest
free access database of chemical compounds, which also provides the highest
scaffold diversity; in effect, the Scaffold Visualizer provides a hierarchical visualization spanning virtually the whole known chemical space – which was one of
the goals of this thesis. On the other hand, with a little modification the Scaffold
Visualizer is prepared to use any other background library – for the cases where
using a domain specific library would be advantageous.
While processing the PubChem Compound database, we also calculated statistics describing the size and shape of the resulting background hierarchy and the
most frequently occurring scaffolds at each hierarchy level. Besides providing
information about the background hierarchy itself, the statistics also offer an interesting view at the original PubChem Compound database. We found out that
94 % of the compounds in PubChem can be described by one of the top ten most
frequent ring connectivity scaffolds, introduced in this work. Many more such
results can be found in section 6.2.2.
The scaffold hierarchy also appears to be very skewed – on most levels a large
number of scaffolds have only one child and then a small number of scaffolds have
a lot of children. Navigating a tree where some nodes have an excessive amount of
children can be difficult and so we attempted to divide these large sets of children
into smaller clusters. We tried both structure-based methods and cluster analysis.
Using the structural approach we eventually came into a situation where we not
able to further subdivide the scaffold derivation steps in the hierarchy, while
still keeping the steps reasonable from the chemical standpoint. For the cluster
analysis we could not find a suitable scaffold similarity measure. We created a
custom measure based on the graph edit distance but it was not reliable and was
very expensive to calculate. More details on that are provided in chapter 4.
Next, to the visualization itself. At first, we intended to visualize the scaffolds
based on their similarity – drawing similar scaffolds near to each other. But this
has again not been possible due to no suitable scaffold similarity measure being
available. As we have encountered this same problem twice, it seems that finding
a suitable scaffold similarity metric would be an interesting research topic with
wide applications.
Nevertheless, with similarity information being unavailable, a visualization
of scaffold sets, which is based on a tree map layout has been designed and
implemented. The tree map layout seems to be a suitable one – it is synoptic
and allows to present multiple information at once – encoded by the area and the
color of the rectangles. Besides that, scaffold images can be included, leading
to immediate identification. Moreover, when a scaffold similarity measure is
eventually found, the tree map layout can still reflect this information in the
positioning of the rectangles – for example via the Nmap algorithm mentioned
earlier. One complication of a tree map based layout can be that when the sizes
of the rectangles displayed are very disproportionate, the smaller rectangles can
get lost. We have attempted to fix this by allowing a logarithmic transformation
to be applied to the data prior to displaying.
That concludes the overview of the results. The Scaffold Visualizer has been
published under an open source license and is free to use. It has been created to
be easily modifiable to more specific use cases. Modifications and improvements
are encouraged and welcome.
[1] Dimitris K. Agrafiotis and John J. M. Wiener. Scaffold explorer: An interactive tool for organizing and mining structure-activity data spanning multiple
chemotypes. Journal of Medicinal Chemistry, 53(13):5002–5011, jul 2010.
doi: 10.1021/jm1004495. URL http://dx.doi.org/10.1021/jm1004495.
[2] Dimitris K. Agrafiotis, Dmitrii N. Rassokhin, and Victor S. Lobanov.
Multidimensional scaling and visualization of large molecular similarity tables.
Journal of Computational Chemistry, 22(5):488–
500, 2001. doi: 10.1002/1096-987x(20010415)22:5<488::aid-jcc1020>3.
0.co;2-4. URL http://dx.doi.org/10.1002/1096-987X(20010415)22:
[3] E Anderson, GD Veith, and D Weininger. Smiles: A line notation and
computerized interpreter for chemical structures, us epa. Environmental
Research Laboratory-Duluth. Report No. EPA/600/M-87/021. Duluth, MN,
[4] Guy W. Bemis and Mark A. Murcko. The properties of known drugs. 1.
molecular frameworks. Journal of Medicinal Chemistry, 39(15):2887–2893,
jan 1996. doi: 10.1021/jm9602928. URL http://dx.doi.org/10.1021/
[5] Regine S. Bohacek, Colin McMartin, and Wayne C. Guida.
art and practice of structure-based drug design: A molecular modeling perspective.
Medicinal Research Reviews, 16(1):3–50, jan
1996. doi: 10.1002/(sici)1098-1128(199601)16:1<3::aid-med1>3.0.co;2-6.
URL http://dx.doi.org/10.1002/(SICI)1098-1128(199601)16:1<3::
[6] Evan E. Bolton, Yanli Wang, Paul A. Thiessen, and Stephen H. Bryant.
PubChem: Integrated platform of small molecules and biological activities.
In Annual Reports in Computational Chemistry, pages 217–241. Elsevier BV,
2008. doi: 10.1016/s1574-1400(08)00012-1. URL http://dx.doi.org/10.
[7] Nathan Brown. Identifying and representing scaffolds. Scaffold Hopping in
Medicinal Chemistry, pages 1–14, nov 2013. doi: 10.1002/9783527665143.
ch01. URL http://dx.doi.org/10.1002/9783527665143.ch01.
[8] Nathan Brown and Edgar Jacoby. On scaffolds and hopping in medicinal chemistry. Mini Reviews in Medicinal Chemistry, 6(11):1217–1229, Nov
2006. doi: 10.2174/138955706778742768. URL http://dx.doi.org/10.
[9] Mark Bruls, Kees Huizing, and Jarke J. van Wijk. Squarified treemaps.
In Eurographics, pages 33–42. Springer Science+Business Media, 2000. doi:
10.1007/978-3-7091-6783-0_4. URL https://www.win.tue.nl/~vanwijk/
[10] Hans-Joachim Böhm, Alexander Flohr, and Martin Stahl. Scaffold hopping.
Drug Discovery Today: Technologies, 1(3):217–224, dec 2004. doi: 10.1016/j.
ddtec.2004.10.009. URL http://dx.doi.org/10.1016/j.ddtec.2004.10.
[11] Alex M. Clark and Paul Labute. Detection and assignment of common scaffolds in project databases of lead molecules. Journal of Medicinal Chemistry, 52(2):469–483, jan 2009. doi: 10.1021/jm801098a. URL
[12] Arthur Dalby, James G. Nourse, W. Douglas Hounshell, Ann K. I. Gushurst,
David L. Grier, Burton A. Leland, and John Laufer. Description of several
chemical structure file formats used by computer programs developed at
molecular design limited. Journal of Chemical Information and Modeling,
32(3):244–255, may 1992. doi: 10.1021/ci00007a012. URL http://dx.doi.
[13] Sébastien Doeraene. Scala.js: Type-Directed Interoperability with Dynamically Typed Languages. Technical report, EPFL, 2013. URL https:
[14] Felipe S. L. G. Duarte, Fabio Sikansi, Francisco M. Fatore, Samuel G. Fadel,
and Fernando V. Paulovich. Nmap: A novel neighborhood preservation
space-filling algorithm. IEEE Transactions on Visualization and Computer
Graphics, 20(12):2063–2071, dec 2014. doi: 10.1109/tvcg.2014.2346276. URL
[15] Peter Ertl, Stephen Jelfs, Jörg Mühlbacher, Ansgar Schuffenhauer, and Paul
Selzer. Quest for the rings. in silico exploration of ring universe to identify
novel bioactive heteroaromatic scaffolds. Journal of Medicinal Chemistry, 49
(15):4568–4573, jul 2006. doi: 10.1021/jm060217p. URL http://dx.doi.
[16] Peter Ertl, Ansgar Schuffenhauer, and Steffen Renner. The scaffold
tree: An efficient navigation in the scaffold universe. In Chemoinformatics and Computational Chemical Biology, volume 672 of Methods
in Molecular Biology, chapter 10, pages 245–260. Humana Press, 2010.
doi: 10.1007/978-1-60761-839-3_10. URL http://dx.doi.org/10.1007/
[17] B. E. Evans, K. E. Rittle, M. G. Bock, R. M. DiPardo, R. M. Freidinger,
W. L. Whitter, G. F. Lundell, D. F. Veber, and P. S. Anderson. Methods
for drug discovery: development of potent, selective, orally effective cholecystokinin antagonists. Journal of Medicinal Chemistry, 31(12):2235–2246,
dec 1988. doi: 10.1021/jm00120a002. URL http://dx.doi.org/10.1021/
[18] Martin Gütlein, Andreas Karwath, and Stefan Kramer. CheS-mapper chemical space mapping and visualization in 3D. Journal of Cheminformatics, 4(1):7, 2012. doi: 10.1186/1758-2946-4-7. URL http://dx.doi.
[19] Ramesh Hariharan, Anand Janakiraman, Ramaswamy Nilakantan, Bhupender Singh, Sajith Varghese, Gregory Landrum, and Ansgar Schuffenhauer. Multimcs: a fast algorithm for the maximum common substructure problem on multiple molecules. Journal of Chemical Information
and Modeling, 51(4):788–806, Apr 2011. doi: 10.1021/ci100297y. URL
[20] Robert Hauser.
Chemical information in patents:
What are
markush structures and who needs them?
Online; accessed
2016-07-12, nov 2015.
URL https://www.linkedin.com/pulse/
[21] David Hoksza, Petr Škoda, Milan Voršilák, and Daniel Svozil. Molpher: a
software framework for systematic chemical space exploration. Journal of
Cheminformatics, 6(1):7, 2014. doi: 10.1186/1758-2946-6-7. URL http:
[22] Ye Hu, Anne Mai Wassermann, Eugen Lounkine, and Jürgen Bajorath. Systematic analysis of public domain compound potency data identifies selective molecular scaffolds across druggable target families. Journal of Medicinal Chemistry, 53(2):752–758, jan 2010. doi: 10.1021/jm9014229. URL
[23] Ye Hu, Dagmar Stumpfe, and Jürgen Bajorath. Computational exploration
of molecular scaffolds in medicinal chemistry. Journal of Medicinal Chemistry, 59(9):4062–4076, may 2016. doi: 10.1021/acs.jmedchem.5b01746. URL
[24] John J. Irwin, Teague Sterling, Michael M. Mysinger, Erin S. Bolstad, and
Ryan G. Coleman. ZINC: A free tool to discover chemistry for biology.
Journal of Chemical Information and Modeling, 52(7):1757–1768, jul 2012.
doi: 10.1021/ci3001277. URL http://dx.doi.org/10.1021/ci3001277.
[25] Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta
Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, Jiyao
Wang, Bo Yu, Jian Zhang, and Stephen H Bryant. Pubchem Substance and
Compound databases. Nucleic Acids Research, 44(D1):D1202–D1213, Jan
2016. doi: 10.1093/nar/gkv951. URL http://dx.doi.org/10.1093/nar/
[26] Hugo Kubinyi.
Similarity and dissimilarity: A medicinal chemist’s
view. Perspectives in Drug Discovery and Design, 9/11(0):225–252, 1998.
doi: 10.1023/a:1027221424359. URL http://dx.doi.org/10.1023/A:
[27] Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on
Code Generation and Optimization, 2004. CGO 2004., pages 75–86. Institute of Electrical & Electronics Engineers (IEEE), 2004. doi: 10.1109/cgo.
2004.1281665. URL http://dx.doi.org/10.1109/cgo.2004.1281665.
[28] Alan H. Lipkus. A proof of the triangle inequality for the tanimoto distance.
Journal of Mathematical Chemistry, 26(1/3):263–265, 1999. doi: 10.1023/a:
1019154432472. URL http://dx.doi.org/10.1023/A:1019154432472.
[29] Eugene A Markush. Pyrazolone dye and process of making the same, 1924.
US Patent 1,506,316.
[30] Bill Venners Martin Odersky, Lex Spoon. Programming in Scala. Artima
Inc, 2016. ISBN 0981531687. URL http://www.ebook.de/de/product/
[31] Malcolm J. McGregor and Steven M. Muskal. Pharmacophore fingerprinting.
1. application to QSAR and focused library design. Journal of Chemical
Information and Computer Sciences, 39(3):569–574, may 1999. doi: 10.
1021/ci980159j. URL http://dx.doi.org/10.1021/ci980159j.
[32] Alan McNaught. The IUPAC international chemical identifier. Chemistry international, pages 12–14, 2006. URL http://www.iupac.org/
[33] Jose L. Medina-Franco, Karina Martínez-Mayorga, Andreas Bender, and
Thomas Scior. Scaffold diversity analysis of compound data sets using an
entropy-based measure. QSAR & Combinatorial Science, 28(11-12):1551–
1560, dec 2009. doi: 10.1002/qsar.200960069. URL http://dx.doi.org/
[34] Eberhard Moebius. How many stars in the universe?, jan 2005. URL http:
//helios.gsfc.nasa.gov/qa_star.html#howmany. Online; accessed 201607-15.
[35] JW Morgan and E Anders. Chemical composition of earth, venus, and
mercury. Proceedings of the National academy of Sciences of the United
States of America, 77(12):6973–6977, Dec 1980.
[36] Gerhard Müller.
Medicinal chemistry of target family-directed masterkeys.
Drug Discovery Today, 8(15):681–691, Aug 2003.
URL http://dx.doi.org/10.1016/
[37] Kong T Nguyen, Lorenz C Blum, Ruud van Deursen, and Jean-Louis Reymond. Classification of organic molecules by molecular quantum numbers.
ChemMedChem, 4(11):1803–1805, Nov 2009. doi: 10.1002/cmdc.200900317.
URL http://dx.doi.org/10.1002/cmdc.200900317.
[38] C. A. Nicolaou, S. Y. Tamura, B. P. Kelley, S. I. Bassett, and R. F. Nutt.
Analysis of large screening data sets via adaptively grown phylogenetic-like
trees. Journal of Chemical Information and Computer Sciences, 42(5):1069–
1079, sep 2002. doi: 10.1021/ci010244i. URL http://dx.doi.org/10.1021/
[39] Noel M O-Boyle and Geoffrey R Hutchison. Cinfony - combining open source
cheminformatics toolkits behind a common interface. Chemistry Central
Journal, 2(1):24, 2008. doi: 10.1186/1752-153x-2-24. URL http://dx.doi.
[40] Noel M O-Boyle, Michael Banck, Craig A James, Chris Morley, Tim Vandermeersch, and Geoffrey R Hutchison. Open babel: An open chemical toolbox.
Journal of Cheminformatics, 3(1):33, 2011. doi: 10.1186/1758-2946-3-33.
URL http://dx.doi.org/10.1186/1758-2946-3-33.
[41] Noel M. O’Boyle.
cheminformatics.js: Preamble.
Online; accessed
2016-07-19, feb 2015. URL http://baoilleach.blogspot.cz/2015/02/
[42] TI Oprea and J Gottfries. Chemography: the art of navigating in chemical
space. Journal of Combinatorial Chemistry, 3(2):157–166, 2001.
[43] P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of
drug-like chemical space based on GDB-17 data. Journal of Computer-Aided
Molecular Design, 27(8):675–679, aug 2013. doi: 10.1007/s10822-013-9672-4.
URL http://dx.doi.org/10.1007/s10822-013-9672-4.
[44] Sara N. Pollock, Evangelos A. Coutsias, Michael J. Wester, and Tudor I.
Oprea. Scaffold topologies. 1. exhaustive enumeration up to eight rings.
Journal of Chemical Information and Modeling, 48(7):1304–1310, jul 2008.
doi: 10.1021/ci7003412. URL http://dx.doi.org/10.1021/ci7003412.
[45] M Rarey and JS Dixon. Feature trees: a new molecular similarity measure
based on tree matching. Journal of Computer-Aided Molecular Design, 12
(5):471–490, Sep 1998.
[46] J. W. Raymond. RASCAL: Calculation of graph similarity using maximum
common edge subgraphs. The Computer Journal, 45(6):631–644, jun 2002.
doi: 10.1093/comjnl/45.6.631. URL http://dx.doi.org/10.1093/comjnl/
[47] Hans J Reich and Donald J Cram. Macro rings. xxxvii. multiple electrophilic
substitution reactions of [2.2] paracyclophanes and interconversions of polysubstituted derivatives. Journal of the American Chemical Society, 91(13):
3527–3533, 1969.
[48] Kaspar Riesen, Sandro Emmenegger, and Horst Bunke. A novel software
toolkit for graph edit distance computation. In International Workshop
on Graph-Based Representations in Pattern Recognition, pages 142–151.
Springer, 2013.
[49] Lars Ruddigkeit, Ruud van Deursen, Lorenz C. Blum, and Jean-Louis Reymond. Enumeration of 166 Billion organic small molecules in the chemical universe database GDB-17. Journal of Chemical Information and Modeling, 52(11):2864–2875, nov 2012. doi: 10.1021/ci300415d. URL http:
[50] Erich Schubert, Alexander Koos, Tobias Emrich, Andreas Züfle,
Klaus Arthur Schmid, and Arthur Zimek. A framework for clustering uncertain data. Proceedings of the VLDB Endowment, 8(12):1976–1979, 2015.
URL http://www.vldb.org/pvldb/vol8/p1976-schubert.pdf.
[51] Ansgar Schuffenhauer and Thibault Varin. Rule-based classification of chemical structures by scaffold. Molecular Informatics, aug 2011. doi: 10.1002/
minf.201100078. URL http://dx.doi.org/10.1002/minf.201100078.
[52] Ansgar Schuffenhauer, Peter Ertl, Silvio Roggo, Stefan Wetzel, Marcus A.
Koch, and Herbert Waldmann. The scaffold tree - visualization of the scaffold
universe by hierarchical scaffold classification. Journal of Chemical Information and Modeling, 47(1):47–58, jan 2007. doi: 10.1021/ci600338x. URL
[53] Ben Shneiderman. Tree visualization with tree-maps: 2-d space-filling approach. TOG, 11(1):92–99, jan 1992. doi: 10.1145/102377.115768. URL
[54] Silicos-it. Strip-it™. Online; accessed 2016-07-12. URL http://silicos-it.
[55] G Thijs, W Langenaeker, and H De Winter. Application of spectrophores™
to map vendor chemical space using self-organising maps. Journal of Cheminformatics, 3(Suppl 1):P7, 2011. doi: 10.1186/1758-2946-3-s1-p7. URL
[56] NP Todorov and PM Dean. Evaluation of a method for controlling molecular scaffold diversity in de novo ligand design. Journal of Computer-Aided
Molecular Design, 11(2):175–192, Mar 1997.
[57] Miklos Vargyas, Judit Papp, Ferenc Csizmadia, Szabolcs Csepregi, Alex
Allardyce, and Peter Vadasz. Maximum common substructure based
hierarchical clustering. American Chemical Society Fall meeting, sep
2006. URL https://www.chemaxon.com/conf/MCS_Based_Hierarchical_
[58] Matthew E Welsch, Scott A Snyder, and Brent R Stockwell. Privileged
scaffolds for library design and drug discovery. Current Opinion in Chemical
Biology, 14(3):347–361, Jun 2010. doi: 10.1016/j.cbpa.2010.02.018. URL
[59] Michael J Wester, Sara N Pollock, Evangelos A Coutsias, Tharun Kumar
Allu, Sorel Muresan, and Tudor I Oprea. Scaffold topologies. 2. analysis of
chemical databases. Journal of Chemical Information and Modeling, 48(7):
1311–1324, Jul 2008. doi: 10.1021/ci700342h. URL http://dx.doi.org/
[60] Stefan Wetzel, Karsten Klein, Steffen Renner, Daniel Rauh, Tudor I Oprea,
Petra Mutzel, and Herbert Waldmann. Interactive exploration of chemical space with scaffold hunter. Nature Chemical Biology, 5(8):581–583,
Aug 2009. doi: 10.1038/nchembio.187. URL http://dx.doi.org/10.1038/
[61] Steven J. Wilkens, Jeff Janes, and Andrew I. Su. HierS: hierarchical
scaffold clustering using topological chemical graphs. Journal of Medicinal Chemistry, 48(9):3182–3193, may 2005. doi: 10.1021/jm049032d. URL
[62] David S Wishart, Craig Knox, An Chi Guo, Savita Shrivastava, Murtaza
Hassanali, Paul Stothard, Zhan Chang, and Jennifer Woolsey. Drugbank: a
comprehensive resource for in silico drug discovery and exploration. Nucleic
Acids Research, 34(Database issue):D668–D672, Jan 2006. doi: 10.1093/
nar/gkj067. URL http://dx.doi.org/10.1093/nar/gkj067.
[63] Alon Zakai. Emscripten. In Proceedings of the ACM international conference companion on Object oriented programming systems languages and
applications companion - SPLASH ’11. Association for Computing Machinery (ACM), 2011. doi: 10.1145/2048147.2048224. URL http://dx.doi.
[64] Hongyu Zhao and Justin Dietrich. Privileged scaffolds in lead generation. Expert Opinion on Drug Discovery, 10(7):781–790, Jul 2015.
doi: 10.1517/17460441.2015.1041496. URL http://dx.doi.org/10.1517/
A. Installation and Generator Tasks
In this section, we describe how to install Scaffold Visualizer from the source
code. Thanks to the used build system, the process is simple.
We do not provide a binary distribution because we do not have a license to
redistribute some of the required libraries. Also, describing the process of running
Scaffold Visualizer from the source encourages experiments and modifications,
which we want to encourage. Anyone interested in only a quick preview of the
Scaffold Visualizer is encouraged to visit the demo instance before diving into the
installation process. The demo instance is available at:
Hardware Requirements
Scaffold Visualizer consists of three parts - the client, the server and the generator.
We list requirements for all of them separately. The only exception are the CPU
requirements – Scaffold Visualizer performs a lot of CPU intensive computations
in all its parts – therefore the faster the CPU the better. There is no strict
minimum requirement for the CPU, but a at least a middle class CPU no older
than 5 years is recommended. The generator and the server are able to utilize
multiple cores.
Client Except for a decent CPU, the client requires a sufficient amount of RAM
- at least 1 GB of RAM should be available to the browser to open larger data
sets. Besides that, the client requires no installation and has no requirements for
disk space.
Server At least 2 GB of RAM is recommended to be available for the server.
The server also needs some disk space for the installation – about 60 MB –
and then enough space to store the background hierarchy, which depends on the
background database used. About 1.1 GB of space is required for the hierarchy
based on PubChem Compound database as of July 23, 2016 – this number is
expected to grow as the number of molecules in the Compound database grows
Generator The hardware requirements of the generator are based on the background chemical library used. For the PubChem Compound database to be
processed, at least 8 GB of RAM is recommended. Also about 100 GB of disk
space is temporarily needed to download and store the database.
Software prerequisites
As a browser application, the client has no special requirements. The Google
Chrome desktop browser is recommended, version 51.0 or later, preferably 64 bit.
The server and the generator are Java applications. They should be OSindependent, and they have been verified to run on GNU/Linux and Microsoft
A Java SE Development Kit (JDK), containing the Java Runtime Environment and development tools, is required. It is recommended to use Oracle JDK
version 8 or later.
On top of that, SBT (http://www.scala-sbt.org/) is required to be installed.
Last, the bin folders of the JDK and the SBT installations have to be in
system path, so that the sbt and javac commands can be run directly from a
command line.
Downloading Scaffold Visualizer
The Scaffold Visualizer can be obtained from GitHub, either in a form of a ZIP
package, or using a GIT command:
git clone https://github.com/velkoborsky/scaffvis.git
In any case, after cloning from GIT or extracting the ZIP package, one should
have a scaffvis folder containing the build file (build.sbt) and the project
structure (folders client, generator, server,. . . ).
The folder does not yet contain the required libraries. Most of the libraries
are downloaded from the Internet automatically, some require to be downloaded
manually, as described in the following subsection.
Downloading Libraries
The Scaffold Visualizer requires the ChemAxon JChem cheminformatics toolkit
to run. It is not required to install the toolkit. Instead, it is needed to copy a
few selected libraries from the distribution ZIP file.
Step 1: Downloading the JChem toolkit
At the time of writing, the ChemAxon JChem Suite can be downloaded from:
It is recommended to select the platform independent ZIP file option - a “Cross
platform package without installer”. The Scaffold Visualizer has been developed
and tested with JChem Suite version (from Jan 5, 2016), however, it
might be possible to use a later version. As of July 2016, the version is
available from Archives at the bottom of JChem Suite download page, or directly
In any case, at the end of this step you should have a JChem package file such
as jchem- or an equivalent later version.
Figure A.1: The list of required JChem libraries
Step 2: Extracting the libraries
When extracted, the JChem package file should contain binary libraries in the
jchem/lib/ subdirectory. From all the libraries contained there, only a small
subset is required. The exact list of the required libraries is presented in figure
The libraries need to be copied to the generator/lib/ directory in Scaffold Visualizer. This directory already exists and contains a single file named
REQUIRED_LIBRARIES.txt which contains the same list of libraries as presented
in figure A.1.
It is important to copy only the libraries prescribed. Simplifying the task and
copying all the libraries that are provided with ChemAxon JChem could lead
to classpath conflicts with the libraries used by the Play Framework – i.e. the
generator would still work fine but the server might not work at all.
Running Scaffold Visualizer
Now, everything should be ready to run the Scaffold Visualizer. In a command
line, navigate to the folder where the Scaffold Visualizer is installed (i.e. the folder
containing the build.sbt build file). You shall be able to start an interactive
SBT console by executing the sbt command.
You can start by executing a sample generator task, executing command:
> generator/run CheckPaths
The task will check for various paths (such as the PubChem compound source
directory and the database files) and report their existence and locations.
Now the generator can be used to generate the background hierarchy. Which
can be used when a sample hierarchy is included (see the CheckPaths task output)
or when providing a scaffold hierarchy (file scaffoldHierarchy.mapdb) generated elsewhere.
Generating the background hierarchy
As described in section 6.2.1, the background scaffold hierarchy is generated from
PubChem files in three distinct steps. They can be run all at once, using a
supplied SBT command:
> generate
Alternatively, the tasks can be run one by one. That can be done the same way
as with the CheckPaths task or by changing to the generator project
> project generator
Then, tasks can be run using the run command only, for example:
> run ImportPubChem
As described earlier, the ImportPubChem task converts the PubChem Compound
SDF representation to a custom format. Before running the ImportPubChem
task, download the PubChem Compound database (see pubchem/README.txt)
and check that it is correctly detected using the CheckPaths task. Running the
ImportPubChem task, a PubChem database in a custom format shall be created in
hierarchy/pubchem.mapdb (see also hierarchy/README.txt for details on how
to change the default path).
The second task, GenerateScaffolds, uses the PubChem database file generated in the first step and calculates a raw processing hierarchy, creating a file
This file is then used in the third step to create the final hierarchy, by default
stored in hierarchy/scaffoldHierarchy.mapdb.
Except for the scaffoldHierarchy.mapdb file, the other files may now be
Also, please make sure, that the target database files do not exist prior to
calling the processing task creating them – unless resuming an interrupted computation.
Running the server in development mode
Having a hierarchy file ready, we can run the server. That can be done either in
development mode, which is easy, or in a production mode, which is a bit more
involved and will be described in the next section.
Running the server in the development mode consists of calling a single SBT
> server/run
That should immediately inform us, how the application can be accessed. By
default, the server listens on address http://localhost:9000/.
Running the server might fail in case the background scaffold hierarchy is not
available (see previous section on how to generate it and use the CheckPaths
generator task to verify that it is correctly detected). The development server
also fails with a hard to debug mistake when the javac Java compiler is not in
system path environment variable.
The development mode can be used to run the server for the first time – to
verify that everything works as expected. On top of that, the development mode
is useful when customizing the application. The changes made to the source code
are immediately applied to the running server, which allows for rapid iterations.
Running the server in production mode
Creating a distribution package
To run the server in, a binary distribution package needs to be created. In order
to do that, switch to the server project:
> project server
Then, if creating a distribution package for the first time, generate a new cryptographic secret1 :
> playUpdateSecret
Next you can generate a binary distribution package using the supplied release
> release
You shall be informed where to find the resulting package. By default it shall
reside in a file server-1.0.0.zip, in folder server\target\universal.
Running the server
The package server-1.0.0.zip then needs to be extracted to a suitable location and run from a command line using the included launchers in bin folder
(bin/server for Linux and bin/server.bat for Windows).
The server can either be run from the computer where it has been compiled
or copied to a dedicated server and run from there. The package is platform
independent and only needs a server Java Runtime Environment (it doesn’t need
the JDK nor the SBT).
The package also contains all the required libraries. It does not, however,
contain the scaffold hierarchy. This hierarchy has either be copied to the server
(to the path hierarchy/scaffoldHierarchy.mapdb) or an alternative location
has to be specified (as described in hierarchy/README.txt).
Consult the Play Framework documentation for more details, specifically: https://www.
More resources
Finally, the server is a standard Play Framework application and detailed information about preparing distribution packages and running Play applications in
production can be found in the Play Framework documentation2 as well as in the
documentation of the SBT Native Packager3 .