Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Dissertation submitted to the Combined Faculties for the Natural Sciences and for Mathematics of the Ruperto-Carola University of Heidelberg, Germany for the degree of Doctor of Natural Sciences presented by M. Sc. David Camacho Trujillo born in México City, México Oral-examination: ................................................ 2 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques ............................................................. .................................................. .................................................. Referees: . Prof. Dr. Ursula Kummer P.D. Dr. Ursula Klingmüller 3 4 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques D edicated to: Sarah & Tere & A rturito 5 6 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 7 INDEX Summary.......................................................................... 9 Zusammenfassung ....................................................... 10 Personal Words............................................................. 11 List of abbreviations ..................................................... 13 General Motivation........................................................ 17 1. Biological context ..................................................... 19 1.1 Gene regulation ............................................................................................. 19 1.2 Basal transcription apparatus ......................................................................... 19 1.3 Transcription factors...................................................................................... 21 1.4 Enhancers-Insulators...................................................................................... 22 1.5 Post-transcriptional regulation of the mRNA ................................................. 23 1.5.1 Alternative splicing................................................................................. 23 1.5.2 RNA interference.................................................................................... 24 1.5.3 Dimensional in-homogeneities ................................................................ 26 2. Reverse engineering and modelling of genetic network modules........................................................... 29 2.1 Related work ................................................................................................. 29 2.2 General concepts ........................................................................................... 30 2.3 Dimensionality reduction by data selection .................................................... 32 2.4 Theoretical works .......................................................................................... 36 2.4.1 Boolean Networks................................................................................... 36 2.4.2 Differential equation systems .................................................................. 38 2.4.3 Stochastic Models ................................................................................... 44 2.4.4 Bayesian networks .................................................................................. 45 3. Methods ..................................................................... 50 3.1 Workflow ...................................................................................................... 50 3.2 Data pre-processing, Quality control.............................................................. 51 3.3 Data normalization ........................................................................................ 53 8 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 3.4 Dimensionality problem. The use of interpolation approaches ....................... 55 3.5 Data fitting .................................................................................................... 57 3.6 Models........................................................................................................... 62 3.6.1 The CTRNN model................................................................................. 62 3.6.2 The TDRNN model................................................................................. 66 3.6.3 Robust parameter determination.............................................................. 67 3.6.4 Graph generation and error distance measurements ................................. 68 3.6.5 Clustering of results ................................................................................ 68 3.6.6 Dynamic Bayesian Network.................................................................... 71 4. Results ....................................................................... 73 4.1 Synthetic benchmark: The Repressilator ........................................................ 74 4.1.1 Parameter space selection........................................................................ 75 4.1.2 Required data length. .............................................................................. 86 4.1.3 Robustness against noise......................................................................... 92 4.1.4 Robustness against incomplete information: Clustering improves the standard reverse engineering task, quantitatively and qualitatively................... 97 4.2 The yeast cell cycle...................................................................................... 103 4.2.1 TDRNN shows superior inference and predictive power than previous models on experimental data.......................................................................... 104 4.2.2 Bootstrapping validation ....................................................................... 106 4.2.3 Clustering improves the RE process with real data ................................ 107 4.3 Reverse engineering of keratinocyte-fibroblast communication.................... 109 5. Discussion ............................................................... 127 5.1 Model choice and data driven experiments................................................... 128 5.2 Data selection .............................................................................................. 129 5.3 Data interpolation, implications ................................................................... 130 5.4 Data fitting and inference power relationship ............................................... 131 5.5 Reverse engineering framework, improving the robust parameter selection.. 135 6. Conclusions............................................................. 137 7. Bibliography ............................................................ 139 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 9 Summary In the iterative process of experimentally probing biological networks and computationally inferring models for the networks, fast, accurate and flexible computational frameworks are needed for modeling and reverse engineering biological networks. In this dissertation, I propose a novel model to simulate gene regulatory networks using a specific type of time delayed recurrent neural networks. Also, I introduce a parameter clustering method to select groups of parameter sets from the simulations representing biologically reasonable networks. Additionally, a general purpose adaptive function is used here to decrease and study the connectivity of small gene regulatory networks modules. In this dissertation, the performance of this novel model is shown to simulate the dynamics and to infer the topology of gene regulatory networks derived from synthetic and experimental time series gene expression data. Here, I assess the quality of the inferred networks by the use of graph edit distance measurements in comparison to the synthetic and experimental benchmarks. Additionally, I compare between edition costs of the inferred networks obtained with the time delay recurrent networks and other previously described reverse engineering methods based on continuous time recurrent neural and dynamic Bayesian networks. Furthermore, I address questions of network connectivity and correlation between data fitting and inference power by simulating common experimental limitations of the reverse engineering process as incomplete and highly noisy data. The novel specific type of time delay recurrent neural networks model in combination with parameter clustering substantially improves the inference power of reverse engineered networks. Additionally, some suggestions for future improvements are discussed, particularly under the data driven perspective as the solution for modeling complex biological systems. 10 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Zusammenfassung Für den iterativen Prozess der experimentellen Erforschung biologischer Netzwerke und der computergenerierten Ableitung von Modellen für diese Netzwerke werden schnelle, fehlerfreie und flexible Programmiergerüste benötigt, um biologische Netzwerke zu modellieren und um sie zu rekonstruieren. In dieser Arbeit stelle ich ein neuartiges Modell vor, das genregulierte Netzwerke darstellt, indem zeitverzögerte, rekurrente, neuronale Netzwerke benutzt werden. Zudem führe ich eine Methode des Parameter-Clusterings ein, die Parameter-Set-Gruppen, die biologisch sinnvolle Lösungen darstellen, aus den Simulationen auswählt. Zusätzlich wird hier eine generelle, lernfähige Funktion eingesetzt, um die Konnektivität kleiner genregulierter Netzwerke zu verringern und um diese zu untersuchen. In dieser Dissertation wird die Leistungsfähigkeit dieses neuartigen Modells, die Dynamik genregulierter Netzwerke aus synthetischen und experimentellen Datensätzen von Zeitreihen der Gen-Expression zu simulieren und deren Topologie abzuleiten, aufgezeigt. Die Qualität der abgeleiteten Netzwerke bestimme ich mit Hilfe von Graph-Edit-Messungen im Vergleich zu den synthetischen und experimentellen Bezugswerten. Außerdem vergleiche ich den Arbeitsaufwand der von den zeitverzögerten rekurrenten Netzwerken abgeleiteten Netzwerke und anderer bereits beschriebener Rekonstruktionsmethoden, die auf zeitkontinuierlichenrekurrenten und dynamischen-bayesischen Netzwerken basieren. Darüber hinaus befasse ich mich mit Fragen der Netzwerk-Konnektivität und der Korrelation zwischen der Datenanpassung und der statistischen Power der Inferenz, indem ich bekannte experimentelle Einschränkungen des Rekonstruktionsprozesses, wie unvollständige oder höchst rauschbehaftete Datensätze, simuliere. Dieses neuartige und spezielle, zeitverzögerte, rekurrente, neuronale Netzwerk verbessert zusammen mit dem Parameter-Clustering wesentlich die Ableitungskraft der rekonstruierten Netzwerke. Zudem werden einige Anregungen für zukünftige Verbesserungen erörtert, insbesondere aus der datengestützen Perspektive als der Lösungsstrategie für die Modellierung komplexer biologischer Systeme. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 11 Personal Words I would like to acknowledge in this dissertation to those persons and institutions that made this work possible. The support of Professor Ursula Kummer is of special importance due to the circumstances of this work. Therefore, I would like to sincerely acknowledge her for this opportunity to finish all these years of work on a nice way. I want to acknowledge Professor Randall Beer, who thanks to the internet could provide me with punctual but decisive directions to understand and develop my theoretical model. Analogous are the contributions from Professor Paul Johnson as well as the SWARM community who always helped me to implement my model under the SWARM philosophy. Additionally, I would like to thanks Dr. J.J. Merello for his GA code and directions to use it. Undoubtedly, this work could not be possible without the support from Sarah. In many ways she has supported me and was at my side in difficult moments. Hence, I want to set this in words. Many thanks for all your support, Saritah. Special thanks to my family because despite they are not geographically close to me, they always were present and support me with their comprehension. Special thanks to my sisters Rebekita and Lendy, for their tenderness and shift responses when needed. I acknowledge the opportunity to develop this work at the iBIOS group from the Deutsches Krebsforschungszentrum (DKFZ) and the Viroquant group at Bioquant from the University of Heidelberg. This work could be developed by the support of the DAAD who I would like to acknowledge, because it is more than a building - they always have been a human Institution. In this sense, I would like to acknowledge my former University UNAM, because despite of being massive, it has a quality similar to the University of Heidelberg and gave me my formation for free. Once I heard that that massive universe imprint over us some sort of nationality, and I believe there is a little truth in that. I would like to thank all my friends that have been at my side, in one or another way. Finally, I would like to thanks to my own “dickopfness” because in many senses, it was not planned for me to be at this point. 12 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques List of abbreviations Activating Protein 1 AP-1 Adaptive time-delay neural network ATNN ATP: protein phosphotransferase (cAMP-dependent) PKA Boolean network BN Carcinoembryonic antigen-related cell adhesion molecule 1 CEACAM1 Catalytic subunit of the main cell cycle - cyclin-dependent kinase CDK (yeast) cdc28 Cell division control protein 15 (yeast) cdc15 Cell division cycle 14 protein (yeast) Cdc14 Cell division cycle 20 protein (yeast) Cdc20 Cell division cycle 20-like protein 1(yeast) Cdh1 Central nervous system CNS Connectivity K Continuous time recurrent neural network CTRNN Cyclin-dependent kinase (yeast) Mcm1 Deoxyribonucleic Acid DNA Dynamic Bayesian network DBN Early growth response protein 1 EGR1 Epidermal grow factor receptor EGF-R Epidermal grow factor EGF Escherichia Coli E.coli FBJ murine osteosarcoma viral oncogene homolog FOS Fetal calf serum FCS G1/S-specific cyclin (yeast) Cln 3 G1/S-specific cyclins (yeast) Cln 1/2 G2/mitotic-specific cyclin 1/2(yeast) Clb 1/2 Gene regulatory network GRN Gene Set Enrichment Analysis GSEA Genetic algorithm GA Glyceraldehyde-3-phosphate dehydrogenase (NADP+) GAPDH Granulocyte–macrophage colony-stimulating factor GM-CSF 13 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 14 Graph edit distance GED Gray Gy (absorbed radiation units) Hepatocyte growth factor HGF Human skin dermal fibroblasts HDF Immortalized human keratinocytes HaCaT Integrin beta 6 ITGB6 Integrin, alpha V ITGAV Irradiated human skin dermal fibroblasts HDFi Keratinocyte growth factor FGF-7 Laminin, alpha 3 LAMA3 Laminin, gamma 2 LAMC2 Means square error MSE Messenger ribonucleic acid mRNA Micro ribonucleic acid miRNA Mismatch MM Multiple recurrent neural networks MRNN Multiprotein bridging factor (yeast) MBF Nondeterministic polynomial-time hard NP-hard Not pruning NP Open reading frame ORF Ordinary differential equations ODE Perfect match PM Plasminogen activator, urokinase PLAU Plasminogen activator, urokinase receptor PLAUR Prostaglandin-endoperoxide synthase 2 PTGS-2 Pruning P Random Boolean networks RBN Regulatory protein SWI4 (yeast) SBF Regulatory protein SWI5 (yeast) Swi5 Reverse engineering RE Reverse transcriptase-polymerase chain reaction RT-PCR Ribonucleic acid RNA Ribonucleic acid polymerase subunit II RNA pol II Simulating annealing SA Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques S-phase entry cyclin 5/6 (yeast) Stromal derived factor-1 Clb5/6 SDF-1 Substrate and inhibitor of the cyclin-dependent protein kinase CDC28 (yeast) Sic1 Thiamine-adenine promoter consensus sequences TATA box Time delayed neural networks TDNN Time delayed recurrent neural network TDRNN Transcription factors TF Transcription factor Jun B JUNB Variance stabilization normalization VSN v-ets erythroblastosis virus E26 oncogene homolog 1 ETS1 15 16 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 17 General Motivation A group of technologies like microarrays, CGH or mass spectrometry, has become part of the standard laboratory experiments all around the word. These technologies allow us to measure thousands of genes, hundreds of proteins or other cellular components like mRNA´s at the same time. All this information is usually primary stored into databases but generally it analysis is far to be finished. One reason for this situation is that the traditional one to one “cause-effect” correlation, typically used in biology, is not applicable to these large data sets. To handle this information a new kind of approaches has been developed during the last years. Approaches able to store, analyze and develop models for a large number of variables. One area that is deeply influenced by experimental high throughput data generation technologies is the analysis of gene regulation of the mammalian cells. Because of its complexity and implications in different areas like evolution or drug target generation, gene regulation is widely studied by theoretical works. The holistic integration of gene regulation dynamics has just begun, and it is clear that only the iterative work between lab data driven experiments and theoretical work will be able to generate a new paradigm in the area. In this multidisciplinary context the present work is circumscribed. To understand the goals, achievements and limitations of this work, some basics topics will be described about gene regulation complexity (Chapter 1), and the most relevant related theoretical work developed until now (Chapter 2). In Chapter 3 the methodology used in this thesis is described. A comparative study to with similar approaches is achieved in the results section (Chapter 4) as well as the presentation of results obtained by applying the approach described in the present work, to experimental data. Lastly, the analyses of the results as well as collateral topics are described on the discussion section (Chapter 5) and some final words and outlook is described in the Conclusion section (Chapter 6). 18 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 19 1. Biological context 1.1 Gene regulation Gene regulation is a complex and not well-understood process. It has several mechanisms to control itself which act at different levels in time scale, cellular space and molecular mechanisms. Here, some of these mechanisms will be described to highlight the implications and restrictions they impose on the theoretical models intended to capture gene regulation behavior. 1.2 Basal transcription apparatus The basal transcription apparatus is not part of the gene regulation mechanism by itself. However, it is important to remark that its presence is a necessary condition in order to transcribe any gene. Therefore, it is important to know some structural aspects of it, which plays a role for the design of gene regulation kinetic related models (Mjolsness and Sharp, 1991). While the enzymatic behavior of transcription is due to the basal transcription apparatus bound to the tetrameric RNA polymerase II, substrates are the relative free diffusible nucleotide bases and the highly conserved thiamine-adenine promoter sequences (TATA boxes, here on) on the DNA. Finally, the mRNA is the obtained product. From here on, it should be clear that the TATA boxes do not form part of the same chemical liquid phase as the other substrates. TATA boxes are part of an extreme long polymer associated to thousands other proteins conforming a dynamical semi-solid phase system, the so-called Chromatin. Therefore, traditional kinetic models (as Michaelis Menten and others) should not be applicable here, because they presuppose a freely diffusive Brownian motion of all substrates along an homogeneous liquid phase media, meaning homogeneous concentration. 20 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques The basal transcription apparatus plus the tetrameric RNA pol II is a complex of more than 20 proteins that is in situ sequentially assembled This multimeric complex requires additional proteins to initiate its formation; the so-called Mediator (Lewis and Reinberg, 2003) is part of those required additional proteins. The Mediator is about 20 proteins in size and this large multimeric protein complex requires additional proteins to initiate the transcription, the so-called Transcription Factors (TF here on). These TF are the triggering initial step in gene transcription activation. The basal transcription apparatus plus Mediator and additional TF is a large multimeric complex, which could have the size of more than 60 proteins. Hence, it is clear that gene transcription initiation requires some structural conditions on the DNA super structure as accessibility to avoid steric impediments. This accessibility is controlled by other regulatory mechanisms that promote chromatin relaxation (Cremer and Cremer, 2001). Additional aspects like DNA malleability (3D curvature of the DNA) or stochastic fluctuation in access generated by the Chromatin “breathing” are a matter of discussion. For sure, these processes play also their roll, but the problem is to know when and how intense every regulatory mechanism contributes to the global behavior. From here on, it should be clear that chromatin accessibility is a tri dimensional level of inhomogeneity and a source of gene activity regulation. Models oriented to represent gene reaction-diffusion kinetics should take this into account. Concerning the product (mRNA) kinetics; since the mRNA is a polymer, it follows a multiple-step-synthesis. Additionally, as this mRNA product is not a monotonic polymer, its step-by-step formation implies a more complex process (among other processes like translocation, strand separation etc.) known as nucleotide selection. This means that the rate of the mRNA production is not diffusion limited, and has slight variations depending on the template sequence and other context dependent proteins. On the other hand, once the basal transcription machinery is induced it activates the RNA polymerase, and in turn it moves along the gene performing the transcription. The consequence of this RNA polymerase displacement is that another RNA polymerase is able to bind the initiation complex and initiate simultaneously a new transcription process. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 21 Therefore, even though there is only one functional a copy of every gene, it is difficult to quantify the maximum number of transcription complexes at a certain time. Owing to these last two considerations, it is practically impossible to define a transcription rate constant for any gene, because the rate of transcription for every gene is a continuous time-and-context dependent process. 1.3 Transcription factors Transcription factors belong to a large but limited set of families of proteins that share functionality, in human they are about 300 (Itzkovitz, et al., 2006) TF can recognize (the mechanism is family specific) (Itzkovitz, et al., 2006) pattern sequences into the DNA along the so-called promoter region of every gene. Once the TF are bound to the promoter region of a given gene, they can interact with proteins from the basal transcription apparatus and together recruit the RNA polymerase II initiating the transcription of that gene. The activation strength is function of the concentration and physical interactions among the TF that activate a given gene at a certain time. It has been shown (van Nimwegen, 2003) that the total number of TF (N) of any specie scales with it genome size (G) as a power-law (N∼G1.9 Prokaryotes, N∼G1.3 Eukaryotes). However, all of them are not present at the same time nor with the required concentration to activate their target genes. Instead, TF need to interact with some others proteins in order to activate one gene. Usually, they form dimers (homo and hetero dimers) and in turn form quaternary complexes at the promoter regions (Pilpel and Sudarsanam, 2001), where typically more than one complex regulates its activity. Furthermore, it has been proposed that the total number of TF per family correlates with the number of degrees of freedom (number of base pairs recognized by family, ranging from 4 to 96 in humans) in the binding mechanism. However, an overlap of sequence recognition occurs among different TF, probably to make the system more robust through redundancy. Regarding the concentration of the TF in the nucleus, the general idea is that the local concentrations of TF are responsible for the activation of their target genes. 22 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques The local variation of TF concentrations is influenced by their transport (internalization into the nucleus from the cytoplasm) which in turn is often regulated by their activation (usually by phosphorylation) and negatively regulated by their deactivation and or degradation. However, once the TF are in the nucleus, local variations of TF concentrations occur due to the interaction with other already focalized (attached to) proteins at certain DNA regions, the so-called enhancers. Interactions between TF and its promoter targets on the DNA are not covalent, the real picture is a dynamical stochastically process where TF are bound and unbound permanently to the DNA. In this sense TF activation state often plays a central role in their activity because often the activated (usually phosphorylated) transcription factors exhibit a higher affinity for the DNA recognition site, but when the concentration of inactive TF is high enough, then the none activated TF could displace the already bound active TF. Therefore, at this TF regulation level, gene activity again is a continuous time combinatorial process, function of TF identities, their transport, local concentrations and often it activation state (phosphorylated or not). 1.4 Enhancers-Insulators These are short regions on the DNA, which can facilitate the transcription of (cis) genes at a relative long-distance. They are defined like distant-acting cis-regulatory elements (Blackwood and Kadonaga, 1998) but their precise mechanism of acting is still not clear. There is evidence that enhancers increase the probability of genes on their surroundings to be transcribed. Enhancers do their task probably by increasing the local concentration (known as nuclear localization) of TF, but there are some other proposed mechanisms such as; chromatin or nucleosomes remodeling, superhelical tension (to facilitate chromatin accessibility) and direct interaction with associated proteins and the transcription basal machinery. Enhancers also have their functional counterpart on the so-called insulators (West, et al., 2002), which probably directly inhibit the functioning of enhancers, but probably they promote gene repression by other mechanism like Chromatin condensation. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 23 However, there is not enough information on the final balance between enhancersinsulators until now. Therefore this could be modeled like a none-specific Bias of the gene activation process. 1.5 Post-transcriptional regulation of the mRNA At this point two mechanisms are the most relevant to take into account for modeling gene regulatory networks: alternative splicing and interference RNA. A good description of both could be found elsewhere therefore here the focus is just on the implications for the area of modeling gene regulatory networks. 1.5.1 Alternative splicing The mRNA is transcribed as a precursor containing intervening sequences (introns). These sequences are subsequently removed such that the flanking regions (exons) are spliced together to form mature mRNA. Alternative splicing pathways generate different mRNAs encoding distinct protein products, those increasing the coding capacity of genes. The resulting proteins may exhibit different and sometimes antagonistic (Cremer and Cremer, 2001) functional and structural properties, as binding affinity, intracellular localization, enzymatic activity, stability and posttranslational modifications, and may inhibit the same cell with the resulting phenotype being the balance between their expression levels. Alternative splicing can also act as an on–off gene expression switch by the introduction of premature stop codons (Feyzi, et al., 2007). The alternative splicing mechanism is achieved by a ribonucleo-protein structure called spliceosome, but most of the splicing regulation that is not part of the basal spliceosome is known to be undertaken by families of splicing regulatory proteins. These splicing factors bind to signals in the vicinity of the exon and promote the exon’s inclusion or exclusion by activating or inhibiting the function of the splice site. The number of classes and characteristics of these regulatory proteins and their RNA binding sites are relatively little known and are currently under active investigation (Irimia and Roy, 2008; Pettigrew and Brown, 2008; Solis, et al., 2008). 24 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques These properties from alternative splicing should change the vision of gene regulatory networks, defined by a fixed topology and a set of rules of interactions, by a more dynamical concept. Especially now, that it is accepted that alternative splicing is not the exception but the rule for about 40-60% of the human genes (Downes, 2004; Stamm, 2002). However, until now just a few information concerning alternative splicing regulation is available and therefore it could not to be included into gene regulatory networks models. It should be mentioned that alternative splicing is not a problem for the reverse engineering of gene regulatory networks task by it self, rather for the prediction of gene network behavior (see the difference on chapter 2 between inference power and prediction power). It is, at the cellular level all signaling mechanisms are context dependent, and what applies for a given cell at certain development stage, is not applicable to another cell line or the same cell line at another cell cycle stage or under different environmental conditions. Hence, one should carefully extrapolate what is encountered in a cell line to another cell line or the same cell but under different conditions. The distinction between these two situations will be further explained on the 2nd Chapter. 1.5.2 RNA interference Less than 2% of the human genome is translated into proteins (He, 2004), yet more than 40% of the genome is thought to be transcribed into RNA (Ben-Dov, et al., 2008). The vast fraction of untranslated RNA's includes several kinds of functional non coding RNA, like snRNA (Spliceosomal and U7), snoRNA, telomerase RNA, SRP RNA (protein trafficking), tRNA, TSK RNA (transcription elongation), and a group of RNA that interfere with the expression of genes: siRNA, miRNA, piRNA. This last family has different mechanisms to act, but share the characteristic of silencing the expression of genes. The siRNA is oriented to silence exogenous genes like those present in viruses (Juliano, et al., 2008) piRNA is utilized in mammalian germ cells (Paddison, 2008) and finally the miRNA is utilized by plants and animals cells to selectively silence genes during development (Boutros and Ahringer, 2008) Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 25 differentiation or proliferation as part of another level of gene regulation. miRNA regulation is similar to gene expression, they have promoters and enhancers, but also could be transcribed as part of some gene and later spliced during the mRNA maturation. However, they are initially transcribed as pri-mRNA, removed by an enzyme (drosha) in the nucleus, exported to the cytoplasm as pre-mRNA, where are further removed by another enzyme (dicer) generating mature miRNA. It is there, in the cytoplasm where miRNA´s finally meets its target mRNA, basically by WatsonCrick complementarities’. Once miRNAs meet their target mRNA, they decrease its function by different possible ways like direct cleavage (mostly in plants), mRNA deadenylation, affecting the stability of their target mRNA and inhibiting translation (mainly in animals). Currently there are 328 miRNA's annotated in the human genome (Chen and Rajewsky, 2007), but is thought that there are more than 1000. Interestingly, even though this miRNA is just 22 bases long they exhibit a relative highly conserved sequence, and it has been shown that every class of miRNA could affect several (hundreds) of different genes (He, 2004). It is thought that more than 30% of the human genes are also regulated by miRNA’s. Therefore this mechanism should be included explicit or implicitly by any model for gene regulatory networks. Nevertheless, the problem to include this mechanism is that in animals the main mechanism of gene repression through miRNA’s is in proteic translation repression. It means, it does not affect the synthesis of mRNA and therefore would not be reflected at the transcriptome level and in case, also would not be reflected in the mRNA microchip technology. Ideally, there should be information regarding the stability of mRNA, the relationship of mRNA translated to proteins, the proteins half life, the state of activity of proteins involved in gene regulation, localization of this proteins, complex formations etc. to create a dynamical model of gene regulatory networks. But the case of miRNA negative regulation is of particular relevance for reverse engineering of gene regulatory networks, because it acts as a negative regulation of genes. Once again, the distinction between both areas will be further explained in Chapter 2. 26 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 1.5.3 Dimensional in-homogeneities As it has been mentioned before, the accessibility of the genes into the chromatin is a general mechanism of gene expression repression (Cremer and Cremer, 2001). However, chromatin structure is a field under investigation that proved that chromatin has too many levels of control (Lanctôt, et al., 2007). A starting point to exemplify the chromatin complexity is the super-coiling structure with its seven levels of packaging that, according with the previously described, function as a basal repressive steric barrier. Other gene expression regulatory mechanisms at the Chromatin level are the chromosome territories, acethylation and methylation of histones, interaction with the nuclear (actin) matrix, translocations of genes etc. In general all these processes promote or inhibit gene activity by means of favoring or inhibiting diffusion of the reactants (Misteli, 2001). Additionally to the three dimensional in-homogeneities, there is another level of gene regulation that is far beyond of the scope of the actual models. This additional complexity comes from the fact that some of process previously described as activation of TF, methylation and acethylation of histones are precisely regulated by cytoplastmatic events called signal transduction. Cell signal transduction is a large and very important area of study under intense research that is beyond the scope of this work. However, the emergent aspect to notice here is that the previously described inhomogenic diffusive processes, and the cell signal transduction processes of communicates with each other through a fiscal barrier: the nuclear envelope (Auboeuf, et al., 2007). This fiscal barrier controls with high precision the fluxes of molecules between focalized processes and signal transduction by the means of nuclear pores that often uses active transport (against concentration gradient, an energy dependent process). Hence, a multi compartment modeling should be applicable if simultaneous data of gene activity and signal transduction where available. However, this also could be modeled by two means: reaction diffusion models or Ordinary differential equations (ODE, from here on) with time delays. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 27 Reaction diffusion models are usually based on partial differential equations, (despite some other possible approaches like cellular automata have being used) and have shown to be very useful if the adequate data is available. However, to model the multi-compartment fluxes coupled with diffusion reaction is, in general, difficult to be modeled. On the other hand time delayed models are an alternative to model the final effect of these complex processes as transport and focalizations of metabolites. This good performance of time-delayed models has the price of not representing a precise mechanics of the original system (kinetics constants) but instead they represent the global behavior and structure through a semi-parametric model (have being called phenomenological ones). However, depending on the goals of a research project this last could be very useful. 28 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 29 2. Reverse engineering and modelling of genetic network modules 2.1 Related work The task of recovering the wiring between elements of a given system and the rules governing their interaction, using data of its dynamical behavior is known as reverse engineering. In the context of functional genomics, it means finding out which genes regulate which others, how and under which circumstances. In other words, it is the task of finding the topology of the gene regulatory network (GNR) related to a particular cell line. Several experimental works and theoretical approaches have been performed in this field. Since there are good reviews on the reverse engineering of gene regulatory networks, in the present chapter I will just briefly describe some of these works, starting with some common agreements in reverse engineering of gene regulatory networks field, followed by a description of the most important theoretical approaches on this field. 30 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 2.2 General concepts Into a given cellular system, gene activity influences directly or indirectly the activity of other genes. This system could be uni-cellular or pluri-cellular, where gene activity from one cell influences the gene activity into another cell, as occurs in a tissue. These interactions also could span the cell life cycle, relaying their influence to the offspring cells. Moreover, this influence could be direct (by e.g. through miRNA), or indirect upon the activity of the proteic sub-product of genes as TF or protein kinases. However, more often the activity of one gene influences the activity of another as means of the activity of a third gene. The reason for this behavior is that cellular events occur sequentially, creating orchestrated cascades of gene activation as occurs in differentiation processes. Additionally, activation of genes does not occur only sequentially but also in parallel, meaning that often one gene influences more than another one. Often some genes behave as hubs, where one gene influences several target genes. The last part of this entangling process is the control of it, where the activity of sub sequenced activated genes, in turn feedback influencing the activity of the firstly activated genes. When this feedback process is positive, an amplification occurs and is often used by the cellular system as a bi-stability control switch (Smolen and Baxter, 2000). When the feedback is negative, a dampening occurs and is often used by the cells for oscillatory or periodic behaviors as cell cycle or circadian rhythms (Smolen and Baxter, 2000). In this way a network of precise regulations emerges and is called a gene regulatory network (GRN). In the systems biology community, the set of interactions (wirings and rules) governing the behavior of a GRN is known as network topology. The task of finding this network topology from the data of the dynamics of a given GRN is the goal of reverse engineering. Usually, in order to perform the reverse engineering (RE, from here on) task a formalism to represent the original system dynamics is needed, and is called a model. The modeling of a given dynamic cellular system is very often not clearly distinguished from the task of reverse engineering from that system. A model is a representation of a system that helps us to understand its complexity. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 31 If the model is validated, by e.g. a cross validation technique, then it could be used to make predictions about the original system. This last is known as the predictive power (van Someren, et al., 2002) of a model. However, to model a system one needs to know the wiring and rules of it topology. As we usually do not have complete information about this topology and if there is information about the dynamic of that system one can use the model to infer the missing topology. If the model is a good representation of the original system, it has more probabilities to make the correct inferences. This capability is known as the inference power (van Someren, et al., 2002) of the model. To perform the reverse engineering task based on dynamical data, series of gene expression data over time (microarrays or RT-PCR) are used to represent a GRN dynamics, as could be the entire genome of a cellular line. In this context every gene from the GRN is known as a node (n) from a network, which is the GRN. The measurement of how many elements are wired to a particular node is known as the connectivity (K) of the node. The level of transcription for a given gene is known as its activity state. Here the set of all (N) activity states measured at a given time point, represents one transcriptional state of the system. In turn, a set of (system states) measured time points of microarrays represents a trajectory of the cellular system. Some works started by Kauffman (Iguchi, et al., 2007; Kauffman, 1969; Kauffman, 2004; Kauffman, et al., 2004; Kauffman, 1969; Socolar, 2003) on the late 60´s proposed to see periodic behaviors as the cell cycle from a given cell line, as an attractor of the entire organism’s genome. A phenomenological description of an attractor could be the set of systems states where a system tends to exist. That is, independently from the initial state a system have, it will tend to move into to the closest attractor. This implies that, a system could have more than one attractor. The vicinity from these attractors, where they applied their influence (like the size of a funnel), is known as the basin of attractor. In this way, a multicelullar organism is the entire system, and it different cell lines represent different attractors1 from that organism. 1 There are different kinds of attractors like; the fixed-point attractor (the simplest) limit cycles, toroids and strange attractors. Here, the analogy is between the cell cycle and the limit cycles attractors. 32 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques When a system is observed evolving from an initial given state until it reaches an attractor, it is said that the trajectory states are transition states. In this perspective, the last goal of the RE of GRN works is, to achieve the necessary knowledge to perturb a given cellular system to move it apart from its original attractor to another desired one. For instance, the goal of RE could be to selectively control cells belonging to a cancer attractor to differentiate them into a desired attractor, as could be the apoptosis (programmed cell death). 2.3 Dimensionality reduction by data selection When using microarray time series the system is highly undetermined. There is a large lack of data in different dimensions, like observational time window, granularity of measurements, diversity of conditions (stimulus response curves for different stimuli and or conditions), repetitions, etc. Therefore, in order to reduce the dimensionality of the system to be analyzed several works circumscribe the system to be analyzed to the set of responding genes to a given stimuli. The criteria to select those responding genes have evolved over the last years, and since it is a crucial step for the reverse engineering of gene regulatory networks, some of them like those who have been tested in this work, will be discussed below. Threshold data selection from complete data sets A common practice among biologist to reduce the gene data to be analyzed is to take into account only those genes which expression has changed after a specific cellular stimulus by at least two fold expressions, or by deleting genes with incomplete data or those which standard deviation does not change beyond an arbitrary threshold. Besides the choosing of a certain threshold it always will be arbitrary and this approach faces two main drawbacks: (i) after correcting for multiple hypotheses testing, no individual gene may meet the threshold for statistical significance, because the relevant biological differences are modest relative to the noise inherent to the microarray technology. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 33 (ii) Alternatively, one may be left with a long list of statistically significant genes without any unifying biological theme. Interpretation can be daunting and ad hoc, being dependent on a biologist’s area of expertise. Functional modules Due to the previous two problems on the threshold filtering of data, I will discuss here different approaches oriented to reduce the dimensionality of the data. Three of them are based on circumscribing the reverse engineering and modeling process to a smaller functional module or cellular function. In this way the first main task to perform those processes is to isolate the smaller number of state variables, able to describe the given cellular module without missing important information. Therefore, this module should be functionally self-sufficient and the available data reflect the orthogonality (Lipan, 2005) from this module. GSEA algorithm The gene set enrichment (GSEA) algorithm is proposed (Subramanian, et al., 2005) to be the solution of the complex task of isolating the important set of genes related to a particular stimulus-response study. The logic of it is that single-gene analysis may miss important effects on pathways, because cellular processes often affect sets of genes acting in concert. They point that an increase of 20% in all genes encoding members of a metabolic pathway may dramatically alter the flux through the pathway and may be more important than a 20-fold increase in a single gene. Gene Set Enrichment Analysis (GSEA) evaluates microarray data at the level of gene sets. The gene sets are defined based on prior biological knowledge, e.g., published information about biochemical pathways or co-expression in previous experiments. The goal of GSEA is to determine whether members of a gene set tend to occur toward the top (or bottom) of a list, in which case the gene set is correlated with the phenotypic class distinction. 34 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Experimental isolation of modules through periodic stimulus An experimental approach to the problem of isolation of functional genetic modules has been proposed (Lipan, 2005) by using oscillatory inputs to analyze the response of a cellular system. Lipan and Wong propose in their work that an oscillatory input has many advantages: (i) the measurements can be extended to encompass many periods so the signal-tonoise ratio can be dramatically improved; (ii) the measurement can start after transient effects subside, so that the data become easier to incorporate into a coherent physical model; and (iii) an oscillatory stimulus has more parameters (period, intensity, slopes of the increasing and decreasing regimes of the stimulus) than a step stimulus. As a consequence, the measured response will contain much more quantitative information. The genes that interact with the driven gene will be modulated by the input frequency. The rest of the genes will have different expression profiles dictated by the internal parameters of the biological system. This point of view is supported by their findings. Lipan and Wong notice that the measured data can be expressed as a sum of exponentially decaying functions, e-t, if a step stimulus is used while for a periodic input the response contains only exponentials with imaginary argument, ei-t. Mathematically, the main difference between exponentials with real arguments, e-t, and those with imaginary arguments, ei-t, is that with the former one cannot form an orthogonal basis of functions, whereas such a basis can be formed with the latter. Therefore they propose that, in general, the response of the network to a step input will be a sum of components that are not orthogonal on each other. The time dependence of these non-orthogonal components can be more complex than an exponential function; they can contain polynomials in time or decaying oscillations, depending on the position in the complex plane of eigenvalues of a transfer matrix H. In contrast, the permanent response obtained from a periodic input is a sum of Fourier components that form an orthogonal set. Orthogonal components are much easier to separate than non-orthogonal ones. This mathematical difference explains the advantage of using oscillatory inputs. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 35 Clustering Cluster analysis is an exploratory data analysis technique which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Some works like Wahde et al (Wahde, 2000), have follow the logical step to capture the global behavior of the entire cellular system representing it by clusters instead of representing particular genes, reducing in this way the dimensionality of the system to be engineered. However, in their work Wahde et al assumes that genes belonging to the same cluster share the same function. Therefore, this approach proposes to use the centroid (mean of the Euclidean distances between every gene expression profiles into a cluster) of every cluster to represent the function related to that group of genes. In this way the entire genome is represented by the biological functions that respond to the given stimuli. However, this approach faces two issues. The first is related to the cluster interpretation. Genes belonging to a particular cluster could or not share the same function. More likely, genes belonging to a given cluster share the same regulation, but they could belong to different biological functions. The second issue of this approach is that at the end, it is not possible to map the centroid to any particular gene. Therefore the practical use of centroids to represent the entire genome is limited. 36 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 2.4 Theoretical works 2.4.1 Boolean Networks Almost 40 years ago Kauffman (1969) developed this model to represent genetic networks. It is based on the assumption that genes exist basically in two possible states; active (ON) or inactive (OFF). The activity state of each gene at a given time is determined by a Boolean function (AND, OR, and NOT) of its inputs (wired nodes) at the previous time step. It is assumed that each gene is controlled by K other genes in the network. For this models connectivity K is a very important parameter to determine the network dynamics (with large K, the dynamics tends to be more chaotic). In Random Boolean Network models, these K inputs, and a K-input Boolean function, are chosen at random for each gene. The Boolean variables (ON/OFF states of the genes) at time t+1 are determined by the state of the network at time t through the K inputs as well as the logical function assigned to each gene. An excellent tool for calculating and visualizing the dynamics of these networks is the DDLAB software (Wuensche, 1998). Under this approaches the total number of expression patterns is finite, therefore the system will eventually return to an expression pattern that it has visited earlier. Since the system is deterministic, it will keep following the exact same cycle of expression patterns. This periodic state cycle is the previously defined attractor of the network. Another assumption performed by Kauffman, is that gene regulatory networks could have a structure similar to random Boolean networks (RBN, from here on) and therefore they should share some common features. For instance, the number of distinct attractors of a Random Boolean Network tends to grow as a square root of the number of nodes-genes (Kauffman, et al., 2004). If we equate the attractors of the network with individual cell types, as Kauffman suggests, it is explained why a large genome of a few billion base pairs is capable of a few hundred stable cell types. This convergent behavior implies immense complexity reduction, convergence and Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 37 stabilization in networks of constrained architecture. In this way, with this model it is possible to correlate the size and number of attractors from a RBN to the number of cellular lines of an organism and the size of its genome in a predictive fashion. These correlations are possible for a variety of species, according to some scaling coefficients encountered by this work. Another insight into the behavior of large regulatory networks is that given analogous Boolean functions (Kauffman, 1971) and similar connectivity, RBN as well as biological systems tend to exhibit either a maximum or a minimum of organization. As recently the microarray technology comes to generate genome size data, this model has been re-utilized to analyze dynamical properties of large GRN´s. Some of these works utilized Boolean Networks (BN) to infer GRN from real data. Liang and Somogyi (Liang, et al., 1998) have developed the REVEAL algorithm, that utilized the Shannon entropy to correlate state transitions between nodes to infer the mutual information between them, in this way and using a full search approach they could the define the transitions rule table that reconstruct the original network topology. The main drawbacks from this algorithm are the general criticism to BN of representing gene expression by only two states: ON or OFF, and the assumed low connectivity K. Akutsu (Akutsu, et al., 2000) developed a series of different algorithms based on BN to analyze the sample complexity of several variants networks, including noisy Boolean networks. He proved that using a conceptual simpler approach, O(log2N) random measurements are sufficient to identify a network of N genes with bounded connectivity K. This means that for a data set with 1000 genes and a connectivity K=2, in the order of only 10 independent measurements were sufficient for his algorithm to infer the network topology. However, this approach has some strong drawbacks. The first drawback is that its exhaustive search engine would utilize O(1010) units of time. The second is that usually the connectivity is larger for real GRN. And the last drawback is a general one for any Boolean network model: genes expression exist in more than two ON/OF states, and very often the information processing of this GRN are related to continuous levels of expression of their genes. 38 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 2.4.2 Differential equation systems To overcome some of the limitations of the BN, some other works have been developed to model the GRN on a continuous gene expression basis, using ordinary differential equations (ODE). A set of ODEs, one for each gene, describes gene regulation as a function of other genes: dxi dt = fi (x1 ,..., x N ,u,θ i ) 2. 1 where xi(t) is the concentration of transcript i measured at time t, θi is a vector of parameters describing interactions among genes (the edges of the graph), i =1 … N, N is the number of genes and u is an external perturbation to the system. As ODEs are deterministic, the interactions among genes represent causal interactions, and not statistical dependencies as in other methods. To reverse-engineer a network using ODEs means to choose a functional form for fi and then to estimate the unknown parameters θi for each i from the gene expression data D using some optimization technique. With an ODE-based approach signed directed graphs are obtained and it can be applied to both steady-state and time-series expression data. Another advantage of using ODE approaches is that once the parameters θi, for all i are known, equation (2.1) can be used to predict the behavior of the network under different conditions (i.e. gene knockout, treatment with an external agent, etc.) as mentioned before as prediction power. There are many different approaches that can be enclosed into this differential equation approaches, I will just briefly describe the major categories into which particular models could be assigned and mention some examples: generalized additive models, recurrent neural networks, S-systems, pair-wise equations. Historically, systems of differential equations have long ago proved their validity in modeling simple gene regulation systems. An example is the work of Mjolsness et al. (1991), which used a hybrid homogeneous and 2D spatial reaction diffusion approach Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 39 to model a small number of genes involved in pattern formation during the blastoderm stage of development in Drosophila (Reinitz, 1995). In this work, the change in expression levels at each time point depended on a weighted sum of inputs from other genes, and diffusion from neighboring “cells”. Synchronized cell divisions along a longitudinal axis (under the control of a maternal clock) were alternated by updating the gene expression levels. This model was able to successfully reproduce the pattern of eve stripes in Drosophila, as well as some mutant patterns on which the model was not explicitly trained. After this model was introduced, several other models that use a similar formalism to the so-called connectionist model (Mjolsness et al., 1991) were developed. Some examples are the linear model (D'Haeseleer, et al., 1999), linear transcription model (Chen, et al., 1999), weight matrix model (Weaver, et al., 1999) etc. Therefore, it has being proposed (D´Haeseleer, 2000)to unify all of them by their common additive nature and classified them into the so-called generalized additive models. Here I will briefly describe some of them. Linear system In last term, it is possible to interpret these models as a multiple regression process: dxi w ji x j (t ) + bi dt = ∑ j 2. 2 where xi is the level of expression of gene i at time t, bi is a bias term indicating the basal expression of gene i when the summation of regulatory inputs is zero, and weight wji indicates the strength of the influence of gene j on the regulation of gene i. Therefore, given an equidistant time series of expression levels (or an equidistant interpolation of a non-equidistant time series), it is possible to use linear algebra to find the least-squares fit to the data. Chen et al. (1999) presented a number of linear differential equation models, which included both mRNA, and protein levels. They showed how such models can be solved using linear algebra and Fourier transforms. Interestingly, they find that 40 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques mRNA concentrations alone are not sufficient to solve their model, without at least the initial protein levels. Conversely, their model can be solved given only a time series of protein concentrations. D’Haeseleer (D'Haeseleer, et al., 2000) showed that even a simple linear model can be used to infer biologically relevant regulatory relationships from real data sets. He applied a linear model to the central nervous system differentiation rat data. It is one of the few works developed over RT-PCR data, which are considerably more quantitative that microarray data. In his work, he merged two data sets to obtain a new data set with 65 genes and 28 time points. However, the dimensionality problem still appears with several models able to fit equally well the data. To decrease this dimensionality problem he used a spline interpolation scheme to obtain more data points. Van Someren et al.(van Someren, et al., 2000) combined a linear model with clustering techniques to purpose a solution to the dimensionality problem. They applied their model to the Yeast cell cycle data from Spellman.(Spellman, et al., 1998) and showed that by working with the centroids of the clustered data, it was possible to reconstruct a global network of the yeast cell cycle. However, as previously mentioned, this approach has the drawback of not being able to correlate a particular gene with a given cellular function. Recurrent neural networks As has been said, one of the pioneer’s works on the area using differential equations is the one from Mjolsness et al (1991). However, this work as a series of other developments could be seen as recurrent neural networks. Some advantages of seeing these works under this perspective, is that the artificial intelligence community has developed a series of good optimization methods for this kind of recurrent neural networks, as back propagation through time and global optimization with genetic algorithms (GA). On the other hand, some works about continuous time recurrent neural networks (CTRNN from here on) have proved that given enough data, there is just one model that better fits the data. This uniqueness (Albertini and Sontag, 1993) property is Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 41 highly desired and in the RE of GRN has been called consistency. Weaver et al. (1999) showed how a non-linear transfer function can be incorporated into a linear model, and demonstrated that some randomly generated networks can be accurately reconstructed using this modeling technique. To handle the dimensionality problem, Weaver proposed the use of the Moore-Penrose pseudo-inverse. This special matrix inverse produces a solution for undetermined problems that minimizes the sum of the square weights but still perfectly fits the data. To impose a limited connectivity, he proposed a greedy backward search that iteratively sets the smallest weight to zero and then recomputed the pseudo-inverse on the, now slightly less undetermined problem. However, this last technique is extremely sensitive to noisy data. Whade and Hertz (Wahde, 2000), inspired by the work from Mjolsness, utilized a continuous time recurrent neural network model in the form: dYi −1 = σ W Y − Y + θ ∑ ij j i i τ ,i = 1,...N, dt j to represent artificial as well as clustered GRN, where τ 2. 3 -1 are rate constants, Yi the gene expression levels, dYi/dt the genes expression levels derivatives in time, θi the genes basal expression levels, N the number of genes modeled, giving Wij as an NxN ( weighting matrix and σ is the logistic sigmoid transfer function: σ (x ) = 1 + e− x −1 ) . Wahde and Hertz utilized a Genetic Algorithm as global optimization technique of parameters and represent the same CNS rat data from Wen, by only four clusters. Additionally, using artificial data they showed that it is better for the RE task, to have multiple shorter time series than one long series. Again, the major drawback of this work is that just four cellular functions without any specific gene where taken into account. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 42 Recently, Hu (Hu, 2005) introduced a time delay term to the Wahde formulation in order to account for additional process, as translation or diffusion that could delay the response from a gene activity and its influence upon it target genes. They utilized the Back propagation through time method to globally optimize their parameters. In they work they reproduce the kinetics of six genes of the SOS DNA repair system of E.coli. taking 50 time points data. In this way their model could recover seven of nine experimentally reported regulations, as well as suggest six additional ones to be tested. A drawback from this approach is the interpretation of the RNA decay, where it is misunderstood and interpreted as the τ -1 in a similar formalism as the equation 2.3. S-system S–systems (synergistic and saturable system) have long been used (Savageau, 1969) as models of biochemical pathways, genetic networks and immune networks (Akutsu, et al., 2000; Akutsu, et al., 2000). S–Systems are a class of non–linear ordinary differential equations and have the form: dXi n Xj dt = α i ∏ j =1 gij n hij − β∏ X j 2. 4 j =1 where n is the number of state variables or reactants Xi (X expressed in concentration), and i,j (1 ≤ i,j ≤ n) are suffixes of state variables. The terms gij and hij are the interactive effect of Xj to Xj. The first and second terms represent all influences that increase and decrease Xi, respectively. The constants, gij, and hij are exponential parameters referred to as kinetic orders. S–Systems have unique mathematical properties allowing large realistic phenomena to be investigated and can be derived from general mass balance equations by aggregating inputs and outputs approximated by the products of power–law functions. Each dimension of the S–System model represents the dynamics of a single variable represented as the difference of two products of power–law functions, one describing the influxes and the other describing the effluxes. The major disadvantage of the Ssystem is the large number of parameters to be estimated: 2X(X+1). Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 43 Pair-wise equations Another way to overcome the so-called dimensionality problem is to restrict the complexity of the model by only considering pair-wise relationships. Apparently, Arkin (McAdams and Arkin, 1997) was the first to suggest the use of time-shifted pair-wise correlations to model biochemical pathways. Initially, the position and magnitude at which the maximal time-shifted cross-correlation occurs is computed in a pair-wise fashion. Then a distance measure is constructed to perform hierarchical clustering obtaining a linked tree of associated genes. Finally the model is completed with information about directionality and time lags, and in turn it can provide information about the dynamics of the system. A further development comes with the work from Chen, who utilize a similar approach but instead of using the correlation he performed the matching of peaks in the expression profiles of genes. His algorithm performed threshold filtering followed by the clustering step, obtaining profiles of sets of expression peaks. Then peaks in the profiles are compared in a pair-wise fashion to determine causal activation scores. From these scores a putative regulation network is constructed by optimizing it with a simulating annealing approach. Another related approach which combines the logical rules of Boolean network models with some of the advantages of differential equation methods are “Glass networks”. Glass networks have been proposed as a simplified model of genetic networks (Edwards and Glass, 2000) as well as an underlying model for the reverse– engineering of regulatory networks (Perkins, et al., 2006). The main drawback of these models is the low connectivity K they are limited to, basically to single and in some advanced cases to two pair interactions. As a general criticism could be said that, differential equations presuppose that concentrations of chemical species changes continuously and deterministically, both of which assumptions may be questionable in the case of gene regulation (Gibson and Mjolsness, 2001; Gillespie, 1977; McAdams and Arkin, 1999; Szallasi, 1999). Against the first assumption is the fact that some of the components of GRN acts in small numbers of molecules, as the transcription factors in the cell nucleus and a 44 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques single DNA molecule carrying the gene, compromising the continuity assumption. Second, deterministic change presupposed by the use of the differential operator d= dy/dt may be questionable due to fluctuations in the timing of cellular events, such as the delay between start and finish of transcription. As a consequence, two regulatory systems having the same initial conditions may enter into different states. 2.4.3 Stochastic Models Stochastic models of gene regulatory processes claim to remedy many of the drawbacks of deterministic (mainly differential equation) based approaches. One such shortcoming is the assumption of a continuous rate of mRNA production. Typically, transcription factors exist on very low concentrations in a cellular system, and this is not well represented by continuous models as differential equations. In fact, as exposed at the introduction, mRNA as well as proteins are not produced at a continuous rate, but rather in short bursts (McAdams and Arkin, 1997). In addition, some mechanisms of transcriptional regulation are known to amplify noise, creating heterogeneity within a population. With the addition of noise in gene transcription, individual cells may take different regulatory paths despite having the same regulatory input (Guet, et al., 2002). It is very likely that evolution has selected networks which can produce deterministic behaviors from stochastic inputs in a noisy environment. In fact, certain topologies in networks can attenuate the effects of noise (such as the mentioned control loops) (Rao, et al., 2007) and also that noise can indeed act as a stabilizer itself in other systems (Hasty, et al., 2000). There are generally two methods for modeling stochastic gene regulation. The first are stochastic differential equations: dYi dt = fi (Yi ) + ν i (t ) 2. 5 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 45 This previous equation gives the form of a stochastic differential equation that explicitly models noise in the system through the term ν (t). This equation is often referred as the Langevin equation and in general is not analytical tractable. Typically, solutions to the Langevin equations are obtained through the use of Monte–Carlo algorithms. The conditions under which the approximation is valid may not always be possible to satisfy in the case of genetic regulatory systems (de Jong, 2002). The second approach is to characterize the transitions of a molecule using probability functions. During each individual time step, a molecule is given a certain probability of transitioning to a different discrete state. From this, a probability density function for the behavior of the system can be obtained. Such systems are referred to as the “Master Equation”. It has being proposed disregard the so-called master equation altogether and directly simulate the time evolution of the regulatory system. This idea underlies the stochastic simulation approach developed by Gillespie (Gillespie, 1977; Gillespie, 1992). Although stochastic models are often more realistic than their deterministic counterparts, they are expensive to simulate. In fact, for many realistically sized systems, stochastic approaches are impractical (Swain, et al., 2005). However, stochastic models of gene regulation have been successfully used in Keasling (Keasling, et al., 1995), Arkin (Arkin, et al., 1998) and Kastner (Kastner, et al., 2002) just to mention a few examples. Recently, significant efforts have being performed to reduce the computer simulation cost for the Gillespie algorithm (Cao and Gillespie, 2006; Slepoy, et al., 2008). However, their use for the RE of GRN area has not being assessed. 2.4.4 Bayesian networks Bayesian networks are probabilistic models. They model the conditional independence structure between genes in the network. Edges in a Bayesian network correspond to probabilistic dependence relations between nodes, described by conditional probability distributions. Distributions used can be discrete or continuous, and Bayesian networks can be used to compute likely successor states for a given Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 46 system in a known state. A formal definition of Bayesian networks is: A Bayesian Network is a directed, acyclic graph G = (X,A), together with a set of local probability distributions P. The vertices X = {X1, ...,Xn} correspond to variables, and the directed edges A represent probabilistic dependence relations between the variables. If there is an arc from variable Xi to Xj , then Xj depends probabilistically on Xi. In this case, Xi is called a parent of Xj. A node with no parents is unconditional. P contains the local probability distributions of each node Xi conditioned on its parents, p(Xi|parents(Xi) (Radde and Kaderali, 2007)). In the formalism of Bayesian networks (Friedman, et al., 2000), the structure of a genetic regulatory system is modeled by a directed acyclic graph G = (V; E). The vertices i ∈V, 1 < i < n, represent genes or other elements and correspond to random variables Xi. If i is a gene, then Xi will describe the expression level of i. For each Xi, a conditional distribution p(Xi | parents(Xi)) is defined, where parents(Xi) denotes the variables corresponding to the direct regulators of i in G. In this approach the conditional independency i(Xi; Y | Z) express the fact that Xi is independent of Y given Z, where Y and Z denote sets of variables. The graph encodes the Markov assumption, stating that for every gene i in G, i(Xi; nondescendants (Xi ) | parents.Xi )). By means of the Markov assumption, the joint probability distribution can be decomposed into: n p (X ) = ∏ p (Xi | parents (Xi )) 2. 6 i =1 The resulting graphs from this Bayesian networks implies additional conditional independencies. Two graphs, and hence two Bayesian networks, are said to be equivalent, if they imply the same set of independencies. The graphs in an equivalence class cannot be distinguished by observation on X. Equivalent graphs can be formally characterized as having the same underlying undirected graph, but may disagree on the direction of some of the edges see Friedman (Friedman, et al., 2000) for details and references. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 47 Given a set of expression data D in the form of a set of independent values for X, learning techniques for Bayesian networks allowed to some works to infer the network, or rather the equivalence class of networks that best matches D. These learning techniques rely on a matching score to evaluate the networks with respect to the data and search for the network with the optimal score. As this optimization problem is known to be NP-hard, heuristic search methods have to be used, which are not guaranteed to lead to a globally optimal solution. However, an additional problem is that currently available expression data underdetermines the network, because just a few dozen of experiments provide information on the transcription level of thousands of genes. Friedman and colleagues (Friedman et al., 2000) proposed an heuristic algorithm for the inference of Bayesian networks from expression data that is able to deal with this so-called dimensionality problem. Instead of looking for a single network, or a single equivalence class of networks, they focus on features that are common to high-scoring networks. In particular, they look at Markov relations and order relations between pairs of variables Xi and Xj. A Markov relation exists, if Xi is part of the minimal set of variables that shields Xj from the rest of the variables, while an order relation exists, if Xi is a parent of Xj in all of the graphs in an equivalence class. An order relation between two variables may point at a causal relationship between the corresponding genes. Statistical criteria to assess the confidence in the features have been developed. A recent extension of the method (Pe'er, et al., 2001) is able to deal with genetic mutations and considers additional features, like activation, inhibition, and mediation relations between variables. Markov relations and order relations have been studied in an application of the algorithm to the cell cycle data set of Spellman and colleagues (Spellman, et al., 1998) (see Pe’er et al. [2001] for another application). This data set contains 76 measurements of the mRNA expression level of 6,177 S. cerevisiae ORFs included in time-series obtained under different cell cycle synchronization methods. The Bayesian induction algorithm has been applied to the 800 genes whose expression level varied over the cell cycle. By inspecting the high-confidence order relations in 48 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques the data, Friedman and colleagues found that only a few genes dominated the order, which indicates that they are potential regulators of the cell cycle process. Many of these genes are known to be involved in cell-cycle control and initiation. Of the high-coné dense Markov relations, most pairs are functionally related. Some of these relations were not revealed by the cluster analysis of Spellman and colleagues. A Bayesian network approach towards modeling regulatory networks is attractive because of its solid basis in statistics, which enables it to deal with the stochastic aspects of gene expression and noisy measurements in a natural way. Moreover, Bayesian networks can be used when only incomplete knowledge about the system is available. Although Bayesian networks and the graph models are intuitive representations of genetic regulatory networks, their disadvantage is to leave the dynamical aspects of gene regulation implicit. To some extent, this can be overcome through generalizations like dynamical Bayesian networks, which allow feedback relations between genes to be modeled (Murphy, 1999). Since this Dynamic Bayesian networks (DBN) are among of the more promising works on the RE of GRN area, in this work I compare the performance of the here introduced TDRNN model with the performance of DBN and the previously explained CTRNN. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 49 50 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 3. Methods 3.1 Workflow In few steps the working scheme could be described as: a) Times series data acquisition. b) Data quality control and normalization. c) Data selection. To reduce the number of state variables here I propose to focus on a small functional module or cellular function. d) Data interpolation. This has been applied already in order to reduce the solution space. e) Data fitting. The parameters of the model are globally optimized by the use of a genetic algorithm (GA) to approximate the dynamics of the selected module. f) Robust parameter identification. Statistical analysis is performed to define the more likely parameters and consequently network connectivity g) Summarization. The last step is the proposal of a network topology to describe the dynamics of the original functional module. h) Error calculation. In case of the test data, this error is calculated between the resultant network and the goal benchmark network. Figure 3.1 Workflow of the reverse engineering of gene regulatory networks. From left to right; times series of data acquisition, normalization and interpolation of data, 50 regression multiples to fit the data using a global parameter optimization approach, parameter significance identification, and finally summarization of results on a network topology graph Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 51 3.2 Data pre-processing, Quality control Given to the many sources of error of the microarray technology, a prerequisite for working with experimental data from microarrays is to check their quality. This includes checking for experimental outliers, discarding them, and checking for statistical uncertainty of the results. In this thesis, quality control of every set of microarray data has been performed with the package: simpleaffy (Wilson and Miller, 2005) from Bioconductor (Gentleman, et al., 2004) which is able to detect different sources of errors during the different steps of the pipeline of the microarray data production. For every set of times series, four different sources of errors are checked; a) average background, b) 3-´5´ relationship and c) percentage of positive hybridizing d) scale factor. a) Average background: should be similar across all chips. There are several reasons for a significant variation on the average background, but generally it is due to some experimental problems, like having different concentrations of mRNA on the hybridization cocktails, or a more efficient hybridization in some of the chips respect to the others. b) 3’—5’ relationship or early degradation of the mRNA: detected by an abnormal signal from some control probe sets present on the chips. The chips contain some specific genes probes, which have a particularly large sequence and well-defined degradation kinetics. A change on this kinetic indicates that an abnormal degradation of the sample has occurs and that measurements on the rest of the probe sets could be lower than should be. Most cell types ubiquitously express β-actin and GAPDH. These are relatively long genes, and the majority of Affymetrix chips contain separate probesets targeting the 5’, mid and 3’ regions of their transcripts. By comparing the amount of signal from the 3’ probeset to either the mid or 5’ probesets, it is possible to obtain a measure of the quality of the RNA hybridized to the chip. If the ratios are high then this indicates the presence of truncated transcripts. This may occur if the in vitro transcription step has not performed well or if there is general 52 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques degradation of the RNA. Hence, the ratio of the 3’ and 5’ signal gives a measure of RNA quality. c) Percentage of positive hybridization: an indication of how many probes present on the chip exhibit positive hybridizing. For a given experiment, one should expect that the global number of positive hybridization is similar. However, this is not necessarily the case because usually one is interested in finding the differences between experiments. Therefore, problems with this evaluation should be taken carefully. These percentages of responding genes (Present/Marginal/Absent calls) are generated by looking at the difference between perfect matches PM and mismatches MM values for each probe pair in a probeset. Probesets are flagged Marginal or absent when the PM values for that probeset are not considered to be significantly above the MM probes. As with scale factors, large differences between the numbers of genes called present on different arrays can occur when varying amounts of labeled RNA have been successfully hybridized to the chips. This can occur for similar reasons (differences in array processing pipelines, variations in the amount of starting material, etc.). The ‘% Present’ call simply represents the percentage of probesets called Present on an array. As with Scale Factors, significant variations in % Present call across the arrays in a study should be treated with caution. Note that usually the absolute value is generally not a good metric because some cells naturally express more genes than others d) Scale factor: some normalization packages (e.g. MAS 5.0) adjust the mean value of expression between different chips to the same value. If scale factors between arrays are large, it is an indication of issues when trying to compare between chips. The default normalization used by MAS 5.0 (and many other algorithms) makes the assumption that gene expression does not change significantly for the vast majority of transcripts in an experiment. (Note that this assumption is also explicit in any analysis that looks for a relatively small number of changing genes within a transcript population containing many Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 53 thousands (for example, looking for ~200 differentially expressed probesets from the ~54,000 found on the U133 plus 2 array). One consequence of this is that the trimmed mean intensity for each array should be constant, and by default, MAS 5.0 scales the intensity for every sample so that each array has the same mean. The amount of scaling applied is represented by the ‘scale factor’, which, therefore, provides a measure of the overall expression level for an array, and (assuming all else remains constant), a reflection of how much labelled RNA is hybridized to the chip. Large variations in scale factors signal cases where the normalization assumptions are likely to fail due to issues with sample quality or amount of starting material. Alternatively, they might occur if there have been significant issues with RNA extraction, labeling, scanning or array manufacture. In order to successfully compare data produced using different chips, Affymetrix recommend that their scale factors should be within 3-fold of one another. 3.3 Data normalization The goal of normalization is to be able to compare between different microarrays chips. In general this is achieved by adjusting an average value of every experimental array equal to that of the baseline array, in the case of time series data, the reference array is that array (or those, in case of repetitions) without the external stimuli. VSN data normalization In the case of the microarrays chips (Affymetrix) used, their preprocessing involves the following steps: a) combining the perfect match (PM) and mismatch (MM) intensities in one number per probe, b) calibrating, c) transforming, and d) summarizing the data. The algorithm vsn (Huber, et al., 2002), from the Bioconductor platform, addresses the calibration and transformation steps. This algorithm is considered to be the first choice for these tasks. The goal of this algorithm is to provide robustness and avoid overfitting by first calibrating and then performing a transformation of the data. 54 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Calibration is performed as follows: Let yki be the matrix of uncalibrated data, with k indexing the rows and i the columns, then the calibrated data y’ki is obtained by scaling with a factor λsi and shifting the (Draper and Smith, 1998) data by the factor Osi,: y 'ki = yki − osi 3. 1 λsi s is the so-called stratum (a classification of regions of the chips according to their background signal) to probe k. The transformation to a scale where the variance of the data is approximately independent of the mean is performed by the use of the function: h ki = arcsin h ( a0 + b0 y 'ki ) = log a0 + b0 y 'ki + ( a0 + b0 y 'ki )2 +1 3. 2 Where a and b are constants of proportionality calculated at the beginning of the algorithm (here not shown). Both are applied simultaneously to the data in order to obtain an almost constant transformed variance for every spot, independently from the transformed mean for that spot. In this way, it is possible to work simultaneously with high and low expression values, while at the same time avoiding any bias. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 55 3.4 Dimensionality problem. The use of interpolation approaches As it has been explained in the section 2.3 the system is highly undetermined due to the lack of data in different dimensions, like time window, granularity, diversity of conditions (stimulus response curves for different stimuli and or conditions), repetitions, etc. Therefore, in order to increase the amount of data, here, it is assumed that changes in gene expression between one measurement and the next one follow a smooth function. Hence, interpolations in time are performed for every gene to obtain a continuous set of data and to impose some constrains on the system. Linear interpolation Interpolation is a process for estimating values that lie between known data points. The simplest way to perform interpolation consists in the use of a linear function to produce continuous data points along the gap between two points using the shortest trajectory. This kind of interpolation is named linear interpolation and supposes a normal distribution of the measured data in relationship to the unknown real data. If one have repetitions of experimental data measurements then it is possible to estimate the dispersion of the data in the forms of variance or standard deviation of gene expression. However, this variability would be the sum of two different processes, the biological variability and the error involved in the measurement process. The former fluctuations are usually hard to estimate, but the available information points to a normal distribution of this kind of variability. In relationship to the error associated to the experimental measurements, the supposition is that there is not a systematic error associated, but in case, the Quality Control procedure previously described should help to detect this problem. 56 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Cubic spline interpolation Another possibility used in this work to increase the data is the cubic spline interpolation. This algorithm also guarantees continuity of the interpolated data. Therefore, the generated data could be differentiable everywhere and this characteristic is often used by optimization methods like the gradient descendent in other contexts. However, in any case interpolation with cubic splines implies the smoothing of the original data. There are two assumptions that underlie smoothing; a) the relationship between the original data and the predicted data is smooth. b) The smoothing process results in a smoothed value which is a better estimate of the original value because the noise has been reduced. However, one should not fit data with a parametric model after smoothing, because the act of smoothing invalidates the assumption that errors are normally distributed (Draper and Smith, 1998). Nevertheless, I also tested and compare the performance of this interpolation technique. The Ziv Bar-Joseph et al. algorithm for gene expression representation It occurs very often that experimental data is incomplete, very noisy and not uniformly sampled. To overcome the missing data point problem, this algorithm (BarJoseph, 2004) splits the entire data set into clusters of genes showing similar behavior and calculates the intra cluster noise. With this information, the algorithm calculates the most likely values for the missing data. In this way, they provide with entire vectors of data points of gene expression to a B-spline smoothing algorithm that also takes into account the noise on the data obtaining a better representation of gene expression behavior over time. This algorithm was tested while using experimental data. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 57 3.5 Data fitting As explained in the workflow section, the next step in reverse engineering basically consists in performing multiple regressions to find the set of parameters of a given model that best fits the available data. This last presupposes the existence of an already chosen model to perform the regressions. Since the system is an ill posed problem (Radde and Kaderali, 2007), due to the so-called dimensionality problem, there are several models that could fit the data and one have to distinguish between them (Often named system identification or inverse problem). Here, two different groups of related models were tested: the CTRNN and the new specific TDRNN here introduced. The data to be fitted is the data obtained from interpolation of the time series gene expression as explained in the previous section. The fitting is achieved mapping the behavior of every node from these models to a specific gene from the original data set. Finally, since for every model there are several sets of parameters that equally well fit the data, the next step is to discriminate between those sets of parameters by performing statistical analysis to conclude with a consistent solution to the reverse engineering problem; this will be covered in sections 3.6.3 and 4.14. Error measuring, mean square error. The fitting of the data by the models is measured by the use of the mean square error between the output from every node and the data over time: MSE = ∑ 0 ∑ i =0 ( d i − oi ) Li N L N 2 3. 3 Where (di - oi) is the error between the desired output and the obtained output from a node at a particular time, N represents the total number of nodes studied and L the simulation measured duration. 58 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Parameter optimization This work uses a canonical genetic algorithm (Whitley, 1993) to globally optimize the parameters of the models using a forward Euler2 integration scheme to simulate the original system. Genetic algorithms A genetic algorithm (GA) is basically a search technique inspired in the nature of evolution. It is often used to find an exact or approximate solution to optimization or searching problems. Genetic algorithms are categorized as global optimization heuristic searching technique. A typical genetic algorithm requires two things to be defined: 1. a genetic representation of the solution domain (Blanco and Delgado, 2001), 2. a fitness function to evaluate the solution domain. The fitness function is defined over the genetic representation and measures the quality of the represented solution. The fitness function is always problem dependent. Once I have defined the genetic representation and the fitness function, GA proceeds to initialize a population of solutions randomly, then I applied repetitive adjustments to the mutation, crossover, inversion and selection operators as described in table 3.1 until achieving a general increasing of the fitting. Initially many individual solutions are randomly generated to form an initial population. The population size depends on the nature of the problem, but typically contains several hundreds or thousands of possible solutions. Traditionally, the population is generated randomly, covering the entire range of possible solutions, the so called search space. Occasionally, the solutions may be "seeded" in areas where optimal solutions are likely to be found. 2 In general is a bad idea to perform serious calculations with Euler integration, here the caution taken is the general recommendation in the computer science area to choose an integration step at least ten times smaller than the smallest time constant. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 59 The next step consists of the evaluation of every individual from the initial population according to the selected fitness function. Analogous to an evolution process, this initial population is the first generation that will evolve through generations towards a better solution to the problem in question. During each successive generation, a proportion of the existing population is selected to breed a new generation. This one is called the intermediary population. To generate the intermediary population, individual solutions from the actual generation are selected through a fitness-based process, where fitter solutions are preferentially selected. Usually the selection methods (threshold ranked selection, elitist strategy) rank the fitness of each solution and preferentially select the best solutions. Other selection methods rate only a random sample of the population, this process may be very time-consuming and also uses a fitness-based function. Most fitnessbased functions are stochastic and designed in a way so that a small proportion of less fit solutions are selected. This helps keeping the diversity of the population large, preventing premature convergence and therefore a poor final solution. Popular and well-studied selection methods include roulette wheel selection and tournament selection. The next step is to generate a new-generation or population of solutions from those selected by the use of the genetic operators: crossing-over (also called recombination), and/or mutation. Many crossing-over techniques exist for population’s individuals which use different data structures to store themselves. In a single crossing over scheme, a crossover point on both parents' organism strings is selected. All data beyond that point in either organism string is swapped between the two parent organisms. The resulting organisms are the children. The classic example of a mutation operator involves a probability that an arbitrary bit in a genetic sequence will be changed from its original state. A common method of implementing the mutation operator involves generating a random variable for each bit in a sequence. This random variable tells whether or not a particular bit will be modified. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 60 For each new solution to be produced, a pair of "parents" solutions is selected from the intermediary population to breed a new individual. By producing a "child" solution using the above mentioned methods of crossover and mutation, a new solution is created which typically shares many of the characteristics of its "parents". New parents are selected for each child, and the process continues until a new population of solutions of appropriate size is generated. These processes ultimately result in the next generation population of chromosomes which is different from the initial generation. Generally, the average fitness of every new generation will increase since only the best organisms from the previous generation are selected for breeding, along with a small proportion of less fit solutions, for reasons already mentioned above. This generational process is repeated until a termination condition has been reached. Common terminating conditions are • A solution is found that satisfies minimum criteria • Fixed number of generations reached • The highest ranking solution's fitness is reaching or has reached a plateau such that successive iterations no longer produce better results • Manual inspection • Combinations of the above. Here binary encoding was used with four Bytes per parameter. The population size was fixed at 1000 in all cases. Fitness rank-based selection was performed and the stopping criterion was in all cases the number of generations (N=1000). Data was prescreened to conveniently adjust the mutation and crossing-over as shown in table 1 for the different data sets. Table 3.1 Optimization parameters adjustment Repressilator Initial Adjusted at gen: Factor Mut: 0.33 x 0.33 50, 150, 300, 750, 850 C.O. 0.4 + 0.8 Mut = mutation rate Yeast Initial Adjusted at gen: Factor 0.5 x 0.2 100, 250, 500, 750 0.5 + 0.9 C.O. = crossing over gen = generation Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 61 Fitness rank-based selection was performed and the stopping criterion was in all cases the number of generations. Two optimization functions derived between the model outputs and the interpolated data along the optimized period of time were used by the GA as fitness functions. The first −1 Fitness = (MSE + 1) 3. 4 optimization function was used to obtain a bounded fitness space, while the second function: Fitness = MSE −1 + λ e− β MSE 3. 5 additionally incorporates a fitness-adaptive weight pruning function (Bebis, 1996) with β controlling the onset of the pruning starting point. λ is a function of the gene interaction weights Wij according to the following parabolic function: N2 λ= ∑ (aW 2 ij i, j =1 + bWij + c ) (dW + e), ij 3. 6 Where N is the number of nodes and a, b, c, d and e (50, -170, 100, 5 and 1 respectively) are parameters controlling the shape of the parabolic function. Figure 3. 2 Shape of the Lambda function that evaluates every weight to be pruned. Weights between 1 and 3 are penalized while zero and values higher than 3 are promoted Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 62 This function penalizes weights in the range [1,3] and promotes values close to 0 and bigger than 3 to favor a sparse interaction matrix without affecting the MSE-derived fitness. Changing the shape of the parabolic function or changing the range of permissible interaction weights Wij did not change the results qualitatively. Fifty optimization runs were performed for each of the synthetic and experimental time series, randomly initializing the model parameters and the GA. The parameters to be optimized were [τ ,ϑ ,θ,W ] and for the TDRNN additionally the time delay [δ ]. 3.6 Models 3.6.1 The CTRNN model In the continuous time recurrent neural network (CTRNN from here on) approach, it is assumed that gene activity is reflected at the level of mRNA expression while monitoring a cell stimulus-response or any other normal dynamical cellular processes. The observables changes on mRNA expression are the balances of all those processes described in the introduction, but projected at the mRNA expression level. The GRN power inference of this model relays on the analogy between the continuous activity level from it nodes and the continuously regulated gene expression of a given GRN. For the internal structure of this model, - its nodes are predictors, from a multiple regression point of view - it could be classified into the generalized linear models (Bay, et al., 2002). This model could be derived from the gene activity analogy as follows: For a given gene (Yi), changes in mRNA expression over time are its synthesis (S) and degradation (ϑ) balance: dY dt = S (Yi )− ϑ (Yi ) 3. 7 Since very sparse information is available with respect to the mRNA degradation, this model assumes a first order degradation kinetics for the gene expressions: Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques ϑ i (Yi ) = K i (Yi ) 63 3. 8 In turn, it synthesis rate is the balance between direct or indirect interactions with other genes (Yj) that activates or repress the gene in question: Si (Yi ) = ∑ j =1 Y j j=N 3. 9 An additional term, a constant bias (θj) is added to take into account a basal level of activity for any gene: dYi dt = ∑ j =1 Y j − θ j − ϑ i (Yi ) j=N ( ) 3. 10 However, every interaction between two genes is a complex process following a nonlinear behavior. In gene regulatory networks this means a sensitive switch like behavior as described by Hill’s kinetics (Hofmeyr, 1997; Setty, et al., 2003; Zaslaver, et al., 2004). Therefore, on this family of generalized additive models, every node-tonode interaction is passed through a logistic sigmoid (σ) function: σ ( x) = 1 1 + e− x 3. 11 giving us: dYi dt = ∑ j =1 σ Y j − θ j − ϑ i (Yi ) j=N ( ) 3. 12 The model here described represents the relative changes of gene expression rather than being a model of mRNA concentration kinetics. The reason is that as explained on the normalization section, the experimental data from the microarrays technology is far from being quantitative mRNA´s concentrations. Microarray time series data are rather strong indications of fold changes in relationship to a predefined state of activity, usually respect to time cero defined as the time before any stimuli is supplied to a cellular system. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 64 The application of the sigmoid function to every interaction and not to the entire summation of them increases the precision of the calculation with respect to the Whade and D´Haesseleer CTRNN models. At the same time, other consequence from using the original Beer (Beer, 1995) CTRNN approach is, that a gene induction or repression change scales proportional to the number of genes: Since: ( ) lim σ Y j − θ j = 1 (Y −θ )→∞ j 3. 13 j and assuming that all interactions on the summation of a given node are activations (positives) at their respective maximum value: Yi (max)= lim ∑ σ (Y N →∞ N j =1 j −θj = N ) 3. 14 b) a) 1.0 Y2 0.8 5.0 5.0 2.5 2.5 0.0 0.0 0.0 -2.5 -2.5 0.6 σσ -5.0 0.4 -5.0 Y1 0.5 1.0 1.5 2.0 σ 0.2 10 5 0 5 Y Y 10 Figure 3.3 Sigmoid function comparison. Comparison among the output space after applying the sigmoid function to the summation of all interacting terms in a) and the output space after separated application of the sigmoid function to every interaction b). Here, for simplicity in b) is depicted only to the second term of N. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 65 with N = number of total genes modeled by the CTRNN. Thus, the balance between genes activating and inactivating each other acquires more relevance. This kind of recurrent neural networks is full connected and explores every possible interaction between nodes, in order to fit the data there are two possible tendencies for the final topology: a) A mutual cancellation of interactions. b) the elimination of unnecessary interactions giving a sparsely connected network. To the actual knowledge, this last is the kind of network topologies present on the real genetic networks and has to be favored by any model. Therefore, the second optimization function previously described was used. Additionally, a weighting (W) factor is applied to every interactions pair (ij), in order to accentuate the importance of that interaction. Additionally, this weight assigns a final sign to the interaction: dYi N = ∑ Wij σ Y j − θ j − ϑ iYi dt j =1 ( ) 3.15 In order to integrate possible external information (for classification by e.g.) or stimuli, an independent weighted term could be added: dYi N = ∑ Wij σ Y j − θ j − ϑ iYi + wI dt j =1 ( ) 3.16 Finally, a general transcription rate parameter tau is added to describe the differences on the response time from different classes of genes. τ dYi N = ∑ Wijσ Y j − θ j − ϑ iYi +wI dt j =1 ( ) 3.17 This model is exactly the CTRNN formalism known as continuous time recurrent neural network (Beer). At the same time this is a series of coupled non-linear differential equations. An advantage of seeing this model as CTRNN is that the work from Funahashi et al (Funahashi, 1989) showed that they are universal aproximators 66 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques of smooth functions. Additionally, Albertini and Sontag (Albertini and Sontag, 1993) showed that, if there is enough data, there is just one network topology that best fit the data. 3.6.2 The TDRNN model However, by the amount and quality of mRNA expression data required by the CTRNN model there would never be enough. Another drawback of this model is that it does not take into account that in Eukaryotes exist an important time delay from gene activation until their product can interact with other genes. Here a modification of eq. 3.17 was introduced by adding a delay (δi) to the interaction between genes term: τ dYi N = ∑ Wij σ Y j (t − δ i ) − θ j − ϑ iYi + wI dt j =1 (( ) ) 3.18 This time delayed recurrent neural network model (TDRNN) increases the nonlinearity with respect to the CTRNN, which non-linear parameter space behavior has been analyzed by other works (Beer, 2006; Mathayomchan, 2002). More important is that the differential time delay for every node moves this model apart from a Markov chain process. Therefore no analytical solution is feasible. Hence this work will follow a statistical approach to validate it in section 4.1.1. Moreover, despite this abstraction is still far from integrating all gene regulation complexity, in this thesis, I will demonstrate the utility of this novel model to represent gene regulatory networks and in the RE task of them. This TDRNN model is different from previously developed ones (Hu, et al., 2005; Kim, 1998; Liao and Wang, 2003; Ma, 2004) in some senses. The model from Kim demonstrates that TDRNN are superior than previous time delayed neural networks (TDNN), adaptive time-delay neural network (ATNN) and multiple recurrent neural networks (MRNN) for the tasks of temporal signal recognition, prediction and identification. However, even though the general principle is similar, the delayed information processing of his TDRNN is achieved trough the architecture of the Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 67 network, having input delay and output layers. Even though this architecture is suitable for these purposes, it makes it impossible to do the mapping of one node to one gene in a multiple regression model as the one here introduced. A similar situation is present on the works from Ma and Liao and Wang, which are artificial neural models developed for more general purposes. However, Ma demonstrates that TDRNN are also capable to develop certain memory for spatio-temporal sequences. On the other hand, this model is different from the model of Hu et al. in three different aspects; a) This model explicitly model the decay rate associated to the mRNA produced. b) The considered delays are constant for every gene, instead of being particular for every interaction as proposed by Hu et al. One advantage of this is having less parameters to estimate. However, more important is – as I consider - that this is a more realistic situation, because the associated delay is due to translation and diffusion of genes and proteins and it is constant among genes. c) The non-linearity in my model is considered for every interaction instead as for the entire summation. This marks the same difference as with previous CTRNN models. In this work, I compared the performance of the TDRNN to infer GRN, respect to the modified CTRNN version here exposed, the original utilized by Wahde and Hertz and an available Dynamic Bayesian Network (DBN) implementation from Wu (Wu, 2004) in the GeneNetwork package. 3.6.3 Robust parameter determination After performing the optimization runs, the weight matrix parameters were evaluated from their z-score (D´Haeseleer et al., 2000), zij = W ij σ ij 3.19 where W ij and σij denote the mean and the standard deviation of every matrix element Wij is calculated from the 50 independent optimization runs. For the synthetic and 68 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques experimental GRN data robust parameters were defined as those having z-scores greater than 2 and 1.5, respectively, corresponding to statistical significance values of ≈ 0.05 and ≈ 0.13375. 3.6.4 Graph generation and error distance measurements A graph from the robust parameters determined on the previous step was generated, by discretizing every robust parameter to a ternary representation according to values between [1,-1] = 0; [5, 1] = 1; [-1,-5] = -1 generating in this way the, from here on, so-called adjacency matrix. Then we used the Cytoscape (Shannon, et al., 2003) facilities to generate a directed graph from every adjacency matrix. With the adjacency matrix, the dissimilarity was calculated between it and the desired benchmark network, the repressilator or the Yeast cell cycle, by the use of a directedweighted version of the graph edit distance algorithm (GED) developed by RoblesKelly et al. (Hancock, 2005; Robles-Kelly and Hancock, 2005). To calculate the transformation cost between every resultant graph and the target network graph this algorithm takes into account the existence of shortcuts (deletions-insertions) from the semantic of directed weighted graphs. This is especially suitable for middle and large size networks, and is the more objective way to compare between them. 3.6.5 Clustering of results Cluster analysis simply discovers structures in data without explaining why they exist. Therefore, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but they are still in the exploratory phase of research. Therefore, statistical significance testing is really not appropriate here, even in cases when plevels are reported, as in k-means clustering. In a sense, cluster analysis finds the "most significant solution possible." Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 69 In the RE area case, several dissimilar network topologies could fit the data equally well. To distinguish between these solutions, clustering over the vector representation of the matrix of weights from each experiment was performed. Using the Genesis platform (Sturn, et al., 2002), a hierarchical and self organized maps clustering was used to define the best partition number for a standardized k-means splitting procedure. Then the splitting k-means algorithm was applied with the same parameters for all the experiments to be compared. Robust parameters were identified using the z-score method. To distinguish between these networks, the ratio between the size and the mean fitness of every cluster was calculated. An important prerequisite for clustering is the need of measuring the similarity or dissimilarity between the elements of the sample to be grouped. There are several different measurements of dissimilarity known as “distance” as Euclidean, Manhattan, squared Euclidean, Chevichev, power distance etc. In this work the Euclidean distance was utilized in all the cases where clustering is referred. It simply is the geometric distance in the multidimensional space and it is computed as: dissimilarity ( x, y ) = ∑ xi − yi i 1 2 3.20 Being x and y the two different elements of the sample to be clustered where I want to measure the dissimilarity and I represents the number of different dimensions measured. Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. Once the distances between elements of a given data sample were calculated, the clustering algorithm groups or splits the sample into subgroups by mainly two different strategies: a) agglomerative techniques; starting from being every element on an isolated group, those with the lowest distance are grouped together until ending with one unifying cluster. This technique is typical for hierarchical clustering. 70 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques b) Splitting techniques; where given the initial set of data it is spliced according to it elements similarity until ending with a desired number of clusters or a certain rule is accomplished. This last technique is typical for k-means clustering. Hierarchical Clustering Hierarchical methods return a hierarchy of nested clusters, where each cluster typically consists of the union of two or more smaller clusters. The hierarchical methods can be further distinguished into agglomerative and divisive methods, depending on whether they start with single object clusters and recursively merge them into larger clusters, or start with the cluster containing all objects and recursively divide it into smaller clusters. K-means Partitioning The k-means algorithm (MacQueen, 1967) can be used to partition N genes into K clusters, where K is pre-determined by the user. Where K initial number of clusters is chose by the user, and each distance among genes is calculated. Then starting from the lowest K distances, every gene is assigned to the cluster with the nearest mean named centroid. Next, the centroid for each cluster is recalculated as the average expression pattern of all genes belonging to the cluster, and genes are reassigned to the closest centroid. Membership in the clusters and cluster centroids are updated iteratively until no more changes occur, or the amount of change falls below a predefined threshold. K-means clustering minimizes the sum of the squared distance to the centroids, which tends to result in round clusters. Different random initial seeds can be tried to assess the robustness of the clustering results. Self –Organizing Maps clustering The Self-Organized Map (SOM) method is closely related to k-means However, the method is more structured than k-means in that way that the cluster centers are located on a grid. In each iteration, a randomly selected gene expression pattern attracts the nearest cluster center, plus some of its neighbors in the grid. Over time, fewer cluster centers are updated at each iteration, until finally only the nearest cluster Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 71 is drawn towards each gene, placing the cluster centers in the center of gravity of the surrounding expression patterns. Drawbacks of this method are that the user has to specify a priori the number of clusters (as for k-means), as well as the grid topology, including the dimensions of the grid and the number of clusters in each dimension (e.g. 8 clusters could be mapped to a 2x4 2D grid or a 2x2x2 3D cube). The artificial grid structure makes it very easy to visualize the results, but may have residual effects on the final clustering. 3.6.6 Dynamic Bayesian Network To compare the performance of the TDRNN with an established modeling framework, the dynamic Bayesian network (DBN) approach implemented in the GeneNetwork package (Wu et al., 2004) was used. The performance of the TDRNN and DBN networks were compared on the same experimental data set (see section 4.2.1) under the same conditions: the networks were inferred from 100 linearly interpolated data points. The GA implementation of the GeneNetwork package was set to use a population size of 1000 individuals, running for 1000 generations with a mutation rate and crossing over rate of 0.05 and 0.5, respectively. For the TDRNN the optimization scheme as described in table 1 was used. 72 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 73 4. Results This section is divided into three main parts. The first part contains the results obtained with synthetic data of a system analogous to the so-called repressilator synthetic system. Additionally, in this section, aspects are exposed to analyze the introduced Time Delayed Recurrent Neural Network (TDRNN) model, through a parallel study with a Continuous Time Recurrent Neural Network. This analysis and the address uncovers some open questions in the reverse engineering area of gene regulatory networks area as network sparsity, over-fitting and information required by the here introduced model. In the second part the results obtained with the TDRNN model of real data of the yeast (Saccharomyces cerevisiae) cell cycle are compared with other approaches as the previously used CTRNN model and a Bayesian dynamic network. In the third part, results related to the keratynocytes-fibroblast communication system will be presented. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 74 4.1 Synthetic benchmark: The Repressilator To assess the inference power of the TDRNN and CTRNN models, I tested both models on generated synthetic data of the so-called repressilator system. The repressilator system was chosen because it is among the simplest experimental synthetic systems showing realistic characteristics of GRN as cyclic behavior. Cycles occur often in biochemical networks and some of the more promising models for the reverse engineering of GRN, the Bayesian networks (de Jong, 2002), cannot infer cyclic networks. Furthermore, the repressilator work has become a good bench work for different models on different areas (Elowitz, 2000). The original repressilator is a synthetic GRN engineered in the E.coli bacteria, and constitutes a network of three mutually repressing genes capable of undergoing limit cycle oscillations. In this work, the repressilator dynamics is represented by three sine waves derived from eq. 4.1 with a phase shift of 2 π between each of them. 3 ( ) y = sin x ⋅ 2π 150 ± 2 3 π ⋅ 5 4. 1 The sine curves have amplitude of 5 units and a period of 150 units to be in the period time scale of the original work and on the expression amplitude scale of microarray data as depicted in figure 4.1a. Figure 4.1 Repressilator scheme. On a) is shown the oscillatory dynamics of the three mutual repressing genes named repressilator b). This is a synthetically engineering system in E.coli. The shadowed time elapsed in a) is assumed to be a relaxation lapse, therefore is not included into the optimization process. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 75 With this synthetic data, the GA optimizations runs of the next sections were performed. But in all cases, the first 200-simulation time units were discarded (see figure 4.1a) to allow the model to reach an oscillatory steady state. To represent the expression induction of every gene in the original experiment, the recursive connection from every node is the only possibility in our working scheme. Hence the goal network will consist of three nodes and six edges: auto-regulation of each node and mutual repression in a cyclic way as depicted in figure 4.1 b. This has the advantage of keeping the model from being driven by any arbitrary input function. 4.1.1 Parameter space selection As has been stated, the biological systems to be engineered are highly undetermined, and therefore exist an infinite number of parameter combinations able to equally well fit the data. Additionally, the models tested here are semi parametric and in principle have no bounds in the ranges their parameters could take. This models are focused to perform the RE task at the network organization level rather than inferring kinetic constants. Therefore, to choose the right parameter space is not a trivial task. In fact, the chosen parameter is a compromise between different restrictions and objectives. On the one hand, a large parameter space is desired in order to assure the convergence between the model and the data. Additionally, some aspects of the parameter space could be of particular relevance through a broader parameter range. In this case, it is the range of the taus (τ) parameter, because the systems to be modeled have different response time scales as the fast (in the order of minutes and probably seconds) signal transduction process and the slower (in the order of hours) GRN. The range of the τ parameter was chosen to be as broad as 3 ≤ τ ≤ 66 on the rest of this section. The reason for this choice is that this result should also be valid in the next sections with experimental data sets involving both kinds of processes. Additionally, it is highly desirable to work under narrowed parameter space to speed the searching task. But it is more important that some parameters should not correlate with the RE task. Ideally, only the weights parameters should be correlated to a given 76 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques network topology. Therefore, it is highly desirable that the rest of the parameters exert their influence mainly to fit the data, here, it is the bias parameter. In the absence of any external stimuli, the individual biases are the only sources of dynamic behavior. Actually, the effects of the biases and the external inputs on the locations of the regions known as nullclines in the synaptic input space are identical (Beer 1995). Therefore, considering these aspects, the chosen biases parameter space has the −1 ≤ θ ≤ 1 short range. Another parameter that should have a small effect on the RE task but plays an important role to fit the data in a biologically inspired way, is the decay (υ) parameter. However, since the present generalized additive models do not impose upper and lower limits to the activity space of every node as it do the generalized linear models, the maximum activity of every node scales with the network size according to equation (3.18): Yi (max)= lim ∑ σ (Y N →∞ N j =1 j −θj = N ) 4. 2 Therefore, the υ decay parameter that balances the global activity of every node on the TDRNN should also vary proportionally. Hence, here were used the Log10 (Yi ( max) ) , giving a 0 ≤ ϑ ≤ 0.5 range for the repressilator data set and 0 ≤ ϑ ≤ 3.5 for the experimental data set presented on the next sections. Parameter screening Finally, for the delay δ parameter, I performed a fast screening of the ranges this parameter together with the taus τ parameter and beta pruning controlling factor could take and I chose the range where the TDRNN model was performing better in respect to the MSE and to the RE task. To evaluate this, the resultant weights of every optimization run of the screening was discretized to a ternary representation: [1,-1] = 0; [5,1] = 1; [-1,-5] = -1 and compared with the goal network adjacent matrix of figure 3.1b. The screening was performed with a population size of 100 individuals. The results of this screening are scatter-plotted on figure 4.2. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Beta (log) Fitness 400 200 Delay (intervals) 400 2 10 0 0 4 10 -1 10 0 2 10 4 10 c) 5 10 0 10 2 4 10 g) 0 10 0 10 0 20 f) 5 10 10 -1 0 0 5 10 10 10 0 g) e) 10 10 0 0 20 -1 10 Taus (intervals) 400 200 b) Errors d) 200 0 0 10 MSE (log) a) 77 10 0 0 5 10 i) 0 5 10 20 10 0 5 10 Figure 4.2 Scatter-plot of the parameter space screening. On a, b and c panels are the scatterplots of the influence of pruning over the fitness and inference power of the model. In the middle vertical panels d, e and f, is represented the influence of the delay parameter of the TDRNN model over the same performance indicators fitness and errors. Panels g, h and i shown analogous information for the influence of the taus τ parameter. In dashed lines are represented the asymptotic or oscillatory tendency that some parameters to follows. On figure 4.2 a, b, and c is possible to see the functioning of the pruning function. On 4.2.a values of the fitness function clearly decrease for larger values of the Beta β pruning controlling factor of equation 3.5. Notice that for β =102, β plays almost no role and for β =103 does not play any role at all. On the panels 4.2 b and c is shown that the MSE and the inference power (expressed in number of errors) are not affected by the pruning process (β ≤ 102) despite the fitness varies with the beta factor. On 4.2.b, it is possible to see two distributions; on the upper part are runs with a good fit (MSE on logarithmic scale) and on the lower part are runs with a poor fit of the data. However, this bimodal distribution is present for all Beta values. On 4.2.c, the scatterplot suggests that those runs with no pruning (β ≥ 102) are only slightly more sensitive concerning the inference power, showing no runs with one error. On the scatterplots of the delay (δ) parameter ranges (figures 4.2. e and f), it is possible to observe the same bimodal distribution previously observed. Here, it is 78 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques possible to see that the inference power of the model for a delay range of six units is in a maximum performance zone. This apparently oscillatory behavior of the parameters ranges is clearly related to the high symmetry of the repressilator system. As demonstrated by the works of Beer (1995, 2006), the parameter space of the CTRNN for such a dynamical system is divided into regions of topologicallyequivalent dynamics by bifurcation manifolds. Hence, for the reasons previously exposed, it was chosen to work with the shortest delay interval of six units showing the best performance. An analogous situation to the delay parameter could be observed on the figures 4.2 g, h and i for the taus ( τ ) parameter. However, for the reasons previously exposed, here it was chosen to work with a broader parameter space to observe the implications of this cyclic behavior in just one of the parameters. Ideally this parameter should work on a broader space to cover diverse biological systems. However, since the others parameters are constrained to a smaller space, the manifolds should not appear. Under these circumstances, problems related to the solution of the systems of stiff differential equations could rise, see next sections. Parameter correlations and inference power To corroborate that the inference power of the TDRNN and CTRNN models is mostly insensitive with respect to the chosen parameters ranges ( 3 ≤ τ ≤ 66, −1 ≤ θ ≤ 1 , −1 ≤ υ ≤ 1 , −5 ≤ W ≤ 5 ), I calculated all correlations in between parameters, to the MSE and the inference power (as means of errors). For this purpose, 150 optimization runs under the previous standardized conditions were performed with the TDRNN model using a 0 ≤ δ ≤ 6 range. I used a population of 100 individual’s size. To discard spurious correlations due to none fitting problems (as observed by the bimodal MSE distributions) a MSE= 0.4 was used as stopping criteria (see figure 4.3a). This is a normal optimization procedure when most of the runs share a similar MSE. To avoid interference, the pruning function was not used. On figure 4.3 the global distributions of the parameters are depicted by their histograms (figure 4.3 c, d, e, f and g). M or e 10 4. 02 -25 Lambda 30 25 0 Errors 50 3. 21 8 6 4 2 0 70 75 2. 41 20 110 90 125 1. 61 40 79 c) 0. 80 Frequency 60 0 Fitness (MSE) 175 b) 80 10 a) Frequency 200 150 100 50 0 0. 03 1. 68 3. 33 Frequency Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques -10 140 120 100 80 60 40 20 0 TAUS f) Frequency 50 DELAY 40 30 20 10 60 50 g) 40 30 20 10 0 -0 .9 9 -0 .8 0 -0 .6 1 -0 .4 2 -0 .2 3 -0 .0 4 0. 15 0. 34 0. 53 0. 72 0. 90 0 BIAS 0. 02 0. 11 0. 19 0. 28 0. 36 0. 45 0. 53 0. 62 0. 70 0. 79 0. 87 Frequency 60 e) 1. 00 1. 48 1. 95 2. 43 2. 90 3. 38 3. 86 4. 33 4. 81 5. 29 5. 76 Frequency d) 2. 87 5. 05 7. 23 9. 41 11 .6 0 13 .7 8 15 .9 6 18 .1 4 20 .3 2 22 .5 0 24 .6 8 Frequency | WEIGHTS | 80 70 60 50 40 30 20 10 0 DECAY Figure 4. 3 Parameters histograms The distributions depicted on figures 4.3 e, f and g, suggest a normal distribution for the global decay, delay and bias parameters; this was corroborated by a KolmogorovSmirnov test of normality (p ≤ 0.05). Interestingly, figure 4.3.c shows a bimodal distribution of the absolute value of weights. One of the processes appears to be close to zero while the other is centered on interactions around 3 units. These are the expected distributions for a network using 2/3 of it full connectivity as it occurs in the repressilator topology. Superposed, in figure 4.3.c are the values of the Lambda function (black bars using the second y axis scale at the right side) for the same weight values. Notice that the shape of the λ function, exactly penalize the weights between the two distributions, at the same time this λ function promote the moving of the two distributions towards zero and larger than 3 units values respectively. This pruning distribution is desired because it is easy to probe correlations with large interactions, and obviously no interaction is the most desired feature of the pruning function. 80 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Table 4.1 Pearson correlation among parameters and fitness (MSE) and inference power (ERS). ERS = number of errors T=taus B = bias C= decay D=delay On the 4.3.d figure, it is clearly shown that the τ parameter have a Log normal like3 distribution skewed to the right side, with a median toward short values (the peak is around 7 units). This shows that the model does not have any problems integrating stiff equations systems; instead, it automatically chose those values that promoted its stability according to the period size on the approximated sine functions. Since the size of the sample was bigger than 100 elements, deviations from normality on the previous distributions are less important (Hill and Lewicki, 2006), therefore Pearson correlation (p-value ≤ 0.05) was calculated between every parameter pair as well as to the MSE and the inference power (expressed in errors). The resultant correlations are in table 4.1 3 The lognormal distribution could be used to model the time required to perform some task when "large" values sometimes occur. It is always skewed to the right and it has a longer right tail than the gamma or Weibull distributions. The lognormal distribution is closely related to the classical normal distribution Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 81 On the 2nd and 3rd columns of the table 4.1, are shadowed the two more important correlation series. The first is the correlation between the nodes individual parameters and the fitness (MSE 3rd column) and the second between the same nodes individual parameters and the inference power represented by the individual-errors (ERS, 2nd column header). Here, it is demonstrated that no other parameter than the weights could exert a stronger influence to the inference power of the model. Particularly, at the ERS column, it is quite obvious that the more important weights concerning the inference power are those which should be eliminated (W13,W21 and W32) to obtain the repressilator topology. By contrast, the bias (B1 and B3 rows on the table 4.1) is clearly the parameter that correlates more strongly to the fitness (MSE column) of the model. These results corroborate the assumption that only the weights correlate to the inference power while the rest of the parameters mostly correlate to the fitting of the data. This result is of high importance because of results from sections 4.1, 4.2 and 4.3; the analyses will be focused on the square matrices of the weights (Wij). One can argue that the weights should correlate strongly to the inference power 4 since it is calculated exactly on the weights. However, this is not true since the errors are calculated in relationship to the ternary discretization of the weights. This decreases the variance of the errors respect to the weights; therefore, this is not an auto correlation measurement. Moreover, it is important to notice that even some weights almost do not correlate to the errors, as W13, W21 and W32 - which are exactly those of the mutual inhibitions - despite they are strongly related to the desired topology. Notice that in general there are mostly low correlations between the parameters and the fitness and the inference power as one can see at the first two columns of the correlations in the 4.1 table. Again, one can argue that it is that way because the distribution of the fitness is monotonic with a low variance, as the MSE was prefixed as stopping criteria. Nevertheless, this is not the case of the inference power. On figure 4.3.b, the histogram of the errors distribution is shown and it clearly shows that 4 the inference power here refers to the inverse of errors 82 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques errors have a geometric distribution5 as opposed to the MSE. This means that actually, one should expect stronger correlations for the inference power to any other column than the expected for the fitness, but only the weights have this characteristic. Instead of strong correlations between parameters and fitness or inference power, on table 4.1 there are strong correlations (shown with grey scale background) between the parameters and the weights (depicted on the 3rd quadrant of table 4.1). Particularly, the strongest interactions are between the decay parameter and all the edges of a given node (depicted in respective node’s boxes in table 4.1). In decreasing order of correlation strength, those with the stronger correlations, like r ≥ 0.9, (cells with black background and white numbers) are the direct correlations between the positive auto-feedback loop of every node (W11,W22 and W33, see the weights scheme into the table 4.1) and their respective decay parameter (see Figure 4.4 a, b and c). These correlations are expected for this model in order to stabilize its outputs. So, for a bigger positive auto-feedback loop a bigger stabilization decay parameter is needed. 5 The geometric distribution (discrete) with probability = p can be thought of as the distribution of the number of failures before the first success in a sequence of independent Bernoulli trials, where success occurs on each trial with a probability of p and failure occurs on each trial with a probability of 1 - p. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 5 a) W11 5 4 4 3 3 2 2 1 1 0 0 6 b) W22 83 c) W33 5 4 -1 3 2 1 0 0 0.5 0 0.5 DECAY 1 1 DECAY 1 -1 d) DECAY 2 0 -1 -2 -3 -4 -5 -3 -4 -5 W12 (-) g) W13 1 0.5 DECAY 3 1 f) W31 (-) W23 (-) W21 4 h) 4 i) W32 3 3 2 2 2 1 1 1 0 -2 DECAY 3 -6 3 -1 0 0 1 -2 -4 0.5 -2 e) -1 -3 4 0.5 DECAY 2 1 0 -2 -5 0.5 0 1 0 -1 0 -1 0 0.5 1 DECAY 1 0 -1 0 0 0 -1 0.5 1 DECAY 2 -2 0.5 1 DECAY 3 Figure 4. 4 Principal interactions scatterplots The next strength correlation level (0.5 ≤r ≤0.6) for interactions between the decay parameter and weights corresponds to those weights of the mutual repression (W12, W23 and W31) edges. Particularly, the correlations are between the decay parameter of a given node and the incoming repression edge from another node. Notice that since those weights are negative, the correlation is a direct and not an anti-correlation as one should expect for two processes that act together decreasing the node activity. The last mentioned process is clarified on the scatterplots of figures 4.4 d, e and f, where the angle of the correlation trend is negative, shown in the figure with dashed lines. Since no important correlation conclusions should be done based just on Pearson correlations, a scatter-plot visual examination has been performed for all the interaction pairs between parameters to check for any possible omission or spurious correlation, because a nonlinear correlation could occur, or spurious correlations due to bimodal distributions also could takes place and both cases could been mishandled by a linear (Pearson) correlation. 84 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Additionally to the mentioned upper panels scatterplots of figure 4.4, the rest of the more important scatter plots are shown in figures 4.4 middle panels (d, e and f) and 4.4 lower panels (g, h and i). In figure 4.4 middle panels, just linear correlations (dashed lines) or anti-correlations were found for all cases, showing that even the time delay increases the nonlinear behavior of every node’s activity, the global behavior is far from being unpredictable. Finally, in respect to the decay parameter and with an anti-correlation (-0.5≤ r ≤ -0.4), lie those correlations between the decay parameter of a given node and the weight of the incoming edge from another node. Notice that there are just two incoming edges per node: the repressive one and the other one, this last is mostly working as an activator. In this case, the reference is to the W13, W21 and W32 weights that should not exist or being slightly positive. Proceeding with an analogous study for another parameter, I found on table 4.1 that the highest level of correlation strength (0.6 ≤ r ≤ 0.9) corresponds to the correlation between the taus (τ) parameters of a given node and the incoming activation from another node to the first one (see the figure on first quadrant of table 4.1.) Notice that this is the edge that should not exist. According to this, the stronger the incoming activator weight the larger the τ parameter, meaning a smaller expression ratio of the node in question. The same situation occurs for the incoming negative repression edges (W12, W23 and W31) and the τ parameter of every node, but at lower correlation strength (0.2 ≤ r ≤ 0.7). These two correlations are logically needed to stabilize the socalled node’s reactivity or reaction rate. The larger the weights strength is, the larger is the τ parameter which needs to be on the equation 3.18 to stabilize the global activity of the node: dYi 1 N = ∑ Wijσ (Y j ( t − δ i ) ) − θ j − ϑiYi + wI dt τ j =1 ( ) 4.3 However, for the interactions between the τ and the auto regulatory edge weights (W11, W22 and W33), the weak range (-0.5 ≤ r ≤ -0.3) of anti-correlation is partially explained by the fact that, in order to produce an oscillatory behavior, this kind of Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 85 dynamical systems requires amplification. In this sense, the shorter the taus parameter is, it needs a larger auto regulatory edge weight to keep the system oscillating. Notice that the last two parameters analyzed are those which are not passed trough the sigmoid function. This it is very logical because their range of influence over the entire node activity is in the same order than the weights. Therefore a similar strong influence is expected from the external I input term from the previous equation 4.3 (or 3.18). From here on, one has to be very cautious while analyzing this model with an external input because then it could be not possible to separate its influence on the Matrix of weights (Wij) as it is proved in this study. Without pretending to create an unnecessary statistical model about the TDRNN model topology, the next analysis was looking at the scatter plots of every pair of interaction among weights in order to corroborate the absence of non-linear interactions among them. The relevant correlation results are plotted in figure 4.5, where the encountered strongest correlations on the 4th quadrant of table 4.1 were those without a normal distribution of the weights. Notice that all but two correlations (W32-W33 and W31-W35) show a linear correlation (dashed trend lines) among them. However, the two correlations that are better represented by a second order trend line are those with sign transitions in the weights, while those showing all the data distributed in one quadrant show a linear correlation. W12 5 3 1 -5 -1 0 W11 5 -5 -3 -5 -5 W13 5 4 3 2 1 0 -1 0 -2 -3 -4 -5 W11 5 -5 W23 5 4 3 2 1 0 -1 0 -2 -3 -4 -5 W22 5 4 3 2 1 0 -1 0 -2 -3 -4 -5 W22 5 -5 W21 5 -5 Figure 4. 5 Scatter-plot of the principal weights interactions W31 5 4 3 2 1 0 -1 0 -2 -3 -4 -5 W32 5 4 3 2 1 0 -1 0 -2 -3 -4 -5 W33 5 5 W33 86 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques All this results are consistent with the expected behavior according to the works from Beer (1995, 2006). More important is the fact that this results corroborate that the inference power of the TDRNN and CTRNN models is mostly insensitive concerning the chosen parameters ranges ( 3 ≤ τ ≤ 66, −1 ≤ θ ≤ 1 , −1 ≤ υ ≤ 1 , −5 ≤ W ≤ 5 ) of the former results, restricting in this way the further analysis of the Wij matrices of weights as described in the methods section. 4.1.2 Required data length. To determine the data required to reverse engineer the cyclic repressilator system, the model was simulated for 0.2, 0.33, 0.5, 0.6, 0.75, 1.0, 1.5, 2, and 3 oscillation periods after an initial transient of T = 200 units. The 50 optimization runs were filtered from outliers (MSE-mean +/- 2 standard deviations) and the robust parameters and adjacency matrices were identified for each of the 9 simulation intervals as described in methods. To compare the performance of the networks, the errors were defined as the numbers of mismatches between the adjacency matrices of the inferred and the goal network showed in figure 3.1b. The results are plotted on Fig. 4.6 as a comparison between the errors of the TDRNN and the CTRNN models for different simulation runs using weight pruning (P) or unbiased parameter selection (not pruning - NP). Summing over all errors from all simulations of nine time intervals the TDRNN-NP model performed best, having only three errors. In comparison, the reverse engineering using the CTRNN-P, CTRNN-NP and TDRNN-P gave a total of 11, 7 and 10 errors, respectively. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 87 Errors 6 4 2 0 0.2 0.33 0.5 0.6 0.75 CTRNN-P TDRNN-P 1 1.5 2 CTRNN-NP 3 Periods TDRNN-NP Figure 4. 6 Inference power comparison along nine different optimized period intervals between the TDRNN and CTRNN models with and without parameter pruning (P, NP). The comparison is expressed in terms of errors, defined as mismatches between the robust parameters calculated by any model and the topology of the goal network. For 0.5 periods, all models can infer the topology without errors, but below and higher than 1.5 periods, NP and TDRNN perform best. In order to interpret the previous result in the light of the data fitting by every model, a comparison of performance to fit the data by the four different model-reverse engineering combinations (in the following called models for simplicity) is shown in the boxplots of figure 4.7 This boxplots produces a box and a whisker plot for each of the simulated periods of every model. The box has lines at the lower quartile, median and upper quartile values. The whiskers extend from each end of the box to the adjacent values in the data of the fittings, the most extreme values within 1.5 times, the interquartile range from the ends of the box. In this boxplots, outliers are those fittings with values beyond the ends of the whiskers. Outliers are displayed with a + sign. Notches display the variability of the median between samples. The width of a notch is computed so that boxplots whose notches do not overlap have different medians at the 5% significance level. The significance level is based on a normal distribution assumption, but comparisons of medians are reasonably robust for other distributions. Comparing boxplot medians is like a visual hypothesis test, analogous to the t test used for means. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 88 CTRNN-NP CTRNN-P 0 MSE(log) MSE (log) 0 10 -2 -2 10 10 0.2 0.3 0.5 0.6 0.75 1 1.5 2 Periods TDRNN-NP 3 0.2 0.3 0.5 0.6 0.75 1 1.5 2 Periods TDRNN-P 3 0.2 0.3 0.5 0.6 0.75 1 1.5 Periods 3 0 0 10 10 MSE(log) MSE (log) 10 -2 -2 10 10 0.2 0.3 0.5 0.6 0.75 1 1.5 2 Periods 3 2 Figure 4.7 Data fits comparison among the TDRNN and CTRNN along different optimized period intervals using (P) and not (NP) the adaptive pruning function. For every group an easy-to-fit zone is identified into the dashed line box. Outliers of every group distribution are represented by “+” sing. The results showed in figure 4.7 clearly show 3 zones (separated by vertical dashed lines) of fitness values for all the models. In all cases, the central region comprising the 0.5, 0.6 and 0.75 period intervals, are the regions with better median fitting of the data. These regions are called henceforth easy to fit regions. However, notice that in all cases the inferior whiskers (extreme data) follow a clear decrement with the quantity (expressed in period fractions) of fitted data. In this way, for all models the extreme lower fitting is close to 0. This last result is logical and was expected, by contrast, the easy to fit regions were an unexpected result. Additionally to the boxplots a normality distribution test was performed (Lillie test, p < 0.05), and since the distribution of the MSE deviated from normality between all models, a parameter free test (Kruskal-Wallis test, p-value < 0.05) was performed to compare between MSEs. Here I found that the TDRNN-NP model fitted the data significantly better in 7 out of the 9 intervals. In figure 4.8 are shown the respective boxplots showing the same result favourable for the TDRNN model. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 89 0 10 MSE (log) 10 0 10 CTnp CTp TDnp TDp -2 CTnp CTp TDnp TDp CTnp CTp TDnp TDp MSE 0.08 0.03 0.03 0.02 0.02 0.04 0.01 0.01 0.02 0 CTnp CTp TDnp TDp 0 0.06 0 CTnp CTp TDnp TDp CTnp CTp TDnp TDp 0.06 0.06 MSE 0.04 0.04 0.02 0.02 0 0 CTnp CTp TDnp TDp 0.04 0.02 CTnp CTp TDnp TDp 0 CTnp CTp TDnp TDp Figure 4.8 Comparison of the fitting by the two models using (p) and not (np) the pruning function at every optimized time window. CT = Continuous time recurrent neural network, TD = Time delayed recurrent neural network. This result suggests that fitness would be a well-suited criterion to determine the topology inference power by the models. However, this is a very strong statement that has to be confirmed or rejected. Therefore, to corroborate this result in the light of possible parameter-overfitting, the individual-errors of every optimization run was calculated by discretizing every resultant weight matrix of the optimization runs to a ternary distribution according to the following mapping: [-5,-1] -> -1 , [-1,1] -> 0, [1,5] -> 1. It is important to notice that even for this discretization, the goal network is just one in 3N^2 = 19 683 possible solutions. In this way, information was obtained to correlate individual fitness (MSE) with individual quantification of errors (individualerrors, from here on). Since MSE and individual-errors distributions were not normal, a parameter free test (Spearman) was performed to correlate these two variables along every model group. In figure 4.9, the bars depict the Spearman correlation between the fitness, the individual-errors and their respective p-values (shown by the lines with second y axis) for each simulation interval and model respectively, after filtering for the 2 standard deviations. 90 0.8 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 0.8 0.8 0.6 0.6 0.4 0.4 0.6 0.4 0.2 0.2 0.2 0.0 0 CTRNN-NP 0 -0.2 -0.4 0.8 0.2 0.3 0.5 0.6 0.75 1 correlation 1.5 2 3 Periods p-values 0.6 0.4 0.2 0.0 0.2 0.3 0.5 0.6 0.75 1 correlation 1.5 2 3 Periods p-values TDRNN-P 0.6 0.4 0.2 Periods 0 -0.4 1 0.8 0.6 0.4 0.4 0.2 0.2 0.0 0 -0.2 -0.2 -0.4 -0.4 0.8 TDRNN 0.6 -0.2 -0.2 -0.2 -0.4 -0.4 0.8 CTRNN-P 0.2 0.3 0.5 0.6 0.75 1 correlation 1.5 p-values 2 3 Periods 0.2 0.3 0.5 0.6 0.75 1 correlation 1.5 p-values 2 3 -0.2 -0.4 1.0 0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 Figure 4. 9 Spearman correlation between individual-run-errors and their respective fitness (MSE) for the 4 models along the 9 optimized time intervals. Corresponding p-values are show as thin red lines. The correlation between individual errors and fitness is weak and varies with the data structure and the model equation used to represent the GRN. For the simulation time interval 0.5<T< 0.75, where T denotes the time in multiples of the oscillation period, the MSE median was significantly lower (in the so called easy to fit region) for all models with a weak, yet significant correlation between fitness and errors. Such a correlation was strongest (Spearman ≈ 0.5, p < 0.05) in the region 0.2<T<0.3 for all models. Optimizing for one oscillation period (T=1) it is found an anti-correlation between fitness and errors in all models. Notice in figure 3.11 the strong change in correlation between 0.75<T<1 for the CTRNN-NP model. In contrast to the TDRNN model, the CTRNN model exhibits again a strong correlation for the remainder of the MSE intervals (2 <T<3). This result rejects the suggestion that fitness could be a good indicator of inference power. To explain the existence of these 3 MSE zones, the dynamics of every optimization run was analyzed for a longer period of time (T=7.2). In the upper part of figure 4.10, this is exemplified by plotting the dynamics of three different period fractions from the TDRNN-NP model representing these 3 regions. To measure, in some extent, the stability of every solution, the instantaneous-fitness was calculated along this time interval for every run. At the lower part of figure 4.10 is depicted the so called instantaneous-fitness along the dynamics of every run of the respective dynamics Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 91 from the upper part. Here, the mean of the instantaneous-fitness of every model is represented by a thicker black line and the standard deviation by a dashed red line. The optimized interval is depicted by a vertical black dashed line. 20 10 Dynamics 5 10 5 0 0 0 -10 -5 -5 -20 0 0.2 2 0 Fitness Periods 1 0.5 0 2 -10 0 1.5 1.5 1 1 0.5 0.5 0 0 0 0.2 0.75 Periods Periods 0.75 Periods 2 0 0 2 Periods 4 2 Periods 4 Figure 4.10 Long term simulation dynamics of the TDRNN-NP model inferred from three different time intervals (from 0 to the vertical dashed line).On the left side panels over-fitting is present. On the middle panels de-synchronization occurs for the majority of the models after the optimized interval. In the right side panels a stable long term fitting is observed for the majority of the models. However, some failures to fit the amount of data exist. While all solutions fit the data for 0.2<T<0.33 (figure 3.12 left panels), only few solutions show the desired oscillatory behavior for long times. This could be corroborated with the instantaneous fitness in the left lower panel of figure 4.10. Simulating and fitting the system for 0.5<T<0.75 (figure 3.12 middle panels) all solutions show stable oscillatory behavior, yet with different oscillation periods, as they start to desynchronize for long times. Again, here it is easy to corroborate with the respective (lower panel) graph of the instantaneous-fitness, where after the optimized period a broad spectrum of similar frequencies is depicted. Finally, for T>1 (figure 4.10 right panels), two solution groups were found: one group of solutions showed almost perfect agreement between model simulation and synthetic data, with a stable long term instantaneous-fitness while the second group showed significant discrepancies in the inferred amplitude and frequency of the oscillations also showing a very poor instantaneous fitness. 92 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Taken together, all these results demonstrate that the TDRNN model could infer the goal topology from as little information as just one third of the cyclic repressilator model with just one error, approx. 90% of accuracy of correctly predicted weighted and directed edges, outperforming the CTRNN model in terms of congruence with the goal network. 4.1.3 Robustness against noise To determine the influence of measurement errors on the reverse engineering process, I added Gaussian distributed noise to the time series data. I equidistantly sampled the repressilator sine functions at ten time points over a time interval of two cycles and added noise with a standard deviation σ = sI , where I and s denote the amplitude of the sine wave and a proportionality factor, respectively. The latter is used to define the noise strength in subsequent experiments. To assess the performance of different interpolation approaches under the influence of noise, I interpolated the sampled data using a linear and cubic spline interpolation for s set as 20%, 30%, 40%, and 50%, respectively. The new datasets were then reverse engineered to investigate the relationships between measurement noise, data interpolation, the model functions and the use of the parameter pruning resulting in 8 models under four noise conditions in total. On figure 4.11 are the boxplots from the distributions of the fitness of these 8 groups. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 0 10 Fitness (log MSE) 0 10 TDRNN-P TDRNN-NP Fitness (log MSE) Fitness (log MSE) CTRNN-P-L 0 10 0 10 20 30 40 50 % of Noise 20 30 40 50 % of Noise CTRNN-NP-S CTRNN-P-S TDRNN-NP-S TDRNN-P-S 0 10 0 10 0 10 20 30 40 50 % of Noise 20 30 40 50 % of Noise Fitness (log MSE) 20 30 40 50 % of Noise Fitness (log MSE) 20 30 40 50 % of Noise Fitness (log MSE) Fitness (log MSE) Fitness (log MSE) CTRNN-NP-L 93 0 10 20 30 40 50 % of Noise 20 30 40 50 % of Noise Figure 4.11 Data fits comparison among the two models (TDRNN and CTRNN), pruning (P) and not pruning and using linear (L) or spline (S) interpolation giving a total of eight groups. Distribution outliers are represented by a “+” sign. The boxplots of the fitness from the optimization runs on figure 4.11 show that the TDRNN models have less outliers (+ signs) than the CTRNN models, meaning that the TDRNN models fail less often to fit the data. Comparing between the intergroup medians, I found in all 8 groups, that the more the noise the worst (higher MSE) is the fitting of the data. On the other hand, comparing corresponding intra group medians among models, the TDRNN has always a better fitness than the CTRNN model. Moreover, without the outliers, the distributions of the fitness (MSE) in all the TDRNN model groups follow a normal distribution while the respective fitness distributions from the CTRNN do not. Therefore, since the CTRNN shows too many problems to fit the data, in order to compare among the media of both models and interpolation approaches, a double filtering was applied. Fitness (MSE) 0.8 0.6 0.4 50% 40% 30% TD_PR_L 20% TD_NP_L CT_PR_L CT_NP_L TD_PR_SP TD_NP_SP CT_PR_SP 0 CT_NP_SP 0.2 Noise Models Figure 4. 12 Inference power comparison among the eight groups. Linear spline interpolation performs better that spline interpolation for higher noise quantities Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 94 The first filter was to split those clear failure fittings with MSE >1, and the second was the previously used +/- 2 standard deviations. The result of the means of this double filtering is on figure 4.12. On figure 4.12, it is clearly shown that besides the model utilized, in all cases the models can better fit the data when using linear interpolation instead of the cubic spline interpolation. This is particularly true for noise quantities ≥ 30%. This result is of significance since on the reverse engineering area this is an open issue (Bar-Joseph, 2004; Bar-Joseph, et al., 2004). On the normal reverse engineering workflow introduced in this thesis, the more important result is the calculus of the robust parameters that determine the network topology. In this case, the comparison of this robust parameters and its comparison to the goal network is plotted in figure 4.13 by the means of errors as mismatches between inferred topologies and goal network. CT_NP_L CT_NP_S 6 CT_P_S TD_NP_L TD_NP_S TD_P_L E rrors CT_P_L 4 2 0 20% 30% 40% 50% Noise % TD_P_S Figure 4. 13 Inference power comparison among the two models (TDRNN and CTRNN), pruning (P) and not pruning and using linear (L) or spline (S) interpolation giving a total of eight groups, under different strength of Gaussian noise. The direct result from the robust parameters calculation depicted in figure 4.13 shows that the TDRNN is by far more robust against noise than the CTRNN model. Actually the TDRNN models using linear (L) interpolation present no errors regardless of the quantity of noise s in the data and the use or not of the pruning function (NP and P). The same TDRNN models (NP and P) using spline interpolation showed one error each for s ≥ 20 %. Interestingly the CTRNN model using spline interpolation and not using the pruning function, showed no errors regardless of the presence of noise. By contrast, the rest of CTRNN models present several errors for all quantities of noise. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 95 Finally, analogous to the analysis performed in the length of data section, individualerrors per run were calculated to correlate the fitness (MSE) of the runs with their respective individual-errors. Again, this was achieved through the ternary discretization of individual run’s matrix of weights and comparison to the goal network. After filtering for 2 standard deviation ≥ intra group mean of MSE, the CTRNN model still has no normal distributions of its MSE runs. Therefore, Spearman’s (none parametric) correlation was performed over the filtered (to avoid for spurious correlations) MSEs of the models and the individual-errors of every run. The results are plotted on figure 4.14. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 96 0.8 0.8 0.8 0.6 0.6 0.6 0.8 0.4 0.4 0.4 0.6 0.2 0.2 0.2 0.4 0 0.2 CTRNN-NP-L 0 0.0 20 p-values 0.8 30 40 Spearman -0.2 50 Poly. (p-values) CTRNN-NP-S 20 p-values 40 Spearman 0.8 0.8 CTRNN-P-L 30 50 1.0 0.0 Poly. (p-values) 0.8 CTRNN-P-S 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 0.0 0 20 p-values 0.8 30 40 Spearman 50 Poly. (p-values) TDRNN-NP-L 1.0 0.8 0.8 0.6 0.8 0.6 0.4 0.6 0.4 0.2 0.4 0.3 -0.2 20 30 40 50 -0.7 p-values 0.8 0.6 0.4 0.2 0 -0.2 -0.4 0 Spearman 0.2 0 0.0 -0.2 Poly. (p-values) 0.8 TDRNN-P-L 20 p-values 0.6 p-values 30 40 Spearman 50 Poly. (p-values) TDRNN-NP-S 1.0 0.2 20 30 40 Spearman 0.8 50 Poly. (p-values) 0.8 TDRNN-P-S 0.6 0.6 0.4 0.4 20 p-values 30 Spearman 40 50 0.2 0.0 Poly. (p-values) 0.4 0.2 0.2 0 -0.2 p-values 0.0 20 30 Spearman 40 50 0.0 Poly. (p-values) Figure 4.14 Spearman correlation between the individual errors and individual fitness. Comparison among the eight models groups under different strength of noise, correlation trends are in dashed lines. After filtering for outliers, figure 3.16 clearly shows that those models showing no or a weak correlation between their MSE distribution and individual errors, are exactly those with no errors encountered by the robust parameter calculation technique (TDRNN NP-L, TDRNN P-L and CTRNN NP-S) depicted in figure 3.15. This result could be seen counterintuitive at first glance, but bear in mind that since the errors are of discrete nature and the fitness (MSE) of continuous nature, therefore small variations on the fitness (MSE) do not correlate to the errors because they could have no variation at all (zero error zones). Therefore, only strong correlations with low p- Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 97 values are significant and positive ones are associated to fitting problems while negative ones to over fitting problems. However, usually one could not know the goal topology in advance and these results are useful just to identify that fitness alone is a bad indicator of the inference power of a model. 4.1.4 Robustness against incomplete information: Clustering improves the standard reverse engineering task, quantitatively and qualitatively On this section, the model is evaluated for the common situation when the network under consideration is actually larger than the number of genes selected for reverse engineering (see Methods, data selection). In order to elucidate the effect of incomplete information on the modelling procedure optimizations using three-node CTRNN and TDRNN models were performed, while only the fitness in one or two of the three nodes was measured. The distributions of the fitness (MSE) and the individual-errors for this experimental set-up are plotted in figure 4.15 trough boxplots. As expected, (figure 4.15 lower panels) an increase in the number of individual-errors was found for both models with a reduction of information. For the one-node-case all four models (TDRNN, CTRNN both using and not pruning) were unable to properly infer the goal network, showing more than four out of nine possible errors. Consistent with this observation, a reduction in the number of nodes to be optimized increased the computed fitness as shown in upper panels of figure 4.15 while they decreased in the sense of the MSE. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 98 MSE 0.1 0.05 0 CT-NP-1 2 3 CT-P-1 2 3 TD-NP-1 2 Models 3 TD-P-1 2 3 3 CT-P-1 2 3 TD-NP-1 2 Models 3 TD-P-1 2 3 8 Errors 6 4 2 0 CT-NP-1 2 Figure 4.15 Models fits comparison between the TDRNN and CTRNN using (P) and not (NP) the pruning function, along different proportions of incomplete information represented by the number of nodes optimized: 1 = two nodes not optimized, 2 = one node not optimized and 3 all nodes were optimized. Concerning the calculus of the number of errors through the identification of the robust parameters approach, a similar tendency was found, here depicted in figure 4.16 Even though the TDRNN models have significant less number of errors for the case of two optimized nodes, in general all models failed when just one out of three nodes were optimized. As mentioned, this case of incomplete information occurs very often in the RE of GRN area. Therefore, it would be of high interest for the community to increase the inference power of any model for such a situation. Hence, here I performed an additional analysis with the objective of improving the inference power of my model. In the two previous sections, it was shown that different network topologies are inferred along the reverse engineering process (Figures 4.10). To systematically analyze these distributions, the total sets of different parameter solutions were clustered. For this purpose, the matrix of weights of every optimization run was represented as a vector. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques CTRNN_P CTRNN_NP TDRNN_P TDRNN_NP 6 5 Errors 99 4 3 2 1 0 1 2 3 Nodes Figure 4. 16 Inference power along different quantities of missing information. 3 = no missing information The next step was to identify the ideal number of clusters to split the entire set of solutions. Therefore, hierarchical and SOM clustering (see chapter 3 Methods) were applied over the sets of 50 vectors per model and where determined five as the number of naturally formed clusters. Then I used the k-means algorithm to split the 50 optimization runs on each of the four models. Finally, I recalculated the robust parameters to obtain the adjacency matrix for each cluster. The first and more important result of this section appears for the analysis of the clusters of the different models when the optimization of just one node was performed. The following results are related only to that case. Additionally, since five different possible solutions per model were obtained, I needed a way to differentiate among them. Therefore, the inference power of every cluster was calculated as the ratio between its size and its MSE mean fitness. The upper panels (a-e) of figure 4.17 depict the five clusters of the TDRNN-NP model, the mean (over cluster row in colour scale) and the standard deviation (over the mean row in grey scale) of every cluster and their inference index. The respective weighted directed graph is obtained from the robust parameters calculations from the last two parameters (see chapter 3, Methods) and it is placed below their respective cluster. On figure 4.17, on the lower panels (f-j) are depicted the analogous results for the CTRNN-NP model. The colour scale of every cluster and their mean is depicted in figure 4.17 k. 100 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques L) k ) W 1,2 W 2,2 W 1,1 W 2,1 W1,3 W 3,2 W3,1 W2,3 W 3,3 Figure 4.17 On top of a, b ,c ,d and e are clusters representing groups of weight parameter solutions between the three nodes from the TDRNN-NP model. Every row is a vector representation of the matrix of weights (W i,j) from each of the 50-optimization runs from this model. At the top of ever cluster are the columns heads for mapping positions (W ij) of these vectors into the matrices of weights, below these column heads are the standard deviation (grey scale) and the mean value (color scale) of every column. On k) is placed the weights color scale and below in l) is a scheme with the map of weights positions representing the repressilator (blue dashed lines does should not exist, showed just for mapping purposes). With the mean and standard deviation, the z-score is calculated and those robust parameters (z-score > 2) determine the edges on the lower graphs. Networks b) and e) are closely related. Those networks and network a) are topologically equivalent solutions to the repressilator, with the rotation direction being the only difference between them. To differentiate between the four architectures encountered (a, b + e, c and d) we calculated the ratio between number of elements and mean fitness per cluster and named it the cluster index. Those with the highest index are the two repressilator-like architectures. On f, g, h, i and j, are the analogous clusters from the CTRNN model and below them, their respective inferred networks. Here i and j are related solutions. The network on a, is the counter wise repressilator solution, despite the cyclic behavior is incomplete Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 101 While performing a visual inspection of the formed clusters, an unexpected feature emerged. For the entire repressilator study, a counter clockwise architecture as the goal network was considered (see figure 4.1a). However, while optimizing just one of three nodes this restriction banishes because only the sinusoidal dynamics of one node is important. In this way, the equivalent cyclic repressilator architecture functioning clockwise is also a valid topology which is depicted in figure 4.17-L. Taking this second architecture into account, it is easier to explain the fact that the three clusters with the largest index (> 1000 units) from the TDRNN-NP model (clusters a b and e from figure 4.17 upper panel), have an equivalent repressilator graph among them. Cluster-graph a is the counter clockwise repressilator architecture while clustersgraph b and e represent the second valid clockwise repressilator architecture depicted in 4.17-L. Performing an analogous analysis for the CTRNN-NP model leads to the graphs f-j from figure 4.17. However, notice that here are only two clusters with an index > 1000 units and that the model failed to find the cyclic topology of the counter clock wise repressilator (figure 4.17f). For the other two models, as well as for the previous ones, the results are summarized on table 4.2. However, in table 4.2, I additionally compared some other properties of the directed graphs derived from the clusters as the number of false positives, false negatives and if a cyclic topology of any repressilator were found. As it can be seen, in all four models the cluster-graphs with the higher index (bold fonts) are those related to any of the two repressilator architectures. Taking only these solutions into account, all the models decrease their number of errors dramatically. 102 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Table 4. 2 Comparison among the different clusters and networks from the two different models TDRNN and CTRNN using (P) and not (NP) the pruning function MODELS CTRNN_NP CTRNN_P TDRNN_NP TDRNN_P Errors F. positives F. negatives Cluster size MSE mean Cluster index CYCLE Errors F. positives F. negatives Cluster size MSE mean Cluster index CYCLE Errors F. positives F. negatives Cluster size MSE mean Cluster index CYCLE Errors F. positives F. negatives Cluster size MSE mean Cluster index CYCLE Clust. 1 2 0 2 20 0.00 4163.59 NO 3 0 3 14 0.01 1828.78 YES 1 0 1 25 0.01 4726.47 YES 4 1 3 5 0.01 639.78 YES Clust. 2 2 1 1 2 0.00 774.29 NO 4 1 3 10 0.05 199.85 NO 2 1 1 8 0.00 3732.65 YES 2 0 2 15 0.01 2451.19 YES Clust. 3 5 1 4 9 0.01 1010.11 NO 4 1 3 7 0.05 129.00 NO 7 3 4 4 0.01 268.00 YES 5 1 4 12 0.05 246.57 NO Clust. 4 2 2 0 2 0.00 687.99 YES 8 3 5 3 0.04 78.17 YES 5 1 4 2 0.01 134.39 NO 5 1 4 9 0.05 181.61 NO Clust. 5 1 0 1 17 0.00 3710.31 YES 3 0 3 16 0.01 2373.65 YES 2 0 2 11 0.01 1472.27 YES 2 0 2 9 0.01 1137.29 6 YES F positives = false positives F negatives = false negatives CYCLE = cyclic inferred network Summarizing, applying the cluster analysis, a decrease in the error incidences from 4 to 1.5 was found for both CTRNN-NP and TDRNN-NP models. However, only for the TDRNN-NP model was found three different network topology solutions with oscillatory behavior. These clusters with the highest inference power constitute the topologically equivalent repressilator architectures with clockwise or counterclockwise cyclic repression (see figure 4.17 b, e and a respectively). Clustering solutions of the CTRNN-P and TDRNN-P models decreased the error incidences from 6 and 5 (see figure 4.16) to 3, respectively (see table 4.2). Both models were able to find the two repressilator architectures. Repeating the same cluster analysis for reverse engineering with two nodes all models succeeded to infer Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 103 the desired repressilator topology without errors. This demonstrates clearly that the parameter clustering approach improves the model’s inference power. 4.2 The yeast cell cycle To assess the predictive power of the TDRNN as a gene regulatory network model and the inference power of the clustering extended reverse engineering workflow for biological systems, in this section, I compared the performance of the TDRNN with that of a CTRNN and a Dynamic Bayesian Network to infer the well studied (Chu, et al., 1998; Futcher, 2002; Gavin, et al., 2006; Guelzim, et al., 2002; Harrison, et al., 2007; Ihmels, et al., 2002; Krogan, et al., 2006; Murray and Beckmann, 2007; Tsai and Lu, 2005) transcription-signal transduction cell cycle network of Saccharomyces cerevisiae based on experimental data. The goal network On their work, Li et al. (Li, et al., 2004) proposed a cell cycle network model of Saccharomyces cerevisiae having 11 nodes that comprise 18 genes, proteins and black boxes. The functional network elements are divided into 4 groups: cyclines (Cln 1,2 and 3), inhibitors degraders and competitors of the cycline cdc28 (Sic1, Cdh1, Cdc20/14), transcription factors (SBF, MBF, Mcm1/SFF, Swi5), check points and self repressions (cell size, DNA replication), see figure 4.18a. There are 34 interactions, out of which 19 are positive and 15 negative. Note that five negative interactions are of unknown nature and were added in order to make the network functional in logical terms. The data source, selection and quality control In this section, the gene expression kinetics of the yeast cell cycle from the alphafactor data set was used (Spellman et al., 1998). Utilizing four different techniques to synchronize the yeast colonies they generate four data sets named; cdc28, cdc15, alpha factor, and elutriation. Unfortunately, other works (Fellenberg, et al., 2001) has shown that desynchronization occurs on the cdc15 and elutriation data sets; therefore 104 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques we do not consider them. After analyzing the left two data sets, we discarded the cdc28 data set because the high frequency of missed data points on our selected genes. Therefore, I end up with just the alfa factor data set. After performing the quality control described on methods, no particular issue was detected for the selected data set. However, inferring the cell cycle network proposed by Li et al from this transcriptomic data is challenging, as only four of the eleven network nodes are transcription factors and it includes 2 dimers (SBF, MBF) and 5 nodes that are represented by 2 proteins: Clb 1/2, Clb5/6, Mcm1/SFF, Cdc14/20 and Cln1/2. Fortunately, the genes encoding the dimers and the redundant proteins exhibit similar expression kinetics so that I represented these nodes by the mean expression of their respective genes. Data interpolation I tested two different interpolations approaches for this data set: linear interpolation and a B-spline interpolation method suggested by Bar-Joseph (see Methods) for the continuous representation of gene expression. Models The data was fitted using the TDRNN and the CTRNN models choosing the parameter ranges for [τ ,θ,ϑ ] to [10 − 55,0 − 1,0 − 3.5]. For the TDRNN, an optimization time delay range equivalent to 10 minutes was chosen as a biological compromise between the fast signal-transduction and the slow transcriptional responses. 4.2.1 TDRNN shows superior inference and predictive power than previous models on experimental data Here, I compared the TDRNN performance on the cell cycle data with CTRNN and Dynamic Bayesian Networks (DBN) as described in chapter 3. The quantitative comparison of the different solutions was performed by computing the adjacency Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 105 matrix for each group of model solutions and calculating the cost of transforming the resulting directed graph of every model group into the goal network. As an objective transformation cost function, a directed-weighted version of the graph edit distance algorithm (GED) was used. This algorithm calculates the cost to convert one graph into another by changing edge weights and/or directions. Additionally, it accounts for shortcuts (the deletions-insertions of nodes) from the semantic of directed, weighted graphs. The results for the GED together with other graph measures are depicted in table 4.3 Table 4. 3 Mean square errors (MSE) and graph edit distance from inferred networks using the CTRNN, Dynamic Bayesian Network and TDRNN models. The MSE is averaged over 50 optimization runs. The columns two and three denote the number of nodes and edges for which robust interactions were inferred. Columns four and five show the number of correctly and falsely predicted interactions, which column 6 shows the GED costs. The rows with label Cluster 1-4 show the MSE and the GED for the individually clustered solutions from 50 independent optimization runs for the TDRNN-NP model. The Bootstrapping results for model are given in the bottom row Model MSE Nodes Edges Positives FP GED Index CTRNN 0.98 3 2 0 2 134 Dynamic Bayesian network NA 11 34 10 24 119 TDRNN B-Spline 0.71 9 9 2 7 107 TDRNN-P 0.36 10 18 8 10 80 TDRNN-NP 0.37 11 19 8 11 68 Cluster 1 0.36 11 21 9 12 62 50 Cluster 2 0.36 11 26 15 11 41 28 Cluster 3 0.37 11 24 9 15 61 22 Cluster 4 0.39 11 25 11 14 70 20 Bootstrap GED Mean =99, GED Standard deviation = 147 Nodes = number of nodes with at least one edge Positives = number of correct inferred correlations among two nodes F P = false positive edges GED = graph edit distance cost According to this analysis, the TDRNN-NP model using linear interpolation has the best inference power of all models having the lowest GED costs. The results depend crucially on data interpolation where this model shows also the best predictive power exemplified in figure 4.18 and to a lesser extent on parameter pruning, changing the GED cost by 64% and 23%, respectively. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 4 3 2 1 0 -1 -2 -3 -4 3 Expression Expression CLN3 1 0 8 6 4 2 0 -2 -4 -6 41 61 81 Time (min) -1 -3 -1.5 101 1 21 5 4 3 2 1 0 -1 -2 -3 41 61 81 Time (min) 101 Expression 21 3 41 61 81 Time (min) 101 6 0 -1 -2 -3 1 21 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 41 61 81 Time (min) 21 41 61 81 Time (min) 21 3 Expression Expression 2 0 101 1 21 6 4 2 0 -2 -4 -6 -8 SWI-5 1 CDH-1 CLB-5,6 4 -6 1 101 101 -4 Expression 1 41 61 81 Time (min) -2 3 2 1 0 -1 -2 -3 -4 -5 CDC-14,20 2 21 41 61 81 Time (min) 101 CLB-1,2 Expression 1 1 SIC-1 Expression CLN-1,2 0 -2 Expression 21 MBF 0.5 -0.5 -1 1 Expression 1 SBF 2 Expression 106 41 61 81 Time (min) 101 1 21 41 61 81 Time (min) 101 MCM1 2 1 0 -1 -2 1 21 41 61 81 Time (min) 101 1 21 41 61 81 Time (min) 101 Figure 4.18 Data fitting example from one of the TDRNN runs. In straight lines are the original linearly interpolated data; in smooth lines are the approximation of every node of one model for their respective gene expression. The global fitness of this average run was MSE= 0.36 Notice that on figure 4.18, the approximation of most of the gene dynamics is acceptable. Only for the MCM1 gene, the approximation presents some discrepancy to the data (straight line). The reason for this behavior is that the data for this gene presents some missing data point. Therefore its pattern appears almost randomly changing from active to inactive. This is a limitation from the data set and therefore, it was not considered necessary to improve its fitting. 4.2.2 Bootstrapping validation To validate the TDRNN-NP results for the yeast data, a bootstrapping test was performed by randomly shuffling the order of the microarray time series data 50 times and repeating the entire workflow. Then, the GED cost of each of the 50 adjacency Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 107 matrices obtained was calculated. The resulting GED costs were normally distributed as expected (see figure 4.19). Additionally, the Lillie test of normality was performed. The mean and standard deviations of the bootstrapped optimization runs are shown on the bottom row in table 4.3. Notice that the mean of the GED cost from this bootstrapping deviates more than 2 standard deviations in comparison with the TDRNN-NP model. This confirms that the results from our TDRNN-NP model were not obtained by chance which further validates the correlation between the inferred network topology and the proposed cell cycle network. 12 Frequency 10 8 6 4 2 0 70 78 86 94 102 GED 110 118 More Figure 4.19 Bootstrap data histogram. The distribution of the cost to transform networks, inferred from randomizing the original data, into the goal network is normal with mean = 99 edition units. 4.2.3 Clustering improves the RE process with real data After performing the clustering approach described in the section 3.1.4, I found that the two clusters with the highest index (clusters 1 and 2) decreased the GED cost from 68 to 62 and 41, respectively (see table 4.3 and fig. 4.20b). The GED cost from cluster 2 deviates by more than four standard deviations from the mean of the bootstrapping test, i.e. the result is unlikely to be obtained by chance. 108 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Figure 4.20 On the left side, the original Yeast cell cycle network from (Li et al., 2004), where blue and red arrows mean activation and inhibition, respectively. The indispensable edges according to the work from Stoll et al. are on thicker lines. On the right side, is the network topology inferred by our TDRNN model, (cluster2, of the yeast results on table 2). Red and blue thicker lines (8) are correctly inferring directed weighted edges, green lines (7) denote matches with reported correlations between two nodes on the goal network (direction and/or weight are not properly inferred on this edges) and grey dashed lines are false positives (11). Seven of the 8 properly inferred weighted directed edges, belongs to the 13 indispensable ones reported by Stoll while none of the misdirected/weighted falls into this category. Comparing the inferred network from cluster two (figure 4.17b) with the cell cycle goal network (figure 4.20a), I found that 7 out of 8 weighted and directed edges were predicted correctly according to the 13 edges defined by Stoll (Stoll and Rougemont, 2006) as being the only necessary interactions for this yeast cell cycle circuit. The low graph editing costs of the best inferred network, i.e. cluster 2, demonstrates that the TDRNN model together with the clustering of reverse engineered solutions is able to extract biologically meaningful information from a combined protein signalling and gene regulation network by just using one experimental time series (44% [15/34] of the total correlations, or 54% (7/13), if considering the indispensable weighted directed edges). Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 109 4.3 Reverse engineering of keratinocyte-fibroblast communication In this section, I present a case study of an unknown network. This scenario is a real situation where there is little information about the structure of the function one is interested in to understand its dynamics through reverse engineering. Along this case, the assumptions, limitations and improvement opportunities that offer such experimental scenario will be explained. Probably, the more important gain from this case scenario lies in making the experience of how to direct a data-driven experimental setup as well as how to avoid certain problems in the iterative experiment-modeling process. The unknown goal network The idea behind this experimental setup was to study the communication between two cell lines, keratinocytes and fibroblast. This interaction is important for processes like skin wound healing and some authors (Birchmeier, et al., 2003) have even suggested that it is related to cancer by the means of certain analogies. In particular, the process here studied is related to cell migration and its cancer analogous: metastasis. To study this interaction, DNA microarray experiments at several time points upon hepatocyte growth factor (HGF) stimulation was performed by a cooperation partner group (Axel Szabowski), to obtain the gene expression kinetics from heterogonous co-cultures containing primary human keratinocytes and murine cjun-deficient fibroblasts. The latter were chosen trying to discriminate between human and murine mRNA, based on species-specific sequences and to provide an HGF-free background. This is a strong assumption; therefore a detailed quality control analysis will be explained for this data set. In the global experimental setup keratinocytes were stimulated with HGF, which induces both proliferation and migration (Birchmeier et al, 2003). Three additional experiments using keratinocyte growth factor (FGF-7), granulocyte–macrophage 110 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques colony-stimulating factor (GM-CSF) and stromal derived factor-1 (SDF-1) as stimuli were conducted, all of them inducing cell proliferation, but not migration (Florin, et al., 2004). The particular experimental details are described below in order to explain the analysis and the obtained results, but the experiments where performed by the collaboration partner group. Cell culture Normal human skin keratinocytes and dermal fibroblasts (HDF) were derived from adult skin (Stark, et al., 1999). HDF obtained from the outgrowth of explant cultures were grown in Dulbecco’s modified Eagle’s medium (DMEM; Bio Whittaker) supplemented with 10% fetal calf serum (FCS), and cells from passages 4 to 8 were used. Mouse wild-type and cjun-/- fibroblasts were isolated from mouse embryos and immortalized according to the 3T3 protocol (Schreiber, et al., 1999) and used together with HDF as feeder cells. Normal human skin keratinocytes were plated on Xirradiated feeder cells (HDFi, 70 Gy8; MEFi, 20 Gy) in FAD medium (DMEM/Hams F12 3:1) with 100 U/ml penicillin, 50 mg/ml streptomycin and supplemented with 5% FCS, 5 mg/ml insulin, 0.1 ng/ml recombinant human EGF, 0.1 nM cholera toxin, 0.1 nM adenine and 0.4 mg/ml hydrocortisone (Sigma). For expression profiling, total RNA of co-cultured cells was isolated 1, 2, 3, 4, 6 and 8 h after stimulation with recombinant human cytokines (10 ng/ml HGF, 10 ng/ml GM-CSF, 10 ng/ml FGF-7 or 10 ng/ml SDF-1; all obtained from R&D Systems). Migration assay Immortalized human keratinocytes (HaCaT cells) were cultured in monoculture with DMEM (10% FCS, 100 U/ml penicillin, 100 mg/ml streptomycin). Subsequently, the cell monolayer was damaged with a ‘scratch’ using a pipette tip and the cells were 8 In Dosimetry, which is a scientific subspecialty in the fields of health physics and medical physics that is focused on the calculation of internal and external doses from ionizing radiation, the absorbed dose is reported in gray (Gy) for the matter or sieverts (Sv) for biological tissue, where 1 Gy or 1 Sv is equal to 1 joule per kilogram. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 111 treated with 5 mg/ml mitomycin c (Sigma-Aldrich) 3 h before the stimulation. The cells were stimulated at the indicated time points and periods with cytokines and/or inhibitors: 10 ng/ml HGF (R&D Systems), 1 ng/ml EGF (R&D Systems), 150 nM tyrphostin AG1473 (Biomol) (EGFR inhibitor) or 1 mM GW2974 (Sigma) (EGF-R inhibitor), 50 mM meloxicam (Biomol) (PTGS-2 inhibitor), 0.5 mM H-89 (Calbiochem) (PKA inhibitor) or 10 mM myristoylated PKI (14–22) amide, cellpermeable PKA inhibitor (Biomol), 200 mM 8-(4-chlorophenylthio) adenosine 30, 50 -cyclic mono-phosphate sodium salt (Sigma) (PKA activator) and incubated for further 30 h. The Relative migratory activity was determined by measuring the migration distance during the culture by using standard protocols. In general, the experimental setup has three strong issues: the first is the fact that the cells were not synchronized in order to perform the measurements of they transcriptomic response after stimulation. This is crucial since the measurements are done over populations of cells, and it should represent the average behavior of the population. The other two are described on the next paragraphs. Microarray data acquisition and analysis Microarray measurements were recorded for four different stimuli from co-cultures, namely HGF, FGF-7, SDF and GM-CSF. For each stimulus, within one experiment six probes were taken (time points of 1, 2, 3, 4, 6 and 8 h after initial system stimulation) and further analyzed. Total RNA was isolated, labeled and hybridized to HG-U133-2plus (Affymetrix) according to the manufacturer’s protocol. Raw microarray data were processed using the R environment (R Development Core Team, 2007) and the Bioconductor toolbox (Gentleman, et al., 2004). The Probe annotation was handled with the Bioconductor package hgu133-plus2cdf (Bioconductor Project). The Normalization was performed using variance stabilization available in the Bioconductor package vsn (Huber, et al., 2002). The gene fold expression was calculated according to the mean gene expression of two control measurements of an uninduced system at 0 and 8 h. 112 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques The second important issue from that experimental set-up appears in this section, where microarray measurements were performed over the mixture of co-cultured cells: Immortalized Human keratinocytes, irradiated Human Fibroblasts and irradiated Mouse Epidermal Fibroblast. Hence, cross hybridization occurs among different mRNA expressed by these cells. Moreover, there is no information on how strong and for which genes this cross hybridization could be. Finally, the third important drawback of the experimental set-up resides on the lack of measurements of the errors and/or noise associated to the previous two drawbacks. In other words, there are no replications for the experiments because they are not reproducible (Szabowski, personal communication). As mentioned, despite repetitions could be expensive tedious and with low information content from an experimentalist point of view, they are of central relevance to develop models of biological systems. Despite these three drawbacks invalidate the data for any scientifically based result, here will be shown as an example, what could be done with such data and some suggestions about how to improve future works that could face similar problems. Quality control As described in Methods, the first step to start the reverse engineering is to asses the quality of the data and to identify possible issues (Wilson and Miller, 2005). Figure 4.21 depicts the results of the preformed quality control of the keratinocyte arrays as described in Methods. The interpretation in this case, is based on the following explanations but, since these microarrays and quality control standards are thought for a single homogeneous cell line experiments, biochemical interpretation will be added when considered pertinent. The figure is plotted from the bottom up, with the first chip at the base of the diagram and the last chip at the top. This corresponds to the order of the samples depicted in table 4.4. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 113 Table 4. 4 Microarray time series sampling and input strength [µg/µl] HGF Chip number Hours b26 Control 8h a1. Control 1h 0.5 2. HGF 1h 0.5 3. HGF 2h 0.5 4. HGF 3h 0.5 5. HGF 4h 0.5 6. HGF 6h 0.5 7. HGF 8h 0.5 Figure 4.21 Quality control of the microarray data. Dotted horizontal lines separate the plot into rows, one for each chip. Dotted vertical lines provide a scale from -3 to 3 fold expressions. Each row shows the percentage of responding gene-probes in relation to the total gene probes in the chip, the average background, the scale factors and GAPDH / β-actin ratios for an individual chip. The first indicator of the quality control pipeline is the average background; it is at the left of figure 4.21. The variation among the different chips is bigger than 10%, 114 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques therefore they are colored red. This parameter is usually more related to experimental problems, such as different concentration labeling or different efficiency of hybridization cocktails. Here, a stronger variation is found at the chip number 5, having a strong average background (67 units, compared to the 43 average of the two referential 1 and 26 chips). Usually this parameter has to be analyzed together with the rest of the quality control before obtaining false conclusions; however one has to bear in mind the particularities associated to the mentioned issues. Therefore, this strong background of the chip 5 will be analyzed later. In figure 4.21 GAPDH 3’:5’ values are plotted as circles. According to Affymetrix, they should be about 1. The obtained values suggest a misbalance of the different transcripts sections (see methods section). However, no issues could be concluded here. β-actin, 3’:5’ ratios are plotted as triangles. Because this is a longer gene, the recommendation is for the 3’:5’ ratios to be below 3; values below 3 are colored blue, those above, would be red. This result indicates no early degradation issues of the probes. The percentage of present gene-probes is listed at the left side of the figure. The variation among the different chips is less than 10%; therefore, they are colored blue. However, it is important to notice that in general, the values are very low. This could have two explanations. The first is that this is a normal result, because it is not expected that all the genes are responding to the stimuli, instead just a small fraction (around 33% of the total gene-probes). The second could be that the low percentage of responsive probes is very likely an indicator of the mixture of cells. Though this parameter is designed for homogeneous cell population measurements, the normal interpretation does not apply. Instead, since probesets are flagged marginal or absent when the PM values for that probeset are not considered to be significantly above the MM probes, the number of mismatches could vary due to cross hybridization of the mixture with genes from different cells. Finally, the blue stripe in the graph represents the range where scale factors are within 3-fold of the mean for all chips. Scale factors are plotted as a line from the centerline Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 115 of the image. A line to the left corresponds to a downscaling, to the right, to an up scaling. If any scale factors fall outside this ‘3-fold region’, they would be colored red, otherwise they are blue. As shown in figure 4.21, there are no issues related to large scaling factors. Dimensionality reduction: Data selection and interpolation As explained previously, data selection and interpolation is applied to reduce the dimensionality of the data. The first step, data selection, is crucial to find the real response of the biological system and no bias should be applied here towards a particular set of genes. However, ideally, no genes balancing the overall system should be missed. It is a difficult compromise. For this data set, it was better to rely on the expertise of the experimentalist designer because the lack of controls could be in some extent compensated by some assumptions from the area of interest. In this sense, migration is the desired phenotypic response to be associated to a selected set of genes. Therefore, genes were selected by an experimentalist expert, to amplify the knowledge of their association to the migration process. Additionally, genes were selected due to their expression profile. In this sense, the idea behind was to select genes of three different kinds: early genes, possible target genes of the previous ones and late response genes that could be target genes of the two first groups. In this way, the selected genes were PTGS2, CEACAM1, ETS1, EGR1, JUNB, FOS, PLAU, ITGAV, ITGB6, SERPINE, LAMA3, LAMC2 and additionally, I added PLAUR in order to check for its possible role as an autrocrine feedback loop. The expression profile of the selected genes is depicted in figure 4.22. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 116 1 2 3 4 6 8 3 2.5 2 1.5 1 0.5 0 -0.5 0 1 2 Time(h) 2 3 4 6 8 2 3 Expression 4 6 8 3 2.5 2 1.5 1 0.5 0 -0.5 0 1 3 4 6 8 4 6 8 Time(h) 3 2.5 2 1.5 1 0.5 0 -0.5 0 1 2 4 6 8 1 3 2 3 4 6 8 4 6 8 Time(h) ITGAV 4 6 8 3 2.5 2 1.5 1 0.5 0 -0.5 0 1 2 3 Time(h) CEACAM1 Expression Expression 3 2 3 LAMA3 3 2.5 2 1.5 1 0.5 0 -0.5 0 PLAU Expression 2 2 Time (h) SERPINE1 1 1 Time(h) Expression Expression 2 Time(h) 3 2.5 2 1.5 1 0.5 0 -0.5 0 8 Time(h) PLAUR Expression 1 6 Expression 1 Time(h) PTGS2 3 2.5 2 1.5 1 0.5 0 -0.5 0 4 ITGB6 3 2.5 2 1.5 1 0.5 0 -0.5 0 Expression 1 3 3 2.5 2 1.5 1 0.5 0 -0.5 0 Time(h) ETS1 3 2.5 2 1.5 1 0.5 0 -0.5 0 FOS Expression JUNB Expression Expression EGR1 3 2.5 2 1.5 1 0.5 0 -0.5 0 3 Time(h) 4 6 8 3 2.5 2 1.5 1 0.5 0 -0.5 0 1 2 3 4 6 8 Time(h) Expression LAMC2 3 2.5 2 1.5 1 0.5 0 -0.5 0 1 2 3 4 6 8 Time(h) Figure 4.22 Selected data expression profile. Error bars does not represent experimental repetitions, instead represent standard deviation among probe sets of every gene at every sampling time point. Standard deviation represents the confidence of every selected gene data. The genes in figure 4.22 are ordered by their expression profile. From top to bottom, there are from the early responding to the late responding genes. Here, it is important to notice that the standard deviation bars does not represent repetitions, instead it represents the variation of signal intensity for those genes which have more than one probe-set. Hence, this standard deviation could be used just to measure the specificity of every probe-set, but not to demonstrate the reproducibility of the experimental setup. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 117 In this sense, genes without standard deviation bars as JUNB and FOS are represented by just one probe on the microarray chip. Even though, these two genes belong to an important family of genes with several homologous genes as the AP-1 system, there is just one probe for each member of the family. Therefore, one has to take special care, because the possibilities for a cross hybridization with the respective genes from the other cellular lines increase. Genes like PTGS2 and PLAU exhibit a very small standard deviation among their different probes; this shows that some genes do not have cross hybridization among different cellular lines. Instead, their probe-sets are specific for them. By contrast, the rest of the genes exhibit a large standard deviation among their probe-sets. This situation is very likely due to cross hybridization. The putative functionality of the selected genes is depicted on Figure 4.23. Notice that the early responding genes: EGR1, JUNB, FOS and ETS1 from figure 4.22 are all TF. Additionally, the apparently early responding genes from the upper panels of figure 4.22 (ITGV6 and LAMA3) are genes which encode for a cell receptor and do an extra cellular kind of Laminin proteins respectively. Stimuli Integrins •Complex formation •migration Cell adhesion Cell signalling •Laminins •Adhesion •migration Plasminogen inhibition Early genes •Inflammation •Inhibition of anoikis Figure 4.23 Thirteen selected genes, available information scheme. 118 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques The genes which maximum response – which happens after 2h (the middle panels of Figure 4.22 )are the PTGS2, SERPINE1, ITGAV and PLAUR genes. The first two genes, PTGS2 and SERPINE1, are encoding for inhibitory proteins associated to cell signaling events, while the last two genes encode for cell receptors associated to the migration process as depicted in Figure 4.23. Lastly, there are genes which expression profile, in the lower panels of Figure 4.22 shows an increasing late response like CEACAM1, LAMC2 and PLAU. Their cellular functionality varies: CEACAM1 is a gene encoding for a protein receptor associated to processes as cell adhesion and signaling; LAMC2 is another gene encoding for a Laminin protein associated to cellto cell adhesion and the migration process and PLAU is a gene encoding for the excreted protein PLAU that could display an autocrine feedback loop for this process. Concerning the interpolation process, since there is no information about the associated noise, linear interpolation was utilized. However, two important aspects need to be taken into account to configure the final data set to be engineered. The first one is related to the quality control of Chip 5 and the strong background signal expression at 4 hours. A careful visual inspection of the data depicted in figure 4.22 coincides that this data point is showing very likely an artificial inflexion in almost all the selected genes. Inflexions in data are the means by which any correlation or multi regression model could associate putative interactions among genes, therefore a spurious inflexion could play a strong artificial role in the final result. By contrast, notice that the gene FOS at the three hours point shows a second wave that could not be associated to an experimental issue as demonstrated in the quality control and therefore it may remain there. Hence, the decision was taken to discard the data for the 4 hours point for the reverse engineering process as depicted in figure 4.24. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 3 Expression 2.5 2 1.5 1 0.5 0 -0.5 0 1 2 3 Time (h) 6 8 119 ETS1 JUNB FOS EGR1 CEACAM1 PLAU PTGS2 LAMA3 LAMC2 ITGAV ITGB6 SERPINE1 PLAUR Figure 4.24 Linear interpolation of the expression profile of the selected gene data. Due to quality control issues, data of 4h were discarded. Interpolation data of the protein encoding PTGS2 gene appears to be first in reacting to the stimuli (stronger slope). The FOS transcription factor encoding gene presents a plateau between 2 and 3 h. The second issue that could rise from interpolating the selected data is that an unexpected strong, early signal comes from some genes that do not belong to the early genes as PTGS2. Notice in figure 4.24 that the strong slope from this gene could direct the model to associate it as an initiator of the response. Hence, the importance of an adequate sampling rises because an intermediary 30 minutes point may easily show an even more abrupt response but at later time for this gene than the early genes. Data fitting As it can be seen in figure 4.24, the profile of the data to be fitted is relatively simple compared to the previous two experiments from sections 4.1 and 4.2. Therefore, the initial prescreening of the parameters showed that a much smaller population of only 100 individuals could be utilized to fit the data. Additionally, since the human cellular life span and the response of the selected set of genes is larger than the respective Yeast or E. coli of the two previous theoretical experiments, here a larger time delay parameter has to be utilized. However, as it can be seen from the results of section 4.1.1, an even short increment in the time delay range could have undesired results (see figure 4.2). Therefore, a maximum of 45 units’ equivalents to minutes from the data set of figure 4.24 is utilized to fit this data. The selected range is of extreme importance, especially if one considers that this experimental setup includes an external input, because the fitting could be easily just driven by the external input and 120 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques when this is withdrawn, the selected functional module will withdraw its behavior, neglecting that even without strong external signals genes interact with each other at a basal level. Actually, this is the reason for the bias term and is the only source of activity for the repressilator modeled in section 4.1. As mentioned before, this experimental setup includes the addition of HGF as a firing signal response. Therefore, here was used the equation (3.18): τ dYi N = ∑ Wijσ (Y j ( t − δ i ) ) − θ j − ϑiYi + wi I dt j =1 ( ) 4. 4 where the last term of the right side includes an external input I. However, the wi parameter is ≠ 0 only for those genes (represented by nodes) biologically able to receive the signaling transduced external input, in other words, only the nodes representing the TF genes receive the external signal. It makes no biological sense that all the nodes receive an input from the very beginning of the simulation runs, because it can easily impose a serious bias to the resulting network. Moreover, the input here generated has to be the same for the selected 4 input nodes (JUNB, FOS, EST1 and EGR1) because no additional information exists. The only other possibility to change the input that I consider could be changing the intensity of the input that every one of the so-called input-nodes could receive. Therefore, the range for the wi parameter ranges [-1,1] meaning that some genes are activated with different strength by the input and even some others could be (which not necessarily is the case) inhibited by the input. Changing the shape of every input is again a strong assumption that drives easily the results towards a desired bias. As mentioned before , this is especially true if a larger time delay is chosen, as for instance 2h., because then for the initial dynamics nodes are controlled mostly by the input. Therefore, here the general input is modeled by an exponential function giving a dynamics as depicted in figure 4.25 and as mentioned the wi parameter was optimized to be ≠ 0 only for the ETS1, JUNB, FOS and EGR1 genes. Notice that there is a basal input along the entire simulation runs. Signal strength Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 121 2 1.5 1 0.5 0 1 35 69 103 137 171 205 239 Time (minn) Figure 4.25 Fraction of the dynamics of the simulated external input. This simulated input represents a constant basal stimulus of 0.15 arbitrary units until the end of the optimization lapse of 480 units’, equivalent to 8 hours. . However, control data fittings were performed with no input at all, and the same input without a basal signal. The rest of the parameters were chosen to be the same as those for the Yeast and repressilator data sets. MSE results of the 50 data fitting runs per each experiment, with and without input, are depicted in the Histograms of Figure 4.26 a and b respectively Two important observations need to be mentioned from these histograms. The first is that no significantly different distributions were found among the two of them (t-test at p= 0.05). The second is that the fitting was performed to very low MSE (high fitness) ranges in all cases (no outliers). This results indicates that the data is very little restrictive and the fitting of the data has been very easily achieved. Unfortunately, this could indicate that the number of different solutions fitting the data equally well would be very broad. 15 a) 10 b) 0 Bin 5 0 0. 12 0. 16 0. 20 0. 25 0. 29 0. 34 0. 38 M or e 5 Frequency 10 0. 12 0. 17 0. 21 0. 25 0. 30 0. 34 0. 38 M or e Frequency 15 Bin Figure 4. 26 Histograms of the fitting runs of the TDRNN model a) using an external input and b) without external input 122 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Clustering and robust parameters identification After clustering and performing the robust parameter identification as described in Methods, only poorly connected networks were found for both optimization groups, with and without input. Therefore, the z-score threshold was lowered to 1 unit in other to impose a less restrictive criterion for the identification of robust parameters in the clusters. The resulting five networks encountered for the input bounded group are depicted in Figure 4.27. a) b) c) d) e) Figure 4. 27 Resulting inferred networks of the five clusters of the fitting runs: a, b and e are discarded because of lack of biological meaning; network c and d represent the potential solutions of the RE selected data. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 123 The results depicted in figure 4.27 show that despite the fitting of the data were performed very well, no common pattern could be found as in the previous work (see Yeast in section 4.2). Notice that the index were not included since no significant difference among them was found (0.22, 0.20, 0.25, 0.23, 0.21 for the a, b, c, d and e networks). A similar situation was obtained for the case of the no input optimized group. Hence, it is not possible to elaborate serious conclusions about the different networks obtained. However, the networks c and d exhibit a more realistic biologically founded scenario. Usually, these two networks would be the only two considered for further analysis, discarding the other three. Hence, these two networks would be an intermediary step on the iterative process of RE of GRN. One would extract some connectivity hypothesis of them and test them by simpler and faster experiment in the wet laboratory. On the other hand, additional efforts to improve the identification of a common network pattern or robust parameters were performed. Analogous to the repressilator case section 4.1.2, the further dynamics of every optimization was simulated for every model by the equivalent of three times the optimized interval meaning 24h (1440 min). As expected, all the models of the no input run arrived to a steady state defined by their last activity states. Therefore, all the encountered models remain activated. This could be argued as the desired long term activity behavior found in migration. However, due to the optimization scheme, I consider this result just as an artifact from the indeterminacy of the system. Moreover, the models optimized including an external input behaved interestingly with a robust similar behavior, even though the prolonged simulations do not include any input. Both cases are exemplified by one randomly selected model of every group, depicted in figure 4.28. 124 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques a) Expression 10 8 6 4 2 0 -2 1 481 Time (min) 961 1441 b) Expression 10 8 6 4 2 0 -2 1 481 961 1441 Time (min) Figure 4. 28 dynamics stability analysis of two solutions of the simulated data; a) the simulation of the model using the external input until 480 min, b) simulation of the model without the external input Notice that after withdrawing the input at the time of 8h (480 min), the model depicted in Figure 4.28 presents only a small perturbation, but it recovers by itself in the same steady state. This stable steady state was confirmed for longer intervals than 50h. Therefore, in my case, the fitting of this data drives the model to a stable steady state. However, despite this robust behaviors are logical and expected, it was still not possible to find a differential cluster in the solutions space of every group. Definitively the modeling and process of reverse engineering GRN is possible and robust as previously shown in sections 4.1 and 4.2, but definitely it also has its limitations. As previously mentioned, the more important gain from this section is to notice the improvement opportunities for future works. For instance, for the secondly mentioned issue of measuring over the mixture of cells, an experimental improvement could be the use of the Flow cell Cytometry technique (Gray, et al., 2007). Then, quantitative information could be obtained about the proportion of every cell into the mixture. Then a numerical algorithm could be applied and signal deconvolution could be applied (Fellenberg, et al., 2001) to split the measured mRNA intensity signal according to the proportional contribution of every cellular line. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 125 Additionally, to extend the cell system dynamics sampling would be another easy improvement for any data set like the here analyzed. As previously mentioned, having just one measurement after one hour after the supplied stimuli, it opens the possibility for behaviors like the showed by the PTGS2 gen. Notice that certainly in the network depicted in figure 4.27e this protein encoding gene appears at the top of an activation cascade. However, this simple improvement could be achieved only if the experimental setup is of a reproducible nature. 126 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 127 5. Discussion In general, the reverse engineering of gene regulatory networks serves two purposes. It is concerned in the development of models with high predictive power to perform in silico experiments, as for instance theoretical knock outs. However, the main task of RE of GRN is to use such models into a broader pipeline to establish a frame work with the inference power to generate new knowledge about the biological networks under study. Therefore, this is a hot topic into modern Biological Sciences and new models and frameworks are intensively developed during the last ten years. However, since this is a multidisciplinary area, in fact it requires a deep knowledge on different areas such as molecular biology, mathematics, computer science, biochemistry etc. This situation has been source of limitations and even several misconceptions in the development of such models and frame works in the area. In this section, I will try to summarize the more important results obtained in this thesis, pointing out the critical steps for the area, and finally suggest some areas of improvement according to the experience gained along this work. 128 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 5.1 Model choice and data driven experiments As explained in the Biological context, traditional enzymatic models have no chance to deal with the complexity of gene regulatory networks because the available data is by far not enough in quantity or in quality to perform such calculations. Instead, pragmatical models, whether black or grey could be useful ones, depending on particular research objectives. In the introduction and along this thesis it was demonstrated that, the new technologically generated data are playing a double role in the area; on one hand they are the motivation for such an integrative theoretical work, but at the same time they are the bottleneck to obtain more accurate models and inference frame works. Therefore, in the last years the international community has started to deal with this situation and started to perform the so-called data-driven experiments. Data driven experiments are tedious and expensive from an experimental biologist point of view. They require, for instance, tightly sampled time-series for every stimulus-response experiment and with they respective (statistically significant) replicates. However, this data driven experiments are absolutely necessary to any serious effort in modern Systems Biology. Sooner or later the data limitation will be solved and large amounts of data will need to be integrated into a new nascent biological epistemology. Then, I will come back to the problem of the choice of the adequate model, according to every particular research goal. In this sense, there are good and very interesting efforts to avoid systematically the misconceptions and misunderstandings that the bias from having different professional formations imposes over the choice of a model or a framework. In this context, I propose with this work, a semiautomatic (globally optimized) fast generation of models to cover, at a topological description level, two biological networks: Signal transduction and gene regulatory networks, through a general time delayed recurrent neural network. Additionally, I am proposing the innovative parameter clustering technique of the models solutions, to improve the actual inference power of different frameworks. However, there are still many issues to Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 129 fulfill in order to improve the reverse engineering of actual frame works. These issues are further discussed into next sections of this general discussion chapter. 5.2 Data selection Following the order of the proposed workflow (see figure 2.1), the first critical step in the RE area is the data selection to reduce the dimensionality of the system to be engineered. Besides the different techniques described in the sections of related works and methods, a central problem needs to be solved: the so called data orthogonality. Ideally the selected genes or proteins (nodes, from a modeling perspective), data should be autonomous (or orthogonal) for the cellular function under study, with respect to the rest of the cellular components. These selected nodes should include sufficient information that none of the gene or proteins represented have missing information to explain its activation or inhibition, as it occurs in the reverse engineered network proposed by Li et al. where five auto inhibitory feedback loops needs to be added in order to make the module functional. The gene regulatory goal network of the yeast cell cycle used in this study is a conceptual approximation based on experimental evidence. Here, I used just a single gene expression time series of unknown quantity of noise for reverse engineering. Hence, it would have been rather unlikely that all inferred connections would be correct. In this light, I consider that finding the 44% of the total correlations (15/34), or the 54% (7/13) of the indispensable weighted directed edges is a very good result. Obviously, I do not suggest testing experimentally for the false positives I encountered. Instead I suggest giving them a reading in the GED context, taking into account meaningful shortcuts. In the section of the repressilator case study 4.1.4, the incompleteness of information in respect to the number of optimized nodes, I included one or two “blind” nodes into the model to fulfill the missing nodes. As this is valid for a benchmark to analyze the inference power of the models, it could not be the case for a normal pipeline as in the keratinocytes case study here analyzed. Usually, one has no precise information of the 130 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques percentage of missing nodes. The solution in this case was to use my bottom up proposal, starting from a core of nodes which was selected by using experts’ knowledge and making it grow whenever more nodes needed to be added in the light of biological interest. However, the used approach in a related keratinocytes case study (Busch, et al., 2008) differs in this by the use of an expression strength ranking of the nodes (in this case genes). Since the expression range of the next added nodes is considerably smaller, for sure, they will not change the initial engineered topology, this solution imposes a strong bias to the original selected core based on the function one is interested to find. I rather proposed to work under a bottom up approach, started with the minimum nodes information of biological interest, selected without a bias, by for instance the GeneSet enrichment platform (see Methods). Additionally, once an initial nodes core is selected and reverse engineered, one should make it grow by adding “blind nodes” for a < 25% range of the total initial nodes core and reengineer it until one can define a common topological pattern. Then, one can fix the encountered robust parameters; add new nodes data and restart the reverse engineering process. For sure, this is a lot of work to be performed, but imposes no bias towards a particular result. Therefore, the time consumption could be compensated by the knowledge one can obtain into this supervised iterative process. 5.3 Data interpolation, implications Surprisingly, the TDRNN model is very robust against noise to perform the RE task. However, the different interpolation techniques are playing an important role to represent noisy data. My results are in agreement with : “You should not fit data with a parametric model after smoothing, because the act of smoothing invalidates the assumption that the errors are normally distributed” (Draper and Smith, 1998). Therefore, for higher quantities of noise (> 20%) the linear interpolation performs better on the synthetic data set. Consistently, the results from the experiments in section 3.2 suggest that interpolation plays an important role to facilitate the approximation of the data. Here, the MSE of the TDRNN using linear interpolation is Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 131 twice as low as when the same model is using another interpolation approaches. Therefore, the relationship between these MSE and the GED results suggests that this large difference is playing a significant role to the entire RE process. Hence, the results suggest the use of linear interpolation unless the percentage of noise is low. The superior reverse engineering performance of my TDRNN model compared to the CTRNN model on the synthetic network was obtained under different circumstances such as incomplete or noisy data. This achievement was obtained despite the fact that this data set does not have a different time delay between genes and is highly symmetric, which supposes an advantage for the CTRNN model because it already has synchronized node responses. 5.4 Data fitting and inference power relationship The next critical step into any RE framework is, to define the optimal data fitting scheme used in every particular problem. Notice, this includes any additionally considered information as the sparsely mixed optimization function here introduced. Since the MSE is not a normalized error measure (Battaglia, 1996) but depends on the range of the data, it is not possible to determine a MSE threshold on different data sets to distinguish good from bad data fitting. Therefore, we see for the shortest periods [0.2, 0.33] of the 4.1.2 repressilator case study, that the less information to measure the error, the lower is the MSE inter-group limit obtained (see lower whiskers from boxplots in figure 4.7. However, even though these optimizations with the lower MSE have a direct correlation with the errors (see figure 4.9), the provided information is still too few to restrict the solution space, and even several unrelated solutions are found (see figure 4.10, upper panels). In other words, for these low information conditions the GA get trapped in local maxima with fitness levels as high as the correct solution. 132 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Moreover, I have realized that fitness alone is a poor indicator with variable correlation to the inference power (RE task) of a model. This is especially true for the high fitness (low MSE) [0.5, 0.75]period regions in section 4.1.2 and what should be seen as enough data [1, 3] period regions on the 4.1.2 experiment. Additionally, notice that the easy-to-fit zone does not correlate with the anti-correlation zone among 1 to 1.5 periods. Indeed, correlation depends on both: data structure and relative fitness values (as previously explained, the MSE is not an absolute error measurement). This is complex at a first glance, but it is easier to understand through some examples from the 4.1.2 experiments. Here, in figure 5.1 are plotted the scatterplots of the individual-errors vs. the fitness expressed by the logarithm of the MSE, for the smallest optimized interval (0.2 periods) of the four models. 5 15 TDRNN-NP 5 -5 0 Log(MSE) CTRNN-P 10 5 0 -10 -5 0 Log(MSE) 10 0 -10 15 Errors 10 0 -10 Errors CTRNN-NP 15 Errors Errors 15 -5 0 Log(MSE) TDRNN-P 10 5 0 -10 -5 0 Log(MSE) 5. 1 scatterplots of the relationship between data fitting (log MSE) and inference power (errors); for these fitness values, the relationship follows a quadratic function. As it can be seen in figure 5.1, the correlation in all these cases does not follow a linear tendency but appears like a quadratic regression function. This is what is expected for this insufficient information data structure. The interpretation is, that for regions with poor fitting values (right side of the vertex of every parabola, MSE ≥ 1) the models are facing problems to fit the data, therefore there are several individualerrors and therefore, one faces a strong positive correlation. For imperfect data ( as for only 0.2 period), it occurs that fitting this imperfect data structure whit high accuracy (left side of the parabolas, negative log (MSE)) is Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 133 associated to a wrong network topology, a bad generalization which is an analogous of over-fitting. This occurs because the optimization process of the models easily finds different solutions than the expected one which could better fit that imperfect data structure. Therefore, the number of individual-errors increases according to the lowest mean value of the quadratic regression at the vertex of the parabolas. This parabolic behavior is exactly what is expected when over-fitting problems arise. Obviously one can ask how it is possible then that on the figure 4.9 all the correlations for this period data show a clear positive correlation. The answer is in the boxplots of figure 4.7, this data structures (0.2 periods) have the largest distributions of MSE, ranging from very small MSE with high fitness, shown by the lower whiskers, to large MSE with few outliers, meaning covering from over fitting regions to regions with poor fitting. However, despite having few outliers the distributions are not normal and therefore, the Spearman correlation were used. Contrary to the Pearson correlation which is based on linear correspondence, Spearman is based on ranked data and therefore a relative strong correlation could still be observed for these data regions. As previously mentioned, one should not base important conclusions just on correlations without analyzing the scatterplots. However, here there are no contradictions but an answer to the complex behavior of the correlation. In principle, one should expect that the vertices of this parabolas moves towards the origin of the graph (as depicted in figure 5.2 by a dashed arrow) with the increment of information (longer optimized period regions). Then, there should be only a positive correlation or no correlation when the fitness is around the vertex (as depicted by the box around the vertex in figure 5.2). In the case of the easy–to-fit-regions of figure 4.7 occurs that all the fitting runs have a good fitness (low MSE) without over-fitting and few fitting failure. Therefore, they correlation is around the vertex of this parabolic correlation regression, meaning close to cero or no Spearman correlation. 134 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 5. 2 Proposed parabolic behavior of the inference power and fitness relationship Notice that a second issue for such kind of correlations is, that the individual-errors on the y-axis are measured in a discrete scale while the fitness (MSE) has a continuous scale. Hence, for small variations of the MSE around the vertex, there are no variations of individual-errors which means no correlation. This explains why a quadratic correlation regression would give the same results. But it is more important to take into account the fitness (MSE) region where the optimization runs are. These last two arguments partially explain the third fitness (MSE) optimization region, Since the fitness alone is a poor indicator for the RE task, there should be other criteria to optimize the models additionally, to the fitting of the data. For instance, one should have an idea of the noise or incompleteness of the data in order to use the “early stopping” criteria according to some reference. This reference is usually not available, but could be or should be part of a routinely experimental setup. It could be established by the use of any previous knowledge of the network to be engineered. Another possibility is to perform the optimizations letting the model to explore for different possible solutions (or fitness regions), then to utilize the clustering technique to identify these different solutions and to evaluate them with the index here suggested. Regarding the network sparsity, which was thought as another criteria to optimize the models, I found that the here introduced adaptive pruning function decreases the connectivity of the resultant networks, while no considerable alteration of their fitness occurs. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 135 While the pruning function helps to split the different network solutions and consequently the clusters, I have found no evidence that pruning alone improves the RE task. Now it is clear that network sparsity is not a decisive criteria for the RE task. Instead, the connectivity characteristics as small world (Potapov, et al., 2005) or scale free (Balaji, et al., 2006; Chen, et al., 2008; Iguchi, et al., 2007; Kauffman, 2004; Wildenhain and Crampin, 2006; Zhou, 2005) networks could be taken into account . However, in my case the size of the studied networks was not suitable for such an analysis. 5.5 Reverse engineering framework, improving the robust parameter selection An effective improvement to the reverse engineering is the clustering of different model solutions based on the identification of robust parameters. This clustering approach is innovative since it is applied directly to the parameter space and not to the dynamics of the models. This difference is exemplified in section 4.1.4. The use of Lyapunov exponents to restrict solutions to those with stable and similar dynamics could not find the two valid repressilators architectures, but the clustering applied to the parameters solution space does it. Additionally, the same clustering approach increased the number of correct edges (see. section 4.2.3), while not affecting the number of false positives. Moreover, I found the cluster index to be useful to distinguish between possible solutions. Instead of a unique solution this strategy will usually narrow down putative solutions to few possibilities to be validated experimentally. 136 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 137 6. Conclusions Reverse engineering of gene regulatory networks is an iterative process between experiments and modeling. In this process, the correct representation of the original system is a critical step. In this sense the TDRNN has been shown to constitute an improved approach as compared to existing CTRNN or DBN models. Additionally, the clustering of the reverse engineering solutions provides a novel method to identify robust parameters within the dynamic recurrent neural networks. Altogether, I presented a supervised learning framework that helps to provide novel insight into dynamic systems properties of genetic regulatory networks both from a biological and theoretical point of view. If the only data source is of transcriptional nature, such as RT-PCR or microarray data, only transcriptional networks can be inferred. Cellular information processing, however, is a feedback entangled process between protein signaling and gene regulation for which combined transcriptomic and proteomic data are needed. I consider the TDRNN to be the ideal model to incorporate both types of data. Due to its incorporation of time delays that can potentially range from seconds to hours, it can naturally incorporate fast responses occurring in signal transductions and slow responses from the genetic regulatory network as we have demonstrated for the yeast cell cycle data set. 138 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 139 7. Bibliography Akutsu, T., Miyano, S. and Kuhara, S. (2000) Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function, J Comput Biol, 7, 331--343. Albertini, F. and Sontag, E. (1993) Uniqueness of weights for recurrent nets, Proceedings of the International Symposium Math. Theory of Networks Syst., II, 599602. Arkin, A., Ross, J. and McAdams, H.H. (1998) Stochastic kinetic analysis of developmental pathway bifurcation in phage lambda-infected Escherichia coli cells, Genetics, 149, 1633-1648. Auboeuf, D., Batsche, E., Dutertre, M., Muchardt, C. and O'Malley, B.W. (2007) Coregulators: transducing signal from transcription to alternative splicing, Trends Endocrinol Metab, 18, 122-129. Bar-Joseph, Z. (2004) Analyzing time series gene expression data, Bioinformatics, 20, 2493-2503. Bar-Joseph, Z., Farkash, S., Gifford, D.K. and Simon, I.R., R. (2004) Deconvolving cell cycle expression data with complementary information, Bioinformatics, 20 Suppl 1, I23-I30. Battaglia, G.J. (1996) MSE, AMP Journal of Technology, 5, 31-36. Bay, S.D., Shrager, J. and Pohorille, A.L., P. (2002) Revising regulatory networks: from expression data to linear causal models., J Biomed Inform, 35, 289-297. Bebis, G.G., Michael. Kaspalris, Takis. (1996) Coupling weight elimination and genetic algorithms. Neural Networks, 1996., IEEE International Conference on. Washington, DC, USA, 1115-1120. 140 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Beer, R.D. (1995) On the dynamics of small continuous-time recurrent neural networks, Adaptive Behavior, 3(4), 469-509. Beer, R.D. (2006) Parameter space structure of continuous-time recurrent neural networks, Neural Comput, 18, 3009-3051. Ben-Dov, C., Hartmann, B., Lundgren, J. and Valcarcel, J. (2008) Genome-wide analysis of alternative pre-mRNA splicing, J Biol Chem, 283, 1229-1233. Blackwood, E.M. and Kadonaga, J.T. (1998) Going the distance: a current view of enhancer action, Science, 281, 60--63. Blanco, A. and Delgado, M.P., M. C. (2001) A real-coded genetic algorithm for training recurrent neural networks, Neural Netw, 14, 93-105. Boutros, M. and Ahringer, J. (2008) The art and design of genetic screens: RNA interference, Nat Rev Genet, 9, 554-566. Busch, H., Camacho-Trullio, D., Rogon, Z., Breuhahn, K., Angel, P., Eils, R. and Szabowski, A. (2008) Gene network dynamics controlling keratinocyte migration, Mol Syst Biol, 4, 199. Cao, Y. and Gillespie, D.T.P., L. R. (2006) Efficient step size selection for the tauleaping simulation method, J Chem Phys, 124, 44109. Cremer, T. and Cremer, C. (2001) Chromosome territories, nuclear architecture and gene regulation in mammalian cells, Nat Rev Genet, 2, 292--301. Chen, K. and Rajewsky, N. (2007) The evolution of gene regulation by transcription factors and microRNAs, Nat Rev Genet, 8, 93--103. Chen, T., He, H.L. and Church, G.M. (1999) Modeling gene expression with differential equations, Pac Symp Biocomput, 29--40. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 141 D'Haeseleer, P., Liang, S. and Somogyi, R. (2000) Genetic network inference: from co-expression clustering to reverse engineering, Bioinformatics, 16, 707-726. D'Haeseleer, P., Wen, X., Fuhrman, S. and Somogyi, R. (1999) Linear modeling of mRNA expression levels during CNS development and injury, Pac Symp Biocomput, 41-52. D´Haeseleer, P. (2000) Reconstructing gene networks from large scale gene expression data, Ph.D. Thesis. de Jong, H. (2002) Modeling and simulation of genetic regulatory systems: a literature review., J Comput Biol, 9, 67-103. Downes, S.M. (2004) Alternative splicing, the gene concept, and evolution, Hist Philos Life Sci, 26, 91-104; discussion 123-109. Draper, N.R. and Smith, H. (1998) Applied Regression Analysis. John Wiley \& Sons, New York. Edwards, R. and Glass, L. (2000) Combinatorial explosion in model gene networks, Chaos, 10, 691--704. Elowitz, M.B.L., S. (2000) A synthetic oscillatory network of transcriptional regulators, Nature, 403, 335-338. Fellenberg, K., Hauser, N.C., Brors, B., Neutzner, A., Hoheisel, J.D. and Vingron, M. (2001) Correspondence analysis applied to microarray data, Proc Natl Acad Sci U S A, 98, 10781-10786. Feyzi, E., Sundheim, O., Westbye, M.P., Aas, P.A., Vagbo, C.B., Otterlei, M., Slupphaug, G. and Krokan, H.E. (2007) RNA base damage and repair, Curr Pharm Biotechnol, 8, 326-331. Florin, L., Hummerich, L., Dittrich, B.T., Kokocinski, F., Wrobel, G., Gack, S., Schorpp-Kistner, M., Werner, S., Hahn, M., Lichter, P. and Szabowski, A.A., P. 142 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques (2004) Identification of novel AP-1 target genes in fibroblasts regulated during cutaneous wound healing, Oncogene, 23, 7005-7017. Friedman, N., Linial, M., Nachman, I. and Pe'er, D. (2000) Using Bayesian networks to analyze expression data, J Comput Biol, 7, 601--620. Funahashi, K. (1989) On the Approximate Realization of Continuous Mapping By Neural Networks, Neural Networks, 2, 183-192. Gillespie, D.T. (1977) Exact stochastic simulation of coupled chemical reactions, J. Phys. Chem., 81, 2340-2361. Gillespie, D.T. (1992) A rigorous derivation of the chemical master equation, Physica A, 188, 404-425. Gray, A.C., McLeod, J.D. and Clothier, R.H. (2007) A review of in vitro modelling approaches to the identification and modulation of squamous metaplasia in the human tracheobronchial epithelium, Altern Lab Anim, 35, 493-504. Guet, C.C., Elowitz, M.B. and Hsing, W.L., S. (2002) Combinatorial synthesis of genetic networks, Science, 296, 1466-1470. Hancock, E.R. (2005) Graph Edit Distance from Spectral Seriation, IEEE Trans. Pattern Anal. Mach. Intell., 27, 365--378. Hasty, J., Pradines, J., Dolnik, M. and Collins, J.J. (2000) Noise-based switches and amplifiers for gene expression, Proc Natl Acad Sci U S A, 97, 2075--2080. He, L.H., G. J. (2004) MicroRNAs: small RNAs with a big role in gene regulation., Nat Rev Genet, 5, 522-531. Hill, T. and Lewicki, P. (2006) Statistics : methods and applications : a comprehensive reference for science, industry, and data mining. Tulsa, Oklahoma. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 143 Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P. and Kummer, U. (2006) COPASI--a COmplex PAthway SImulator, Bioinformatics, 22, 3067-3074. Hu, X., Maglia, A. and Wunsch, D. (2005) A general recurrent neural network approach to model genetic regulatory networks, Conf Proc IEEE Eng Med Biol Soc, 5, 4735-4738. Hu, X., Maglia, A., Wunsch, D. (2005) A general recurrent neural network approach to model genetic regulatory networks. , In Conf Proc IEEE Eng Med Biol Soc, 5, 4735 – 4738 Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A. and Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics, 18 Suppl 1, S96-104. Itzkovitz, S., Tlusty, T. and Alon, U. (2006) Coding limits on the number of transcription factors, BMC Genomics, 7, 239. Juliano, R., Alam, M.R., Dixit, V. and Kang, H. (2008) Mechanisms and strategies for effective delivery of antisense and siRNA oligonucleotides, Nucleic Acids Res, 36, 4158-4171. Kastner, J., Solomon, J. and Fraser, S. (2002) Modeling a hox gene network in silico using a stochastic simulation algorithm, Dev Biol, 246, 122--131. Kauffman, S. (1971) Gene regulation networks: a theory for their global structure and behaviors, Curr Top Dev Biol, 6, 145-182. Kauffman, S., Peterson, C. and Samuelsson, B.T., C. (2004) Genetic networks with canalyzing Boolean rules are always stable, Proc Natl Acad Sci U S A, 101, 1710217107. 144 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Keasling, J.D., Kuo, H. and Vahanian, G. (1995) A Monte Carlo simulation of the Escherichia coli cell cycle, J Theor Biol, 176, 411--430. Kim, S.-S. (1998) Time-delay recurrent neural network for temporal correlations and prediction, Neurocomputing 20 20, 253—263 Lanctôt, C., Cheutin, T., Cremer, M., Cavalli, G. and Cremer, T. (2007) Dynamic genome architecture in the nuclear space: regulation of gene expression in three dimensions, Nat Rev Genet, 8, 104--115. Lewis, B.A. and Reinberg, D. (2003) The mediator coactivator complex: functional and physical roles in transcriptional regulation, J Cell Sci, 116, 3667--3675. Li, F., Long, T., Lu, Y. and Ouyang, Q.T., C. (2004) The yeast cell-cycle network is robustly designed, Proc Natl Acad Sci U S A, 101, 4781-4786. Liang, S., Fuhrman, S. and Somogyi, R. (1998) Reveal, a general reverse engineering algorithm for inference of genetic network architectures, Pac Symp Biocomput, 18-29. Liao, X. and Wang, J. (2003) Global dissipativity of continuous-time recurrent neural networks with time delay, Phys Rev E Stat Nonlin Soft Matter Phys, 68, 016118. Lipan, O.W., W. H. (2005) The use of oscillatory signals in the study of genetic networks, Proc Natl Acad Sci U S A, 102, 7063-7068. Ma, J. (2004) The capacity of time-delay recurrent neural network for storing spatio-temporal sequences, Neurocomputing 62 19 – 37. MacQueen, J.B. (1967) Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1, 281-297. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 145 Mathayomchan, B.B., R. D. (2002) Center-crossing recurrent neural networks for the evolution of rhythmic behavior, Neural Comput, 14, 2043-2051. McAdams, H.H. and Arkin, A. (1997) Stochastic mechanisms in gene expression, Proc Natl Acad Sci U S A, 94, 814--819. Misteli, T. (2001) Protein dynamics: implications for nuclear architecture and gene expression, Science, 291, 843--847. Mjolsness, E. and Sharp, D.H.R., J. (1991) A connectionist model of development., J Theor Biol, 152, 429-453. Murphy, K.a.M., S. (1999) Modelling Gene Expression Data using Dynamic Bayesian Networks, Computer Science Division, University of California, Life Science Division. Lawrence Berkeley National Laboratory. Paddison, P.J. (2008) RNA interference in mammalian cell systems, Curr Top Microbiol Immunol, 320, 1-19. Pe'er, D., Regev, A., Elidan, G. and Friedman, N. (2001) Inferring subnetworks from perturbed expression profiles, Bioinformatics, 17 Suppl 1, S215--S224. Perkins, T.J., Jaeger, J. and Reinitz, J.G., L. (2006) Reverse engineering the gap gene network of Drosophila melanogaster, PLoS Comput Biol, 2, e51. Pilpel, Y. and Sudarsanam, P.C., G. M. (2001) Identifying regulatory networks by combinatorial analysis of promoter elements, Nat Genet, 29, 153-159. Potapov, A.P., Voss, N., Sasse, N. and Wingender, E. (2005) Topology of mammalian transcription networks, Genome Inform, 16, 270--278. Radde, N. and Kaderali, L. (2007) Bayesian Inference of Gene Regulatory Networks Using Gene Expression Time Series Data. In. Springer Berlin / Heidelberg, 1-15. 146 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Rao, A., Iii, A.O.H., States, D.J. and Engel, J.D. (2007) Inferring time-varying network topologies from gene expression data, EURASIP J Bioinform Syst Biol, 51947. Reinitz, J.S., D. H. (1995) Mechanism of eve stripe formation, Mech Dev, 49, 133158. Robles-Kelly, A. and Hancock, E.R. (2005) Graph edit distance from spectral seriation, IEEE_J_PAMI, 27, 365--378. Savageau, M.A. (1969) Biochemical systems analysis. II. The steady-state solutions for an n-pool system using a power-law approximation, J Theor Biol, 25, 370-379. Schreiber, M., Kolbus, A., Piu, F., Szabowski, A., Mohle-Steinlein, U., Tian, J., Karin, M., Angel, P. and Wagner, E.F. (1999) Control of cell cycle progression by cJun is p53 dependent, Genes Dev, 13, 607-619. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B. and Ideker, T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks, Genome Res, 13, 2498-2504. Slepoy, A., Thompson, A.P. and Plimpton, S.J. (2008) A constant-time kinetic Monte Carlo algorithm for simulation of large biochemical reaction networks, J Chem Phys, 128, 205101. Smolen, P. and Baxter, D.A.B., J. H. (2000) Mathematical modeling of gene networks, Neuron, 26, 567-580. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol Biol Cell, 9, 3273--3297. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O. and Botstein, D.F., B. (1998) Comprehensive identification of cell cycle- Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 147 regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol Biol Cell, 9, 3273-3297. Stamm, S. (2002) Signals and their transduction pathways regulating alternative splicing: a new dimension of the human genome, Hum Mol Genet, 11, 2409-2416. Stoll, G. and Rougemont, J.N., F. (2006) Few crucial links assure checkpoint efficiency in the yeast cell-cycle network, Bioinformatics, 22, 2539-2546. Sturn, A., Quackenbush, J. and Trajanoski, Z. (2002) Genesis: cluster analysis of microarray data, Bioinformatics, 18, 207-208. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R. and Lander, E.S.M., J. P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles., Proc Natl Acad Sci U S A, 102, 15545-15550. Swain, M., Hunniford, T., Dubitzky, W., Mandel, J. and Palfreyman, N. (2005) Reverse-engineering gene-regulatory networks using evolutionary algorithms and grid computing, J Clin Monit Comput, 19, 329-337. van Nimwegen, E. (2003) Scaling laws in the functional content of genomes, Trends Genet, 19, 479-484. van Someren, E.P., Wessels, L.F. and Reinders, M.J. (2000) Linear modeling of genetic networks from experimental data, Proc Int Conf Intell Syst Mol Biol, 8, 355-366. van Someren, E.P., Wessels, L.F.A. and Backer, E.R., M. J. T. (2002) Genetic network modeling, Pharmacogenomics, 3, 507-525. Wahde, M.H., J. (2000) Coarse-grained reverse engineering of genetic regulatory networks, Biosystems, 55, 129-136. 148 Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques Weaver, D.C., Workman, C.T. and Stormo, G.D. (1999) Modeling regulatory networks with weight matrices, Pac Symp Biocomput, 112--123. West, A.G., Gaszner, M. and Felsenfeld, G. (2002) Insulators: many functions, many mechanisms, Genes Dev, 16, 271--288. Whitley, D. (1993) "A genetic algorithm tutorial," Tech. Rep. CS-93-103. Department of Computer Science, Colorado State University, Fort Collins, CO 8052. Wilson, C.L. and Miller, C.J. (2005) Simpleaffy: a BioConductor package for Affymetrix Quality Control and data analysis, Bioinformatics, 21, 3683--3685. Wu, C.C., Huang, H.C., Juan, H.F. and Chen, S.T. (2004) GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data, Bioinformatics, 20, 3691-3693. Wuensche, A. (1998) Genomic regulation modeled as a network with basins of attraction, Pac Symp Biocomput, 89--102. Reverse engineering of genetic networks with time delayed recurrent neural networks and clustering techniques 149 Erklärung Ich, David Camachio Trujillo, habe diese Dissertation selbst verfasst und mich dabei keiner anderen als der von mir ausdrücklich bezeichneten Quellen und Hilfen bedient. Experimentelle Daten bzw. Materialien, die nicht von mir selbst erhoben bzw. hergestellt wurden, habe ich besonders kenntlich gemacht. Ich habe an keiner anderen Stelle ein Prüfungsverfahren beantragt und diese Dissertation auch nicht anderweitig in dieser oder anderer Form bereits als Prüfungsarbeit verwendet oder einer anderen Fakultät als Dissertation vorgelegt. Teile der vorliegenden Arbeit wurden im Vorfeld in Absprache publiziert: • David Camacho-Trujillo, Hauke Busch, and Roland Eils (2008): Reverse Engineering Gene Regulatory Networks with Time Delay Recurrent Neural Networks. Accepted for publication at Bioinformatics • Busch, H., Camacho-Trullio, D., Rogon, Z., Breuhahn, K., Angel, P., Eils, R. and Szabowski, A. (2008) Gene network dynamics controlling keratinocyte migration, Mol Syst Biol, 4, 199. • Schnickmann,S., Camacho-Trujillo, D., Bissinger, M., Eils, R., Angel, P., Schirmacher1, P., Szabowski, A., Breuhahn, K. (2008): 1AP-1 controlled hepatocyte growth factor (HGF) activation promotes keratinocyte migration via CEACAM1 and uPA/uPAR, Accepted at Journal of Investigative Dermatology • Camacho, D., Busch, H., Eils, R., Angel, P., Sbawoski, A.. (2008) Means and mehtods for diagnosing metastasizing potentials of tumor cells. Patent Pend. Heidelberg, den _________________________________ David Camacho Trujillo

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement