Graph clustering for RNA secondary structure analysis

*Joint work with H. Du, C. Heitsch, F. Hurley, C.V. Mennicke, B.D. Sullivan and B. Xu*

Unlike DNA, RNA is single stranded and folds via bonds between pairs of complementary nucleotides while it is still being synthesized from DNA. One central problem in molecular biology is understanding the specific shape into which an RNA molecule folds, as its shape encodes functional information. We will focus on RNA secondary structure in this talk, that is the 2D arrangement of the final RNA configuration. For an RNA sequence, we will consider a representative set of secondary structures and discuss combinatorial methods to mine the structural information from the given ensemble. In particular, we will present a graph algorithms approach based on dissimilarity scores and community detection.

Combinatorial insights into biomolecular interactions

In this talk, we look at combinatorial methods to describe biomolecular processes, namely DNA self-assembly and DNA:RNA interactions. In the first part, we focus on self-assembling graph-like structures using branched junction DNA molecules which are star-shaped molecules that join together through adhesion sites at the end of their arms. We describe a combinatorial representation of these molecules as vertices with labeled half-edges, and consider the problem of optimally building a target graph under different laboratory settings. We show how this question give rises to new graph invariants and how these invariants can be studied using edge-colorings and graph decompositions. In the remainder of the talk, we use formal grammars to describe R-loops, three-stranded structures formed by a DNA:RNA hybrid and a single strand of DNA. A formal grammar is a system to generate words; it consists of a set of symbols, terminals and non-terminals, and a set of production rules. The production rules specify how to rewrite non-terminal symbols to obtain words formed by only terminals. We introduce a “braid grammar” to model R-loops as strings of terminal symbols representing different states of the braided structure and discuss approaches to add probabilities to the production rules, thereby yielding a stochastic grammar and a probabilistic model for R-loop prediction.

Combinatorial questions arising from biomolecular processes

We will look at combinatorial problems that describe biomolecular processes, namely DNA self-assembly and DNA:RNA interactions. In the first part, we show how DNA self-assembly processes give rise to new graph invariants and how these invariants are related to known chromatic parameters. We also discuss how these invariants help in building molecules that can optimally assemble into a target structure. In the second part of the talk, we use formal grammars to model DNA:RNA interactions and the formation of R-loops. By identifying patterns specific to R-loops using experimental data, we can expand the current grammar into a probabilistic model to predict locations favorable for R-loops in a given DNA sequence.

Insertions on double occurrence words motivated by DNA rearrangement

*Joint work with D. A. Cruz, N. Jonoska, L. Nabergall and M. Saito*

A double occurrence word (DOW) is a word in which every symbol appears exactly twice; two DOWs are equivalent if one is a symbol-to-symbol image of the other. We consider the so-called repeat pattern (αα) and the return pattern (αα^{R}), with gaps allowed between the α’s. DOWs and repeat/return words have been used in studies of DNA rearrangement in certain species of ciliates. Motivated by the genome architecture of the ciliate *Oxytricha trifallax*, we introduce a notion of inserting repeat/return words into DOWs and study how two distinct insertions into the same word can produce equivalent DOWs. Given a DOW *w*, we characterize the structure of *w* which allows two distinct insertions to yield equivalent DOWs. This characterization depends on the locations of the insertions and on the length of the inserted repeat/return words and implies that when one inserted word is a repeat word and the other is a return word, then both words must be trivial (i.e., have only one symbol). The characterization also introduces a method to generate families of words recursively.

Insertions on double occurrence words motivated by DNA rearrangement

*Joint work with D. A. Cruz, N. Jonoska, L. Nabergall and M. Saito*

A double occurrence word (DOW) is a word in which every symbol appears exactly twice; two DOWs are equivalent if one is a symbol-to-symbol image of the other. We consider the so called repeat pattern (αα) and the return pattern (αα^{R}), with gaps allowed between the α’s. These patterns generalize square and palindromic factors of DOWs, respectively. DOWs and repeat/return words have been used in studies of DNA rearrangements in certain species of ciliates. Motivated by the genome architecture of the ciliate *Oxytricha trifallax*, we introduce a notion of inserting repeat/return words into DOWs and study how two distinct insertions into the same word can produce equivalent DOWs. Given a DOW *w*, we characterize the structure of *w* which allows two distinct insertions to yield equivalent DOWs. This characterization depends on the locations of the insertions and on the length of the inserted repeat/return words and implies that when one inserted word is a repeat word and the other is a return word, then both words must be trivial (i.e., have only one symbol). The characterization also introduces a method to generate families of words recursively.

Graph-theoretical questions arising from biomolecular processes

We will look at graph-theoretical problems that describe biomolecular self-assembly processes and DNA:RNA interactions. In the first part, we show how DNA self-assembly processes give rise to new graph invariants and how these invariants are related to known chromatic parameters. We also discuss how these invariants help in building molecules that can assemble into a target structure. In the second part of the talk, we use directed graphs to analyze sequence variants that appear during RNA-guided DNA repair processes. Understanding the structure of these graphs requires the development of new graph-theoretical approaches, including topological methods.

Homology of directed graphs with application to DNA recombination

*Joint work with L. Fajardo Gómez, N. Jonoska and M. Saito*

A double-occurence word (DOW) is a word in which every symbol appears exactly twice. We consider the so called repeat patterns (αα) and return patterns (αα^{R}), with gaps allowed between the α’s; these patterns generalize square and palindromic factors of DOWs, respectively. In the context of genomics, pattern deletions on DOWs have been used to study DNA recombination in certain species of ciliates. We model these reduction processes with a directed graph where vertices are DOWs, and an edge from *w* to *w*‘ exists if *w*‘ is obtained from *w* through a pattern deletion. On this graph, we consider the cell complex consisting of products of directed simplices and define a new boundary operator. This allows the computation of homology groups, which will help in identifying rearrangement pathways that may be of interest.

Modelling RNA-DNA interaction

*Joint work with Y. Jeon, N. Jonoska and F. Storici*

In recent years, it has become evident that non-coding RNA can transfer information to DNA and impact gene expression. For instance, the Storici Lab showed that antisense transcript RNA can serve as a template for double-strand break (DSB) repair in yeast. The actual process of DSB repair is unclear, while experimental data indicate that repair must occur during transcription. We conjecture that DSB repair requires the formation of an R-loop, and we propose a testable mechanism where the antisense transcript RNA engages with the ssDNA of the R-loop formed by the sense transcript, and repairs the DNA. Moreover, we present mathematical approaches to model an R-loop, which in turn will help in analyzing RNA-DNA hybrids in the context of DSB repair.

Mathematical models for describing molecular self-assembly

*Joint work with N. Jonoska*

We present several mathematical models for describing molecular building blocks, called rigid tiles, that assemble in larger nanostructures. Rigid tiles can be seen as *k*-arm branch junction structures that join together by annealing to each other through the affinity of their arm-ends. Such a *k*-arm rigid tile is described with a set of *k* vectors joined at the origin that can be translated or rotated during the assembly. Besides the geometric shape of the building blocks, the models can take into account the geometry of the arm-ends joining together. We show distinctions between four models by characterizing types of structures that can be assembled from rigid tiles.

Insertions yielding equivalent double occurrence words

*Joint work with D. A. Cruz, N. Jonoska, L. Nabergall and M. Saito*

A double occurrence word (DOW) is a word in which every symbol appears exactly twice; two DOWs are equivalent if one is a symbol-to-symbol image of the other. We consider subwords which appear twice (repeat word) or which appear once along with their reverse (return word). Such subwords generalize square and palindromic factors of DOWs, respectively. Given a DOW *w*, we characterize the structure of *w* which allows two distinct insertions of repeat/return words to yield equivalent DOWs. This characterization depends on the locations of the insertions and on the length of the inserted repeat/return words. When one inserted word is a repeat word and the other is a return word, then both inserted words must be trivial (i.e., have only one symbol). The characterization also gives a method to generate families of such words recursively.

Insertions yielding equivalent double occurrence words

*Joint work with D. A. Cruz, N. Jonoska, L. Nabergall and M. Saito*

A double occurrence word (DOW) is a word in which every symbol appears exactly twice; two DOWs are equivalent if one is a symbol-to-symbol image of the other. We consider the so called repeat pattern (αα) and the return pattern (αα^{R} ), with gaps allowed between the α’s. These patterns generalize square and palindromic factors of DOWs, respectively. We introduce a notion of inserting repeat/return words into DOWs and study how two distinct insertions into the same word can produce equivalent DOWs. Given a DOW *w*, we characterize the structure of *w* which allows two distinct insertions to yield equivalent DOWs. This characterization depends on the locations of the insertions and on the length of the inserted repeat/return words and implies that when one inserted word is a repeat word and the other is a return word, then both words must be trivial (i.e., have only one symbol). The characterization also introduces a method to generate families of words recursively.

Modelli matematici per nanostrutture di DNA

Nuove tecniche di laboratorio utilizzano le proprietà dei filamenti di DNA per costruire nanostrutture, in particolare oggetti geometrici che possono essere modellizzati tramite grafi. In ognuna di queste procedure, è necessario descrivere accuratamente le componenti molecolari da utilizzare. In questa conferenza presenteremo alcune di queste tecniche ed il modello matematico sviluppato per supportare tali applicazioni.

Mathematical models for describing molecular self-assembly

*Joint work with N. Jonoska*

We present several mathematical models for describing molecular building blocks, called rigid tiles, that assemble in larger nanostructures. Rigid tiles can be seen as *k*-arm branch junction structures that join together by annealing to each other through the affinity of their arm-ends. Such a *k*-arm rigid tile is described with *k* vectors joined at the origin that can be translated or rotated during the assembly. Besides the geometric shape of the building blocks, the models can take into account the geometry of the arm-ends joining together. We show distinctions between four models by characterizing types of structures that can be assembled and we outline an algebraic approach to characterize nanostructures built by a set of rigid tiles.

Design strategies for DNA tile assembly

*Joint work with A. Cook, A. Houlihan, R. Rouleau, N. C. Seeman, G. Pangborn and J. Ellis-Monaghan*

New laboratory techniques have been developed using the Watson-Crick complementarity properties of DNA strands to achieve the self-assembly of nanostructures, particularly polyhedra. For all of these methods, an essential step is designing the component molecular building blocks.

We present a mathematical model that captures the geometric constraints of rigid tiles, that are star-shaped molecules used as building blocks for tile-based DNA self-assembly. We illustrate the functionality of the model by providing DNA self-assembly strategies to efficiently construct Platonic and Archimedean 3-regular polyhedral skeletons.

Mathematical models for describing molecular self-assembly

*Joint work with N. Jonoska*

We present several mathematical models for describing molecular building blocks, called rigid tiles, that assemble in larger nanostructures. Rigid tiles can be seen as *k*-arm branch junction structures that join together by annealing to each other through the affinity of their arm-ends. Such a *k*-arm rigid tile is described with a set of *k* vectors joined at the origin that can be translated or rotated during the assembly [1]. Besides the geometric shape of the building blocks, the models can take into account the geometry of the arm-ends joining together. We show distinctions between four models by characterizing types of structures that can be assembled and we outline an algebraic approach to characterize nanostructures built by a set of rigid tiles, as initially proposed in [2]. The research is partially supported by the National Science Foundation with grant CCF-1526485 and by the National Institute of Health with grant R01 GM109459.

[1] N. Jonoska, G.L. McColm. Flexible versus rigid tile assembly, Unconventional Computation, Lecture Notes in Computer Science **4135, **139–151 (2006)

[2] N. Jonoska, G.L. McColm. Describing self-assembly of nanostructures, SOFSEM 2008, Lecture Notes in Computer Science **4910, **66–73 (2008)

Formalism and design strategies for DNA tile assembly

*Joint work with A. Cook, A. Houlihan, R. Rouleau, N. C. Seeman, G. Pangborn and J. Ellis-Monaghan*

New laboratory techniques have been developed using the Watson- Crick complementarity properties of DNA strands to achieve the self-assembly of nanostructures, particularly polyhedra. For all of these methods, an essential step is designing the component molecular building blocks. Combinatorics is a natural tool to address these new design strategy problems.

We present a mathematical model that captures the geometric constraints of rigid tiles, which are the branched junction molecule building blocks for tile-based DNA self-assembly. We also provide DNA self-assembly strategies to efficiently construct Platonic and Archimedean 3-regular polyhedral skeletons using this model.

Minimal edge colorings of class 2 graphs and double graphs

*Joint work with N. Zagaglia Salvi*

A proper edge coloring of a class 2 graph *G* is minimal if it contains a color class of cardinality equal to the resistance* r(G)* of *G*, which is the minimum number of edges that have to be removed from *G* to obtain a graph which is *∆(G)*-edge colorable, where *∆(G)* is the maximum degree of G. In this paper using some properties of minimal edge colorings of a class 2 graph and the notion of reflective edge colorings of the direct product of two graphs, we are able to prove that the double graph of a class 2 graph is of class 1. This result, recently conjectured, is moreover extended to some generalized double graphs.

On the partition graph of a positive integer

*Joint work with N. Zagaglia Salvi*

Let *n* be a positive integer. Denote *V _{n}* the set of all ordered partitions of

The partition graph of

In this paper we investigate some structural properties of this graph. A particular attention is given, in the case of

Abstract

It is known that some sequences of four elements, that are the nitrogenous bases A, C, G, T (namely adenine, cytosine, guanine and thymine) from which DNA is assembled, are repeated in the human genome many times. Because of the structure of DNA, different sequences that are equivalent up to cyclic rotation and reverse complementation, determine the same repeated pattern. Bell et al. [1] were able to derive a formula for the number of equivalence classes of primitive patterns of DNA sequences of length *n* using the bijective correspondence between words on two letters of length *2n* and words of length *n* on four letters. A further derivation of the formula for the classification of DNA sequences of length *n* is provided in [2] by using properties of cycle structures related to fixed points of the hypercube *Q _{n}*.

In this thesis a cyclic binary string of length

The second part of this work deals with techniques that use the Watson-Crick complementarity properties of DNA strands to self-assembly nanostructures, particularly polyhedra. For all of these methods, an essential step is designing the component molecular building blocks; thus there is an increased necessity for a mathematical approach to these design strategy problems.

We provide a mathematical model that captures the geometric constraints of rigid tiles, which are the branched junction molecule building blocks for tile-based DNA self-assembly. We also describe graph theoretical formalism for the DNA origami method. Using the aforementioned techniques we provide DNA self-assembly strategies to efficiently construct Platonic and Archimedean 3-regular polyhedral skeletons.

[1] G. I. Bell, R. L. Bivins, J. D. Louck , N. Metropolis, M. L. Stein, Properties of words on four letters from those on two letters with an application to DNA sequences. Advances in Applied Mathematics **14**(3), 348–367 (1993)

[2] W. Y. C. Chen, J. D. Louck, Necklaces, MSS sequences, and DNA sequences. Advances in Applied Mathematics **18**(1), 18–32 (1997)

Abstract

È noto che alcune sequenze di quattro elementi, che sono le basi azotate A, C, G, T (cioè adenina, citosina, guanina e timina) da cui è formato il DNA, si ripetono nel genoma umano molte volte. A causa della struttura del DNA, differenti sequenze che sono equivalenti a meno di rotazione ciclica e complementazione inversa, determinano lo stesso schema ripetuto nella sequenza del DNA. Bell et al. [1] determinano una formula per il numero di classi di equivalenza di schemi primitivi di sequenze di DNA di lunghezza *n* utilizzando la corrispondenza biunivoca tra le parole su due lettere di lunghezza *2n* e le parole di lunghezza *n* in quattro lettere. Un’ulteriore derivazione della formula di classificazione delle sequenze di DNA di lunghezza *n* è presentata in [2] dove si utilizzano proprietà di strutture cicliche relative ai punti fissi dell’ipercubo *Q _{n}* .

In questa tesi una stringa binaria ciclica di lunghezza

Nella seconda parte di questo lavoro vengono analizzate alcune tecniche che utilizzano le proprietà di complementarità dei filamenti di DNA per costruire nanostrutture, in particolare poliedri. Per tutti questi metodi, un passo essenziale è la descrizione accurata delle componenti molecolari che costituiranno i blocchi da utilizzare durante il processo di assemblaggio; vi è quindi una crescente necessità di approcci matematici a questi problemi di progettazione.

In questa tesi forniamo un modello matematico in grado di catturare i vincoli geometrici dei tasselli rigidi (“rigid tile”), i quali rappresentano “branched junction molecules”. Descriviamo anche un formalismo teorico per il metodo DNA-origami. Utilizzando queste tecniche mostriamo come costruire in modo efficiente i solidi platonici ed archimedei 3-regolari.

[1] G.I. Bell, R.L. Bivins, J.D. Louck , N. Metropolis, M.L. Stein, Properties of words on four letters from those on two letters with an application to DNA sequences. Advances in Applied Mathematics **14**(3), 348–367 (1993)

[2] W. Y. C. Chen, J. D. Louck, Necklaces, MSS sequences, and DNA sequences. Advances in Applied Mathematics **18**(1), 18–32 (1997)

Abstract

Tra gli argomenti considerati ormai classici nell’ambito della teoria dei grafi rientrano senz’altro i problemi di colorazione e i problemi di decomposizione.

Per quanto riguarda i primi si parla addirittura di teoria cromatica dei grafi, distinguendo in genere se si considerano colorazioni sui vertici o sugli spigoli. Nei problemi di decomposizione si studia la possibilità di spezzare un grafo assegnato in “pezzi” dotati di particolari proprietà. Di fatto si può pensare che le due tipologie di problemi siano facce diverse della stessa medaglia. In altre parole per una assegnata colorazione di un grafo si può andare ad esaminare la struttura costituita dai sottografi determinati dagli oggetti (vertici o spigoli) che hanno ricevuto lo stesso colore. Viceversa, data una decomposizione di un grafo, si può pensare che i sottografi della decomposizione in esame siano oggetti di colore assegnato in una colorazione di qualche tipo (sui vertici o sugli spigoli).

Per quanto riguarda la colorazione sugli spigoli, un risultato fondamentale si deve a Vizing [4] il quale dimostrò una relazione che lega il numero di colori necessari per determinare una colorazione con la valenza massima dei vertici del grafo in esame. Questo teorema consente di suddividere i grafi in due categorie: i grafi di Classe 1 e di Classe 2. I primi sono tali per cui il numero di colori necessari è pari alla valenza massima dei vertici, i secondi invece necessitano di un colore ulteriore rispetto al valore del grado massimo dei vertici del grafo.

Di notevole importanza è il concetto di 1-fattorizzazione, con il quale si intende la possibilità di decomporre il grafo dato in sottografi ricoprenti regolari di grado 1 tali che, a due a due, siano privi di archi comuni e la loro unione sia il grafo in esame. Il legame tra 1-fattorizzazione e colorazione sugli spigoli emerge in particolare quando si analizzano i grafi regolari; infatti per questi ultimi essere di Classe 1 equivale ad essere 1-fattorizzabili. Affinchè un grafo sia 1-fattorizzabile è necessario, come verrà meglio spiegato nel Cap. 3, che sia regolare e abbia un numero pari di vertici.

Non tutti i grafi regolari però risultano 1-fattorizzabili, come si vede analizzando, ad esempio, il grafo di Petersen. Nasce quindi il problema di determinare una condizione sufficiente che indichi se il grafo è 1-fattorizzabile (e quindi di Classe 1). In particolare è stata sviluppata l’idea che se un grafo regolare ha valenza “abbastanza” alta, allora è 1-fattorizzabile.

Qual’è un valore adeguato per la soglia minima della valenza? È in questo contesto che viene formulata la cosiddetta “congettura della 1-fattorizzazione”, che a tutt’oggi non ha ancora un controesempio:

se *G* è un grafo regolare di grado *d* su *2n* vertici tale che

• *d ≥ n – 1*, se *n* è pari

• *d ≥ n,*, se *n* è dispari

allora *G* è 1-fattorizzabile. Nel tempo si sono dimostrati vari risultati che tendono ad abbassare la soglia, avvicinandosi sempre più a quella della congettura. In questa tesi verranno riportati i risultati relativi ai grafi regolari di grado *d = 2n – 1, d = 2n – 2, d = 2n – 3, d = 2n – 4* e *d = 2n – 5* presenti in [3] e in [1].

Verranno presentati anche teoremi dovuti a Chetwynd ed Hilton nei quali la congettura in esame è stata dimostrata per *d* ≥ 6/7 |V (G)|∼0.857 |V (G)| in [1] e d ≥ 1/2(√7 – 1)|V (G)|∼ 0.823 |V (G)| in [2]. Infine verranno elencati alcuni risultati recenti che migliorano la soglia minima che deve soddisfare la valenza del grafo affinchè quest’ultimo sia 1-fattorizzabile.

[1] A. G. Chetwynd, A. J. W. Hilton, Regular graphs of high degree are 1-fatorizable. Proc. London Math. Soc. **3**(2), 193-206 (1985)

[2] A. G. Chetwynd, A. J. W. Hilton, 1-factorizing regular graphs of high degree-an improved buond. Discrete Mathematics **75**, 103-112 (1989)

[3] W. D. Wallis, One-Factorizations (Kluwer Academic Publishers, Dordrecht, 1997)

[4] V. G. Vizing, On an estimate of the chromatic class of *p*-graph [Russian]. Discretyni Analiz **3**, 25-30 (1964)

Abstract

This work deals with the concept of Cayley graphs and related properties. A Cayley graph is a graph whose vertex set is a group *G* and two vertices *u* and *v* are adjacent if and only if *uv ^{−}*

The first one is a special class of these graphs: complete rotational Cayley graphs; which are Cayley graphs admitting a particular automorphism ω such that, for some ordering of

The second topic deals with what is generally known as the maximum clique problem, which asks for a clique of maximum cardinality; where a clique is a subset of the vertex set such that the induced subgraph is complete. We have distinguished between maximal clique (i.e. a clique that is not a proper subset of any other clique) and maximum clique (i.e. a maximal clique that has the maximum cardinality) and we have analyzed some results that construct a maximal clique in the Cayley graph of square order. Moreover, we have studied some bounds for the clique number (the size of the maximum clique in the graph), the independent number (the cardinality of the largest coclique in the graph) and the chromatic number (the minimum number of colors required to obtain a proper coloring of the graph) obtained with the eigenvalues of the graph (i.e. the eigenvalues of the adjacency matrix of the graph).

This is the reason why we have introduced, as the last main topic, the spectrum of Cayley graphs. In order to obtain information about the spectrum of a Cayley graph, we have to investigate the structure of the group

In the appendices there are definitions and main properties about presentation of finite groups and representation theory, two subjects that are foundamental for studying the concepts presented in this work.

Discrete models for understanding mechanisms of RNA-mediated DNA break repair

*Joint work with Y. Jeon, N. Jonoska and F. Storici*

Precise DNA repair is required to maintain genome integrity. Particularly, repair of DNA double-strand breaks (DSBs) needs to be efficient and accurate not only to avoid mutations in DNA, but also to prevent gross chromosome rearrangements. RNA had been considered as an intermediate molecule between DNA and proteins, and a variety of new functions of RNA have been uncovered in the last decades. Recently, we found that RNA can be used as a template for DNA DSB repair directly in yeast. We are developing and optimizing a system to study RNA-mediated DNA repair in mammalian cells. Our biology systems are built on plasmid DNA molecules and are combined with the CRISPR/Cas tools of genome editing to induce DSBs efficiently and on specific locations of the plasmid DNAs. This experimental design makes our system modular. We show sequencing results of RNA-mediated DNA repair after DSB, and that transcript RNA may help to recover functional properties of a gene after a break. We use graph-theoretic properties to study the sequence variants that appear during the DNA repair process. Relationships among sequences are displayed in a graph where vertices are sequences and an edge between two vertices exists if the two sequences differ by one nucleotide mutation. By studying these graphs, we can better understand the molecular process that play a role during DNA break repair.

Request thesis

Richiedi tesi