Team:Heidelberg/Project Software
From 2013.igem.org
Line 22: | Line 22: | ||
<!--graphical abstract--> | <!--graphical abstract--> | ||
<div class="col-sm-6"> | <div class="col-sm-6"> | ||
- | <a class="fancybox" rel="group" href="https://static.igem.org/mediawiki/2013/7/76/Heidelberg_ga_Software.png"><img src="https://static.igem.org/mediawiki/2013/7/76/Heidelberg_ga_Software.png" style="width:100%; margin-bottom: | + | <a class="fancybox" rel="group" href="https://static.igem.org/mediawiki/2013/7/76/Heidelberg_ga_Software.png"><img src="https://static.igem.org/mediawiki/2013/7/76/Heidelberg_ga_Software.png" style="width:100%; margin-bottom:1px; padding:1%;border-style:solid;border-width:1px;border-radius: 5px;" /></a> |
<div class="jumbotron"> | <div class="jumbotron"> | ||
<h2>Highlights</h2> | <h2>Highlights</h2> |
Revision as of 15:47, 3 October 2013
NRPSDesigner. Design your own NRP.
Highlights
- Transfer of the whole delftibactin NRPS pathway from D. acidovorans into E. coli
- Novel approach for transfering a whole NRPS pathway more than 50 kb in size from one bacterial species into another
- Optimization of the Gibson Cloning Strategy for the creation of large plasmids (over 30 kb in size) with high GC content
- Precipitation of pure gold from electronic waste using delftibactin
Abstract
Non-ribosomal peptide (NRP) synthesis is a biochemical process of remarkable hierarchical organization. Vertically it can be described stepwise starting from the coding DNA sequence that is translated into a giant enzyme catalyzing in turn the actual NRP assembly. Horizontally its complexity is established by a modular order of functional proteinogenic units. Due to this systematic composition a bioinformatic approach appears most suitable, if we aim for the automated design of fully synthetic NRPs.
Here, we introduce a comprehensive software tool, the NRPSDesigner, which facilitates the prediction and synthesis of non-ribosomal synthetases (NRPS) that catalyze customized NRP-assembly. The predictive power of the NRPSDesigner is based on a curated database storing information of about 200 NRPS modules, their DNA coding sequences and substrate specificities. It is used to calculate the optimal domain sequence according to the weighted phylogenetic distance between domain origins. Additionally an integrated domain recognition algorithm allows for curated expansion of the database. To accelerate the process from in silco NRPS design towards experimental validation we embedded the Gibthon iGEM software tool of Cambridge 2010 for Gibson primer construction. With this framework we want to suggest a new standard for the fast and accurate computer aided design of customized short peptides.
Introduction
Even though biological processes can be characterized by their physicochemical properties, they can also be translated into an abstract model of interconnected functional entities. This is exemplified by the central dogma of Biology describing the information flow from DNA, to RNA and finally proteins. Synthetic biology has always tried to interfere with these levels of organization with the goal of systematically controlling the projected outcome. (Auch für den oberen Abschnitt wäre eine Referenz gut…)
NRPS carry this principal to extremes by adding yet another hierarchical level:
The modular proteinogenic complex sequentially synthesizes its own short non-ribosomal peptide (NRP). These peptides in turn are not limited by the standard set of proteinogenic amino acids; instead D-isoforms and diverse modifications can be utilized (reference). Nature has made great use of this system by creating versatile natural products such as antibiotics, metallophores or dyes. (reference)
Although, little is known about the actual dynamic properties of the synthesis process most of its logical rules are indeed understood: A NRP-synthetase consists of a number of modules, each of which is responsible for adding one amino-acid to the nascent peptide. But even a module can be further sub-divided into domains, each with a distinct functionality.
This hierarchal organization demonstrates the large potential for the synthetic biology community: The exchange or combination of modules and domains from different organisms or different proteins has been repeatedly shown to produce fully functional NRPS (reference). Not surprisingly, several bioinformatic approaches have put great effort to meticulous categorize NRPS and their functionality. For example, databases such as NRPS-PKS Clustermine360 describe the domain organization of diverse NRPS, while the Norine database includes information about non-ribosomal peptides and their sequence of monomers. Also, many tools are capable of predicting domain sequences, substrate specificity and hence the putative product of a particular NRPS (references). Exemplary the NRPS-PKS and PKS-NRPS analysis tool (hereafter referred to as Maryland tool) elaborate in this direction. While antiSMASH also provides similar prediction capabilities its scope is broader and covers many different secondary metabolite pathways.
However, as the understanding of the underlying biological processes and methods for assembly of diverse DNA constructs has improved, many novel software tools aim at the computer aided design (CAD) of DNA sequences (reference). Such tools have been particularly valuable to the iGEM community, as they stimulate the design of more complicated, yet less error-prone biological devices. Two examples originating from the iGEM community are: Clotho, a framework enabling the automated and computer-assisted design of synthetic biology constructs introduced by the Berkeley iGEM team of 2008 (link) and Gibthon created as a web app by the Cambridge iGEM Team of 2010, which suggests primers in order to assemble a set of predefined DNA fragments using Gibson cloning.
Influenced by this development, we introduce here the NRPSDesigner, an integrated CAD software, implemented to facilitate the design of customized synthetic NRPs. In particular, the NRPSDesigner includes the following features: Based on the NRPS-PKS database we built a manually curated database capturing the biological complexity of NRPS while storing information of about 200 NRPS modules their coding sequences and substrate specificity. The database can be easily extended with curated content using automated domain prediction based on Hidden Markov Models. By applying this information, the NRPSDesigner can calculate an optimal sequence of domains based on simple evolutionary assumptions. To accelerate the process of testing this synthetic construct and eventually produce a customized peptide we included additional assisting software to the framework. We offer to incorporate the necessary domains for combining the nascent peptide with an Indigoidine tag (link). Furthermore, embedding of the Gibthon software automates the suggestion of primers necessary for the assembly of the predicted domains by Gibson cloning.
Results
NRPSDesigner database structure
The NRPSDesigner is a knowledge-based software tool using stored information about NRPS pathways to predict the optimal domain sequence that is able to produce a user-defined NRP. The storage organization is of great importance for the functionality of the designer because of its dependence on a comprehensive description of the biological and biochemical properties of NRPSs. For this purpose we built a hierarchical database that comprises three layers of complexity (see Figure XX): i) the DNA level represented by all DNA coding sequences. They are directly linked to ii) their encoded NRPS domain, respectively. Finally iii) our database stores detailed information about the substrate and its potential modification of the corresponding domain.
Next to the tight links between these layers all of them also point at additional database entries that complete the needed information for the design algorithm. For example, a DNA coding sequence is linked not only to its product, the translated domain, but also to its origin (organism, plasmid etc.) Additionally, a coding sequence can also be connected to another coding sequence. This ‘parent’ sequence is a predecessor of the stored sequence that already underwent biosynthetic modification. On the domain level there is an upstream link to the coding sequence but also to the specific type of domain (e.g. thioesterase or the condensation domains). Some Domains, such as adenylation domains, also point at monomers, based on their substrate specificity. Subsequently, for these substrates we store their chirality, modification and if they are proteinogenic or not (e.g. glutamine and ornithine in Figure ?). To enable the NRPSDesigner to use information from outside of the database it is equipped with global identifiers. For organisms we saved the NCBI taxon id, while for BioBricks the unique identifier in the Parts registry. To integrate the content with other databases, we created for every layer a linkout entry that consists of a type and specific identifier. The linkout type includes a description of the corresponding resource, as well as a URL, which in combination with the specific identifier enables the cross-linking of each database entry to other resources. The most common linkout types are Norine and Pubchem IDs for the substrates, PFAM IDs for the domain types and GenBank identifiers for the coding sequences.
For visualization of the NRPS domains and the chemical structures of the substrates by Open Bable we added a JSON representation of each domain, based on the Pfam Graphics library (link) and the SDF (structure data file) format, respectively.
NRPSDesigner database content
The information currently stored in the NRPS database was mainly retrieved from already published data (Link) and extended or changed according to our own experimental results. Because of the already present domain organization and the related substrate specificities we mainly filled our database with information from the NRPS-PKS library and added the missing coding sequences. Accordingly, the positions of domain boundaries and linkers had to be converted from their protein specific coordinates to DNA coordinates. In some cases manual curation was necessary, because for example…: In table XX all changes to the original NRPS-PKS are listed and explained. In collaboration with the iGEM team Edinburgh the database was extended by NRPS sequences the team has worked with during their project… In total our database contains curated information about…
Extension and validation of the database
One of the core requirements for our software is the ability to detect NRPS domains. This feature enables the automatic and standardized definition of domain boundaries and also facilitates the addition of new entries into the NRPSDesigner database by the community.
Based on the antiSMASH program we established our own pipeline to maximize domain recognition specificity. antiSMASH is implemented in (BIO)Python and uses HMMER3 and thus, could be easily integrated in our framework. To improve domain recognition especially for the Adenylation/Oxidation/Adenylation (AOxA) domain of IndC, the Adenylation domain HMM (from Pfam, id PF00501.21) of antiSMASH was replaced by another HMM of the Maryland tool constructed from the seed alignment for the A domain (special thanks to Prof. Jacques Ravel for providing us with the seed alignment). In addition, a HMM profile of the AOxA domain appearing in diverse indigoidine synthetases was constructed and added to the pipeline. Furthermore, Thiolation domains have been split into two categories, depending on whether an epimerisation domain follows or not, as experimental evidence has shown that their functionality differs (link).
To easily extend the NRPSDesigner database this domain recognition pipeline was integrated into a user-friendly interface. The user can add an additional description for each domain, possibly change the specificity of the adenylation domains if it has not been predicted correctly and also define his own domain boundaries, with the help of an integrated multiple sequence alignment against the domains of the same type already present in the database.
Guidance for cloning of NRPS constructs with Gibthon
Going further than offering in-silico predicted sequences of NRPS domains producing a particular NRP would be the implementation of a cloning procedure based on Gibson assembly for the NRPSDesigner. This should make NRPS more accessible to the synthetic biology community. One of the most popular tools for computer-aided primer design is Gibthon. Created as a web app by Bill Collins of the Cambridge iGEM Team of 2010 Gibthon suggests primers for the assembly of predefined DNA fragments (entered by the user or imported from the Parts Registry) using Gibson cloning. Since Gibthon was such a successful iGEM software project and also written in Django, we decided to use it as our tool of choice for automated primer design.
Gibthon was integrated into the core NRPSDesigner GUI using a modular interface. Particular care was taken to keep Gibthon and the rest of the NRPSDesigner clearly separated, to enable the use of other primer design software such as J5 in the future. The strategy applied for the integration is the following: For each of the domains returned by the in-silico prediction, the DNA sequence is extracted from the NRPSDesigner database. The resulting sequences are returned in a ring structure, which ensures a minimal number of Gibson fragments to be assembled. These sequences, together with metadata, such as references and descriptions, are appropriately converted to the Gibthon database format and then copied into the Gibthon gene fragment table. In return the user has access to the standard Gibthon interface to get an overview over the suggested primers. Similar to Gibthon the NRPSDesigner is tightly linked with the registry of standard biological parts. The user can add his parts of choice, using the automated parts registy import tool. However, some additional restrictions have been placed in order to ensure the integrity of the designed NRPS sequence: The user cannot enter a new fragment/part in between one of the NRPS domains; instead he can only place it after the Thioesterase and before the initiation adenylation domain.
To easily extend the NRPSDesigner database this domain recognition pipeline was integrated into a user-friendly interface. The user can add an additional description for each domain, possibly change the specificity of the adenylation domains if it has not been predicted correctly and also define his own domain boundaries, with the help of an integrated multiple sequence alignment against the domains of the same type already present in the database.
In silico design of the optimal NRPS sequence
The quickly growing number of known NRPS pathways increases the pool of modules and domains catalysing the incorporation of a specific peptide monomer. Especially on the domain level there is often more than one NRPS protein specific for the same substrate or catalytic reaction. To be able to determine the optimal selection we assume that a combination of domains exists which will provide the best possible performance when combined to a NRPS, and that this performance can be objectively quantified relative to other domain combinations. To find the optimal domain combination the shortest path from a start to an end point on a graph-based representation of the problem needs to be determined. The NRPSDesigner constructs a directed acyclic graph (DAG) representing all possible domain combinations capable of synthesizing the desired peptide. The DAG is organized in layers, corresponding to positions in the final NRPS. Every layer comprises a discrete number of nodes representing NRPS domains and every node of a layer is connected to every node of the next layer. However, no edges within a layer or to previous layers exist. The first and last nodes, being the start and end points of the shortest path problem, are dummy nodes without an associated domain, edges pointing to or from them having a weight of zero.
So far, very little is known about the potential functionality of synthetic NRPS including possible assembly rules for modules and domains. Therefore we assume a simple scoring function for the edge weights between the user-selected domains, where combinations of domains from closely related organisms have a higher probability to form a functional NRPS than domain combinations involving organisms which are far apart on the taxonomic tree. The taxonomic distance is currently determined from lineage information available in the NCBI taxonomy database [Pmid22139910], but could be easily extended to include sequence-based information[Pmid23778980] or other, possibly better fitting, criteria as soon as more experimental evidence becomes available. In addition, domains which are directly consecutive in their native organism are preferred over those from different modules, if possible. The well known shortest path algorithm by E.W. Dijkstra[PAM doi:10.1007/BF01386390] is then employed to find the optimal domain combination, using the dummy nodes as start and stop nodes.
Implementation
Following the example of major bioinformatics applications like BLAST[Pmid23609542], the NRPSDesigner was split into a command-line executable comprising the core algorithm, and a web interface providing easy access to the functionality enriched with some extra features. This design enables potential users to set up their own NRPSDesigner instance according to their needs using the compiled executable and a database dump.
The command-line executable
We chose C++ as implementation language for the command-line executable to avoid dependencies on run-time interpreters and to maximise speed. Additionally the usage of C++ features like smart pointers, lambda functions, or hash tables from the new C++11 standard allows for terser, faster, less error-prone code. To facilitate re-use of the code in other contexts, all of the functionality is contained within a (currently static) library, with the executable being a thin wrapper around it. In the future, even tighter integration with the web interface can be imagined. In this case a python wrapper would be implemented around the library using a software package like SIP or SWIG.
In case the PartsRegistry or some other publically accessible NRPS database offers the necessary functionality at some point in the future we decoupled the algorithm from the database and allowed multiple database backends. We designed an abstract base class defining the interface required by other parts of our code, and a static factory function responsible for constructing an object of the appropriate backend class employing the singleton pattern, which allows only one instance of the database backend to exist within the entire program. Currently, only one database backend exists, capable of retrieving data from our MySQL database by using the MySQL C++ connector library. Custom exception classes complete the abstraction, as all SQL exceptions are caught within the MySQL backend and a DatabaseError is thrown containing information about the specific error.
Taxonomy information is retrieved on-demand from the NCBI server via the NCBI XML API[Entrez xml api] using the CURL network library and further parsed using the LibXML library. In case of network failure, the NRPSDesigner can fall back to a local copy of the taxonomy database, which can be retrieved from the NCBI FTP server in advance. In order to reduce network traffic and memory consumption when the final graph is built, as little data as possible is loaded in advance. The missing taxonomy information is retrieved lazily in parallel to the shortest path computation and cached afterwards. DNA sequences and metadata not required for the pathway construction are fetched after the pathway has been constructed only for the domains the pathway is comprised of. The final pathway is returned in form of an XML file, built with the LibXML library, containing information on the domains and their origin.
For command-line option parsing, the Boost.program_options library was chosen, as it promotes modularity by allowing every component to specify its own set of options. This enables switching between different database backends or changing the existing one without the rest of the program knowing about it. Although this approach introduces an extra dependency to the NRPSDesigner library itself and not just the executable, Boost.program_options allows querying of options, their associated data type and other metadata. Therefore e.g. a graphical user interface written in C++ could extract this information and dynamically build the required input fields, without knowing anything about the underlying algorithm or backends.
The web interface
In order to allow tight integration with the Gibthon software package of iGEM 2010 we chose the Django framework, written in Python, as the back-end of the web interface. The front-end was implemented using the Bootstrap 2, jQuery and jQueryUI frameworks. The interface allows to select the desired peptide monomers and specify their configuration/modification. Subsequently, a 2D structure preview of the resulting molecule is generated on-the-fly using the OpenBabel library[Pmid21982300] via its Python bindings. The structures of the individual amino acids are captured from the database, performing a SMARTS pattern search for the NH2-CR-COOH group and constructing peptide bonds between consecutive monomers. The resulting peptide structure is projected to 2D coordinates, exported as SVG and then directly sent to the browser. To reduce server load, structures are cached using Django's view caching functionality.
Heavy computations like pathway construction or domain prediction are performed asynchronously outside of the web server to counter long computation times and request timeouts. The Celery Python package combined with RabbitMQ is used for managing and scheduling the tasks. Additionally log messages are collected and retrieved by the browser via AJAX polling. After the pathway has been constructed, a visualizaton of the resulting NRPS is generated using the PFAM graphics JavaScript library, as well as a Gibson construct. To this end, the Gibthon software package was integrated and slightly modified, such that no fragments can be inserted between consecutive NRPS domains, automatically generated fragments are hidden from the user, and user-defined fragments are sorted by their type. Additionally, a bug was fixed preventing import of BioBricks from the PartsRegistry.
Users are able to enter new domains and pathways into the database by providing the coding sequence of their NRPS. After simple validity checks using the BioPython package, a modified version of the antiSMASH pipeline[Pmid23737449], optimized for NRPS domain prediction, is run on the sequence to determine domain types, their positions, and substrate specificity if there is any. Results of the prediction are presented to the user, who is able to manually verify their accuracy and correct possible mistakes. To further improve domain linker definition, a multiple sequence alignment of a domain against all other domains of the same type in the database can be performed using the Clustal Omega software package[Pmid21988835]. The alignment is presented to the user in an interactive dialog via JavaScript extracted from the Kalignvu software package, where he can inspect the conserved residues and set the domain linker positions appropiately.
Discussion
NRPSDesigner database compared to other NRPS databases
While exploring the most commonly used databases for NRPS (Norine, NRPS-PKS, Clustermine360), we realized that most tools were not designed for the needs of the biosynthetic community. The organization of a gene cluster, the relative arrangement of individual genes, possible native promoters or even other genomic elements as available in Clustermine360 can be neglected for the purpose of NRPS design. In contrast, the approach of NRPS-PKS, to include only protein sequences would not be sufficient as synthetic modification starts at the DNA level. Consequently we decided to establish our own database focusing mainly on DNA coding sequences, domain types and substrate specificities. Additional information needed for the customization of the final construct (regulative promoters, restriction sites etc.) can be included directly via an interface with the registry of standard biological parts.
To maintain a high curation standard we implemented a semi-automated domain recognition algorithm for the NRPSDesigner framework based on Hidden Markov Models (HMM). Again, we chose to adapt already existing approaches in agreement with our own experimental results (see section). Accordingly, we added a HMM specific for Adenylation/Oxidation/Adenylation domains making our tool the only available that can detect this combined domain commonly found in Indigoidine synthetases. Another HMM exclusively distinguishing between epimerisation domains that follow on thiolation domains further extends the predictive power of the NRPSDesigner.
Despite the advantages of automated domain recognition we strongly emphasize the importance of manual curation with the help of peer-reviewed sources. In this way the user is allowed to change database entries based on experimental evidence.
Scoring Function
Initially the internal algorithm calculating the optimal sequence of NRPS domains was based on the simple assumption that phylogenetically close modules also work best if assembled together. This principal was translated into a scoring function that assigned linear weights to the edges between neighboring domains proportional to their phylogenetic distance taken from taxonomic data. The shortest path over all domains is calculated by a Dijkstra's algorithm. As it turns out the underlying scoring function can be further improved incorporating specific experimental evidence. For example, [Marahiel2009] has shown that the success of domain shuffling depends tremendously on the linker between the new neighboring domains. In particular it was suggested that the C-A linker should be kept intact. Likewise domains that are already originating from the same organism get an additional preference in the scoring function, thus, reducing the number of fragments that have to be amplified by PCR and assembled subsequently. Although already showing convenient results this scoring function will be improved in the next update of the NRPSDesigner. For example, the distance between two domains can be further specified using homology information or the differences in the tertiary structure of both domains.
Connecting with NRPSDesigner
One of they key issues of the NRPSDesigner is its ability to combine scientific rigorousness with a user-friendly interface making both NRPS and synthetic biology more approachable to the broader community. This principal is implemented with the help of several features and interfaces facilitating the process of NRP in silico design and experimental validation. For example, in order to help developing a cloning strategy of the desired construct we added a tool for automated primer design. Although J5 is a very rigorous software tool with complex primer design algorithms using Primer3 we decided on Gibthon, well known to the iGEM community. It is a very user-friendly interface with direct connection to the Parts Registry. In contrast to J5, Gibthon uses mFold and throws warnings in case of misprimings or self-primings. But even if J5 proves to be more long lasting or yet an entirely different tool, due to its modular implementation within the Django framework another primer design software could be incorporated. However, so far we focused on the improvement of the Gibthon software solely by including a GeneBank output, which saves primer positions and the description of the assembled fragments. With this version the Gibthon construct can be exported in SBOL, which is also used internally in the NRPSDesigner when C++ communicates with the Python code.
Beyond providing a user-friendly way of designing new NRPs, the NRPSDesigner also incorporates our novel experimental findings showing that NRPs can be tagged by Indigoidine. The results of the Gibson assembly can be easily screened using a very simple blue-colored readout. This indeed fits very well with the general goal of the NRPSDesigner, which tries to make NRPS cloning accessible to a wider scientific audience: In fact, the Indigoidine tag is a great and cheaper alternative to expensive screening procedures such as mass-spectrometry.
- blabla
- balbla