Team:Heidelberg/Project Software

From 2013.igem.org

Revision as of 00:45, 5 October 2013 by Hetitus (Talk | contribs)


NRPSDesigner. Design your own NRP.

Highlights

  • Computer aided design of fully synthetic NRPS determined to produce a user-defined short peptide
  • Optimal domain assembly based on evolutionary distance
  • Curated database storing information of XXX Domains and XX modules with XX possible monomer specificities
  • Automated domain recognition for newly entered NRPS sequences
  • Integration of Gibthon to facilitate implementation of cloning strategy
  • Parts registry interface and SBOL output format

Abstract

Non-ribosomal peptide (NRP) synthesis is a biochemical process of remarkable hierarchical organization. Vertically it can be described stepwise starting from the coding DNA sequence that is translated into a giant enzyme catalyzing in turn the actual NRP assembly. Horizontally its complexity is established by a modular order of functional proteinogenic units. Due to this systematic composition a bioinformatic approach appears most suitable, if we aim for the automated design of fully synthetic NRPs.
Here, we introduce a comprehensive software tool, the NRPSDesigner, which facilitates the prediction and synthesis of non-ribosomal synthetases (NRPS) that catalyze customized NRP-assembly. The predictive power of the NRPSDesigner is based on a curated database storing information of about 200 NRPS modules, their DNA coding sequences and substrate specificities. It is used to calculate the optimal domain sequence according to the weighted phylogenetic distance between domain origins. Additionally an integrated domain recognition algorithm allows for curated expansion of the database. To accelerate the process from in silco NRPS design towards experimental validation we embedded the Gibthon iGEM software tool of Cambridge 2010 for Gibson primer construction. With this framework we want to suggest a new standard for the fast and accurate computer aided design of customized short peptides.

Introduction

Even though biological processes can be characterized by their physicochemical properties, they can also be translated into an abstract model of interconnected functional entities. This is exemplified by the central dogma of Biology describing the information flow from DNA, to RNA and finally proteins. Synthetic biology has always tried to interfere with these levels of organization with the goal of systematically controlling the projected outcome [1].

NRPS carry this principle to the extreme by adding yet another hierarchical level:

The modular complex sequentially synthesizes its own short non-ribosomal peptide (NRP). These peptides in turn are not limited by the standard set of proteinogenic amino acids; instead D-isoforms and diverse modifications can be utilized [2]. Nature has made great use of this system by creating versatile natural products such as antibiotics, metallophores or dyes. [2] [3]

Although, little is known about the actual dynamic properties of the synthesis process most of its logical rules are indeed understood: A NRP-synthetase consists of a number of modules, each of which is responsible for adding one amino-acid to the nascent peptide. But even a module can be further sub-divided into domains, each with a distinct functionality [4].

This hierarchical organization demonstrates the large potential for the synthetic biology community: The exchange or combination of modules and domains from different organisms or different proteins has been repeatedly shown to produce fully functional NRPS [5] [6]. Not surprisingly, several bioinformatics approaches have put great effort to meticulous categorize NRPS and their functionality. For example, databases such as NRPS-PKS [7] or Clustermine360 describe the domain organization of diverse NRPS, while the Norine [8] database includes information about non-ribosomal peptides and their sequence of monomers. Also, many tools are capable of predicting domain sequences, substrate specificity and hence the putative product of a particular NRPS. Exemplary, the NRPS-PKS [7] and PKS-NRPS [9]analysis tool (hereafter referred to as Maryland tool) elaborate in this direction. While antiSMASH [10] [11] also provides similar prediction capabilities its scope is broader and covers many different secondary metabolite pathways.

At the same time, as the understanding of the underlying biological processes and methods for assembly of diverse DNA constructs has improved, many novel software tools aim at the computer aided design (CAD) of DNA sequences (e.g. J5, Clotho, a framework enabling the automated and computer-assisted design of synthetic biology constructs introduced by the Berkeley iGEM team of 2008 and Gibthon created as a web app by the Cambridge iGEM Team of 2010, which suggests primers in order to assemble a set of predefined DNA fragments using Gibson cloning.

Influenced by this development, we introduce here the NRPSDesigner, an integrated CAD software, implemented to facilitate the design of customized synthetic NRPs. In particular, the NRPSDesigner includes the following features: Based on the NRPS-PKS database we built a manually curated database capturing the biological complexity of NRPS while storing information of about 200 NRPS modules their coding sequences and substrate specificity. The database can be easily extended with curated content using automated domain prediction based on Hidden Markov Models. By applying this information, the NRPSDesigner can calculate an optimal sequence of domains based on simple evolutionary assumptions. To accelerate the process of testing this synthetic construct and eventually produce a customized peptide we included additional assisting software to the framework. We offer to incorporate the necessary domains for combining the nascent peptide with an Indigoidine tag . Furthermore, embedding of the Gibthon software automates the suggestion of primers necessary for the assembly of the predicted domains by Gibson cloning.

Results

NRPSDesigner database structure

The NRPSDesigner is a knowledge-based software tool using stored information about NRPS pathways to predict the optimal domain sequence that is able to produce a user-defined NRP. The storage organization is of great importance for the functionality of the designer because of its dependence on a comprehensive description of the biological and biochemical properties of NRPSs. For this purpose we built a hierarchical database (Fig. 1) that comprises three layers of complexity: i) the DNA level represented by all DNA coding sequences. They are directly linked to ii) their encoded NRPS domain, respectively. Finally iii) our database stores detailed information about the substrate and its potential modification of the corresponding domain.

Next to the tight links between these layers all of them also point at additional database entries that complete the needed information for the design algorithm. For example, a DNA coding sequence is linked not only to its product, the translated domain, but also to its origin (organism, plasmid etc.) Additionally, a coding sequence can also be connected to another coding sequence. This ‘parent’ sequence is a predecessor of the stored sequence that already underwent biosynthetic modification. On the domain level there is an upstream link to the coding sequence but also to the specific type of domain (e.g. thioesterase or the condensation domains). Some Domains, such as adenylation domains, also point at monomers, based on their substrate specificity. Subsequently, for these substrates we store their chirality, modification and if they are proteinogenic or not, e.g. histidine or ornithine (Fig. 2) . To enable the NRPSDesigner to use information from outside of the database it is equipped with global identifiers. For organisms we saved the NCBI taxon id, while for BioBricks the unique identifier in the Parts registry. To integrate the content with other databases, we created for every layer a linkout entry that consists of a type and specific identifier. The linkout type includes a description of the corresponding resource, as well as a URL, which in combination with the specific identifier enables the cross-linking of each database entry to other resources. The most common linkout types are Norine and Pubchem IDs for the substrates, PFAM IDs for the domain types and GenBank identifiers for the coding sequences.

For visualization of the NRPS domains and the chemical structures of the substrates by Open Babel [12] we added a JSON representation of each domain, based on the Pfam Graphics library and the SDF (structure data file) format, respectively.

NRPSDesigner database content

The information currently stored in the NRPS database was mainly retrieved from already published data [7] and extended or changed according to our own experimental results. Because of the already present domain organization and the related substrate specificities we mainly filled our database with information from the NRPS-PKS library, which saves the protein sequence of diverse NRPS, and added the missing coding sequences. Accordingly, the positions of domain boundaries and linkers had to be converted from their protein specific coordinates to DNA coordinates. In some cases manual curation was necessary, because the substrate specificity stored in the NRPS-PKS database erroneously did not agree with the published experimental results. Furthermore, several pathways seem to contain special domain types with unclear functionality. In such a case we used antiSMASH combined with in-depth literature research to clarify the correct domain functionality. The current progress of the database curation is shown in table 1. In total our database contains curated information about…

Extension and validation of the database

One of the core requirements for our software is the ability to detect NRPS domains. This feature enables the automatic and standardized definition of domain boundaries and also facilitates the addition of new entries into the NRPSDesigner database by the community.

Based on the antiSMASH [10] [11] program we established our own pipeline to maximize domain recognition specificity. antiSMASH is implemented in (BIO)Python and uses HMMER3 [13] and thus, could be easily integrated in our framework. To improve domain recognition especially for the Adenylation/Oxidation/Adenylation (AOxA) domain of IndC, a HMM profile of the AOxA domain appearing in diverse indigoidine synthetases was constructed and added to the pipeline. Also, the Adenylation domain HMM (from Pfam, id PF00501.21) of antiSMASH was replaced by the HMM of the Maryland tool(special thanks to Prof. Jacques Ravel for providing us with the seed alignment). Note, that the start boundary for the A domain varied a lot depending on the prediction tool used (Fig. 3) . We chose the Maryland HMM, because it agreed most with our Tyrocidine experiments, as well as previously published results [14]. Furthermore, Thiolation domains have been split into two categories, depending on whether an epimerisation domain follows or not, as experimental evidence has shown that their functionality differs [15].

To easily extend the NRPSDesigner database this domain recognition pipeline was integrated into a user-friendly interface. The user can add an additional description for each domain, possibly change the specificity of the adenylation domains if it has not been predicted correctly and also define his own domain boundaries, with the help of an integrated multiple sequence alignment against the domains of the same type already present in the database.

Guidance for cloning of NRPS constructs with Gibthon

Going further than offering in-silico predicted sequences of NRPS domains producing a particular NRP would be the implementation of a cloning procedure based on Gibson assembly for the NRPSDesigner. This should make NRPS more accessible to the synthetic biology community. One of the most popular tools for computer-aided primer design is Gibthon. Created as a web app by Bill Collins of the Cambridge iGEM Team of 2010 Gibthon suggests primers for the assembly of predefined DNA fragments (entered by the user or imported from the Parts Registry) using Gibson cloning. Since Gibthon was such a successful iGEM software project and also written in Django, we decided to use it as our tool of choice for automated primer design.

Gibthon was integrated into the core NRPSDesigner GUI using a modular interface. Particular care was taken to keep Gibthon and the rest of the NRPSDesigner clearly separated, to enable the use of other primer design software such as J5 [16] in the future. The strategy applied for the integration is the following: For each of the domains returned by the in-silico prediction, the DNA sequence is extracted from the NRPSDesigner database. The resulting sequences are returned in a plasmid structure, while ensuring that a minimal number of Gibson fragments has to be assembled. These sequences, together with metadata, such as references and descriptions, are appropriately converted to the Gibthon database format and then copied into the Gibthon gene fragment table. In return the user has access to the standard Gibthon interface to get an overview over the suggested primers. Similar to Gibthon the NRPSDesigner is tightly linked with the registry of standard biological parts. The user can add his parts of choice, using the automated parts registy import tool. However, some additional restrictions have been placed in order to ensure the integrity of the designed NRPS sequence: The user cannot enter a new fragment/part in between one of the NRPS domains; instead he can only place it after the Thioesterase and before the initiation adenylation domain.

To easily extend the NRPSDesigner database this domain recognition pipeline was integrated into a user-friendly interface. The user can add an additional description for each domain, possibly change the specificity of the adenylation domains if it has not been predicted correctly and also define his own domain boundaries, with the help of an integrated multiple sequence alignment against the domains of the same type already present in the database.

In silico design of the optimal NRPS sequence

The quickly growing number of known NRPS pathways increases the pool of modules and domains catalysing the incorporation of a specific peptide monomer. Especially on the domain level there is often more than one NRPS protein specific for the same substrate or catalytic reaction. To be able to determine the optimal selection we assume that a combination of domains exists which will provide the best possible performance when combined to a NRPS, and that this performance can be objectively quantified relative to other domain combinations. To find the optimal domain combination the shortest path from a start to an end point on a graph-based representation of the problem needs to be determined. The NRPSDesigner constructs a directed acyclic graph (DAG) representing all possible domain combinations capable of synthesizing the desired peptide (Fig. 4) . The DAG is organized in layers, corresponding to positions in the final NRPS. Every layer comprises a discrete number of nodes representing NRPS domains and every node of a layer is connected to every node of the next layer. However, no edges within a layer or to previous layers exist. The first and last nodes, being the start and end points of the shortest path problem, are dummy nodes without an associated domain, edges pointing to or from them having a weight of zero.

So far, very little is known about the potential functionality of synthetic NRPS including possible assembly rules for modules and domains. Therefore we assume a simple scoring function for the edge weights between the user-selected domains, where combinations of domains from closely related organisms have a higher probability to form a functional NRPS than domain combinations involving organisms which are far apart on the taxonomic tree. The taxonomic distance is currently determined from lineage information available in the NCBI taxonomy database [17], but could be easily extended to include sequence-based information [18] or other, possibly better fitting, criteria as soon as more experimental evidence becomes available. In addition, domains which are directly consecutive in their native organism are preferred over those from different modules, if possible. The well known shortest path algorithm by E.W. Dijkstra [19] is then employed to find the optimal domain combination, using the dummy nodes as start and stop nodes.

Implementation

Following the example of major bioinformatics applications like BLAST [20], the NRPSDesigner was split into a command-line executable comprising the core algorithm, and a web interface providing easy access to the functionality enriched with some extra features. This design enables potential users to set up their own NRPSDesigner instance according to their needs using the compiled executable and a database dump.

The command-line executable

We chose C++ as implementation language for the command-line executable to avoid dependencies on run-time interpreters and to maximise speed. Additionally the usage of C++ features like smart pointers, lambda functions, or hash tables from the new C++11 standard allows for terser, faster, less error-prone code. To facilitate re-use of the code in other contexts, all of the functionality is contained within a (currently static) library, with the executable being a thin wrapper around it. In the future, even tighter integration with the web interface can be imagined. In this case a python wrapper would be implemented around the library using a software package like SIP or SWIG.

In case the PartsRegistry or some other publically accessible NRPS database offers the necessary functionality at some point in the future we decoupled the algorithm from the database and allowed multiple database backends. We designed an abstract base class defining the interface required by other parts of our code, and a static factory function responsible for constructing an object of the appropriate backend class employing the singleton pattern, which allows only one instance of the database backend to exist within the entire program. Currently, only one database backend exists, capable of retrieving data from our MySQL database by using the MySQL C++ connector library. Custom exception classes complete the abstraction, as all SQL exceptions are caught within the MySQL backend and a DatabaseError is thrown containing information about the specific error.

Taxonomy information is retrieved on-demand from the NCBI server via the NCBI XML API [21] using the CURL network library and further parsed using the LibXML library. In case of network failure, the NRPSDesigner can fall back to a local copy of the taxonomy database, which can be retrieved from the NCBI FTP server in advance. In order to reduce network traffic and memory consumption when the final graph is built, as little data as possible is loaded in advance. The missing taxonomy information is retrieved lazily in parallel to the shortest path computation and cached afterwards. DNA sequences and metadata not required for the pathway construction are fetched after the pathway has been constructed only for the domains the pathway is comprised of. The final pathway is returned in form of an XML file, built with the LibXML library, containing information on the domains and their origin.

For command-line option parsing, the Boost.program_options library was chosen, as it promotes modularity by allowing every component to specify its own set of options. This enables switching between different database backends or changing the existing one without the rest of the program knowing about it. Although this approach introduces an extra dependency to the NRPSDesigner library itself and not just the executable, Boost.program_options allows querying of options, their associated data type and other metadata. Therefore e.g. a graphical user interface written in C++ could extract this information and dynamically build the required input fields, without knowing anything about the underlying algorithm or backends.

The web interface

In order to allow tight integration with the Gibthon software package of iGEM 2010 we chose the Django framework, written in Python, as the back-end of the web interface. The front-end was implemented using the Bootstrap 2, jQuery and jQueryUI frameworks. The interface allows to select the desired peptide monomers and specify their configuration/modification. Subsequently, a 2D structure preview of the resulting molecule is generated on-the-fly using the OpenBabel library [12] via its Python bindings. The structures of the individual amino acids are captured from the database, performing a SMARTS pattern search for the NH2-CR-COOH group and constructing peptide bonds between consecutive monomers. The resulting peptide structure is projected to 2D coordinates, exported as SVG and then directly sent to the browser. To reduce server load, structures are cached using Django’s view caching functionality.
Heavy computations like pathway construction or domain prediction are performed asynchronously outside of the web server to counter long computation times and request timeouts. The Celery Python package combined with RabbitMQ is used for managing and scheduling the tasks. Additionally log messages are collected and retrieved by the browser via AJAX polling. After the pathway has been constructed, a visualizaton of the resulting NRPS is generated using the PFAM graphics JavaScript library, as well as a Gibson construct. To this end, the Gibthon software package was integrated and slightly modified, such that no fragments can be inserted between consecutive NRPS domains, automatically generated fragments are hidden from the user, and user-defined fragments are sorted by their type. Additionally, a bug was fixed preventing import of BioBricks from the Parts Registry.

Users are able to enter new domains and pathways into the database by providing the coding sequence of their NRPS. After simple validity checks using the BioPython package, a modified version of the antiSMASH pipeline [10] [11], optimized for NRPS domain prediction, is run on the sequence to determine domain types, their positions, and substrate specificity if there is any. Results of the prediction are presented to the user, who is able to manually verify their accuracy and correct possible mistakes. To further improve domain linker definition, a multiple sequence alignment of a domain against all other domains of the same type in the database can be performed using the Clustal Omega software package [22]. The alignment is presented to the user in an interactive dialog via JavaScript extracted from the Kalignvu [23] software package, where he can inspect the conserved residues and set the domain linker positions appropriately.

Discussion

NRPSDesigner database compared to other NRPS databases

While exploring the most commonly used databases for NRPS (Norine [8] , NRPS-PKS [7], Clustermine360 ), we realized that most tools were not designed for the needs of the biosynthetic community. The organization of a gene cluster, the relative arrangement of individual genes, possible native promoters or even other genomic elements as available in Clustermine360 can be neglected for the purpose of NRPS design. In contrast, the approach of NRPS-PKS, to include only protein sequences would not be sufficient as synthetic modification starts at the DNA level. Consequently we decided to establish our own database focusing mainly on DNA coding sequences, domain types and substrate specificities. Additional information needed for the customization of the final construct (regulative promoters, restriction sites etc.) can be included directly via an interface with the registry of standard biological parts.

To maintain a high curation standard we implemented a semi-automated domain recognition algorithm for the NRPSDesigner framework based on Hidden Markov Models (HMM). Again, we chose to adapt already existing approaches in agreement with our own experimental results (see section). Accordingly, we added a HMM specific for Adenylation/Oxidation/Adenylation domains making our tool the only available that can detect this combined domain commonly found in Indigoidine synthetases. Another HMM exclusively distinguishing between epimerisation domains that follow on thiolation domains further extends the predictive power of the NRPSDesigner.

Despite the advantages of automated domain recognition we strongly emphasize the importance of manual curation with the help of peer-reviewed sources. In this way the user is allowed to change database entries based on experimental evidence. The current progress of the database curation is shown in table 1.

Scoring Function

Initially the internal algorithm calculating the optimal sequence of NRPS domains was based on the simple assumption that phylogenetically close modules also work best if assembled together. This principal was translated into a scoring function that assigned linear weights to the edges between neighboring domains proportional to their phylogenetic distance taken from taxonomic data. The shortest path over all domains is calculated by a Dijkstra’s algorithm. As it turns out the underlying scoring function can be further improved incorporating specific experimental evidence. For example, [24] has shown that the success of domain shuffling depends tremendously on the linker between the new neighboring domains. In particular it was suggested that the C-A linker should be kept intact. Likewise domains that are already originating from the same organism get an additional preference in the scoring function, thus, reducing the number of fragments that have to be amplified by PCR and assembled subsequently. Although already showing convenient results this scoring function will be improved in the next update of the NRPSDesigner. For example, the distance between two domains can be further specified using homology information or the differences in the tertiary structure of both domains.

Interaction with the wet lab

One of the core aspects of synthetic biology is its interaction with systems biology [25] and computational modeling in general, which in turn intensively depend on the thorough knowledge exchange with experimentalists. Consequently this was also one of the main goals throughout the development of the NRPSDesigner.

Experimental evidence had a variety of implication in the design of our software. For one, experiments showed at an early stage that the assumptions of modularity that are essential for the NRPSDesigner’s do indeed hold. Also, they allowed us to further refine the scoring function in regards to the importance of linkers between different domains. The Indigoidine project showed that the linker between T-TE domains is more important than the one between A-T, a result not previously published (to the best of our knowledge). In addition, the exact definition of domain boundaries and the use of appropriate HMMs was guided by the actual experimental results (e.g. C-A domain boundary based on Tyrocidine results). Finally, the Indigoidine tag was included in the software only, once its functionality had been proven in the wet lab.

But also in opposite direction, the NRPSDesigner could easily influence experimentalists: The most obvious way is the cloning and expression of novel NRPS constructs in vivo, as suggested by the Designer. But even the prediction pipeline which is part of the Designer can have valuable implications for the experiments: Domain shuffling, could be facilitated using the domain boundaries currently predicted by the pipeline. Especially, for the recognition of totally unknown NRPS pathways our tool offers the optimal framework.

Connecting with the NRPSDesigner

One of they key issues of the NRPSDesigner is its ability to combine scientific rigorousness with a user-friendly interface making both NRPS and synthetic biology more approachable to the broader community. This principal is implemented with the help of several features and interfaces facilitating the process of NRP in silico design and experimental validation. For example, in order to help developing a cloning strategy of the desired construct we added a tool for automated primer design. Although J5 [16] is a very rigorous software tool with complex primer design algorithms using Primer3 we decided on Gibthon, well known to the iGEM community. It has a very user-friendly interface with direct connection to the Parts Registry. In contrast to J5, Gibthon uses mFold [26] and throws warnings in case of misprimings or self-primings. But even if J5 proves to be more long lasting or yet an entirely different tool, due to its modular implementation within the Django framework another primer design software could be incorporated. However, so far we focused on the improvement of the Gibthon software solely by including a GeneBank output, which saves primer positions and the description of the assembled fragments. With this version the Gibthon construct can be exported in SBOL, which is also used internally in the NRPSDesigner when C++ communicates with the Python code.

Beyond providing a user-friendly way of designing new NRPs, the NRPSDesigner also incorporates our novel experimental findings showing that NRPs can be tagged by Indigoidine. The results of the Gibson assembly can be easily screened using a very simple blue-colored readout. This indeed fits very well with the general goal of the NRPSDesigner, which tries to make NRPS cloning accessible to a wider scientific audience: In fact, the Indigoidine tag is a great and cheaper alternative to expensive screening procedures such as mass-spectrometry.

Outlook

With the NRPSDesigner, we have managed to provide a powerful tool that introduces NRPS to the synthetic biology community, making it easier to create novel NRPs or improve existing antibiotics, pigments, etc. As the community gets involved, the NRPSDesigner will also continuously improve, as more domains will be entered into the database and thus more exotic amino acids will become available for inclusion into the final NRP. Also, as more teams follow our paradigm of submitting NRPS domains to the Parts Registry, it will become possible to clone NRPS constructs for diverse peptides by just using parts available in the Registry, rather than trying to access the original strains. For this foreseeable future we have several important changes planned; some of which will be implemented within the next month, while others are long term goals:

NRPSDesigner 1.1:

  • Database access: Currently the wealth of information available in the NRPSDesigner database can only be accessed by loading the provided MySQL dump. We intend to make access a lot easier for everyone, by introducing a user-friendly interface for database exploration.
  • Improved Multiple Sequence Alignment view: The multiple sequence alignment (MSA) is supposed to simplify the setting of domain boundaries following the automated boundary predictions, giving extra information for the improvement of the HMMs. This view will get more user-friendly, as the predicted boundaries will be visualized on the MSA and the selection of new boundaries will become the matter of a few clicks.
  • Modifications: You will be able to add N-methylation modifications to a wide-range of amino acids.
  • User Interface based on Bootstrap 3: To simplify the coding architecture for the GUI and stay up to date at the same time we plan to change the coding environment from Boostrap2/jQuery UI into a Bootstrap 3 based application.

NRPSDesigner 2.0:

  • REST API: Programatically access the wealth of knowledge available in the NRPSDesigner database, design domains for new peptides or get suggestions for DNA assembly of the former.
  • Even more modifications: Add cyclizations, heterocyclizations, reductions and oxidations to your final NRP.
  • More safety: Categorize species according to their safety level and issue appropriate warnings for non-S1 organisms. Automatically disable design of NRPS for known or predicted toxic peptides.
  • Structure pattern matching: Rather than having to enter the sequence of monomers for the final NRPs, the user will be able to enter the structure of a molecule of choice and the NRPSDesigner will use structure matching algorithms in order to check whether this molecule is a NRP and suggest a cloning strategy if this is the case.
  • Multiplasmid cloning strategies: For very large NRPS, nature splits the domains on multiple coding sequences and the translated proteins are then combined by N-terminal and C-terminal communication domains. Thus, rather than having to assemble huge constructs on single plasmids, the NRPSDesigner will be able to suggest constructs split across multiple plasmids.
  • Improved Gibthon: More state of the art algorithms for automated primer suggestions allow synthesis of short DNA sequences, such as RBS, in the 5’ end of the primers and possibly include other cloning techniques, such as Golden Gate assembly.

1. Endy D (2005) Foundations for engineering biology. Nature 438: 449–453.

2. Hur GH, Vickery CR, Burkart MD (2012) Explorations of catalytic domains in non-ribosomal peptide synthetase enzymology. Nat Prod Rep 29: 1074–1098.

3. Reverchon S, Rouanet C, Expert D, Nasser W (2002) Characterization of indigoidine biosynthetic genes in Erwinia chrysanthemi and role of this blue pigment in pathogenicity. J Bacteriol 184: 654–665.

4. Finking R, Marahiel M a (2004) Biosynthesis of nonribosomal peptides1. Annual review of microbiology 58: 453–488.

5. Duerfahrt T, Doekel S, Sonke T, Quaedflieg PJLM, Marahiel MA (2003) Construction of hybrid peptide synthetases for the production of alpha-l-aspartyl-l-phenylalanine, a precursor for the high-intensity sweetener aspartame. Eur J Biochem 270: 4555–4563.

6. Duerfahrt T, Eppelmann K, Müller R, Marahiel MA (2004) Rational design of a bimodular model system for the investigation of heterocyclization in nonribosomal peptide biosynthesis. Chem Biol 11: 261–271.

7. Ansari MZ, Yadav G, Gokhale RS, Mohanty D (2004) NRPS-PKS: a knowledge-based resource for analysis of NRPS/PKS megasynthases. Nucleic acids research 32: W405–13.

8. Caboche S, Pupin M, Leclère V, Fontaine A, Jacques P, et al. (2008) NORINE: a database of nonribosomal peptides. Nucleic acids research 36: D326–31.

9. Bachmann BO, Ravel J (2009) Chapter 8. Methods for in silico prediction of microbial polyketide and nonribosomal peptide biosynthetic pathways from DNA sequence data. 1st ed. Elsevier Inc.

10. Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, et al. (2011) antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic acids research 39: W339–W346.

11. Blin K, Medema MH, Kazempour D, Fischbach M a, Breitling R, et al. (2013) antiSMASH 2.0–a versatile platform for genome mining of secondary metabolite producers. Nucleic acids research 41: W204–12.

12. O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, et al. (2011) Open Babel: An open chemical toolbox. J Cheminform 3: 33.

13. Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic acids research 39: W29–W37.

14. Linne U, Marahiel M a (2000) Control of directionality in nonribosomal peptide synthesis: role of the condensation domain in preventing misinitiation and timing of epimerization. Biochemistry 39: 10439–10447.

15. Linne U, Doekel S, Marahiel M a (2001) Portability of epimerization domain and role of peptidyl carrier protein on epimerization activity in nonribosomal peptide synthetases. Biochemistry 40: 15824–15834.

16. Hillson NJ, Rosengarten RD, Keasling JD (2012) j5 DNA assembly design automation software. ACS synthetic biology 1: 14–21.

17. Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Res 40: D136–D143.

18. Fang H, Oates ME, Pethica RB, Greenwood JM, Sardar AJ, et al. (2013) A daily-updated tree of (sequenced) life as a reference for genome research. Sci Rep 3: 2015.

19. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1: 269.

20. Boratyn GM, Camacho C, Cooper PS, Coulouris G, Fong A, et al. (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res.

21. Sayers E (2008) E-utilities Quick Start. National Center for Biotechnology Information.

22. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, et al. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7: 539.

23. Lassmann T, Sonnhammer EL (2006) Kalign, Kalignvu and Mumsa: web servers for multiple sequence alignment. Nucleic acids research 34: W596–W599.

24. Marahiel M a (2009) Working outside the protein-synthesis rules: insights into non-ribosomal peptide synthesis. Journal of peptide science 15: 799–807.

25. Smolke CD, Silver PA (2011) Informing biological design by integration of systems and synthetic biology. Cell 144: 855–859.

26. Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic acids research 31: 3406–3415.

Thanks to