Team:USTC-Software/Project/Method

From 2013.igem.org

(Difference between revisions)
Line 65: Line 65:
<div id="jobs_container">
<div id="jobs_container">
        <div class="jobs_trigger"><strong>Fetch Database Abstract</strong></div>
        <div class="jobs_trigger"><strong>Fetch Database Abstract</strong></div>
-
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">To simulate and analyze a genetic regulatory network (GRN), we need to build an objects’ array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it’s hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
+
<div class="jobs_item" style="display: none;"><p class="bodytext"></p><p align="justify">To simulate and analyze a genetic regulatory network (GRN), we need to build an objects' array to store the complete information of each gene. It contains regulation relationships between genes, sequences of genes, sequences of promoters and so on. However, it's hard to find an appropriate database online containing all information we need in a simple file. RegulonDB has downloadable files about the regulation between transcription factors (TF) and genes. Files about genetic information, transcription unit information and promoter information can also be downloaded from the RegulonDB. All those files have been put into file “source data” in the root directory of our software. They contain all information the simulation needs and we use fetching module to achieve data extraction and integration. There are four steps: fetch regulation relationships, fetch gene information, fetch promoter information and integrate information above.
</p>
</p>
                 </div>
                 </div>
Line 74: Line 74:
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects’ array of genetic data. </br>
First of all, read the regulation relationship of TFs. Our software filters the documentation of RegulonDB on the head of all files and then reads the name of regulate and regulated TF, which is also the name of its genes, one by one. In the same time, our software numerates the genes and stores their names into an objects’ array of genetic data. </br>
The format of regulation database:</br>
The format of regulation database:</br>
-
TF_name   TF_name    +/-/+-</br>
+
TF_name &nbsp&nbsp&nbspTF_name &nbsp&nbsp&nbsp+/-/+-</br></br>
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene’s regulation in the same way of TF to TF. The format of regulation database:</br>
The regulation of TFs has been put into a square matrix whose row is the regulator and column is the one regulated by. To make our GRN as complete as possible, the regulation between TF and genes has joined into the matrix. The one-way interaction results that we must read the TF in order to fulfill the regulator before completing the TF to gene’s regulation in the same way of TF to TF. The format of regulation database:</br>
-
TF_name   Gene_name    +/-/+-</br>
+
TF_name &nbsp&nbsp&nbspGene_name &nbsp&nbsp&nbsp+/-/+-</br></br>
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.</br>
At last, a regulatory matrix whose row represents regulate gene (TF) and whose column represents gene regulated by (TF+Gene) has been output into a file called “old_GRN” in root directory. The values in GRN matrix are regulations in which “1” means positive activation, “-1” means repression and “0” means no relationship. There have been some regulations both positive and negative identified regulations are determined by the experimental environment. As a result, our software picks out those uncertain genes and stores them into a file named “uncertain_database”.</br>
The format of uncertain database:</br>
The format of uncertain database:</br>
-
?   Gene_name->Gene_name</br>
+
? &nbsp&nbsp&nbspGene_name->Gene_name</br></br>
The question mark represents the unknown regulation between regulator and regulated-by whose names presented afterward. Users could replace the question mark with the data known in past experiment. (“+” rep positive, “-” rep negative). Our software will replace the values in matrix automatically. But if not rewrote, our software will regard those regulation as unknown.
The question mark represents the unknown regulation between regulator and regulated-by whose names presented afterward. Users could replace the question mark with the data known in past experiment. (“+” rep positive, “-” rep negative). Our software will replace the values in matrix automatically. But if not rewrote, our software will regard those regulation as unknown.
Line 91: Line 91:
All gene information has been deposited into a file named gene_info which could be downloaded here[]. In order of picking out the genes in GRN as fast as possible, all genetic information are stored in a “map”. “Map” is just like a dictionary yet its words are names of genes and its descriptions of words are replaced by genetic information. By using binary tree method, it is very fast to searth the “word” wanted in the “dictionary”. As tested, the speed of binary tree method built-in “map” function is 720 times faster than traversal method.</br>
All gene information has been deposited into a file named gene_info which could be downloaded here[]. In order of picking out the genes in GRN as fast as possible, all genetic information are stored in a “map”. “Map” is just like a dictionary yet its words are names of genes and its descriptions of words are replaced by genetic information. By using binary tree method, it is very fast to searth the “word” wanted in the “dictionary”. As tested, the speed of binary tree method built-in “map” function is 720 times faster than traversal method.</br>
The format of Gene Info database:</br>
The format of Gene Info database:</br>
-
ID_assigned_by_RegulonDB   Gene_name    Left_end_position    Right_end_position    DNA_strand    Product_type    Product_name    Start_codon_sequence    Stop_codon_sequence   Gene_sequence</br>
+
ID_assigned_by_RegulonDB &nbsp&nbsp&nbspGene_name &nbsp&nbsp&nbspLeft_end_position &nbsp&nbsp&nbspRight_end_position &nbsp&nbsp&nbspDNA_strand &nbsp&nbsp&nbspProduct_type &nbsp&nbsp&nbspProduct_name &nbsp&nbsp&nbspStart_codon_sequence&nbsp&nbsp&nbsp  Stop_codon_sequence &nbsp&nbsp&nbspGene_sequence</br></br>
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB’s gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
The label of the map vector is gene name which will be picked out based on the names read in regulation matrix before. It is really fast using the binary tree method to find the specific genetic information and store them into a specific object. Those information includes gene ID, left position, right position, gene description and gene sequence. The gene ID is used to link to RegulonDB’s gene details; The left position is used to find its specific transcription unit; The right position is used to figure out the base amount; The description of genes is used to distinguish the RNA and protein; The sequence is used to predict the regulation by alignment.
Line 103: Line 103:
        <div class="jobs_item" style="display: none;"><p align="justify">All promoter information has been deposited into a file named promoter_info which could be downloaded here[]. But we also need transcription unit information because the information files about promoter do not contain all genes’ names backward. “TU Info” file, which can be downloaded here[], contains the starting position of each TU and its promoter name. Our software picks out the starting position into a integer array. Using the left position picked out in gene info, our software would find out which unit the gene belongs to through dichotomy method and then stores the name of promoter into corresponding object.</br>
        <div class="jobs_item" style="display: none;"><p align="justify">All promoter information has been deposited into a file named promoter_info which could be downloaded here[]. But we also need transcription unit information because the information files about promoter do not contain all genes’ names backward. “TU Info” file, which can be downloaded here[], contains the starting position of each TU and its promoter name. Our software picks out the starting position into a integer array. Using the left position picked out in gene info, our software would find out which unit the gene belongs to through dichotomy method and then stores the name of promoter into corresponding object.</br>
The format of TU info database:</br>
The format of TU info database:</br>
-
Operon_name   Unit_name    promoter_name    Transcription_start_site ......</br>
+
Operon_name &nbsp&nbsp&nbspUnit_name &nbsp&nbsp&nbsppromoter_name &nbsp&nbsp&nbspTranscription_start_site ......</br></br>
The principle of fetching information of promoters is same as fetching genes’s. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.</br>
The principle of fetching information of promoters is same as fetching genes’s. Our software stores the promoter information from the file named “promoter_info” in a “map” which could be used to pick out the promoter sequence by searching promoter name through binary tree method.</br>
The format of Promoter Info database:</br>
The format of Promoter Info database:</br>
-
Promoter_ID_assigned_by_RegulonDB   Promoter_name</br>
+
Promoter_ID_assigned_by_RegulonDB &nbsp&nbsp&nbspPromoter_name</br></br>
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes’ regulation pattern.
The sequence of promoter will be used in the alignment method in next module which could make a prediction of exogenous genes’ regulation pattern.
Line 118: Line 118:
Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.
Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software.
The format of all_info database:</br>
The format of all_info database:</br>
-
No.   promoter_sequence    gene_sequence    gene_name    ID    left_position    right_position    promoter_name    description</br>
+
No. &nbsp&nbsp&nbsppromoter_sequence &nbsp&nbsp&nbspgene_sequence &nbsp&nbsp&nbspgene_name &nbsp&nbsp&nbspID &nbsp&nbsp&nbspleft_position &nbsp&nbsp&nbspright_position &nbsp&nbsp&nbsppromoter_name &nbsp&nbsp&nbspdescription</br>
The fetching module generates three files: old_GRN, all_info and uncertain_database.</br>
The fetching module generates three files: old_GRN, all_info and uncertain_database.</br>

Revision as of 02:31, 26 September 2013

Slide

Take a gNAP before wearing your gloves! Genetic Network Analyze and Predict
The sketch and final GUI of gNAP!
We compare the result of our software with gene expression profile in literature.
We are USTC-Software!

Methodologies

Methodologies

In order to simulate the GRN’s working and analyze the changing after exogenous gene imported, some advanced algorithms and classical methods are employed in the software. These algorithms and methods include Binary Tree method, Needle-Wunsch Algorithm, Decision Tree method, Hill Equation and PSO Algorithm.
There are five parts of methodologies: Fetch Database, Alignment Analyze, New Network Construction, Network Model and Predict.

Fetch Database

Fetch Database Abstract
Fetching Regulation
Fetching Gene Info
Fetching Promoter Info
Integration

Our software integrates all information we picked out about genes and generates a file named “all_info” —— all information about genes —— for the output graphical interface’s reading. In the meanwhile, the array of objects containing all information has been stored in computer memory which greatly improve the computing speed of our software. The format of all_info database:
No. &nbsp&nbsp&nbsppromoter_sequence &nbsp&nbsp&nbspgene_sequence &nbsp&nbsp&nbspgene_name &nbsp&nbsp&nbspID &nbsp&nbsp&nbspleft_position &nbsp&nbsp&nbspright_position &nbsp&nbsp&nbsppromoter_name &nbsp&nbsp&nbspdescription
The fetching module generates three files: old_GRN, all_info and uncertain_database.

Alignment Analyze

An example
Models
Prediction Model
Mathematical Description of The Network
Sequence similarity

New Network Construction

Filter
Construct A New Regulated Vector
Construct A New Regulating Vector
A Supplementary Game: Test of The Model

The behavior similarity of two units can be described by the dot product of two regulated vectors or two regulating vectors. A more intuitive way is using the vectorial angle to measured the similarity of two behaviors. But there are some zero vectors in the gene regulatory network which usually means the units either play the role of target or the regulator.
[Pic. 4 GRN matrix, target vector, regulator vector and their dot product]
We have tested the hypothesis by analyzing all 1748 regulation units of Escherichia coli, K-12, recorded in RegulonDB. By pairwise comparison of all these units, about 1.6 million sets of data was obtained. Each set of data consists of promoter sequence similarity, protein coding sequence similarity and behavior similarity. We hope to find some structure in the data that supports our hypothesis. And it is lucky enough to find there is a tendency showing the relationship between sequence similarity and behavior similarity(Pic. 2).
[Pic. 2 Sequence similarity and behavior similarity]
Sequence similarity is set as x axis and behavior similarity is set as y axis. Obviously sequence similarity is continuous-valued (from 0 to 1) and behavior similarity is discrete-valued. Values of behavior similarity determined by the dimension(N) of the vector are between -N and N. According to the result, promoter sequence similarity mainly distributes from 0.4 to 0.6, protein coding sequence similarity mainly distributes from 0 to 0.7 and behavior similarity mainly distributes from -3 to 5. As it is shown in Picture 4, high behavior similarity is partial to high sequence similarity. Peak value of behavior similarity, 17, appears where sequence similarity is 0.537. When behavior similarity value is fixed, for example, set behavior similarity as 8, it is obvious that the higher the sequence similarity is, the more intensive the dots are.

Network Model

Network Model Abstract

Network analysis includes finding stable condition of network, adding new gene, finding new stable condition and changes from original condition to new condition. We use densities of materials to describe network condition. If all material densities are time-invariant, we can say the network condition is stable.

Hill Equations
Find Stable Network Condition
Find Changes From Original Stable Condition to New Condition

Predict

Predict Abstract

In some cases, importing exogenous gene is for enhancing or suppressing the expression of some specific genes in engineered bacteria itself. But it is hard to choose an appropriate regulatory gene. Our software analyzes the GRN forward as well as simulates by optimization algorithm backward for giving a reference of choosing to the users. Our software not only focused on the direct regulation but also focused on the global GRN. In the same time, controlling the expression of multiple genes in network has been realized by global prediction. What’s more, Particle Swarm Optimization (PSO) Algorithm makes it possible.

Input Target
Particle Swarm Optimization
Filter