Team:Heidelberg/Notebook/iGEM42
From 2013.igem.org
iGEM 42. The answer to everything.
Have you ever easily found the piece of information you wanted from past iGEM competitions? Right! That turned out to be very hard.
Maybe you should try this tool which was developed to do exactly this: Lead you directly to the information you want.
Reimplementation of scraping
As the scraping scripts we started out with had a rather high run time and were organised pretty complicated, the whole thing was rewritten. The spiders use the central team (https://igem.org/Team_List?year=%d) and results (https://igem.org/Results?year=%d) page per year to go through all the teams at once instead of opening, editing and closing the JSON results file for every team. Within each team page HtmlXPathSelectors are used to generate the single values. As this of course depending on the exact same page structure, nothing but the two central pages can be easily scraped using these detailed selectors.
Wiki Scraping
We wrote a pipeline to collect and print the data for every team properly. Thus proper script exiting and output file specification via the command line were easier to implement. The track, which was missing until now was included in the scraping process.
Furthermore several bugs were fixed. We had encountered problems regarding whitespaces, espacially the dublication of teams when their name contains a blank space. We also took care of the proper construction of all the spider objects.
Data conversion to R
The JSON file needs to be converted to R compatible data for the analysis. The target file contains one dataframe for all single value parameters and one list for all multiple value parameters / gib text contents. Here unique naming of the teams was achieved by combination of name and year. A full liste of which parameters and contents were converted is displayed in table 4.1.
Data frame parameters | List elements |
---|---|
|
|
First shiny app implemented
A first shiny app that can display the numerical data in a scatterplot was created. Besides shiny it also uses the RCharts package for more aesthetic graphs.
Different concepts for shiny files
Since there are two different concepts of how to implement the page surrounding a shiny app, we tried out both.
See week 20 for the final solution we chose for building our apps.
R builder functions for html
Files needed:
- ./server.R
- ./ui.R
This solution turned out to be very complicated due to the combination of directly accessible html tags, those which need the tags() function and styling rules within the ui.R. Thus using the other concept seems more intuitive.
Separated html code
File needed:
- ./server.R
- ./www/index.html
Here the app is built directly in html and css syntax as any other basic homepage. This on the other hand became complicated, when integrating the server output in the app.
Second app implemented
The second shiny app we impelmented was the timeline app. It counts teams, awards or students within each year and puts the numbers in a bar plot. For the plot we again use RCharts but this time with nvd3. In order to prevent errors we had to add empty values to the data since nvd3 can't handle incomplete datasets.
First hompage draft
A simple draft of the homepage was implemented using basic html and css tags.
This page is supposed to either become integrated in our wiki or to be a stand alone homepage.
Layout
page wrapper
Page Header
APP
iframe
App integration
In order to integrate the apps, the central iframe was used as target for the links in the menu bar. The link would then for example look like this:
<a href="apps/factsfigures/" target="center-frame">Facts & figures</a>
== Text analysis drafts ==
For the text analysis we used python and it's nltk platform (see [http://nltk.org]). The only text corpus used was the stopword corpus.
Independent of the analysis done a "stemmer" was run on the abstracts, which reduces all words to their very basic form. For the topwords and information content calculation simple counting was performed and for the information content the proportion of the stopwords corpus in the whole was determined.
For the extraction of the meshterms a list of terms in synthetic biology from Paul Oldahm and his colleagues (Oldham, P., Hall, S., & Burton, G. (2012). Synthetic Biology: Mapping the Scientific Landscape. PLoS ONE, 7(4), e34368. doi:10.1371/journal.pone.0034368) was used. Here the stemmer was applied to both the abstract and the words list and the matches were again counted.
Creation of several design suggestions
During a meeting we decided to create three design proposals, so that the team could vote which one they like best. Out of the three one should be a modern design similar to other web pages, having a virtual third dimension. With regard to the Hitchhiker's Guide to the Galaxy we wanted one "spacy" design and lust but not least one alchemical / old fashioned design to fit our team logo.
Note: Since we finally used the same design as on our wiki, it's not worth showing the other design proposals or the introductory texts at this point. These can all be found on our wiki.
Draft of filtering options
The drafts for most of the filtering functions were implemented. Table 1 gives an overview of the different parameters and their basic filter design.
Parameter | Type | Options* | Status |
---|---|---|---|
Year | numeric | 2007-2012 | draft |
Region | string | levels | draft |
Track | string | levels | draft |
Students | numeric | 0, 5, 10, 15, 20, >20 | draft |
Advisors | numeric | 0, 2, 5, 10, 15, >15 | draft |
Instructors | numeric | 0, 2, 5, 10, 15, >15 | draft |
Biobricks | numeric | 0, 5, 10, 20, 50, 100, 200, >200 | draft |
Championship | character vector | levels | draft |
Regional | character vector | levels | draft |
Medals | string | bronze, silver, medal | missing |
Score | numeric | 0-100 (steps of 10) | draft |
Abstract | binary | provided/missing | missing |
* Levels means all possible values the parameter can have. |
The general code for filtering a numerical range parameter X using R looks like this:
FilterForX <- function(data) { if (input$FILX_min == ">max") data <- data[-which(data$X < max),] else if (input$FILX_max == ">max") data <- data[-which(data$X < input$FILX_min),] else data <- data[-which(data$X < input$FILX_min | data$X > input$X_max),] return(data) }
The general code for filtering a single string parameter Y using R looks like this:
FilterForY <- function(data) { matchY <- rep(0, times=length(data$Y)) for (i in 1:length(input$FILY)) { matchY[which(data$Y == input$FILY[i])] <- 1 } data <- data[-which(matchY == 0),] rm(matchY) return(data) }
For each filter the corresponding selection elements (max, min or multiple selection) are added underneath each other below the plot of the app. They are grouped as teams success and team composition.
Scraping of BioBrick count
There is an API provided for accessing the registry, but unfortunately it only allows for search of a single specified BioBrick id. The fastest, but still very slow way to connect a team with it's BioBricks is doing an API request for every id in the teams parts range. The only assumption we made in order to speed things up, is that the BioBricks were submitted in a continuous parts range. Thus the matching is aborted when the first BioBrick in the parts rang can't be found.
Implementation of Scoring
In order to solve the very sensitive problem of rating a team's success we started out with a subjective scoring by every one of our team members for the different awards and the medals. This native scoring turned out to be pretty consistent an thus we just calculated mean values and put the on an exponential scale in order to achieve a harsh separation of the highest top scoring teams and those who didn't get that far. For every team the score of the single awards was added up. As the awards rewarded differend every year, we normalised the summarized score of the teams in one year to a scale from 0 to 100%.
The whole analysis was added to the R-script converting the JSON file to the RData file.
Further text analysis and data conversion
In order to do the text analysis for the methods extraction we collected a raw list of methods, which is displayed in table 10.1. They were clustered in Preprocessing, Processing and Analysis. The script doing the analysis in python again stems both the abstract and the methods and matches them.
Preprocessing | ||
---|---|---|
Fusion Proteins | Primer Design | cloning |
preparation of DNA | Restriction Digestion | Insert preparation |
cell fractionation | cell counting | |
Processing | ||
DNA sequencing | PCR | DNA Microarray |
arrays | interaction chromatography | purification |
Gel extraction | Ligation | Transformation |
FRET | DNA extraction | patch clamp |
Analysis | ||
Northern Blot | Southern Blot | Western blotting |
Bioinformatics | ELISA | Chromatography |
flow cytometry | X-Ray-crystallography | NMR |
Electron microscopy | Molecular dynamics | coimmunoprecipitation |
Electrophoretic mobility shift assay | southwestern blotting | size determination |
gel electrophoresis | macromolecule blotting and probing | immuno assays |
phenotypic analysis | imaging | spectroscopy |
spectrometry |
The RData file was extended and now also includes the meshterms within the data-list as well as the information content and the track in the data-frame.
Implementation of Topics app started
For the topics and the methods app we had several ideas on how to structure them. Besides a timeline of the hottest topics or histogramm view of the topics distribution the most important feature we had in mind was the direct search for the meshterms we extracted.
Pseudocode for the corresponding filtering functions and a draft of the interface were written, but the workload in the other projects was too big to actually create a functional app.
Filters fixed
The drafts for most of the filtering functions had already been implemented, but ended up non-functional. This was due to data removal using empty vectors, which produces a severe error in R. Table 18.1 gives an overview of the different parameters and their basic filter design.
Parameter | Type | Options* | Status |
---|---|---|---|
Year | numeric | 2007-2012 | final |
Region | string | levels | final |
Track | string | levels | final |
Students | numeric | 0, 5, 10, 15, 20, >20 | final |
Advisors | numeric | 0, 2, 5, 10, 15, >15 | final |
Instructors | numeric | 0, 2, 5, 10, 15, >15 | final |
Biobricks | numeric | 0, 5, 10, 20, 50, 100, 200, >200 | final |
Championship | character vector | levels | working |
Regional | character vector | levels | working |
Medals | string | bronze, silver, medal | missing |
Score | numeric | 0-100 (steps of 10) | final |
Abstract | binary | provided/missing | missing |
* Levels means all possible values the parameter can have. |
The general code for filtering a numerical range parameter X using R after this update is displayed below (changes are displayed in red).
FilterForX <- function(data) {
if (input$FILX_min == ">max") data <- data[-which(data$X < max),]
else if (input$FILX_max == ">max" & input$FilX_min != "min") data <- data[-which(data$X < input$FILX_min),]
else if (input$FILX_max == ">max" & input$FilX_min == "min") return(data)
else data <- data[-which(data$X < input$FILX_min | data$X > input$X_max),]
return(data)
}
The general code for filtering a single string parameter Y using R now looks like this:
FilterForY <- function(data) {
matchY <- rep(0, times=length(data$Y))
for (i in 1:length(input$FILY)) {
matchY[which(data$Y == input$FILY[i])] <- 1
}
delete <- which(matchY == 0)
if (length(delete) != 0) data <- data[-delete,]
rm(delete)
rm(matchY)
return(data)
}
The code for matching the awards is slighly more complicated, since the awards are saved in the data-list and we want to reduce the data-frame. This is the pseudocode for the functions:
iterate through all awards to be included { (re-)set a vector for deleting those items from the data-list iterate through all the teams in the data-list { if the award matches one of the team awards: - expand the vector for the teams to keep with the name - expand the vector for the teams already matched } delete those teams from the data-list, that were already positive in order to save runtime } reduce data-frame according to the matching resultAll the filter functions and corresponding layout elements were put together in one file to easily copy and paste them to the different apps.
Incomplete data
During the testing of the filters it was found, that there was no best team in 2007. After further investigation we found out, that the 2007 awards were on a different page than it was from 2008 onwards. Thus another scraper had be written and integrated into the pipeline in order to cover this page, too. This took a bit longer than for the other results pages, because its structure is entirely different from the standard one.
Restyling filters including javascript and css
In order to have the whole apps in a compact layout, the filters were put into a notebook. To navigate and style this javascript and css had to be included. This works through the simple "includeHTML()"-function shiny contains.
The tab heading are implemented as a unordered list containing links. These links activate a javascript, which shifts between the containers displayed in the notebook. This works by first hiding all containers in that area and then showing the one that was specified as parameter, when calling the function from the link.
The style of the tabs and the notebook were adjusted to match the border color and radius of the bootstrap styled elements.
Adjustments to page design
To allow for proper presentation in meetings, the design of the test-homepage was adjusted to current wiki design.
Data completion
As the data was not yet completely converted to R, the script doing so was altered to also include the results from the methods, meshterms and topics extractions, as well as the medals awarded. Prior to this the data update concerning the 2007 awards had to be manually corrected, since the registered team names and those on the results page were not exactly the same in some cases. These were:
- Wisconsin
- MIT
- St. Petersburg
- Davidson Missouri
- Boston U
- Mississippi
- Berkeley
- Michigan
- Colombia Israel
Expansion of Filters
The last filters missing (medal and abstract), another filter for team names and one for the information content of the abstract were implemented. Additionally the selection ranges of some filters were adjusted. See table 23.1 for the final filter setups.
Parameter | Type | Options* | Status |
---|---|---|---|
Year | numeric | 2007-2012 | final |
Region | string | levels | final |
Track | string | levels | final |
Students | numeric | 0, 5, 10, 15, 20, >20 | final |
Advisors | numeric | 0, 2, 4, 6, 8, 10, 12, 14, >14 | final |
Instructors | numeric | 0, 2, 5, 10, 15, >15 | final |
Biobricks | numeric | 0, 5, 10, 20, 50, 100, 200, >200 | final |
Championship | character vector | levels | final |
Regional | character vector | levels | final |
Medals | string | levels | final |
Score | numeric | 0-100 (steps of 10) | final |
Abstract | binary | all/only provided | final |
Information content | numeric | 0, 0.4, 0.45, 0.5, 0.55, 0.6 | final |
Team name | string | variable entry | final |
* Levels means all possible values the parameter can have, which is either generated automatically or manually. |
Abstract
The filter for the abstract only checks, whether an abstract was submitted or not. See the pseudocode:
if user wants to see all teams return full data set else { iterate through all row names in the data-frame { if the abstract of the team "row name" in the data-list matches the string "-- No abstract provided yet --" save the name to a vector } delete all teams without abstract return the reduced data set }Since the information content is also related to the abstract, they are put together in one tab of the filter notebook. As it turned out that the information content always is in a very small range around 50%, the possible selections were chosen accordingly. The filter function is similar to those of any other numeric parameter, where the only difference is that all choices are numeric and thus the different cases of minimum and maximum selection can be skipped.
Medal
The medal filter is just the same as the filters for track or region, since it is also saved in the data-frame.
Team Name
For filtering the team name a text entry box was introduced, together with a javascript, that checks on the entered search term. A list of all team names was created and converted to a javascript array using R and its package RJSONIO. The javascript code then compares the entered term(s) with this array and changes the text color to either red (no exact match) or black (exact match). To allow for multiple name entries comma-serapated terms can be entered and are automatically separated by the javascript for color changing and by the filter function.
The pseudocode of the filter funciton looks like this:
split the input string at every comma iterate through all strings { find all teams having the string in their name and keep their ids } reduce and return dataset
Team table output
All the apps so far only displayed data, but with no option to see which teams contrbuted to each portion of the graphs. Thus a table output was implemented, that shows the teams' names, years and wiki links. To allow for better overview, one can select how many tams should be display (none, 5, 10, 20, 50, 100 or all) and in which order. This can be done either by the score, displaying the best teams at the top of the list, by year starting with most recent teams, or alphabetically. The data-frame displayed always represents the fully filtered data. Thus the user can narrow the number of teams down using the filters and can then go directly to the corresponding wiki pages.
Facts and Figures app
Since the type of data handled in the scatterplot and the timeline app is very similar, the two apps were put together into one app changing it's appearance depending on the chosen x-axis scale. For this a conditional panel was used to either display selection for the scatterplot y-axis parameters or the one for the timeline-summing parameters.
In order to have a continuous design and to be able to target both plot types to the same div the scatterplot was changed from rPlot to nPlot. This means it is now generated using nvd3 instead of the standard R-methods.
Additionally a selection for the categories was added. The data displayed can now be grouped by either region, track or medal awarded. The grouping per year is no longer needed, since the timeline gives the optimal time-resolution of the data.
Methods app
The goal of the methods app was to quickly lead the user to wikis providing high quality protocols for various methods. For this purpose the methods were clustered and now the interface was implemented. It contains two main drop down menus - one for the method cluster and one for the particular method. The options given in the second drop down menu are determined by the first one via conditional panels. There is only one responsive filter applied to the data, which is similar to that for matching the awards. The only differences are, that exact matches instead of regular expressions are detected in the data set and that no empty rows have to be deleted resulting data set. The result is displayed only through the table containing all teams matching the method, which can again be customized regarding the number and order of teams.
To give an overview over the most popular - or until now the most mentioned - methods a table displaying the top 10 methods was added to be constantly displayed in the application. Unfortunately this list only contains 9 elements using the current list of methods and thus the list has to be extended and the text analysis has to be optimised.
Topics app
In the topics app the main element is the entry box, where the user can enter a term to find, which works very similar to the entry box for team name filter. The text turns red when there is no exact match in the meshterms assigned to the teams and otherwise black. Those teams exactly matching the searched term are again displayed in a tabular format. The filter function currently is exactly the same as for the methods, but has to be altered, to also allow for typing mistakes, case unsensitivity and partial string matches. Here we also added a list of the top 10 topics.
Data update and completion
There were many automatic and manual corrections done on the data. These included synchronisation of track and award naming, as well as some double or missing awards.
Name synchronisation
The naming of the tracks and awards was not conserved over the years, but basically kept the value they had. One example is the extension of the 2007 Energy track to the Food or Energy track in 2008. The same applies for other compined tracks, but there were also minor differences, that needed fixing, as for example medals starting with upper or lower case letters. Most of these were solved using regular expressions, when converting the data from JSON to R. See table 23.2 for all synchronisations made.
Championship awards | |||
---|---|---|---|
Regular expression | Replacement | Covered occurances | |
Grand Prize | Grand Prize |
| |
(1st)|(First) Runner Up | 1st Runner Up |
| |
(2nd)|(Second) Runner Up | 2nd Runner Up |
| |
Environment | Best Environment Project |
| |
Energy | Best Food & Energy Project |
| |
Health | Best Health & Medicine Project |
| |
Foundational | Best Foundational Advance Project |
| |
New Application | Best New Application Project |
| |
Part, Natural | Best New BioBrick Part, Natural |
(differenciating between great teams on this level | |
Best Model | Best Model |
| |
Information Processing | Best Information Processing Project |
| |
Software Tool | Best Software |
(Best Software Tools will be one with the Best Software) | |
Presentation | Best Presentation |
| |
Regional awards | |||
Regular expression | Replacement | Covered occurances | |
Grand Prize | Grand Prize |
Regional prizes always end | |
Finalist | Regional Finalist | ||
Human Practices | Best Human Practices Advance | ||
Experimental Measurement | Best Experimental Measurement Approach | ||
Model | Best Model | ||
Device, Engineered | Best New BioBrick Device, Engineered | ||
Part, Natural | Best New BioBrick Part, Natural | ||
Standard | Best New Standard | ||
Poster | Best Poster | ||
Presentation | Best Presentation | ||
Wiki | Best Wiki | ||
Safety | Safety Commendation | ||
Medals, Regions, Tracks | |||
Regular expression | Replacement | Covered occurances | |
[Bb]ronze | Bronze | upper or lower case medals | |
[Ss]ilver | Silver | ||
[Gg]old | Gold | ||
America | America | All regions on american continents were put together. | |
US | America | ||
Canada | America | ||
Medic | Health & Medicine |
| |
Energy | Food & Energy |
| |
Foundational | Foundational Advance |
|
Updated scoring function
After the in depth curation of all the award names, the scoring function had to be updated, since some awards had never been included in the scoring due to their rare awarding and some were lost in the naming issues. After updating the scores list, the data conversion to RData was rerun and the scoring was briefly checked by reviewing the list of the "best" teams.
Manual data curation
For some reason the 2011 championship awards were entirely missing on the standard results page. Thus we had to go directly to the jamboree results page and add the right awards to every team in 2011. This was done directly in the JSON file an the RData-file was updated imediately.
Bug-fix: NA-values
The award filters match the full team list to retrieve the names of the teams to keep in the dataset. These are then taken from the already reduced data set, which produces empty rows in the data for those teams, that match the award filter and were already removed the data-frame by another filter prior to the award matching. The naming of those empty rows is "NA.number". Thus in order to remove these empty rows those row-names containing "NA" were removed. This was a really bad idea, because we spend lots of time trying to find out why our tool doesn't like the UNAM MEXICO teams. This bug was fixed by adding a dot to the regular expression and separately removing the first empty row, which would exactly match "NA".
Update of methods
The list of the methods was slightly updated and the clusters were removed, since they turned out to be not very intuitive. In order to find methods, that are actually mentioned in abstracts we went through those of teams who were successful in either the new application track, experimental measurement aproches and other practice associated directions. The updated list is displayed in table 22.1.
After rerunning the postprocessing of our data sets the top 10 methods list actually contained 10 methods of which many were newly added.
Old | New | Removed |
---|---|---|
Fusion Proteins, Primer Design, preparation of DNA, Restriction Digestion, Insert preparation, cell fractionation, cell counting, PCR, interaction chromatography, purification, Gel extraction, Ligation, Transformation, FRET, DNA extraction, patch clamp, Northern Blot, Southern Blot, Bioinformatics, ELISA, Chromatography, flow cytometry, X-Ray-crystallography, NMR, Electron microscopy, Molecular dynamics, coimmunoprecipitation, Electrophoretic mobility shift assay, size determination, gel electrophoresis, macromolecule blotting and probing, immuno assays, phenotypic analysis, imaging, spectroscopy, spectrometry | traditional cloning, Gibson, bacterial artificial chromosome, protein interaction, FISH, assembly line, Sequencing, Microarray, computer simulation, drug production, encapsulation, genome deletion, mutagenesis, next generation sequencing, image processing, kill switch, Western Blot, Mass spectrometry, southwestern blot, fluorescence microscopy | cloning, DNA sequencing, DNA Microarray, arrays, Western blotting, southwestern blotting |
Bug fixing
We noticed, that not all of the award filters gave the right number of teams. This was due to false unlisting in the JSON to RData conversion script. After fixing this bug and after removing the false runner up awards in 2011 all the awards were finally correct.
By converting the methods to lower case before matching them methods missing beforehand, as for example Gibson and FRET, could also be detected.
Design adjustments
In order to have a nice layout when integrating the apps into a page resembling our wiki, a text box gicing a short introduction to the corresponding app was put next to the major input elements. This top section is now optically separated from the results area and all apps fit nicely into the minimum width of the iframe, which is 900px.
Chart details
As most of our charts are only displaying discrete values we wanted to get rid of the extra decimal positions in the axis labeling. This was achieved by adding a short piece of javascript in the corresponding nvd3 plot parameters. Since the number of tick marks was not reduced by the rounding of the numbers this produced repetitive tick marks if only small values were displayed. To avoid this the height and the position of the plot were changed depending on the number of ticks to be displayed. As the final detail browser compatibility tags were added to the css and axis labels were added to the plots.