Team:Heidelberg/Tempaltes/iGEM42-W-4

From 2013.igem.org

(Difference between revisions)
(Created page with " == Wiki Scraping == * white space handling added * track scraping added * output-file specified via command-line * bug-fixes: ** exiting script ** proper construction of spider ...")
 
Line 1: Line 1:
== Wiki Scraping ==
== Wiki Scraping ==
-
* white space handling added
+
We wrote a pipeline to collect and print the data for every team properly. Thus proper script exiting and output file specification via the command line were easier to implement. The track, which was missing until now was included in the scraping process.<br/>
-
* track scraping added
+
Furthermore several bugs were fixed. We had encountered problems regarding whitespaces, espacially the dublication of teams when their name contains a blank space. We also took care of the proper construction of all the spider objects.
-
* output-file specified via command-line
+
-
* bug-fixes:
+
-
** exiting script
+
-
** proper construction of spider objects
+
-
!! To be done by Ilia !!
+
== Data conversion to R ==
== Data conversion to R ==
-
* JSON file is needs to be converted to R compatible data for the analysis
+
The JSON file needs to be converted to R compatible data for the analysis. The target file contains one dataframe for all single value parameters and one list for all multiple value parameters / gib text contents. Here unique naming of the teams was achieved by combination of name and year. A full liste of which parameters and contents were converted is displayed in table 4.1.
-
* target file contains one dataframe for all single value parameters and one list for all multiple value parameters / gib text contents.
+
-
* Unique naming of the teams is achieved by combination of name and year
+
{| class="wikitable"
{| class="wikitable"
 +
|+ Table 4.1: Parameters generated so far.
|-
|-
! Data frame parameters !! List elements
! Data frame parameters !! List elements

Latest revision as of 00:32, 5 October 2013

Wiki Scraping

We wrote a pipeline to collect and print the data for every team properly. Thus proper script exiting and output file specification via the command line were easier to implement. The track, which was missing until now was included in the scraping process.
Furthermore several bugs were fixed. We had encountered problems regarding whitespaces, espacially the dublication of teams when their name contains a blank space. We also took care of the proper construction of all the spider objects.

Data conversion to R

The JSON file needs to be converted to R compatible data for the analysis. The target file contains one dataframe for all single value parameters and one list for all multiple value parameters / gib text contents. Here unique naming of the teams was achieved by combination of name and year. A full liste of which parameters and contents were converted is displayed in table 4.1.

Table 4.1: Parameters generated so far.
Data frame parameters List elements
  • Numerical:
    • year
    • students count
    • advisors count
    • instructors count
    • regional awards count
    • championship awards count
    • biobrick count
  • Character strings:
    • region
    • wiki
    • url (Team overview page)
  • year
  • character vector of regional awards
  • character vector of championship awards
  • parts range
  • advisor names
  • project title
  • abstract