Wiki Scraping

We wrote a pipeline to collect and print the data for every team properly. Thus proper script exiting and output file specification via the command line were easier to implement. The track, which was missing until now was included in the scraping process.
Furthermore several bugs were fixed. We had encountered problems regarding whitespaces, espacially the dublication of teams when their name contains a blank space. We also took care of the proper construction of all the spider objects.

Data conversion to R

The JSON file needs to be converted to R compatible data for the analysis. The target file contains one dataframe for all single value parameters and one list for all multiple value parameters / gib text contents. Here unique naming of the teams was achieved by combination of name and year. A full liste of which parameters and contents were converted is displayed in table 4.1.

Table 4.1: Parameters generated so far.
Data frame parameters List elements
  • Numerical:
    • year
    • students count
    • advisors count
    • instructors count
    • regional awards count
    • championship awards count
    • biobrick count
  • Character strings:
    • region
    • wiki
    • url (Team overview page)
  • year
  • character vector of regional awards
  • character vector of championship awards
  • parts range
  • advisor names
  • project title
  • abstract