Description

The aim is to merge and integrate big environmental, public-available datasets. Typically, the geographical location of different data bases is not identical and measurement stations have to be linked based on geographical distances. In addition, some data bases include missing values and measurement artefacts. To deal with that, specific data can be aggregated or deleted from the data base. Also some measurements are available on a grid in the plane while other measurements are on specific coordinates.

The challenges are to adapt and write efficient code (e.g. using the Rccp pacakge and/or using the R package data.table) for data manipulation and linkage. Packages rgdal, maptools and sp provide useful software tools to deal with such problems. The data should be automatically accessed and linked afterwards. Web scraping tools such as rvest and RCurl may additionally be used to access data that are not provided in standard formats. Some methods already exist to access to data bases, like
http://www.r-bloggers.com/harvesting-canadian-climate-data/

or packages to fetch data from the web such as the package weatherData
http://cran.r-project.org/web/packages/weatherData/index.html

or the package wux: http://cran.r-project.org/web/packages/wux/index.html

Another challenge is to consider data with different spatial and temporal scale. Tools are available to convert to different reference systems (like the rgdal R package), but have to be applied to harmonize data on different coordinate references. Additionally, solutions for harmonizing different time scales of the data (monthly, yearly, daily, hourly data) should be based on expert knowledge provided by the mentors.

We give a list of possible data sets that are of interest from an environmental scientist point of view.
-  The European Earth observation programme Copernicus:
http://www.copernicus.eu/pages-principales/services/
https://sentinel.esa.int/web/sentinel/sentinel-data-access
-  Daily data from the European Climate Assessment & Dataset
http://www.ecad.eu/dailydata/predefinedseries.php
-  European data centers about air pollution, biodiversity, climate change, land use and water:
http://www.eea.europa.eu/data-and-maps/european-data-centres
-  European soil data centre: http://esdac.jrc.ec.europa.eu
-  European forest data centre: http://forest.jrc.ec.europa.eu/efdac/
-  European Pollen database: http://www.europeanpollendatabase.net/data/downloads/
-  NASA ozone and air quality data: https://ozoneaq.gsfc.nasa.gov/data/ozone/
-  Environmental stratification 
http://www.wageningenur.nl/en/Expertise-Services/Research-Institutes/alterra/Projects/EBONE-2/Products/European-Environmental-Stratification.htm

Benefit for the Student

The student will get deep knowledge in handling, linking and geo-coding big data sets in R. The student will work together with experts in the field of environmental sciences and with researchers from biology and statistics.

Benefit for the Project

Integrating more information in the existing databases will allow better prediction of environmental changes and allows better analysis to learn spatial and temporial changes of climate. Richer databases are essential for more complex environmental modelling.

Requirements

Medium to advanced knowledge in the software R. Good knowledge in C++. Good knowledge in SQL and relational data bases.

Mentors

Matthias Templ, Peter Filzmoser

More information

http://www.r-project.org
http://cran.r-project.org/web/packages/rgdal/index.html
http://cran.r-project.org/web/packages/data.table/index.html