Data privacy has become one of the hot topics in society, but also in reseach. In the light of data available to the public or to researchers, data privacy should ensure that individuals cannot be identified.
The state-of-the art software tools for data anonymization of complex and big data are open-source: the R package sdcMicro for data anonymization, and the R package simPop for simulating synthetic confidential data.
These software tools are used by various national and international institutions. The aim of this project is to raise the code quality of these anonymization software products. To do so, Integrated test should be implemented also to ensure to avoid bugs when changing parts of a code. The integrated test battery should automatically test every modular part of the code.
Benefit for the Student
The student will gain fundamental and deep knowledge in the statistical environment and programming language R as well as the student will learn statistical methods on data privacy. In addition, the student will use modern integrated test tools such as the testhat package of R and travis (https://travis-ci.org/).
Benefit for the Project
The code and code complexity of the state-of-the-art software sdcMicro and simPop have been significantly grown over the last years. Both software packages are implemented in an object-oriented manner, and every part of the code have potential influence on other parts. Changing parts of the code may likely result in bugs that are possible overseen. A integrated test battery of independent tests that test each part of the code will ensure that bugs will be identified immediately. In addition, by creating tests, ideas and possibilities of splitting the code into modular parts will naturally be detected.
Excellent programming background and at least basic knowledge in the statistical environment R. Knowledge in data privacy methods and statistical modelling are not mandatory but a benefit.