Accessible data quality assessments with R

Link zur Sitzung: Zoom-Link

Data quality has been defined as the “degree to which a set of inherent characteristics of data fulfils requirements” (ISO 8000). This important aspect of any research data collection should be assessed thoroughly and efficiently. However, many users are uncertain as to which tools are most suitable to assess data quality. In fact, already in the programming language R, hundreds of packages of potential interest are available. This workshop provides an applied overview with recommendations on R packages to use, covering the following dimensions of data quality assessments: integrity („The degree to which the data conforms to structural and technical requirements.“), completeness („The degree to which expected data values are present.“), consistency („The degree to which data values are free of breaks in conventions or contradictions.“), and accuracy („The degree of agreement between observed and expected distributions and associations.“).
Methods, Results
We provide an overview on three different approaches to target data quality with R. First, packages focusing on exploratory data analysis to get a fast overview while making little use of additional metadata (e.g. SmartEDA). Second, packages to conduct highly targeted rule-based checks on distinct data properties (e.g. validate). Third, packages that produce extensive data quality reports driven by metadata (e.g. dataquieR). All approaches will be illustrated based on a publicly available example data set. The exemplary analysis starts without metadata, using packages of the first type, and we will revise the scope and limitations of the results. Subsequently, we will show how additional checks can be performed using functionalities of the second package type. Finally, we will illustrate how to further improve the scope and efficiency of data quality assessments by setting up a metadata file to control the assessment with packages of the third type.
Depending on the data quality dimensions of interest and on the availability of metadata different approaches can be chosen to conduct data quality assessments. There is no one-size-fits-all approach, and not all potential aspects of relevance can be targeted by the available R packages. Depending on the maturity level of the metadata, different levels of data quality reporting are possible.