Roberto Boselli et al.
4. The robust data quality analysis technique
The Robust Data Quality Analysis( RDQA) is an iterative technique able to analyse and evaluate the consistency degree of a data source by applying three distinct functions:
• Function T(): it takes as input the source( and“ dirty”) dataset returning a new cleansed one. Clearly, this function is used as a black box and it can be implemented by means of several paradigms. In the labour case described in this paper, the T() cleansing function has been implemented as a part of a larger ETL process. There are a large number of off‐the‐shelf tools implementing the ETL paradigms. For our purposes the Talend tool( Talend, 2013) has been used.
• Function F(): it is a model‐based function exploiting model checking techniques as described by( Mezzanzanica et al., 2011). This function takes as input a data source( a set of sequences) whose consistency has to be analysed and produces as output:( 1) a subset of sequences where at least one inconsistency has been found and( 2) a subset of sequences where no inconsistencies have been found.
• Function DIFF(): It takes as input both the dirty and the cleansed versions of the same data source and, for each sequence stored in one of the analysed datasets, it classifies the sequence into( 1) the set of sequences which have been altered by T() during the cleansing process, and( 2) the set of sequences left untouched by T().
The Figure1 shows a schematic representation of the RDQA approach. A RDQA iteration works as follows:
• The T() function analyses the source dataset S creating a cleansed version N of the source. S is composed by several event sequences( whereas an event sequence is a people career in the labour reference scenario), N is composed by the same event sequences, where some sequences are left untouched while others are changed. Hopefully the former are consistent sequences while the latter are the cleansed versions of inconsistent ones. However this is not always the case as it will be outlined later.
• The function F() analyses the consistency of the source S. The set Fs +( Fs‐) will contain all the sequences violating( satisfying) the consistency constraints;
• The function F() analyses the consistency of the cleansed dataset N. The set Fn +( Fn‐) will contain all the sequences violating( satisfying) the consistency constraints;
• The function DIFF() computes the differences between S and N( with respect to event sequences). The set D +( D‐) will contain all sequences changed( unchanged) by T().
• By inspecting the generated set Fs +, Fs‐, D +, D‐, Fn +, Fn‐ we can analise how a sequence( i. e., a career) has been managed by the cleansing process. The result will be a Double Check Matrix, as described in Table 3.
Figure 1: Schematic representation of the RDQA approach
The contribution of the Double Check Matrix( DCM) is twofold: on one side it provides an estimation of the quality of the source data( i. e., a data quality analysis), on the other side it allows users to validate the data cleansing process by comparing two distinct cleansing paradigms, namely the empirical and the formal one.
89