Roberto Boselli et al.
The cases labelled with() represent the careers dropped from the cleansed dataset by the ETL( in spite of their consistency), since these careers refer to workers living and working in other regions, and their registration should be stored elsewhere.
The computation of a RDQA iteration was performed on a 32 bits 2.2Ghz CPU( connected to a MySQL server through ODBC driver) in about 50 minutes using about 220 MB of RAM. The results of Table 3 are shortly commented in the following.
Case 1: represents careers already clean that have been left untouched by T(). It provides an estimation of the consistent careers on the source archive.
Case 2: refers to careers considered consistent( by function F()) before but not after cleansing, although they have not been touched by T(). As expected this subset is empty.
Case 3: describes consistent careers that have been improperly changed by T(). Note that, despite such kind of careers remain consistent after the intervention of T(), the behaviour of T() has been investigated to prevent that the changes introduced by T() could turn into errors in the future. This set was deeply inspected due to the high impact it has on the overall DCM( i. e., about 6 % of the total careers). We have discovered that the T() implementation improperly changed events and values for some kind of careers. We detected two main intervention types performed by the T() on case 3 careers. The size of the affected sets are summarised in Table 4. We discovered that both interventions 1 and 2 are wrong with respect to the expected semantics( although both producing consistent results), therefore the function T() has to be fixed. Note that still remains the 0.08 % of other interventions. The emerging of the latter subset( which is actually under investigation) and its size are a witness of the contribution that formal methods can provide to the data quality process improvement.
Case 4: represents careers originally consistent that T() has made inconsistent. These careers were very useful to identify and correct bugs in the T() implementation.
Case 5: refers to careers considered inconsistent by function F() before, but consistent after cleansing, although they have not been touched by T(). Also in this case this subset is empty, as expected.
Case 6: describes inconsistent careers, that T() was able neither to detect nor to correct, and consequently they were left untouched.
Case 7: describes the number of( originally) inconsistent careers which F() recognises as properly cleansed by T() at the end.
Case 8: represents careers originally inconsistent which have been not properly cleansed since, despite an intervention of T(), the function F() identifies them still as inconsistent.
Table 4: Composition of the Case 3, showed in Table 3
Composition of the Row 3 of Table 3 # Careers %
TOTAL 334,097 6 ETL Intervention 1 238,794 71.47 ETL Intervention 2 69,213 20.72
ETL Interventions 1 and 2 256 0.08
Other Interventions 26,346 7.89
91