Roberto Boselli et al.
6. Conclusions
In this paper we described both a formal and an empirical data quality technique that have been exploited to evaluate and improve the consistency dimension of an administrative archive, the Mandatory Communication Archive. As first step we used an ETL‐based approach to cleanse a source( and dirty) dataset. Then, we enhanced the results provided by the ETL through the RDQA. Table 5 summarises some results achieved by the RDQA, thanks to which we were able to measure the following.
• The initial consistency degree( Case1 + Case4 of Table 3) of the source dataset before the ETL’ s cleansing intervention. Note that the initial consistency degree is the 35,65 % considering careers data( i. e., the 35,65 % of the careers are consistent) whilst it drops to the 12,42 % if we consider events data( i. e., the 12,42 % of the mandatory communications belong to consistent careers). Hence, one can improperly argue that, by looking at events data, the remaining 88 % of the events database is inconsistent. Differently, the 88 % of the events belong to worker careers presenting at least an inconsistency along the time. This is due to the incremental nature( with respect to the time) of the administrative database. To clarify this aspect let us consider a simple scenario in which the first communication of a worker career presents an inconsistency( e. g., the career starts with a cessation event). As a result, this career will be marked as inconsistent and then, from here on, all the following events related to this worker( in spite of their consistency) will be marked as“ belonging to an inconsistent career”, with no chance to modify the inconsistency status of the career in the future. More generally, the consequence of the incremental dynamics of the database can be summarised as follows:“ the sooner an inconsistent event happens the higher the impact on the overall database consistency”. Differently, we can state that the consistency of the database in terms of careers is the 35 %. This result is enough to motivate a cleansing process on the data before using them for decision making purposes. Looking at the consistency degree of the cleansed dataset, we can observe that it has grown from the 35 % up to the 87 %.
• The room for improvement( Case3 + Case4 + Case6 + Case8 of Table 3) of the ETL routines, gives a quantitative estimation about how the ETL process could be improved.
• The quality improvement( Case7‐Case4 of Table 3) achieved by the ETL approach. Note that the use of formal methods to evaluate the ETL process makes more reliable this value. In other words, this value can be considered as a witness of the ETL effectiveness confirmed by the function F().
Table 5: Some results achieved by the RDQA
Careers(%) Events(%) Consistency on S 35.65 12.42 Consistency on C 86.94 84.25
Room for Improvement 11.4 14.52 Quality Improvement Achieved 51.29 71.83
Finally, we would like to remark that the aim of this work is to show how a data quality analysis can be performed on an administrative archive, i. e. by formally evaluating the cleansing process effectiveness with respect to the data consistency.
In this regard, each single RDQA iteration allows one to identify, extract and derive knowledge about the cleansing process, contributing to the identification of bugs and incongruities. Indeed, the RDQA can be iteratively applied by refining the cleansing procedures until a satisfactory data quality level is reached.
Afterward, the cleansed dataset can be investigated by domain experts to support decision makers activities in economic, government and business fields.
References
Afrati, F. N. and Kolaitis, P. G.( 2009) ' Repair checking in inconsistent Databases: Algorithms and Complexity ', Proceedings of the 12th International Conference on Database Theory, 31‐‐41.
Arasu, A. and Kaushik, R.( 2009) ' A grammar‐based entity representation framework for data cleaning ', Proceedings of the 35th SIGMOD international conference on Management of data, 233‐‐244.
Batini, C., Cappiello, C., Francalanci, C. and Maurino, A.( 2009) ' Methodologies for Data Quality Assessment and Improvement ', ACM Comput. Surv., vol. 41, no. 3, July, pp. 16:1‐‐16:52.
Batini, C. and Scannapieco, M.( 2006) Data Quality: Concepts, Methodologies and Techniques, Springer.
92