Improving Data Cleansing Techniques on Administrative Databases
Roberto Boselli, Mirko Cesarini, Fabio Mercorio and Mario Mezzanzanica Dept. of Statistics and Quantitative Methods, CRISP Research Centre, University of Milan, Bicocca, Milan, Italy Roberto. Boselli @ unimib. it Mirko. Cesarini @ unimib. it Fabio. Mercorio @ unimib. it Mario. Mezzanzanica @ unimib. it
Abstract: Business and governmental applications, web applications, ongoing relations between citizens and public administrations generate a lot of data, whereas a relevant subset can be considered as longitudinal data. Such data are often used in several decision making activities in the context of active policies design and implementation, resource allocation, and service design and improvement. Unfortunately, the lower the quality of the data, the lower the reliability of the information derived thereon. Hence, data cleansing activities play a key role in ensuring the effectiveness of the decision making process. In the last decade a great effort has been made by both industrial and academic communities in developing algorithms and tools to assess the data quality, by dealing with a wide range of dimensions( e. g., consistency, accuracy, believability) in several fields( e. g., government, statistics, computer science). Nevertheless, scalability issues often affect theoretical methods since the size of real case datasets is often huge, while the lack of formality of a lot of data cleansing techniques may affect the cleansed data reliability. Therefore, the application of such approaches to real‐world domains still represents a challenging issue. This work is aimed to exploit both empirical and theoretical approaches by combining their capabilities in assessing and improving the data cleansing procedures, providing experimental results in a motivating application domain. We focus on a scenario where the well‐known ETL( Extract, Transform, Load) technique has been used to generate a new( cleansed) dataset from the original one. Then, we enhanced the ETL features by assessing the results through the Robust Data Quality Analysis( RDQA), a model‐checking‐based technique implemented to evaluate the consistency of both the source dataset and the cleansed one, providing useful insights on how the ETL procedures could be improved. We used this methodology to a real application domain, namely the " Mandatory Notification System ", designed by the Italian Ministry of Labour and Welfare. The system stores data concerning employment and active labour market policies for the Italian inhabitants. Such data are stored in several databases managed at territorial level. In such a context, the data used for the decision making by policy makers and civil servants should be carefully and effectively managed, given the social relevance of the labour market dynamics. We evaluated our approach on a database containing more than 5,5 million people career data, i. e. the citizens living in an Italian Region. Thanks to the joint exploitation of both the ETL and the RDQA techniques, we performed a fine‐grained evaluation of the data cleansing results.
Keywords: data cleansing, data quality, administrative databases, decision making
1. Introduction and contribution
In the last decade, the diffusion and application of Information Systems has grown apace, providing a relevant contribution in the definition and realisation of many IT services, also in the public sector. As a result, a great number of datasets have become available, a lot of them could be exploited to deeply analyse, observe and explain social, economic and business phenomena. Unfortunately, several studies report that the enterprise databases and the public administration archives data quality is very low, e. g.( Batini and Scannapieco, 2006)( Redman, 1998), causing unpredictable effects on the effectiveness and reliability of the statistics derived. To give an example, the causes of the Challenger Space Shuttle explosion are imputed to ten different categories of data quality problems( Fisher and Kingma, 2001). Organisations are getting more and more aware of the low data quality consequences and costs, therefore several plans, strategies, and actions have been implemented, e. g. as described in( Tee et al., 2007). Data quality is a broad concept and it is composed of several dimensions, e. g., accuracy, consistency, accessibility. A complete survey can be found in( Batini and Scannapieco, 2006). In this paper we focus on the consistency dimension, referring to“ the violation of semantics rules defined over a set of data items”.
In this paper we describe how data quality techniques can be used to improve a database quality, by focusing on administrative databases, especially those presenting a longitudinal dynamic( i. e., repeated observation of the same subject at multiple time points). To this aim, we report a successful experience in the domain of the labour market data, in which both ETL and formal‐methods‐based techniques have been jointly applied. Indeed, a challenge in the field of data quality is the development of techniques able to analyse and cleanse huge data archives, by focusing on several data quality dimensions. From a statistical perspective, this activity
85