Roberto Boselli et al.
Error localisation and correction works can be further classified in: 1) those exploiting machine learning methods and 2) those exploiting data dependencies( formalised by domain experts) to detect and correct errors. Considering the latter, the effort of domain experts is required to formalise dependencies and rules.
1) Machine learning methods. Possible techniques and approaches are: unsupervised learning, statistical methods, data profiling, range and threshold checking, pattern recognition, clustering methodologies. It is well known that these methods can improve their performance in response to human feedbacks, however the model resulting from the training phase can’ t be easily accessed and interpreted by domain experts. In this paper we explore a different approach where the consistency models are explicitly built and validated by domain experts.
2) Dependencies based methods. Several approaches focus on integrity constraints for identifying errors, however they cannot address complex errors or several inconsistencies commonly found in real data( Fan, 2008),( Maletic and Marcus, 2000). Other constraint types have been identified in the literature: multivalued dependencies, embedded multivalued dependencies, and conditional functional dependencies. Nevertheless, according to( Vardi, 1987), there are still semantic constraints that cannot be described. In( Arasu and Kaushik, 2009) a context‐free‐grammar based framework is used to specify production rules, to reconcile the different representations of the same concept( e. g., Univ. → University). Such approach mainly focuses on the attribute level, whilst the work presented in this paper focuses on set‐of‐records consistency.
Works on database repair focus on finding a consistent and minimally different database from the original one, however the authors of( Chomicki and Marcinkowski, 2005) state that computational issues affect the algorithms used for performing minimal‐change integrity maintenance. Deductive databases( Ramakrishnan and Ullman, 1995) add logic programming features to relational systems and can be used for managing consistency constraints. To the best of our knowledge, few works in the literature focus on deductive databases and data quality. Furthermore, scalability issues have to be investigated when dealing with large datasets.
Consistent query answering works, e. g.( Bertossi, 2006), focus on techniques for finding out consistent answers from inconsistent data, i. e. the focus is on automatic query modifications and not on fixing the source data. An answer is considered consistent when it appears in every possible repair of the original database.
Other works and tools not included in the previous categories are now briefly surveyed. The problem of checking( and repairing) several integrity constraint types has been analysed by( Afrati and Kolaitis, 2009). Unfortunately most of the approaches adopted can lead to hard computational problems. Finally, many data cleansing toolkits have been proposed for implementing, filtering, and transforming rules over data. A detailed survey of those tools is outside the scope of the paper. The interested reader can refer to( Maletic and Marcus, 2000)( Vassiliadis, 2009).
3. The“ mandatory notification system”
The Italian Law No. 264 of 1949 requires the employers to report information about their employees to the public administration offices. Those information are called Mandatory Communications( CO) and should be notified to the public administration within five days after the start, the cessation, or the modification of the working contract. Since the 1997, the Ministry developed an ICT infrastructure, called the“ Mandatory Communication System”( The Italian Ministry of Labour and Welfare, 2012), for recording data concerning employment and active labour market policies, generating an administrative archive useful for studying the labour market dynamics.
For the sake of clarity, it is important to highlight that a mandatory communications sequence describes how a worker career has changed during the time. In this sense, the longitudinal data extracted by the CO archives allow one to observe the overall flow of the labour market for a given observation period, obtaining insightful information about worker career paths, patterns, and trends. Such information can strongly support the decision making processes of civil servants and policy makers.
87