13th European Conference on eGovernment – ECEG 2013 1 | Page 112

Roberto Boselli et al.
Table 2: The double check matrix cases
Which set does a sequence belong to? Case Fs + Fs‐ D + D‐ Fn + Fn‐ 1 NO YES NO YES NO YES 2 NO YES NO YES YES NO 3 NO YES YES NO NO YES 4 NO YES YES NO YES NO 5 YES NO NO YES NO YES 6 YES NO NO YES YES NO 7 YES NO YES NO NO YES 8 YES NO YES NO YES NO
It is worth noting that the inconsistencies prevent the correct analysis of the labour market data. To give an example, let us consider a person ending his working contract with a company at a given time point. A communication will be sent to the relative job registry to notify the end of the contract through the“ Mandatory Communication System”. Let us suppose that the contract started before the observation period, or that no corresponding start contract has been registered, in both cases the career gets inconsistent. This“ inconsistency” will prevent the correct evaluation of some typical( and relevant) labour indicators, as the worker turnover, the contract average duration, the occupation trends etc. Hence, data quality analysis and cleansing activities are paramount. The closing contract communication is considered“ correct” when it is considered in isolation, and this is very common in the administrative approach for managing data, while the inconsistency emerges when considering the whole career, and such holistic approach is very common in analytical tasks.
5. Experimental results
We performed both the ETL and the formal‐methods‐based techniques on the labour marked data of an Italian Region. The dataset is composed by 47,154,010 mandatory communications representing the careers of 5,570,991 of citizens observed starting from the 1 st January 2004 to the 31 st December 2011. Here we report the Double Check Matrix resulting after a single RDQA iteration. For the sake of clarity, we remark that the dataset has been considered as a longitudinal one during the consistency evaluation, i. e. we evaluated the consistency of each single career rather than focusing only on single events.
Table 3: The double check matrix computed on the careers data of an Italian region
Case Row
Consistent in S
Touched by the ETL
Consistent in C
# Careers
% Careers
# Events 1 1 YES NO YES 1,984,692 35.63 5,367,306 2 2 YES NO NO 0 0 0 3 3 YES YES YES 334,097 6.00 1,429,170 * 4 YES YES null 40,520 0.73 126.644 4 5 YES YES NO 1,100 0.02 9,530 5 6 NO NO YES 0 0 0 6 7 NO NO NO 15,267 0.27 86,364 7 8 NO YES YES 2,858,357 51.31
34,399,67 4
8 9 NO YES NO 284,569 5.11 5,314,062
* 10 NO YES null 52,389 0.93 421,260
TOTAL 5,570,991 100
47,154,01 0
90