13th European Conference on eGovernment – ECEG 2013 1 | Page 108

Roberto Boselli et al.
enables the conversion of administrative data( i. e., data collected as single and independent events) into statistical data. The latter data allow one to analyse and describe a phenomenon as result of multiple observations and measurements from both the qualitative and the quantitative point of view. In this streamline, dealing with consistent data is required to guarantee the statistical results believability. An example can help clarifying the matter. The Table 1 shows a cruise ship travel plan. A ship usually travels by sea, then stops at the port of calls( intermediate destinations), making a checkin notification when entering a harbour and a checkout notification when exiting. The reader will notice that the departure date from Lisbon is missing, since the ship should checkout( from the previous harbour) before entering into a new one. In this sense, the dataset is inconsistent.
Missing data and wrong values may be not noticed during the“ daily operations”, but they can strongly affect the information derived for decision making purposes. Indeed, even though each single notification of Table 1 can be considered as“ correct”, the missing departure from Lisbon is a problem, and several of them can strongly affect the statistics and the indicators computed on the whole dataset. For instance, missing dates create uncertainty when computing an indicator like active travel days / overall cruise duration, and unpredictable effects may arise on the statistics generated, since the missing data frequency is unknown and cannot be precisely estimate.
Table 1: Example of a cruise ship travel plan
EventId ShipID
City
Date
Notification Type
e 1
S01
Venice
12th April 2011
checkin
e 2
S01
Venice
15st April 2011
checkout
e 3
S01
Lisbon
30th April 2011
checkin
e 4
S01
Barcelona
5th May 2011
checkin
e 5
S01
Barcelona
8nd May 2011
checkout
...
...
...
...
...
To avoid this unpleasant side‐effect, have exploited the data quality techniques:
Step1: to analyse the source dataset, and to discover and fix inconsistencies, creating a new consistent dataset;
Step2: to provide an assurance of the effectiveness of the technique used at step1, and consequently of the cleansed data reliability, hence increasing the usefulness of the generated statistics.
The former task has been accomplished by means of the well‐known ETL technique while the latter has been performed through the Robust Data Quality Analysis( RDQA), a technique we applied as introduced by( Mezzanzanica et al., 2011). For the sake of completeness, we highlight that the RDQA has been implemented by applying formal methods, model checking for instance( Clarke, Grumberg and Peled, 1999). A discussion about this hardware / software formal verification technique is out of the scope of this paper. For this reason the RDQA will be presented by avoiding details about formal definitions, models and algorithm pseudo‐codes. The interested reader can find all these details in the works of( Mezzanzanica et al., 2011),( Mezzanzanica et al., 2012).
2. Related works
Data quality has been addressed in different research domains including statistics, management, and computer science as reported in( Batini et al., 2009),( Scannapieco, Missier and Batini, 2005). For the sake of clarity, the works surveyed in this section have been classified into three groups according to the( main) goal pursued: record linkage, error localisation and correction, and consistent query answering. The classification adopted is not strict since several works could be classified in several groups.
Record linkage( known as object identification, record matching, merge‐purge problem) aims to bring together corresponding records from two or more data sources or finding duplicates within the same one. The record linkage problem falls outside the scope of this paper, therefore it is not further investigated.
86