Tutorials on Informatica Data Quality idq | Page 2
similarity between two strings, such as name and address. There are implementations that
are more well-suited to use with date strings and others that are ideal for numeric strings. In
the coming weeks, we’ll go through an overview of each of these implementations and how
to use them to your advantage!
Hamming Distance Algorithm
The Hamming distance algorithm is particularly useful when the position of the characters in
the string is important. Examples of such strings are telephone numbers, dates and postal
codes. The Hamming Distance algorithm measures the minimum number of substitutions
required to change one string into the other, or the number of errors that transformed one
string into the other.
The Hamming distance is named after Richard Hamming. Hamming was an American
mathematician whose accomplishments include many advances in Information
Science. Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance
algorithm is most often associated with the analysis of telephone numbers. However the
advantages of the algorithm are applicable to various types of strings and are not limited to
numeric strings.
Worth noting is one condition that needs to be adhered to when using this algorithm; the
strings being analyzed need to be of the same length. Since the Hamming distance algorithm
is based on the “cost” of transposing one string into another, strings of unequal length will
result in high penalties due to the transpositions involving null character values.
Six Measures of Data Quality
The quality of the data records in your datasets can be described according to six key criteria,
and an effective quality management system will allow you to assess the quality of your data
in areas such as these:
Completeness: Concerned with missing data, that is, with fields in your dataset that have
been left empty or whose default values have been left unchanged. (For example, a date field
whose default setting of 01/01/1900 has not been edited.)
Conformity: Concerned with data values of a similar type that have been entered in a
confusing or unusable manner, e.g. numerical data that includes or omits a comma separator
($1,000 versus $1000).
Consistency: Concerned with the occurrence of disparate types of data record in a dataset
created for a single data type, e.g. the combination of personal and business information in
a dataset intended for business data only.
Integrity: Concerned with the recognition of meaningful associations between records in a
dataset. For example, a dataset may contain records for two or more individuals in a
household but provide no means for the organization to recognize or use this information.
Duplication: Concerned with data records that duplicate one another’s information, that is,
with identifying redundant records in the data set.
Accuracy: Concerned with the gener