Tutorials on Informatica Data Quality idq | Page 2

similarity between two strings, such as name and address. There are implementations that are more well-suited to use with date strings and others that are ideal for numeric strings. In the coming weeks, we’ll go through an overview of each of these implementations and how to use them to your advantage! Hamming Distance Algorithm The Hamming distance algorithm is particularly useful when the position of the characters in the string is important. Examples of such strings are telephone numbers, dates and postal codes. The Hamming Distance algorithm measures the minimum number of substitutions required to change one string into the other, or the number of errors that transformed one string into the other. The Hamming distance is named after Richard Hamming. Hamming was an American mathematician whose accomplishments include many advances in Information Science. Perhaps as a result of Hamming’s time at Bell Laboratories, the Hamming distance algorithm is most often associated with the analysis of telephone numbers. However the advantages of the algorithm are applicable to various types of strings and are not limited to numeric strings. Worth noting is one condition that needs to be adhered to when using this algorithm; the strings being analyzed need to be of the same length. Since the Hamming distance algorithm is based on the “cost” of transposing one string into another, strings of unequal length will result in high penalties due to the transpositions involving null character values. Six Measures of Data Quality The quality of the data records in your datasets can be described according to six key criteria, and an effective quality management system will allow you to assess the quality of your data in areas such as these: Completeness: Concerned with missing data, that is, with fields in your dataset that have been left empty or whose default values have been left unchanged. (For example, a date field whose default setting of 01/01/1900 has not been edited.) Conformity: Concerned with data values of a similar type that have been entered in a confusing or unusable manner, e.g. numerical data that includes or omits a comma separator ($1,000 versus $1000). Consistency: Concerned with the occurrence of disparate types of data record in a dataset created for a single data type, e.g. the combination of personal and business information in a dataset intended for business data only. Integrity: Concerned with the recognition of meaningful associations between records in a dataset. For example, a dataset may contain records for two or more individuals in a household but provide no means for the organization to recognize or use this information. Duplication: Concerned with data records that duplicate one another’s information, that is, with identifying redundant records in the data set. Accuracy: Concerned with the gener