The primary goal of Data Quality solution is to assemble data from one or more data sources. However, the process of bringing data together usually results in a broad range of data quality issues that need to be addressed. For instance, incomplete or missing customer profile information may be uncovered, such as blank phone numbers or addresses. Or certain data may be incorrect, such as a record of a customer indicating he/she lives in the city of Wisconsin, in the state of Green Bay.
1. Profiling
Profiling data helps you examine whether your existing data sources meet the quality standards of your solution. Properly profiling your data saves execution time because you identify issues that require immediate attention from the start – and avoid the unnecessary processing of unacceptable data sources. Data profiling becomes even more critical when working with raw data sources that do not have referential integrity or quality controls.
There are several data profiling tasks: column statistics, value distribution and pattern distribution. These tasks analyze individual and multiple columns to determine relationships between columns and tables. The purpose of these data profiling tasks is to develop a clearer picture of the content of your data.
• Column Statistics – This task identifies problems in your data, such as invalid dates. It reports average, minimum, maximum statistics for numeric columns.
• Value Distribution – Identifies all values in each selected column and reports normal and outlier values in a column, missing values, null, whitespaces etc.
• Pattern Distribution – Identifies invalid strings or irregular expressions in your data.
2. Cleansing
After a data set successfully meets profiling standards, it still requires data cleansing and de-duplication to ensure that all business rules are properly met. Successful data cleansing requires the use of flexible, efficient techniques capable of handling complex quality issues hidden in the depths of large data sets. Data cleansing corrects errors and standardizes information that can ultimately be leveraged for MDM applications.
3. Parsing and Standardization
This technique parses and restructures data into a common format to help build more consistent data. For instance, the process can standardize addresses to a desired format, or to USPS® specifications, which are needed to enable CASS Certified™ processing. This phase is designed to identify, correct and standardize patterns of data across various data sets including tables, columns and rows, etc.
4. Matching
Data matching consolidates data records into identifiable groups and links/merges related records within or across data sets. This process locates matches in any combination of over 35 different components – from common ones like address, city, state, ZIP®, name, and phone – to other not-so-common elements like email address, company, gender and social security number. You can select from exact matching, Soundex, or Phonetics matching which recognizes phonemes like “ph” and “sh.” Data matching also recognizes nicknames (Liz, Beth, Betty, Betsy, Elizabeth) and alternate spellings (Gene, Jean, Jeanne).
5. Enrichment
Data enrichment enhances the value of customer data by attaching additional pieces of data from other sources, including geocoding, demographic data, full-name parsing and genderizing, phone number verification, and email validation. The process provides a better understanding of your customer data because it reveals buyer behavior and loyalty potential.
• Address Verification – Verify addresses to highest level of accuracy
• Phone Validation – Fill in missing area codes, and update and correct area code/prefix.
• Email Validation – Validate, correct and clean up email addresses using three levels of verification: Syntax; Local Database; and MXlookup. Check for general format syntax errors, domain name changes, improper email format for common domains (i.e. Hotmail, AOL, Yahoo) and validate the domain against a database of good and bad addresses, as well as verify the domain name exists , and parse email addresses into various components.
• Name Parsing and Genderizing – Parse full names into components and determine the gender of the first name.
• Residential Business Delivery Indicator – Identify the delivery type as residential or business.
• Geocoding – Add latitude/longitude coordinates to the postal codes of an address.
6. Monitoring
This real-time monitoring phase puts automated processes into place to detect when data exceeds pre-set limits. Data monitoring is designed to help organizations immediately recognize and correct issues before the quality of data declines.
There are three general categories or ways to organize your data so that it can ultimately be merged – they are unique identifiers, attributes, and transactions.
Unique Identifiers – These identifiers define a business entity’s master system of record. As you bring together data from various data sources, an organization must have a consistent mechanism to uniquely identify, match, and link customer information across different business functions. While data connectivity provides the mechanism to access master data from various source systems, it is the Total Data Quality process that ensures integration with a high level of data quality and consistency. Once an organization’s data is cleansed, its unique identifiers can be shared among multiple sources. In essence, a business can develop a ‘single customer view’ – it can consolidate its data into a single customer view to provide data to its existing sources. This ensures accurate, consistent data across the enterprise.
Attributes – Once a unique identifier is determined for an entity, you can organize your data by adding attributes that provide meaningful business context, categorize the business entity into one or more groups, and provide more detail on the entity’s relationship to other business entities. These attributes may be directly obtained from source systems. While managing unique identifiers can help you cleanse duplicate records, you will likely need to cleanse your data attributes. In many situations, you will still need to perform data cleansing to manage conflicting attributes across different data sources.
Transactions – Creating a master business entity typically involves consolidating data from multiple source systems. Once you have identified a mechanism to bridge and cleanse the data, you can begin to categorize the entity based on the types of transactions or activities that the entity is involved in. When you work with transaction