Well, this is my first post in 2020 and it better be good and helpful to you out there. Happy new year!!!.

I want to discuss data profiling using SSIS in this post. If you are not familiar with data profiling, its a process of looking at a data set to make general sense of the data and its content. For example, l have a table called customers which has information on my customers containing the email address, zip code, country, date of birth, customerid etc. If  I want to answer general questions such as :

what is the max  date of birth?

What is min date of birth ?

Are there zip codes that do not meet the USA standard of 2 letters?

How many people live in a particular zip code ?

What are potential keys in the table that can be used for primary keys?

What is the min and max length of emails?

All these information can easily be extracted by just profiling the data. Data profiling is done to help you determine if the data fits the expected outcome you want to deliver to consumers. You can profile the data in the staging environment before deciding whether to merge it to your data warehouse or not. For example if you have a requirement that when you pull the customers from a particular system, more than 90% must have their email address filed, data profiling in the stage can give you the count of nulls and you use its percentage to decide if the data is good to move to the data warehouse or not. During designing stage of a data warehouse, you can also profile different tables to determine what will be the appropriate table structure including data types and the lengths to use in the final table.

There are different data profiling tools but I am going to use SSIS to profile a table in adventureworks data warehouse on my local instance for this demo. This demo assumes you already know  SSIS and familiar with SSIS 2008+. Data profiling is available in 2008+ version of SSIS.

First step is to start SSIS and create a new SSIS project, by default a new package will be created and in this package add a new data profiling task from the common section in ssis toolbox. You should have below as shown in screen shot.

Next click on the Data profiling task see the screen shot below to config next

Choose FileConnection for the DestinationType

For Destination, click and select new file connection then chose create new file and enter the name of the file such as profileoutput.

Click the ok button to come back to the screen shot above. Next click on the quick profile button above to go to connect to a server and perform the profiling.

 You should see the screen below

Here, things to note is SSIS data profiling allows only ado.net connection and it works only in sql server. You can only profile the data in sql server using SSIS data profiler so wherever your data is example flat file, spreadsheet, to use the SSIS data profiler, you must store it into sql server.

To connect to the sql server where the data is, click on the new button  in the snapshot above and enter the sql server to connect and select the database to connect and click ok. In my case l just connected to my local instance using dot and selected the adventureworks database. See the snapshot below.

Once you are done setting the connection to the sql server and the database, click ok to go back to the profiler form configuration. You should see below screen after clicking ok.

 

The next step is to connect to the table or the view you want to profile. In my case l am going to profile dim.customer table in adventureworkdw2016 so I click to select it in the table or view section and below screen shows all the available profiling options.

Now we are at the meat of data profiling and need to understand what each of the options in the compute section means so l will take it one at a time.

i. Column Null Ratio Profile : This will go over all the columns in the table and give you count of nulls in each of the columns and the corresponding percentage of the null count in that column.

ii. Column Statistics Profile: This is mostly for integers and dates profiling. For an integer it calculates the minimum,maximum,mean and standard deviation. For a date field, it will calculate the minimum and the maximum date. These are very useful information just from profiling the data.

iii.Column Value Distribution Profile: This is calculating the uniqueness or repetition of values in a column. So if a column has some values repeated, it gives those values and the count of repetition and its corresponding percentage to the values in that column.

iv.Column Length Distribution Profile: The column length distribution is simply the length of characters in the values in a column. This distribution will go over the various lengths and how many values in the column have that number of characters and give the percentage. it can tell you like 40% of customers have email of length 25, 10% has email of length 30 etc.

v.Column Pattern Profile: Note that this is not selected by default. This calculation is resource intensive as it tries to report set of regular expressions that cover the specified percentage of values in a string column.It helps you identify problems in your data, such as invalid strings, and can suggest regular expressions that can be used in the future to validate new values. I will not be selecting this option for the data profiling demo.

vi. Candidate Key Profile: This profiles all columns in the data and report which columns or combination of columns can be used as primary keys. Note that is has the option to increase the number of column keys to use and the more this value, the more combination of keys to use to find the candidate key. I will leave it at 1 mean I need to find candidate key from the data but the key needs to be just one column.

vii: Function Dependency Profile: This is also another resource intensive computation and by default its not selected. This profile tells the extent to which a value in a column depends on a value in another column or set of columns. It has the option to select the number of columns as determinant columns. The determinant columns are the columns on which a columns value depends on.

With all these details laid out, click ok to complete the configuration in the above screen shot. You should now see the request page showing all the selected profiling to be performed.

Now run the package and see it complete then click on the data profiler task to open it again. You should see the screen below.

Now click on Open Profile Viewer which will open the latest profiling done as shown below.

Now you can click on each of the calculated profiles to see the details, for example the candidate key profile looks like below

From the screen shot it means the customeralternatekey, customerkey,emailaddress all has 100% key strength to be used as primary key for the table.

Look at the Length Distribution Profiles as shown below

Select Addressline1 shows in the column length distribution profiles shown the minimum length is 8 and the maximum length is 33. In the length distribution for that column you can see that the are 335 values in that column that has length of 23 forming 1.8124% of of values in that column. You can go through other columns by click on them in the top level and see the details below in the panes on the right in the above screenshot. Next lets see the Null Ratio Profiles in the screen shot below.

From the screenshot above you can easily see 98.3121% of values in addressline2 are nulls. You can also easily see Addressline1 has no null. This can help you easily see which columns are getting null and prepare how to handle those values before moving it into the final destination tables. Lets take a look at Column statistics profiles in the screen shot below.

You can see above the column statistics profile reporting only on integers and dates. This can also help you identify how your data looks. Next is the column value Distribution profiles below.

For the column distribution profile, you can see Addressline1 had 12797 distinct values and no repeated values. Lets look at addressline2 to see how it looks below

So for addressline2, 166 of the values are distinct but the value Verkaufsabteiung is repeated 34 times forming 0.1839% in the addressline2 column.

With all these information now at your finger tip, you can decide how to build your final schema or design your ETL to have some intelligence when the staging data did not meet the requirement to be merged into your final data warehouse.

I hope this helps you in providing quality data in your data engineering roles.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *