This week l had to show on the fly to a colleague of mine why it is very important to always clean data coming from different sources before storing it in the data warehouse. Most often than not, when you are building data warehouse and you are pulling data from other system, developers do not trim the data they are pulling from these systems before storing it in the data warehouse and I have showed a demonstration to a colleague how this could affect other applications pulling data to use for other business needs.

To demonstrate this scenario,

Create a table call visitor

CREATE TABLE [dbo].[Visitor](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [varchar](50) NULL,
[Age] [int] NULL
) ON [PRIMARY]

Now put some values into this table like this

INSERT INTO [2smart4school].[dbo].[Visitor](Name,Age)
VALUES (‘Nelson Mandela’, 95),
(‘    Nelson Mandela’,95),
(‘Nelson Mandela    ‘,95)

The second Nelson Mandela has couple of spaces before the name and the last Nelson Mandela has couple of spaces after the name while the first Nelson Mandela has no spaces before or after it.

Now let’s create another table by selecting and trimming the data from visitor into another table called member

SELECT V.ID memeberid, LTRIM(RTRIM(v.Name)) AS memberNAME, v.Age memberage INTO dbo.Members
FROM [dbo].[Visitor] V

At this time, l hope you realize you will have the members table with Nelson Mandela with any spaces before or after any of the names.

Now lets go ahead and do a let join visitor and member  and see the output

SELECT v.*,m.*
FROM [2smart4school].[dbo].[Visitor] v
Left JOIN [2smart4school].[dbo].[Members] m
ON m.memberNAME = v.Name

The output is shown below. You can see that the Nelson Mandela with spaces before in the visitor table has not matching in the member table but the Nelson Mandela with spaces after it matches perfectly to the non space after Nelson Mandela in the member table. This is how tricky it is, so when you have space before a word, its a whole different than when you have space after it in SQL server. These things can happen a lot due to human error when entry data so the best option is to do a left and right trim on all character or varchar data you are storing in your data warehouse. This will reduce lot of headache for you in the future when you try merging or join to get results and your counts are never matching.

 

image

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *