TSQL Tutorial 20: Index

In this tutorial, l’ll be discussing one of the most used database objects for optimizing the performance of queries. Indexes are very important and every developer should know when to create it or not. They are DB objects that are used to optimize the performance of T-SQL queries by organizing data in a particular pattern in a balance tree structure.

Lets take for example, a book. Some books are really humongous or you can even take little books, its not easy to find something like, where does the word peace exist in the bible. that’s going to take you forever to go from the beginning of the bible to the end to locate that word. And by the way, I’m not preaching any religion here so just chill out and understand the concept or the example I’m making about indexes. I know some people have problem with religion and fight all the time simply because of religion. To me, God is one. You can call it whatever you like, but God is one. OK, forget about that and pay attention.

So, if we want to find the word peace say in any book, the best way to find is to go to the index of the book. The index has words arranged in some alphabetical order so its easy to spot peace if it exists in the book, then you’ll find a chapter or page where that word is located, open that page and then scan to find the spot of the word you are looking for. That’s exactly the idea behind index in database design. When you create tables, they are normally stored in unordered, unstructured form called a heap. This heap is just like a mess or a pile of clothes you refused to arrange in your room for 1000 years. When looking for a particular shirt, you go inside your pile of mess and start digging until you find it or reach the floor and realize that shirt doesn’t exist in your mess. Exactly is tables in a database. After they are created, they just serve us a room, for you to dump everything into it, things will pile up forever and there is no arrangement or ordering. So the next time you are looking for a record from the table, you use a process called table scan going through everything in the table from the top to bottom to find your stuff. The unfortunate thing is, even if what you are looking for is at the top of the records in the heap, the process will still waist all the time and scan to the bottom before giving you the result. So you see, that’s not cool. It makes your searches slow and ridanculous. To prevent this ungodly behaviour of the heap, an index is created on the table using a particular column as the index key, with which the whole data will be arranged either logically or physically. For every index created on a table, SQL server allocates a seperate Balanced Tree structure which either holds the data or points to data according to the type of index created.

There are two major types of indexes, these are clustered index and non clustered index. A clustered index and non clustered index have the same concept of an index described for our book example, the only difference is that, in a clustered index the data from the heap is now stored at the nodes of the Balance tree when it is created, but for a non clustered index, the data still exists wherever it was but it provides a key to access the data. Now, don’t over work yourself, the balance tree is created automatically by the SQL server, all you have to do is issue the statement to create the index type you want and the server will take care of everything for you. Some information to know about the Balance tree is that every node is a page in sql server of 8 kb in size (practically capable of holding 8060 bytes of information). The Balance tree is built dynamically by SQL server according to the number of records coming into the table and can scale up to as many number of levels depending upon the size of each row, number of rows and clustering key for that index structure. The balance tree can have root, intermediate and leaf level nodes. The root and the Intermediate nodes holds the starting value of clustering key for its child node and pointers to reach those child nodes. The leaf node holds actual data pages in the case of clustered index or it holds pointers in the case of non clustered index.Lets take each index type one at a time and discuss it.

Clustered Index:

When a clustered index is created, all information is pulled from the Heap memory and is sorted according to the ascending order of the clustering key and dumped at the leaf nodes of the Balance tree structure. The clustering key is the column or columns used in creating the index. You can have up to 16 columns maximum or 900 bytes, which ever come first, as the composite key to create the index on. That means, the balance tree will be created and sorted according to all these columns as one. Either clustered or non clustered can take only 16 columns, why? well, l think Microsoft don’t want you to crash his server lol. But seriously, is that not too much work to sort and arrange all these freaking columns for the balance tree structure of the index? As I said, the clustered index stores data at the leaf nodes so data is no longer in heap memory. Data pages are physically sorted and stored at the leaf nodes and the logical ordering and physical order matches. That means, it’s not just logical, it is physically stored in the system in that manner. You wana open your computer and see it? hahah. Now, this is the most important thing to carry home, We can have only 1 clustered index per table but there can be multiple non clustered indexes per table. Why? Imagine having multiple clustered indexes on the table, that means the data will be repeatedly stored at the leaf nodes of all the indexes. Do you want the computer to explode? lol.
Clustered indexes will sort and store the data in Balance tree nodes, so creating multiple clustered indexes will
mean creating and storing multiple Balance trees, which is not sensible, you can tell your interviewer that.

After creating a clustered index on the table, the server does not do table scan no more, it does what is called clustered index seek. This is the idea, if you are searching for something based on the clustering key, the server goes to the root node and ask, Mr Root, I’m looking for this record with this clustering key, the root will tell the server, go to this child node, when the server reach that child, the child will direct you to the leaf node and you simply grab that record and leave. Isn’t that cool? You avoid running around everywhere looking for just one thingy lol. That’s what is called clustered index seek. Final thing about clustered index, when you create a primary key for a table, that primary key automatically creates a unique clustered index on the column declared as the primary key. Indexes can be unique and non unique, where unique means no repeating and a single null in that column will be accepted, non unique means, you can go crazy repeating stuff in that column as many times as you want, no sweat about it. Now, let’s talk about non clustered index.

Non Clustered Index:

This type of index does not store the table data at the leaf nodes. It logically sorts the data but it doesn’t store the data in the Balance tree leaf nodes. Data is either in the heap or the clustered index. If you want to create both clustered and non clustered index on a table, it is recommended to create the clustered index first before creating the non clustered index because if you do it the other way, the non clustered index will have to be recreated after the clustered index is created because the data is now stored at the leaf node of the clustered index. you dig? (means, you understand?). I know you are like, then why didn’t you just say that in the first place. The answer is simple, it makes me feel sassy when l talk like that lol.

OK, back to business, the logical ordering of the data does not match the physical ordering of the data in non clustered index, as we saw for the clustered index. The leaf nodes of the non clustered index balance tree structure has pointers to the actual data. The pointer is called Row Identifier (RID) if the table has no clustered index on it, which means the data is still in the heap format. The pointer is called a clustering Key if there exists a clustered index on the table. There can be 249 non clustered indexes for sql server 2005 and 999 non clustered index for sql server 2008.

There are two processes to retrieve the data when we are going through the non clustered index balance tree. One is Key look up in which case the data is stored in a clustered index and the other is RID lookup in which case the data is in a heap format. The main process of traversing the non clustered index balance tree to locate the appropriate pointer is called non clustered index seek. So with the Non Clustered index on a table, you can perform either Non clustered index seek + Key look(table with clustered index) or Non clustered index seek + RID look up(heap). Now, this is the silly part, if you create all this indexes but in your search, you did not user the column or columns on which the index or indexes are created, you do what is called index scan in other to retrieve the data.

Enough talk, let’s look at the implementation of the two index types we talked about before discussing something else.

Syntax to create an index

CREATE NONUNIQUE/UNIQUE NONCLUSTERED/CLUSTERED INDEX <index name> ON <table>(col1,col2,—-)

As simple as it is, it’s very powerful. Example,

CREATE UNIQUE CLUSTERED INDEX idx_Emp ON Employee (EmpID)

In the above, l’ve created a clustered index on the table Employee, named it idx_Emp and it’s created on the column EmpID. This index is Unique. By default, when you say Create Index , it will create unique Non clustered index so its recommended to stop being lazy and specifically state the type of index you want.

Another example,

CREATE UNIQUE NON CLUSTERED INDEX idx_customers ON Customer(CustID)

Understanding the concept matters, creating the index is as simple as ABC. Note, once an index is created on a column, you can not change the column on which the index is created using an ALTER statement on the index. To perform that task, you have to first drop the index and recreate it using the desired column. This is the code to drop an index.

DROP INDEX <indexname>

Example, DROP INDEX idx_customers

Let’s now extend our knowledge a little to look at some other types of indexes. These indexes are just extension of what we’ve already discussed.

Filtered Index:

This type of index allows you to index a portion of the rows and not the whole table. Filtered index is an extended version of the non clustered index, in which user has control over the number of records that can be added onto the leaf node for Balance tree structure. Only those entries that satisfy the WHERE clause condition while creating filtered index will be added to the leaf level of the balance tree structure. This index is more optimized than full table non clustered indexes which has all the entries for all the rows, in case your search predicates would always be from specific range. The idea is simple guys, you want to create the non clustered index but you are not interested in all the records. Say, you have table for sales and you want to create the non clustered index on sales that are within a year from today. Imagine how much space and time to create the index for 100 years of sales while your search will be base on only one year from now. This is where filtered indexes come to the rescue, be careful though, you can’t do this for clustered index. This applies to only non clustered index and it has less index storage and maintenance costs. Lets use our sales example to create the filtered index on the sales table. Lets say the table has a column salesdate that we want to create the index on and we want the index to be created for sales within a year from today. Now, you have to be careful, indexes refresh automatically when new data comes into the table so it is a good idea to know how much index to create or you’ll face the consequence of index refresh overhead. Which can slow down perform in transactional environments where there frequent insertions, deletions and updates.

CREATE NON UNIQUE NON CLUSTERED INDEX idx_Sales_Within_Year ON sales(salesdate)

WHERE DATEDIFF(day,salesdate,GETDATE())/365 <= 1

The WHERE condition is the filter on the index, without it, the entire table will be indexed, but now, we are restricting it to create non clustered index for only those records that are a year or less from today. OK, last one to go

Covering Index:

This is also an extension of the Non Clustered index. This type of index is to make lookups in case of non clustered index more optimized. In a Covering index, the non clustered index is created on a particular column but the balance tree is informed to include data for some other columns at the node. Remember how the data is stored at the leaf node of a clustered index, in this case, the non clustered index is going to store the data for those included columns at the leaf node. That means, once you reach the leaf node of the non clustered index, you can simply grab data for those columns and leave. No look ups for those columns. So the idea works fine if you include those columns you will be searching more often. The data is duplicated for those columns included, that means, the data is stored at the leaf node and stored in the heap or clustered index if there is a clustered index on the table. Again, we have to be careful using this index not to waste space. The columns which are covered in non clustered-index will be added only to the leaf nodes of Balance tree structure, those columns are not added to the root and intermediate nodes. Its purpose is only to prevent RID or Key lookups as the columns selected will be in the leaf node itself. Also, any index which is composed of multiple columns can include a maximum of 16 columns or have a max size of 900 bytes, whichever comes first but you can include as many as 1023 columns in a covering index to store those columns at the leaf node. Lets see the code for creating a covering index for a employees table that has columns employeeid,departmentid, firstname, lastname, dateofbirth,hiredate,title,marital statues. In this case, l want to create a covering index on departmentid and include firstname and lastname. The assumptions is that, my search queries will be using the department id a lot searching for all employees in the department and retrieving their firstname and lastname. Here we go,

CREATE NON CLUSTERED INDEX idx_EmpCovering ON Employee(departmentid)

INCLUDE (Firstname, Lastname)

Before we get out of here, lets look at some interesting differences between having a covering index and creating index on composite keys.

–Differences in covering index and composite index
1. In covering index, the index is on a column and those columns that are often selected are included.
2. You can use the covering index to search on the indexed column and select the included columns
3. In a composite index, all the columns are used for the b-tree. that means, search can be performed
4. using all the columns in the composite index and the logical sorting is done on all the columns in the composite index.
5. In covering index, the logical sorting is done only on the indexed column.
6. The columns in covering index are stored with the indexed column in the leaf nodes and the columns in composite index are
are also stored in the leaf node.
7. Covering index can be created only for non clustered indexes but composite index can be for both clustered
and non clustered.
8. The efficiency of the composite index is when all the columns are used in performing search at a time.
using just one of the column or some of the indexes in the composite key will not take full advantage of the indexed b-tree.

I hope this really helps someone, please add your feedback, in the next tutorial on TSQL, I’ll be discussing triggers, another interesting database object. Peace!

TSQL Tutorial 20: Index

Leave a Reply Cancel reply