SQLTeam.com | Weblogs | Forums

When to use a clustered and/or non-clustered index


#1

Hi,
I have been looking all over the net and in books to findwhen to use a clustered and/or non-clustered index; and all I get isdefinitions, not when to use one or both. I know that you should always have a primary key and therefore the system will make a clustered index; so by this alone you probably should always have one, but not sure and when you should have a non-clustered index. A lot of recitals go deep into what each of these are but not how and when to use them. Can you please tell me how I can learn this, or would you be able to give me some general steps to do this?

Thank you


#2

A primary key <> a clustered index. By default, when you create a primary key, SQL will build a clustered index on that key, assuming there is not already a clustered index. But, you can specify a nonclustered index for the PK.

e.g. you can write:

USE AdventureWorks2012;  
GO  
ALTER TABLE Production.TransactionHistoryArchive   
ADD CONSTRAINT PK_TransactionHistoryArchive_TransactionID PRIMARY KEY CLUSTERED (TransactionID);  
GO  

but you can also specify NONCLUSTERED on the ADD CONSTRAINT line.

Using a good clustered index can speed up data retrieval or searching in the same order as the CI. Microsoft recommends that all tables have a CI. But that need not be the PK. The PK should be a good, unique business key or surrogate key. The CI can be on different columns (and if not unique, a unique-ifier is added).

Too many shops (looking at my own!) use an IDENTITY as the PK and the CI which makes sense in some scenarios (e.g. DW apps) but not so much in others.

Since you're just learning about it, separate the two ideas and keep them separate. Then optimize your table, choosing the best CI and PK for business purposes. It's good to settle this quickly though, since the cost of changing either goes up as the table grows.

One other thing to keep in mind: if the table has a CI (actually I should say if your table is a CI, since tables are either CIs or heaps), then all other indexes record the CI key columns and use a key lookup to get other columns. that's one reason why integer identity columns are popular -- they keep the index entries narrow and hence the b-tree shallow -- both good things.

This can be outweighed by other factors though. If a table has lots of INSERTs, it can develop a "hot spot" at the end, increasing contention. One solution is to use a GUID for the CI key. But this is not a general recommendation. It depends on your unique situation.


#3
  1. All permanent, non-data-staging tables should have a clustered index. There is almost never a reason to have a nonclus index and no clus index (and the exceptions to that are so rare it's not practical to worry about them anyway).

  2. A PK can be nice to have, but it's not required. I know relational theory says it is, but we're modeling data for the real word, not for a mathematician's academic world.

  3. The clustering key should be the most common way data is looked up, esp. if it is ranges, such as a date. The more common a particular column(s) is(are) used for lookup, the more likely it is that it(they) should be the clustering key(s).

  4. SQL provides stats on index needs and uses in system views. Specifically they are:
    sys.dm_db_missing_index_*
    sys.dm_db_index_operational_stats
    sys.dm_db_index_usage_stats

If you have a specific table or two you want to analyze the CIs on, I can walk you thru that process.


#4

Thanks for the information on the clustered index everyone. So in adition to the cluster index ho would you know when to a no-cluster index?

Thank you


#5

To get an initial idea of what nonclus index(es) might be needed, you can again use the:
sys.dm_db_missing_index_*
views.
Also, you can review query plans to look for table scans and "missing index" reports from the "Estimate Query Plan".
But don't automatically create all indexes SQL "tells" you to. Be sure to test to make sure the index is actually being used and that it actually helps the query enough to be worthwhile.


#6

This has all been a really help
thank you


#7

If the PK is a GUID, you might cluster on something else for optimization purposes, but I'm not sure that performance improvement would ever be worth worrying about.


#8

From personal experience, clustering on a GUID is a bad idea unless it is mostly read-only table. I have a legacy inherited database that has a table with GUID as clustered PK. Lot of inserts and deletes, and the table gets fragmented like crazy. Something like 20 million records, and it drives me up the wall.


#9

JamesK, what would you suggest clustering on, in that case?


#10

If you do have to use a surrogate key as a clustering key, I would use INT or BIGINT rather than GUID. Ideally if the clustering key is ever-increasing i.e., new records that are inserted get inserted at the end, that would reduce page splits and fragmentation when inserting.

But the other side of the equation is that if you use the column(s) by which you query data most often as the clustering key, that will make queries more efficient.


#11

Again, you don't need or want to guess at index usage. Get the real stats from SQL itself:
sys.dm_db_missing_index_*
sys.dm_db_index_operational_stats
sys.dm_db_index_usage_stats