Should i used date limit on a clustered index table?Also 2c cindex?

sapator · March 24, 2017, 9:06am

Hello.
2 questions.
We have a table. We use a cluster index on these 2 columns
C1 C2
A 1
A 2
A 3
A 4
A 5
A 1
A 6
B 1

As you can see the first columns has mostly the same data and the second column has mostly different data.
The problem here is that to have a unique row selection you need both columns C1 and C2.
So we are creating a cluster index using both columns(cluster index C1 - C2 , with that order).
Is this OK?

Secondly.
Is there any difference (as the data grows) to do this:
Update table1 set x = 2, y=3 where C1=A and C2= 2 and Mydate between @dateFrom and @dateTo

rather than this:

Update table1 set x = 2, y=3 where C1=A and C2= 2

Would we have any performance upgrade on large data if we restrict the data to date periods rather than just using the cluster index?

Thanks.

ScottPletcher · March 24, 2017, 2:41pm

It depends on how many duplicates of C1 and C2 there are with different dates.

You could add dateFrom the clus index as well:

( C1, C2, dateFrom )

sapator · March 24, 2017, 2:50pm

Hi.
It's about 2 C1 valules
and guess it is about 10000 values per day on C2.

ScottPletcher · March 24, 2017, 2:51pm

So on avg you'll have 10K rows with dup key values, such as ( A, 1 ), etc.? In that case, if the rows are narrow, don't worry about it, if they are wide, add date.

sapator · March 24, 2017, 2:53pm

Hi.
What does it mean narrow and wide sql wise?
Thanks.

ScottPletcher · March 24, 2017, 4:01pm

Narrow = not too many bytes.
Wide = many bytes.

For example, a narrow row might be 100 bytes, and a wide row would be 1200 bytes.

sapator · March 25, 2017, 6:59pm

Thanks.

So this is an indexing standard?

I mean adding a date when rows have too many bytes? Does it include having too many columns ?
In my case the rows are Narrow but they consist of about 40 columns per row.

ScottPletcher · March 27, 2017, 4:33pm

I wouldn't call it a "standard". It's more about doing what's practical and leads to best overall performance for the current table data and structure.

40 columns is not bad, particularly if they are narrow. I've seen tables with 200+ columns!

sapator · March 28, 2017, 6:54am

Thanks.

BustazKool · March 28, 2017, 8:31pm

Here is another approach to consider. When using multiple columns in an index, it is helpful to use the most "selective" columns first. In this case, since C2 has the most variety, it should be the first column in the index. In an extreme case, if C1 was the first column in the index and was almost always one value, the index would likely never get used.
Note, also, that if your clustered index is not declared to be unique, the SQL engine will add a uniqueifier (is that a word?) to your index under the covers which will make your index wider and have some level of negative impact on performance. How negative? Only testing will show. Your sample data shows non-unique entries but if they are, in fact, unique, make sure that you define the index to be unique.

sapator · March 29, 2017, 10:59am

Hi.
The index is unique , the columns used is first C1 and C2 .
C1 has 2-3 values max but the ratio is 60-40. Meaning 60% C1 -A- and 40 C2 -B-

As per execution plan, if i do a select * i get a cluster index scan.
This may probably be because the table is relatively small right now (about 90.000 rows)?

thanks

ScottPletcher · March 29, 2017, 2:13pm

The first column in the clustering index should be the one you always / most often use to do look ups. If there is more than one column, and you always use both, then put the more selective one first. But, if you always search by c1 and only, say, 40% of the time by c2, then put c1 first, regardless of selectivity.

sapator · March 29, 2017, 9:45pm

No.
Sorry my bad,
I mean C1 has 2 values in ration 60-40.
C2 has 90.000 values and growing,
C1 and C2 form the index and I need both to make it a unique value (that is specified as unique in the double index also).
C1 is first on the index and c2 second,
Sorry for the misunderstanding,

ScottPletcher · March 29, 2017, 10:30pm

Do you (almost) always specify both values when searching the table?

If so, then your current clustered index, on ( c1, c2 ), is fine.

BustazKool · March 29, 2017, 10:42pm

Scott raises a good point. If your other queries filter or join on just the C1 value, you'd definitely want indexing that would be used. It might be the clustered index or another non-clustered index. In the context of the original query and index though, the leading C1 is, in essence, a binary value which makes it difficult for the SQL engine to find that useful. If, on the other hand, the C2 value is first in the index column list, it can efficiently locate the fairly unique value. The next question becomes, should the C2,C1 index be clustered or non-clustered? The answer, as Scott has pointed out, is a function of how often a C1 only query gets executed. If most of your queries use both C1 and C2, then I would take the time to re-create the clustered index using the more selective column (C2) first.

sapator · March 30, 2017, 6:49am

Hi.
Yes both C1 and C2 are always used on the query.
So from what I understand, C1 C2 should be the cluster index but C2 must be declared first on the index.
Correct?
Thanks.

Kristen · March 30, 2017, 10:52am

If C2 is more "selective" then YES.

I'm expect this is obvious but:

SELECT COUNT(*) AS [Count], COUNT(DISTINCT C1) AS [C1], COUNT(DISTINCT C2) AS [C2]
FROM YourTable

will tell you how many different values you have for C1 and C2 (and the total number of rows).

If either C1 or C2 is the same as COUNT then its unique, so perfectly selective. If not then pick the one with the bigger number and put it first in the index.

sapator · March 30, 2017, 11:24am

Yes they have the same count.
So I should just leave it as is?
Thanks.

BustazKool · March 30, 2017, 3:24pm

Um....what?
Or more correctly, what are three values returned? Your earlier responses indicated that C1 had essentially two values but that C2 had about the same number of different values as there were entries. Based on that, I'd expect Kristen's query to return 90000, 2, 90000 for the counts. If that is true and you always use C1 and C2 in your queries, then you want to re-create the index with C2 as the first column.

ScottPletcher · March 30, 2017, 4:21pm

As I stated before, since you always specify both, your existing index is fine and probably even preferable to changing it. You are always doing a full keyed lookup anyway, so the real concern is to reduce fragmentation. Based on what you're described, ( c1, c2 ) should be less fragmented than ( c2, c1 ), perhaps significantly so.