Hi all, i've to refactor a database with several tables with around 100.000.000 records each one.
Each table has the following relevant columns:
ID (bigint)
Tenant (actually i have 10k tenants)
Year (all info has Always a filter search by year)
other 10/20 simple columns
I was thinking to have:
pk not clustered on ID
clustered on (tenant,year,id)
otger accessory not clustered index
In alternative, i was thinking to apply a partition by year in each table.
What do you think Will be the optimal architetture for thia database?
All queries are by tenant and year.
All updates are by id.
The growth Will be of ~8.000.000 records year on many tables.
Keep the clustered index on ID because the clustering key is included with every nonclustered index.
Put a nonclustered index on (tenant, year).
Do analytical queries get run against this table, and is most of the older data fairly static, rarely getting updated?
If so, it may be a good candidate for a columnstore index.
If not, it may be a good candidate for converting to a partitioned view, where you split the single table into one table per year, then apply constraints to year.
My recommendation is that you try several methods and test how they work for you in your environment.
Kimberly Tripp has several blog posts and videos about database design strategies for very large tables over at https://www.sqlskills.com and in various channels all over YouTube.
If i have clustered index on (tenant, year) and pk notclustered on id is good anyway?
From what i undestand the notclustered index are concatenated to the clustered right?
In thia case, if id is an identity with autoincrement i Will have some bad effect?
@SqlPippo in my application i have ti display in many places content filtered by Tennant+Year. Consider that the number of columns in the select will probably increase, Is not better have a clustered on tenant+year so that all records are available without need to lookup?
Last questione: what about find the tenant by the id value? For example:
Id from 1 to 1000000 = tenant 1
Id from 1000001 to 2000000 = tenant 2
In this way i can make a clustered index on id+year.
Do you believe Is a good approch?
We can recommend things for you to test, but the only way to know what will work best for you is to try each solution taking into account maintainability, usability, and performance.
My wife goes to a store to buy running shoes. The employees are all runners and have experience running in many brands and styles of shoes. They can make recommendations based on their experience, but my wife still has to try the shoes on in the store before buying them, and even then still has to wear them for a while to find out if they work for her.
The point of that is to say that nothing can take the place of implementing a solution, testing it against your workload, measuring the performance, and learning what it feels like from a usability standpoint to see if it's going to be right for you.
As for the idea of assigning ID ranges to different tenants, what would the goal be? Improving maintainability, usability, or performance? Can you implement a test of that hypothesis to see if it can be confirmed?
@SqlHippo sure i'm just fetching some High level ideas and i Will make all test in my scenario.
Can you explain me one thing? I understand that as best practice clustered index should be small and incrementative.
If i have (tenanid as int and year) every new record Is not sequential, for example now i can have an inserti from tenant 2 then from tenant 123.456 and then 2 again. Am i wrong?
The clustered index should be relatively small as it is used in all the non-clustered indexes.
Whether the clustered index should be incremental rather depends on what you want to do with it. Generally an incremental clustered index helps as there are only good page splits at the end of the index. In a highly concurrent environment this can cause blocking problems so it may be better to lower the fill factor and have a non-incremental clustered index with the inserts spread thoughout the table. You also need to take into account how the data is queried. It can get rather complicated and require a lot of testing.