Large Table Select Performance

bigeldm · October 11, 2018, 11:00pm

I inherited a process that has suddenly started to fail.
There is table that has 7126226 rows in it and the process is trying to find the min(createdate) for each rowid where a certain criteria is met.

There is a non-clustered index on the rowid and the table is not in rowid order.

This query is run 5 different times against the MDM_C_PARTY_HIST table matching to different tables and fields each time.

Here is one of the queries - in this case the PH.ROWID_OBJECT and the BU.ROWID_OBJECT have indexes but the PH.ROWID_BU does not.

SELECT
MIN(PH.HIST_CREATE_DATE)
FROM
dbo.MDM_C_PARTY_HIST AS PH
JOIN dbo.MDM_C_LKP_BU AS BU ON PH.ROWID_BU = BU.ROWID_OBJECT
WHERE
BU.BUS_UNIT_NAME <> 'PENDING'
GROUP BY
PH.ROWID_OBJECT

I have left this query run for 25 min and gotten no results. When I look at the query plan it says it\ estimates 580993000 rows.

Here is the question:
I am not sure the best approach for getting this to work.

Do I make the PH.ROWID_OBJECT index clustered instead of nonclustered?
Do I add an index for each field that the the BU.ROWID_OBJECT is being joined to on the PH table?

I am worried that adding more indexes will affect the load process. This table already takes more than an hour to load. If I add 5 more indexes, or add a clustered index won't that just show it down even more?

Is there a better approach to take ? I am at a total loss for how to address this.

Thanks

yosiasz · October 11, 2018, 11:22pm

as a starter without yet creating indices
which of these 2 tables has the least rows now
MDM_C_PARTY_HIST or MDM_C_LKP_BU

AndyC · October 12, 2018, 9:49am

Tables without a clustered index (aka heaps) generally don't perform well in SQL Server, particularly if the data in them changes and you end up with large numbers of forwarded records. Adding an appropriate CI is often a good solution particularly if there is something which meets the usual criteria of "small, incrementing and unique" as that will usually be sufficient to ensure performance doesn't take too much of a hit during inserts of new data.

If not, rebuilding the table can help because doing so will remove all forwarded records that currently exist:

ALTER TABLE tablename REBUILD

If that does help though, it's probably a sign you need to monitor such heaps for forwarded record counts and factor the need to rebuild the table into your maintenance procedures.

ScottPletcher · October 12, 2018, 2:28pm

The most critical performance factor for any table is to first get the best clustered index in place on that table.

Drop all nonclus indexes on table PH
Create a clus index on PH on ( ROWID_BU ), with FILLFACTOR of 95 or more
Recreate any other nonclus index(es) that you truly need on PH, but none that start with ROWID_BU since that's the clus index key
Drop all nonclus indexes on table BU
Create a clus index on PH on ( ROWID_OBJECT ), with FILLFACTOR of 95 or more
Recreate any other nonclus index(es) that you truly need on BU, but none that start with ROWID_OBJECT since that's the clus index key

bigeldm · October 12, 2018, 9:06pm

[yosiasz] MDM_C_LKP_BU is smaller - it only has 57 rows
the other table has 580993000 rows

bigeldm · October 12, 2018, 9:11pm

Hi AndyC - thanks for responding - I dropped all the indexes on this table and recreated the non-clustered index about a week ago. As I understand it that would mean there isn't a lot of forwarding happening correct? Or does the forwarding not get addressed in this manner?

ScottPletcher · October 12, 2018, 9:52pm

No. Again, non-clustered index is only useful in very specific cases. For best performance, you must cluster the large table on how you query it most often. You could spend hours futzing about with non-clus indexes and still not really get any performance gain from it.

bigeldm · October 12, 2018, 9:57pm

ok thanks this is very helpful

AndyC · October 13, 2018, 7:40pm

No, it wouldn't help with the forwarded records problem. Unlike a table with a clustered index (where all the data is effectively in an index) the table data in a heap is independent of the indexes. The only way to clean up the forwarded records is to rebuild the table.

btrimpop · November 2, 2018, 6:34pm

I'm not disagreeing about clustered indexes, but there is a caveat on clustered indexes. Clustered indexes physically reorder the table. If it is a table that has lots of inserts/updates to the clustered key, then you can hit performance walls in other parts of the application as the table is reordered to fit in the new clustered key..... I've had to actually remove clustered indexes from tables due to that problem. Queries ran fast, but data entry/edit/processng that hit the table hard died a horrible and painful death. I've gotten acceptable query performance by adding appropriate non-clustered indexes and good entry/edit/processing performance by removing the clustered index....

ScottPletcher · November 2, 2018, 7:00pm

Yes, all INSERTs will add a clustered key. But UPDATEs is not normally done to clus keys, and could cause big problems, esp. if it's more than a one-time update of the clus key when, say, the row is deactivated.

The (entire) table is not reordered. Page splits can occur, but there are certain techniques to help reduce those as well. The most extreme is to pad out the row so that one row takes a full page -- no page splits then!

I'd have to look at the details of that particular table and its usage, but almost all the time the best-chosen clust index improves overall performance against the table much more than just non-clus indexes. I use heap tables virtually only for staging tables.

btrimpop · November 2, 2018, 7:20pm

Right, As I said, I'm not disagreeing, just something to be aware of depending on the application and usage of the table. If it is a large heavily modified table where modifications impact the clustered key, whether inserts or updates, you can see performance issues in other parts of the application.

AndyC · November 3, 2018, 5:10pm

It's certainly true that selecting the right clustered index is important and you definitely don't want to be clustering on something you are likely to update. That's one of the reasons that primary keys are often selected for clustering, though that is not always the ideal choice.