Using count before top is a good practice

sapator · November 29, 2017, 11:59am

Hi.
I wanted to ask if using count before a top select is a viable solution.

I have a heavy unindexed database that need to select some data.
I was thinking of using a count to find how many rows are selected and then use a Top n select for the rows.
Is this something that is advised or not?
From the previous question I got another question that is, will a select top that gets all row but missing (rows -1) is slow?
I mean if I have a select that brings top 10 of 10 million rows but the rows from the criteria are only 9. Then will the top 10 bring the 9 rows and iterate through all the database in order to find one more row, that will never be found? If so, getting the 9 out of 10 rows and missing the 1 row, is extremely slow on millions of rows vs getting all 10 rows and apparently stopping the search?

Thanks.

Kristen · November 29, 2017, 2:05pm

SELECT TOP 10 from a table with millions of rows, where only 9 can be found, will depend on the indexes available.

If the Criteria (WHERE clause etc.) for the query is "covered" by an INDEX then finding 9-only, or first-10, will be very quick.

If there is no suitable index then SQL will use a Table Scan - i.e. check each row in turn. Yes, finding 9-only will be slow, but finding 10 may be slow too - maybe the 10th, and last, record to be found is right at the "end" of the table scan? So it will be chance whether SQL finds all the rows quickly (i.e. they happen to be at the start of the table scan), or not.

The best answer is "Create an index so that the Query is always fast"

If your query uses a table scan then it is always going to check all rows to get a COUNT()

I would not want to run the query twice - to get the COUNT and to get the TOP 10 - although when you do the COUNT that will load the Cache, so the TOP 10 query might then be faster - if the data is already in Cache.

I would try doing a COUNT and TOP 10 in one query, and comparing that with the speed of just doing the TOP 10 - maybe it will be very similar.

So maybe compare

SELECT [TotalRows] = COUNT(*)
FROM MyTable
WHERE TestCol1 = 123 AND  TestCol2 = 'XYZ'
--
SELECT TOP 10 Col1, Col2, ...
FROM MyTable
WHERE TestCol1 = 123 AND  TestCol2 = 'XYZ'

(NOTE: I presume you have NO Order By, because that would require a complete table scan and THEN sort to find the TOP 10)

with

SELECT TOP 10 [TotalRows] = COUNT(*) OVER()
		, Col1, Col2, ...
FROM	MyTable
WHERE TestCol1 = 123 AND  TestCol2 = 'XYZ'

One way to check the relative "cost" of each query is to wrap it with

SET STATISTICS IO ON; SET STATISTICS TIME ON

... your query here ...

SET STATISTICS IO OFF; SET STATISTICS TIME OFF;
GO

and compare the Scan Count and Logical Reads (NOT the Physical Reads).

You can clear cache (before each test) with:

DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

but you should not use those on a Production Server

sapator · November 29, 2017, 2:22pm

Hi.
This is very helpful.
Can you please explain what I'm trying to accomplish with just using the OVER() without any arguments.
I'm a little lost at this part.

Thanks

sapator · November 29, 2017, 2:53pm

Hi.
OK so with over you take the total rows count. Why do we do that?
I assume it's 2 in one check?

So anyhow For the first two I get 1 scan count and reads 294 and 258 accordingly
and for the last one with the over, i get 3 scan count 3 logical reads 199 and scan count 1 logical read 294.

I'm not sure if more scans and less reads is better than less scan and more read.
Also I'm not sure what this means:
Table 'Worktable'. Scan count 3, logical reads 199, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

I have no 'Worktable'. What is this Worktable?

Thanks

sapator · November 29, 2017, 3:19pm

OK after a little read.
I guess scan count must by low and logical reads high but I'm not sure what is the better performance on these queries
From the looks of it. 1 scan count and 294 logical reads look better. So the simple count() is better than the top 10 without the over and the last one is very bad as it is 2 in one and we have both a count() and a top 10 that is adding an extra scan count.

No i was wrong. Actually logical reads must be low not high.
I don't think i need to go down so deep for this one.I'm creating more problems than solutions here

ScottPletcher · November 29, 2017, 5:01pm

If you've got a table with millions of rows that you need to search, you should create an index on the data. If at all reasonably possible, you'll want to use a clustered index. If you want to explore that further, provide more details on the specific columns and search conditions and we can design appropriate index(es).

If you can't / won't build an index, then just select the data you need to begin with. Counting first doesn't gain you anything. And, yes, a "TOP (10)" where there are only 9 rows would search to the end of the data, but since you can't know how many total rows match until you search all the data, you still lose by counting first, finding the count is 9, then limiting it to a "SELECT TOP (9)".

Kristen · November 29, 2017, 6:47pm

Yes, that was my plan. But ... I have no idea if two separate queries (one for the Total Count,. and the other for the TOP 10) would be more efficient. I was just wanting to show how you could do it in one query.

I was assuming that you want to tell the user "Here are the first 10 rows (out of a total of 999 matching rows)"

If that's not the case then, as Scott said, there is no benefit to doing a COUNT.

There's not a lot to choose between them, but on the face of it 294+258 logical reads is less than 199+294, so "probably" the second one is more efficient.

It would mean that SQL created some sort of temporary table (either because your code used one, or because SQL decided some "temporary working space" was the best way to solve the query)

If you have a SCAN COUNT higher than 2, and you then try adding an index, and then the SCAN COUNT comes down (maybe "comes down A LOT"), then that is a good indication that the efficiency of the query has improved.

If you have no indexes then you will probably only see SCAN COUNT = 1 - SQL makes one pass of the whole table. So on that basis a single pass, with fewer Logical Reads, suggests "more efficient" / "less work" [than one with MORE Logical Reads].

If you add an index it becomes more complicated, because the figures may look good but on closer investigation SQL is NOT using the index that you expected. e.g. SQL uses the PKey Clustered Index, rather than your Really Cool Index The solution is to make your Really Cool Index even more Super Cool (e.g. add more columns, or add Include Columns ... or perhaps decide that the Query cannot be improved by an index.

Once you get to that point just looking at Scan Count and Logical Reads is NOT enough, other more details analysis is needed.

But that won't matter if you can NOT add more indexes to the table.

sapator · November 30, 2017, 11:23am

Thanks both.

Yes I cannot add more indexes, or do be more precise, we need to file a ticket to the support company that builds the database and the explain why we need to do so etc etc and , yes in about a year , maybe we will get an answer of if this is doable but we do not do that for minimum issue like the one I am having.

Kristen · November 30, 2017, 11:27am

Restore a Backup of the database as an "Enquiry Version" (e.g. each night) and then run a script to create the indexes which would help the reports?

Would that be an option?

sapator · November 30, 2017, 12:30pm

Yeah it depends as the databases are about 100GB to 30GB but this is supposed to run inside a reporting software we are making and is somewhat live by the minute. I wouldn't mind loosing a couple of minutes but I can't have a blank previous date (as per backup solution). I wouldn't get that far to be honest. It's not as heavy as other reports that we had issues and we where discussing index fixes here.
Anyhow I would try to find some more criteria for the search, as using a datefrom - dateto, I guess that should limit the search enough.
But it is always interesting on learning new stuff as the above

Kristen · November 30, 2017, 12:52pm

There are other ways to get a "reporting database" - you can "replicate" the changes from LIVE to REPORTING in real time, or near-real-time, but still have different indexes on the Reporting solution. That will avoid the downtime of restoring PRODUCTION to REPORTING, and also prevent the REPORTING data being "stale"

Quite a lot of work involved in building such systems, and maintaining them though ...

... but ...

... running slow reports (i.e. slow because of missing indexes) can have catastrophic effects on a Production system - e.g. Locking / Blocking of updates - so there is the risk that once your Management Information Reporting Solution becomes popular you will have to move to a Reporting Database solution too ...

sapator · November 30, 2017, 1:49pm

It is as you say (for the slow system part as for the reporting replication I have no knowledge at all how to test or implement).
The locking and blocking of updates had happened a lot unfortunately ,
as we are not able to change indexes because we are bound to a company contract and they are reeeaally slow on changing core stuff on the databases.
We have moved all the heavy data insert jobs --> accounting - sap system <-- at night time and we constantly checking for issues. We can get away with some problems in reporting not showing data. The most important issues are the accounting and the live insert -- update of data.
This is why sometimes i use 'read uncommitted' selects so if we are to loose something, we are better of loosing a report run.
We have a main database that collects data from other databases but that is the night collection database.We can do pretty much whatever we want there but it still won't be live data.
The issues we are facing are the live report runs (that, as I've said we can get away a little) and not to impact the whole live operation when run. This is a constant fear and we check again and again for live data loss on insert -- update.
The other issue is the analysis server that is running for about 2 hours when building cubes.This is another issue but I have asked here a while back on any thoughts on how we can start troubleshooting this and I got no answer (I understand that this is a complete different monster of course).

So bottom line am kinda bound on the live database and not bound on the main night collect database

Kristen · November 30, 2017, 3:13pm

I understand that some reports will always need LIVE data ... but as I tell my people here "Look in that tray, there are 100 documents waiting to be entered. They will be entered by 5PM ... should you wait to run your report until then to be sure you have all the latest data?"

Also, whilst data is being entered the records are is changing. Press REFRESH on your report and the bottom-line will change - because someone has entered a new transaction. So you cannot use your Excel from 5 minutes ago with a different data set you have just reported on, because they are not in SYNC.

But if you use last-night's-data then it will be consistent, for the whole day.

Sometimes they understand what I am talking about, and are persuaded that a static dataset is better than a more up-to-date, but changing, one.

But of course if you must have LIVE-data then that is it.

Only other thought:

If you can SEARCH on Last Nights Data and then REPORT based on ONLY those records, you can JOIN the Reporting Table PKeys to the Live database Table to get live data. That is a Clustered Index Join and will be fast, so you can use Slow Search on REPORTING and fast enquiry/reporting on LIVE.

Of course that will only find data that existed last night (and has not changed, today on LIVE, in any of the search criteria columns).

Anyway : Just a thought

I expect you know, but READ UNCOMMITTED may include some row twice, and some rows not at all - as well as giving your Dirty Data of course. So totals are unreliable.

Many people expect the Dirty Data (which is rarely an issue for anyone as usually data is not rolled back) but the rows-included-twice and some-rows-missing can be an unexpected surprise to developers who are not aware that that happens (and is not infrequent).

sapator · November 30, 2017, 3:21pm

Hi.
Thanks for the food for thought.
Yes read uncommitted is used on reports that are not, may not be 100% accurate.