Randomly select records based on a condition

Cenk · January 1, 2023, 6:33am

Hi,

I am using this SQL statement in order to randomly get records from a table. Is it a good method to get random records? Any suggestions?

SELECT top 10 *
FROM GameConfirmResponses
where status = 0
ORDER BY NEWID()

robert_volk · January 1, 2023, 1:10pm

Yes, that's probably the best available technique.

JeffModen · January 1, 2023, 11:22pm

Unfortunately, that means that the entire table is randomly sorted first and, only after that, are the TOP 10 selected.

What is the Primary Key of that table? A "Random Stab'n'Grab" method can be far superior if the table is large, depending on the PK, of course.

Cenk · January 2, 2023, 5:51am

[Id] [int] IDENTITY(1,1) NOT NULL,

JeffModen · January 2, 2023, 6:27pm

Ok, then try this...

First, make this Filtered, Non-Clustered Index. Some would say that the "Include" isn't necessary if the ID column is the Clustered Index but can you guarantee that won't EVER change? You might think so but you cannot... Someone might think a different Clustered Index would be better for other things even though the PK doesn't change.

--===== This FILTERED Non-Clustered Index (NCI) is much more narrow 
     -- than the Clustered Index (CI).
 CREATE INDEX By_Status ON dbo.GameConfirmResponses (Status) 
INCLUDE (ID) 
  WHERE Status = 0
;

Then, change your query to this...

   WITH cteGetRandomIDs AS
(--==== This gets 10 random IDs from the narrow NCI
 SELECT TOP 10
        ID
   FROM dbo.GameConfirmResponses
  WHERE Status = 0
  ORDER BY NEWID()
)--==== This surgically extracts the rows from the wider CI.
 SELECT *
   FROM dbo.GameConfirmResponses src
   JOIN cteGetRandomIDs rdm ON src.ID = rdm.ID
;

In my testing on a 20 Million row table where the CI had an average row width of 366 bytes, it reduced the number of logical reads from 939,490 to just 8,270 (which is real "memory friendly" by two orders of magnitude), reduced CPU from 6,311 ms to 2,610 ms (more than twice as fast), and reduced duration from 556 ms to 297 ms (almost twice as fast).

I don't know how many times a day the code you have will run but, those seemingly small savings in CPU and Duration can very quickly add up to a substantial savings. The savings in logical reads are important even if executed just once a day because it might mean that stuff that runs more often will stand a better chance of not getting knocked out of memory. Of course, it this code runs a lot during the day, it stands a better chance of not getting knocked out of memory itself because it uses so little after the suggested change above.

If the table is tiny, I'd still do it the way I propose above to prevent one of the many "Death by a Thousand Cuts" that a lot of databases suffer and get slower and slower because of such excuses like "it's only going to be on small tables".

Cenk · January 12, 2023, 5:57am

An explanation based on numbers, thank you @JeffModen

JeffModen · January 12, 2023, 9:55pm

Aye. And thanks for the feedback, @Cenk . Much appreciated.