Long-performing queries on a single table

I have a table that is being used as a stack. There are around 200 connections to the server, any one of those connections, at any given time, may insert a record into the table. The table is highly volatile, records are being inserted and deleted all the time. It rarely contains more than 15-20 at any given time.
I have a process that runs every second. It connects to the server, runs a select on my stack table, and disconnects.
For years this has been running fine, with the process taking < 0.5 seconds to pull a record from the table. Of late, I am noticing intermittent timeouts (frequent) on this select. Any suggestions on where I can start troubleshooting this?
This is the select statement (the "id" column is a decimal, everything else is nvarchar):
@T_ID = id,
@T_TYPE = [type]
machinenum = @T_MachineNum
datetimepr <> 'INPROGRESS'
[date] = replace(convert(char(8),getdate(),11),'/','')
[time] <= replace(convert(char(8),getdate(),8),':','')
priority desc, [time] asc

And attached are the results of subsequent statistics running the same query.

Help! :slight_smile:

The server is an AWS instance. The client that's having the problem is another AWS instance in the same subnet.
If I simply run "select * from stack with(nolock)" — I get a consistent return in about 420 ms. Always, from everywhere, at any time. As soon as I add a where clause to it — I get the problem described above.

You'd need to change how the table is clustered. Specifically, cluster it to meet the query with a single seek and scan from that point (the scan will be needed since you have a "<>" condition).

I am clustered by [date]. I have additional non-clustered indexes on [time] and [priority]. Right now this select (in actual execution plan) performs a single clustered index scan that takes 22% of the overall execution, the remaining 78% being in the sort. So I am running a single scan. It runs in 460 ms one time, and in the very next second it can run for 26000 ms with absolutely identical criteria.

Every second is extreme (and super-extreme if you're genuinely making a new connection every time to run it; just DELAY for one second and run it again).

About all I see is to cluster on:
( date, machinenum, time )
and see if you can get a seek predicate for at least the date and machinenum in the query plan. SQL will still have to scan from that point, but it hopefully will still cut down the load.

Every second is actually slow — this is a telephony application where every 1/10 of a second counts, I'd run it more often if I could. Thanks for the suggestion, I'll try that. What worries me though is that this is not consistent behavior — at times it runs very fast, at other times — very slow. I find it difficult to reconcile this behavior with the clustered index setup — wouldn't it be a lot more consistent if it had an index problem?

I would look here first - is your [date] column date or datetime data type? Or is it character data - if so, why? Is [time] defined as a time data type - or is it integer, char, something else?

If your [date] column is a date data type - the above code could be causing a cardinality issue due to the implicit conversion. You might be able to avoid that with:

[date] = cast(getdate() as date)

Or - create a variable and use that in the code:

DECLARE @currentDate date = current_timestamp, @currentTime time(0) = cast(current_timestamp As time(0));

FROM ...
WHERE [date] = @currentDate
AND [time] <= @currentTime;

If all of your queries filter on the machine num, clustering on machinenum, [date], [time] should improve the scan.

Legacy app writes date and time as 8-charcter string each, I need to work within those limitations. So the columns are char data. I will recreate the clustered index as you and Scott suggested and see. Still, I don't see how this will address the consistency issue. My queries are not consistently performing poorly, but intermittently. I am trying to figure out how to pin down the cause of that "intermittance". I have a feeling that some other factor may be affecting it, but when I look at locks they are all shared, I can't see what it may be waiting on. Especially on a select from such a small table.

Even with NOLOCK, certain things can cause a delay.

First, make sure it's not a new connection every time, that would be huge overhead.

Second, verify that the log file is not dynamically growing. Log growth stalls all db processing.

Again, the idea of the clus index is to get a seek that eliminates some number of rows -- hopefully a majority -- from having to be read, esp. given that it runs every second. The required sort is also a relatively expensive operation, make sure it's not doing too large a memory grant nor an insufficient memory grant that forces it to page out the sort to disk. Typically that would require a large number of rows, but with NOLOCK you may see a lot more rows than a typical SELECT would see because they are in the process of being deleted, have moved and you've read them again, etc..

Finally, maybe reorg/rebuild the table every 5 minutes, give or take, if you can do ONLINE rebuilds.

Changing the clustered index had no effect whatsoever on the performance. :frowning:

I hear you on the connection thing, that's something I need to work on. However, the testing I did to produce these results in screenshots was done from Query Analyzer without recreating a connection, just hitting F5 time after time. And that still produced wildly different execution times.

Good point about the log file, but that would most likely affect every one of dozens (if not hundreds) of operations per minute on all the tables in this DB, and I am seeing a problem literally with only this one SP.

I can do rebuilds, every 5 minutes, I guess, problem is that the table never grows beyond 20 rows (30 at most). That would include rows being deleted, inserted, updated, etc. With such a tiny data collection and index, I fear the rebuild operation will introduce more of a delay than it saves.

There's another issue with overindexing here — whatever I may save on reads I may loose on inserts. I got a caller on the phone, waiting for the insert/read combination to be complete, and one second is a long time to wait on the phone.

If the table is only 20-30 rows, a reorg or online rebuild should finish almost instantaneously. The advantage might be in compressing data because a row or two were in a page by themselves for some reason.

What are the specs of the server? Ram etc
What recovery model is the db?

Quad core CPU with no processor affinity (<10% utilization in windows task manager), 2 GB SQL memory cap in the SQL Server properties, Boost SQL Server Priority ON, Full recovery model, T-log backup every 15 min, Diff backup every 6 hr, full every 24 hr.

Is this a physical or virt and what kind of disk? Ssd?

AWS instance, SSD, multi-instance SQL Server install

Is there some sort of throttling happening by AWS settings? Are you tipping over with some connection and AWS is throttling ? Intermittent does not sound due to indices

Yeah, you should of mentioned AWS to start with. That's a whole other array of delay possibilities.

Multi-instance? How are those configured?

Also - turn of 'Boost SQL Server Priority' - it is a deprecated feature and could be causing issues.

1 Like

Also have you thought of rearchitecturing and not use SQL server at all but opt for something that is more high throughput friendly :slightly_smiling_face: insert every sub seconds for 15 to 20 rows max at any one time does not scream SQL server imho.
Unless you use SQL server for something else also

If it's SQL 2012, you can't use memory-optimized tables. I'd suggest you do so if you could.

Does Amazon provide some type of RAM drive for SQL? You could maybe dedicate a small amount of RAM to that table. With the activity it has, it would be worth it.

SSD might be problematic with that many modifications going on.

1 Like