Should I use an index on this?

sapator · February 27, 2018, 9:28am

CREATE TABLE [dbo].[EOrderReminderConsent](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[BookingId] [nvarchar](50) NOT NULL,
	[ShowTime] [datetime] NOT NULL,
	[Email] [nvarchar](150) NULL,
	[EmailSend] [smallint] NULL,
	[RetryStatus] [smallint] NULL,
 CONSTRAINT [PK_EOrderReminderConsent] PRIMARY KEY CLUSTERED 
(
	[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

So The booking ID is there but rarely used in query selects.
The main use is on the email and ShowTime

Emails are almost always different, ShowTime is different by shows. I mean, suppose we have a show starting at 16:10 and another one at 18:00 then a data sample would be like this:

ShowTime email
16:10 email1@1.com
16:10 email2@1.com
16:10 email3@1.com
18:00 email4@1.com
18:00 email5@1.com

So email constantly changes and session from show to show.
The select would be select email where showtime = sometime

So what seems correct to me is create and index on showtime and included column on email.
That is, I'm note sure about the showtime as it is not a different value always.

Also another question. Suppose that we delete this rows when we send an email, 30 minutes before the show starts. Some shows of course may have been pre paid and the showtime can be days, even weeks or month away, so we would maintain values in the table. But I do not expect the maintained values to be more than 100.000-200.000. Is that worth using an index, considering that this table insert values constantly, so we are slowing the insertions down by using indexes.

Thanks.

harishgg1 · February 27, 2018, 10:01am

Indexes are used on tables with STATIC Data for FASTER Querying Purpose

Hope this answers your question

nigelrivett · February 27, 2018, 10:08am

Short answer is - try it and see.
How are you going to use the data.
From your suggested index it sounds like you want to get all the emails for a show?
From the table it also looks like you update the emailsend and retrystatus?

Maybe you are sending emails at a certain time before a show?
I wouldn't rely on the table being small as they have a tendency to grow.
Consider just an index on showtime - then you can get the IDs and use that to get the data for the email. It would reduce the complexity of the index and stop it growing if you add more columns to the table or need more for the email.

sapator · February 27, 2018, 11:05am

Hi.
When I send the email, it will be delete from a job that will run once a day.
The EmailSend is the key for the deletion. RetryStatus is used from exchange server but just ignore those columns.

A Query would be:

Select email from EOrderReminderConsent
where DATEDIFF(mi,Showdate ,getdate()) < 30

So I can get the shows that start in 30 minutes (hope i wrote the T-SQL correct).
Show an index in showtime and not email? How come?
I thought you should index different values (email) more that similar values
thanks.

sapator · February 27, 2018, 11:06am

Wrote the Datediff wrong. It's DATEDIFF(mi,getdate(),Showdate )

nigelrivett · February 27, 2018, 1:13pm

test it but I wouldn't expect that to use an index (might scan the whole index) on Showdate - try

where Showdate > dateadd(mi, -30, getdate())
or simpler
where Showdate > getdate() - '00:30:00'

sapator · February 27, 2018, 2:16pm

OK.
Thanks.
I suppose DATEDIFF has a potential issue to make the query not sargable?

ScottPletcher · February 27, 2018, 3:01pm

The select would be select email where showtime = sometime

Then cluster the table by showtime first. Don't fall into the horrible myth that "every table should be clustered by identity". That's just ridiculous.

ALTER TABLE dbo.EOrderReminderConsent DROP CONSTRAINT PK_EOrderReminderConsent;

ALTER TABLE dbo.EOrderReminderConsent ADD CONSTRAINT PK_EOrderReminderConsent ( ShowTime, Id ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [PRIMARY];

sapator · February 27, 2018, 3:35pm

Thanks, why are we adding fillfactor 99?
I read that specifying fillfactors other than default can decrease performance.
Thanks

sapator · February 27, 2018, 3:41pm

Hi.
Just find out that I need to search for the bookingid also on another query.

So

  update EOrderReminderConsent 
  set [RetryStatus] = [RetryStatus] + 1 
  where BookingId = ''
  and Email ='asdsad'

So in general. I will use the first query, an insert and an update .

sapator · February 27, 2018, 3:50pm

I believe that if I where to specify and index then what would make sense is BookingId and Email.
I'm not sure about the showtime. As I've said , showtime is the same for a specific show, so we can have a show with the same time for 100 clients and the next one with 50 clients and the next one with 150 clients etc.
So there would be 3 showtime values for 300 clients.

ScottPletcher · February 27, 2018, 4:02pm

You need to index on the WHERE columns. You could retain Id as the PK, just make it nonclustered.

ALTER TABLE dbo.EOrderReminderConsent DROP CONSTRAINT PK_EOrderReminderConsent;

CREATE UNIQUE CLUSTERED INDEX CL_EOrderReminderConsent ON dbo.EOrderReminderConsent ( ShowTime, Id ) WITH ( FILLFACTOR = 96, SORT_IN_TEMPDB = ON ) ON [PRIMARY];

ALTER TABLE dbo.EOrderReminderConsent ADD CONSTRAINT PK_EOrderReminderConsent PRIMARY KEY NONCLUSTERED ( Id ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [PRIMARY];

I read that specifying fillfactors other than default can decrease performance.

Very vague and ridiculous to apply as a general rule. How would you ever know, for any given db, what the default fillfactor was specified as, and why it was chosen?

From a performance perspective, there is actually no such thing as a "default fillfactor", and certainly no such thing as a "default clustering key". Both should be carefully chosen based on that specific table's needs.

sapator · February 27, 2018, 4:08pm

OK.
I should try to implement these tomorrow.
I'm not certain about the fillfactor. Haven't used it, will have to read about it.

P.S. Still don't know why we are using the showtime as it comes with values as I've said before.
Is there an explanation or a document I can read, because I was thinking that non unique values are best be avoided on indexes.Probably I'm wrong.

Thanks.

ScottPletcher · February 27, 2018, 6:05pm

How do you filter the result set? That is, what column(s) are used in the WHERE clause.

Based on what you said, I took ShowTime to be (almost) always used in the WHERE clause. If so, clustering the table on that will reduce the number of rows SQL has to consider, which will improve overall performance.

nigelrivett · February 28, 2018, 12:11am

It's a tradeoff and may decrease performance.
Clustering on the identity means that the clustered index (which is included in all indexes) is small and there is little reorganisation of data on inserts.
Clustering on other columns can mean that data gets fragmented and cause page splits.
Clustering on a column used in the where clause can reduce the number of reads needed to get the data after a seek but can increase the number of pages read in total.

For something like this I would consider using an index to get the the IDs then using those to get the data rows. That should reduce the temptation to add more indexes and help when updating and reduce fragmentation - and hopefully be fast enough for the requirement.

sapator · February 28, 2018, 8:40am

I'm a little confused about the suggestion though.

There is no use of the id column anywhere, we could just remove it but unfortunately there is no other column or column conjunction that guarantee a unique primary key.So after some extra requirements the most common query would be:

Select email,showdate,ShowTime,EmailSend,RetryStatus
form EOrderReminderConsent
WhereDATEDIFF(mi,getdate(),ShowTime) <=  45 and DATEDIFF(mi,getdate(),ShowTime) >=0
and EmailSend = 0  and RetryStatus  < 300
--or date check with nigelrivett suggestion, did not transformed that yet.

In my opinion and from what you told me, I should use an index to Showtime and id , for uniqueness and I would use an included column on email . I believe I do not need to use and index on smallint's (Emailsend and Retrystatus) due to the repetition and the small 1digit number but I'm open to suggestion.

Note that i do not want to overfeed the index since I will be doing constant inserts on the table.

So in sort I'm closer to ScottPletcher suggestion.
Any thoughts, opinions?
Thanks.

ScottPletcher · February 28, 2018, 4:06pm

The id by itself is unique. So adding it to the ShowTime will also automatically make that a unique combination. Presumably the ShowTimes are added (roughly) sequentially as well, so fragmentation will be low anyway. You're right about the other columns: you don't need, or want, to index on EmailSend or RetryStatus, etc.

But you MUST make sure that in the WHERE clause you do not perform any function(s) on ShowTime (in tech terms, that it's "sargable"). Otherwise SQL can't do a seek on the index. Something like:

Select email,showdate,ShowTime,EmailSend,RetryStatus
form EOrderReminderConsent
and EmailSend = 0 and RetryStatus < 300
Where ShowTime >= DATEADD(MINUTE, -45, GETDATE()) AND
ShowTime <= GETDATE()

You should get significantly better response times, for both INSERTs and SELECTs, because you won't have the overhead of additional nonclus indexes to support queries. And you're not constantly having to adjust those indexes. Eventually you most often end up with the entire table size repeated in nonclus covering indexes, which is such a waste.

What's your basis for this claim? If the ShowTimes are inserted in order, there shouldn't be really significant fragmentation. Don't get too obsessed with the INSERT. Data is INSERTed only once, but then read millions of times. Worst case, you could partition and rebuild the active part of the table, although that's almost never really needed. Or just reorg (or, if you have to, rebuild) the main table itself.

Again, it's a myth, and a performance-destroying one, that "(all) tables should be clustered by default on identity." There is no such thing as a "default clustering key". The single most important performance factor for any table is how it is clustered. So that shouldn't be left to simplistic, pre-defined "rules", but instead carefully chosen for each table based on its specific uses.

sapator · February 28, 2018, 4:11pm

Thanks. How about the email? Should i index it somewhere?

ScottPletcher · February 28, 2018, 7:38pm

I wouldn't, unless you search by email address a fair amount. That will be a large, and very volatile, column to index. In that index, you would see a lot of fragmentation.

nigelrivett · March 1, 2018, 2:11am

I'm guessing that showtimes won't be inserted in order given that the row has an email address and booking id. I'm guessing that the entry is added when a booking is made - you could know more about the application though.
I would say that the insert is the most important thing. If the query stops the insert from working then that can mess up the system whereas the extract should be built to be recoverable.

Odd - it used to be an incorrect statement in bol that tables should never be clustered on the identity - I hadn't heard people saying that it should always be - although it's sometimes a good idea.