Identify Changes In Data

squealer · April 7, 2019, 5:55pm

Hi, there is a script for a table with data below. I am trying to write a query that will take 3 parameters;
@trainername ,@classchange and @distancechange
@classchange and @distancechange can each have values of -1, 0,+1
I am trying to get a query which will find all instances of for a given trainer where the any of the horses have changes in the class and distance that match the values provided by @classchange and @distancechange.
For example, on 31/1/10 trainer Dave ran horse Clodhopper in a Class 5 Distance 6 race
on 24/1/19 trainer Dave ran Clodhopper in a Class 6 Distance 7 race.
If the input values of @classchange was +1 and @distancechange was +1 then that would be a match.
as would, any horse for trainer Dave where its next race was in a higher class and higher distance. An input of 0 in either case means that there was no change in distance or class, and -1 means that a subsequent race was in a lower class or a lower distance. I have no idea how to achieve this in sql. I just hope that I have made myself clear enough and that someone can help. Many thanks.

CREATE TABLE [dbo].[ChangesTest](
[DATE] [date] NULL,
[TRAINER] [nvarchar](255) NULL,
[HORSE] [nvarchar](255) NULL,
[CLASS] [float] NULL,
[DISTANCE] [float] NULL,
[RESULT] [float] NULL

) ON [PRIMARY]
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-01' AS Date), N'DAVE', N'AMBLE', 2, 6, 1)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-02' AS Date), N'DAVE', N'FALLSALOT', 4, 7, 3)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-03' AS Date), N'DAVE', N'PEGASUSNOT', 5, 6, 5)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-04' AS Date), N'DAVE', N'PEGASUSNOT', 4, 7, 4)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-05' AS Date), N'DAVE', N'AMBLE', 2, 5, 1)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-06' AS Date), N'DAVE', N'CLODHOPPER', 6, 7, 2)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-07' AS Date), N'DAVE', N'PEGASUSNOT', 5, 4, 1)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-08' AS Date), N'DAVE', N'FALLSALOT', 4, 3, 2)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-09' AS Date), N'DAVE', N'AMBLE', 3, 6, 7)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-10' AS Date), N'DAVE', N'CLODHOPPER', 4, 7, 3)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-11' AS Date), N'DAVE', N'AMBLE', 2, 5, 3)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-12' AS Date), N'DAVE', N'PEGASUSNOT', 5, 6, 4)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-13' AS Date), N'DAVE', N'AMBLE', 2, 5, 9)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-14' AS Date), N'DAVE', N'FALLSALOT', 6, 7, 8)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-15' AS Date), N'DAVE', N'AMBLE', 4, 7, 1)
GO
INSERT [dbo].[ChangesTest] ([DATE], [TRAINER], [HORSE], [CLASS], [DISTANCE], [RESULT]) VALUES (CAST(N'2019-01-16' AS Date), N'DAVE', N'CLODHOPPER', 4, 6, 1)
GO

Lewie · April 8, 2019, 3:35am

Your data doesn't match what your explanation said. i.e. dates not in the data
but from what I understand this maybe what you are looking for.

CREATE TABLE #ChangesTest ( [DATE] date, [TRAINER] nvarchar(255), [HORSE] nvarchar(255), [CLASS] float(8), [DISTANCE] float(8), [RESULT] float(8) )
INSERT INTO #ChangesTest
VALUES
( N'2019-01-01T00:00:00', N'DAVE', N'AMBLE', 2, 6, 1 ),
( N'2019-01-02T00:00:00', N'DAVE', N'FALLSALOT', 4, 7, 3 ),
( N'2019-01-03T00:00:00', N'DAVE', N'PEGASUSNOT', 5, 6, 5 ),
( N'2019-01-04T00:00:00', N'DAVE', N'PEGASUSNOT', 4, 7, 4 ),
( N'2019-01-05T00:00:00', N'DAVE', N'AMBLE', 2, 5, 1 ),
( N'2019-01-06T00:00:00', N'DAVE', N'CLODHOPPER', 6, 7, 2 ),
( N'2019-01-07T00:00:00', N'DAVE', N'PEGASUSNOT', 5, 4, 1 ),
( N'2019-01-08T00:00:00', N'DAVE', N'FALLSALOT', 4, 3, 2 ),
( N'2019-01-09T00:00:00', N'DAVE', N'AMBLE', 3, 6, 7 ),
( N'2019-01-10T00:00:00', N'DAVE', N'CLODHOPPER', 4, 7, 3 ),
( N'2019-01-11T00:00:00', N'DAVE', N'AMBLE', 2, 5, 3 ),
( N'2019-01-12T00:00:00', N'DAVE', N'PEGASUSNOT', 5, 6, 4 ),
( N'2019-01-13T00:00:00', N'DAVE', N'AMBLE', 2, 5, 9 ),
( N'2019-01-14T00:00:00', N'DAVE', N'FALLSALOT', 6, 7, 8 ),
( N'2019-01-15T00:00:00', N'DAVE', N'AMBLE', 4, 7, 1 ),
( N'2019-01-16T00:00:00', N'DAVE', N'CLODHOPPER', 4, 6, 1 )

DECLARE @trainername VARCHAR(40),
@Class INT,
@Distance INT

SET @trainername = 'DAVE';
SET @Class = -1;
SET @Distance = 1;

SELECT Z.*
FROM #changestest A
CROSS apply (SELECT TOP 1 B.*
FROM #changestest B
WHERE A.trainer = B.trainer
AND A.HORSE = B.HORSE
AND A.date < B.date
ORDER BY B.date ASC) Z
WHERE A.trainer = @trainername
AND Z.class = A.class + @Class
AND Z.distance = A.distance + @Distance;

squealer · April 8, 2019, 8:43am

Apologies for data not matching the explanation. Your solution provides exactly what I want, however it takes an age to complete, typically 8 or more minutes.

Lewie · April 8, 2019, 9:02am

If you add on an index it may help.
On trainer, horse, date

ScottPletcher · April 8, 2019, 4:32pm

Try clustering ChangesTest on ( Trainer, Date, Horse ). It's much better if you can specify it as UNIQUE key too.

squealer · April 8, 2019, 4:49pm

I'm not sure how to do that. Please elaborate. Thanks.

ScottPletcher · April 8, 2019, 5:24pm

Well, if there's not a clus index on the table already:

CREATE UNIQUE CLUSTERED INDEX ChangesTest__CL ON dbo.ChangesTest ( Trainer, Date, Horse ) WITH ( FILLFACTOR = 99, SORT_IN_TEMPDB = ON ) ON [PRIMARY];

squealer · April 8, 2019, 6:06pm

This is a picture of the execution plan after execution. The warning refers to an excessive memory grant.
There is a clustered index in existence, hence this error message; The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.ChangesTest' and the index name 'ChangesTest__CL'. The duplicate key value is (DAVE, 2019-01-01, AMBLE). I offer it because it might be of help. I suspect that my dbase is not as well normalized as it could and should be, but if anyone would care to offer any useful suggestions I'd be most grateful. Many thanks.

ScottPletcher · April 8, 2019, 7:55pm

Which specific version of SQL are you on? 2008? 2012? Etc.

Also, is this for any starting date and value?

For example, controls are:
Dave, +1, +1

D had class 5 and distance 6.
Then c 6 and d 7 -- match.
Then c 4 and d 5 -- no match.
Then c 5 and d 6 -- match?

Lewie · April 8, 2019, 9:52pm

If you are getting that error that means you have a duplicate for that data for that day.
Would a horse run more than one race a day?

Lewie · April 8, 2019, 10:04pm

Try adding the index: -
CREATE INDEX IX_MYIndex ON ChangesTest(trainer,HORSE,date) INCLUDE (class,distance)

squealer · April 9, 2019, 6:42am

sqlversion

In the example you gave, only the first example is a match ie ' then c 6 and d 7 -- match'
The query should take as inputs, the trainer, the class change the distance change for a current days racing.
That means, that for the race the horse is entered in today it is running class +1 and distance +1 (for example )compared to its last race. The query then finds examples of all instances where trainer Dave's horses have run at +1 +1 compared to their last race and then calculates the probability of a win based on that data. The probability calculations are not the issue for me. If I pull all the data off the database with a simple select ie 'select * for trainer Dave' and then query the result set in my vb.net app, then the desired probability figure is achieved in something under 20 seconds. I just had it in my mind that these sorts of calculations would be better, faster or more efficient, if done on the server side.

squealer · April 9, 2019, 6:47am

Using the following - (there is an asterix in the count() but it's not showing in the code pasted here )
SELECT horse,race_date, COUNT() as Occurences
FROM MAIN_TABLE
GROUP BY horse, RACE_DATE
HAVING COUNT() > 1

I can't find any duplicates. Of course, a trainer may have multiple runners on a single day.

I have added the suggested index, but to no obvious effect on the speed.

Lewie · April 10, 2019, 5:10am

Can you post the plan with the new index?
Which index was it that you added in the end?
Can you also post any other indexes on that table, I might be able to work with the ones you already have.

squealer · April 10, 2019, 7:13am

This is the index.

index_name	index_description	index_keys
IX_MYIndex	nonclustered located on PRIMARY	TRAINER, HORSE, DATE

And this the execution plan.

And in case it helps.

And finally - thank you for your continued help.

Lewie · April 10, 2019, 10:41am

That query plan looks odd. The %’s should add up to 100.

Select * into #temp from #changestest where trainer = @trainer
Create index tmp on #temp (horse,date) include (class,distance)

SELECT Z.*
FROM #temp A
CROSS apply (SELECT TOP 1 B.*
FROM #temp B
WHERE A.HORSE = B.HORSE
AND A.date <= B.date
ORDER BY B.date ASC) Z
WHERE Z.class = A.class + @Class
AND Z.distance = A.distance + @Distance;

squealer · April 10, 2019, 11:32am

This is the actual query used, I added the final line because I wanted to view the results.

HORSE_STATS
go
DECLARE @trainer VARCHAR(40),
@Class INT,
@Distance INT
SET @trainer = 'm johnston';
SET @Class = -1;
SET @Distance = 0;

Select * into #temp from MAIN_TABLE where trainer = @trainer
Create index tmp on #temp (horse,race_date) include (class,dist)
SELECT Z.*
FROM #temp A
CROSS apply (SELECT TOP 1 B.*
FROM #temp B
WHERE A.HORSE = B.HORSE
AND A.RACE_DATE <= B.RACE_DATE
ORDER BY B.RACE_DATE ASC) Z
WHERE Z.class = A.class + @Class
AND Z.DIST = A.DIST + @Distance;

select * from #temp

And these are the messages
(13802 rows affected)
(4 rows affected)
(1 row affected)
(0 rows affected)
(8 rows affected)
(1 row affected)
(13802 rows affected)
(2 rows affected)
(1 row affected)

As you can see, suddenly the query completed in the blink of an eye, but I have no idea why.

ScottPletcher · April 10, 2019, 2:14pm

That's not a clustered index as I specified, in bold even. The only possibility to get the seek you need is with a clus index. Non-clus indexes must be so extremely selective and fully covering or they're often not worth even creating.

squealer · April 10, 2019, 5:26pm

@ScottPletcher Apologies for the oversight. Attempting to create key as per your advice produced the following :- Warning! The maximum key length for a clustered index is 900 bytes. The index 'ChangesTest__CL' has maximum length of 1023 bytes. For some combination of large values, the insert/update operation will fail.
Msg 1505, Level 16, State 1, Line 3
The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.ChangesTest' and the index name 'ChangesTest__CL'. The duplicate key value is (DAVE, 2019-01-01, AMBLE).
The statement has been terminated.
If that helps. Thank you.

ScottPletcher · April 10, 2019, 6:27pm

Ah, that's the issue, you have to explicitly specify CLUSTERED index; it doesn't have to be unique -- it helps SQL efficiency somewhat, but it's not 100 required.

Drop all existing indexes on table.
Create the clustered index:
CREATE CLUSTERED INDEX ChangesTest__CL ON dbo.ChangesTest ( Trainer, Date, Horse ) WITH ( FILLFACTOR = 85, SORT_IN_TEMPDB = ON );

Then get the Estimated Execution Plan for the query and see if the scan of the ChangesTest has a seek/seek predicate first.