for a project I am working on I was asked to solve this task:
Given the table
events
event_id int (autoincrement) --10B distinct values
event_ts datetime -- 10B
event_type int (1 = impression, 2 = click, 3 = purchase...) --20 types
product_id int --100K
client_id int --10M
client_type int --10
Q: find the amount of clients which have seen an impression of a product before purchasing it
I came up with two solutions:
With cteProdsClients As (
select e1.product_id ,e1.client_id
from events as e1
where event_type = ‘3’ and
EXISTS (SELECT e2.product_id ,e2.client_id
FROM events as e2
WHERE event_type = ‘1’ and e1.product_id = e2.product_id
AND e1.client_id = e2.client_id
AND e1.event_ts <e2.event_ts )
count(distinct product_id)
From cteProdsClients
)
SELECT
count(distinct client_id)
FROM
cteProdsClients ;
2)
With cteProdsClients As (
select e1.product_id ,e1.client_id
from events as e1 left join
(
SELECT e2.product_id ,e2.client_id
FROM events as e2
WHERE event_type = ‘1’ AND e1.event_ts <e2.event_ts
)
ON e1.product_id = e2.product_id AND e1.client_id = e2.client_id
WHERE event_type = ‘3’
)
SELECT
count(distinct client_id)
FROM
cteProdsClients ;
I need to not only create a query that would get the output but also do it in the most efficient, optimized way. Out of these 2 which is better? I would appreciate it if you fixed errors (if exist) and propose better solutions Thanks
;WITH cteProdsClients AS (
SELECT product_id, client_id,
MIN(CASE WHEN event_type = '1' THEN event_ts ELSE NULL END) AS first_impression,
MIN(CASE WHEN event_type = '3' THEN event_ts ELSE NULL END) AS first_purchase
FROM events
WHERE event_type IN ('1', '3')
GROUP BY product_id, client_id
HAVING MAX(CASE WHEN event_type = '1' THEN 1 ELSE 0 END) = 1 AND
MAX(CASE WHEN event_type = '3' THEN 1 ELSE 0 END) = 1
)
SELECT COUNT(DISTINCT client_id) AS client_count
FROM cteProdsClients
WHERE first_impression < first_purchase
@ScottPletcher thanks!
That is more efficient since there is no use in joins. Would you use indexes on some columns to improve optimization?
And if you have an idea to further optimize your code let me know since that's what I'm being tested in
An index keyed on ( event_type, product_id, client_id ) and INCLUDEing ( event_ts ) would cover that query, I think (you can verify by looking at the query plan after creating the index).
Some people object to multiple keys in an index, but in this case it should provide performance when SQL runs the query.
Yes, except if you want to make the key UNIQUE -- which is not a bad idea -- we need to add the ts to the keys.
CREATE [UNIQUE] INDEX index_name
ON events( event_type, product_id, client_id, event_ts ) --in that specific order!!
Since the query uses event_ts, that needs to be in the index to prevent SQL from having to go back to the main table to look it up. In technical terms, putting that column in the index -- either as a key or as an INCLUDEd column -- makes the index "covering".
Yes, it does help overall performance, plus SQL strongly prefers unique indexes.
Note that if in the future another column(s) is added to this query, that column(s) must be added to the index as well so that the index continues to be "covering".
CREATE UNIQUE INDEX index_name
ON events( event_type, product_id, client_id, event_ts )
WHERE event_type IN ('1', '3')
If you know that you'll always include that WHERE condition in the query. It would reduce the number of total rows to be queried and reduce the size of the index.
;WITH cteProdsClients AS (
SELECT product_id, client_id,
MIN(CASE WHEN event_type = '1' THEN event_ts ELSE NULL END) AS first_impression,
MIN(CASE WHEN event_type = '3' THEN event_ts ELSE NULL END) AS first_purchase
FROM events
WHERE event_type IN ('1', '3')
GROUP BY product_id, client_id
HAVING MIN(event_type) = '1' AND
MAX(event_type) = '3'
AND SUM(CASE WHEN event_type = '1' THEN 1 END)>=3
)
SELECT DISTINCT client_id
FROM cteProdsClients
WHERE first_impression < first_purchase
Do I use the index in my query or is it something that exists in my db and improves efficiency? (asking in general)
And would you say its a better idea than the unique filter:
CREATE [UNIQUE] INDEX index_name
ON events( event_type, product_id, client_id, event_ts )
The query optimizer would choose the index if it determines it would reduce the overall cost of the query. In case it doesn't, you could force it to use the index with a hint:
Generally speaking, it's usually better not to force an index hint. In this specific case, and as long as you don't modify the criteria, it should be OK. Make sure to test performance with and without the hint, and compare the query plans, before settling on using the hint.
Got it thanks a lot.
So you would stick with the filtered index over the unique index? And it serves both queries right? (the one with the count, and the one without)
@robert_volk and in addition,
would that index be helpful as well for the following query? if not which index would you use?
Q: Find the top 3 products(product name) by the highest number of clients purchasing them
select p.product_name, e.purchases
From (
Select top 3
e.product_id , count(distinct e.client_id) as purchases
From events as e
Where event_type = 3
Order by 2 DESC
) as e
left join products as p on e.product_id=p.product_id
It's hard to say, the only way to know is to generate a query plan and check, it will show either a seek or an index scan, and hover over the element to see the name of the index.
Question: why would you use a LEFT JOIN on events to the products table? Why/how would an event have a product ID that's not in products?
And I would generate a query plan and check but the problem is that there is no actual DB.
This is a task for a university course, not supposed to run it anywhere.
So would you add another index for that query? (top 3 products)
That would have been valuable to know from the beginning. "for a project I am working on I was asked to solve this task" doesn't indicate this is theoretical or that you're being graded on it.
Nor does it help to open a new thread:
...that doesn't include the discussion already here.
See my comment there. The SQL Server query optimizer (and some others) generates a query plan based on columns statistics that are calculated on the actual data. For an index to be useful, the statistics of the data in its column(s) have to be selective enough for the optimizer to say "using this index is substantially cheaper" than another method. The fact that you have no data makes any discussion about useful indexes moot.
We need to close one of these threads, they're duplicates. I'll let you choose which one.
Edit: closing this one, as discussion is continuing in the other thread.