Amount of clients which have seen an impression of a product before purchasing - most optimal answer for Billion-record table

for a project I am working on I was asked to solve this task:

Given the table

events
event_id int (autoincrement) --10B distinct values
event_ts datetime -- 10B
event_type int (1 = impression, 2 = click, 3 = purchase...) --20 types
product_id int --100K
client_id int --10M
client_type int --10

Q: find the amount of clients which have seen an impression of a product before purchasing it

I came up with two solutions:

With cteProdsClients As (
select e1.product_id ,e1.client_id 
from  events as e1
where event_type = ‘3’ and 

EXISTS (SELECT  e2.product_id ,e2.client_id 
              FROM events as e2
              WHERE event_type = ‘1’ and    e1.product_id  = e2.product_id 
                        AND e1.client_id  = e2.client_id
AND e1.event_ts  <e2.event_ts )

count(distinct product_id)
From cteProdsClients                    
)
SELECT
    count(distinct client_id)
FROM 
    cteProdsClients ;



2)

With cteProdsClients As (
select e1.product_id ,e1.client_id 
from  events as e1 left join 
(
  SELECT  e2.product_id ,e2.client_id 
              FROM events as e2   
 WHERE event_type = ‘1’  AND e1.event_ts  <e2.event_ts
)  
ON e1.product_id  = e2.product_id  AND e1.client_id  = e2.client_id
WHERE event_type = ‘3’ 
)
SELECT
    count(distinct client_id)
FROM 
    cteProdsClients ;

I need to not only create a query that would get the output but also do it in the most efficient, optimized way. Out of these 2 which is better? I would appreciate it if you fixed errors (if exist) and propose better solutions Thanks


;WITH cteProdsClients AS (
    SELECT product_id, client_id, 
        MIN(CASE WHEN event_type = '1' THEN event_ts ELSE NULL END) AS first_impression,
        MIN(CASE WHEN event_type = '3' THEN event_ts ELSE NULL END) AS first_purchase
    FROM events
    WHERE event_type IN ('1', '3')
    GROUP BY product_id, client_id
    HAVING MAX(CASE WHEN event_type = '1' THEN 1 ELSE 0 END) = 1 AND
       MAX(CASE WHEN event_type = '3' THEN 1 ELSE 0 END) = 1
)
SELECT COUNT(DISTINCT client_id) AS client_count
FROM cteProdsClients
WHERE first_impression < first_purchase
2 Likes

@ScottPletcher thanks!
That is more efficient since there is no use in joins. Would you use indexes on some columns to improve optimization?
And if you have an idea to further optimize your code let me know since that's what I'm being tested in

An index keyed on ( event_type, product_id, client_id ) and INCLUDEing ( event_ts ) would cover that query, I think (you can verify by looking at the query plan after creating the index).

Some people object to multiple keys in an index, but in this case it should provide performance when SQL runs the query.

1 Like

ok so you mean something like this?

CREATE [UNIQUE] INDEX index_name
ON events( event_type, product_id, client_id )
INCLUDE(event_ts );

Whats the reasoning behind the include part?
Thanks

Yes, except if you want to make the key UNIQUE -- which is not a bad idea -- we need to add the ts to the keys.


CREATE [UNIQUE] INDEX index_name
ON events( event_type, product_id, client_id, event_ts ) --in that specific order!!

Since the query uses event_ts, that needs to be in the index to prevent SQL from having to go back to the main table to look it up. In technical terms, putting that column in the index -- either as a key or as an INCLUDEd column -- makes the index "covering".

1 Like

Got it, thanks for the explanation
So you would stick with the unick key index? Does it further improve efficiency?

Yes, it does help overall performance, plus SQL strongly prefers unique indexes.

Note that if in the future another column(s) is added to this query, that column(s) must be added to the index as well so that the index continues to be "covering".

1 Like

You might even want to consider a filtered index:

CREATE UNIQUE INDEX index_name
ON events( event_type, product_id, client_id, event_ts )
WHERE event_type IN ('1', '3')

If you know that you'll always include that WHERE condition in the query. It would reduce the number of total rows to be queried and reduce the size of the index.

1 Like

But first you need to make sure you don't run under settings that would not allow a filtered index to be present.

1 Like

Very helpful thanks @ScottPletcher , @robert_volk
there is another task with a twist (same 'events' table):

Which clients have seen an impression of a product at least 3 times before purchasing it?
So instead of getting the amount, I have to get the clients

What would you add to Scott's query in the first comment? Or do differently..

;WITH cteProdsClients AS (
    SELECT product_id, client_id, 
        MIN(CASE WHEN event_type = '1' THEN event_ts ELSE NULL END) AS first_impression,
        MIN(CASE WHEN event_type = '3' THEN event_ts ELSE NULL END) AS first_purchase
    FROM events
    WHERE event_type IN ('1', '3')
    GROUP BY product_id, client_id
    HAVING MIN(event_type) = '1' AND
       MAX(event_type) = '3'
AND SUM(CASE WHEN event_type = '1' THEN 1 END)>=3
)
SELECT DISTINCT client_id
FROM cteProdsClients
WHERE first_impression < first_purchase

edit: slight tweaks to original

2 Likes

Do I use the index in my query or is it something that exists in my db and improves efficiency? (asking in general)
And would you say its a better idea than the unique filter:

CREATE [UNIQUE] INDEX index_name
ON events( event_type, product_id, client_id, event_ts )

The query optimizer would choose the index if it determines it would reduce the overall cost of the query. In case it doesn't, you could force it to use the index with a hint:

Generally speaking, it's usually better not to force an index hint. In this specific case, and as long as you don't modify the criteria, it should be OK. Make sure to test performance with and without the hint, and compare the query plans, before settling on using the hint.

1 Like

Got it thanks a lot.
So you would stick with the filtered index over the unique index? And it serves both queries right? (the one with the count, and the one without)

@robert_volk and in addition,
would that index be helpful as well for the following query? if not which index would you use?

Q: Find the top 3 products(product name) by the highest number of clients purchasing them

select p.product_name, e.purchases
From (
Select top 3
e.product_id , count(distinct e.client_id) as purchases
From events as e
Where event_type = 3
Order by 2 DESC
) as e
left join products as p on e.product_id=p.product_id

Thanks

It's hard to say, the only way to know is to generate a query plan and check, it will show either a seek or an index scan, and hover over the element to see the name of the index.

Question: why would you use a LEFT JOIN on events to the products table? Why/how would an event have a product ID that's not in products?

1 Like

The left join is only for getting the product_name column. And its done after getting a 3 - records table.
What about my previous question?

So you would stick with the filtered index over the unique index? And it serves both queries right? (the one with the count, and the one without)

And I would generate a query plan and check but the problem is that there is no actual DB.
This is a task for a university course, not supposed to run it anywhere.
So would you add another index for that query? (top 3 products)

That would have been valuable to know from the beginning. "for a project I am working on I was asked to solve this task" doesn't indicate this is theoretical or that you're being graded on it.

Nor does it help to open a new thread:

...that doesn't include the discussion already here.

See my comment there. The SQL Server query optimizer (and some others) generates a query plan based on columns statistics that are calculated on the actual data. For an index to be useful, the statistics of the data in its column(s) have to be selective enough for the optimizer to say "using this index is substantially cheaper" than another method. The fact that you have no data makes any discussion about useful indexes moot.

We need to close one of these threads, they're duplicates. I'll let you choose which one.

Edit: closing this one, as discussion is continuing in the other thread.