Optimizing query of large table based on ids in another large table

techsassy · August 3, 2022, 10:32pm

Hi all, having some trouble with a query that is taking too long to run. The query is meant to pull back an identifier and name from a very large table, restricted to only a set of ids that appear in another large table. I'd like to avoid indexing the table, or using other techniques that can't be contained within the query, if possible.

Here's my query. The customer_data table is huge, so I want to limit it to current customers only, by
selecting id from the much smaller current_cusomters table:

select * from 
(select id, first_name, last_name 
from customer_data
where id in (select id from current_customers where year = 2022) )

If I simply pass the ids directly, it works great:

select * from 
(select id, first_name, last_name 
from customer_data
where id in (560,561) )

Can anyone explain why the first query takes forever (I've had to terminate it each time since it never finishes), but the second query works almost instantly? Is there a better way to optimize my first query? Been research it all day but can't seem to find a good solution. Thanks in advance!!

yosiasz · August 3, 2022, 11:00pm

How many rows returned by

select id from current_customers where year = 2022

And you dont really need this

select * from

ScottPletcher · August 3, 2022, 11:11pm

You must already have an index on id on the customer_data table if the first query works instantly.

Look at the query plan to see how SQL is processing the first query. Is it using an index search or is it doing a table scan (or an index scan)?

jeffw8713 · August 4, 2022, 8:33pm

I would also recommend validating there actually is an 'id' column in the current_customers table. By not aliasing the tables - and not using the alias for columns, it is entirely possible that there isn't an 'id' column in that table and that subquery is referencing the id column in the outer query.

harishgg1 · August 5, 2022, 10:28am

hi

when
the SQL query you write ... is going through HUGE data .. then it will take forever and ever and ever

this is a very common issue since the begining of time

to overcome this .. all different types of solutions have been thought of

Partitioning of data
Indexing
Run the query on a super computer ( rent it ) with huge capablilties
go through the data in small parts at a time
Switch to some other databases ( Oracle or MonogDB) meant for this purpose
from the inception designing with Huge Data in mind ..called thinking of Scalability

mike01 · August 5, 2022, 3:20pm

what are the indexes for current_customers table? Can you simply join them if id is unique for the year?