Can a joined query be as efficient as a single table?

SergioM · July 10, 2024, 2:13am

I have one table that has data that is meant to be immutable. It has a hash nvarchar(64) as a unique key. I want to add other tables that have data that is related to it. Those other tables may be updated & therefore I think it makes sense to keep them separately. For the sake of explaining it, the main table is a financial ledger which has records that relate to things like sales/refunds/discounts/etc. The ancillary tables have additional data for each of those types. Each with different column headers.

The problem is that there are ~20M records in the financial ledger. I've tried by making the hash unique on both tables, making sure it is indexed on both & setting a foreign key relationship. But querying a month's worth of data takes about 5 minutes on just 2 tables VS 1 minute on 1 table.

Is there anything I can do to divide this data? If the shared keys were integers would that make a difference? Other than unique & indexed keys with a pk/fk relationship is there anything else I can do to optimize performance?

As it stands now I have a single, massive table with a lot of null columns. It works, but I can't believe that this is the best approach there is.

ScottPletcher · July 10, 2024, 1:46pm

Yes, Use an integer surrogate key in the related tables rather than nvarchar(64).

Other performance issues would relate to indexing, but we would need many more details before being able to address any index issues.

Ifor · July 10, 2024, 3:02pm

You might also want to look at sparse columns.

SergioM · July 10, 2024, 5:10pm

Thanks, I'll try changing the pk/fk to an int.

This is essentially the schema to (2) of the tables. I've dropped columns that are irrelevant to the point. Can you provide any more insight knowing this? CREATE TABLE [tmp].[financial]( [id] [int] IDENTITY(1,1) NOT NULL, [dateadde - Pastebin.com

SergioM · July 10, 2024, 5:11pm

Thanks. If possible, I'd rather divide the concepts into different tables. Otherwise, sparse columns are definitely the way to go.

ScottPletcher · July 10, 2024, 6:05pm

I'd recommend separate tables rather than sparse columns (for one thing, it's a more normalized design).

Get rid of the $IDENTITY column in the child tables, and instead use the unique ID from the main table. It can (still) be defined as a unique, primary key, clustered, it just shouldn't be an IDENTITY() column. If the main table doesn't have an IDENTITY() column, add one to it. Despite what some people say, not every table should have an IDENTITY() column. That id column from the main table should be linking fk.

Btw, the nvarchar(64) should still be there, in the main table, it's just not used for a fk.

cristianbells716 · August 22, 2024, 8:47am

SergioM:

I have one table that has data that is meant to be immutable. It has a hash nvarchar(64) as a unique key. I want to add other tables that have data that is related to it. Those other tables may be updated & therefore I think it makes sense to keep them separately. For the sake of explaining it, the main table is a financial ledger which has records that relate to things like sales/refunds/discounts/etc. The ancillary tables have additional data for each of those types. Each with different column headers.

The problem is that there are ~20M records in the financial ledger. I've tried by making the hash unique on both tables, making sure it is indexed on both & setting a foreign key relationship. But querying a month's worth of data takes about 5 minutes on just 2 tables VS 1 minute on 1 table.

Is there anything I can do to divide this data? If the shared keys were integers would that make a difference? Other than unique & indexed keys with a pk/fk relationship is there anything else I can do to optimize performance?

As it stands now I have a single, massive table with a lot of null columns. It works, but I can't believe that this is the best approach there is.

It sounds like you’re dealing with a significant volume of data, which is challenging for performance. Switching from an nvarchar(64) hash to an integer key could improve performance since integer comparisons are generally faster than string comparisons. Additionally, consider partitioning your financial ledger table by date or another relevant attribute, which could help reduce query times by limiting the amount of data scanned. Another approach could be to use indexed views or materialized views if your database supports them, allowing you to pre-aggregate or pre-join data for faster querying. You might also explore denormalizing certain data, storing frequently accessed data together in the same table to avoid joins. Finally, reviewing and optimizing your indexes, query plans, and database hardware could provide further performance gains.