Select correct table based on year

The most flexible way is to create a partitioned view:

CREATE View AllDocs AS
SELECT * FROM [2020Docs]
UNION ALL
SELECT * FROM [2021Docs]
UNION ALL
SELECT * FROM [2022Docs]
UNION ALL
SELECT * FROM [2023Docs]
UNION ALL
SELECT * FROM [2024Docs]
GO

To get optimal performance, you'd need to add a constraint on the date column in each table, like so:


ALTER TABLE [2020Docs] ADD CONSTRAINT CHK_2020Docs CHECK (DateCol BETWEEN '2020-01-01' AND '2020-12-31 23:59:59.997')

You add a similar constraint on each table with the date values adjusted for the relevant year of that table.

Adding that constraint allows the view to use partition elimination when you query on that date column, and it will eliminate any table whose data would be out of that date range. For additional performance improvement you can add an index to that date column in each table:

CREATE INDEX dateCol ON [2020Docs](dateCol);
CREATE INDEX dateCol ON [2021Docs](dateCol);
... etc. for each table

What's important to remember is if you want to query an entire year:

SELECT * FROM AllDocs WHERE dateCol BETWEEN '2021-01-01' AND '2021-12-31 23:59:59'

and NOT like this:

SELECT * FROM AllDocs WHERE YEAR(dateCol)=2021

The first style can use an index and partition elimination, because it presents actual date values that can be compared immediately in both the indexes and the constraints. The second one does not, because the YEAR() function must be evaluated for every row, and it returns an integer type.

The can also query across year boundaries, as long as you use the first style with actual dates. Parameters also work:

DECLARE @start datetime='2021-06-13', @end datetime='2022-03-23'
SELECT * FROM AllDocs WHERE dateCol>=@start and dateCol<@end

I wouldn't advise having the application code try to determine which table to query based on a year or date value, it will get complicated and you'll need to modify it whenever you add a new year.

Frankly, keeping them all in separate tables is probably unnecessary, as long as you index the date column, they could comfortably go into a single documents table.

Lastly, and I'm sure you won't like hearing this but I'd feel delinquent if I didn't mention it, if you expect to be storing millions of PDF documents, you DO NOT WANT TO STORE THEM IN BLOB COLUMNS IN THE DATABASE. If you have any opportunity to change your database structure I cannot emphasize this enough.

They're files, store them as files in a filesystem. If you add them into the database as blob columns, your database will grow in size, backups will get larger, they can't compress because PDFs are already compressed, your maintenance windows will grow longer, it will become a serious headache in a year or less. Not to mention, your PDF data will never change, so you're backing up the same exact data every time, only more each time.

I'm speaking from far too much experience here. Currently working on moving 10 TB of JPEG images to AWS storage, there's perhaps 5-6 GB of actual data in the database.

Since no one listens to me and everyone loves Brent Ozar, here's Brent's take:

2 Likes