Select correct table based on year

K3JAE · October 28, 2022, 10:13am

This is a basic question and I believe I know the answer but wish to defer to the experts on this prior to attempting this. I'm wondering if there is a way to retrieve data from a database table based on a year... Being more specific:

I will have multiple tables with named 2020Docs, 2021Docs, 2022Docs, 2023Docs, 2024Docs etc.

Each of those tables then would have the exact same column structure.

I am wondering if there is a way to search for a record in, as an example, 2020Docs specifically but still maintain the ability to search into one of the other tables, if required, without having to write a new query each time.

My end thinking is since the DOCS table will contain a years worth of fairly large PDFs, transact time becomes slower and slower as the year progresses and more records are added. Rather than bloating the DB with multiple years of DOCS I was thinking of breaking them into individual years thus keeping transact times within reason.

Ultimately this will be done in a program where once a record is called the document will display on the screen. I have that functionality working but need to seriously speed up the transact time.

So, as a guidance, can I set a parameter, and more specifically, how to pass said parameter to look at a specific DOC table? Additionally, I will also need the ability to write (save documents) to each year's table as documents are produced. The later part will most likely only need to be written to the current years DB as all documents pertaining to the record will have been generated within days of completing a load.

I have not attempted this yet so I have no SQL to show my progress as this is only a preliminary idea for development that I believe will address my slow Transact time.

ScottPletcher · October 28, 2022, 1:46pm

We need more info.

On what columns specifically do you search? Do you search by date?

Typically you don't need to split out to several tables for something like this but details are critical. Most often you'd want to force the PDF data to a separate, off-page row to keep the main table much smaller for searching.

robert_volk · October 28, 2022, 1:53pm

The most flexible way is to create a partitioned view:

CREATE View AllDocs AS
SELECT * FROM [2020Docs]
UNION ALL
SELECT * FROM [2021Docs]
UNION ALL
SELECT * FROM [2022Docs]
UNION ALL
SELECT * FROM [2023Docs]
UNION ALL
SELECT * FROM [2024Docs]
GO

To get optimal performance, you'd need to add a constraint on the date column in each table, like so:


ALTER TABLE [2020Docs] ADD CONSTRAINT CHK_2020Docs CHECK (DateCol BETWEEN '2020-01-01' AND '2020-12-31 23:59:59.997')

You add a similar constraint on each table with the date values adjusted for the relevant year of that table.

Adding that constraint allows the view to use partition elimination when you query on that date column, and it will eliminate any table whose data would be out of that date range. For additional performance improvement you can add an index to that date column in each table:

CREATE INDEX dateCol ON [2020Docs](dateCol);
CREATE INDEX dateCol ON [2021Docs](dateCol);
... etc. for each table

What's important to remember is if you want to query an entire year:

SELECT * FROM AllDocs WHERE dateCol BETWEEN '2021-01-01' AND '2021-12-31 23:59:59'

and NOT like this:

SELECT * FROM AllDocs WHERE YEAR(dateCol)=2021

The first style can use an index and partition elimination, because it presents actual date values that can be compared immediately in both the indexes and the constraints. The second one does not, because the YEAR() function must be evaluated for every row, and it returns an integer type.

The can also query across year boundaries, as long as you use the first style with actual dates. Parameters also work:

DECLARE @start datetime='2021-06-13', @end datetime='2022-03-23'
SELECT * FROM AllDocs WHERE dateCol>=@start and dateCol<@end

I wouldn't advise having the application code try to determine which table to query based on a year or date value, it will get complicated and you'll need to modify it whenever you add a new year.

Frankly, keeping them all in separate tables is probably unnecessary, as long as you index the date column, they could comfortably go into a single documents table.

Lastly, and I'm sure you won't like hearing this but I'd feel delinquent if I didn't mention it, if you expect to be storing millions of PDF documents, you DO NOT WANT TO STORE THEM IN BLOB COLUMNS IN THE DATABASE. If you have any opportunity to change your database structure I cannot emphasize this enough.

They're files, store them as files in a filesystem. If you add them into the database as blob columns, your database will grow in size, backups will get larger, they can't compress because PDFs are already compressed, your maintenance windows will grow longer, it will become a serious headache in a year or less. Not to mention, your PDF data will never change, so you're backing up the same exact data every time, only more each time.

I'm speaking from far too much experience here. Currently working on moving 10 TB of JPEG images to AWS storage, there's perhaps 5-6 GB of actual data in the database.

Since no one listens to me and everyone loves Brent Ozar, here's Brent's take:

K3JAE · October 28, 2022, 2:44pm

The PETS_LOAD_NUMBER (which will be in all tables) will be the primary search column - regardless of year. ALL load number are preceded by the year then a sequential number behind it (ex: 2022001, 2022002, 2022003, etc)
so my SQL would be WHERE PETS_LOAD_NUMBER like '2022%' if I'm looking for an entire year otherwise WHERE PETS_LOAD_NUMBER = '2022001'

Not sure I understand your comment "you'd want to force the PDF data to a separate, off-page row." I am not familiar with what an off-page row is.

Schema for all tables are as follows (already created and 2 of them (2021Docs, 2022Docs) are currently populated).

ID int NOT NULL IDENTITY(1,1) PRIMARY KEY,
BROKER_LOAD_NUMBER nvarchar(15) NOT NULL,
PDF_FILENAME,nvarchar(50) NOT NULL,
PETS_LOAD_NUMBER int NOT NULL,
DOCUMENTS varbinary(MAX) NOT NULL

K3JAE · October 28, 2022, 3:12pm

Robert... This is very good info and gives me something to learn and chew on.. I am not a strong guru at doing SQL and everything you mentioned in the 2nd part of your response I am already starting to feel the pain of (slow transacts, trying to figure out how to code this in VB for my program, how to plan to adjust code every year to account for new years etc.)

Since these are new Databases, I am not too far along where I could change the db structure if that is the recommended way to go but not sure how to do that as a filesystem or as ScottPletcher suggested in his posting. I would prefer to find a good way to compress my transact time to a few seconds rather than up to 30+ seconds I am already experiencing. I did a BU of the DB last evening and it took near 4x longer than it had prior to me adding the new tables with the PDFs added as a blob.

My end goal is to be able, from within the program, to search for and display a record, with PDF, regardless of the year without the need to pre-load a 'master' dataview table as that takes way too long a time to load all the records with the data (which will become even larger by the week). I have been thinking on this and have come up with a few different ideas but need to understand all and then once I decide on the actual path to then implement it via VB code.

Once I figure out the SQL for this to do it properly then I have to work that into the VB program. Thus why I am asking ahead of time the best suggested and more importantly the most efficient way to do this before I dive into this mess! LOL

yosiasz · October 28, 2022, 3:33pm

Why did you decide to go with this design approach?

2020Docs

Instead of adding a column for year and keep it all in just one table?

ScottPletcher · October 28, 2022, 6:39pm

Your unique clustering key should be on:
(PETS_LOAD_NUMBER, ID)

To force all PDFs to a separate page:
EXEC sys.sp_tableoption 'tablename', 'large value types out of row', 1

Then:
UPDATE tablename SET DOCUMENTS = DOCUMENTS

Note: if a lot of the PDFs currently fit within the page (they are less than 8000 bytes), this UPDATE could take a while and generate lots of log records. If they were larger than 8000 bytes, they were already on separate pages.

What is the date column? I don't see it in your table ddl.

K3JAE · October 28, 2022, 7:33pm

@ScottPletcher - I do not need a date column or do I? I will only ever search on the PETS_LOAD_NUMBER column. I see no real reason to have a date column as it has no real relevance. My search will always consist of WHERE PETS_LOAD_NUMBER = 'yyyynnn' (substituting the 'yyyy' (year) and 'nnn' (sequential number) for whatever record I need)

K3JAE · October 28, 2022, 7:39pm

@yosiasz because currently I have a BLOB column and the larger the db gets the slower it loads. It is painfully slow now taking near 12 seconds to retrieve 1 record from just the current year records. It takes near 3 minutes to find the same record with all the records combined I have thus far in the test db and that is only 2 years worth of data. So looking forward this will become ever increasing processing time with additional years added to one table.

Additionally as @robert_volk mentioned the massively increased back-up time, additional coding every year etc. makes this a more feasible option for the moment.

ScottPletcher · October 28, 2022, 9:10pm

OK. I figured you had a date column because you said you wanted to partition by year. I had no idea the PETS_LOAD_NUMBER was a date by a different name. Please keep in mind, we know nothing about your data.

K3JAE · October 29, 2022, 1:59pm

@ScottPletcher Understand that totally. I apologize for the lack of information. Probably should have done this at the outset...
My 'project blueprint,' as it were, is as follows:

Absolutely must keep transact times to a minimum.
Must have ability to search for any record regardless of year if said record exists along with the PDF's in the form of my VB program (currently working at the moment but it is not in the most efficient way evidently)..
Search will always consist of WHERE PETS_LOAD_NUMBER = 'yyyynnn' (substituting the 'yyyy' (year) and 'nnn' (sequential number) for whatever record is requested (thus the date parameter of sorts). This information should be known at the outset.
Alternately have ability to search on BROKER_LOAD_NUMBER - which would NOT be a date parameter but is a unique number that is not repeated.
I need a better way (or best way) to store PDFs within the database since using a BLOB column has been deemed not the best way to do this. This is an area I am extremely weak (more like completely ignorant) in knowledge in possible options or how to implement.
Each record will have a PDF file included, each consisting of anywhere between 4mb - 20mb of data with the average size being ~7mb.

Unless I missed something this is the basic blueprint.

ScottPletcher · October 31, 2022, 6:22pm

Cool, thanks.

What I stated above for the unique clustered index is still correct: (PETS_LOAD_NUMBER, ID)

You'll also want to create a unique nonclustered index on ( BROKER_LOAD_NUMBER ).

As to storing the PDFs, I agree it's best to store them outside of the SQL db. If you really want SQL to manage those blobs, use FILESTREAM rather than directly using varbinary to store them.

yosiasz · October 31, 2022, 8:40pm

I would have separated the pdf to be on disk

I would have dealt with the pdf via the application for display.

So the application gets needed data including the path of the pdf and it would be up to the app to display the pdf from disk not from sql