Issues Ingesting many transactional JSON files from Amazon AWS S3 into on-premise SQL Server

KeithyM · September 7, 2023, 4:28pm

Hi. I need help with a problem in work involving event streaming, AWS S3 and ingesting transactional Json files into an SQL Server database.

Our work currently use on premise SQL server database which is sourced using a daily batch feed. However shortly we will be moving to a new source system that supplies the data in transactional Json files via an event stream on AWS S3. Despite the cloud storage our database will remain as on-premise for the time being. (Even if cloud database is a better solution, this is not an option as we are under tight timescales to transform the source)

Note I’m completely new to AWS S3 and our company is currently setting up an account for the source company to replicate the event stream to our S3 account so please assume I know little in your response. I have also only basic SQL server knowledge but learning fast

So, in summary my main questions are:

What is the best way to ingest 100’s of thousands of transactional Json files from an AWS S3 bucket into SQL server? (Preferably a solution that doesn’t involve purchasing extra software if possible)
How can I read many transactional files (each representing an account key record) with data that varies in each file (since the transactional file only contains the fields that have been updated by an event) into SQL Server. Each file can contain anything from 100 to 300 fields. I can obviously create code to read an isolated file using many OPENROWSET and CROSS APPLY clauses but, how would I possibly code around fields not existing in the json? e.g. if a field called colA has not been updated as part of the event, the json file will not represent this field as “colA": [ ] but instead will not exist in the file altogether. If there is not an way round to code for this in SQL server or in SSIS, do you know of any software that will solve this problem? (Amazon Athena has been mentioned but as far as I can see, its just a querying tool)
How can I go about automating the ingestion? I have heard I can create something in SSIS to automate this?

Many thanks in advance. And apologize for the lengthy multi facet question

RogierPronk · September 7, 2023, 4:42pm

You can use a foreach loop in SSIS to load the multiple data into the database. I would use a new table to load the files. Once you have the table you can create a stored procedure and fire the stored procedure in SSIS.

In the stored procedure you can compare the table with your table and only update the fields needed.

Learn SSIS - YouTube has some nice video's

KeithyM · September 7, 2023, 5:05pm

Thanks for that. that video will be useful for looping in SSIS, however it mainly deals with csv's and my source file format is json.

Regarding reading in many json files all with different fields populated, do you know of any simple way of doing that ? if i hard code hard reading in a json file that contains all possible fields into SQL Server (i.e with openrow/ cross apply statements etc) how can this program read other json files were not all of these fields will be present? the code would fall over if fields are not present

robert_volk · September 7, 2023, 6:33pm

Bert Wagner has some good JSON blogs for SQL Server:

This one is my go-to for when I have to parse indeterminate JSON:

It should work for your case where columns are missing. All of his JSON blogs are here, there are 4 pages worth:

https://bertwagner.com/category/sql/development/json.html

He covers some pretty deep stuff.

Regarding AWS S3 files, if SSIS proves to be problematic, you may want to consider the AWS CLI tool, using the S3 commands:

https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html

Depending on file naming schemes, this may offer a faster way to process thousands of files, assuming you're going to pull them from S3 to a local or WAN file store in your domain. Unfortunately it doesn't allow for getting newer files based on date created or modified, but it does have a sync operation that will grab only files that don't exist in the destination. If you're only ever getting new files, and not reloading modified files, that might be beneficial.

Once the files are downloaded to your side, SSIS ForEach loop should work, or you can process JSON files by importing into NVARCHAR(MAX) columns in a SQL Server table. OPENROWSET, BULK INSERT, or the bcp utility should work fine for any of those. From there, you'd go with OPENJSON.

As a point of reference, I recently did a project that was the reverse of your situation: I had to extract JPEG images stored in binary columns in SQL Server and upload them to S3. It was all done via bcp and AWS CLI S3 commands. In the end there were 10s of millions of files exceeding 6 TB of space. It took about 2 weeks in various stages, mostly because our uplink to AWS was throttled.

Also note that you will pay network egress charges to pull data from S3. It's not much per GB but it adds up, and you'll want to check your actual invoiced amounts regularly (daily at least) until you get a good picture on how much it will cost. And if you're going to be dealing with 100s of thousands of files, you'll need to manage them somehow on your local filestore. You DO NOT want to have more than a few thousand files per folder on an NTFS file system, you'll see abysmal performance and possible directory corruption once you exceed 50K-100K files. (One option to consider is moving your processed files in S3 to another bucket, or a folder within the same bucket, so that they're not available for the next round of processing. This will involve additional costs though.)

KeithyM · September 7, 2023, 10:34pm

Thanks so much for such a lengthy and detailed answer. I'll take a look at your recommendations and videos tomorrow

robert_volk · September 8, 2023, 5:55am

You're welcome. Hope they're useful.

At the risk of getting too far ahead, if you find you're dealing with a lot of JSON that has only a few elements you're interested in, these utilities might help:

https://jqlang.github.io/jq/

GRON really helped me out when I had nearly identical JSON that only differed in 1 or 2 elements. I could make it a line-based file compare and simply grab the different elements I needed. This cut down the amount of data that actually had to go into SQL Server tremendously.