TB of data ingest

yosiasz · November 15, 2017, 2:41pm

Greetings,

We are tasked with ingesting into our database(s) 25tb worth of log data each day, comes out to about a billion rows a day.
What approach have you ever used to ingest large data sets.

Do you break it down into multiple smaller files
do you use SSIS
do you use powershell

I am trying to see what is out there that could crunch this big data set fast and furious. For now I am testing ingest using SSIS and it does not look too promising.

Thanks

Kristen · November 15, 2017, 6:17pm

I've always favoured BCP and if the table has a Clustered Index then pre-sorting the import data on that key (and providing the hint/parameter that it is pre-sorted)

jeffw8713 · November 15, 2017, 7:00pm

BCP won't be any faster than SSIS or Powershell...they all have to read the file and push the data to a table. If the source data is in a single file - that would be quite a file and not much you can do to process that file any faster.

If the log files are separate files - then you could separate the load into multiple concurrent processes, but be careful you don't try to run too many at one time.

Using SSIS you can improve the performance by making sure you select a reasonable batch/commit size on the OLE DB Destination. This will allow SQL Server to commit the changes during the load process more frequently and avoids issues with growing the transaction log - as well as reducing the time it takes to actually load the data.

I am guessing that your SSIS process is using the default settings for the OLE DB Destination - which essentially will attempt to load all rows in a single transaction. That process will cause the transaction log to grow to fill the drive and if that does not happen - will take an extremely long time to perform the commit one all data has been loaded.

yosiasz · November 15, 2017, 8:11pm

Thanks all!
I hope its not one single file yikes! it is log files coming from hardware. we will have some sort of listener which maybe we can throttle during ingest.

Kristen · November 16, 2017, 7:47am

My preference for BCP is ... familiarity

The thing I like (which may well be achievable with all the others, I don't have knowledge enough to know) is the level of control we get. Apart from speed our biggest problem is "What went wrong"; I haven't had close contact with SSIS for Donkeys' Years, but back then I saw Client SQL Admins spending all morning figuring out what had gone wrong with overnight Data Imports, and "fixing them". In fairness the SQL Admins the Clients employed were not high calibre, so their Imports probably lacked any sort of useful diagnostic logging / capture.

We typically use BCP for a bunch of ETL file imports, overnight, in a Batch file, using a command looking something like this:

BCP 
    MyTargetDB.dbo.MyTargetTable
    IN x:\path\MyFileName.BCP 
    -e x:\path\MyFileName.ERR 
    -o x:\path\MyFileName.OUT 
    -S . 
    -T 
    -N 
    -E 
    -b 100000
    >>x:\path\MyFileName.ER2

in particular the "-e" parameter to catch errors, the "-o" to catch Output (typically the number of rows processed, in each batch) and the Redirect of screen output, each provide diagnostic information, in appropriate circumstances, that no single output does.

We can use the presence of the "-E" Error file as an indication that there is a problem, along with the ERRORLEVEL that BCP returns.

We use "-N" for Native data format whenever we can (i.e. SQL-to-SQL transfer) which solves the parsing of CSV / dates / etc. problems, but when using CSV if there is a muckup in the data then I want to know the row where it occurred, and if the import can continue, ignoring that row, so much the better.

What if the connection is duff, or the import file has zero bytes (we treat that as an error condition,its only ever happened to us when some earlier process had a disk full).

I wish I could remember all the other errors we have had, because a comprehensive list would greatly help in the building and testing of these String & Gum solutions - e.g. allowing deliberate simulation of all potential errors

For batch processes that provide the same named files, once-a-day, we want to be sure that timestamps are appropriate. We also want to be able to control "flow" of the files - so maybe:

Check that files exist in source location and "look OK" - e.g. filesize and date
Move the files to the IMPORT location (so that any newly arriving files don't interfere)
Perform the BCP import
Move the files to a DONE location - e.g. we cycle 10 folders DONE01 ... DONE10 so we have the last 10 data import files still available when User reports a problem (for some users we cycle 100 folders!!)
Have a means of discovering that the process failed. e.g. on START if there are already files in the IMPORT location either its currently already running or the previous import failed.

jeffw8713 · November 16, 2017, 7:15pm

I have nothing against BCP - other than the fact that it requires a 2 step process. The first step is an extract process from the source system into a file - the second step is an import process into the destination.

If the source and destination are both database systems - especially if both are SQL Server - then SSIS can outperform BCP because it can extract and load the data at the same time. There are 2 processes in SSIS where the first process is extracting the data and sending that data down the pipeline - and the second process which receives the data in the pipeline and loads the data to the destination.

If you define the batch and commit sizes to an appropriate size for your system - you also avoid large transactions that can take quite a bit of time to perform the commit. You can also control the batch size with BCP - but you cannot start loading data until all data has been extracted from the source system.

For example - if the extract process takes 15 minutes and the load process takes 5 minutes then the total processing time would be at least 20 minutes (depending on network latency, file location, etc...). For SSIS the process would be finished in a bit over 15 minutes - as the final set of data would be loaded and committed a short time after the extract completed.

As you have outlined above - you also have to build up processes to manage the files appropriately. You also run the risk of running out of disk space...all of that had to be coded and scripted.

SSIS has all of the ability to log errors and manage processing flow - to the point of even being able to (if you wanted) to redirect error rows to a different destination and continue loading all other rows. Of course, most implementations do not go to that length and someone then has to figure out what caused the error - which could take some time depending on your knowledge of the process and SSIS in general.

Kristen · November 16, 2017, 7:34pm

Yup, agree with that. In fact we do do server-to-server when its SQL at both ends, I'd forgotten about that. But we also do BCP in that case where, to give a real world example, the 3rd party that manages a Great Plains APP exports via BCP, ZIPs it up, transfers it to the SQL box (which since various hardware changes over the years is now probably in the same rack!!) that our DB is on, Unzips it there and leaves it in a folder for our stuff to import.

Their stuff has no error checking / handling and breaks about once a month ...

There is no permitted direct-connect between the two servers. All that is to work around the cost of [upgrade because of a] licensing restriction as we are definitely not allowed to query the G.P. database direct.

I have no idea what the licence would cost, but I'm fairly sure it would be cheaper, as well as being way way better for the client, than the cost of all the BUILD Consultancy and follow up SUPPORT consultancy that the Client has paid for over the years ...

I don't know how much effort that is to build in SSIS, and whether you can reuse the code, once you've figured out Best Practice?

FWIW we have a single BATCH file that does the BCP. It just expects some parameters for where the files are, and what folder to stuff any error files into, so what we consider to be our Best Practice for BCP is reused wherever it is needed, and has been upgraded in all places it is used as we have discovered a Better Mousetrap.

jeffw8713 · November 16, 2017, 8:37pm

Not actually very difficult once you have done it - but it does have to be added to each package but that can be put into a 'template' to be reused. Most people don't bother going to that extreme as it really isn't that hard to figure out why something failed - at least now that we have Integration Services Catalog available which will give you a report of each task and the failure message.

Generally - the most difficult part of troubleshooting is identifying which row and what column actually caused the issue. Integration Services Catalog (2012 and above) now reports the row and column making it much easier to identify the problem and determine the fix.

For someone that has already spent the time and effort to build the code and processes to utilize BCP - moving to SSIS would take more time and effort than sticking with what works...however, not having it built means SSIS is much faster to implement, deploy and manage.

Kristen · November 17, 2017, 7:20am

That's good to hear. Wish that had been there on day one way-back-when

Yes, that is an ongoing problem of course. We usually pull data into Staging tables, and then into the actual table from there. If we are importing from some rubbish CSV file we make all columns in the Staging Table VARCHAR(8000) and then do data type and length tests on all columns / rows, so that
a) the data survives the import without generating Import Error
b) we can give the User a comprehensible error report - so the user can go-fix the source and reimport, rather than having to involve Support

With a SQL-to-SQL transfer we still have the issue that the other DB may change datatype or column width (and we'll be the last to know) ... a field-size change is unlikely to actual impact data until a user enters a new "Longer value", so we detect that, chop it off (i.e. for varchar) and report it as a non critical error.

We have code that mechanically generates all the tests for that. Both WARNING and ERROR MESSAGE columns in the Staging table, and so on.

Where the remote has a ChangeDateTime column we pull only rows newer than "last time", but also any row (in Staging table) that has an Error Message from previous import - it might now recover e.g. an FKey error that is now resolved.

All probably? a bit excessive for a suck-in of @yosiasz TBs of Log Data ...

Lately we've added an Unique ID for the error too Our import is picking up so many goofy-data issues, with error messages from code that was implanted on the basis "This should never happen", that the errors are needing management by the Client User Ridiculous that the stuff upstream doesn't validate the data to a level that I would consider reasonable ... but we are where we are. So now we need "THIS User corrects THAT error" and "I want to ignore ERROR-X" along with "I want to ignore this error, on this record, until the record next changes". Absolutely crazy ... but all has become standard as part of our Import Toolkit.

yosiasz · November 17, 2017, 5:03pm

ok glad to report it is not a full dump of TB sized files but streams of data (IOT-ish) coming from hardware. the TB is the sum of streamed data per day.

so not sure bcp can come into the picture here though I very much like that utility. Not sure SSIS even can handle stream data. next option might be kafka.(https://kafka.apache.org/)

jeffw8713 · November 17, 2017, 5:46pm

This really depends on who manages the import process. Since I manage the processes I don't need an application that displays the errors so I can follow up with the sending application/system and have them fix it. In most cases - that is an outside vendor (insurance) and we don't have any access to fixing the data ourselves.

Because these are external files sent from a vendor that I have no access/control over - then I want (actually require) the process to fail immediately prior to being loaded into our staging data. We get multiple files from the vendor - and if one row of one file fails then we cannot proceed with any downstream processing. For example - we have a member file and if one row fails for that file - then we cannot load and process the claims data from the other files because that member will not exist until the member file is successfully loaded.

Now - when I have control over extracting the data then I can build my processes to account for data changes. For example when I have a known table where the length can and will change without notification - I setup my destination to a large enough size to handle (e.g. if current max is 64 characters - set destination to 100). On output from the source I cast the column to that specific size and that way won't run into any issues (except truncation if they ever exceed 100 characters).

It really depends on the purpose here - in most cases for me we are building reporting/data warehouse systems and end users won't get involved in the load/build process. So building an application to allow them to 'work' errors isn't really necessary and actually makes the whole process more complicated for now gain.

jeffw8713 · November 17, 2017, 5:48pm

Where is this data currently be streamed to and how is that processed? Is there already a process in place to take the streamed data and feed it to a downstream system? If so - what does that look like and can you build off that to route to your database?

yosiasz · November 20, 2017, 4:15pm

Currently it is being streamed to specific set of listener devices.
This is a brand new process that has not event yet been architected. we are in the discovery stage. It is just a matter of us eventually creating a listener process to capture the broadcast in a secure fashion.
Once we get the listeners stood up the rest should be straight forward. Either dump data to some .dat .csv .txt .xml files then ingest. or if SSIS has a stream data flow thingy use that.

Kristen · November 20, 2017, 4:35pm

If SSIS looks like it will handle the stream well then I'd be inclined to also dump to CSV/whatever - so that if SSIS fails you have the means of patching up the missed data. If you find you don't actually do any patching then you could dispense with that Belt + Braces.

Damned stuff always breaks IME ...

jeffw8713 · November 21, 2017, 6:41pm

SSIS is not an interface engine - it is an integration engine. Although you may be able to find something that might work - I would not recommend trying to utilize SSIS in that capacity.

Ideally - you should look at receiving the data from these listener devices through a true interface engine that is designed to handle TCP/IP real-time interfaces with a client/server configuration. The interface engine would start up a service process on a specified port - and each client would then connect to that server...sending the data through the stream.

The interface engine then processes each record and forwards that record to a destination - which can be a database, file, other device, etc...

If your only option is to convert the stream to a file - then you could have that process create a file that can be processed by SSIS in a batch load. Again - I would not recommend using SSIS as an interface engine as it is not designed for that purpose.