Using Parallel data load in a single table...very unique feature!

jasdeepsingh · August 13, 2015, 1:28pm

Hi All,

We have heard of loading data in parallel mode in different tables
simultaneously to save time and optimize the ETL Loads. But here is a
new concept that can be used to perform parallel data load in ONE SINGLE TABLE.

It can be achieved in Visual Studio 2012 without needing it to be upgraded to 2014 version to use BDD.

In SSIS , while trying to optimize a long running package i could
clearly see that writing data in target table was taking quite a long
time than reading the data from source. All the database concepts had
been tried already to optimize the data load but it dint do any
significant improvement.

While trying various optimizations at SSIS level i came across one
component named Percentage Sampling. Basic purpose of this component is
to divide data in two parts on the basis of percentage. It creates
chunks or slices of data and loads them in tables as per business
requirement.

Ex : If we have customer data and due to technical constraints we can
send mails to 30% of customers at once , then one data chunk of 30% of
total data is loaded into a table of one server , remaining 70% into
another table of another server and accordingly mail is sent. This is
basically used in retail stores like Big Bazaar where huge data is
processed for mails and messages.

After understanding the basic purpose of this component it just
clicked into mind whether this component could be used for parallel
loading in one single table !!!! And it worked with drastic optimization
up to 46%

Following are the test cases :

Case 1: Serial Loading
Source – 1 (22Million rows), Percentage Sampling — 0, Target instances – 1, Total time: 1min 33sec.

This is purely serial computing and loading.

Case 2: One Stage Parallel Loading
Source – 1 (22Million rows), Percentage Sampling — 1, Target instances – 2, Total time: 1min 9sec.

This is parallel loading since we are loading data simultaneously in same target table using 1 stage of percentage sampling.

Case 3: Three Stage Parallel Loading
Source – 1 (22Million rows), Percentage Sampling — 3, Target instances – 4, Total time: 53sec.

This is advanced parallel loading since we are loading data
simultaneously in same target table using 3 stages of percentage
sampling.

Case 4: Seven Stage Parallel Loading
Source – 1 (22Million rows), Percentage Sampling — 7, Target instances – 8, Total time: 1min 32sec.

This is advanced parallel loading since we are loading data
simultaneously in same target table using 7 stages of percentage
sampling but not effective as it was in Case 3.

Basic logic behind Percentage Sampling and Parallel Computing

Case 1: Serial computing what it does is, it loads the target
table serially row by row. Hence among all the four cases Case 1 has the
maximum time elapsed.

Case 2: We see that same target table is being loaded
parallely with more than 1 input branch. Hence it is quicker. Multiple
rows get inserted at a time making it a more optimized process. Hence
Case 2 elapsed time is 1min 9sec which is 24 sec lesser than Case 1
making it 25% optimized than Case 1.

Case 3: Now the question arises whether the process can be
optimized further following the same approach to next level. In Case 3
we see the total time elapsed is 53 sec which is 46% optimized than Case 1 because of use of 3 stages Percentage sampling.

Case 4: Implementing the same logic further and making it a 7 stage Percentage sampling we observe that this case is 1% optimized
than case 1. But clearly we can see it is not as optimized as it was in
Case 3 because the Parallel sampling also depends on the core
configuration and CPU memory of the server on which the package is
running. Hence here we get the threshold value and no need to dig out
further.

% Optimization: Case 3 > Case 2 > Case 4 > Case 1

Important Notes
This approach is data dependent and server configuration details.
Hence multiple cases need to be implemented before coming to a
conclusionAs stated above in introduction Parallel Computing is not the reason
for which this Percentage Sampling transformation is present in SSIS.
We have innovated its implementation for Parallel Computing.