Hi All,

We have heard of loading data in parallel mode in different tables

simultaneously to save time and optimize the ETL Loads. But here is a

new concept that can be used to perform parallel data load in ONE SINGLE TABLE.

It can be achieved in Visual Studio 2012 without needing it to be upgraded to 2014 version to use BDD.

In SSIS , while trying to optimize a long running package i could

clearly see that writing data in target table was taking quite a long

time than reading the data from source. All the database concepts had

been tried already to optimize the data load but it dint do any

significant improvement.

While trying various optimizations at SSIS level i came across one

component named Percentage Sampling. Basic purpose of this component is

to divide data in two parts on the basis of percentage. It creates

chunks or slices of data and loads them in tables as per business

requirement.

Ex : If we have customer data and due to technical constraints we can

send mails to 30% of customers at once , then one data chunk of 30% of

total data is loaded into a table of one server , remaining 70% into

another table of another server and accordingly mail is sent. This is

basically used in retail stores like Big Bazaar where huge data is

processed for mails and messages.

After understanding the basic purpose of this component it just

clicked into mind whether this component could be used for parallel

loading in one single table !!!! And it worked with drastic optimization

up to 46%

Following are the test cases :

Case 1: Serial Loading

Source – 1 (22Million rows), Percentage Sampling — 0, Target instances – 1, Total time: 1min 33sec.

This is purely serial computing and loading.

Case 2: One Stage Parallel Loading

Source – 1 (22Million rows), Percentage Sampling — 1, Target instances – 2, Total time: 1min 9sec.

This is parallel loading since we are loading data simultaneously in same target table using 1 stage of percentage sampling.

Case 3: Three Stage Parallel Loading

Source – 1 (22Million rows), Percentage Sampling — 3, Target instances – 4, Total time: 53sec.

This is advanced parallel loading since we are loading data

simultaneously in same target table using 3 stages of percentage

sampling.

Case 4: Seven Stage Parallel Loading

Source – 1 (22Million rows), Percentage Sampling — 7, Target instances – 8, Total time: 1min 32sec.

This is advanced parallel loading since we are loading data

simultaneously in same target table using 7 stages of percentage

sampling but not effective as it was in Case 3.

Basic logic behind Percentage Sampling and Parallel Computing

Case 1: Serial computing what it does is, it loads the target

table serially row by row. Hence among all the four cases Case 1 has the

maximum time elapsed.

Case 2: We see that same target table is being loaded

parallely with more than 1 input branch. Hence it is quicker. Multiple

rows get inserted at a time making it a more optimized process. Hence

Case 2 elapsed time is 1min 9sec which is 24 sec lesser than Case 1

making it 25% optimized than Case 1.

Case 3: Now the question arises whether the process can be

optimized further following the same approach to next level. In Case 3

we see the total time elapsed is 53 sec which is 46% optimized than Case 1 because of use of 3 stages Percentage sampling.

Case 4: Implementing the same logic further and making it a 7 stage Percentage sampling we observe that this case is 1% optimized

than case 1. But clearly we can see it is not as optimized as it was in

Case 3 because the Parallel sampling also depends on the core

configuration and CPU memory of the server on which the package is

running. Hence here we get the threshold value and no need to dig out

further.

% Optimization: Case 3 > Case 2 > Case 4 > Case 1

Important Notes

This approach is data dependent and server configuration details.

Hence multiple cases need to be implemented before coming to a

conclusionAs stated above in introduction Parallel Computing is not the reason

for which this Percentage Sampling transformation is present in SSIS.

We have innovated its implementation for Parallel Computing.