SQLTeam.com | Weblogs | Forums

How to get Merge Join to work for two different data sources in SSIS

Hello all,

I apologize in advance if I am posting incorrectly, I am new to SQLTeam :slight_smile: . I am seeking help with an issue I've been having in SSIS; I have been working on creating/loading data into a database for a school project and have been having some issues with Merge Join. I’ve researched many issues the others have had with Merge Join and typically solve my own problems but this one is a bit tricky. I’ve created an SSIS package that should pull a column from a table in Access (this column contains duplicate names to which I utilize a sort later on in the data flow) as well as pull another column from a table in my SQL Server database. For both of these OLE DB Sources I have tried using the simple method of selecting the table through the data access mode but I thought perhaps this was contributing to many warning messages because it would always pull everything from the table as opposed to the one column from each that I wanted. I am now using the SQL Command option with an extremely simple query.

SELECT DISTINCT Name
FROM NameTable

For both OLE DB sources the query is the same except for the parameters selected. Following the source, I have a data conversion on each (because I found that Merge Join is a pansy when the data types don’t match) and I convert the Access one from DT_WSTR to DT_STR, while the SQL Server source is converted from DT_I4 to DT_STR. I then follow both with a sort, passing through the copy of Name and Tid, checking the “removing sorts with duplicate rows” option. Following that step, I then begin utilizing Merge Join with the Access source being my left input and the SQL Server Source (by source I am just referring to the side of the data flow, you’ll see in the image below) being the right input. Below I will also show how I am configuring the Merge Join, in case I’m doing it wrong. Lastly, I have my OLE DB Destination setup to drop this data into a table with the following columns, PrimaryKey column (it auto increments as new data is inserted), the Name column and the Tid column.

When I run the columns it says that it succeeds with no errors. I check my database and nothing has been written, I also note that in SSIS it says 0 rows written. I’m not sure what is going on as I enable the data viewers in between the sorts and the merge join and can see the data coming out both pipelines. Another important thing to note is that when I enable the data viewer after the Merge Join, it never shows up when I run the package, only the two after sort appear. At first I thought maybe the data wasn’t coming out of the Merge Join so I experimented with placing derive columns after the Merge Join and sure enough, the data does flow through. Even with those extra things in between the Merge Join and Destination, the data viewers never pop up. I mention this because I suspect that this is part of the problem. Below are also the messages that SSIS spits out after I run the package.

SSIS Messages:

SSIS package "C:\Users\Liono\Documents\Visual Studio 2015\Projects\DataTest6\Package.dtsx" starting.
Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.
Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.
Information: 0x40043006 at Data Flow Task, SSIS.Pipeline: Prepare for Execute phase is beginning.
Information: 0x40043007 at Data Flow Task, SSIS.Pipeline: Pre-Execute phase is beginning.
Information: 0x4004300C at Data Flow Task, SSIS.Pipeline: Execute phase is beginning.
Information: 0x40043008 at Data Flow Task, SSIS.Pipeline: Post Execute phase is beginning.
Information: 0x4004300B at Data Flow Task, SSIS.Pipeline: "OLE DB Destination" wrote 0 rows.
Information: 0x40043009 at Data Flow Task, SSIS.Pipeline: Cleanup phase is beginning.
SSIS package "C:\Users\Liono\Documents\Visual Studio 2015\Projects\DataTest6\Package.dtsx" finished: Success.
The program '[9588] DtsDebugHost.exe: DTS' has exited with code 0 (0x0).

DataFlow

EDIT: I couldn't show a picture of my merge join but the left input is that Name data and the right input is the id number being joined.

Lastly, I did have a somewhat similar issue before and I solved it on my own by using one source with the right SQL query, but the same thing doesn’t apply here because I’m pulling from two different sources and I am having issues with the Merge Join this time around. The code I used last time:

SELECT a.T1id,
b.T2id,
c.Nameid
FROM Table1 AS a join
Table2 AS b
On a.T1id = b.T2id,
Name AS c
ORDER BY a.[T1id] ASC

I post this because, maybe someone might know of a way to right some SQL that will allow me to forgo using Merge Join again, where I can somehow grab both sets of data and join them, then dump them in my table in SQL Server.

As always, I greatly appreciate your help and if there are any questions of clarifications that need to be made, please ask and I will do my best to help you help me.

Thanks!

I would recommend you use a stored procedure

insert into destinationtable
select x,y,z
  from dbo.sourceStaging src
where not exists (
 select 1 from destinationtable tgt
where src.joiningId = tgt.joiningId )

update tgt
  set tgt.fieldA = src.fieldA
  from destinationtable tgt
 join dbo.sourceStaging src
 on src.joiningId = tgt.joiningId 
where tgt.fieldA <> src.fieldA   --update only if there is a change

If the only columns you have coming from one source is Name - and the other source is Id - then how are they related to each other?

You have to have some type of value (column) that is being used to relate each set of data. The Id column should be matched to an Id column from the other source.

Thank you, I'll have to look into how to use a stored procedure, but I assume it is a tool within SSIS right?

The data is coming from a denormalized source and they are related to each other because of where they are currently stored. In the future data will come in denormalized and for each set I will be adding it to the normalized schema I created. Does that answer your question?

Thank you!

@yosiasz Hopefully you won't be annoyed by this but do you think you could help me understand a little bit more about what is going on in the code sample you shared with me. I'll try and do my own research into using a stored procedure within SSIS (or maybe it is in SQL Server?). Your help is greatly appreciated!

You have really annoyed so you need to send me some beer!:beer::beers:

The following code inserts new data from a table called sourceStaging that does not exist in destinationtable
insert into destinationtable
select x,y,z
  from dbo.sourceStaging src
where not exists (
 select 1 from destinationtable tgt
where src.joiningId = tgt.joiningId )


The following code updates existing data in destinationtable  that also exists in sourceStaging . It will update the fieldA only if there has been a change in sourceStaging table 
update tgt
  set tgt.fieldA = src.fieldA
  from destinationtable tgt
 join dbo.sourceStaging src
 on src.joiningId = tgt.joiningId 
where tgt.fieldA <> src.fieldA   --update only if there is a change

Waiting for :beer:

FYI:
you might not necessarily need to use SSIS just to run stored procedures. if you are dealing with multiple data sources and data destination you could use SSIS but you could also leverage LinkedServers.

1 Like

@yosiasz :beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers: All the beers belong to you, ha! So is this code stored in from tool used in SSIS? Like one of the sources with SQL written in it? Sorry, I really am a total noob at this stuff. Trying to pick it up but there is a lot haha.

Thank you again!

not enough beer young Jedi.

so this code I posted could be part of a stored procedure then in SSIS package you you pick SSIS Tool Box, From ToolBox you select Execute SQL Task

1 Like

In order for you to be able to join data - in SQL Server or SSIS - you need to be able to relate the data. That means you need to have a common data element from each source that matches...

Since you are pulling data from 2 separate sources you can combine them in SSIS for one destination, or you can put them both into the destination and then join the data in SQL Server.

I would extract the data from Access directly into staging tables defined just for that source. I would then extract the data from the other source into separate staging tables defined just for that source. In the control flow - i would then add another step after the data flows that calls a stored procedure on the destination that joins the data and files it into the final destination table.

1 Like

@yosiasz :beers::beer::beers::beer::beers::beer::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers::beers:

Enough beer good sir? Also, so I create my package with the two sources in the package like shown in my original post and then create a stored procedure in SSMS with the T-SQL you provided (adjusted for my stuff of course) and then connect the data flow to the Execute SQL Statement tool?

Thank you again!

Yes execute sql. but if you are only dealing with stored procedures no need for SSIS, it can be done just using SQL Jobs that call those procs. but not sure of your specific environment and situation.

one beer missing. but it ok.

@yosiasz You are so knowledgeable senpai! Here is your beer! :beer: I guess the thing that I am unsure of is how I can ping the Access database with the stored procedure. For the most part, the data will come from Access and Excel, where old data is mostly in Access that needs migrating & normalizing and from Excel comes new data that also needs to be loaded into the new database in a normalized form. The other things is that sometimes there are variations in column names because the data comes from different colleges and they all have their own way of doing things >.<.

kohai, in that case, I would recommend using SSIS package to extract the data from Access and Excel then dump that data into a Staging table then from Staging table you take the rows into your clean table.
how often does the data from colleges come.

At least once a month, maybe twice. Most the times its new data and sometimes they are just updates to current/old data. So a good process would then be, incoming data -----(use SSIS to extract and load, not transform)--->SQL Server Staging Table-----(use stored procedure to extract, transform and load)----->store in normalized database?

Thank you again for your help :slight_smile:

yes that type of process. if the column names change, it is ok as long as they are always in the same position in the excel file.

@yosiasz I'm having some trouble getting the stored procedure to work. I think that is because I miss-communicated my situation or I am just using it wrong.

insert into NameID
select @name, @TID
from dbo.StageDB src
where not exists (
select 1 from NameID tgt
where src.Name = tgt.TID )

update tgt
set tgt.TID = src.Name
from NameID tgt
join dbo.StageDB src
on src.Name = tgt.TID
where tgt.TID <> src.Name --update only if there is a change

Before trying to construct this stored procedure, I migrated my Access database over to SQL Server via SQL Server Migration Assistant. So now I am trying to join the name column in the stage db (it has repeating names but I only need one) with the primary key from another table into the NameID table in my "clean" database. With the repeating parameters, that is why I thought SSIS would be a good method for doing this stuff.

I appreciate your continued support Senpai!! Also, here is your :beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer::beer:

Drink responsibly.

please post schemas

How would you prefer them to look? Want me to screenshot an ERD or just table names with columns as bullets below?

Table1

  • T1ID (pk)
  • Name
  • DOB

Thanks again!

sure screen shot for now is good.