SQLTeam.com | Weblogs | Forums

How to join linked server tables with openquery?

SELECT * FROM OPENQUERY(PANTS,'

SELECT LONG ,SHORT FROM JEANS

SECOND QUERY

SELECT * FROM OPENQUERY(SHIRTS,' SELECT WHITE ,BLACK FROM STORES

Lets say long is in the stores table and I want to join on it. How would I do that?

I expect you could do:

SELECT *
FROM	OPENQUERY(PANTS,'SELECT LONG ,SHORT FROM JEANS') AS P
	JOIN OPENQUERY(SHIRTS,' SELECT WHITE ,BLACK ,LONG FROM STORES') AS S
		ON S.LONG = P.LONG

but it may be very inefficient. Another option would be to pull the data, from each OPENQUERY, into a local #TEMP table and then do the JOIN (locally)

Would probably be worth also trying:

SELECT  P.LONG ,P.SHORT ,S.WHITE ,S.BLACK -- ,S.LONG
FROM	PANTS.DBName.dbo.JEANS AS P
	JOIN SHIRTS.DBName1,dbo.STORES AS S
		ON S.LONG = P.LONG

but my expectation is that that will be "slow" too.

If you do go down the 4-part-naming route then I strongly recommend you set up Synonyms for the 4-part names, and do NOT hard-code the 4-part names in your code. MUCH easier to change a Synonym when the server / database changes, rather than having to alter, and test :frowning:, all the code.

One table has about 14 millions row based on the criteria in a sep database and the other table has about 5.7 million rows based on the criteria so joining locally isn't ideal bc of the amount of data which is why I want join before hand. Now my code is timing out and I am not sure what the best solution is with this amount of data I am working with and the fact that it's in two different databases. Any advice?

Joining locally is your ONLY option - because the data is coming from two separate sources. Regardless of how you construct the query - SQL Server has to pull all data from each source into local worktables or temp tables - then join the data.

Performance will be much better if you pull the data locally into temp tables - because you can control the process. If you allow SQL Server to do this - either using OPENQUERY or 4-part naming - it is highly likely that one or more connections will actually utilize a cursor to loop over each row instead of a set based operation. You can validate that by looking at the source systems and seeing whether or not the linked server uses 'sp_cursor' for your queries.

My recommendation would be to setup a process to copy all relevant data needed from each source into local permanent tables - and refresh that data on a daily/weekly/monthly basis. Based on the amount of data - i would recommend either SSIS or BULK INSERT processes where you can control the batch and commit sizes which will keep your transaction logs from growing too large and filling a drive.

Some may recommend using BCP - which will be a bit slower and will take up more space because you have to export to a file and import from that file whereas using SSIS the movement of data is all done in memory and is multi-threaded (e.g. as data is read from source - it is written to destination and committed based on the batch and commit sizes defined).

Absolutely agree. That's how we do it, we never use direct OPENQUERY or 4-part naming, we always localise the data in some way. That won't work for everyone though, but a local JOIN of two remote databases will always be painful :frowning:

I don't suppose PANTS and SHIRTS are actually on the same server? If so mounting a SProc on that server, which makes the query / JOIN, might solve the problem. You then have to hope that PANTS and SHIRTS don't subsequently move to different servers!!

We use OPENQUERY :slight_smile: and put changes/new data into a staging table (and delete anything no longer present in the remote). That job is hugely simplified if the Remote DB has a ChangeDate on all rows (which in our case it nearly always is). The ones we do that require a full BCP export from the source and import locally and THEN figure out what the DIFF is take all night long to import - and they may actually only change a handful of rows

The problem with OPENQUERY is that you cannot control the batch or commit sizes. If you need to pull 100,000,000 rows using OPENQUERY it will be done in a single transaction - which will cause the tlog to grow larger than necessary and will also make the process take longer than needed.

Using SSIS/BULK INSERT and setting a reasonable batch/commit size not only keeps the tlog at a reasonable size - but it also will perform better because you never reach that limit where committing the rows takes longer than extracting the data.

I have seen processes that took hours to complete cut down to minutes by using this technique.

1 Like