How to join linked server tables with openquery?

ladams3158 · November 17, 2016, 2:14am

SELECT * FROM OPENQUERY(PANTS,'

SELECT LONG ,SHORT FROM JEANS

SECOND QUERY

SELECT * FROM OPENQUERY(SHIRTS,' SELECT WHITE ,BLACK FROM STORES

Lets say long is in the stores table and I want to join on it. How would I do that?

Kristen · November 17, 2016, 7:20am

I expect you could do:

SELECT *
FROM	OPENQUERY(PANTS,'SELECT LONG ,SHORT FROM JEANS') AS P
	JOIN OPENQUERY(SHIRTS,' SELECT WHITE ,BLACK ,LONG FROM STORES') AS S
		ON S.LONG = P.LONG

but it may be very inefficient. Another option would be to pull the data, from each OPENQUERY, into a local #TEMP table and then do the JOIN (locally)

Would probably be worth also trying:

SELECT  P.LONG ,P.SHORT ,S.WHITE ,S.BLACK -- ,S.LONG
FROM	PANTS.DBName.dbo.JEANS AS P
	JOIN SHIRTS.DBName1,dbo.STORES AS S
		ON S.LONG = P.LONG

but my expectation is that that will be "slow" too.

If you do go down the 4-part-naming route then I strongly recommend you set up Synonyms for the 4-part names, and do NOT hard-code the 4-part names in your code. MUCH easier to change a Synonym when the server / database changes, rather than having to alter, and test , all the code.

ladams3158 · November 20, 2016, 1:16pm

One table has about 14 millions row based on the criteria in a sep database and the other table has about 5.7 million rows based on the criteria so joining locally isn't ideal bc of the amount of data which is why I want join before hand. Now my code is timing out and I am not sure what the best solution is with this amount of data I am working with and the fact that it's in two different databases. Any advice?

jeffw8713 · November 20, 2016, 5:01pm

Joining locally is your ONLY option - because the data is coming from two separate sources. Regardless of how you construct the query - SQL Server has to pull all data from each source into local worktables or temp tables - then join the data.

Performance will be much better if you pull the data locally into temp tables - because you can control the process. If you allow SQL Server to do this - either using OPENQUERY or 4-part naming - it is highly likely that one or more connections will actually utilize a cursor to loop over each row instead of a set based operation. You can validate that by looking at the source systems and seeing whether or not the linked server uses 'sp_cursor' for your queries.

My recommendation would be to setup a process to copy all relevant data needed from each source into local permanent tables - and refresh that data on a daily/weekly/monthly basis. Based on the amount of data - i would recommend either SSIS or BULK INSERT processes where you can control the batch and commit sizes which will keep your transaction logs from growing too large and filling a drive.

Some may recommend using BCP - which will be a bit slower and will take up more space because you have to export to a file and import from that file whereas using SSIS the movement of data is all done in memory and is multi-threaded (e.g. as data is read from source - it is written to destination and committed based on the batch and commit sizes defined).

Kristen · November 21, 2016, 8:16am

Absolutely agree. That's how we do it, we never use direct OPENQUERY or 4-part naming, we always localise the data in some way. That won't work for everyone though, but a local JOIN of two remote databases will always be painful

I don't suppose PANTS and SHIRTS are actually on the same server? If so mounting a SProc on that server, which makes the query / JOIN, might solve the problem. You then have to hope that PANTS and SHIRTS don't subsequently move to different servers!!

We use OPENQUERY and put changes/new data into a staging table (and delete anything no longer present in the remote). That job is hugely simplified if the Remote DB has a ChangeDate on all rows (which in our case it nearly always is). The ones we do that require a full BCP export from the source and import locally and THEN figure out what the DIFF is take all night long to import - and they may actually only change a handful of rows

jeffw8713 · November 21, 2016, 7:29pm

The problem with OPENQUERY is that you cannot control the batch or commit sizes. If you need to pull 100,000,000 rows using OPENQUERY it will be done in a single transaction - which will cause the tlog to grow larger than necessary and will also make the process take longer than needed.

Using SSIS/BULK INSERT and setting a reasonable batch/commit size not only keeps the tlog at a reasonable size - but it also will perform better because you never reach that limit where committing the rows takes longer than extracting the data.

I have seen processes that took hours to complete cut down to minutes by using this technique.