Load from view via ODBC using WHERE clause

extrayda · April 18, 2017, 11:16am

Hi - apologies if this is a newbie question - I have googled it but found nothing definitive.

I am downloading data from an external source - SQL Server using Native ODBC.
Connectivity etc. is all fine, the download works, but is very SLOW.

I am connecting to a View on the source server, which is comprised of several tables (as you would expect).
One of the tables has ~4 million rows.
The load that I have on my machine (destination) is filtered on a created on date being between two values.
The created on date is one of the fields in the view.

The download takes about 20 minutes to return a single row (result is correct, it should only return a single row at the moment).
My question is:
Does the remote server generate the entire view, THEN apply my WHERE clause (given that the WHERE is based on one of the view fields)?

Have to check tomorrow, but if I remember correctly, if I query the view on the source server itself using the same where clause, it returns the record in less than a minute.
The tables underlying the view are indexed, I need to confirm whether everything is indexed properly - I have only just got that level of access. Unfortunately I can not currently view the execution plans on the source server.

Kristen · April 18, 2017, 5:56pm

It might be worst than that (given how long it takes), it is possible that not only is the Remote retrieving ALL rows but it is then ALSO sending ALL rows to your PC (across the network ...) for YOUR PC to apply the WHERE clause.

I would expect it to be a lot quicker than that, so that may suggest that the query needs optimising, but that has nothing to do with the 20 minutes you experience from the remote.

If you can:

Create a LINKED SERVER on your, local, server referencing the REMOTE server. Use the native SQL Driver, not ODBC.

Use OPENQUERY to make your query, rather than directly querying the server.

SELECT *
FROM OPENQUERY(YourLinkedServer, 'SELECT Col1, Col2, FROM Remotedatabase.dbo.RemoteView WHERE XXXX')

I would expect that to be the same speed as-if executed directly on the remote server, except if you are retrieving many rows / columns add a reasonable amount of time for the data to physically travel across the network.

extrayda · April 19, 2017, 6:30am

Hi Kristen - thanks for the response!
I had a bit of a further look at things today.

The issue is with my WHERE clause. What it actually does is looks for a max date value in the View, and returns data based either on that, or a default date range (depending on which will return data). The underlying field in the view is NOT indexed, and comes from the table with 4 million rows. Takes ages however I run it.
HOWEVER if I replace the selection with an actual date range, the query runs in about 10 seconds.
So you are right, it does run quicker than my 'less than a minute' statement - once I checked properly!

I have asked for the field in the underlying table of the view to be indexed on the modifiedon date (I don't have rights to do it myself). My understanding is that this does also then work for the view. Will see if that helps with performance.
If that doesn't work, then I should in theory be able to get the max date required from the underlying table directly, and use that in the WHERE clause.

Will try and report back on Friday.

Kristen · April 19, 2017, 6:45am

That's true if you query the server directly / locally, but you still need to do the things I described in order to be able to access the data from remote (assuming you want a reasonably complex query), otherwise SQL may still resolve keys parts of the query locally - and to do that it will pull all the data from the remote and then apply the filter locally. So you need to use a "pass through" approach to have the filtering done at the remote - which is what OPENQUERY does.

Assuming you are referring to a query made remotely then its probably just that SQL is able to pass-through the whole query, whereas your other example is sufficiently complex that it fails to do that - but, again, OPENQUERY would allow you to control how much of the query (ideally all of it!) is passed through to remote.

You can JOIN the OPENQUERY results to local tables - but the guts of the remote query will still all run remotely, so you can minimise the data which is passed back and THEN processed locally. That sort of query, if just done as plain, complex, SQL will often result is masses of data transferring to local and THEN being filtered out locally ... which will have dreadful performance.

extrayda · April 19, 2017, 7:04am

Wow - speedy response !
I'm not sure whether I will be allowed to set up a Linked server, but will push for that if required.

I do have other similar data loads from this server which I have cut down to a few minutes from 50 minutes in some cases (severe lack on indexing on the source !).
I am hopeful that this can be resolved in the same way (or a bit of creative fiddling). It is a once a day load, so a few minutes to achieve a load is OK, but 24 minutes is not (to be honest, no one else really cares, I just don't like to leave things that I know are very inefficient).
Given that the WHERE clause is looking for MAX values that in itself takes a VERY long time to run, I am hopeful

Kristen · April 19, 2017, 7:21am

its only on your LOCAL server - you have control over that presumably? After all you made an ODBC connection to remote on it [i.e. your local server], right?

That's presumably because SQL is pulling all the data locally to do the MAX, whereas when you do a date range it does that at the Remote end. Just a guess, might be something else, but there is no way that [I can think of!!] where:

SELECT MAX(MyMaxCol)
FROM MyTable

will be any slower than

SELECT Col1, Col2
FROM MyTable
WHERE MyMaxCol BETWEEN @Param1 AND @Param2

Either there is an index on MyMaxCol in which case MAX(MyMaxCol) will be "instant", and that will also help the RANGE filter, or there is no index - in which case, in both cases, SQL will have to scan every single row in the table.

So if RANGE filter is quick, and MAX is slow, the only explanation (assuming that the rest of the WHERE clause is the same) will be because the fast one was fully processed on the remote and the slow one pulled data locally and THEN did the processing.

But that's assuming apples-and-apples, if there are other variances in the WHERE clause, different JOINs, etc. between the two test queries that would influence the outcome too, of course.