Polybase bad when using a variable

Hello everyone...

I have just created my first Oracle connection, using the Polybase feature with SQL Server 2019. Here is the command I have used:

CREATE EXTERNAL TABLE dbo.COMPLETED_TRAINING (
EMP_NUM VARCHAR(24) COLLATE Latin1_General_100_BIN2_UTF8 NOT NULL,
COURSE_NUM Integer Not Null,
COMPLETED_DATE DateTime2 Not Null
)
WITH (
LOCATION=N'........',
DATA_SOURCE=[....]
);

Now, if I do a select statement with a hard-coded value, it comes back instantly:
select * from dbo.COMPLETED_TRAINING where EMP_NUM = '123456';

But, I would really like to use a variable. (Something I can't do using OPENQUERY). I have tried both of these Declare statements prior to executing the query:

Declare @EMP_NUM VARCHAR(24) = '123456';
Declare @EMP_NUM VARCHAR(24) = '123456' COLLATE Latin1_General_100_BIN2_UTF8;

select * from dbo.COMPLETED_TRAINING where EMP_NUM = @EMP_NUM;

In both cases the command takes forever to run.

I am sure it has to do with the collation. The collation I made during the "CREATE EXTERNAL TABLE" was done in that format, because if I didn't I got an error message that it was required.

When using the variable, it appears to be doing a full table scan instead of using the index that is on the Oracle server.

Execution plan using a variable (35 seconds):
image

I have also tried:
CREATE STATISTICS SampleStats ON dbo.COMPLETED_TRAINING (EMP_NUM) WITH FULLSCAN;

Any suggestions?

...and this is what the Execution plan looks like when I just hard-code the employee number:

image

So I have found some more details:

The same behavior takes place if I use a variable for a numeric field. (So it may not be a collation problem).

Using resource monitor and the Polybase logs, I can see clearly that when I use a variable; it loads the entire Oracle table into a temporary table (in the TEMPDB), and then searches it.

When I use hard-coded values; it uses the index in the Oracle table.

I have seen similar behavior when the developers use the .NET ".Parameters.AddWithValue" method, where it uses nVarchar by default even though the actual column is Varchar. When that happens, SQL does a full table scan. If they use .Parameters.Add(value,type).Value =, then it uses the index.

You see this kind of issue when using linked servers - the problem is that SQL Server does not have any information from the remote system to determine how to process the data. When you hard-code the query with a value - SQL Server sends that query as is to the remote system and the remote system processes the query based on the hard-coded values.

When you use a variable - SQL Server does not have any information available to generate an execution plan - so it removes the filter, queries for the full table and then applies the filter after the data has been returned.

Worse yet - in some cases - SQL Server will decide to utilize a cursor where it then requests a single row from the source system, checks the filter - then requests the next row.

This really has nothing to do with how you define the external table...or what collations are used. It depends entirely on how SQL Server generates the plan for that query from a remote system.

I think your only option is going to be dynamic SQL.

Thanks for the reply Jeff...

I would like think that it should know how to do it properly; so I am going to try and poke around a bit more to see if I can avoid ad-hoc queries. Here is some more info I found...

  1. When you look at the specs, it says it supports "Oracle up to v12". My external source is using 11g.

  2. The Polybase installation progress installs the Oracle driver ODBC driver, and that driver does not appear in the "traditional" ODBC list in the administration screen.

  3. The execution plan for the two different queries show different syntaxes. For example: When the command is sent with hard-coded values (and takes less than 1 second), it looks like this:

select * from dbo.COMPLETED_TRAINING where [emp_num]=@1;

It's passing the HARDCODED value I set to Oracle as a parameter. It already knows the data type, because when we run the Create External table command it checks it. (And complains if there is a mismatch.)

When I run the command with parameters (and it takes 35 seconds), it sends it like this:
select * from dbo.COMPLETED_TRAINING where [emp_num]=@emp_num;

I would like to think that it is smart enough to know how to put that value as an Oracle parameter as well.

Cheers!

This is the same issue as seen with linked servers - SQL Server can parameterize the query where the values are hard-coded but cannot do the same for variables.

This isn't a driver issue - rather it is an issue with how external (linked or remote) queries will be processed. In the first one, SQL Server takes the hard-coded values - parameterizes them - and sends a parameterized query to the external system.

In the second one - SQL Server does not parameterize the query...which results in pulling all of the data from the external system and then filtering using the local variable.

Again - this is seen across linked servers and now external tables (which probably use the same methodology as a linked server). The only way I know of to get around this is to either:

  1. Call a stored procedure on the external system with appropriate parameters
  2. Build the query dynamically and execute the dynamic query
  3. Build the query using dynamic SQL and OPENQUERY to force execution on the external system instead of locally.

You can also see additional issues as soon as you try to join the external table to a local table. Doing this will often force SQL Server to pull more data locally than is needed - or forces a cursor over the external data.

Thanks for the workarounds and explanations Jeff!

So to follow up, I was able to get the queries working from SQL Server 2019 to Oracle 11G using parameters, and without any special start-up parameters. Instead, I just put QUERYTRACEON into the select statement directly:

select * from dbo.COMPLETED_TRAINING
where EMP_NUM = @EMP_NUM
OPTION (QUERYTRACEON 4199)

To repeat from the beginning, all that I have do is install a new SQL Instance with Polybase selected. I didn't have to start the SQL Server with -T4199. (Worth mentioning that when I did try that, it broke all of the Polybase queries.)

I'm very happy now that I can just use normal parameters for my queries to the external Oracle DB. NO MORE OPENQUERY! WOOO HOOOOOOO :slightly_smiling_face: