Unexpected behavior in subquery or is it

llatinsky · January 19, 2017, 7:52pm

Hello. Could somebody explain the reasoning of this query working:

use tempdb
go
create TABLE Table_Test (Column_1 int)
insert Table_Test values(1),(2),(3);

create TABLE Table_Test2 (Column_2 int)

insert into Table_Test2 select 1

select*from table_test where Column_1 in (select Column_1 from Table_Test2)
select*from table_test where Column_1 in (select Column_2 from Table_Test2)

djj55 · January 19, 2017, 8:06pm

What is the problem?

llatinsky · January 19, 2017, 8:13pm

In the first select the first table's column name is used in the subquery and it acts like where column_1 = column_1
in the second query it selects only values that are present in column_2. they question was - how do I explain the validity of this behaviour in a subquery?

ScottPletcher · January 19, 2017, 9:27pm

SQL will resolve a column name to its "closest" valid table.

In this case, if Table_Test2 contains the column named, it will come from that table. If not, it will come from an upper/outer query.

The validity of SQL doing this can be seen in this query:
select*from table_test where exists(select * from Table_Test2 where Column_1 = Column_2 )

Such potential issues are why you should follow a simple rule:
ALWAYS explicitly indicate which table a column is from if it could be from more than one table:

selectfrom table_test tt where Column_1 in (select tt1.Column_1 from Table_Test2 tt2)
selectfrom table_test tt where Column_1 in (select tt2.Column_2 from Table_Test2 tt2)

Just looking at the final code in the first query will help you realize what you did wrong.

llatinsky · January 20, 2017, 1:46pm

Thank you for the explanation, Scott. I definitely understand what I did - just did not see an explanation why SQL Server does not give an invalid column name error. Is that an ANSI standard behavour tha tSQL Server resolves it to its "closest" valid table?

ScottPletcher · January 20, 2017, 6:46pm

I believe it is, yes. It's more convenient when writing SQL, at least in theory. In practice, it can of course lead to hidden bugs, as we've just seen.

llatinsky · January 20, 2017, 7:02pm

Thank you. I needed an explanation of the behavior - so it's always a correlated sub-query even if it feels we are not referring to the outer table inside of the inner select at that point

ScottPletcher · January 20, 2017, 8:07pm

Technically it doesn't have to be a correlated subquery. A subquery there could refer to the outer query, but it doesn't have to.

jeffw8713 · January 21, 2017, 3:21pm

You can think of this as a scoping issue...

The outer query cannot 'see' any columns from the inner/sub query - but the inner/sub query can 'see' all of the columns from the outer (containing) query. This is also valid through multiple levels:

SELECT {scope a}
  FROM table1
 WHERE value IN (SELECT {scope b} 
                   FROM table2
                  WHERE EXISTS (SELECT {scope c}
                                  FROM table3
                                 WHERE ...))

In the above - {scope a} has access to only columns from table1 - {scope b} has access to all columns from the outer query (table1) and its own query (table2) but cannot see anything in table3. {scope b} can see all columns from all tables since it is at the innermost level.

I also recommend that you always alias the tables - and use those aliases to reference all columns in all queries at all levels. I have seen where not using the alias caused an issue where instead of deleting the 7 rows identified in the sub-query - over 400,000 rows were deleted. All because the sub-query referenced an invalid column from the outer query and evaluated to true for those 400,000 rows.