SQLTeam.com | Weblogs | Forums

Unexpected behavior in subquery or is it


Hello. Could somebody explain the reasoning of this query working:

use tempdb
create TABLE Table_Test (Column_1 int)
insert Table_Test values(1),(2),(3);

create TABLE Table_Test2 (Column_2 int)

insert into Table_Test2 select 1

select*from table_test where Column_1 in (select Column_1 from Table_Test2)
select*from table_test where Column_1 in (select Column_2 from Table_Test2)


What is the problem?


In the first select the first table's column name is used in the subquery and it acts like where column_1 = column_1
in the second query it selects only values that are present in column_2. they question was - how do I explain the validity of this behaviour in a subquery?


SQL will resolve a column name to its "closest" valid table.

In this case, if Table_Test2 contains the column named, it will come from that table. If not, it will come from an upper/outer query.

The validity of SQL doing this can be seen in this query:
select*from table_test where exists(select * from Table_Test2 where Column_1 = Column_2 )

Such potential issues are why you should follow a simple rule:
ALWAYS explicitly indicate which table a column is from if it could be from more than one table:

selectfrom table_test tt where Column_1 in (select tt1.Column_1 from Table_Test2 tt2)
from table_test tt where Column_1 in (select tt2.Column_2 from Table_Test2 tt2)

Just looking at the final code in the first query will help you realize what you did wrong.


Thank you for the explanation, Scott. I definitely understand what I did - just did not see an explanation why SQL Server does not give an invalid column name error. Is that an ANSI standard behavour tha tSQL Server resolves it to its "closest" valid table?


I believe it is, yes. It's more convenient when writing SQL, at least in theory. In practice, it can of course lead to hidden bugs, as we've just seen.


Thank you. I needed an explanation of the behavior - so it's always a correlated sub-query even if it feels we are not referring to the outer table inside of the inner select at that point


Technically it doesn't have to be a correlated subquery. A subquery there could refer to the outer query, but it doesn't have to.


You can think of this as a scoping issue...

The outer query cannot 'see' any columns from the inner/sub query - but the inner/sub query can 'see' all of the columns from the outer (containing) query. This is also valid through multiple levels:

SELECT {scope a}
  FROM table1
 WHERE value IN (SELECT {scope b} 
                   FROM table2
                  WHERE EXISTS (SELECT {scope c}
                                  FROM table3
                                 WHERE ...))

In the above - {scope a} has access to only columns from table1 - {scope b} has access to all columns from the outer query (table1) and its own query (table2) but cannot see anything in table3. {scope b} can see all columns from all tables since it is at the innermost level.

I also recommend that you always alias the tables - and use those aliases to reference all columns in all queries at all levels. I have seen where not using the alias caused an issue where instead of deleting the 7 rows identified in the sub-query - over 400,000 rows were deleted. All because the sub-query referenced an invalid column from the outer query and evaluated to true for those 400,000 rows.