Temp Table

Mel · May 24, 2016, 2:40pm

Hello,

What is the different between query # 1 and # 2 and which is better to use Temporary Table from memory and the most efficient way?

QUERY # 1:
If OBJECT_ID ('tempdb.dbo#Yaks') IS NOT NULL DROP TABLE #Yaks

CREATE TABLE #Yaks
(
A int,
B char(30),
C varchar(max),
D varchar(100)
)

INSERT INTO #Yaks (A,B,C,D)
SELECT DISTINCT A,B,C,D
FROM MYTABLE

--FINAL SET
SELECT A,B,C,D FROM #Yaks

DROP TABLE #Yaks

QUERY # 2:
If OBJECT_ID ('tempdb.dbo#Yaks') IS NOT NULL DROP TABLE #Yaks

SELECT DISTINCT A,B,C,D
INTO #Yaks
FROM MYTABLE

--FINAL SET
SELECT A,B,C,D FROM #Yaks

DROP TABLE #Yaks

Thank you all

or just

JamesK · May 24, 2016, 3:08pm

For the most part, it may not matter which method you use. But, it can have an impact in some cases. Couple of examples:

With select into you have less control over the length/type of the columns. This can be a problem if you wanted to insert more rows into table at a later point, and those happened to have longer character data.
If you are in SIMPLE or BULK-LOGGED recovery model, SELECT INTO is minimally logged. So it can be more efficient.
If you don't know the columns/data types of the source data in advance, SELECT INTO takes care of that for you.

I usually construct the able and then insert into it except in cases where I am trying to cut corners (faster to do select into) when doing something quick and dirty.

ScottPletcher · May 24, 2016, 3:26pm

Neither method affects whether only memory is used for the temp table or whether it has to spill to disk instead. That is affected only by the volume of data.

However, a hybrid method is actually best, because:
Option #1 requires you to hard-code the column definitions in advance. That can cause errors and/or force changes when the underlying source table changes.
Option #2 places some locks and holds on system tables and can affect other people trying to load data into the same db. Also, creating a clustered index on the table, if needed, must be done after the table is loaded, which is far less efficient.

Instead, you can combine the two methods, like so:

If OBJECT_ID ('tempdb.dbo#Yaks') IS NOT NULL DROP TABLE #Yaks

--create the table structure, but without any rows in it.
SELECT TOP (0) A,B,C,D
INTO #Yaks
FROM MYTABLE

--if the temp table would benefit from a clustered index,
--(and very often it will!, such as when JOINs or ORDER BY are use in SELECTs on the table),
--create the clus index before the table is loaded.
--For example:
CREATE CLUSTERED INDEX Yaks__CL ON #Yaks ( A, B, C, D )

INSERT INTO #Yaks WITH (TABLOCK) --allows minimal logging, even in nontemp dbs.
SELECT DISTINCT A, B, C, D
FROM MYTABLE

SELECT A,B,C,D
FROM #Yaks
ORDER BY A, B, C, D --the clus index will mean no additional sort is required to do this ORDER BY

DROP TABLE #Yaks

Mel · May 24, 2016, 4:04pm

Hi ScottPletcher,

Did you mean Created Clustered or NonClustered Index on A,B,C,D?

Also Scott, what is the impact of WITH(TABLOCK)?

Thanks

ScottPletcher · May 24, 2016, 4:35pm

Clustered. There's almost never a good enough reason to build a nonclus index on a temp table (it should be clus index instead). Edit: That is, while people often build a single, nonclus index on a temp table, they really shouldn't, they should build a clus index instead.

The "WITH (TABLOCK)" allows SQL Server to use minimal logging if the database is in Simple or Bulk_Logged mode, which reduces the overhead of doing the INSERTs. I don't think that matters in tempdb, which I strongly suspect always does minimal logging anyway. But when you do an INSERT INTO ... SELECT in a non-tempdb db, using WITH (TABLOCK)" could really reduce the overhead of that table load.

Mel · May 24, 2016, 5:19pm

Thanks Scott. Appreciated !

Mel · May 24, 2016, 5:19pm

Thanks JamesK. Appreciated!

Kristen · May 24, 2016, 6:49pm

My "style" for that job is:

SELECT A,B,C,D
INTO #Yaks 
FROM MYTABLE
WHERE 1=0

I presume its much-of-a-much-ness, and thus just personal preference ...

I need to get into the habit of doing that ...

One thing which typically catches me out is when the Source Table includes an IDENTITY column:

SELECT MyIdentityCol,B,C,D
INTO #Yaks
FROM MYTABLE
WHERE 1=0

...

INSERT INTO #yaks 
SELECT MyIdentityCol,B,C,D
FROM MYTABLE

which fails because it is trying to insert values into the IDENTITY column ... so I have to create the table with:

SELECT CONVERT(int, MyIdentityCol) AS MyIdentityCol,
       B,C,D
INTO #Yaks
FROM MYTABLE
WHERE 1=0

which is a bit of a PITA ...

ScottPletcher · May 24, 2016, 9:12pm

You can add a UNIOn ALl after the first query which will cancel the identity property:

SELECT *
INTO #Yaks
FROM MYTABLE
WHERE 1=0
UNIO AL --filter at work rejects post as "sql injection" if I spell this correctly
SELECT *
FROM MYTABLE
WHERE 1 = 0

Mel · May 25, 2016, 1:19am

Hi Scott,

Now that I am looking at your code of:

--create the table structure, but without any rows in it.
SELECT TOP (0) A,B,C,D
INTO #Yaks
FROM MYTABLE
or

SELECT MyIdentityCol,B,C,D
INTO #Yaks
FROM MYTABLE
WHERE 1=0

What is the benefit having table without any rows in it? Is that to avoid like what you stated on the option # 1, hard-code the column definitions in advance?

Thanks Scott

Kristen · May 25, 2016, 4:53am

its the equivalent of your CREATE TABLE statement. Benefits are the two points that Scott mentioned:

#1 It will use the properties of the Source Table as the basis for defining the columns. Thus if the definition of the Source Table should change,in future, the #YAKS table will automatically adjust to match.

#2 if you were to do:

SELECT DISTINCT A,B,C,D
INTO #Yaks 
FROM MYTABLE

and IF MYTABLE holds a large number of rows, then during the insert process there is enough delay that locks etc. may cause interference with other people. Also, if using this method there is no opportunity to create a Clustered Index BEFORE the data is added to the table. Adding a Clustered Index to a table AFTER it is populated may require a lot of data page shuffling so preferable to create the Clustered Index first (either CREATE TABLE / CREATE CLUSTERED INDEX or SELECT ... INTO #YAKS ... WHERE 1=0 and then CREATE CLUSTERED INDEX prior to INSERT INTO #TEMP SELECT ...

JeffModen · May 25, 2016, 1:38pm

The locks are shared locks and about the only thing that it will cause problems with is the Explorer Window in SSMS. It won't affect people trying to load data into the same DB other than what normal blocking would do if the target is the same table.

On the post-creation of a clustered index, most of the time it's totally unnecessary because, ostensibly, you've only loaded what is needed and all code will result in a full table scan anyway. The use of a clustered index in such cases will only serve to slow things down.

There's also more than 1 type of efficiency... if the recovery model is BULK LOGGED or SIMPLE (as should be for an "import staging" database, which can really save your keester), the post-creation of a Clustered Index will be minimally logged and can be quite fast compared to not being able to use SELECT INTO. But, "It Depends" and, with that thought in mind, it's always good to do a comparison test rather than by doing thing by rote.

Mel · May 25, 2016, 1:43pm

Thank you Kristen

ScottPletcher · May 25, 2016, 1:53pm

JeffModen:

The locks are shared locks and about the only thing that it will cause problems with is the Explorer Window in SSMS. It won't affect people trying to load data into the same DB other than what normal blocking would do if the target is the same table.

On the post-creation of a clustered index, most of the time it's totally unnecessary because, ostensibly, you've only loaded what is needed and all code will result in a full table scan anyway. The use of a clustered index in such cases will only serve to slow things down.

There's also more than 1 type of efficiency... if the recovery model is BULK LOGGED or SIMPLE (as should be for an "import staging" database, which can really save your keester), the post-creation of a Clustered Index will be minimally logged and can be quite fast compared to not being able to use SELECT INTO. But, "It Depends" and, with that thought in mind, it's always good to do a comparison test rather than by doing thing by rote.

I've seen SELECT ... INTO hold metadata locks that affected other queries and loads, not just SSMS, particularly in SQL 2008.

The proper clustered index can be quite helpful to performance if the table is being selectively joined to or SELECTed from multiple times.

Minimal logging can also be achieved for INSERT INTO ... SELECT, as I noted above, thus making defining the clus index in advance of loading the table save a scan of the data while still gaining the benefits of minimal logging.

Kristen · May 25, 2016, 2:03pm

I sometimes use a #TEMP as s staging table for import of some data, and as such the Clustered Index point it probably moot - as Jeff describes.

But ... I also use a #TEMP where I want to reuse the data - either on multiple occasions, within an Sproc, or to cut & dice it in different ways - and as such I think a clustered index is very helpful for performance.

A lot of the time the number of rows in my #TEMP are so few, and the rows are relatively "narrow", that it probably isn't worth the bother

ScottPletcher · May 25, 2016, 3:01pm

I wanted to make one main point there really. I've seen far too many temp tables with a nonclustered index but no clustered index. That's frankly just silly. If you're going to create a single index on the table, make it a clustered index. If an index on that column(s) is really useful, it will be vastly more useful as the clus index in the vast majority of cases.

JeffModen · June 4, 2016, 5:45pm

I can't argue against something I've never seen happen and can't say it absolutely won't happen. That's the trouble with saying something won't happen. I have, however, seen it happen to the source table if it's done through a linked server.

[quote]The proper clustered index can be quite helpful to performance if the table is being selectively joined to or SELECTed from multiple times.
[/quote]

I've also seen it slow things down quite a bit. That happened to me when I was writing the first of two "Hierarchies on Steroids" articles. The source table needed to be self-joined as a part of a recursive CTE. The clustered index version ran quite a bit slower than the heap version.

To wit, and I'm sure you'd agree, "It Depends" and is worth testing for.

[quote]Minimal logging can also be achieved for INSERT INTO ... SELECT, as I noted above, thus making defining the clus index in advance of loading the table save a scan of the data while still gaining the benefits of minimal logging.
[/quote]

It sure can depending on "conditions". Your good post reminded me that I've not tested for this in more than a decade (just before 2005 hit the streets, IIRC) and that I need to do those tests again. Thanks, Scott.