Order By with Table Creation and Clustered Index not sorting by column used for cluster/order

catbeasy · April 24, 2020, 8:35pm

Its my understanding that creating a clustered index on a table sorts that table (by whatever column the clustered index refers to) physically on the disc (as opposed to non-clustered index).

I have done this and this is the first time i have bothered to look at the column I put the clustered index on and upon doing a select * on the table, it doesn't appear to sort by the clustered column.

Also, just as a lark, i created another table including an ordered by a particular column in creating the table (created from another populated table). When i looked at the results using Select * (with no order by), the records were not sorted by the order i had put in the table creation.

Was wondering why this is? Clearly without an order by, select * has an homunculus in it that decides what it wants the order to be..! Was thinking at least in the case of the clustered index, the retrieval would be off the physical disc and so expected the order by to be per the clustered column.

And it seems the order by in the creation of a table is a complete waste of time?

Thanks for any conceptual help here..

JamesK · April 24, 2020, 8:50pm

This is expected behavior.

By definition, a database table is an unordered collection. That SQL Server may store it in a specific order is no guarantee that the result set of a SELECT will return the data in that order. See the first bullet near the top of this page.

People who are into database theory make the distinction between a Relation and a Cursor. In layman's language (i.e., my language), a relation is a table - an unordered collection. A cursor is a result set that may come out of that relation, in a programming environment. Cursor has an order and that order is guaranteed to be something that you desire only if you specify the order by clause when querying the relation.

ScottPletcher · April 24, 2020, 9:25pm

The absolute rule in SQL is that if you don't specify an ORDER BY on the (outermost) SELECT, you are not guaranteed any particular order of rows coming back.

That said, when the table is first loaded, generally it is in physical order (although you still need an ORDER BY to guarantee ordered results). As the table is modified, SQL may use pointers and/or other physical "tricks" to provide logical order for the rows, even though the rows may not strictly be in physical order any more.

If you think about it, that's the only way it could work. Say you had a table with 1B rows and you added 3000 rows near the front of the table. Obviously you wouldn't want SQL re-writing the last 900M rows just to keep them in strict physical order.

jeffw8713 · April 25, 2020, 2:35pm

Yes and No...

If your expectation is that selecting the data from the table without an ORDER BY will be in a specific order - then yes it is a complete waste of time.

If your expectation is that an IDENTITY column is defined based upon that specific ORDER BY statement - then no, it isn't a complete waste of time.

The only value that I have found for using ORDER BY on an INSERT statement is to ensure the identity is computed based on that order - or when used with a TOP statement to ensure the correct rows are selected and inserted.

Note: there was a bug in SQL Server 2008 R2 (and lower) where inserting into a table variable with an ORDER BY would return the result from that table variable in the same order it was inserted without using an ORDER BY on the select statement. Unfortunately for a lot of people that relied on that undocumented feature - Microsoft fixed the bug in SQL Server 2012 which caused many issues for applications where the expected sort order was no longer valid.

JeffModen · April 26, 2020, 6:10pm

Heh... leave it to MS to fix a bug that should have been changed to a feature.

ScottPletcher · April 27, 2020, 2:24pm

I don't think so. MS should have a relational db follow relational rules when reasonably possible.

It'd be a bad idea to have one specific SELECT query always be in order and all others not. That would cause confusion for many people, because they'd then expect other table structures to follow the automatic-order-by rule.

JeffModen · April 27, 2020, 7:12pm

Yeah... I'll definitely agree with that. To wit, even if they did add it as a feature, I'd still do an ORDER BY just in case they changed their minds again.

catbeasy · April 28, 2020, 5:27pm

Hello, thank you all for your responses! That really clears things up for me about this particular issue. Only about 1.8 billion more to go..

One other thing that is related: So, you all had mentioned that doing a select on a table will not generate a specified order by on a table that had been created with an order by. Got that. However, what about for speeding up query return results via joins?

So, what if you did an order by upon table creation, say on an ID column, which you then JOINED to another table on that ID column. Would that help the join? i.e. would it speed up the data return? (lets assume the other table had an index on that ID column). Would it behave as an INDEX? Or would it still be the same situation where the order by upon table creation doesn't help?

I ask because i am in the habit of creating clustered or non clustered indexes on any temp tables i create in order to speed up data return when i join those temp tables to other tables.

Again, this is more for a conceptual understanding of what's going on..good learning this stuff! Thanks for any continued input!

ScottPletcher · April 29, 2020, 1:55pm

No, not unless there's an index on the ID column that SQL can use for the query.

I ask because i am in the habit of creating clustered or non clustered indexes on any temp tables i create in order to speed up data return when i join those temp tables to other tables.

First focus on creating the best clustered index for any table, but esp. temp tables, before worrying about nonclus indexes. It almost never makes performance sense to create a nonclus index on a temp table.

Also, you must be sure to create the clus index before loading the table. If you wait until after the table is loaded, the entire table must be scanned, sorted and rewritten.

jeffw8713 · April 29, 2020, 7:54pm

Even with a clustered or non-clustered index on a temp table - the full table is going to be read in any join operations unless the joins are defined with multiple columns or you include one or more columns from the temp table in the where clause or you are using that temp table for multiple steps where only a portion of the temp table is needed for each step.

It might be possible to avoid a sort operation on the join by including a clustered index - but whether that improves performance enough would depend on other factors, such as how many rows in that temp table and how expensive the sort operation.

ScottPletcher · April 29, 2020, 8:47pm

It's difficult to be sure when or whether a clus index will be effective, but it usually cost effective to add one to a temp table that will be joined on those columns, esp. if the nonclus table being joined to is also clustered that way or the query is using a nonclus index with the join column as the lead key. Even if those situations don't exist today, they could tomorrow.

In short, use a clus index as often as possible on a temp table, use nonclus indexes only after rigorous checking to make sure they are worth it.

The main exception would be extremely large tables that you only join to once. In that case, it may not be worth creating a clus index because of the sort required when inserting into the temp table.

Again, create the clus index before loading the temp table!!

Nonclus index(es) is(are) typically created after loading, but note that building the index after loading the will require a full scan of the temp table for every nonclus index built. That's another reason it's so rare that nonclus indexes are valuable on a temp table.

yosiasz · April 29, 2020, 11:09pm

Could you please expound on this a bit more. The join column as the lead key. One of the biggest challenges, includes me, is indexing. And one of the most opinionated subjects in SQL, which it should not be.

ScottPletcher · April 30, 2020, 1:18pm

Because if both inputs are already sorted on a given column(s), SQL may be able to use a MERGE join. MERGE joins are extremely efficient, i.e. the best join type there is, when they are applicable to a given query.

JeffModen · April 30, 2020, 2:39pm

Amen to that!