SQLTeam.com | Weblogs | Forums

Selecting hieararchies from a table with a twist

sql2008
sql2014
tsql

#1

Hi All,

I have a table called FileTable with the following columns:
ID int NOT NULL
ParentID int NULL

ID is the primary key value while ParentID is acting as a foreign key.
ParentID contains the ID of the node directly above it in the hierarchy. ParentID is only null when we talk about the root node, otherwise it contains the ID of the node above. The table contains several hierarchies, approx. 15000 at any single time, so performance is important.

Data in the table is like this:

ID ParentID

1 null
2 1
3 1
4 2
5 null
6 5
7 6
8 null
9 8

These data basically describe 3 hiearchies, with root node IDs 1, 5 and 8.

I have another table with a single int column called AccessTable.
It contains the following integers

FileRootID

1
8

What I want to achieve is to select only those hierarchies (meaning the root node and all nodes below it) from the FileTable, where the root ID is present in the AccessTable.

How do I achieve this?


Bill Of Materials Table get top-level part number
#2

Using a recursive CTE:

DECLARE @FileTable TABLE (ID INT, ParentID INT)

INSERT INTO @FileTable
VALUES (1, NULL), (2, 1), (3, 1), (4, 2), (5, NULL), (6, 5), (7, 6), (8, NULL), (9, 8);

DECLARE @AccessTable TABLE (FileRootID INT)

INSERT INTO @AccessTable (FileRootID)
VALUES (1), (8);

WITH cte
AS (
     SELECT f.*
     FROM @FileTable f
     INNER JOIN @AccessTable a ON f.ID = a.FileRootID
     WHERE f.ParentID IS NULL
     
     UNION ALL
     
     SELECT f.*
     FROM @FileTable f
     INNER JOIN cte r ON f.ParentID = r.id
     )
SELECT *
FROM cte

#3

Amazing. Thanks a lot.


#4

@Daniel,

There are a couple of concerns that I have for this...

You say there are >15000 individual "hierarchies" in the "FileTable" and that you're concerned with performance. With that in mind, I have some questions for you...

  1. How many total rows (nodes) in the table?
  2. How often does the table suffer inserts, updates, or deletes?
  3. Would it be better to display the hierarchies in hierarchical order rather than the level order that an rCTE would show?

#5

Hi Jeff,

We have approx. 100 000 records in the table, and yes, performance takes hit with recursive CTE no matter how we display the data. The query itself is running about 10 times slower. The difference is quite noticeable. The fact that SQL needs to build hierarchies takes a toll on query performance.
We ended up adding a HierarchyID INT NOT NULL column that is filled with the ID of the root node at the time nodes are inserted (the application has information on which hierarchy a node is inserted into). This way we can filter with a simple join without building the hierarchy.

However, I'm grateful for your concern about performance - which indeed turned out to be a problem - and for coming back to my issue. Much appreciated.


#6

Depending on the maximum depth of any given hierarchy, you can do this with a series of self-joins and unpivot. How deep can your hierarchies go?


#7

I think in all the tables where we have hierarchies we maintain a PATH column (as a fixed width list of IDs of all ancestors). I expect this was the "old way" before rCTEs were available etc.

Does make me wonder, based on performance issues etc., whether the tradeoff is still worthwhile of having to maintain the PATH (when rows are inserted into the hierarchy) and the space taken up by the wide list of ancestors, in order to avoid having to do (slow) recursive nested queries every time there is a Query on the table?


#8

Heh... yeah. I kinda figured you were having perf problems because of the rCTEs.

Understood on what you've done so far. I still need to know how often you add, modify, or delete any row or rows in that 100,000 node hierarchy. I may have something for you that will blow the socks off a lot of different methods.


#9

Are you preserving the original Adjacency List, Kristen, which I think is absolutely essential for reasons that'll become apparent depending on finding out how often Daniel's hierarchy table is updated.


#10

Hi Jeff,

The most performance critical part is the "select" because that's when users have to wait for the list to be displayed. Inserts and deletes happen on server side. There are approx. 800 hierarchies inserted every day (most of them during working hours), that are made up of about 5000 records. The same amount is deleted every day give or take a few percent.
Please note that hierarchies can have any depth (application doesn't impose a limit).

Thanks for the help again.


#11

It's precisely the SELECTs that I was worried about. My concern for inserts, updates, and deletes was to make sure that they're not done every second of the day.

I believe you may be in for a treat. I'll be back. I have to modify my hierarchical test table generator for the orchard with all the trees you have and do a couple of experiments.


#12

We have a PATH column with IDs for all ancestors (we debated whether that should include "self" or not, in the end we left it out to save space, but quite a lot of processing involves adding the Node ID for Self back in ... maybe we should have included it in the first place.

We also have, in separate columns, IDs for PARENT (NULL = Root) and immediately preceding SIBBLING (NULL=First at that level, for that Parent)

Only real issue is when a new record is inserted (or moved ...) ahead of exiting records at the same level as there is some shuffling around of next sibling. We also allow moving of an element, and all its children, to be inserted somewhere else on the tree - that needs the PATH fixing too.


#13

Outstanding! I know it's going to sound very strange but, because you kept the parent ID, I believe you're really going to like what you see here pretty soon. Keeping the "Adjacency List" (the parent/child thing, just to be sure) going in your hierarchy is something that a lot of people don't do and it's one of the things that makes moves, adds, and deletes (MAD process for short and, yeah, it can drive one mad) a virtual cake walk. In fact, using the simplicity of the "Adjacency List' to do all the MAD stuff is all you have to do. Everything is "auto-magic" and nasty fast especially when it comes to SELECTs to interrogate your hierarchy.


#14

Ok, here we go. First, a bit of background.

An "Adjacency List" (Parent/Child) is absolutely the best and easiest to maintain for MAD (moves, adds, deletes). It can certainly be done by the computer but, because each node is "aware" of one and only one other node (its parent), a human can fix the hierarchy if something goes wrong because it's pretty simple for humans to understand. This is they type of hierarchy that both Daniel and Kristen have. The bad part about "Adjacency Lists" is that they're terrible in the performance department when it comes to retrieving data.

Another type of hierarchy is called the "Hierarchical Path" . It has a couple of advantages not the least of which is to produce the correct sort order from left to right and top to bottom as you might see in an org chart. It also has the advantage of every node knowing its entire upline. It's a real bugger for humans to maintain, though, and for the same reasons. It's also a real pain to write any meaningful SELECTs against, which is why Microsoft wrote a bunch of SQL CLR behind the scenes to be able to use the HierarchyID datatype, which is nothing more than a "positional" Hierarchical Path. Kristen also has this type of hierarchy in her table(s).

For the fastest SELECTs against a hierarchy ever, you need a third type of hierarchy known as "Nested Sets". It uses two anchor IDs (which I call "bowers", which are the two large anchors on the front of ships). The structure of the bowers is what makes the SELECTs so incredibly fast. The HUGE disadvantage of Nested Sets is, similar to the Hierarchical Path, each node is aware of many other nodes and it's just about impossible for a human to maintain even in small hierarchies. Further, traditional methods for converting an Adjacency List to Nest Sets using traditional looping push-stack methods can take literally days to complete. They're also relatively slow and complex when it comes to MADs. They do have a serious performance advantage in that the Left Bower (usually an INT) is much more narrow than the hierarchical path (I refer to it as the "Sort Path) not to mention some incredibly simple and sargable (can do index seeks and range scans) code.

All of that has been solved in the form of a hierarchical table that contains all 3 methods, allowing for the advantages of each. I happen to be a personal friend :smirk: of the guy that put all of this together and, if you want a really good description of how it all works, you should read his article at the following URL.
Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

I've put together an example the resembles Daniel's original request except that, for ease of understanding for anyone, I've followed the idea of a 100,000 node employee "org chart" that uses EmployeeID and ManagerID because people seem to get a bit lost when you start talking about ParentID's and ChildID's. The code is easily modifiable using about a half dozen or so "Search'n'Replace" commands to suit your needs. I say that it resembles Daniel's original request in that the hierarchy has 100,000 nodes in it and around 15,000 "trees" within the "forest".

Let's get started in my next post below.


#15

The first thing that we need is a play area in the form of a new database so that we don't accidentally mess anything up in a real database. I used my initials as part of the database name with the idea that no one is likely to have an existing database named as such.

We also need a stored procedure that will generate a pure (no mistakes or accidental cycles) 100,000 node hierarchy. Please feel free to inspect the following code to be ensured that I'm not pulling any fast ones on anyone and that the code will not damage anything if executed as is.

And, apologies for the line wraps. My self imposed limit on code width is, when the cursor hits character # 120, it's time to wrap the line to prevent horizontal scrolling on my screen or line wraps on landscape paper.

Not to worry about how long it takes. It only takes about 4 seconds to execute on my laptop.

--=====================================================================================================================
--      This creates a test hierarchy database that you're not likely to already have.
--      DO NOT RUN ANY OF THE CODE THAT FOLLOWS IN ANY OTHER DATABASE BECAUSE THERE ARE TABLE DROPS!
--=====================================================================================================================
--===== Create the test database for our hierarchy testing
 CREATE DATABASE JBM_Hierarchy;
  ALTER DATABASE [JBM_Hierarchy] MODIFY FILE (NAME=N'JBM_Hierarchy'    , SIZE=500MB,FILEGROWTH=100MB)
  ALTER DATABASE [JBM_Hierarchy] MODIFY FILE (NAME=N'JBM_Hierarchy_log', SIZE=100MB,FILEGROWTH=100MB)
GO
-----------------------------------------------------------------------------------------------------------------------
--===== Create this stored procedure in our test database.  Details live in the comments.
    USE JBM_Hierarchy;
GO
 CREATE PROCEDURE dbo.BuildLargeEmployeeTable
/****************************************************************************
 Purpose:
 Create a randomized "well formed" Adjacency List hierarchy with indexes.

 Progammer's Notes:

 1. Each EmployeeID (except for the Root Node, of course) is assigned a
    random ManagerID number which is initially always less than the current
    EmployeeID to ensure that no cycles occur in the hierarcy.

 2. The second parameter used to call this stored procedure will optionally
    randomize the EmployeeIDss to make the hierarchy truly random as it would
    likely be in real life.  This, of course, takes a small amounnt of extra
    time.

 3. This code runs nasty fast and is great for testing hierarchical
    processing code. Including the index builds, this code will build a
    million node Adjacency List on a 4 processor (i5) laptop with 6GB of RAM
    in just several seconds.  The optional randomization adds just several 
    more seconds.

 Usage: 
--===== Create the hierarchy where all the ManagerIDs are less than the
     -- EmployeeIDs. This is the fastest option and will build a million node
     -- hierarchy in just about 7 seconds on a modern machine.
   EXEC dbo.BuildLargeEmployeeTable 1000000;

--===== Making the second parameter a non-zero value will further randomize
     -- the IDs in the hierarchy. This, of course, takes extra time and will
     -- build a million row hierarchy in about 17 seconds on a modern 
     -- machine.
   EXEC dbo.BuildLargeEmployeeTable 100000,1;

 Revision History:
 Initial concept and creation - Circa 2009 - Jeff Moden
 Rev 01 - 15 May 2010 - Jeff Moden 
        - Abort if current DB isn't "tempdb" to protect users.
 Rev 02 - 13 Oct 2012 - Jeff Moden
        - Add a randomization stop to make the hierarchy more like real life.
 Rev 03 - 15 May 2015 - Jeff Moden
        - Undo Rev 01 for a demonstration at the following URL.
http://forums.sqlteam.com/t/selecting-hieararchies-from-a-table-with-a-twist/1148
****************************************************************************/
--===== Declare the I/O parameters
        @pRowsToBuild INT,
        @pRandomize  TINYINT = 0
     AS

--===========================================================================
--      Presets
--===========================================================================
--===== Supresss the autodisplay of rowcounts to cleanup the display and to
     -- prevent false error returns if called from a GUI.
    SET NOCOUNT ON;

----===== Make sure that we're in a safe place to run this...
--     IF DB_NAME() <> N'tempdb'
--  BEGIN
--        RAISERROR('Current DB is NOT tempdb. Run aborted.',11,1);
--        RETURN;
--    END;

--===== Conditionaly drop the test table so we can do reruns more easily
     IF OBJECT_ID('dbo.Employee','U') IS NOT NULL
        DROP TABLE dbo.Employee;

--===========================================================================
        RAISERROR('Building the hierarchy...',0,1) WITH NOWAIT;
--===========================================================================
--===== Build the test table and populate it on the fly.
     -- Everything except ManagerID is populated here. The code uses a 
     -- technique called a "Psuedo-Cursor" (kudos to R. Barry Young for the
     -- term) to very quickly and easily build large numbers of rows.
 SELECT TOP (@pRowsToBuild)
        EmployeeID   = ISNULL(CAST(
                          ROW_NUMBER() OVER (ORDER BY (SELECT 1))
                       AS INT),0),
        ManagerID    = CAST(NULL AS INT),
        EmployeeName = CAST(NEWID() AS VARCHAR(36))
   INTO dbo.Employee
   FROM master.sys.all_columns ac1
  CROSS JOIN master.sys.all_columns ac2
  CROSS JOIN master.sys.all_columns ac3
;
RAISERROR('There are %u rows in the hierarchy.',0,1,@@ROWCOUNT) WITH NOWAIT;

--===== Update the test table with ManagerID's.  The ManagerID is some random
     -- value which is always less than the current EmployeeID to keep the 
     -- hierarchy "clean" and free from "loop backs".
 UPDATE dbo.Employee
    SET ManagerID = CASE 
                       WHEN EmployeeID > 1 
                       THEN ABS(CHECKSUM(NEWID())) % (EmployeeID-1) +1 
                       ELSE NULL 
                   END
;
--===========================================================================
--      Conditionally randomize the hierarchy to be more like real life
--===========================================================================
     IF @pRandomize <> 0
  BEGIN
        --===== Alert the operator
                RAISERROR('Randomizing the hierarchy...',0,1) WITH NOWAIT;

        --===== Create a randomized cross reference list to randomize the
             -- EmployeeIDs with.
         SELECT RandomEmployeeID = IDENTITY(INT,1,1),
                EmployeeID
           INTO #RandomXRef
           FROM dbo.Employee
          ORDER BY NEWID()
        ;
        --===== Update the ManagerIDs in the Employee table with the new 
             -- randomized IDs
         UPDATE emp
            SET emp.ManagerID = RandomEmployeeID
           FROM dbo.Employee emp
           JOIN #RandomXRef xref ON emp.ManagerID = xref.EmployeeID
        ;
        --===== Update the EmployeeIDs in the Employee table with the new 
             --randomized IDs
         UPDATE emp
            SET emp.EmployeeID = RandomEmployeeID
           FROM dbo.Employee emp
           JOIN #RandomXRef xref ON emp.EmployeeID = xref.EmployeeID
        ;
    END
   ELSE
  BEGIN
        --===== Alert the operator
                RAISERROR('The hierarchy is not randomized',0,1) WITH NOWAIT;
    END
;
--===========================================================================
--      Build the indexes necessary for performance.
--===========================================================================
--===== Alert the operator
        RAISERROR('Building the keys and indexes...',0,1) WITH NOWAIT;

--===== Add some indexes that most folks would likely have on such a table
  ALTER TABLE dbo.Employee 
    ADD CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED (EmployeeID)
;
 CREATE UNIQUE INDEX By_ManagerID_EmployeeID 
     ON dbo.Employee (ManagerID,EmployeeID)
;
  ALTER TABLE dbo.Employee 
    ADD CONSTRAINT FK_Employee_Employee FOREIGN KEY
        (ManagerID) REFERENCES dbo.Employee (EmployeeID) 
     ON UPDATE NO ACTION 
     ON DELETE NO ACTION
; 
--===========================================================================
--      Exit
--===========================================================================
RAISERROR('===============================================',0,1) WITH NOWAIT;
RAISERROR('RUN COMPLETE',0,1) WITH NOWAIT;
RAISERROR('===============================================',0,1) WITH NOWAIT;
GO
-----------------------------------------------------------------------------------------------------------------------
--===== Run this command to build a highly randomized 100,000 node "Adjacency List" test table 
     -- called "Employee".
   EXEC dbo.BuildLargeEmployeeTable 100000,1;

--===== And run this code to convert the table to a 15,000 tree orchard with a single top level.
     -- As is the nature of random numbers, there may be some duplication here. All that means
     -- is that there will be a random number of trees up to 16,000 and generally pretty close
     -- to 15,000 actual nodes.
  PRINT '--===== Number of individual trees created...';
 UPDATE emp 
    SET ManagerID = NULL
   FROM dbo.Employee emp
  WHERE emp.EmployeeID IN
(
 SELECT TOP 16000
        TreeTopNode = ABS(CHECKSUM(NEWID()))%100000+1
   FROM      sys.all_columns ac1
  CROSS JOIN sys.all_columns ac2
)
;
GO

Up to this point, we've only build a wad of test data resembling Daniel's original post. The next post below will the first real step in doing all I say that can be done.


#16

And now here's the magic. Create this stored procedure. Again, it's commented but, if you want to the "Nit-I-Grit-I" to be able to maintain it, please see the article I previously referred you to.

Note that any time you make a change to your original table, you should run this to rebuild the hierarchy. It only takes about 4 seconds including the two index builds that it does. While that may seem a bit inconvenient, I'm thinking that's probably at least as long as the stuff that Daniel and Kristen are doing. We'll show you how to use the table in my next post after this one.

--===== Since we're going to drop and create tables, ensure that we're in the test database.
    USE JBM_Hierarchy;
GO
 CREATE PROCEDURE dbo.RebuildHierarchy AS
/**********************************************************************************************************************
        We have something similar to what you have in your hierarchy now. The trouble is that we can't see the forest
        for the trees.  We have to build the forest.  You would need to do the same thing in your hierarchy. If it
        doesn't already exist, we have to build a top level "forest" node and then make all the trees realize that
        they're in the same forest by updating the "ManagerID" (Parent) of all the current NULLs at the top of each 
        tree to refer to the new top level "Forest" node.

        We also don't want to make any changes to the original table because that might blow up some of the code that
        does the inserts and checks for "cycles", etc.  So need to make a "copy" of the table with some additional
        columns that we'll need and put the "Forest" together.

        (duration usually less than 3 or 4 seconds on a 100,000 rows)

        Although the code is fairly well documented, if you want more detail on how this works, please see the article
        at the following URL:
        http://www.sqlservercentral.com/articles/Hierarchy/94040/

        --Jeff Moden - 16 May 2015
**********************************************************************************************************************/
--=====================================================================================================================
--      Presets
--=====================================================================================================================
--===== Suppress the auto-display of rowcounts to prevent a GUI from interpreting them as an error.
    SET NOCOUNT ON;

--===== Conditionally drop tables to make reruns easy
     IF OBJECT_ID('dbo.Hierarchy','U') IS NOT NULL DROP TABLE dbo.Hierarchy;

--=====================================================================================================================
--      If the special "Hierarchical Tally Table" that we're going to need doesn't already exist, build it now.
--=====================================================================================================================
     IF OBJECT_ID('dbo.HTally') IS NULL
  BEGIN
        --===== Create the HTally table to be used for splitting SortPath
         SELECT TOP 1000 --(4 * 1000 = VARBINARY(4000) in length)
                N = ISNULL(CAST(
                        (ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1)*4+1
                    AS INT),0)
           INTO dbo.HTally
           FROM      master.sys.all_columns ac1
          CROSS JOIN master.sys.all_columns ac2;

        --===== Add the quintessential PK for performance.
          ALTER TABLE dbo.HTally
            ADD CONSTRAINT PK_HTally 
                PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100;
    END
;
--=====================================================================================================================
--      Build the new table on-the-fly including some unused columns that we'll use in the second half of this proc.
--      The ISNULLs on the Bowers create NOT NULL columns so that we can make a unique index to really make our
--      future SELECTs really fly.
--=====================================================================================================================
WITH 
cteBuildPath AS 
(
--===== This creates the "Forest Node"
 SELECT  EmployeeID     = 0
        ,ManagerID      = NULL
        ,EmployeeName   = 'Forest Node'
        ,HLevel         = 1
        ,SortPath       = CAST(CAST(0 AS BINARY(4)) AS VARBINARY(4000))
  UNION ALL
--==== This is the "anchor" part of the recursive CTE
 SELECT  EmployeeID     = anchor.EmployeeID
        ,ManagerID      = 0
        ,EmployeeName   = anchor.EmployeeName
        ,HLevel         = 2
        ,SortPath       = CAST(CAST(anchor.EmployeeID AS BINARY(4)) AS VARBINARY(4000))
   FROM dbo.Employee AS anchor
  WHERE ManagerID IS NULL --This finds the top level of the individual trees in the original table
  UNION ALL 
--===== This is the "recursive" part of the CTE that adds 1 for each level
     -- and concatenates each level of EmployeeID's to the SortPath column.
 SELECT  recur.EmployeeID
        ,ManagerID = ISNULL(recur.ManagerID,0)
        ,recur.EmployeeName
        ,HLevel = cte.HLevel + 1
        ,SortPath = CAST(cte.SortPath + CAST(Recur.EmployeeID AS BINARY(4)) AS VARBINARY(4000))
   FROM dbo.Employee      AS recur
  INNER JOIN cteBuildPath AS cte 
          ON cte.EmployeeID = recur.ManagerID
)--==== This final INSERT/SELECT creates the "narrow surrogate" column (NodeNumber)
     --  that we'll be able to sort on as well as populating the new table.
 SELECT  EmployeeID     = ISNULL(sorted.EmployeeID,0)
        ,ManagerID      = sorted.ManagerID
        ,EmployeeName   = sorted.EmployeeName
        ,HLevel         = sorted.HLevel
        ,LeftBower      = ISNULL(CAST(0 AS INT),0)                     --NOT NULL column populated later
        ,RightBower     = ISNULL(CAST(0 AS INT),0)                     --NOT NULL column populated later
        ,NodeNumber     = ROW_NUMBER() OVER (ORDER BY sorted.SortPath) 
        ,NodeCount      = CAST(0 AS INT)                               --Populated later
        ,SortPath       = sorted.SortPath
   INTO dbo.Hierarchy
   FROM cteBuildPath AS sorted;

--=====================================================================================================================
--      Build the Nested Sets into the unpopulated columns of the new table along with some other helpful information.
--=====================================================================================================================
DECLARE @LeftBower INT;

--===== Build the NodeCount and the Bowers
WITH 
cteCountNodes AS
(--==== This splits the SortPath column every 4 bytes
     -- and counts the number of matching nodes
 SELECT EmployeeID = SUBSTRING(h.SortPath,ht.N,4), 
        NodeCount  = COUNT(*) --Includes current node
   FROM dbo.Hierarchy h 
  CROSS JOIN dbo.hTally ht
  WHERE ht.N BETWEEN 1 AND DATALENGTH(h.SortPath)
  GROUP BY SUBSTRING(h.SortPath,ht.N,4)
)--==== This calculates the Bowers based on the formulas
     -- previously discussed
 UPDATE dbo.Hierarchy
    SET NodeCount  = cn.NodeCount,
        @LeftBower = LeftBower = (hn.NodeNumber * 2) - hn.HLevel, 
        RightBower = ((cn.NodeCount-1) * 2) + @LeftBower + 1
   FROM dbo.Hierarchy hn
  INNER JOIN cteCountNodes cn
     ON hn.EmployeeID = cn.EmployeeID;

--===== All done. Add some indexes to maximize performance.
 CREATE UNIQUE    CLUSTERED INDEX IXC_Hierarchy         ON dbo.Hierarchy (LeftBower,RightBower,EmployeeID);
 CREATE UNIQUE NONCLUSTERED INDEX IX_Hierarchy_Employee ON dbo.Hierarchy (EmployeeID,LeftBower);
GO

#17

Ok... we're almost ready to rock. Just to make sure everything is up to date, let's run the hierarchy rebuild process, again. Run this code. Remember that this needs to be done after doing MADs to the original table, which we kept alive.

--===== Since we're going to drop and create tables, ensure that we're in the test database.
     -- As previously stated, this only takes about 4 secconds on the 100,000 rows.
    USE JBM_Hierarchy;

--===== Now, rebuild the entire hierarchy.
   EXEC dbo.RebuildHierarchy;

The original problem than Daniel posted also involves a table called "Access" where he maintains a list of "root nodes" that he needs to get the trees for. In his data, those node were the ones that marked the root of each tree with a NULL in the ParentID (ManagerID in this code example). Since we built a "Forest" out of all these nodes, those NULLs were changed to "0" to point to the root of the "Forest".

Let's create and populate that table as yet another test table. I don't know how many nodes Daniel has in this table but we'll use 5,000, which covers about a 3rd of the total "root nodes". That should be a decent test. So, run this code to build the "Access" table with 5,000 rows in it.

--===== Conditionally drop tables to make reruns easy
     IF OBJECT_ID('dbo.Access','U') IS NOT NULL DROP TABLE dbo.Access;

--===== Build the Access table with 500 random nodes that we know are the roots of trees in the forest.
     -- Because we're using NEWID(), which is a random number, we'll get different results each time.
     -- Again, the use of ISNULL is to make a NOT NULL column so that we can apply a PK, which is also unique.
 SELECT TOP 5000
        FileRootID = ISNULL(EmployeeID,0)
   INTO dbo.Access
   FROM dbo.Hierarchy
  WHERE ManagerID = 0
  ORDER BY NEWID()
;
--===== And now we can add the expected PK.
  ALTER TABLE dbo.Access
    ADD CONSTRAINT PK_Access PRIMARY KEY CLUSTERED (FileRootID) WITH FILLFACTOR = 100
;

All our test data is in place and we just rebuilt the hierarchy table. Because of this new structure, the rest is drop dead simple. The following code solves the original problem. To summarize, it looks up the bowers in the hierarchy for all the nodes listed in the Access table and then uses those bowers to find the complete downline for all 5,000 nodes that it found.

--===== Solve the problem using the new structure.
 SELECT h1.*
   FROM dbo.Hierarchy h1
   JOIN dbo.Hierarchy h2
     ON h1.LeftBower BETWEEN h2.LeftBower AND h2.RightBower
   JOIN dbo.Access a
     ON h2.EmployeeID = a.FileRootID
  ORDER BY h1.leftbower
;

Isn't that amazing how simple it is? I'm thinking Daniel will be mighty happy because that only took 686ms to run and it's sorted in correct hierarchical order, to boot!

So, to summarize what's been done.

  1. We created a new semi-permanent table that contains the original "Adjacency List", a "Hierarchical Path", and "Nested Sets", along with some other info and it only took us 4 seconds including the indexes we made. The same thing using old RBAR push-stack methods would have taken would have taken significantly longer (about 50 MINUTES according to previous testing).

  2. As an interesting side bar, none of the code that did the MAD on the original table needs to be changed because we made no changes to the original table.

  3. We built an "Access" table to simulate Daniel's original requirement.

  4. The code to ultimately solve the original problem is remarkably simple and fast. Similar code to do other hierarchical tasks will be equally as simple and fast.

Again, for a really deep dive explanation on how this code works, please see the following article.
Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

If you have to "aggregate" data in a hierarchy at multiple levels and very quickly, then you need to see Part 2 of that short series of articles. MLM'ers lover it.
Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

I know its a really long set of posts but two people looked like they needed some serious help with their hierarches. Thanks for listening, folks.


#18

Very useful Jeff, thanks.

I forgot to mention that we also thought about haveing a SEQUENCE column, so we could just do:

SELECT Col1, Col2
FROM MyHierarchy
WHERE ...
ORDER BY Sequence

Its so long ago since we did the design that I can't remember how the discussion went . but clearly having a 1,2,3 number would mean a massive number of UPDATEs for any change.

I think that then went to "10,20,30" and a "Renumber the world" function when inserting a new node meant that there wasn't any available range left.

But why didn't we just add a FLOAT and halve-the-difference when inserting a node?

When we need to display things in order (the members of a branch to the tree) there is quite a bit of code to deduce, based on sibling links, what the order is.


#19

The Hierarchy table that I wrote about handles all that auto-magically. The LeftBower, NodeNumber, and SortPath (hierarchical path) can be used to sort with a simple "Order By" as I did in the example. The sort on LeftBower is normally the most effective because it's frequently involved in queries against the table.

Just curious, Kristen... what is your hierarchical table used for and how many nodes does it contain?


#20

Its a table of pages on a website. The pages are in hierarchy to match menu structure for the pages (we also allow "pseudo X-Ref pages" where we want a page from "somewhere else" in the tree to be available. Only about 1,500 records in the table.

I'll take a look at the column you mentioned in your code. Going to need a time when I have an hour spare to experiment :slight_smile: