The first thing that we need is a play area in the form of a new database so that we don't accidentally mess anything up in a real database. I used my initials as part of the database name with the idea that no one is likely to have an existing database named as such.
We also need a stored procedure that will generate a pure (no mistakes or accidental cycles) 100,000 node hierarchy. Please feel free to inspect the following code to be ensured that I'm not pulling any fast ones on anyone and that the code will not damage anything if executed as is.
And, apologies for the line wraps. My self imposed limit on code width is, when the cursor hits character # 120, it's time to wrap the line to prevent horizontal scrolling on my screen or line wraps on landscape paper.
Not to worry about how long it takes. It only takes about 4 seconds to execute on my laptop.
--=====================================================================================================================
-- This creates a test hierarchy database that you're not likely to already have.
-- DO NOT RUN ANY OF THE CODE THAT FOLLOWS IN ANY OTHER DATABASE BECAUSE THERE ARE TABLE DROPS!
--=====================================================================================================================
--===== Create the test database for our hierarchy testing
CREATE DATABASE JBM_Hierarchy;
ALTER DATABASE [JBM_Hierarchy] MODIFY FILE (NAME=N'JBM_Hierarchy' , SIZE=500MB,FILEGROWTH=100MB)
ALTER DATABASE [JBM_Hierarchy] MODIFY FILE (NAME=N'JBM_Hierarchy_log', SIZE=100MB,FILEGROWTH=100MB)
GO
-----------------------------------------------------------------------------------------------------------------------
--===== Create this stored procedure in our test database. Details live in the comments.
USE JBM_Hierarchy;
GO
CREATE PROCEDURE dbo.BuildLargeEmployeeTable
/****************************************************************************
Purpose:
Create a randomized "well formed" Adjacency List hierarchy with indexes.
Progammer's Notes:
1. Each EmployeeID (except for the Root Node, of course) is assigned a
random ManagerID number which is initially always less than the current
EmployeeID to ensure that no cycles occur in the hierarcy.
2. The second parameter used to call this stored procedure will optionally
randomize the EmployeeIDss to make the hierarchy truly random as it would
likely be in real life. This, of course, takes a small amounnt of extra
time.
3. This code runs nasty fast and is great for testing hierarchical
processing code. Including the index builds, this code will build a
million node Adjacency List on a 4 processor (i5) laptop with 6GB of RAM
in just several seconds. The optional randomization adds just several
more seconds.
Usage:
--===== Create the hierarchy where all the ManagerIDs are less than the
-- EmployeeIDs. This is the fastest option and will build a million node
-- hierarchy in just about 7 seconds on a modern machine.
EXEC dbo.BuildLargeEmployeeTable 1000000;
--===== Making the second parameter a non-zero value will further randomize
-- the IDs in the hierarchy. This, of course, takes extra time and will
-- build a million row hierarchy in about 17 seconds on a modern
-- machine.
EXEC dbo.BuildLargeEmployeeTable 100000,1;
Revision History:
Initial concept and creation - Circa 2009 - Jeff Moden
Rev 01 - 15 May 2010 - Jeff Moden
- Abort if current DB isn't "tempdb" to protect users.
Rev 02 - 13 Oct 2012 - Jeff Moden
- Add a randomization stop to make the hierarchy more like real life.
Rev 03 - 15 May 2015 - Jeff Moden
- Undo Rev 01 for a demonstration at the following URL.
http://forums.sqlteam.com/t/selecting-hieararchies-from-a-table-with-a-twist/1148
****************************************************************************/
--===== Declare the I/O parameters
@pRowsToBuild INT,
@pRandomize TINYINT = 0
AS
--===========================================================================
-- Presets
--===========================================================================
--===== Supresss the autodisplay of rowcounts to cleanup the display and to
-- prevent false error returns if called from a GUI.
SET NOCOUNT ON;
----===== Make sure that we're in a safe place to run this...
-- IF DB_NAME() <> N'tempdb'
-- BEGIN
-- RAISERROR('Current DB is NOT tempdb. Run aborted.',11,1);
-- RETURN;
-- END;
--===== Conditionaly drop the test table so we can do reruns more easily
IF OBJECT_ID('dbo.Employee','U') IS NOT NULL
DROP TABLE dbo.Employee;
--===========================================================================
RAISERROR('Building the hierarchy...',0,1) WITH NOWAIT;
--===========================================================================
--===== Build the test table and populate it on the fly.
-- Everything except ManagerID is populated here. The code uses a
-- technique called a "Psuedo-Cursor" (kudos to R. Barry Young for the
-- term) to very quickly and easily build large numbers of rows.
SELECT TOP (@pRowsToBuild)
EmployeeID = ISNULL(CAST(
ROW_NUMBER() OVER (ORDER BY (SELECT 1))
AS INT),0),
ManagerID = CAST(NULL AS INT),
EmployeeName = CAST(NEWID() AS VARCHAR(36))
INTO dbo.Employee
FROM master.sys.all_columns ac1
CROSS JOIN master.sys.all_columns ac2
CROSS JOIN master.sys.all_columns ac3
;
RAISERROR('There are %u rows in the hierarchy.',0,1,@@ROWCOUNT) WITH NOWAIT;
--===== Update the test table with ManagerID's. The ManagerID is some random
-- value which is always less than the current EmployeeID to keep the
-- hierarchy "clean" and free from "loop backs".
UPDATE dbo.Employee
SET ManagerID = CASE
WHEN EmployeeID > 1
THEN ABS(CHECKSUM(NEWID())) % (EmployeeID-1) +1
ELSE NULL
END
;
--===========================================================================
-- Conditionally randomize the hierarchy to be more like real life
--===========================================================================
IF @pRandomize <> 0
BEGIN
--===== Alert the operator
RAISERROR('Randomizing the hierarchy...',0,1) WITH NOWAIT;
--===== Create a randomized cross reference list to randomize the
-- EmployeeIDs with.
SELECT RandomEmployeeID = IDENTITY(INT,1,1),
EmployeeID
INTO #RandomXRef
FROM dbo.Employee
ORDER BY NEWID()
;
--===== Update the ManagerIDs in the Employee table with the new
-- randomized IDs
UPDATE emp
SET emp.ManagerID = RandomEmployeeID
FROM dbo.Employee emp
JOIN #RandomXRef xref ON emp.ManagerID = xref.EmployeeID
;
--===== Update the EmployeeIDs in the Employee table with the new
--randomized IDs
UPDATE emp
SET emp.EmployeeID = RandomEmployeeID
FROM dbo.Employee emp
JOIN #RandomXRef xref ON emp.EmployeeID = xref.EmployeeID
;
END
ELSE
BEGIN
--===== Alert the operator
RAISERROR('The hierarchy is not randomized',0,1) WITH NOWAIT;
END
;
--===========================================================================
-- Build the indexes necessary for performance.
--===========================================================================
--===== Alert the operator
RAISERROR('Building the keys and indexes...',0,1) WITH NOWAIT;
--===== Add some indexes that most folks would likely have on such a table
ALTER TABLE dbo.Employee
ADD CONSTRAINT PK_Employee PRIMARY KEY CLUSTERED (EmployeeID)
;
CREATE UNIQUE INDEX By_ManagerID_EmployeeID
ON dbo.Employee (ManagerID,EmployeeID)
;
ALTER TABLE dbo.Employee
ADD CONSTRAINT FK_Employee_Employee FOREIGN KEY
(ManagerID) REFERENCES dbo.Employee (EmployeeID)
ON UPDATE NO ACTION
ON DELETE NO ACTION
;
--===========================================================================
-- Exit
--===========================================================================
RAISERROR('===============================================',0,1) WITH NOWAIT;
RAISERROR('RUN COMPLETE',0,1) WITH NOWAIT;
RAISERROR('===============================================',0,1) WITH NOWAIT;
GO
-----------------------------------------------------------------------------------------------------------------------
--===== Run this command to build a highly randomized 100,000 node "Adjacency List" test table
-- called "Employee".
EXEC dbo.BuildLargeEmployeeTable 100000,1;
--===== And run this code to convert the table to a 15,000 tree orchard with a single top level.
-- As is the nature of random numbers, there may be some duplication here. All that means
-- is that there will be a random number of trees up to 16,000 and generally pretty close
-- to 15,000 actual nodes.
PRINT '--===== Number of individual trees created...';
UPDATE emp
SET ManagerID = NULL
FROM dbo.Employee emp
WHERE emp.EmployeeID IN
(
SELECT TOP 16000
TreeTopNode = ABS(CHECKSUM(NEWID()))%100000+1
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
)
;
GO
Up to this point, we've only build a wad of test data resembling Daniel's original post. The next post below will the first real step in doing all I say that can be done.