Index SCANS when using substring for NodePath in WHERE clause

craigdavidsmith · April 28, 2017, 5:11pm

We have a scenario where admins in our system can see users who belong to their department or any department under them. We, therefore have a few queries that join multiple times onto our Department table and compare the nodepaths for the admin and users to see who can see who...

AND ((SUBSTRING([UserDepartment].[NodePath], 0 + 1, LEN([AdminDepartment].[NodePath]))) =
[AdminDepartment].[NodePath])

For clarification, "UserDepartment" and "AdminDepartment" are both aliases for the "Department" table.

Obviously, this is going to perform a table SCAN on the Department table. For most of our clients this is okay since they have less than 30 departments. But we have a few clients that have over 5000 departments and the related queries are taking minutes to run due to this.

Does anyone have ideas on how to better query this? I'm also interested in hearing about better ways to architect this data storage but that would be a pretty fundamental change so I'm hoping for query/index change solutions.

Cheers.

Kristen · April 29, 2017, 6:48am

Physically split out (and duplicate the data of) that SUBSTRING into a new column. You could use a Computed Column in the table (so no actual code required to maintain that column in parallel, and you can Index a Computed Column)

Kristen · April 29, 2017, 7:00am

OK, ignore that, I've read it more carefully now and the SUBSTRING on the [UserDepartment] table alias is dependent on the LEN of the [AdminDepartment] alias.

Perhaps something like this would work? Dunno if it is SARGable though

AND [UserDepartment].[NodePath] LIKE [AdminDepartment].[NodePath] + '%'

or maybe this - the ">=" may help with an index evn if the "<=" doesn't

AND [UserDepartment].[NodePath] >= [AdminDepartment].[NodePath] 
AND [UserDepartment].[NodePath] <= [AdminDepartment].[NodePath] +'zzzz'

I'm trying to think how we manage this in our hierarchies - its decades since we wrote them so not sure I can remember! but we never have any performance problems with them so maybe we have something that would be useful to you.

I expect there are Best Practice articles on this too - I haven't bothered to google them, but someone here is likely to know of a good link

We have a PATH which is a list of all the IDs (converted to fixed length string with leading zero padding) of the parent levels. We also have the ID of the (next) sibling.

We also have a SEQUENCE number which maintains the whole tree in "walking order". (We used an INT and have to renumber it periodically to allow for any inserts, but with benefit of hindsight maybe a FLOAT would be better,with inserts being the mid-point of the two nodes either side).

craigdavidsmith · May 1, 2017, 7:29pm

Thanks for the response Kristen. I tried that syntax and it had the same performance so not sure what is going on there. I like the idea of pre-computing these relationships. I don't think I would do it as a column in the same table though. I'd probably just setup a "ChildrenDepartments" table with a many to one relationship with Departments. Then I just need to do...

WHERE IN (SELECT ChildID from Departments where DepartmentID = Admin.Department)

Would have to keep it up to date with a Trigger or something.

Sadly, this is a pretty major change so not sure I'll be able to allocate resources to it. But in theory.....

yosiasz · May 1, 2017, 9:31pm

fundamental change. Can you separate your OLTP data from your search data (flattened data ie elastic search, data warehouse, data mart etc )? A rearch that will pay great dividends later on. what is your pipeline look like? throughput etc

or recursive cte?
http://stackoverflow.com/questions/14241936/how-can-i-generate-a-hierarchy-path-in-sql-that-leads-to-a-given-node

craigdavidsmith · May 1, 2017, 10:18pm

We absolutely need to setup an OLAP database. It is definitely on our roadmap. We currently host on AWS and have quite an extensive setup (server clusters in the USA, Canada, Ireland, Australia, and China). We've explored using Redshift primarily related to a new BI tool we launched. We even hired a DBA with that as a stated objective. Unfortunately, that hire didn't work out and the Redshift initiative was shelved.