How should I structure my index to improve lookup time?

testinvite123 · December 4, 2016, 1:11pm

I have a table that stores IP address ranges for a city and there are millions of records in this table. I'm sure that many of you that deal with IP addresses have a similar table to me (I've simplified my table in this example):

CREATE TABLE [dbo].[IPRangeByCity]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [IPIntegerStart] [bigint] NOT NULL,
    [IPIntegerEnd] [bigint] NOT NULL,
    [Country] [nvarchar](150) NOT NULL,
    [City] [nvarchar](150) NULL

    CONSTRAINT [pk_IPRangeByCity] 
        PRIMARY KEY CLUSTERED([ID] ASC),
) ON [PRIMARY]
 GO

Now I don't save, update, or delete any records from this table. I only read from this table. When I read from this table, I take an IPv4 address, convert it to its integer form, and using the integer form of the IPv4 address, I lookup the city between the IP address range for this integer.

For example, let's say the IPv4 address is "187.245.227.116".

"187.245.227.116" converts to the integer 3153453940. Then I run the following select statement to find the city that is associated with this IP address:

select * from IPRangeByCity 
where 3153453940 between IPIntegerStart and IPIntegerEnd

My question is, if I only ever read from this table with the select statement above, how should I structure my index to improve the lookup time of the select statement?

Off the top of my head, if I set the index for this table to the column "IPIntegerStart", it seems like it may be a good index for my select statement. For example:

CONSTRAINT [pk_IPRangeByCity] PRIMARY KEY 
   CLUSTERED([IPIntegerStart] ASC)

However, I'm not really sure. Does anybody know what would be the best index to set my for my table, given my select statement? Should it be a clustered or non-clustered index? Should it be a multi column index (ie. an index with both the columns IPIntegerStart and IPIntegerEnd)? Any help would be appreciated. Thanks.

gbritton · December 5, 2016, 2:31pm

Have you loaded up your table? If so, check out the execution plan. Do you see any issues? Does it run acceptably (for millions of rows I'd expect no more than a few seconds) If not, post the plan and we can see what it's doing.

BTW, what about IPV6? Also, I'd probably store the subnet mask or CIDR suffix rather than the end of the range. but that's just me/

Also, what is the id column for? If it's not in use, drop it!

JeffModen · December 5, 2016, 3:46pm

To make this sing, you're actually going to need to change the primary key to a non-clustered index and assign the clustered index to the IPIntegerStart column. This is one of those rare places where having an IDENTITY column likely serves no practical purpose.

Kristen · December 5, 2016, 4:18pm

Would a Clustered Index of both IPIntegerStart, IPIntegerEnd be better so it covers the query, or is that overkill because it is the clustered index?

Because this is in effect a Read Only table I'm not sure that its even worth having an IDENTITY column at all - I certainly wouldn't have it as the Clustered Index.

JeffModen · December 5, 2016, 9:51pm

Maybe. My thought is "probably not". It would be really good if the CI were also unique and there were some sort of guarantee that there were no overlaps and the my answer would be "NO. The optimizer is smart enough to know that the next range isn't included.

testinvite123 · December 6, 2016, 9:20am

Thanks JeffModen. I tested a bunch of variations of clustered index, non clustered index, etc. You are right that making the IPIntegerStart column the clustered index had the largest impact on performance (I made it the primary key as well).

I tested other variations as well, such as:

Making both the IPIntegerStart and IPIntegerEnd the clustered index. The performance was comparable to just making IPIntegerStart the clustered index.
Making IPIntegerStart and IPIntegerEnd a non-clustered index, but leaving the ID (identity) column as the clustered primary key. The performance was bad, most likely because IPIntegerStart was not the clustered index.
Making IPIntegerStart the clustered index and IPIntegerEnd the non-clustered index. It doesn't seem that making IPIntegerEnd a non-clustered index has any huge impact on performance. Again, what mattered was that IPIntegerStart is set as the clustered index.

There is also an interesting solution that I saw at stackoverflow: http://dba.stackexchange.com/questions/14894/optimizing-ip-range-search/14896#14896. This solution could even be faster, I will test it out. In a nutshell, this solution requires that the table's data is partitioned into different segments, depending on the difference between the IPIntegerEnd - IPIntegerStart.

JeffModen · December 6, 2016, 12:54pm

Cool. Thanks for the feedback on all of that.

The reason why it worked so well for the CI on IPIntegerStart is because of two things. First, the CI means that it doesn't have to do a lookup on other columns because they're already a part of the leaf level of the CI. Second, you gave the CI the added benefit of making it UNIQUE by making it the PK. Setting it to a UNIQUE Clustered Index without making it the PK would have a similar effect making the need for IPIntegerEnd redundant there, as well.

A word of warning, though. The use of CIs for such things is NOT a panacea of performance and could actually hurt performance in many instances of SELECT if the table is wide and will certainly hurt a bit for INSERTs because IP addresses aren't ever-increasing and will cause table page splits (especially after index rebuilds/reorgs). A non-clustered index (even a non-covering one) can be much more effective on wide tables even if the NCI is a near duplicate of the CI because the NCI will be much more narrow and have many more rows per page than the CI will.

As for the "poor man's partitioning" by using an indexed computed column, my first blush thoughts would be that would cause unnecessary scanning but can be effective because a single range scan is more effective than doing a seek on each row. However, just as with real partitioning, I've found that partitioning only has a chance of helping bad code that happens to use the partitioning column as correctly SARGable criteria, which is frequently not the case.