Selecting data from table with over 10 million records

ballybhatia · February 23, 2016, 6:53pm

I have 4 tables

Shipment
ConsigneeAddress - holds address and ShipmentId
ShipperAddress - holds address and ShipmentId
AgentAddress - holds address and ShipmentId

Shipment has ShipmentId, Shipmentdate plus other columns which arent important to mention here plus with the addition of ConsigneeAddressId, ShipperAddressId, AgentAddressId.

These columns are related via fk's

There is the argument that I could hold Consignee, Shipper and Agent address in a single Address table but i'll.come to an issue with that later

Currently my issue is that if shipment has 10 million records it means consignee, shipper and agent must have 10 million records too

So when it comes to reporting and doing inner joins to get data within certain shipmentdate range, it becomes horribly slow...

What to do???

As mentioned puttine all addresses in one Address table wont solve the issue either as it.will.contain 3 records for every shipment.. which means 10 million x 3 in 1 table...

Would be great is someone could help.

Thanks

Baldip

djj55 · February 23, 2016, 7:04pm

What are your indexes?

jeffw8713 · February 23, 2016, 7:06pm

What you need is a single address table - and a single entry in that table for each possible address. Then, you want an intermediary table between the shipment and address tables that relates the shipment to a specific address and type.

One day - you have assigned a shipment to a specific address as the consignee, the next day that same address could be used as the Shipper - and the following day it could be used as the agent.

In your current structure you end up duplicating that information for each and every shipment. Even if we assume that a shipper will always be the shipper - you are duplicating that information for every shipment. An address doesn't change so why would you have that address in the table multiple times?

ballybhatia · February 23, 2016, 7:07pm

Shipment - clustered index on shipmentid, non clustered index on shipmentdate

ConsigneeAddress
ShipperAddress
AgentAddress

all have clustered index on their identity column

ballybhatia · February 23, 2016, 7:51pm

Thanks for.ur feedback..

The issue with that is, i will.have to check at every insert if the address exists.

Just to add something on to this.. this product will be used world.wide.. there potentially millions and millions of addresses could be in there.. if i check.everytime.i insert a record to see if it exists, it will.slow down my insert..

What do u think?

Kristen · February 23, 2016, 7:53pm

Are your addresses not "reused" if someone wants another order shipped to a previously-used-address?

I see lots of Invoicing systems that hold the address, in plain text, on every invoice - its a copy of the address in some Address Table, presumably "In case it changes" the version on the Invoice is a record of as-it-was-at-the-time, but I think it would be better to have an Address Version column that allowed "Version 2" of an address to be created (i.e. a change that is sufficiently different to not be an in-record edit).

Alternative;y whenever an address is "first used" (as a Shipper or Agent of Consignee) then set a "Used" flag and any change after that needs to create a new "version" record, so any usage of an address is guaranteed to point to the details as they were at the time it was used.

But that rather depends on people placing repeat orders etc. such that at least some!! of the addresses are reused

Kristen · February 23, 2016, 7:57pm

When I phone up to place an order, and if I am an existing / previous customers, I expect the Telesales Operator to be able to call up my account and ask if it is a previously used address. If not, or if I am a new customer, then the operator will create a new address. I would NOT want to check if the address already exists [somewhere in the database - e.g. a previous resident at THIS address] (just an additional comma, or space, is going to make it a "different" address) however if you DO want to go down that route you can use the CHECKSUM function to create a lookup hash. That won't be unique to your address, but using that to index "possibly matching addresses" and then comparing the actual address should be a very fast operation. On 10 millions records I would store both the CHECKSUM and the (SUM'd) LENGTH of all the address fields, as a combination of LENGTH and CHECKSUM matching will probably come close to giving you a 1:1 hit

ballybhatia · February 23, 2016, 7:57pm

Yes addresses can be reused for sure..

But my issue is if there are actual 10 million different address, my reporting is becoming painfully slow...

the issue isnt here isnt whether i can reuse the same address but rather how can i solve the slowness if there are 10 million + distinct addresses.

This application will be used world.wide and potentially millions and millions of addresses

Kristen · February 23, 2016, 8:00pm

Note sure why that is slow?

You want to JOIN Shipment to ConsigneeAddress, ShipperAddress and ShipmentId
AgentAddress. Lets say each of those 3 address tables has an INT ID, I can't see that the fact that there are ,say, 10 Millions rows in each table should make any difference. It should be fast to retrieve (unless you want to export all 10 million records )

Kristen · February 23, 2016, 8:03pm

I am presuming that the ID of the address record, in each table (or a single central address table if that is how you eventually decide to do it) is as short as possible, ideally just INT. If it is a composite key containing all sorts of other attributes, such as "Country", or a huge GUID then I would streamline that, if you can.

I would want that ID / PKey to also be the Clustered Index with 100% FILL FACTOR, the STatistics rebuilt relatively often, automatic rebuild of Stats (if you have that turned on to be ASYNC rather than the [default] of SYNC

Is that about how you have it / envisage it? or do you have other "baggage" that is perhaps upsetting performance?

ballybhatia · February 23, 2016, 8:04pm

Aaahhh that makes sense...

the id's are a guid.. i think that is where the problem may be.

I have tryinf to convince my senior not to.use the damn thing as foriegn keys but he would.not listen!!!

Kristen · February 23, 2016, 8:16pm

If the keys are coming from multiple systems that need to allocate them locally, or data is combined from multiple systems / businesses (or even, maybe, "countries"?), then GUIDs can be the correct solution.

But other than that they are a problem ... wide (so fewer keys-per-page), and they are allocated randomly, so they fragment the index as they are added (unless you specifically use a Sequential GUID - which would definitely help, but its probably only any use if the keys are allocated centrally, so probably doesn't apply to any of the reasons for having GUIDs above ... in which case ... if that applies to you then use an INT instead )

The other points are relevant though. If your GUID is not the clustered index, if you are not attending to housekeeping such as de-fragmenting the indexes, updating stats regularly, if you have SYNC Automatic Stats rebuild instead of ASync, then all those things will contribute to poor performance

ScottPletcher · February 23, 2016, 9:27pm

The standard performance destroyer!!

The first column of the clustered index on the shipment table should not be shipmentid since you report, and presumably query, mostly by time periods. The clus key should be ( shipmentdate, shipmentid ). You can and should leave the shipmentid as the pk but make it nonclustered.

Also, since each shipment gets its own address rows, the clus key on the adresses table should be shipmentId, NOT its own identity column.

Kristen · February 23, 2016, 10:09pm

Wouldn't it be better if the address just had a one-part key - e.g. identity - and didn;t store the ShipmentId at all? (I'm not sure why ShipmentId is needed at all, notwithstanding that the O/P's first post does say "ConsigneeAddress - holds address and ShipmentId" - seems superfluous to me, and bloats the width of the key (and thus reduces keys-per-index-page)

I'm probably missing the bleeding obvious though!

ScottPletcher · February 23, 2016, 11:15pm

Don't want to use identity alone regardless, particularly, if each shipment row is getting its own address rows, which is the OP stated.

The "identity as default clustering key" MYTH is incredibly strong, but it's also the most damaging performance myth in existence today.

I question very strongly whether it's worth the time and effort to resolve addresses across different clients. I think there would be so few overlaps it's not worth the extreme effort and processing it would be take.

Thus, cluster addresses either by the ( ShipmentId, address_type_code ) [no identity needed!] or by ( CustomerId, AddressId ) if you want to store addresses by customer instead, in order to easily re-use them.

Even more importantly, perhaps, for legal reasons you must keep the actual shipment address anyway, even if customer's shipping address changes in the future.

Kristen · February 24, 2016, 6:00am

I need a bit of help understanding this please Scott.

Why is CustomerID + AddressID a better Clustered Index key than just a unique INT (assuming that CustomerID + AddressID is 2 x INT)

Aren't the majority of lookups going to be "Get address for this shipment" and thus the shortest key wins (i.e. narrower keys means more keys per index page)

Or are you suggesting that all addresses (Consignee, Shipper and Agent) should be in a single table with ShipmentID +AddressID Clustered Index so that, for a given Shipment, the three address records are clustered and will thus be retrieved from one or two physical pages that are contiguous?

If addresses ARE unique to / duplicated for each Shipment (i.e. there is a 1:1 relationship between Shipment and ConsigneeAddress, ShipperAddress & AgentAddress) would there be a benefit to storing those three addresses IN the Shipment record?

ballybhatia · February 24, 2016, 7:47am

Thank you alll for ur feedback.

I will be re structing my tables and using ints rather than guid's.. and also setting my indexes properly..

Once i have done that i will come back with some feedback

Hopefully it'll give me better results

Thanks all!

ScottPletcher · February 24, 2016, 7:42pm

Yes, a single address table, sorry, I should have been clearer about that. I'd just add an "address_type_code" (smallint) to that table. The other reason I included "a" ShipmentID is so that addresses could easily be shared across related companies/divisions, even if you don't want to share them across all clients. You could add a "ParentShipmentID" for sharing addresses.

would there be a benefit to storing those three addresses IN the Shipment record?

Potentially, yes. But also a down side of making each main row much wider, of course. And of violating normal form, in a sense, because you have multiple occurrences. It depends too on their definition of "shipment". A single physical shipment can technically end up going to more than one consignee. Obviously if that's at all possible, a separate table is effectively required.

If not, you'd want to decide based on how often the addresses are used. My guess is not that often, compared to other uses of the table, so I'd still off-load the addresses to a different table.

Also, I can't see that being enough of a performance issue to force it to be changed and require recoding, but I would easily a future issue that would force a separate table and therefore cause recoding/rework if everything is in a single table.