Group Rows with Mutual Field Data

liammorgs01 · July 3, 2023, 12:29pm

Hi I'm currently trying to create a query to group customers together.

I'm currently wanting to group by "Phone Number", "Email" and maybe some other fields in the future.

My problem is that I may want to group rows which aren't directly linked but have mutual rows.

For example, in this diagram we can see that Customer 2 and Customer 5 are in no way related at all, however they both share connections with Customer 1 which would then allow them to group together.

Here is another View:

CustomerId         | FullName         | PhoneNumber        | Email
---------------------------------------------------------------------------------
1	               | Bill Smith       | 01612345678        | BillS@SqlTeam.com
2                  | Kelly Smith      | 01612345678        | KellyS@SqlTeam.com
3                  | Kevin Roach      | 07111111111        | KevinR@Example.com
4                  | Chris Ronald     | 07222222222        | ChrisR@Blog.co.uk
5                  | Bill Smith       | 07987654321        | BillS@SqlTeam.com

We can see that Customer 2 is connected to Customer 1 through a mutual Phone Number, and Customer 5 is connected to 1 through a mutual Email Address. However Customer 2 and Customer 5 share no information which would've originally grouped them.

My reasoning for needing this functionality is that one person / household could have multiple customer accounts all with differing information therefore I am trying to group them up as best as I can with as much mutual information as possible.

Thanks all, hopefully I have explained my situation well enough!

robert_volk · July 4, 2023, 12:20pm

CREATE TABLE CustomerLink(FirstID int NOT NULL
CONSTRAINT FK_Customer_FirstID REFERENCES CustomerTable(CustomerID)
,SecondID int NOT NULL
CONSTRAINT FK_Customer_SecondID REFERENCES CustomerTable(CustomerID)
,CONSTRAINT PK_CustomerLink PRIMARY KEY(FirstID,SecondID) WITH (IGNORE_DUP_KEY=ON)
,CONSTRAINT CHK_CustomerLink CHECK (FirstID<SecondID))

INSERT CustomerLink(FirstID,SecondID)
SELECT first_phone.CustomerID, second_phone.CustomerID
FROM CustomerTable first_phone
INNER JOIN CustomerTable second_phone ON first_phone.PhoneNumber=second_phone.PhoneNumber
WHERE first_phone.CustomerID<second_phone.CustomerID

INSERT CustomerLink(FirstID,SecondID)
SELECT first_email.CustomerID, second_email.CustomerID
FROM CustomerTable first_email
INNER JOIN CustomerTable second_email ON first_email.Email=second_email.Email
WHERE first_email.CustomerID<second_email.CustomerID

SELECT first_customer.FullName AS First_Customer, second_customer.FullName AS Second_Customer
FROM CustomerLink link
INNER JOIN Customer first_customer ON link.FirstID=first_customer.CustomerID
INNER JOIN Customer second_customer ON link.SecondD=second_customer.CustomerID

The CustomerLink table would hold the IDs of each account linked via whatever column, email, phone, shared Netflix, shared Steam account, etc. You'd run an INSERT for each matching criteria, JOINing on the proper column(s).

The PRIMARY KEY prevents duplicate matches, as it would pollute your results with dupes. The IGNORE_DUP_KEY=ON allows you to simply INSERT and the storage engine will reject dupes if it finds them, without throwing an error.

The CHECK constraint prevents symmetric links, e.g. allows a->b but not b->a, again to prevent unwanted dupes.

One issue I see with your current Customer table is Bill Smith. He has CustomerID 1 and 5. Is it the same Bill Smith, with 2 different phone numbers? Or two different Bill Smiths who share the same email? The latter scenario is likely a data quality problem.

And if CustomerID 1 and 5 are the same person, then the structure you have will produce links that don't accurately link the customer relationships (e.g., if I insert Donald Duck with phone 07987654321, he'd link to Bill Smith 5, instead of Bill Smith 1, which would be the proper link)

Edit: fixed copy/paste error in query