Conditional Join with multiple identifiers and lengths

ITTrucker · September 30, 2016, 7:22pm

I'm trying to compare some existing data that has some bad identifiers in our database to a new table of standardized data.

Here's a sample of the data and query I'm trying to join based on a couple different fields which is returning some bad data:

DECLARE @Existing TABLE (ID varchar(20),Name varchar(20))
DECLARE @New TABLE (ID varchar(20),Name varchar(20),Division varchar(20))

INSERT INTO @Existing
VALUES ('AB1234560','ACME'),('AB123450','Test Incorporated'),('DEF123456-00','Demo Co.')

INSERT INTO @New
VALUES ('AB1234560','ACME Inc.','Main'),('AB123456A','ACME Inc.','Child'),('AB123456AB','ACME Inc.','Child'),
('AB123450','Test Inc.','Main'),('AB12345A','Test Inc.','Child'),('AB12345AB','Test Inc.','Child'),
('DEF123456-00','Demo Corp.','Main'),('DEF123456-01','Demo Corp.','Child')

SELECT E.Name ExistName, E.ID ExistingID,
COALESCE(N1.Name, N2.Name, N3.Name) NewName,
COALESCE(N1.Division, N2.Division, N3.Division) NewDivision,
COALESCE(N1.ID, N2.ID, N3.ID) NewID
FROM
@Existing E LEFT OUTER JOIN
@New N1 ON LEFT(N1.ID,3) = 'DEF' AND LEFT(N1.ID,9) = LEFT(E.ID,9) LEFT OUTER JOIN
@New N2 ON LEFT(N2.ID,2) = 'AB' AND LEFT(N2.ID,8) = LEFT(E.ID,8) AND LEN(E.ID) = 9 LEFT OUTER JOIN
@New N3 ON LEFT(N3.ID,2) = 'AB' AND LEFT(N3.ID,7) = LEFT(E.ID,7) AND LEN(E.ID) = 8
GROUP BY E.Name, E.ID, N1.Name, N2.Name, N3.Name, N1.Division, N2.Division, N3.Division, N1.ID, N2.ID, N3.ID

This returns back good data for the Demo Corp and Acme, the one existing ID matches to new divisions and that's good, the problem is with the Test company, it matches to both Test and Acme.

The actual data has 4 types and it looks like 2 of those have the different ID lengths that I need to try to and account for so that will end up having 6 LEFT OUTER's with 6 COALESCE expressions per column.

Any better ideas? (Can't use the name in the join, it's a mess in our existing tables)

This is what I would like to have returned:
ACME Inc. AB1234560 ACME Inc. Child AB123456A
ACME Inc. AB1234560 ACME Inc. Child AB123456AB
ACME Inc. AB1234560 ACME Inc. Main AB1234560
Demo Corp. DEF123456-00 Demo Corp. Child DEF123456-01
Demo Corp. DEF123456-00 Demo Corp. Main DEF123456-00
Test Inc. AB123450 Test Inc. Child AB12345A
Test Inc. AB123450 Test Inc. Child AB12345AB
Test Inc. AB123450 Test Inc. Main AB123450

bitsmed · September 30, 2016, 9:56pm

Could this be what you're looking for:

select e.name as ExistName
      ,e.id as ExistingID
      ,n1.name as NewName 
      ,n1.division as NewDivision
      ,n2.id as NewID
  from @existing as e
       inner join @new as n1
               on n1.id=e.id
              and n1.division='Main'
       inner join @new as n2
               on n2.[name]=n1.[name]
;

ITTrucker · October 3, 2016, 12:56pm

Data set somewhat misleading, the names in the existing data aren't standardized and can't really be used in the data set.

The child divisions also have actual names (Shipping, Billing, etc...). So if I update the data set to the below code, I get bad records for ACME because it joins 2x to Acme and Test:

DECLARE @Existing TABLE (ID varchar(20),Name varchar(20))
DECLARE @New TABLE (ID varchar(20),Name varchar(20),Division varchar(20))

INSERT INTO @Existing
VALUES ('AB1234560','ACME'),('AB123450','Test Corp.'),('DEF123456-00','Demonstration Corp.')

INSERT INTO @New
VALUES ('AB1234560','ACME Inc.','Headquarters'),('AB123456A','ACME Inc.','Billing'),('AB123456AB','ACME Inc.','Shipping'),
('AB123450','Test Inc.','Admin'),('AB12345A','Test Inc.','Accounting'),('AB12345AB','Test Inc.','Transportation'),
('DEF123456-00','Demo Corp.','Administration'),('DEF123456-01','Demo Corp.','Shipping')

SELECT E.Name ExistName, E.ID ExistingID,
COALESCE(N1.Name, N2.Name, N3.Name) NewName,
COALESCE(N1.Division, N2.Division, N3.Division) NewDivision,
COALESCE(N1.ID, N2.ID, N3.ID) NewID
FROM
@Existing E LEFT OUTER JOIN
@New N1 ON LEFT(N1.ID,3) = 'DEF' AND LEFT(N1.ID,9) = LEFT(E.ID,9) LEFT OUTER JOIN
@New N2 ON LEFT(N2.ID,2) = 'AB' AND LEFT(N2.ID,8) = LEFT(E.ID,8) AND LEN(E.ID) = 9 LEFT OUTER JOIN
@New N3 ON LEFT(N3.ID,2) = 'AB' AND LEFT(N3.ID,7) = LEFT(E.ID,7) AND LEN(E.ID) = 8
GROUP BY E.Name, E.ID, N1.Name, N2.Name, N3.Name, N1.Division, N2.Division, N3.Division, N1.ID, N2.ID, N3.ID
ORDER BY COALESCE(N1.Name, N2.Name, N3.Name), COALESCE(N1.ID, N2.ID, N3.ID)

Returns a bad join. I want to get:

ExistName ExistingID NewName NewDivision NewID
ACME AB1234560 ACME Inc. Headquarters AB1234560
ACME AB1234560 ACME Inc. Billing AB123456A
ACME AB1234560 ACME Inc. Shipping AB123456AB
Demonstration Corp. DEF123456-00 Demo Corp. Administration DEF123456-00
Demonstration Corp. DEF123456-00 Demo Corp. Shipping DEF123456-01
Test Corp. AB123450 Test Inc. Admin AB123450
Test Corp. AB123450 Test Inc. Accounting AB12345A
Test Corp. AB123450 Test Inc. Transportation AB12345AB

Kristen · October 3, 2016, 1:44pm

Would a "Levenshtein distance" algorithm be any good? There is a discussion topic on here (in the old forum, so would need a Google Search to find it)

bitsmed · October 3, 2016, 4:09pm

How about:

select e.[name] as ExistName
      ,e.id as ExistingID
      ,n1.[name] as NewName 
      ,n2.division as NewDivision
      ,n2.id as NewID
  from @existing as e
       inner join @new as n1
               on n1.id=e.id
       inner join @new as n2
               on n2.[name]=n1.[name]
 order by e.[name]
         ,n2.id
;