Query displays multiple duplicate rows

Sorry to bother you again experts,

I have been struggling with this now for a couple of days.

We have a total of 10 tables.

However, the one table that relates to all other tables is the Employees table and it relates to the other 9 tables by employeeID.

More importantly, there is a one to many relationship between the Employees table and the other tables except the dateDetails table where the relationship is one to one.

In other words, with the exception of the dateDetails table, where the relationship between it and the Employees table is one to one, there exists one to many relationship between Employees table and the rest of the tables.

I thought the query below would work:

SELECT DISTINCT e.employeeID,IsNULL(s.sourcename,'NA') sourcename, IsNULL(s.sourceaddress,'NA') sourceaddress,
            IsNULL(sp.spousename,'NA') spousename, IsNULL(sp.spouseaddress,'NA') spouseaddress,
            IsNULL(dv.dividentName,'NA') dividentName, IsNULL(dv.dividentAddress,'NA') dividentAddress,
            IsNULL(r.reimbursementName,'NA') reimbursementName, IsNULL(r.reimbursementAddress,'NA') reimbursementAddress,
            IsNULL(h.HonorariaName,'NA') HonorariaName, IsNULL(h.HonorariaAddress,'NA') HonorariaAddress,
            IsNULL(g.giftName,'NA') giftName, IsNULL(g.giftAddress,'NA') giftAddress,
            IsNULL(o.orgName,'NA') orgName, IsNULL(o.orgAddress,'NA') orgAddress,
            IsNULL(cr.creditorName,'NA') creditorName, IsNULL(cr.creditorAddress, 'NA') creditorAddress
From Employees e
            INNER JOIN  SourceDetails s ON e.EmployeeID = s.EmployeeID
            INNER Join  SpouseDetails sp ON e.EmployeeID = sp.employeeID
            INNER JOIN  org o ON e.employeeID = o.employeeID
            INNER Join  DividentDetails dv ON e.EmployeeID = dv.EmployeeID
            INNER JOIN  ReimbursementDetails r ON e.EmployeeID = r.employeeID
            INNER Join  Honoraria h ON e.EmployeeID = h .EmployeeID
            INNER JOIN  GiftDetails g ON e.EmployeeID = g.employeeID
            INNER Join  dateDetails d ON e.EmployeeID = d.employeeID
            INNER JOIN  creditorDetails cr ON e.employeeID = cr.employeeID
WHERE cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240531 and e.employeeID = 7650
GROUP BY dv.dividentname,h.HonorariaName,g.giftName,o.orgName,o.orgAddress,g.giftAddress,h.HonorariaAddress,r.reimbursementName,r.reimbursementAddress, dv.dividentAddress,e.employeeID,sp.spousename,sp.spouseaddress,s.sourcename, s.sourceaddress,cr.creditorName, cr.creditorAddress
       ```
I tried using DISTINCT. I also tried GROUP BY but the query keeps displaying many duplicate records.

What I posted is the actual query we are using.

Could you experts help figure out what I am doing wrong please?

Many thanks in advance

I don't see how we can help here - we don't know the specifics for each table nor have you outlined how you want to eliminate 'duplicates'.

For each employee - which row from SourceDetails do you want to return? Which row from SpouseDetails? etc...

If all of those are one to one relationships - then which table has more than one row per employee? Figure out which table that is and which row you want, then determine how to return just that one row.

Since GROUP BY returns distinct rows based on the grouping then putting both DISTINCT and GROUP BY in the same query is useless. If you are not performing any aggregate functions - then DISTINCT would be the better option, but then again - you probably won't need either once you determine how to return the single row from each table you want.

One last note - don't perform functions on columns in the where clause. That will made the predicate non-sargable and will prevent SQL Server from using an index on that column if one is available.

Instead of: cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125
Use: d.dateCreated >= '20240125'

SQL Server will implicitly convert to string-literal for the date into a date or datetime data type (depending on the actual data type of the column) and then compare.

@jeffw8713 , thank you so much for your response. Much appreciated.

In the code above for instance, each or the tables like SourceDetails table has 3 rows.

One table, creditDetails has 4 rows.

The thing is that it is hard tell which table will have one or two or three rows.

It all depends on user's input.

I tried putting the queries broken down into UNION ALL and dump them into stored procedure like:

Select the two columns from SourceDetails where employeeID = 7650
UNION ALL
Select the two columns from SpouseDetails where employeeID = 7650

I also tried just using them individually in a stored procedure such as:

Begin
Select the two columns from SourceDetails where employeeID = 7650
end
Begin
Select the two columns from SpouseDetails where employeeID = 7650
End

! knew this would be a challenge but I thought I give it a shot by coming to you gurus.

The way I would solve this is by using a CTE:

WITH common_table_expression (Transact-SQL) - SQL Server | Microsoft Learn

Basically write a sql for each table with multiple records so you get 1 record for each employee. You can use aggregates like SUM, MAX or MIN or techniques like rownumber() to get the records that you want. If you don't need to aggregate you can use the main query. When all the tables are covered in a query or the main query you can start with the CTE. This way it will be easy to read and maintain.

As you didn't provide any sample data I cannot test the query if it works but it would look like this:

;WITH CreditDetails AS
(
SELECT
cr.employeeID,
IsNULL(MAX(cr.creditorName),'NA')) AS CreditorName,
IsNULL(MAX(cr.creditorAddress), 'NA') AS creditorAddress
FROM
creditorDetails cr
GROUP BY cr.employeeID
), SourceDetails AS
(
SELECT
s.EmployeeID,
IsNULL(MAX(s.sourcename,'NA')) sourcename,
IsNULL(MAX(s.sourceaddress,'NA')) sourceaddress
FROM
SourceDetails s
GROUP BY
s.EmployeeID
)
SELECT
e.employeeID,
IsNULL(s.sourcename,'NA') AS sourcename,
IsNULL(s.sourceaddress,'NA') AS sourceaddress,
IsNULL(cr.creditorName,'NA') AS creditorName,
IsNULL(cr.creditorAddress, 'NA') AS creditorAddress
FROM Employees e
LEFT OUTER JOIN CreditDetails cr
ON cr.employeeID=e.employeeID
LEFT OUTER JOIN SourceDetails s
ON s.employeeID=e.employeeID

@RogierPronk , thank you so much for your help with the code.

I am about to test it. If it fails, I am very happy to post a sample data if you like.

Thank you again.

The problem is what you consider to be duplicates. If there are 3 rows in the SourceDetails table - then each of those rows is not a duplicate. There is something different on each one of the rows in that table.

If you don't want all 3 rows - then you need to determine which one of those 3 rows you want and how you identify just that one row.

With that said - these are not duplicates. If there are 2 rows in GiftDetails for that Employee - then there were 2 gifts provided even if those gifts have the exact same name.

I also just noticed you are wrapping all of the columns with ISNULL. Since you are using an INNER JOIN it is almost a certainty that those columns will not be NULL. If they are null then there is really something more fundamentally wrong with the database - as you really should not have a row in the org table that does not have a name.

To further clarify - you are almost certainly going to get multiple rows for the results and each rows will be unique and not a duplicate.

Run these queries. Any tables that contain more than 1 matching row will cause duplicates when you join it to the Employees table. How you go about handling that depends on what you need from each and every table involved.

SELECT COUNT(*) AS Rows 
FROM SourceDetails s		
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM SpouseDetails sp		
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM org o					
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM DividentDetails dv		
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM ReimbursementDetails r	
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM Honoraria h			
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM GiftDetails g			
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM creditorDetails cr		
WHERE EmployeeID = 7650;

SELECT COUNT(*) AS Rows 
FROM dateDetails d			
WHERE EmployeeID = 7650
AND CONVERT(date, d.dateCreated) >= '2024-01-25' 
AND CONVERT(date, d.dateCreated) < '2024-05-31';

Hi @SqlHippo

I have done one better,

I have run the following queries, each querying its own table.

The results of the queries are attached as screenshot. I have superimposed some names so I don't get in trouble.

As you can see from the screenshots, each record is associated with a particular employee.

This is more like master /detail relationship.

An ideal solution for us would have been to run each query separately into one stored procedure.

@RogierPronk ,I ran the query you wrote and it was giving me one row for each table. I am not sure if using the MAX() function will be helpful.

Please see queries and the results they produce.

Whatever advice you give me from that point on, I am ok with.

Thank you guys very much for your patience and continued assistance.

SELECT s.employeeID,s.sourcename,s.sourceaddress FROM SourceDetails s 
inner join dateDetails d on s.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and s.employeeID=7650

SELECT sp.EmployeeID,sp.spousename,sp.spouseaddress FROM SpouseDetails sp
inner join dateDetails d on sp.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and sp.employeeID=7650

SELECT dv.EmployeeID,dv.dividentname,dv.dividentaddress FROM DividentDetails dv
inner join dateDetails d on dv.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and dv.employeeID=7650

SELECT r.EmployeeID,r.reimbursementname,r.reimbursementaddress FROM ReimbursementDetails r
inner join dateDetails d on r.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and r.employeeID=7650

SELECT h.EmployeeID,h.honorarianame,h.honorariaaddress FROM Honoraria h
inner join dateDetails d on h.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and h.employeeID=7650

SELECT g.EmployeeID,g.giftname,g.giftaddress FROM GiftDetails g
inner join dateDetails d on g.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and g.employeeID=7650

SELECT o.EmployeeID,o.orgname,o.orgaddress FROM Org o
inner join dateDetails d on o.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and o.employeeID=7650

SELECT cr.employeeID,cr.creditorName,cr.creditorAddress FROM creditorDetails cr
inner join dateDetails d on cr.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and cr.employeeID=7650

Since you have 3 rows from SourceDetails for EmployeeID 7650 - which row do you want? And how will you determine which of those 3 rows is the correct row?

Do you really have 3 different sources - or is it actually one source repeated 3 times?

How many dateDetail rows fall within the range of dates your are using?

Ok, the way I was told to display the data is this, an employee with ID of 7650 completes a form.

In survey question #1 which is SourceDetails table, the user enters three response to that survey #1.

In survey #2, the same employee who is complete survey question #1 (SourceDetails) to question #2 (CreditDetails), the employee again provides three separate answers.

That's why you are seeing the tables display 3 rows or 4 rows.

As I answered a question 2 days ago, I think when I said that it depends on user's input.

Another employee could provide NA to all 8 questions. In this case, it will be one row of data for each table.

So, I was told that if you choose an employee, say employee with ID of 7650, then all his/her responses should be displayed.

So, in this instance, an employee with ID of 7650 has all of those responses and selecting that employee will show his/her detailed responses.

So, we are looking for a way where you can select an employee with ID of 7650 and all the responses from SourceDetails table (3 rows) to CreditDetails table (4 rows) associated with that ID are displayed.

I hope this makes sense.

So where does the DISTINCT or GROUP BY come in?

If you JOIN from Employee to SourceDetails - on the EmployeeID - then you will get all of the rows related to that employee. If you want to filter the rows down to a specific set of survey questions - shouldn't you have something in the SourceDetails table that identifies the Survey?

How do you know the 3 rows in SourceDetails is related to the latest survey - and not a previous one?

@jeffw8713 , these surveys are taken once a year.

The date range shows the time frame the surveys are taken and each employeeID is generated during each time frame of the survey.

For instance, an employee has a unique ID s/he logs in to take the survey and once the survey is taken, a new employeeID is auto generated.

In the code I posted, this employeeID is 7650 and is tied to ALL the survey questions as the code I posted showed.

The only thing I didn't use here is the Unique ID which is not used across all tables.

Ok, I hope I am not throwing in confusions here.

This employee's unique ID is K1234. I made this up because I don't want to show the actual unique ID but the point is that this unique ID is not stored on any of those tables. What is stored in all the tables is the auto generated empoyeeID.

So, my thinking is that since this employeeID is related to the anwers provided by this employee across all tables, I thought that would be sufficient to pull all the records associated with this employee.
Base on the point you raised, can the unique ID be used to tie to the empoyeeID?

Not sure if that will make a difference.

For instance, if I change the query to this:

SELECT s.employeeID,s.sourcename,s.sourceaddress FROM SourceDetails s 
inner join dateDetails d on s.employeeid = d.employeeid 
where cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) >= 20240125 and cast(CONVERT(NCHAR(8),d.dateCreated,112) AS int) < 20240331 and s.employeeID=7650 and userID = 'K1234'

Will this make any difference?

I don't see how though. You guys are the experts.

Please tell me.

Once again, thanks a lot for all your responses.

It's very hard to give you a proper advice if you ask us to give advice and we don't have the whole story. First you want to have one result for an employee and at the end you provide us a complete new field with userID. We don't know how this user is connected to the tables so it's hard to give you any advice.

I really want to help you but you need to help me first to help you. If you provide a working data example and a desired result I can and will help you, something like this. You can provide the results you want in a screenprint from excel or something simulair.


DROP TABLE IF EXISTS #SourceDetails;

SELECT '7650'  AS EmployeeID, 'John Doe' as sourcename, '123 ABC rd' AS sourceaddress
INTO #SourceDetails
UNION ALL
SELECT '7650'  AS EmployeeID, 'John Doe' as sourcename, '123 ABC rd' AS sourceaddress
UNION ALL
SELECT '7650'  AS EmployeeID, 'John Doe' as sourcename, '123 ABC rd' AS sourceaddress

SELECT * FROM #SourceDetails