Planning out basic TSQL to manipulate table data

Hi

Can someone help me get into the thinking of knowing how to fix data in SQL tables (by trying not to give me an SQL routines I could run).
Ok, this is the situation…. Suppose I have a single table with has a column called ColumnA which has lots of duplicate values. I need to remove all the duplicate entries. Question is….if I had to write pseudo-code as a plan, what SQL should be written
Many thanks to anyone who can offer me any pointers.

Kind Regards
James

1 Like

Use a CTE to partition by ColumnA and assign a ROW_NUMBER. Then DELETE form that CTE where the row_number > 1.

1 Like

Hi Scott
And then should I INSERT the rows back into the original table from the temporary table? I think I'm getting the gist now - not totally certain yet though.
Cheers
James

You would delete from your actual table. In that delete you would join to the CTE. The CTE is part of the delete, not a temporary table, though a temporary table would work too.

1 Like

No need to join to the CTE. You can delete from the underlying table by deleting from the CTE.

2 Likes

Hi
I'm getting there. Could anyone advise how to change this code so it will output the rows to a table rather then to screen please. I need to run this SQL at night as this query could take a long time :-

WITH MyDupesRN AS ( SELECT * , ROW_NUMBER() OVER (PARTITION BY SomeDate, NumValue ORDER BY SomeDate, NumValue) AS rn FROM #Test ) SELECT SomeDate, NumValue, rn FROM MyDupesRN WHERE rn>1;

Thanks
James

WITH MyDupesRN AS
(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY SomeDate, NumValue ORDER BY SomeDate, NumValue) AS rn
FROM #Test
)

INSERT INTO mytable (someDate, NumValue, rn)

SELECT SomeDate, NumValue, rn
FROM MyDupesRN
WHERE rn>1;

1 Like
WITH MyDupesRN AS 
(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY SomeDate, NumValue ORDER BY SomeDate) AS rn
FROM #Test
)
DELETE FROM MyDupesRN
WHERE rn > 1
1 Like

Nice! Learn something new everyday.

1 Like

Hi
These solutions are amazing. To think that a solution can be found so easily is mind boggling to me. However if I searched on Google for an answer I would have just assumed that CTE are just common commands lol.

Ok to advance on any solutions above. Suppose I only want to insert the output into a table ok without duplicating another column I've not mentioned as yet e.g patient_id.--- how would we go about doing that? This would stop my output table from getting too large as I suspect it will as I'm mainly interested in the patient_id.

Cheers
James

Hi
I think I worked out what I need using basic SQL. The number of columns with a HAVING statement will be around the 20 mark so I'm not sure if this efficient or not if anyone can say if using CTE is better.

If someone thinks the same thing will run better using CTE please me know :-

SELECT SomeDate, NumValue, PatientID INTO TABLESTORE FROM TESTTABLE GROUP BY SomeDate, NumValue, PatientID HAVING COUNT(SomeDate)>1 AND COUNT(NumValue)>1 AND COUNT(PatientID)>1 -- could be 20 of these ORDER BY PatientID

Cheers
James

think of a CTE as an easy way to encapsulate a query. In the query itself, you still have to state what you need. So, using a CTE is not necessarily better, just different. However if you have many subqueries (you don't) CTEs are a nice way of organizing things.

1 Like

Hi gbritton
I was initially and still finding the WITH statement harder to write but when its finished I definitely find it easier to read back for some reason - more elegant and satisfying to write. I going to stick with WITH :slight_smile: from now on.
Cheers
James

Hi Scott
Why did you miss out the part in your solution?

It probably makes no difference to the output but just wondering - do you always miss out these parameters?

WITH MyDupesRN (SomeDate, NumValue) AS

Cheers
James

Those are not parameters, they are output column names. They are not required if all the column names in the cte are unique. You cannot have duplicate columns names in the cte, since it is effectively a table.