Planning out basic TSQL to manipulate table data

mastersql · December 8, 2015, 11:11pm

Hi

Can someone help me get into the thinking of knowing how to fix data in SQL tables (by trying not to give me an SQL routines I could run).
Ok, this is the situation…. Suppose I have a single table with has a column called ColumnA which has lots of duplicate values. I need to remove all the duplicate entries. Question is….if I had to write pseudo-code as a plan, what SQL should be written
Many thanks to anyone who can offer me any pointers.

Kind Regards
James

ScottPletcher · December 8, 2015, 11:21pm

Use a CTE to partition by ColumnA and assign a ROW_NUMBER. Then DELETE form that CTE where the row_number > 1.

mastersql · December 8, 2015, 11:30pm

Hi Scott
And then should I INSERT the rows back into the original table from the temporary table? I think I'm getting the gist now - not totally certain yet though.
Cheers
James

TaraKizer · December 9, 2015, 12:15am

You would delete from your actual table. In that delete you would join to the CTE. The CTE is part of the delete, not a temporary table, though a temporary table would work too.

JeffModen · December 9, 2015, 1:38am

No need to join to the CTE. You can delete from the underlying table by deleting from the CTE.

mastersql · December 9, 2015, 2:28pm

Hi
I'm getting there. Could anyone advise how to change this code so it will output the rows to a table rather then to screen please. I need to run this SQL at night as this query could take a long time :-

WITH MyDupesRN AS ( SELECT * , ROW_NUMBER() OVER (PARTITION BY SomeDate, NumValue ORDER BY SomeDate, NumValue) AS rn FROM #Test ) SELECT SomeDate, NumValue, rn FROM MyDupesRN WHERE rn>1;

Thanks
James

gbritton · December 9, 2015, 3:08pm

WITH MyDupesRN AS
(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY SomeDate, NumValue ORDER BY SomeDate, NumValue) AS rn
FROM #Test
)

INSERT INTO mytable (someDate, NumValue, rn)

SELECT SomeDate, NumValue, rn
FROM MyDupesRN
WHERE rn>1;

ScottPletcher · December 9, 2015, 4:45pm

WITH MyDupesRN AS 
(
SELECT *
, ROW_NUMBER() OVER (PARTITION BY SomeDate, NumValue ORDER BY SomeDate) AS rn
FROM #Test
)
DELETE FROM MyDupesRN
WHERE rn > 1

TaraKizer · December 9, 2015, 6:11pm

Nice! Learn something new everyday.

mastersql · December 9, 2015, 8:03pm

Hi
These solutions are amazing. To think that a solution can be found so easily is mind boggling to me. However if I searched on Google for an answer I would have just assumed that CTE are just common commands lol.

Ok to advance on any solutions above. Suppose I only want to insert the output into a table ok without duplicating another column I've not mentioned as yet e.g patient_id.--- how would we go about doing that? This would stop my output table from getting too large as I suspect it will as I'm mainly interested in the patient_id.

Cheers
James

mastersql · December 9, 2015, 10:30pm

Hi
I think I worked out what I need using basic SQL. The number of columns with a HAVING statement will be around the 20 mark so I'm not sure if this efficient or not if anyone can say if using CTE is better.

If someone thinks the same thing will run better using CTE please me know :-

SELECT SomeDate, NumValue, PatientID INTO TABLESTORE FROM TESTTABLE GROUP BY SomeDate, NumValue, PatientID HAVING COUNT(SomeDate)>1 AND COUNT(NumValue)>1 AND COUNT(PatientID)>1 -- could be 20 of these ORDER BY PatientID

Cheers
James

gbritton · December 10, 2015, 6:31pm

think of a CTE as an easy way to encapsulate a query. In the query itself, you still have to state what you need. So, using a CTE is not necessarily better, just different. However if you have many subqueries (you don't) CTEs are a nice way of organizing things.

mastersql · December 10, 2015, 8:39pm

Hi gbritton
I was initially and still finding the WITH statement harder to write but when its finished I definitely find it easier to read back for some reason - more elegant and satisfying to write. I going to stick with WITH from now on.
Cheers
James

mastersql · December 10, 2015, 8:43pm

Hi Scott
Why did you miss out the part in your solution?

It probably makes no difference to the output but just wondering - do you always miss out these parameters?

WITH MyDupesRN (SomeDate, NumValue) AS

Cheers
James

ScottPletcher · December 10, 2015, 9:41pm

Those are not parameters, they are output column names. They are not required if all the column names in the cte are unique. You cannot have duplicate columns names in the cte, since it is effectively a table.