Dear all
Using SQL Svr 2008 R2, I need to clean up duplicated records, as well as set a valid_to date.
My table has the following 5 columns:
id (identifying record id, however some numbers are skipped)
doc_id (this is the document id = invoice number)
qty (this is the quantity of sold products included in this doc_id)
valid_from (this is the creation or modification date of the doc_id; the qty can change later on)
valid_to (is initially null.. but should indicate the end of life date of this doc_id's version)
Every day, I'm inserting new records to the table, containing all new doc_ids, as well as any change made on existing doc_ids. It can also contain duplicates of already existing doc_ids without any change.
Example data (the real table contains about 3 mio rows): .id..|.doc.|qty|.valid_from.|.valid_to. 1001 | 520 | 5 | 2015-12-01 | null 1005 | 520 | 6 | 2015-12-03 | null 1012 | 521 | 2 | 2015-12-03 | null 1018 | 520 | 6 | 2015-12-07 | null 1022 | 521 | 8 | 2015-12-08 | null
I'm looking for a good performing UPDATE query, that
a) removes unnecessary duplicates, meaning no change in qty
b) adds the valid_to which is always one day before the valid_from of the next version for a certain doc_id. Or it remains null if the current line is the most latest version of the doc_id
Do you have any suggestion?
If it makes sense, I can also add an index to some columns. And could also change the null value in valid_from to a certain date (for example 2300-01-01), as null is not indexable.
After the query executed, above table shoud look like this: .id..|.doc.|qty|.valid_from.|.valid_to. 1001 | 520 | 5 | 2015-12-01 | 2015-12-02 1005 | 520 | 6 | 2015-12-03 | null 1012 | 521 | 2 | 2015-12-03 | 2015-12-07 1022 | 521 | 8 | 2015-12-08 | null
record id 1018 was removed because it was a duplicate.
declare @table table
(
id int,
doc int,
qty int,
valid_fr date,
valid_to date
)
insert into @table (id, doc, qty, valid_fr)
select 1001 , 520 , 5 , '2015-12-01' union all
select 1005 , 520 , 6 , '2015-12-03' union all
select 1012 , 521 , 2 , '2015-12-03' union all
select 1018 , 520 , 6 , '2015-12-07' union all
select 1022 , 521 , 8 , '2015-12-08'
----------------------------------
-- removes unnecessary duplicates
----------------------------------
delete d
from
(
select *, rn = row_number() over (partition by doc, qty order by id)
from @table
) d
where d.rn > 1
----------------------------------
-- Show the result after remove
----------------------------------
select *
from @table
----------------------------------
-- update valid_to
----------------------------------
; with
cte as
(
select *, rn = row_number() over (partition by doc order by valid_fr)
from @table
)
update c1
set valid_to = dateadd(day, -1, c2.valid_fr)
from cte c1
inner join cte c2 on c1.doc = c2.doc
and c1.rn = c2.rn - 1
----------------------------------
-- Show the final result
----------------------------------
select *
from @table
Many thanks khtan for your answer! For the removing duplicates part, there is one issue I'd like to mention, as not yet clearly stated in my example and not yet considered in your query:
The row should only be removed, if it is a repeat (same qty) of the very last row with same doc_id.
.id..|.doc.|qty|.valid_from.|.valid_to. 1001 | 520 | 5 | 2015-12-01 | null 1005 | 520 | 6 | 2015-12-03 | null 1012 | 521 | 2 | 2015-12-03 | null 1018 | 520 | 6 | 2015-12-07 | null <- Remove this, as it unnecessarily repeats Qty of last row (id 1005) 1022 | 521 | 8 | 2015-12-08 | null 1023 | 520 | 5 | 2015-12-10 | null <- Do NOT remove this, bcs Qty is different from last row (id 1018), even if same Qty as earlier (id 1001)
Maybe a similar approach has to be used like in the update valid_to query (c1.rn = c2.rn+1) ?