How to clean up unnecessary records and set end of life date

janosh · December 18, 2015, 11:17pm

Dear all
Using SQL Svr 2008 R2, I need to clean up duplicated records, as well as set a valid_to date.
My table has the following 5 columns:

id (identifying record id, however some numbers are skipped)
doc_id (this is the document id = invoice number)
qty (this is the quantity of sold products included in this doc_id)
valid_from (this is the creation or modification date of the doc_id; the qty can change later on)
valid_to (is initially null.. but should indicate the end of life date of this doc_id's version)

Every day, I'm inserting new records to the table, containing all new doc_ids, as well as any change made on existing doc_ids. It can also contain duplicates of already existing doc_ids without any change.

Example data (the real table contains about 3 mio rows):
.id..|.doc.|qty|.valid_from.|.valid_to. 1001 | 520 | 5 | 2015-12-01 | null 1005 | 520 | 6 | 2015-12-03 | null 1012 | 521 | 2 | 2015-12-03 | null 1018 | 520 | 6 | 2015-12-07 | null 1022 | 521 | 8 | 2015-12-08 | null
I'm looking for a good performing UPDATE query, that
a) removes unnecessary duplicates, meaning no change in qty
b) adds the valid_to which is always one day before the valid_from of the next version for a certain doc_id. Or it remains null if the current line is the most latest version of the doc_id

Do you have any suggestion?

If it makes sense, I can also add an index to some columns. And could also change the null value in valid_from to a certain date (for example 2300-01-01), as null is not indexable.

After the query executed, above table shoud look like this:
.id..|.doc.|qty|.valid_from.|.valid_to. 1001 | 520 | 5 | 2015-12-01 | 2015-12-02 1005 | 520 | 6 | 2015-12-03 | null 1012 | 521 | 2 | 2015-12-03 | 2015-12-07 1022 | 521 | 8 | 2015-12-08 | null
record id 1018 was removed because it was a duplicate.

I appreciate any help!

regards, janosh

khtan · December 19, 2015, 12:23am

declare @table table
(
    id        int,
    doc        int,
    qty        int,
    valid_fr    date,
    valid_to    date
)
insert into @table (id, doc, qty, valid_fr)
select    1001 , 520 , 5 , '2015-12-01' union all
select    1005 , 520 , 6 , '2015-12-03' union all
select    1012 , 521 , 2 , '2015-12-03' union all
select    1018 , 520 , 6 , '2015-12-07' union all
select    1022 , 521 , 8 , '2015-12-08'

----------------------------------
-- removes unnecessary duplicates
----------------------------------

delete     d
from
(
    select    *, rn = row_number() over (partition by doc, qty order by id)
    from    @table
) d
where    d.rn    > 1

----------------------------------
-- Show the result after remove
----------------------------------

select    *
from    @table

----------------------------------
-- update valid_to
----------------------------------
; with 
cte as
(
    select    *, rn = row_number() over (partition by doc order by valid_fr)
    from    @table
)
update    c1
set    valid_to    = dateadd(day, -1, c2.valid_fr)
from    cte c1
    inner join cte c2    on    c1.doc    = c2.doc
                and    c1.rn    = c2.rn - 1

----------------------------------
-- Show the final result
----------------------------------

select    *
from    @table

janosh · December 19, 2015, 11:56pm

Many thanks khtan for your answer! For the removing duplicates part, there is one issue I'd like to mention, as not yet clearly stated in my example and not yet considered in your query:

The row should only be removed, if it is a repeat (same qty) of the very last row with same doc_id.

.id..|.doc.|qty|.valid_from.|.valid_to. 1001 | 520 | 5 | 2015-12-01 | null 1005 | 520 | 6 | 2015-12-03 | null 1012 | 521 | 2 | 2015-12-03 | null 1018 | 520 | 6 | 2015-12-07 | null <- Remove this, as it unnecessarily repeats Qty of last row (id 1005) 1022 | 521 | 8 | 2015-12-08 | null 1023 | 520 | 5 | 2015-12-10 | null <- Do NOT remove this, bcs Qty is different from last row (id 1018), even if same Qty as earlier (id 1001)
Maybe a similar approach has to be used like in the update valid_to query (c1.rn = c2.rn+1) ?

best regards, janosh.

khtan · December 20, 2015, 10:32am

yes. you can use similar approach to identify the duplicate and remove it

JeffModen · December 23, 2015, 3:00am

No need. Just change the ORDER BY to "DESC" in the ROW_NUMBER() and it will remove all duplicates except the first row with the DocID/Qty.