Delete before insert

arkiboys · January 25, 2016, 10:25am

Hi,
There are two tables
tbl_data_Load
tbl_Data

tbl_Data is the main table which holds data
tbl_Data_Load is the staging table which every now and then it gets populated when a job is run and then the data in it gets inserted into the table tbl_Data

The structure of the two tables is identical. It's just that the table tbl_Data obviously has more data after each load and the table tbl_Data_Load gets truncated after each insert into the table tbl_Data

Question:
Before each insert into tbl_Data, what kind of delete query can I use to run against tbl_Data so that the data is first deleted (in case it exists) before the data in tbl_Data_Load is loaded into tbl_Data ?

Thanks

djj55 · January 25, 2016, 1:23pm

So you do not want to merge data just replace?

DELETE A FROM tbl_Data A 
INNER JOIN tbl_Data_Load B ON A.ID = B.ID;

Notice that the ID is whatever column(s) that make it unique

arkiboys · January 25, 2016, 2:03pm

Hi, There are a lot of columns, so I guess replacing is the best option. thanks

Kristen · January 25, 2016, 5:32pm

I would recommend that you don't do that.

DELETE rows where the [tbl_Data] PKey can no longer be found in [tbl_Data_Load]

UPDATE any rows where the PKey already exists. Do not update unless at least one column has changed (that will fire any triggers, which increases CPU/work at best and at worst will cause associated tasks to run - e.g. sending a "Record has changed" email when ... there is actually no change!). Use a BINARY Collation for (CHAR) column comparisons to ensure that case-insensitive differences (like "'a' changed to 'A') are treated as significant.

If the Source data has an Update Date/Time column (or something else which is a reliable indicator of a change, such as Row Version or an incremented EditCount etc), which is reliable, use that to reduce the number of rows that are considered for updating.

INSERT any rows where the PKey does not already exist

Doing the three steps in that order should reduce the effort for UPDATE somewhat.

You can also use the MERGE command instead of three separate steps, but personally I find the syntax of the MERGE unwieldy and harder to get right-first-time. Its probably just me though ...

Limiting the UPDATE (.e.g by NOT pre-deleting all rows, and not updating rows that have not changed) will make a significantly (probably hugely significant!) difference to the volume of data recorded in the Transaction Log. That will save both Disk and CPU time.

jeffw8713 · January 25, 2016, 6:53pm

Don't do this - since this is a staging table for an incremental load - doing this will wipe out all previously loaded data. What you want to do is a MERGE process, whether that is using the MERGE command or an UPSERT (update/insert) or a REPLACE (delete/insert).

Using MERGE command you can do the update/insert in one statement as Kristen outlines above. Be careful with this as there are some known performance issues with this method. It will depend on how many rows are included in the incremental process.

Using UPSERT you update all rows that match between the tables and insert any rows in the staging table that don't exist in the final table.

Using REPLACE you delete everything in the destination (final) table - and insert all rows from the source. This can be a valuable process if the source system is sending a full update of specific data and there could be stale data in the destination. For example, if we had an invoice where transactions could be deleted from the source or moved to another invoice you would want to delete/insert to insure that previous rows that are no longer valid have been removed.