Newbie needs forensics help

Allan · February 10, 2016, 4:25pm

I am a coder that inherited an application that interacts with a MSSqlServer database. The application is in use in hundreds of locations and generally works well. However we have one site where rows from one table are getting deleted and they should not be.
I have been digging around online for ideas and, not being a DBA, find much of the discussion over my head. So, here I am.

The application database uses the Simple recovery model. It is my understanding that the Simple model was used because the log was growing too large and no one knew how to back it up so it would shrink.
I believe if I could view the transaction log I could discover when and by what means those rows are deleted. Correct?

From what I gather, if I wish to use/view the transaction log, the recovery model must be changed. But if the model is changed to Full, I will need a means to back up the log. Is that correct?
In a very old thread, I found that Tara Kizer had created a stored procedure to do that backup, but the link is broken. Suggestions appreciated.

ScottPletcher · February 10, 2016, 4:40pm

You can use this command:

SELECT TOP (100) * FROM fn_dblog(NULL,NULL)

to view the log records. Deletes still get logged in simple recovery model, it's just that they can be written over -- those records in the log can be re-used -- at any time. Therefore, as soon as possible after the DELETEs you need to capture the log info.

For example, you could do SELECT * INTO db_log_table and then use queries to try to find the DELETE(s) in the log data.

Allan · February 10, 2016, 5:02pm

This must be done on the customer's machine, correct?
That is, I cannot back up the database and then examine it on my machine.
If true, the issue is I rarely have access to the customer site. Getting the log when the deletes are still in there would be problematic.

Kristen · February 10, 2016, 5:07pm

I think it is hard to use the Transaction Log File to reverse engineer "what went wrong" with deletes. But I have not used any of the modern tools that interrogate the log, so my reluctance may be mis-placed.

You could cretae an AFTER DELETE trigger on the table - for example to log the Time, UserID (maybe) and the PKey ID of the row being deleted (to some, new, Logging Table that you create). As this is a production system you might? consider that that might add some risk / fragility to an APP that is, in other respects, running Just Fine. Personally I wouldn't hesitate, although I'd test it first on a non-Production system

You could use SQL Profiler to see what commands are being sent to the server (and from that, maybe, work out what is causing the deletion). It would help A LOT if you had repeatable circumstances, then you could put a FILTER in SQL Profiler so that it only logged, say, that one user's commands and you started a trace and shouted "GO" to the user, and then stopped the trace when they said "Its finished". This is pretty easy to do, but it won't tell you what was deleted, just what SQL commands were issued - from which you would have a chance of working out which of them was having the side effect of deleting data. Personally I would do this after logging some data from the trigger, above - on the assumption that the trigger would tell you Who and When, and then when you said to them "What were you doing at XXX time?" and from that you and they established what process caused the problem, and thus hopefully could then create a repeatable set of circumstances.

Whilst you are in there with your broom sweeping up stuff I suggest you sort out the SIMPLE Recovery Model. My suggestion would be:

Figure out how often backups are currently being taken. There are SQL queries that can be run which will tell you (ask if you need them) but you might be able to find the appropriate folder on the server where the backup files are stored and see how frequently they are made, and whether all the files are similar size (or have names that suggest they are FULL backups rather than DIFFerentials). If you have a huge file on a Sunday and then a small file on a Monday, and bigger each day through Saturday then that would suggest FULL on Sunday, and DIFF each weekday. Perhaps they have a FULL each night and DIFF at Noon.

Armed with that ask the users "Are you OK to loose ALL data changes between these backup time-points, and are you able to recreate all data that would be lost in a disaster?"

I'm guessing they will say "No"

I don't think that Tara is still maintaining her scripts; The most commonly recommended backup script that I see these days is by Ola Hallengren but, personally, I think that script is for DBAs who want access to all possible knobs and buttons to twiddle and that it isn't a good choice for "Newbie to install, fire and forget". Whereas I think that the scripts by Minion will do just that - if you install the Minion Backup script, turn it on, and then do nothing else you will have a reasonable backup strategy. There are then buttons that you can twiddle if necessary.

Absent knowledge about the way that your client actually uses their database I would suggest you go for Log backups every 10 - 15 minutes, 24 hours a day (don't stop them at night because "no one is at work", there may well be overnight jobs running, if not now then in future).

http://minionware.net/backup/

They have other stuff for rebuilding indexes, which would be worth doing to (if there is nothing, currently, that does that from time to time)

If you embark on that road, and your company has other clients using SIMPLE Recovery Model and you establish that their data is seriously at risk in the event of a disaster, I would suggest that you ought to try to get your company to sort that out - once you have a nice simple canned solution under your belt

ScottPletcher · February 10, 2016, 5:45pm

Also, you could use an INSTEAD OF DELETE trigger to prevent DELETEs unless the conditions were valid for the DELETE. If you never allow DELETEs on that table, you simply log that a DELETE was attempted and cancel the DELETE. If the app does do some legitimate DELETEs, you'd have to add a validating factor, such as setting a particular CONTEXT_INFO() value or pre-loading a specified table row, for example. The problem with that is, if others have access to the code, they can simply reproduce the validating condition and continue doing their own DELETEs.

I wouldn't recommend you turn on full recovery model unless you have someone with some expertise to watch over the log backup process. You might instead add differential backups or increase their frequency, if diffs are already being used.

Allan · March 5, 2016, 5:59pm

I took Kristen's advice, got permission from the customer and ran the profiler over a weekend. The deletions did occur in that time, but examination of the result simply confirmed that the rows were not actively being deleted. That left rollback. After some more research I came to use SET XACT_ABORT ON within the stored procedure that creating the rows. It used a transaction which apparently was failing but not being rollback until a later query. The new procedure has been in place over a week and so far no missing rows. In the past it never went that long without an instance.