Our department has had some concerns originating from high VLF counts. We have a few DBs with counts of 5,000.
I've read that high VLF counts can slow down log backups and restore operations. I don't understand how that is possible? My understanding is that VLF counts are like a fragmented Log File. When tables are fragmented, they can affect performance, but usually very little. So the log file has fragments, how is this a big deal? Will this also cause full backups to be slower or just log backups?
Anyways, I'm trying some timing experiments with full backups while shrinking the log file and reducing the VLF count
I've encountered slow restore operations due to many VLFs. There's a very good writeup on getting detailed backup/restore data via extended events:
I've used this (or a variation of it) to find out what happens after the restore hits "100 percent finished"....and just sits there for several minutes. There are VLF operations that continue after that point, those are highlighted in the images in that article. If you have thousands of VLFs they can add considerable delay.
It's easy to test the impact: make backups of the database and log when the VLF count is high, then reduce the VLFs and make new backups. Restore all of them while capturing the extended events and compare. I remember trimming something like 30 minutes or more off a restore operation by reducing the VLF count down to 100 or so.
Full backups can be impacted if there's a lot of write activity in the database after the backup starts, as it will include the active part of the transaction log in the full backup.
Edit: another thing about high VLF count, they can make your backup files larger even if there's no log activity. I remember having a log with 50K or so VLFs, completely empty, log backup was 128 MB or something. Getting it down to 100 VLFs made the backup 128 KB.
Thanks for that extended Event Robert! That is interesting.
So I did some testing. Same Database, Same Server. Started with a Log file of about 130GB and 4,000 VLFs. And in a few steps got the Log down to about 10GB and 300 VLFs.
So the Restore times reduced significantly as the Log File and the VLF Count reduced! I was surprised. Then I grew it back in steps to about 130GB and 350 VLFs and the restore times were about the same as on the way down.
In each scenario the Log File was about 99% free.
So that tells me the Restore times were related to the size of the Log File (even though it was mainly free space). And the VLF Count wasn't really a factor.
So I checked in the extended event. And true, these events increased 10 fold when we went from 350 VLFs to 4,000 VLFs:
Processing xxxx VLF headers
Processing VLF headers is complete
But this was small (approx 5%) compared to these events:
So any idea why the time is related to these empty Log files? I'm guessing just the time required to allocate space for the data pages for these huge file.
I haven't found a good resource on VLF stuff beyond what Kimberly and Paul cover on SQLSkills.com. Nic Cain had a blog a long time ago on improving backup and restore performance with BUFFERCOUNT and MAXTRANSFERSIZE, he used trace flags to write additional messages to the output and error log, but it's the same info as you get from the extended event. I was actually using it to tune that performance when I noticed the VLF messages.
It's probably worth your time to set up some testing framework to run through automation to get timings and details via the extended event, if nothing else, it would make a good blog and/or presentation. You'd want to try multiple VLF counts, size of VLFs, total log size, total backup size (amount of log activity). I remember Nic finding a sweet spot for backup settings, I was seeing a bit of sweet spot for VLF size and count when I was testing restores. You'd also want to control for autogrow events and variable VLF sizes, the newer VLF sizing algorithm will cause more variability than the older one.
If you want to dig even deeper, you might need the Windows Debugger and loading the public debug symbols for SQL Server. You can find directions on SQLSkills if you want to go that route, but that should provide the most detail on why high VLF counts slow the process down.
One other thing to keep in mind with a giant log file and restores, is the time to write the file. My memory says that it needs to zero out the disk for transaction logs -- even if the file is empty. I vaguely remember that Instant File Initialization really helped for data files but not log files. I also barely recall that changed with some release? But I can't remember the version.
I would also assume that empty VLFs are handled much differently than full VLFs that need actual restoring.
But it's late on Christmas Eve so I may be remembering wrong. But probably worth a little research.