SQL Server Log Shipping – Testing the impact of large operations

Background

In a project we have here at Oasis Digital, our customer relies on the log shipping feature of Microsoft SQL Server 2000 to keep some secondary databases, reporting databases, up to date in almost-real-time: every few minutes a transaction log backup runs, then at a slightly longer interval, every 5-15 minutes, a restore of those logs runs on secondary reporting databases. We’ve developed some code so that the reports can automatically run against any available reporting database; thus as the reporting database occasionally go offline for log shipping restores, the end users never notice that there’s more than one different database being used for their reports.

Most of the time, this works very well. The users get reports quickly, and those reports (no matter how large and painful the queries are) never have any impact on the production OLTP databas. But we have been stung a few times by changes we make in the production database having an unexpectedly large impact on that log shipping process.

We have two theories on what happens:

Certain kinds of changes in the production database, such as adding a field, changing a field type, changing an index, can affect a very large number of pages of the database – since SQL Server log shipping occurs a page at a time, an operation that makes a small change to 100,000 pages of the database can result in an extremely large file being shipped between the production database and the reporting databases.
There are performance characteristics of that restore process which we don’t fully understand – sometimes a log restore takes considerably longer than its side would suggest. Sometimes a log-shipped change triggers such an occurrence.

The result of those two issues together is that sometimes we intentionally make a seemingly minor change in the database and it has a large, negative impact on that log shipping process: hours of reporting database downtime! Occasionally it’s taken more than a day for the log restore target DBs to catch up.

This affects the users severely, and is therefore a Very Bad Thing.

Therefore, we’ve been looking for a way to assess the impact of such schema and index changes, on the log ship process, before making them in production.

Testing / Measuring

Unfortunately the only reasonable answer appears to be the rather large hammer of recreate a copy of the production system, including log shipping. The databases in question are quite large, so this means two servers, or one server with a bunch of hard drives (to run the log ship source and destination on the same machine).

The process is:

Install a lot of hard drive space, OS, SQL Server
Restore a copy of the production DB, call this “test1”
Set up log shipping to one or two reporting DBs, test2 and test3

Then for each test run:

Run log shipping (so its all up to date)
Run test queries (like adding a field, changing a field type, changing a large index, etc)
Run log shipping again, and observe how big of a file is log shipped between the two DBs.

The main metric we’ll get out of this is for change X (i.e. for a certain SQL DDL change or DML change), we get a log ship of size N megabytes or N gigabytes. I suspect that with this kind of data in hand, we will soon discover the underlying “rules” and understand which changes result in very large logs, so most of the time we can tell which things are going to have a big impact, and schedule them around appropriately.

We can even automate the test process: feed in a piece of SQL to a program runs those steps. It might take hours of (cheap) computer time to run, but little (expensive) human time.

The Plot Thickens

What I’ve described above is the simple case. There’s a more complex case: we’ve observed that the worst delays tend to happen if a log backup occurs in the middle of a certain operation – even operations that are usually harmless, can perhaps result in a huge log ship if the log backup happens mid-operation. These also appear to be the logs that take especially long to restore on the recording on the destination database.

To test this, we can take a piece of DML that we think is going to take several minutes to run, start it, wait 30 seconds or so, and while it’s still running start a log backup. Then wait until the DML completes and start another log backup. We would gain several data points: the size of the mid-operation ship and the final data ship, the how long each takes to restore. I suspect we will learn that it’s a really bad idea to let a log backup start while running any potentially large operation.

To prevent that, perhaps we can automate these operations like so:

Run a snippet of SQL to disable log backups (log shipping)
Wait for any running backup to finish
Run the target SQL
Wait for it to finish
Reenable log backups (log shipping)

That’s as much detail as I have time to post; hopefully this will help someout out there with SQL Server log shipping problems. It would be great to hear from others out there who have experienced similar log shipping issues.