At work we have a set of database machines set up so that one is the primary machine, making backups once per day, and several other machines restore this backup once per day, for development, ad hoc reporting, and other secondary purposes. We started out with an obvious approach:
- back up on server1, to a file on server1
- SCP or rsync the file from server1 to server2
- restore the DB on server2
… but over time as the data has grown the inefficiency of such an approach become equally obvious: the backup data goes back and forth across the network and to/from disk repeatedly. These steps only count the backup data, not the live storage in the DBMS:
- On to the disk on server1 (putting additional load on the primary DB machine)
- Off the disk on server1 (putting additional load on the primary DB machine)
- On to the disk on server2
- Off the disk on server2
This is also wasteful from a failure-recovery point of view, since the place we are least likely to need the backup is on the machine whose failure would lead us to need a backup.
Pipe it over the network instead
The project at hand uses PostgreSQL on Linux, so I’ll show example PG commands here. The principles apply equally well to other DBs and platforms of course, though some DBMSs or platforms might not offer backup and restore commands that stream data. (I’m looking at you, MS SQL Server!)
What we need is a pipe that goes over the network. One way to get such a pipe is with ssh (or rsh), something like so, run from server1:
pg_dump -Fc dbnameonserver1 | ssh server2 pg_restore -Fc -v -O -x -d dbnameonserver2
This variation will simultaneously store the backup in a file on server1:
pg_dump -Fc dbnameonserver1 | tee dbname.dump | ssh server2 pg_restore -Fc -v -O -x -d dbnameonserver2
This variation (or something close, I last run this several days ago) will store the backup in a file on server2 instead:
pg_dump -Fc dbnameonserver1 | ssh server2 "tee dbname.dump | pg_restore -Fc -v -O -x -d dbnameonserver2"
To reduce the CPU load from this, adjust SSH to use less CPU-intensive encryption, or avoid that entirely with rsh (but only if you have a trusted / local network).
Multicast / Broadcast it over the network instead
The above commands are good for point-to-point streaming backup / restore, but the scenario I have in mind has one primary machine and several (3, 4, or more) secondary machines. One answer is to run the above process repeatly, once for each secondary machine, but that sends the whole backup over the network N times. Inefficiency! (==Blashphemy?)
To avoid that, simply use UDPCast. It’s a trivial install on Debian / Ubuntu:
apt-get install udpcast
(Be warned though: there is at least one annoying bug in the old (2004) UDPCast offered off-the-shelf in Debian / Ubuntu as of 2008. You might need to the latest UDPCast source from its web site above.)
Run this on the server1:
pg_dump -Fc dbnameonserver1 | udp-sender --min-wait 5 --nokbd
Run this on the server2 .. serverN:
udp-receiver --nokbd | pg_restore -Fc -v -O -x -d dbnameonserverN
With this approach, the backup data will be multicast (or broadcast, if multicast does not work and if all the machines are on the same segment), only traversing the network once no matter how many receiving machines are set up. udp-receiver has a –pipe option, but I found that I occasionally get corruption with huge (50GB+) transfer, when using –file or –pipe. So I recommend this, to save a copy on the receiving end:
udp-receiver --nokbd | tee mydatabase.dump | pg_restore -Fc -v -O -x -d dbnameonserverN
Or perhaps you want to just receive and store the backup on a file server, with this:
udp-receiver --nokbd >mydatabase.dump
To make all this happen automatically, you’ll set the sender to start at the same time as the receivers in “cron” on the relevant machines. Use NTP to keep their clocks in sync, and adjust the udp-sender and udp-receiver options as needed to get the whole process to start smoothly in spite of minor timing variations (–min-wait t, –max-wait t).
As with the previous suggestion for rsh, the data will travel unencrypted over your network, so do this only if you trust your network (such as a LAN segment between your database servers).
Multicast / broadcast is very useful technology, and with UDPcast it is quite easy to use. UDPcast also implements a checksum/retransmit mechanism, it is not a “bare”, loss-prone UDP transmission.
An update: I tested several approaches, looking for the best performance shipping a Postgres backup from one machine to several others. The best I found was to use the pg_dump -Z0 option (no PG compression of the backup), then install lzop (apt-get install lzop), a very fast compression tool. The sending command looks like this:
pg_dump -Fc -Z0 dbnameonserver1 | udp-sender –min-wait 5 –nokbd –pipe lzop
and on the receiving end:
udp-receiver –nokbd | lzop -d | pg_restore -Fc -v -O -x -d dbnameonserverN
lzop doesn’t compress quite as well as gzip, but uses much less CPU. It’s probably not worth compressing at all over gigabit ethernet, but it is worthwhile over a 100mbit ethernet between fast servers.