Kyle Cordes – Page 25

Move files into an existing directory structure, on Linux

I recently needed to move a large number of files (millions) in a deep directory structure, into another similar directory structure, “merging” the contents of each directory and creating any missing directories. This task is easily (though slowly) performed on Windows with Control-C Control-V in Explorer, but I could find no obvious way to do it on Linux.

There is quite a bit of discussion about this on the web, including:

Suggestions to do with with tar; this is a poor idea because it copies all the file data, taking an enormously long time.
Suggestions to do it with “mv -r”… but as far as I can tell, mv does not have a -r option.

After a little thought I came up with the script below. I’d love to have a Linux/Bash guru out there point out how awful it and and send me something better!

A critical feature for me is that it does not overwrite files; if a source file name/path overlaps a destinate file, the source file is left alone, untouched. This can be changed easily to overwrite instead: remove the [[-f]] test.

$ cat ~/bin/move_files_merge.sh
#!/bin/bash

# Move files from dir $1 to dir $2,
# merging in to existing dirs
# Call it like so:
# move_files_merge.sh /FROM/directory /TO/directory

# Lousy error handling:
# Exit if called with missing params.
[ "A" == "A${1}" ] && exit 1
[ "A" == "A${2}" ] && exit 1

echo finding all the source directories
cd $1
find . -type d -not -empty | sort | uniq >$2/dirlist.txt

echo making all the destination directories
cd $2
wc -l dirlist.txt
xargs --no-run-if-empty -a dirlist.txt mkdir -p
rm dirlist.txt

echo Moving the files
cd $1
# There is surely a better way to do this:
find . -type f -printf "[[ -f '$2/%p' ]] || mv '%p' '$2/%h'\n" | bash

echo removing empty source dirs
find $1 -depth -type d -empty -delete

echo done.

Large, economical RAID: 10 1 TB drives

I recently needed a file server with ample capacity, on which to store data backups. The typical data access pattern for backups is that data is written once and read rarely, so bulk backup storage has much lower performance requirements than, for example, disks used for database files.

I need to store a great number of files, and I had an old server to recycle, so I ended up with:

4U ASUS case with room for many internal drives
Qty1, leftover 320GB drive (to boot from)
Qty 10, 1 TB drives for data storage: WD Caviar Green
An extra SATA controller (plus the 8 SATA ports on the motherboard)
Ubuntu Linux 8.04.1 LTS
Software RAID6

The separate boot drive is for simplicity; it contains a trivial, vanilla Ubuntu install; if availability mattered more I could replace it with a RAID1 pair, or flash storage – even a cheap USB “key drive” would be sufficient, if I went to the trouble of setting up /var and /tmp to not write to it (thus avoid premature wearout).

The terabyte drives have one large RAID container partition each (quick work with sfdisk). The 10 of them in a RAID6 yield 8 drives worth of capacity. Adjusting also for the difference between marketing TB/GB and the real thing, plus a bit of filesystem overhead, I ended up with 7.3 TB of available storage. Here it is, with some data already loaded:

Filesystem            Size  Used Avail Use% Mounted on
/dev/sde1             285G  1.3G  269G   1% /
varrun                2.0G  164K  2.0G   1% /var/run
varlock               2.0G     0  2.0G   0% /var/lock
udev                  2.0G  112K  2.0G   1% /dev
devshm                2.0G     0  2.0G   0% /dev/shm
/dev/md0              7.3T  3.1T  4.3T  42% /raid

I went with software RAID for simplicity, low cost, and easy management:

# cat /proc/mdstat
[....]
md0 : active raid6 sda1[0] sdk1[9] sdj1[8] sdi1[7] sdh1[6] sdg1[5] sdf1[4] sdd1[3] sdc1[2] sdb1[1]
      7814079488 blocks level 6, 64k chunk, algorithm 2 [10/10] [UUUUUUUUUU]

I chose RAID6 over RAID5 because:

This array is so large, and a rebuild takes so long, that the risk of a second drive failing before a first failure is replaced and rebuilt seems high.
8 drives of capacity for 10 drives is a decent value. 5/10 (with RAID10) is not.

It turns out that certain default settings in Linux software RAID can yield instability under RAM pressure in very large arrays; after some online research I made the adjustments below and it appears solid. The sync_speed_max setting throttles back the RAID rebuild, helpful because I was able to start populating the storage during the very long rebuild process.

vi /etc/rc.local
echo 30000 >/proc/sys/vm/min_free_kbytes
echo 8192 >/sys/block/md0/md/stripe_cache_size
echo 10000 >/sys/block/md0/md/sync_speed_max

vi /etc/sysctl.conf
vm.vfs_cache_pressure=200
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100

7.3T is far beyond the size limit for ext2/3, so I went with XFS. XFS appears to deal well with the large size without any particular tuning, but increasing the read-ahead helps with my particular access pattern (mostly sequential), also in rc.local:

blockdev --setra 8192 /dev/sda
blockdev --setra 8192 /dev/sdb
blockdev --setra 8192 /dev/sdc
blockdev --setra 8192 /dev/sdd
blockdev --setra 8192 /dev/sde
blockdev --setra 8192 /dev/sdf
blockdev --setra 8192 /dev/sdg
blockdev --setra 8192 /dev/sdh
blockdev --setra 8192 /dev/sdi
blockdev --setra 8192 /dev/sdj
blockdev --setra 8192 /dev/sdk

I was happy to find that XFS has a good “defrag” capability; simply install the xfsdump toolset (apt-get install xfsdump) then schedule xfs_fsr to run daily in cron.

Power consumption seems reasonable at 3.0 amps under load.

Marvell: Not Marvellous

In this machine I happen to have an Intel D975XBX2 motherboard (in retrospect an awful choice, but already installed) which includes a Marvell 88SE61xx SATA controller. This controller does not get along well with Linux. Again with some online research, the fix is just a few commands:

vi /etc/modprobe.d/blacklist
# add at the end:
blacklist pata_marvell

vi /etc/initramfs-tools/modules
# add at the end:
sata_mv

# then regen the initrd:
update-initramfs -u

This works. But if I had it to do over, I’d rip out and throw away that motherboard, and replace it with any of the 99% of other motherboards that work well with Linux out of the box, or disable the Marvell controller and add another extra SATA controller on a card.

Is this as big as it gets?

Not by a long shot; this is a secondary, backup storage machine, worth a blog post because of the technical details. It has barely over $1000 worth of hard drives, and a total cost of under $3000 (even less for me, since I reused some old hardware). You can readily order off-the-shelf machines with much more storage (12, 16, 24, even 48 drives). The pricing per-byte is appealing up to the 16 or 24-drive level, then escalates to the stratosphere.

Factor

Over the holiday I looked at the Factor programming language, and was very impressed. It has a Lisp-like metacircular quality, and a remarkably wide set of features/libraries in spite of a very small development team and community. Unlike many other small language projects, Factor is fast, rich, and can produce shippable binaries. Its team cares about robustness, and operates a build farm for multiple platforms. If you can spare a few hours, first watch Slava’s Google Tech Talk, then download Factor and work through some tutorials.

Will Factor become popular? With a FORTH-like syntax, I suspect the answer is a firm No, the syntax is too foreign (even compared to Ruby, for example) for mainstream devleopers. But I find it fascinating nonetheless, and I will keep my eyes open for an opporunity to use it on a small, real project.

It would also make a great topic for the Lambda Lounge, a new St. Louis area user group about which I am quite excited.

Analyzing PostgreSQL logs with pgFouine (on Ubuntu 8.04)

pgFouine is a slick, useful, and free tool for analyzing PostgreSQL query workloads. It works without any impact on the running PostgreSQL: it analyzes the PG log output. The caveat is that it needs PG configured to write the right kind of log output.

Sadly, as of version 8.3 PG has a wrinkle in how it writes its logs: multi-line queries can get jumbled together in the stderr-based log, resulting in erroneous output from pgFouine. Hopefully a future PG will be able to write its logs without this issue, but in the meantime, the answer it to use syslog logging instead of native PG logging. This isn’t a bad idea anyway, since syslogd and friends are well proven.

On our project where this need arose, we use the Ubuntu Linux distribution, currently version 8.04. Ubuntu’s PG package sets up native stderr logging; here are the steps needed to change that to syslog logging instead. These steps are about the same for other distributions (or for manual compiles), but with different paths.

The setting shown here for log_min_duration_statement will log all queries that take more than 4 seconds to complete. Depending on your server, workload, and type of workload (OLTP vs. OLAP), this might be too high or too low.

Edit your postgresql.conf file:

vi /etc/postgresql/8.3/main/postgresql.conf

log_destination = 'syslog'
log_line_prefix = 'user=%u,db=%d '
log_min_duration_statement = 4000
silent_mode = on
logging_collector = off

With PostgreSQL 8.2, set redirect_stderr instead of logging_collector:

redirect_stderr = off

Next, setup where syslog will store the data, and add “local0,none” to the ;-separated list of what goes in to var/log/message. On my system it ended up looking like this, but of course it may vary depending on what else you’ve set up in syslog:

vi /etc/syslog.conf

# add this:

local0.*        -/var/log/postgresql/main.log

# edit this:

    *.=info;*.=notice;*.=warn;\
    auth,authpriv.none;\
    local0.none;\
    cron,daemon.none;\
    mail,news.none          -/var/log/messages

Restart syslogd to make the change take effect:

/etc/init.d/sysklogd restart

Then restart PG so it starts logging there:

/etc/init.d/postgresql-8.3 restart

Note that we are putting these new logs in the existing /var/log/postgresql directory which the Ubuntu PG package creates; if you install PG manually, create such a directory yourself, or set up syslog to write to the pg_log directory. The existing logs there will remain, holding only the messages from PG startup and shutdown (via /etc/init.d/postgresql). I find this unhelpful but harmless.

Log Rotation

By putting the files in this preexisting location, we take advantage of the log rotation already set up in /etc/logrotate.d/postgresql-common. On a busy server, you may want to adjust the rotation setting therein from weekly to daily, or add a line with “size 1000k” or so. Take a look at “man logrotate” to learn about many useful options, such as the ability to have these logs emailed to your DBA as they rotate.

pgFouine

Finally, you are ready to analyze logs. If you plan to analyze them on the same machine where you run your database (probably not a great idea), proceed (on Ubuntu) to get the PHP command line executable:

apt-get install php5-cli

Then download the pgFouine tarball, quietly curse the lack of an Ubuntu package, put it in your $PATH, and run it. Don’t be alarmed by its .php file extension; PHP is a usable (though not particularly charming) language for writing command line tools, as well as dynamic web pages.

cd /var/log/postgresql

pgfouine.php -file main.log  >somefile.html

View the HTML file in your web browser, and dig in to your worse queries. Good luck.

Multicast your DB backups with UDPCast

At work we have a set of database machines set up so that one is the primary machine, making backups once per day, and several other machines restore this backup once per day, for development, ad hoc reporting, and other secondary purposes. We started out with an obvious approach:

back up on server1, to a file on server1
SCP or rsync the file from server1 to server2
restore the DB on server2

… but over time as the data has grown the inefficiency of such an approach become equally obvious: the backup data goes back and forth across the network and to/from disk repeatedly. These steps only count the backup data, not the live storage in the DBMS:

On to the disk on server1 (putting additional load on the primary DB machine)
Off the disk on server1 (putting additional load on the primary DB machine)
On to the disk on server2
Off the disk on server2

This is also wasteful from a failure-recovery point of view, since the place we are least likely to need the backup is on the machine whose failure would lead us to need a backup.

Pipe it over the network instead

The project at hand uses PostgreSQL on Linux, so I’ll show example PG commands here. The principles apply equally well to other DBs and platforms of course, though some DBMSs or platforms might not offer backup and restore commands that stream data. (I’m looking at you, MS SQL Server!)

What we need is a pipe that goes over the network. One way to get such a pipe is with ssh (or rsh), something like so, run from server1:

pg_dump -Fc dbnameonserver1 | ssh server2 pg_restore -Fc -v -O -x -d dbnameonserver2

This variation will simultaneously store the backup in a file on server1:

pg_dump -Fc dbnameonserver1 | tee dbname.dump | ssh server2 pg_restore -Fc -v -O -x -d dbnameonserver2

This variation (or something close, I last run this several days ago) will store the backup in a file on server2 instead:

pg_dump -Fc dbnameonserver1 | ssh server2 "tee dbname.dump | pg_restore -Fc -v -O -x -d dbnameonserver2"

To reduce the CPU load from this, adjust SSH to use less CPU-intensive encryption, or avoid that entirely with rsh (but only if you have a trusted / local network).

Multicast / Broadcast it over the network instead

The above commands are good for point-to-point streaming backup / restore, but the scenario I have in mind has one primary machine and several (3, 4, or more) secondary machines. One answer is to run the above process repeatly, once for each secondary machine, but that sends the whole backup over the network N times. Inefficiency! (==Blashphemy?)

To avoid that, simply use UDPCast. It’s a trivial install on Debian / Ubuntu:

apt-get install udpcast

(Be warned though: there is at least one annoying bug in the old (2004) UDPCast offered off-the-shelf in Debian / Ubuntu as of 2008. You might need to the latest UDPCast source from its web site above.)

Run this on the server1:

pg_dump -Fc dbnameonserver1 | udp-sender --min-wait 5 --nokbd

Run this on the server2 .. serverN:

udp-receiver --nokbd | pg_restore -Fc -v -O -x -d dbnameonserverN

With this approach, the backup data will be multicast (or broadcast, if multicast does not work and if all the machines are on the same segment), only traversing the network once no matter how many receiving machines are set up. udp-receiver has a –pipe option, but I found that I occasionally get corruption with huge (50GB+) transfer, when using –file or –pipe. So I recommend this, to save a copy on the receiving end:

udp-receiver --nokbd | tee mydatabase.dump | pg_restore -Fc -v -O -x -d dbnameonserverN

Or perhaps you want to just receive and store the backup on a file server, with this:

udp-receiver --nokbd >mydatabase.dump

To make all this happen automatically, you’ll set the sender to start at the same time as the receivers in “cron” on the relevant machines. Use NTP to keep their clocks in sync, and adjust the udp-sender and udp-receiver options as needed to get the whole process to start smoothly in spite of minor timing variations (–min-wait t, –max-wait t).

As with the previous suggestion for rsh, the data will travel unencrypted over your network, so do this only if you trust your network (such as a LAN segment between your database servers).

Multicast / broadcast is very useful technology, and with UDPcast it is quite easy to use. UDPcast also implements a checksum/retransmit mechanism, it is not a “bare”, loss-prone UDP transmission.

Rhino + JavaScript + Swing, Look Ma No Java

A while back I was discussing the future of programming languages with a colleague, and we agreed that for all its foibles, JavaScript will continue to enjoy very wide and increasing use in the coming years. I wrote last year about Steve Yegge’s hints that JavaScript is the “next big languages”, see that post for the reasoning.

Based on all that, I set about writing a small test app to see what it’s like to program a Swing app with JS. After a day or so of work (spread over a few months), I offer my Rhino Swing Test App:

Run RSTA now via Java Web Start

Get the RSTA code (git, on GitHub)

It implements the same “flying boxes” animation demo that I presented a few years ago at the St. Louis JUG, but aside from a generic launcher class, the GUI is implemented entirely in JavaScript. To clarify, this is not web browser JavaScript; it is running in Rhino, in the JVM, using Swing classes.

The documentation for interaction between Java and JS is limited, but sufficient. For simplicity, I used Rhino as an interpreter, I did not compile to java bytecode. Nonetheless, the animation runs about as smoothly in JS as it does in Java, because the heavy lifting is done by the JDK classes.
I used Eclipse (with JavaScript support) to write this code, but of course JS makes much less code completion possible than Java, and I missed that. Typically I mildly prefer IDEs, but am also productive with a text editor. For working with a large API like Swing though, IDE support helps greatly.

Still, I recommend a look at this approach if you are fond of dynamic languages but need to build on the Java platform, and I intend to investigate server-side JS development also.