May 24 2011

Cloudy Data Storage, circa 2001

Published under Technology

Around 2000-2001, Oasis Digital built a system for a client which (in retrospect) took a “cloudy” approach to data storage. 2001 is a few years before that approach gained popularity, so it’s interesting to look back and see how our solution stacks up.

The problem domain was the storage of check images for banks; the images came out of a check-imaging device, a very specialized camera/scanner capable of photographing many checks per second, front and back. For example, to scan 1000 checks (a smallish run), it generated 2000 images. All of the images from a run were stored in a single archive file, accompanied by index data. OCR/mag-type data was also stored.

I don’t recall the exact numbers (and probably wouldn’t be able to talk about them anyway), so the numbers here are estimates to convey a sense of the scale of the problem in its larger installations:

  • Many thousands of images per day.
  • Archive files generally between 100 MB and 2 GB
  • Hundred, then thousands, of these archive files
  • In an era when hard drives were much smaller than they are today

Our client considered various off-the-shelf high-capacity storage systems, but instead worked with us to contruct a solution roughly as follows.

Hardware and Networking

  • Multiple servers were purchased and installed, over time.
  • Servers were distributed across sites, connected by a WAN.
  • Multiple hard drives (of capacity C) were installed in each server, without RAID.
  • Each storage drive on each server was made accessible remotely via Windows networking

Software

  • To keep the file count managable, the files were kept in the many-image archives.
  • A database stored metadata about each image, including what file to find it in.
  • The offset of the image data within its archive file was also stored, so that it could be read directly without processing the whole archive.
  • Each archive file was written to N different drives, all on different servers, and some at different physical sites.
  • To pick where to store a new file, the software could simply look through the list of possibility and check for sufficient free space.
  • A database kept track of where (all) each archive file was stored.
  • An archive file could be read from any of its locations. Client software would connect to the database, learn of all the locations for a file.

This system was read-mostly, and writes were not urgent. For writes, if N storage drives weren’t available, the operator (of the check-scanning system) would try again later. CAP and other concerns weren’t important for this application.

Helpful Properties

  • Even if some servers, sites, or links were down, files remained generally accessible.
  • Offline media storage could be added, though I don’t recall if we got very far down that path.
  • The system was very insensitive to details like OSs, OS versions, etc. New storage servers and drives could be added with newer OS versions and bigger drive sizes, without upgrading old storage.
  • Drives could be made read-only once full, to avoid whole classes of possible corruption.
  • By increasing the number of servers, and number of hard drives over time, this basic design could scale quite far (for the era, anyway).

This approach delivered for our client a lot of the benefits of an expensive scalable storage system, at a fraction of the cost and using only commodity equipment.

Why do I describe this as cloud-like? Because from things I’ve read, this is similar (but much less sophisticated, of course) to the approach taken inside of Amazon S3 and other cloud data storage systems/services.

Key Lesson

Assume you are willing to pay to store each piece of data on N disks. You get much better overall uptime (given the right software) if those N disks are in N different machines spread across sites, than you do by putting those N disks in a RAID on the same machine. Likewise, you can read a file much faster from an old slow hard drive in the same building than you can from a RAID-6 SAN across an 2000-era WAN. The tradeoff is software complexity.

 

One response so far

May 23 2011

Upcoming Talk: Lua on iPhone and Android (using Corona)

Published under Technology

This Thursday (May 26, 2011), I will give a talk at the St. Louis Mobile Dev group on cross-mobile-platform development with Lua. There are various ways to do this (including rolling your own), but for simplicity I’m using Ansca’s Corona product.

As usual, I’ll zoom through some slides, and concentrate instead on the code. For some background on Lua, you may want to watch the video of my 20-minute Lua talk from last year’s Strange Loop.

Update: slides are available here.

 

Comments Off

May 16 2011

Coming this fall: Strange Loop 2011

Published under Technology

Coming this fall, Alex Miller is putting on the third year of his Strange Loop conference, Strange Loop 2011. It’s not in “The Loop” this time, because The Loop isn’t big enough to hold it!

I heartily recommend Strange Loop for any software developer interested in learning more about a wide variety of technical topics. Unlike many other events, this one stays close to the technology all the way through – you might see a higher ratio of code-to-text on the slides here, than at any other conference.

(Again this year, my firm Oasis Digital is a sponsor, and I’ll probably submit a talk. I hesitate a bit though, because if I give a talk, I have to miss someone else’s talk in that timeslot.)

 

Comments Off

May 03 2011

Ancient History: JBuilder Open Tools

Published under Technology

Some years ago, the Java IDE marketplace looked quite different than it does today. VisualAge was very popular. Borland’s JBuilder was another top contender. Since then, many of the good ideas from VisualAge ended up in Eclipse, while the JBuilder of that era was replaced by a newer, Eclipse-based JBuilder. Not everything ended up on Eclipse, though: NetBeans matured to a slick IDE (with its own plugin ecosystem), as did IDEA.

But this post isn’t about today, it’s about a leftover bit of history. Back in that era, I had a section of this web site dedicated to the numerous JBuilder “Open Tools” (plugins) then available. That content is long obsolete and I removed it years ago. Remarkably, this site still gets hits every day from people (or perhaps bots) looking for it.

I agree strongly that Cool URIs don’t change, but that’s OK, because my old JBuilder Open Tools content just wasn’t very cool anyway.

On the off chance you landed on this page looking for it, here is a Google link for your convenience, or you can take a look at web.archive.org’s snapshot of my old list.

 

Comments Off

Feb 19 2011

Comparing OPML Files, or How to Leave NetNewsWire

Published under Technology

Recently I reached a level of excessive frustration with NetNewsWire (Mac) and decided it was time to move on. Problems with NetNewsWire include:

  1. NetNewsWire has no way to sync its subscription list to match your Google Reader subscription list. There is a Merge button in the Preferences that sounds like it should do this, but it does not work correctly. Once your lists get out of sync, they generally stay that way.
  2. NetNewsWire won’t prefetch images referenced in feeds. Without this, it is not useful for the most obvious purpose of a desktop reader: reading without a network connection. That’s a reasonable thing to leave out in early development, but in a mature product? What could they have been thinking?
  3. NetNewsWire fails (silently) to subscribe to Google Alerts feeds, apparently because Google Reader already knows about those feeds… but see #1.
  4. As many other users have reports, NetNewsWire frequently shows a different number of unread items from Google Reader, and no amount of Refreshing makes it match. The sync doesn’t quite work.

But to get rid of NetNewsWire, I needed to verify that I had all my feeds in Google Reader. This was easy:

  1. Export OPML feed list from NNW
  2. Export OPML feed list from Reader
  3. Use a bit of perl regex and diff (below) to extract and compare just the list of feed URLs
  4. Look over the diff, and copy-paste-subscribe the missing ones in Reader

The commands are:

perl -ne '/xmlUrl="([^"]*)"/ && print "$1\n"' <google-reader-subscriptions.xml  | sort >gr.urls
perl -ne '/xmlUrl="([^"]*)"/ && print "$1\n"' <nn.opml  | sort >nn.urls
diff gr.urls nn.urls

… which took much less time and far fewer keypresses than writing this post.

Offline reading is still very useful; at the moment I’m trying a combination of Google Reader, Gruml, and Reeder (iPad). Those work very well – so well that the risk of time-wasting feeds must be managed agressively: drop all but the most important, and don’t look every day.

One response so far

Feb 19 2011

Fix timestamps after a mass file transfer

Published under Technology

I recently transferred a few thousand files, totalling gigabytes, from one computer to another over a slowish internet connection. At the end of the transfer, I realized the process I used had lost all the original file timestamps. Rather, all the files on the destination machine had a create/modify date of when the transfer occurred. In this particular case I had uploaded files to Amazon S3 from end then downloaded them from another, but there are numerous other ways to transfer files that lose the timestamps; for example, many FTP clients do so by default.

This file transfer took many hours, so I wasn’t inclined to delete and try again with a better (timestamp-preserving) transfer process. Rather, it shouldn’t be very hard to fix them in-place.

Both machines were Windows servers; neither had a broad set of Unix tools installed. If I had those present, the most obvious solution would be a simple rsync command, which would fix the timestamps without retransferring the data. But without those tools present, and with an unrelated desire to keep these machines as “clean” as possible, plus a firewall obstacle to SSH, I looked elsewhere for a fix.

I did, however, happen to have a partial set of Unix tools (in the form of the MSYS tools that come with MSYSGIT) on the source machine. After a few minutes of puzzling, I came up with this approach:

  1. Run a command on the source machine
  2. … which looks up the timestamp of each file
  3. … and stores those in the form of batch file
  4. Then copy this batch file to the destination machine and run it.

Here is the source machine command, executed at the top of the file tree to be fixed:

find . -print0 | xargs -0 stat -t "%d-%m-%Y %T"
 -f 'nircmd.exe setfilefoldertime "%N" "%Sc" "%Sm"'
 | tr '/' '\\' >~/fix_dates.bat

I broken it up to several lines here, but it’s intended as one long command.

  • “find” gets the names of every file and directory in the file tree
  • xargs feeds these to the stat command
  • stat gets the create and modify dates of each file/directory, and formats the results in a very configurable way
  • tr converts the Unix-style “/” paths to Windows-style “\” paths.
  • The results are redirected to (stored in) a batch file.

As far as I can tell, the traditional set of Windows built in command line tools does not include a way to set a file or directory’s timestamps. I haven’t spent much time with Powershell yet, so I used the (very helpful) NIRCMD command line utilities, specifically the setfilefoldertime subcommand. The batch file generated by the above process is simply a very long list of lines like this:

nircmd.exe setfilefoldertime "path\filename" "19-01-2000 04:50:26" "19-01-2000 04:50:26"

I copied this batch file to the destination machine and executed it; it corrected the timestamps, the problem was solved.

2 responses so far

« Newer Entries - Older Entries »