Data/API Servers/Services for Single Page Web Applications

A Client needs a Server, or at least a Service

Over the last few years the team at Oasis Digital has created various complex “single page” web applications, using AngularJS, KnockoutJS, and other tools. These applications’ assets and code are be statically hosted (cheaply on  CDN if needed), but of course each application needs a backend data service (possibly comprised of smaller or even “micro” services internally) to provide data and carry the results of user operations to permanent storage.

Context and Background

Our work context is typically data- and rule-centric line-of-business applications, hosted in a company data center or on virtual/cloud or dedicated hardware, or occasionally a (more cloudy) PaaS like Heroku; so advice here is for that context. “Backend as a Service” solutions are appealing, but I don’t have any experience with those.

The systems we build most often store data in a traditional RDBMS like PostgreSQL or MS SQL Server, but data service needs are similar with a NoSQL or other non-RDBMS data store. (Among the many topics I should write about: making effective, justified use of polyglot persistence with CQRS and related ideas.)

We have also worked extensively with multi-tier desktop applications, which have essentially the same data service needs, albeit with different data serialization formats. As a result, we have worked on and thought about data services extensively.

Building a Data API Service

For convenient, rapid, efficient development of robust data/API services, your tool set should have as many as possible of the following:

  1. A server-side programming language / platform, with its runtime environment, native or VM.
  2. A way of routing requests (hopefully with a RESTful pattern matching approach).
  3. Automatic unmarshaling of incoming data into data structures native to the programming language. If you find yourself writing code that takes apart JSON “by hand”, run away.
  4. Similarly, automatic marshaling of ordinary data structures into JSON. If you see code which uses string concatenation to build JSON, it should either be to meet some specific needs for extra marshalling performance of a huge data structure, or shouldn’t be there at all.
  5. Analogous support for other data formats. Successful systems live a long time. If you create one, someone will want to talk to it in a situation where JSON isn’t a good fit. A few years from now, you might think of JSON the way we think of XML now. Therefore, stay away from tools which are too deeply married to JSON or any other single data format.
  6. If you’re using a relational database, your data server toolkit should make it quite easy to talk to that database. Depending on the complexity of the relationship between your data services and the underlying data store, either an object relational mapper (ORM) or perhaps a table/query mapper is suitable. If you find yourself working with the low-level database API to pluck fields out, you are looking at a long investment with little payoff.
  7. Good support for a wide variety of database types (relational and otherwise). This reduces the risks from future database support requirements.
  8. A reasonable error handling system. Things will go wrong. When they do, an appropriate response should flow back to the client code, while a fully detailed explanation should land in a log or somewhere else suitable – ideally without re-inventing this on each project or for every API entry point.
  9. Depending on application needs, some way of maintaining a persistent connection (SSE, websocket, or fallback) to stream back changing information.
  10. A declarative way to specify security roles needed for subsets of your API (RESTful or otherwise).
  11. Monitoring / metrics.
  12. Scalability.
  13. Efficiency, so you are less likely to need to scale, and so that if you must scale, the cost isn’t awful.
  14. Rapid development supported by good tooling. Edit-compile-run cycles of a few seconds.
  15. A pony.

Is That All?

This is quite a checklist; but a toolset lacking these things means that a successful, growing project will probably need to reinvent many of them – shop carefully. In later posts, I’ll write more about particular technology stacks.

 

Cloudy Data Storage, circa 2001

Around 2000-2001, Oasis Digital built a system for a client which (in retrospect) took a “cloudy” approach to data storage. 2001 is a few years before that approach gained popularity, so it’s interesting to look back and see how our solution stacks up.

The problem domain was the storage of check images for banks; the images came out of a check-imaging device, a very specialized camera/scanner capable of photographing many checks per second, front and back. For example, to scan 1000 checks (a smallish run), it generated 2000 images. All of the images from a run were stored in a single archive file, accompanied by index data. OCR/mag-type data was also stored.

I don’t recall the exact numbers (and probably wouldn’t be able to talk about them anyway), so the numbers here are estimates to convey a sense of the scale of the problem in its larger installations:

  • Many thousands of images per day.
  • Archive files generally between 100 MB and 2 GB
  • Hundred, then thousands, of these archive files
  • In an era when hard drives were much smaller than they are today

Our client considered various off-the-shelf high-capacity storage systems, but instead worked with us to contruct a solution roughly as follows.

Hardware and Networking

  • Multiple servers were purchased and installed, over time.
  • Servers were distributed across sites, connected by a WAN.
  • Multiple hard drives (of capacity C) were installed in each server, without RAID.
  • Each storage drive on each server was made accessible remotely via Windows networking

Software

  • To keep the file count managable, the files were kept in the many-image archives.
  • A database stored metadata about each image, including what file to find it in.
  • The offset of the image data within its archive file was also stored, so that it could be read directly without processing the whole archive.
  • Each archive file was written to N different drives, all on different servers, and some at different physical sites.
  • To pick where to store a new file, the software could simply look through the list of possibility and check for sufficient free space.
  • A database kept track of where (all) each archive file was stored.
  • An archive file could be read from any of its locations. Client software would connect to the database, learn of all the locations for a file.

This system was read-mostly, and writes were not urgent. For writes, if N storage drives weren’t available, the operator (of the check-scanning system) would try again later. CAP and other concerns weren’t important for this application.

Helpful Properties

  • Even if some servers, sites, or links were down, files remained generally accessible.
  • Offline media storage could be added, though I don’t recall if we got very far down that path.
  • The system was very insensitive to details like OSs, OS versions, etc. New storage servers and drives could be added with newer OS versions and bigger drive sizes, without upgrading old storage.
  • Drives could be made read-only once full, to avoid whole classes of possible corruption.
  • By increasing the number of servers, and number of hard drives over time, this basic design could scale quite far (for the era, anyway).

This approach delivered for our client a lot of the benefits of an expensive scalable storage system, at a fraction of the cost and using only commodity equipment.

Why do I describe this as cloud-like? Because from things I’ve read, this is similar (but much less sophisticated, of course) to the approach taken inside of Amazon S3 and other cloud data storage systems/services.

Key Lesson

Assume you are willing to pay to store each piece of data on N disks. You get much better overall uptime (given the right software) if those N disks are in N different machines spread across sites, than you do by putting those N disks in a RAID on the same machine. Likewise, you can read a file much faster from an old slow hard drive in the same building than you can from a RAID-6 SAN across an 2000-era WAN. The tradeoff is software complexity.

 

Data Center (Cloud) Cost Efficiency

A few months ago I mentioned James Hamilton’s comments on the micro-server trend. Today I came across a talk he gave at MIX10 in which he presented excellent real-world large-scale data, with insightful analysis, about the cost efficiency of data centers. (Here is a direct MP4 download, suitable for viewing across more platforms.)

I had an intuitive feel for many of his conclusions already, and had numbers to back that up on a small scale (as a customer of cloud services, and provider of SaaS services, and employer of people who operate systems). But I am very pleased whenever an opportunity comes along to replace intuition with data.

I won’t attempt to repeat his ideas here. I will simply recommend that you watch this (and other similar analyses) and get a decent understanding, before purchasing or deploying any in-house, self-hosted, or self-managed servers. The latter still makes sense in some situations, but in 2010 the cloud is the default right answer.

A number of the ideas he presents are iconoclastic; some popular trends, especially in enterprise data centers, turn out to be misguided.

Amazon S3: Now Much Safer for Important Data

A few weeks ago when I spoke at the St. Louis Cloud Computing User Group, one of the possible cloud storage worries I brought up was the prospect of a few misplaced (accidental or malicious) clicks deleting large swaths of data. This applies with both S3 (the market leader) and other similar offerings. If you’ve tried out the various GUI tools for manipulating S3 “objects”, you’ve no doubt noticed that just a few clicks could delete thousands of objects (files) or even a whole bucket. Imagine a naive new employee (or worse) discarding terabytes of customer data; your business could be flushed down the drain in seconds.

Amazon has recently added a couple of features which greatly reduce this risk: Multi-Factor Authentication and Versioning. Using these features, it is now much more reasonable to store important data on S3 – the access needed to delete data can be controlled in such a way that even a malicious user, with access to credentials sufficient to do real work, nonetheless won’t be able to actually delete any data.

As the various cloud offerings mature, I expect all major providers to offer increased “safety” features, and for technical audits to verify and require their use.

Upcoming talk: Cloud Computing User Group

The St. Louis Cloud Computing User Group launches on Jan. 21st at Appistry. Sam Charrington over there kicked it off, but I suspect it will shortly grow far past its Appistry roots.

I’m giving a talk (one of two) at the first meeting. Contrary to the initial description floating around, I won’t be speaking (in detail) about “Amazon Web Services from a Developer Perspective”. Rather, my talk will be broader, and from a developer+business perspective:

To the Cloud(s) and Back

Over the last few years, I’ve been to the Amazon cloud and back; on a real project I started with inhouse file storage, moved to Amazon S3, then moved back. I’ve likewise used EC2 and tried a couple of competitors. I think this qualifies me to raise key questions:

  • Should you use (public) cloud storage? Why and why not?
  • Should you use (public) cloud CPUs? Why and why not?
  • How do you manage an elastic set of servers?
  • Can you trust someone else’s servers? Can you trust your own?
  • Can you trust someone else’s sysadmins? Can you trust your own?
  • What about backups?

This talk will mostly raise the questions, then offer some insights on the some of the answers.

Update: Slides are online here.