Posted on

Component Contracts in Service Oriented Systems

PRINCIPLE: Relationships must be governed by contracts that are monitored for performance.

In order to build a reliable system that is composed of many services, we need to have some guidelines for making the services reliable, both in the technical sense, and in the more psychological sense of people having confidence that things will work.

In a system of services, just like in a society, business relationships should be governed by contracts that are monitored for performance. Wherever a dependency exists between services, components, or teams, a contract needs to exist to govern that dependency. That contract comprises an agreement that defines the scope of responsibility of the service provider and the service consumer. Here’s a description of the contracts each service should provide to its customers.

Interface Contract

Every service must guarantee that its interface will remain consistent. Assuming the service is delivered over HTTP, the interface includes:

  • Names and meanings of query string parameters.
  • Definitions of what HTTP headers are used or ignored.
  • Format of any document body submitted in the request
  • Format of the response body.
  • Use of HTTP methods.

Note that in this context, “consistent” does not have to mean unchanging. It only means that no backwards incompatible changes can be made. If your service is designed on the same RESTful hypermedia principles of the web, your interface can remain consistent while growing over time.

The Interface Contract must be documented and available to both your customers and your delivery team. In fact, I would strongly recommend that the Interface Contract be created and delivered before you begin writing code for your service. It serves not only as documentation, but as the specification for developers to work from, and as the starting point for your test plan.

If changes require breaking compatibility, the best policy is to expose a new version of your service at a different endpoint. You must then establish a deprecation cycle to ensure clients have time to move to the new version. Only after all clients have migrated to the new version can you stop providing the old version. Such deprecation cycles can be very long, depending on the complexity of the service and the velocity of client development. Avoid backwards-incompatible changes in your interface if at all possible.

Service Level Agreement

Where your Interface Contract defines what your service will deliver, the Service Level Agreement (SLA) governs how it will be delivered (or how much). Things that need to be documented in your SLA include:

  • Availability: Uptime guarantees, scheduled maintenance windows, and communication policies around downtime.
  • Response time: What is the target for acceptable response times? What is the limit beyond which you will consider the service unavailable?
  • Throughput: How many requests is the service expected to handle? How many is the client allowed to send in a given time window?
  • Service classes: Are there certain kinds of requests that have non-standard response time or throughput requirements? Document them explicitly.

Your SLA should also describe how you monitor and report on conformance with the agreement. Measurements of these aspects of performance are usually called Key Performance Indicators (KPIs), and those measurements should be made available to your customers as well as your delivery team. These might be circulated in a regular email, or made available as a web-based dashboard.

If there is a financial arrangement involved in using the service, your SLA should also include remedies for non-conformance. However, even for services designed for internal consumption only, the SLA should be explicitly documented and agreed on by the service provider and the service consumer.

Internally, you should also monitor the error rate of your application and subtract it from your availability. A server that throws a 500 Internal Server Error was not available to the customer who received the error. If a high percentage of requests result in errors, you have an availability problem.

Communication and Escalation Policy

The key to any relationship is communication. When you provide a service, you must have a communication plan around delivering that service to customers. Some of that communication is discussed above. Issues to cover in your communication plan include:

  • Notification of changes and new service features.
  • Notification of deprecation cycles.
  • Reporting on service level performance.
  • Notification of incidents and how problems that affect customers are being managed.

In addition to these important communications from you to your consumers, it is also important to establish how your customers will communicate to you.

  • How can your customers contact you with questions or concerns?
  • How do they report problems?
  • What are the business hours for normal communications, and what is the policy for after-hours emergencies?

Establishing these policies up front will help people remain calm when an emergency does occur. A clear communication plan can ensure that you can focus on solving problems rather than fielding complaints. It also ensures that the customer feels confident that you have things well in hand.

Conclusion

At any point where dependencies exist between systems (or teams), that relationship must be governed by a contract. That contract comprises an agreement that defines the scope of responsibility of the service provider, including the interface for the service, a Service Level Agreement that establishes Key Performance Indicators along with targets and limits, and a Communication and Escalation Policy to ensure good support for the running service.

With these parameters defined and clearly communicated, all parties should have confidence in the reliability of the service (or at least a clear path to getting there).

Posted on

Toward a Reusable Content Repository

There are a plethora of web-based content management systems and website publishing systems in the world. Almost all of them are what you might call “full stack solutions,” meaning that they try to cover everything you need to cook up a full publishing system, from content editing to theming. WordPress is the most obvious example, but there are hundreds of such systems varying in complexity, cost, and implementation platform.

So many of the available products are full stack solutions that the market seems to have forgotten the possibility of anything else. What would it look like if you could assemble a CMS from ready-made components? What might those components be, and how would they interoperate?

Every web CMS that I have seen can be divided into three major components. They are:

  • Content Repository
  • Publishing Tools
  • Site Presentation

Each of those major components could further be described with a feature set that might be implemented with sub-components. The Site Presentation component might provide Themes or Sidebar Modules. The Publishing Tools might be as simple as a bare textarea, or might include WYSIWYG with spell checking and media embedding. The Content Repository is, almost universally, a relational database.

The Content Repository, I believe, is the reason that so many systems ship as full-stack solutions. There is no reusable Content Repository component that meets the general needs of content management systems. Without that central component, implementors are forced to bind both their Publishing Tools and their Site Presentation systems tightly to their own custom repository.

I would suggest the following feature set for a reusable Content Repository.

  • Flexible and extensible information architecture, with a sensible default that will work out of the box for most users.
  • Web API for content storage and retrieval (not just a native language API).
  • Fielded search and full-text search over stored objects.
  • Optional version history for content objects.
  • Optional explicit relationships between content objects.
  • Pluggable backends, allowing for implementations at different scales.

Most internal repositories are quite weak in this feature set. For example, very few embedded repositories implement full-text search. Of those who do implement it, the implementation is often naive (SQL LIKE % queries), leading to poor performance and poor scalability. 

Most embedded repositories implement only a native-language API, not a web API, which prevents access to the content unless you also have access to the code (some see this as a feature rather than a bug). 

Relational databases are notoriously bad at flexible information architecture, so it has taken a lot of time and effort for content management systems to add flexibility. Tools like Drupal’s Content Construction Kit and WordPress Custom Post Types are getting there, but without a common base architecture to build on, every implementation is custom and incompatible with the next.

Regardless of the features listed above, there are two key requisites that a reusable Content Repository must fulfill:

  • A published (and preferably simple) protocol for accessing its features.
  • A common base information architecture for content objects.

A Content Repository with these features would serve as a good backing store for Publishing Tools and Site Presentation systems alike, and would be agnostic to both. Any tool that understood the information architecture and protocol used by the Content Repository could build on it easily. Tools could ship with an embedded Content Repository, or connect to an external one that might be hosted on a provider’s servers. Most importantly, your Site Presentation would no longer need to be bundled with your Publishing Tools.

Content Repository Protocol

There are very few contenders for content repository protocols. I only know of two that might be reasonable to build on: AtomPub, and CMIS. To my mind, neither of these is a solution, but examining them might help us develop a solution.

CMIS, despite its name, is geared more toward Document Management than Content Management (IMHO). 

CMIS is far from being simple, and it makes some assumptions about information architecture that make it awkward to use in many cases. For example, it assumes a distinction between documents and folders, and assumes that there is a single folder hierarchy for all content. This is a restrictive and unnecessary constraint that does not fit all use cases. 

It also requires repositories to implement a SQL-like query language, forcing them to map content to a relational model even when it is not stored that way. This makes implementations expensive, and makes certain kinds of queries difficult to craft.

AtomPub, on the other hand, is simple and well-crafted, but defines only a fraction of the feature set we would want in a protocol. For example, it has no defined search protocol at all. Some implementations extend the protocol to mimic Google’s GData search protocol, but it is not a standard, and Google is no longer using it.

Implementers of a reusable Content Repository will have to face the challenge of a new protocol for accessing it. That’s a pretty high barrier.

Common Information Architecture

When I talk about a common information architecture, I am referring to standardizing the shape of content objects in terms of the semantics of its field structure. We need a commonly understood set of metadata, so that tools can share content in a sensible way. Some metadata will be required by Publishing Tools, other metadata will be useful to Site Presentation systems, and some will be needed internally by the Content Repository itself.

Atom and RSS are format standards, but each also defines a base information architecture, and they are mostly compatible with each other. Neither is sufficient for a full Content Repository, but any information architecture incompatible with these formats is a non-starter.

The IPTC has done a huge amount of work in developing interoperability standards for the news industry, which is all about content management. Their G2 News Architecture is documented implicitly in the specifications for their XML exchange formats, and in their rNews metadata format for HTML. I think the G2 News Architecture is a great start on a common information architecture, but a reusable Content Repository would need to define a simpler useful subset of it if it wanted to gain wide adoption.

Conclusion: A Hole in the Market? or No Market?

There are no real conclusions to draw from this, only questions to ask. Namely, is there an under-served market for a reusable Content Repository out there? Perhaps everyone is content with their vertically integrated solutions, and no one is interested in mixing and matching their presentation layer with different publishing tools.

I suspect, however, that the market for a reusable Content Repository will emerge as a result of the proliferation of Internet accessible devices. As people want access to their CMS across desktops, tablets, smart phones, and other devices, the utility of separating the presentation from the repository will become obvious.

Of course, the only way to know is to put in the hard work to build it, and see who bites.

Posted on

CASTED: Cooperative Agents, Single Threaded, Event Driven

The past looked like this: A User logs into a Computer, launches a Program, and interacts with it.

The future looks like this: The Computer on your desk runs a Program (in the background) that collaborates with a Program running on the Computer in your pocket and another Program running on a Computer in the Cloud, operating on your behalf without the need to interact.

In the past, a Program and an Application were the same thing. More and more, the Applications of today and tomorrow are made up of multiple Programs running on multiple Computers but cooperating with each other to achieve some utility for You (formerly the User).

The web development community has lately been very excited about single-threaded event-driven servers like Node.js. These processes are very good at maintaining a large number of connections, each of which requires only a small amount of work. (These servers are not very good at the inverse case, a small number of clients asking for very hard work to be done. For that, you want a different model.)

This paradigm of a large number of connections and small amounts of work fits neatly into the world where large numbers of processes collaborate to create a useful result. Each process does a relatively small amount of work, but the value emerges from the coordination of the processes through their communication.

Example: There is a process on your phone that displays emails. There is a process on the mail server that sends the messages to your phone. There is a process that examines messages as they arrive at the server to filter out junk mail. There is another process that examines the messages to rank them by importance and places some in your Priority Inbox. These processes are constantly running, on multiple servers, operating on your behalf in the background.

Years ago, Tim O’Reilly was writing about software above the level of a single device as part of his Web 2.0 concept. Tim’s classic example is the iPod-iTunes-iStore triumvirate. You have servers on the Internet, a desktop or laptop computer, and a small handheld device all coordinating your data for you.

As more devices have computers embedded into them, there are more opportunities for cross-device applications. And as more such applications emerge, users will expect applications to coordinate across devices like this. If you are designing a new application today, you’d better be thinking about it as a distributed system of cooperating processes.

Posted on

Evolving Systems vs Design Consultants – A Recurring Pattern

I often think of systems architecture as analogous to this word game I played as a child. I don’t know if the game has a name, but it is begun by selecting two words, say “cat” and “dog”. The goal is to begin with one word, and end with the other. The rules are, you can change only one letter each turn, and at the end of every turn, you must be left with a true word. Hence, one way the game might play out is CAT -> COT -> COG -> DOG. You might also get there through CAT -> COT -> DOT -> DOG. Either path is valid, but there is no direct “upgrade” from CAT to DOG.

This is an apt analogy for the problem of systems architecture when dealing with an operational system. The constraints of the system’s operation almost always prevent you from changing more than one component at a time. Every change to any component must result in a system that continues to operate. Real life systems also tend to have far more components than the three-letter word, in fact they comprise sentences, paragraphs, even whole novels.

In my work, I have occasionally had the good fortune to work with some great outside consultants. To date, I have always found these interactions to be productive and educational on multiple levels. It is a remarkable luxury to pick the brain of someone who is truly an expert in their field, and I try to take advantage of such opportunities whenever I can. In those interactions, I have noticed a curious recurring pattern.

Because of my role, I am often dealing with a consultant who is a systems designer. This expert comes in to help us improve the design of our systems. Unfortunately for her (or me), evolving operational systems tend to be more organically grown than designed, and the consultant must infer a design intent from examining the system as built, because the original design intent is lost in the mists of time.

Invariably, a conversation will occur that goes something like this.

“I see that you are using a COT in this part of the system,” the consultant will say, attempting to hide a smirk. “A DOG would be much more appropriate. Why don’t you try using a DOG?”

Of course, the consultant is being tactful here. No person in his right mind would use a COT as a replacement for a DOG. We, who built the system, are embarrassed even to be showing anyone this particular mangled part of our system. My response, when I have sufficient presence of mind to compose a rational one, always has a similar pattern.

“Well yes, ideally you want a DOG there, but when we were building this aspect of the system, we didn’t have enough budget left for a pre-built DOG component. It would have taken us several months to build a custom DOG, which would have caused us to miss our launch deadline. But we had a well-tested CAT component we had built for a different system, and that mostly did the job. We found we could use that if we made some adjustments to the FOOD component to accommodate the CAT, and we could do that faster than building a whole new DOG.”

Pause for a breath. Here’s where the explanation gets messy. “After we launched, we wanted to come back and fix this to use a DOG, as originally designed, but of course we couldn’t switch from a CAT to a DOG without changing the FOOD component again. Since we can only change one component at a time, during the upgrade process either the CAT or the DOG would get the wrong FOOD at some point, breaking the system.” Remember that constraint about changing only one component at a time?

“We can’t afford to break the system, we have live customers to support now.” Here’s that other constraint, every change must result in an operational system. Paying customers enforce that pretty strictly. It’s hard to say you’re lucky if you don’t have paying customers, but sometimes it feels that way.

“So instead, we have migrated to using a COT. It’s obviously not very efficient, but it fits, and it eliminates the dependency on the FOOD component (a COT does not eat). We’re planning to replace the COT with a COG in a future release, which should be a smooth transition, and free up some system resources. Once that’s done, we can use those resources to re-engineer the FOOD component to support a DOG, assuming management signs off on the additional cost.”

By this time, depending on the consultant’s level of experience, she will either be staring at me like I’m a lunatic, or shaking her head with a sympathetic grimace (usually the latter). In either case, the response is usually some variant of “I see.” And the final report will advise, “Upgrade from COT to DOG ASAP.”

Sigh.

There is no aspect of an organically grown system that could not be better designed in retrospect. The shape of the completed system is not governed solely by the appropriateness of the design/architecture. It is largely shaped by convenience, the accessibility of specific tools or components, the cost-benefit trade-offs and time constraints imposed externally on the design process.

The line between sense and nonsense is squiggly, because it must be drawn through the whole history of the system. And it’s not always obvious which side of the line you are on.

Posted on

Django Settings: Three Things Conflated

If you work on a large Django project, there’s a good chance that you would describe your settings file as “a mess” (or perhaps you use harsher language). You may even have broken your settings out into a whole package with multiple files to try and keep things organized. We’re highly skilled and organized developers, how does this happen to us?

I believe part of the problem is that the “settings” bucket holds three different kinds of things without differentiating between them. If you make a clear distinction between these things in your own mind (and in your code), dealing with settings will become easier, if not easy.

Project composition

The first class of settings comprises those used for project composition. One of the killer features of Django is that projects are composed of independent modules (apps). The most important settings in your project’s settings file define what apps make up the project and how they interact with each other. In other frameworks this would be done with code (well, technically settings are Python code), but in Django this is treated as configuration. Things like INSTALLED_APPS, MIDDLEWARE_CLASSES, and TEMPLATE_CONTEXT_PROCESSORS define how the components of your project are combined to achieve the desired functionality.

Settings whose values are (possibly a list of) Python modules normally fall into this category.

External resources

The second class of settings comprises those used for connecting to external resources. This is the area most broadly recognized as configuration. Settings like DATABASES and CACHES fall into this category. These are the things that The Twelve Factor App says should be provided by environment variables, and in fact it’s not that difficult to pull these values into your settings from the environment.

In addition to the obvious dictionaries defining pluggable back-ends, any setting whose value is a file system path or a URL likely falls into this category.

Tunable parameters

The final class of settings comprises tunable parameters, things that are mostly constants or variables that are abstracted out because 1) hard-coded values are bad, and 2) you (or users of the code) might want to change the values from the defaults. Things like CACHE_MIDDLEWARE_SECONDS, DEBUG flags,  DATE_FORMAT, and so on are examples of tunable parameters.

This is the area of greatest multiplication. Virtually every app you pull into your project is going to have some tunable parameters.

Conclusion

Armed with the understanding of the three kinds of values in your settings, you may now be able to devise a superior method of organizing them. You might start by sorting your settings.py file into three sections. Or you might decide to break them out into separate files in a package. Maybe you’ll start using different tools to manage the three types of settings differently. I don’t know, I don’t have the solution to this problem right now, just this one nugget of insight.

What successful methods have you used to organize settings in large projects?

Posted on

Heroku and the Twelve Factor App: Architecting for High Velocity Web Operations

A while back I wrote that infrastructure should be delivered as code along with every web application, because web applications are not run by users, they are operated on behalf of users, and are therefore incomplete without the infrastructure needed to operate them. In that article, I mentioned Heroku, a platform-as-a-service company that makes a living operating other people’s web applications. Inspired by their experience in web operations, some of those folks recently wrote a guide to creating web applications that can be operated easily. They call it The Twelve Factor App.

There is a great deal to be learned from this 12 Factor guide and the platform Heroku has designed. Their business depends on consistent, repeatable, and successful deployment and operation of web applications, and they have this stuff precision-cut and well oiled. The guide, and the Heroku platform, make a clear distinction between what is part of the platform, and what is part of the application. Even if you are heeding my earlier advice and delivering infrastructure with your applications, you will benefit from understanding the points of separation 12 Factor recommends between your application and the platform on which it runs.

I had been planning to summarize each factor here, but the descriptions at the web site are sufficiently concise that summary seems redundant. Just click through the links for each factor and read, it will only take you a few minutes, and it will be well worth your time.

1. One codebase tracked in revision control, many deploys.

2. Explicitly declare and isolate dependencies.

3. Store config in the environment.

4. Treat backing services as attached resources.

5. Strictly separate build and run stages.

6. Execute the app as one or more stateless processes.

7. Export services via port binding.

8. Scale out via the process model.

9. Maximize robustness with fast startup and graceful shutdown.

10. Keep development, staging, and production as similar as possible.

11. Treat logs as event streams.

12. Run admin/management tasks as one-off processes (see 6 above).

Posted on

Web Developers: Infrastructure is part of your Application!

One of the most difficult realities for web developers to face is that their application code, elegant and beautiful as it may (or may not) be, does not run in the ivory tower of Code Perfection. It runs on a real machine (or several) in a real data center, competing for resources to serve real clients, and tripping over all-too-real limitations of the environment.

Operations people, those shadowy, pager-carrying folks that developers call “sysadmins”, know that there is so much more to delivering a web application to its clients than simply deploying code. Web applications are not delivered the way packaged software was in the 90’s, on a shrink-wrapped CD-ROM like a book. Web applications are not products at all, they are services, and services don’t get to say “bring your own computer.” Services must be delivered complete, with an entire stack of running programs and systems underneath them.

A web application, whether Java, Ruby, Python, PHP, or LOLcode, is incomplete until it is paired with a stack of servers and services on which to run it. Which language runtime must be installed? Which version of which web server(s)? How should the database server be tuned? How much RAM allocated to memcached? When should the logs be rotated? Developers often do not even think about these questions. When they do, the answers are usually provided as a narrative requirements list which some dedicated systems engineers must translate into a working system somehow.

Systems automation has now reached the point where this infrastructure can be delivered as code right along with the application code. Every web application should be delivered with Puppet configurations or Chef cookbooks to bring up a precisely tuned deployment stack designed for the application. Cloud-based infrastructure means you can even deliver the (virtual) hardware itself with the application. A good web application should come with a “deploy_to_ec2” script for instant production deployment.

Of course, there are other opinions. You may choose to outsource your operations work to a platform-as-service like Heroku or App Engine. If you want to live in a code-only world where infrastructure never crosses your mind, write your code to target deployment environments like these, and get used to the constraints they impose.

In my opinion, every web development team needs a systems engineer embedded as part of the team, developing and codifying the infrastructure alongside the application code. A web application delivered without infrastructure automation is incomplete.

Posted on

Web Analytics for Operations

Web analytics packages, from free to exorbitant, have grown in complexity over the life of the web. That’s great news for marketers using the web as a tool to deliver a message to an audience. These tools allow them to measure audience reach, time spent viewing a page, return visits, session length, and other useful customer engagement factors that helps shape the business strategy.

Unfortunately, while the marketers have won some great tools, where does that leave the techies who need to operate the infrastructure? We don’t need to know how long a visitors spent on the site, nor to measure the difference between a “page view” and an “interaction”, we need to know how many requests per second the application will generate. Where marketing-oriented analytics goes to great pains to filter out automated crawlers, we desperately need to know when a rampant robot is eating up server resources.

There isn’t much in the way of off-the-shelf software to fit our needs. Mostly, we grow our own solutions, cobbled together with a tool here and a tool there.

Lately I’ve had a need to do some log analysis over a large farm of Apache web servers. I looked at a few open source packages that I knew about: AWStats and Webalizer being the perhaps the best known. But I wasn’t happy with either of these solutions. I wanted a tool that would allow me to aggregate not just hits, but time spent generating each page (in milliseconds), and I wanted to break down traffic by five minute increments for a detailed shape in my graphs. So finally, and somewhat reluctantly, I settled on analog.

Analog is not pretty nor user-friendly by any means. The configuration file is touchy and somewhat arcane, and its convention for command line parameters is non-standard. However, analog generates 44 different reports, including time breakdowns from annual down to my desired five minute interval, reports for successes, failures, redirects, and other interesting outcomes, and a processing time report with fine resolution. It can read compressed log files, and it has no problem processing files out of chronological order.

Most importantly, analog is blazingly fast. It chewed through my 20 million lines of compressed Apache logs in six minutes. The speed at which it consumes log files seems to be limited more by I/O rate than CPU, though as a single-process, single threaded application, analog will only tax one of your CPU cores. If you find CPU a limiting factor on a multi-core system, you might try decompressing the files using gzip and piping the output to analog. This allows the decompression to happen in a separate process, and therefore on a separate CPU core, but I don’t know if that would speed things up much.

I’m still not entirely pleased with this solution. I would prefer a solution that was a little more intuitive, and a little easier to customize. Analog has plenty of knobs to turn, but there is no built-in extension mechanism, so it makes me work pretty hard to pull out custom metrics.

I would love to hear what other folks are using to analyze your Apache logs. How do you get operational intelligence? Are you using remote logging? Shoot me an email or leave a comment.