Archive for the ‘Architecture’ Tag

SaaS Design Checklist

gears2I’ve been asked several times recently about the design considerations for an application that is to be delivered via the Software-as-a-Service (SaaS) model.  In other words, beyond the core functionality of the application itself, what other features or system aspects need to be addressed for the business offering of that application as a commercial service?

The following is a list of such features or aspects.  I make no claim as to the all-inclusiveness of this list, but it’s a good start.  Certain system aspects that apply broadly whether the application is SaaS or just critical-internal have been omitted (e.g., disaster recovery, health monitoring, etc.).  As for items listed, they may or may not apply to every situation, but they should at least be considered serious before deciding.

Security

  • Subscriber-Level Authentication & Authorization:  A “subscriber” is the entity with whom the business relationship exists for use of the SaaS application and comprises one or more users.  Each request must be authenticated to know the subscriber to which the user belongs thereby enabling license checks and usage metering.  Subscriber authorization comes into play if the application has separately licensable modules to which their users may or may not have access.
  • User-Level Authentication & Authorization:  As in all applications, each request must be authenticated to know the originating user and authorized to access specific capabilities by their role.  This authorization may be further constrained by subscriber-level authorization constraints.
  • Parametric Throttling:  A request may contain parameters that if unchecked could harm the system intentionally or otherwise.  For example, consider a request argument that dictates the number of records to return.  The application may protect itself from crippling values like 1,000,000,000 by simply throttling it to some configurable maximum like 500.  Throttling rules may need to be subscriber-specific.
  • Frequency Throttling:  Also important generally but particularly for APIs is the notion of throttling request rates (i.e., maximum hits per second from the same IP address) to prevent anything from abusively heavy subscribers to denial of service attacks.  This is often achieved within the network infrastructure as opposed to the application itself, but this is an opportunity for making the point that a successful SaaS deployment is about more than just the software engineering.

Service Level Agreements

  • Availability Monitoring:  SaaS contracts often carry SLAs that specify minimum application uptime.  When this is the case, a means for self-monitoring availability must be established whether to tout your success, to be the first to know about issues, or simply to address disputes.  Be specific about the very many ways to define uptime.
  • Performance Monitoring:  SLAs may also specify performance thresholds and require similar monitoring for similar reasons.  Individual performance data points should include the subscriber ID and request type to enable rollups in these dimensions since a) different subscribers may demand different SLAs, and b) different request types may have inherently different performance characteristics that can be called out separately with different thresholds.
  • Performance Exclusions:  Depending on the nature of the application or specific requests, there may be portions of execution time that should be excluded from performance calculations.  For example, the implementation of a request may call out to external services or execute a subscriber-specific workflow (i.e., things beyond the SaaS provider’s control).  Such activities may have been excluded from the performance SLAs and thus must be captured enabling the appropriate adjustments.
  • Compliance Auditing:  Collecting all supporting data is necessary, but not sufficient.  Reporting on this data for the purpose of auditing specific SLAs must be established and should be exercised internally to avoid surprises.

Subscription Servicing

  • Request Metering:  Requests incident on the application should be counted by subscriber ID and request type.  This enables usage monitoring by subscriber which may be required to support billing depending on the business relationship.  It also enables internal sensing of more heavily used features; information that can be useful in several ways (e.g., tuning, marketing, deprecation).
  • Subscriber-Level Reporting:  Separate from whatever reporting the application itself provides, there should be a means to generate summary information about a subscriber’s SaaS interaction whether periodically or on-demand.  This information may include usage levels, SLA compliance, license status, strange account activity if detectable, etc.  Minimally, the SaaS provider should be able to retrieve such information, but may also consider making it available to subscribers perhaps as an admin role capability.

External Services

  • Performance Monitoring:  Many applications integrate with externally provided services to perform portions of their request functionality (e.g., information retrieval, payment processing, etc.).  As the consumer of these services, the SaaS application should monitor their performance.  Whether or not the time spent waiting for these services to execute is included in formal SLAs, they will absolute impact user experience.  Downward trends may lead you to shop around for equivalent alternatives.
  • Availability Monitoring:  For all the same reasons, apparent availability of any external services should be tracked.  Apparent availability is the percentage of calls to a service that the service responded in a functionally meaningful way.

Resource Sharing

  • Multi-Tenancy:  A single infrastructure or slice thereof serving multiple subscribers is central to the SaaS economic model.  The most crucial aspect of this is the notion of a multi-tenant database schema.  The problems associated with isolating subscriber data logically rather than physically are easily offset by the maintenance benefits of dramatically reducing the number of production database instances.
  • Partitioning:  Economically, a single multi-tenant database may be ideal.  At scale, however, it may become necessary to have multiple databases each supporting a subset of subscribers.  This may be done to support different SLAs, to service very different usage patterns, to reduce the impact scope of an outage, or simply to handle high scale loads.
  • Selective Purging:  Even the best SaaS applications will lose subscribers.  Purging their data from a multi-tenant database is usually straightforward, but not so when it comes to backup media of multi-tenant databases.  If you’re entering into a contract that originates from the subscriber’s legal department, read the termination clause carefully and be sure it’s feasible.
  • Subscriber Portability:  If subscribers are partitioned across multiple databases, the need to move a subscriber from one instance to another will eventually arise as usage patterns change (the SaaS analog to rebalancing your 401k).  The biggest hurdle to this is avoiding ID collisions across databases.  The catch-22 is that this is rarely considered in release 1.0 and the downstream fix usually requires prohibitively invasive surgery.
  • Cross Partition Monitoring:  Partitioning subscribers across multiple databases or even whole infrastructure slices obviously adds to operational complexity.  As the number of partitions grows, consider some form of central monitoring hub to assist the Operations support staff.  This can start out simple and evolve over time as the ROI increases, but good sensors within the application can greatly facilitate this when the time comes.

Flexibility & Extensibility

  • UI Customization:  User interface flexibility can range from take-it-as-is to full white labeling per subscriber.  It can be as trivial as showing the subscriber’s logo or hacking up every form and navigation to comply with some internal process guidelines.  Ultimately the market will decide what level of customization capabilities are worth the engineering for a given application.
  • Data Model Customization:  Similarly, subscribers may have additional data fields, whole data objects, or even multimedia content that they wish to store along with the application’s data model.  Again, this type of flexibility has many prices and the value in supporting it needs to be assessed case by case.
  • Behavioral Customization:  A more complex type of flexibility is that of business behavior (e.g., configurable workflows, proprietary decisioning rules, calculation policies, etc.).  Unless tightly and explicitly bounded, this type of flexibility in a multi-tenant SaaS deployment can be an insidious slippery slope.  Tread carefully.
  • Platform API:  Many applications perform services that can be exposed via an API (e.g., web services).  Doing so can enable subscribers to incorporate the application more deeply using, for example, a Service Oriented Architecture (SOA) while increasing subscriber stickiness for the SaaS provider.  It also opens up the potential for multiple UIs, which may be a path to extreme UI customizations.  However, while exposing such APIs may appear straightforward, it is definitely not to be undertaken lightly.  More on this in another post. 

Dimensions of Scalability

Designing for scalability is one of the primary challenges of system and software architecture.  For those of us who practice architecture, it’s also great fun thanks the high number of variables involved, the creativity required to discover exploits, the pattern matching to apply tricks and avoid traps, and the necessity to visualize the system in multiple possible futures.

In the broadest terms, “Is it scalable?” = “Will it break under growth?”  A few manifestations that are a bit more useful include “Will performance hold up as we add more users?”, “Will transaction processing time stay flat as the database grows?”, and “Will batch processing still complete within the allotted window as the size of our account base, data warehouse, or whatever multiplies?”.  Architects imagine the kinds of demand parameters that might occur over the life cycle of the system and incorporate mitigation plans.

These examples all pertain to the performance characteristics of a system.  However, there are other dimensions of scalability that are equally important when considering that system in a business context.

Strategic Dimensions

  1. Performance Scalability:  “An observation about the trend in performance in response to increasing demands.”
    Demand can refer to any of several parameters depending on the system such as number of concurrent users, transactions rates, database size, etc.  Performance measures may include event processing time, batch throughput, user perception, and many others.  In any case, we consider a system to be scalable if we observe a flat or nearly flat performance curve (i.e., little or no performance degradation) as any given demand parameter rises.  In reality, even highly scalable systems tend to be scalable through some finite range of demand beyond which some resource tends to become constrained causing degradation.
  2. Operational Scalability:  “An observation about the trend in effort or risk required to maintain performance in response to increasing demands.”
    This may be best illustrated by example. Consider a web application that is experiencing sharp increases in usage and a mid-tier performance bottleneck as a result.  If the application was designed for mid-tier concurrency, the mitigation effort may be simply adding more application servers (i.e., low effort, low risk).  If not, then significant portions of the application may need to be redesigned and rebuilt (i.e., high effort, high risk).  The former case is operationally scalable.  As with performance scalability, operational scalability occurs in finite ranges.  Continuing the previous example, at some point the database may become the bottleneck typically requiring more extensive remedial action.
  3. Economic Scalability:  “An observation about the trend in cost required to maintain performance in response to increasing demands.”
    We consider a system to be economically scalable if the cost of maintaining its performance, reliability, or other characteristics increases slowly (ideally not at all, but keep dreaming) as compared with increasing loads.  The former types of scalability contribute here.  For example, squeezing maximum performance out of each server means buying fewer servers (i.e., performance scalability) and adding new servers when necessary is cheaper than redeveloping applications (i.e., operational scalability).  However, other independent cost factors can swing things including commodity vs. specialty hardware, open source vs. proprietary software licenses, levels of support contracts, levels of redundancy for fault tolerance, and the complexity of developmental software which impacts testing, maintenance, and release costs.

Rocky Roads

Since the underlying theme of these additional dimensions is business context, it should be noted that rarely does an architect get to mitigate all imaginable scalability risks.  Usually this is simple economics.  In the early days of an application, for example, the focus is functionality without which million-user performance may never get to be an issue.  Furthermore, until its particular financial model is proven, excessive spending on scalability may be premature.

However, a good technology roadmap should project forward to anticipate as many scale factors as possible and have its vision corrected periodically.  Scalability almost always comes down to architecture and an architectural change which is usually pervasive by definition is the last thing you want to treat at a hot-fix.

SOA in Good Eternal Company

graveyard-689x407There’s a place where good acronyms go to die.  I call it the GAG (Good Acronym Graveyard).  It’s a dark foreboding place where over-hyped acronyms lie interred separated from their perfectly valid and useful living legacies.

Terminal Terminology

The first GAG funeral that I witnessed in my career personally was Artificial Intelligence.  In the 80s and early 90s, AI was hyped to the point where our brains would surely atrophy as expert systems, neural networks, fuzzy sets, and other goodies would put homo-sapiens out of business.  AI would be the computing industry’s super hero of the era.  But just as most of our super heroes eventually disappoint as they fail to live up to impossible expectations, AI came crashing down.  So many companies and investors were burnt in the process that the term itself became a pariah.  A proposal or business plan could promise to cure cancer, but would be rejected out of hand if it included the term “AI”.

In reality, the AI funeral was for the term itself.  The living legacy of AI is all around us.  We have automated decisioning and diagnostic systems that use many expert systems concepts.  Rule based systems are widely used to codify business policies, determine insurance quotes, and manage the complexities of telecommunications billing.  Neural networks among other techniques are used in pattern analyses such as facial recognition and linguistics.  Just about every complex search technique in use today owes its roots to a university AI lab.  More generally, heuristic algorithms are now pervasive in everything from music recommendations to counter terrorism.

The principles and techniques of AI have been staggeringly successful, but the over-hyped term and its unreasonable expectations rest in peace in the GAG.  This was no time for sorrow, however.  With this burial went the wasteful distraction of trying to satisfy the insatiable.  Released from this burden, practitioners were free to focus and produce the awesome results that have transformed large tracts of the computing landscape.

So Soon SOA

Service Oriented Architecture or SOA has now entered the GAG.  Following a similar pattern as AI, there is nothing wrong with its principles.  In fact, SOA is exactly the transformative movement required by complex enterprises that require breakthrough advances in agility while avoiding the infeasible cost and limitations of wholesale legacy replacement.  Over the past several years, however, the term SOA has been over-hyped as a silver bullet, a specific technology, or a turnkey solution depending on the agenda of the “hyper”.  To these expectations, SOA must and has failed.

In a 01-05-2009 post entitled “SOA is Dead; Long Live Services“, Anne Thomas Manes writes the following insightful obituary:

SOA met its demise on January 1, 2009, when it was wiped out by the catastrophic impact of the economic recession.  SOA is survived by its offspring: mashups, BPM, SaaS, Cloud Computing, and all other architectural approaches that depend on “services”.

SOA is a strategy and an architecture (people tend to forget that’s what the “A” stands for).  It is a path to which enterprises must commit and in which they must invest in order to realize the returns.  When a project is framed as full blown “SOA”, compelling returns on investment are exceedingly difficult to devise and sell.  However, Software as a Service (SaaS) has gained acceptance as an agile, cost effective alternative to wide-scale software installation and maintenance.  Cloud computing is rapidly ascending to acceptance as a nimble alternative to sizing data centers to handle peak-plus demands.  Mashups are everywhere from grass-roots developers to the enterprise back office.  As these mindset changes continue to cure, the principles of SOA will flourish – even better without the baggage of the term itself.

Requiem

And so we gather together on this cold day in January of 2009 to lay to rest the body of SOA, but not its spirit.  We do not mourn this passing as untimely or empty.  Rather we rejoice in the opportunity to move past empty promises and impossible expectations.

Perhaps now that the GAG is sporting yet another tombstone, we can attend to the real business of enterprise transformation through service orientation.  Perhaps we can even throw in a little AI for good measure… D’OH!!!

Taxonomy for Web Service Sources

Various taxonomies for web services are possible.  A focus on technology might produce classifications such as transport (e.g., HTTP, JMS), representation (e.g., SOAP, REST), response handling (e.g., blocking, asynchronous via polling, asynchronous via callback).  A focus on purpose might look more like data source vs. computational vs. legacy API exposure.

In the Web 2.0+ era, web services are proliferating wildly.  The mashup community is providing huge demand satisfied in part by sites such as ProgrammableWeb where over 1,000 web services can be found, and increasingly online services are opening their platforms via APIs.  Enterprise level SOA (Service Oriented Architecture) initiatives, while slowed by a slowing economy, are also beginning to consume external services as well as exposing their own services for internal use.

In recognition of this proliferation, I believe a new taxonomy is required that addresses the source of services from the perspective of the user, orthogonal to technology or purpose.  By “user” in this context, I am referring to the human developer of any application that consumes web services.

Source Taxonomy

The figure illustrates a draft web service taxonomy where services are classified by the nature of their sources or providers as seen by potential users.

ws-source-taxonomy

Classification:  Ownership

Ownership is the distinction between services that are sourced within the user’s organization (e.g., company, business unit) versus those sourced by parties unaffiliated with their organization.

Internal:  “Services sourced within the user’s organization implying some potential for control over their implementation.”  Examples include web service APIs to internal legacy systems as part of a SOA project.

External:  “Services sourced independent of the user’s organization.”  Examples include information services (e.g., news, market quotes, credit report) and APIs to platforms like Twitter or SalesForce.

Classification:  Provision

Provisioning refers to the relationship between the entity supporting the web service endpoint from the user’s perspective (i.e., the provider) and the entity that supplies the functional implementation of the web service (i.e., the source).  When the provider is the source, the service is said to be original.  Conversely, if the provider is some form of third party intermediary between user and source, the service is said to be syndicated.

External / Original:  “External services that are called directly by a user’s application.”

External / Syndicated:  “External services for which users call a third party provider or syndicator which would then call the original source on their behalf.”  Presumably in this type of structure, the syndicator would add some value for acting as intermediary.  For example, a syndicator could serve as a common front for many original service sources thereby presenting an additional interface abstraction and a common point of billing and support.

Internal / Original vs. Syndicated:  Based on the foregoing definitions, the notion of a syndicated internal service seems oxymoronic.  The taxonomic intent is to enable larger enterprises to make the distinction, for example, between point-to-point calling of a legacy API (i.e., original source) versus the use of an intermediate hub, messaging service, or some other abstraction layer (i.e., a form of syndication).

Classification:  Differentiation

This classification addresses the potential for multiple sources to provide functionally equivalent services and how the user perceives the relative value of those sources.  A service is referred to as a commodity if it is possible for multiple sources to provide functionally equivalent implementations of that service, whether or not multiple such sources actually exist.  In contrast, a service is referred to as branded if there can be only one source of that service.

External / Original / Commodity:  “Services provided by an original source that can also be offered by other sources.”  The functional equivalence of these services can enable a user to select a source based on non-functional factors such as price, performance, and reliability.  Examples might include data services such as weather or financial market data.  In this scenario, the commodity sourcing decision must be performed by the user.  Despite functional equivalence, each source may present differing interfaces to which the user must code.

External / Syndicated / Commodity:  “Services provided by a syndicator fronting for potentially multiple functionally equivalent sources.”  The key value of commodity syndication lies in the fact that functional equivalence does not imply interface equivalence.  For a given commodity service, a commodity syndicator has the opportunity to normalize interfaces across functionally equivalent sources, thus providing the user with a single stable interface per service.  This scenario would support the commodity sourcing decision being made either by the user or transparently by the syndicator.

External / Original / Branded:  “Services provided by an original source that can only be available from that source.”  The most common types of these services are APIs to specific applications or platforms.  For example, consider writing a mashup for your back office that uses the SalesForce API.  It cannot decide lightly to call a different CRM application since it is highly unlikely that its API will be functionally equivalent at the service level, not to mention the fact that your company’s data lives at SalesForce.

External / Syndicated / Branded:  “Services that can only be available from a single source, but are accessed through a syndicator.”  This class is included for taxonomic completeness although it is unclear what significant value the syndicator would provide in this case.  There may be some value in a single gateway to multiple branded services for billing, support, or auditing purposes, but this alone hardly seems compelling relative to the overhead.

Classification:  Session

This classification recognizes that certain logical operations may require multiple web service calls.  While this may seem like a technical distinction, its relevance to this taxonomy is in the context of commodity source selection.

… / Commodity / Stateless:  “Completely independent web service calls enabling commodity sourcing decisions on a per call basis if desired.”  This is the finest granularity of web service commoditization.  An example of this might be a request for a stock quote for a known ticker symbol.  A single call does the job and there is any number of functionally equivalent service sources for this information.

… / Commodity / Statefull:  “A logically related group of web service calls that all must be made to the same source, thus necessitating a single commodity sourcing decision for the group.”  An example might be obtaining a credit report on a company.  A first call requests a “list of similars” based on the company name.  The returned list includes a set of possible matches with additional data for disambiguation and source specific IDs.  After selecting the desired company from the list, the second call requests the actual report based on the ID.  The user may not care which source is used, but having made the sourcing decision for the first call, the rest of this conversation must return to the same source since it carries source specific information.

Summary

The last 5 years have seen the rapid proliferation of available web services and a growing appetite of Web 2.0+ developers anxious to consume them.  Thus far, the focus has been on what mashups and service oriented applications can do and how to achieve them functionally.  Going forward, we will see increased attention to qualities of service, stability, and source redundancy analogous to that of cloud computing.  Lack of maturity in these areas is among the factors holding back enterprises from full scale consumption of external web services in their business applications.  Concepts such as syndication and commoditization can play a key role in breaking through this barrier.

The Redundancy Principle

Architecting complex systems includes the pursuit of “ilities”; qualities that transcend functional requirements such as scalability, extensibility, reliability, maintainability, and availability.  Performance and security are included as honorary “ilities” since aside from being suffix-challenged, they live in the same family of “critical real-world system qualities other than functionality”.  The urge to include beer-flavored took a lot to conquer.

Reliability, maintainability, and availability have some overlap.  For example, most would agree that availability is a key aspect of reliability in addition to repeatable functional correctness.  Similarly, a highly maintainable system is not only one that is composed of easily replaceable commodity parts, but one that can be serviced while remaining available.

As an architect, designing for availability can be great fun.  It’s like a chess game where you have a set of pieces, in many cases multiples of the same kinds.  Your opponent is a set of failure modes.  You know that in combating these failures, pieces will be lost or sacrificed, but if well played, the game continues.

We [Don’t] Interrupt this Broadcast

Every component in a system is subject to failure.  Hardware components like servers and disk drives carry MBTF (mean time before failure) specifications.  Communication media and external services are essentially compositions of components that can fail.  Even software modules may be subject to latent defects, memory leaks, or other unstable states however statistically rare.  Even the steel on a battleship rusts.  Failures cannot be avoided.  They can, however, be tolerated.

The single most effective weapon in the architect’s availability arsenal is redundancy.  Every high availability system incorporates redundancy in some way, shape, or form.

  • The aging U.S. national power grid provides remarkable uptime to the average household in spite of a desperately needed overhaul. At my house, electrical availability exceeds the IT-coveted five nines (i.e., 99.999%) and most outages can be traced to the local last mile.
  • The U.S. Department of Defense almost always contracts with dual sources for the manufacturing of weapon systems and typically on separate coasts in an attempt to survive disasters, natural or not.
  • The Global Positioning System comprises 27 satellites; 24 operational plus 3 redundant spares. The satellites are arranged such that a GPS receiver can “see” at least 4 of them at any point on earth. However, only 3 are minimally required to determine position albeit with less accuracy.
  • Even the smallest private aircraft have magnetos; essentially small alternators that generate just enough energy to keep spark plugs firing in case an alternator failure causes the battery to drain. Having experienced this particular failure mode as a pilot, I was happy indeed that this redundancy kept my engine available to its user.

Returning to the more grounded world of IT, redundancy can occur at many levels.  Disk drives and power supplies have among the highest failure rates of internal components and thus RAID technology and dual power supply modules in many servers and other devices.  Networks can be designed to enable redundant LAN paths among servers.  Servers can be clustered assuming their applications have been designed accordingly.  Devices such as switches, firewalls, and load balancers can be paired for automatic failover.  The WAN can include multiple geographically disparate hosting sites.

Drawing the Line

The appropriate level of redundancy in any system reduces to an economic decision.  By definition, any expenses incurred to achieve redundancy are in excess of those required to deliver required functionality.  Although in some cases, redundant resources used to increase availability may provide ancillary benefits (e.g., a server cluster can increase availability and throughput).

Redundancy decisions really begin as traditional risk analyses.  Consider the events to be addressed (e.g., an entire site going down, certain capabilities being unavailable, a specific application becoming inaccessible; for some period of time).  Then determine the failure modes that can cause these conditions (e.g., a server locking up, a firewall going down, a lightning strike hitting the building).  Finally, consider the cost of these events each as a function of its impact (e.g., lost revenue, SLA penalties, emergency maintenance, bad press) and the probabilities of its failure modes actually occurring.  The cost of redundancy to tolerate these failure modes can now be made dispassionately against their value.

As technologists, our purist hearts want to build the indestructible system.  Capture my bishops and rooks and my crusading knights will continue processing transactions.  However, the cost-benefit tradeoff drives the inexorable move from pure to real.

The good news is that many forms of redundancy within the data center are inexpensive or at least very reasonable these days given the commoditization of hardware and the pervasiveness of the redundancy principle.  Furthermore, if economics keeps you from realizing total redundancy, do not be disheartened.  We’re all currently subject to the upper bound that we live on only one planet.