Archive for the ‘system’ Tag

Dimensions of Scalability

Designing for scalability is one of the primary challenges of system and software architecture.  For those of us who practice architecture, it’s also great fun thanks the high number of variables involved, the creativity required to discover exploits, the pattern matching to apply tricks and avoid traps, and the necessity to visualize the system in multiple possible futures.

In the broadest terms, “Is it scalable?” = “Will it break under growth?”  A few manifestations that are a bit more useful include “Will performance hold up as we add more users?”, “Will transaction processing time stay flat as the database grows?”, and “Will batch processing still complete within the allotted window as the size of our account base, data warehouse, or whatever multiplies?”.  Architects imagine the kinds of demand parameters that might occur over the life cycle of the system and incorporate mitigation plans.

These examples all pertain to the performance characteristics of a system.  However, there are other dimensions of scalability that are equally important when considering that system in a business context.

Strategic Dimensions

  1. Performance Scalability:  “An observation about the trend in performance in response to increasing demands.”
    Demand can refer to any of several parameters depending on the system such as number of concurrent users, transactions rates, database size, etc.  Performance measures may include event processing time, batch throughput, user perception, and many others.  In any case, we consider a system to be scalable if we observe a flat or nearly flat performance curve (i.e., little or no performance degradation) as any given demand parameter rises.  In reality, even highly scalable systems tend to be scalable through some finite range of demand beyond which some resource tends to become constrained causing degradation.
  2. Operational Scalability:  “An observation about the trend in effort or risk required to maintain performance in response to increasing demands.”
    This may be best illustrated by example. Consider a web application that is experiencing sharp increases in usage and a mid-tier performance bottleneck as a result.  If the application was designed for mid-tier concurrency, the mitigation effort may be simply adding more application servers (i.e., low effort, low risk).  If not, then significant portions of the application may need to be redesigned and rebuilt (i.e., high effort, high risk).  The former case is operationally scalable.  As with performance scalability, operational scalability occurs in finite ranges.  Continuing the previous example, at some point the database may become the bottleneck typically requiring more extensive remedial action.
  3. Economic Scalability:  “An observation about the trend in cost required to maintain performance in response to increasing demands.”
    We consider a system to be economically scalable if the cost of maintaining its performance, reliability, or other characteristics increases slowly (ideally not at all, but keep dreaming) as compared with increasing loads.  The former types of scalability contribute here.  For example, squeezing maximum performance out of each server means buying fewer servers (i.e., performance scalability) and adding new servers when necessary is cheaper than redeveloping applications (i.e., operational scalability).  However, other independent cost factors can swing things including commodity vs. specialty hardware, open source vs. proprietary software licenses, levels of support contracts, levels of redundancy for fault tolerance, and the complexity of developmental software which impacts testing, maintenance, and release costs.

Rocky Roads

Since the underlying theme of these additional dimensions is business context, it should be noted that rarely does an architect get to mitigate all imaginable scalability risks.  Usually this is simple economics.  In the early days of an application, for example, the focus is functionality without which million-user performance may never get to be an issue.  Furthermore, until its particular financial model is proven, excessive spending on scalability may be premature.

However, a good technology roadmap should project forward to anticipate as many scale factors as possible and have its vision corrected periodically.  Scalability almost always comes down to architecture and an architectural change which is usually pervasive by definition is the last thing you want to treat at a hot-fix.

The Redundancy Principle

Architecting complex systems includes the pursuit of “ilities”; qualities that transcend functional requirements such as scalability, extensibility, reliability, maintainability, and availability.  Performance and security are included as honorary “ilities” since aside from being suffix-challenged, they live in the same family of “critical real-world system qualities other than functionality”.  The urge to include beer-flavored took a lot to conquer.

Reliability, maintainability, and availability have some overlap.  For example, most would agree that availability is a key aspect of reliability in addition to repeatable functional correctness.  Similarly, a highly maintainable system is not only one that is composed of easily replaceable commodity parts, but one that can be serviced while remaining available.

As an architect, designing for availability can be great fun.  It’s like a chess game where you have a set of pieces, in many cases multiples of the same kinds.  Your opponent is a set of failure modes.  You know that in combating these failures, pieces will be lost or sacrificed, but if well played, the game continues.

We [Don’t] Interrupt this Broadcast

Every component in a system is subject to failure.  Hardware components like servers and disk drives carry MBTF (mean time before failure) specifications.  Communication media and external services are essentially compositions of components that can fail.  Even software modules may be subject to latent defects, memory leaks, or other unstable states however statistically rare.  Even the steel on a battleship rusts.  Failures cannot be avoided.  They can, however, be tolerated.

The single most effective weapon in the architect’s availability arsenal is redundancy.  Every high availability system incorporates redundancy in some way, shape, or form.

  • The aging U.S. national power grid provides remarkable uptime to the average household in spite of a desperately needed overhaul. At my house, electrical availability exceeds the IT-coveted five nines (i.e., 99.999%) and most outages can be traced to the local last mile.
  • The U.S. Department of Defense almost always contracts with dual sources for the manufacturing of weapon systems and typically on separate coasts in an attempt to survive disasters, natural or not.
  • The Global Positioning System comprises 27 satellites; 24 operational plus 3 redundant spares. The satellites are arranged such that a GPS receiver can “see” at least 4 of them at any point on earth. However, only 3 are minimally required to determine position albeit with less accuracy.
  • Even the smallest private aircraft have magnetos; essentially small alternators that generate just enough energy to keep spark plugs firing in case an alternator failure causes the battery to drain. Having experienced this particular failure mode as a pilot, I was happy indeed that this redundancy kept my engine available to its user.

Returning to the more grounded world of IT, redundancy can occur at many levels.  Disk drives and power supplies have among the highest failure rates of internal components and thus RAID technology and dual power supply modules in many servers and other devices.  Networks can be designed to enable redundant LAN paths among servers.  Servers can be clustered assuming their applications have been designed accordingly.  Devices such as switches, firewalls, and load balancers can be paired for automatic failover.  The WAN can include multiple geographically disparate hosting sites.

Drawing the Line

The appropriate level of redundancy in any system reduces to an economic decision.  By definition, any expenses incurred to achieve redundancy are in excess of those required to deliver required functionality.  Although in some cases, redundant resources used to increase availability may provide ancillary benefits (e.g., a server cluster can increase availability and throughput).

Redundancy decisions really begin as traditional risk analyses.  Consider the events to be addressed (e.g., an entire site going down, certain capabilities being unavailable, a specific application becoming inaccessible; for some period of time).  Then determine the failure modes that can cause these conditions (e.g., a server locking up, a firewall going down, a lightning strike hitting the building).  Finally, consider the cost of these events each as a function of its impact (e.g., lost revenue, SLA penalties, emergency maintenance, bad press) and the probabilities of its failure modes actually occurring.  The cost of redundancy to tolerate these failure modes can now be made dispassionately against their value.

As technologists, our purist hearts want to build the indestructible system.  Capture my bishops and rooks and my crusading knights will continue processing transactions.  However, the cost-benefit tradeoff drives the inexorable move from pure to real.

The good news is that many forms of redundancy within the data center are inexpensive or at least very reasonable these days given the commoditization of hardware and the pervasiveness of the redundancy principle.  Furthermore, if economics keeps you from realizing total redundancy, do not be disheartened.  We’re all currently subject to the upper bound that we live on only one planet.