Atrophy by Abstraction

In 1957, Isaac Asimov wrote a short story entitled “A Feeling of Power” about a future society where humans had lost to atrophy the ability to perform even simple arithmetic due to a total dependence on computers.  Now I could log volumes on the abysmal state of math knowledge in the humans wandering around in today’s society, but this piece isn’t about that.  It is however along parallel lines for software engineering.

I’ve been doing a fair amount of recruiting for my Engineering team this year and I’m happy to say I’ve hired great people, but it wasn’t easy.  One of the things I like about the process is that I learn a lot.  One of the things I hate is what I sometimes learn, like how many software engineers who have been working in object-oriented technologies for many years who can’t give a lucid explanation of encapsulation or inheritance; and abandon all hopes of polymorphism – log that knowledge as an exception.  These are the core pillars of object orientation and if you can’t at least describe them (much less define them), you can’t use them correctly.

For the most part, I’m not talking about young engineers right out of school, although you’d think they would have forgotten less.  I’m talking about the 5-10 year senior engineer who stares as still as a fallen log hoping that the gods of seniority will suddenly inspire unto them the fundamentals.  And on the subject of “senior” in the title, calling a squirrel a duck doesn’t make it quack.  Admittedly Shakespeare put it more eloquently, “A rose by any other name would smell as sweet.

Another favorite question of mine as my colleagues well know is the binary search.  I often ask engineering candidates, especially server-side and database types, to describe it relative to any other search technique.  Half the time the answer starts by naming certain Java classes – nope pull up, this is not a Java question.  Overall, about 1 in 5 does pretty well.  For the rest, I usually resort to playing a game.

I’m thinking of a number from 1 to 100.  You need to guess it in the fewest number of tries and after each, I will tell you that you’re either correct, too low, or too high.

Almost everyone figures out that the logical first guess is 50.  It has no more of a chance of being right than any other guess, but at least you’re reliably cutting the space in half.  If I say “too high”, then guess 25.  If I then say “too low”, then guess 37, and so on.  That’s a binary search!  Start with a sorted collection and find what you need by successively dividing the search space by 2.

Only once was someone’s first answer not 50 – they guessed 70 and my head exploded scattering debris for miles and miles.

I ask this question because knowing how things work matters.  If you don’t understand the binary search, for example, then you have no idea how an index helps a SQL select or why database inserts and updates are costly when indices are overused.  You may never have to code a binary search ever again thanks to abstraction and reuse, but just because something has been abstracted away from daily view doesn’t mean it isn’t executing at runtime.  Understanding this principle is crucial to being able to play in the high-scale league.

Folding a little math back into my binary search question, I usually ask the following just for fun since only about 1 in 50 come close.  Given a collection of size N, what is the worst case number of tries before you’re sure to win?  More blank fallen log stares as they try to play out the guessing game in their heads, so I lead them down the path.  If N = 2x (i.e., multiplying by 2, x times), then what is the inverse function x = f(N) (i.e., how many times can N be divided by 2)?  What is the inverse of exponent?  But this only helps 1 or 2 more out of 50.

If the many occurrences of the word “log” so far in this post weren’t enough of a clue…

If N = 2x, then x = log2 N

Stealing a bit from Bill Maher and his HBO series, I think we need some New Rules:

  1. All programmers are not engineers.  Programmers write computer programs, maybe even really good ones.  But to be an engineer, one has to know how things work or at least possess the intellectual curiosity to want to know.
  2. Calendars and titles do not make engineers senior.  Few things raise my resume red flags higher than seeing every position since kindergarten as Senior this, Lead that, or Super-Duper something else.  Take the time to learn your craft.  That will distinguish you.
  3. Abstraction without fundamentals is unstable.  It can cause us to mistake tight source code for code that performs well, not thinking about the mass of code sitting in base classes, libraries, and frameworks.  We can write resource-sloppy code and assume the garbage collector will magically clean up behind us.  Try that at scale.

Summing up, abstraction is good.  It has marked the forward movement of software engineering for many decades.  It drives productivity and at least potentially drives better designs, testability, extensibility, maintainability, and lots of other good characteristics.  But it can also give us permission to be lazy.  With or without title qualifiers, good engineers do not let this happen.  They are self motivated to learn and they learn deep, not just broad.

Well, I’ve given away a few of my favorite interview questions here, but if my upcoming candidates suddenly know these answers, I can at least give them credit for reading the interviewer’s blog as preparation.

SaaS Design Checklist

gears2I’ve been asked several times recently about the design considerations for an application that is to be delivered via the Software-as-a-Service (SaaS) model.  In other words, beyond the core functionality of the application itself, what other features or system aspects need to be addressed for the business offering of that application as a commercial service?

The following is a list of such features or aspects.  I make no claim as to the all-inclusiveness of this list, but it’s a good start.  Certain system aspects that apply broadly whether the application is SaaS or just critical-internal have been omitted (e.g., disaster recovery, health monitoring, etc.).  As for items listed, they may or may not apply to every situation, but they should at least be considered serious before deciding.

Security

  • Subscriber-Level Authentication & Authorization:  A “subscriber” is the entity with whom the business relationship exists for use of the SaaS application and comprises one or more users.  Each request must be authenticated to know the subscriber to which the user belongs thereby enabling license checks and usage metering.  Subscriber authorization comes into play if the application has separately licensable modules to which their users may or may not have access.
  • User-Level Authentication & Authorization:  As in all applications, each request must be authenticated to know the originating user and authorized to access specific capabilities by their role.  This authorization may be further constrained by subscriber-level authorization constraints.
  • Parametric Throttling:  A request may contain parameters that if unchecked could harm the system intentionally or otherwise.  For example, consider a request argument that dictates the number of records to return.  The application may protect itself from crippling values like 1,000,000,000 by simply throttling it to some configurable maximum like 500.  Throttling rules may need to be subscriber-specific.
  • Frequency Throttling:  Also important generally but particularly for APIs is the notion of throttling request rates (i.e., maximum hits per second from the same IP address) to prevent anything from abusively heavy subscribers to denial of service attacks.  This is often achieved within the network infrastructure as opposed to the application itself, but this is an opportunity for making the point that a successful SaaS deployment is about more than just the software engineering.

Service Level Agreements

  • Availability Monitoring:  SaaS contracts often carry SLAs that specify minimum application uptime.  When this is the case, a means for self-monitoring availability must be established whether to tout your success, to be the first to know about issues, or simply to address disputes.  Be specific about the very many ways to define uptime.
  • Performance Monitoring:  SLAs may also specify performance thresholds and require similar monitoring for similar reasons.  Individual performance data points should include the subscriber ID and request type to enable rollups in these dimensions since a) different subscribers may demand different SLAs, and b) different request types may have inherently different performance characteristics that can be called out separately with different thresholds.
  • Performance Exclusions:  Depending on the nature of the application or specific requests, there may be portions of execution time that should be excluded from performance calculations.  For example, the implementation of a request may call out to external services or execute a subscriber-specific workflow (i.e., things beyond the SaaS provider’s control).  Such activities may have been excluded from the performance SLAs and thus must be captured enabling the appropriate adjustments.
  • Compliance Auditing:  Collecting all supporting data is necessary, but not sufficient.  Reporting on this data for the purpose of auditing specific SLAs must be established and should be exercised internally to avoid surprises.

Subscription Servicing

  • Request Metering:  Requests incident on the application should be counted by subscriber ID and request type.  This enables usage monitoring by subscriber which may be required to support billing depending on the business relationship.  It also enables internal sensing of more heavily used features; information that can be useful in several ways (e.g., tuning, marketing, deprecation).
  • Subscriber-Level Reporting:  Separate from whatever reporting the application itself provides, there should be a means to generate summary information about a subscriber’s SaaS interaction whether periodically or on-demand.  This information may include usage levels, SLA compliance, license status, strange account activity if detectable, etc.  Minimally, the SaaS provider should be able to retrieve such information, but may also consider making it available to subscribers perhaps as an admin role capability.

External Services

  • Performance Monitoring:  Many applications integrate with externally provided services to perform portions of their request functionality (e.g., information retrieval, payment processing, etc.).  As the consumer of these services, the SaaS application should monitor their performance.  Whether or not the time spent waiting for these services to execute is included in formal SLAs, they will absolute impact user experience.  Downward trends may lead you to shop around for equivalent alternatives.
  • Availability Monitoring:  For all the same reasons, apparent availability of any external services should be tracked.  Apparent availability is the percentage of calls to a service that the service responded in a functionally meaningful way.

Resource Sharing

  • Multi-Tenancy:  A single infrastructure or slice thereof serving multiple subscribers is central to the SaaS economic model.  The most crucial aspect of this is the notion of a multi-tenant database schema.  The problems associated with isolating subscriber data logically rather than physically are easily offset by the maintenance benefits of dramatically reducing the number of production database instances.
  • Partitioning:  Economically, a single multi-tenant database may be ideal.  At scale, however, it may become necessary to have multiple databases each supporting a subset of subscribers.  This may be done to support different SLAs, to service very different usage patterns, to reduce the impact scope of an outage, or simply to handle high scale loads.
  • Selective Purging:  Even the best SaaS applications will lose subscribers.  Purging their data from a multi-tenant database is usually straightforward, but not so when it comes to backup media of multi-tenant databases.  If you’re entering into a contract that originates from the subscriber’s legal department, read the termination clause carefully and be sure it’s feasible.
  • Subscriber Portability:  If subscribers are partitioned across multiple databases, the need to move a subscriber from one instance to another will eventually arise as usage patterns change (the SaaS analog to rebalancing your 401k).  The biggest hurdle to this is avoiding ID collisions across databases.  The catch-22 is that this is rarely considered in release 1.0 and the downstream fix usually requires prohibitively invasive surgery.
  • Cross Partition Monitoring:  Partitioning subscribers across multiple databases or even whole infrastructure slices obviously adds to operational complexity.  As the number of partitions grows, consider some form of central monitoring hub to assist the Operations support staff.  This can start out simple and evolve over time as the ROI increases, but good sensors within the application can greatly facilitate this when the time comes.

Flexibility & Extensibility

  • UI Customization:  User interface flexibility can range from take-it-as-is to full white labeling per subscriber.  It can be as trivial as showing the subscriber’s logo or hacking up every form and navigation to comply with some internal process guidelines.  Ultimately the market will decide what level of customization capabilities are worth the engineering for a given application.
  • Data Model Customization:  Similarly, subscribers may have additional data fields, whole data objects, or even multimedia content that they wish to store along with the application’s data model.  Again, this type of flexibility has many prices and the value in supporting it needs to be assessed case by case.
  • Behavioral Customization:  A more complex type of flexibility is that of business behavior (e.g., configurable workflows, proprietary decisioning rules, calculation policies, etc.).  Unless tightly and explicitly bounded, this type of flexibility in a multi-tenant SaaS deployment can be an insidious slippery slope.  Tread carefully.
  • Platform API:  Many applications perform services that can be exposed via an API (e.g., web services).  Doing so can enable subscribers to incorporate the application more deeply using, for example, a Service Oriented Architecture (SOA) while increasing subscriber stickiness for the SaaS provider.  It also opens up the potential for multiple UIs, which may be a path to extreme UI customizations.  However, while exposing such APIs may appear straightforward, it is definitely not to be undertaken lightly.  More on this in another post. 

Dimensions of Scalability

Designing for scalability is one of the primary challenges of system and software architecture.  For those of us who practice architecture, it’s also great fun thanks the high number of variables involved, the creativity required to discover exploits, the pattern matching to apply tricks and avoid traps, and the necessity to visualize the system in multiple possible futures.

In the broadest terms, “Is it scalable?” = “Will it break under growth?”  A few manifestations that are a bit more useful include “Will performance hold up as we add more users?”, “Will transaction processing time stay flat as the database grows?”, and “Will batch processing still complete within the allotted window as the size of our account base, data warehouse, or whatever multiplies?”.  Architects imagine the kinds of demand parameters that might occur over the life cycle of the system and incorporate mitigation plans.

These examples all pertain to the performance characteristics of a system.  However, there are other dimensions of scalability that are equally important when considering that system in a business context.

Strategic Dimensions

  1. Performance Scalability:  “An observation about the trend in performance in response to increasing demands.”
    Demand can refer to any of several parameters depending on the system such as number of concurrent users, transactions rates, database size, etc.  Performance measures may include event processing time, batch throughput, user perception, and many others.  In any case, we consider a system to be scalable if we observe a flat or nearly flat performance curve (i.e., little or no performance degradation) as any given demand parameter rises.  In reality, even highly scalable systems tend to be scalable through some finite range of demand beyond which some resource tends to become constrained causing degradation.
  2. Operational Scalability:  “An observation about the trend in effort or risk required to maintain performance in response to increasing demands.”
    This may be best illustrated by example. Consider a web application that is experiencing sharp increases in usage and a mid-tier performance bottleneck as a result.  If the application was designed for mid-tier concurrency, the mitigation effort may be simply adding more application servers (i.e., low effort, low risk).  If not, then significant portions of the application may need to be redesigned and rebuilt (i.e., high effort, high risk).  The former case is operationally scalable.  As with performance scalability, operational scalability occurs in finite ranges.  Continuing the previous example, at some point the database may become the bottleneck typically requiring more extensive remedial action.
  3. Economic Scalability:  “An observation about the trend in cost required to maintain performance in response to increasing demands.”
    We consider a system to be economically scalable if the cost of maintaining its performance, reliability, or other characteristics increases slowly (ideally not at all, but keep dreaming) as compared with increasing loads.  The former types of scalability contribute here.  For example, squeezing maximum performance out of each server means buying fewer servers (i.e., performance scalability) and adding new servers when necessary is cheaper than redeveloping applications (i.e., operational scalability).  However, other independent cost factors can swing things including commodity vs. specialty hardware, open source vs. proprietary software licenses, levels of support contracts, levels of redundancy for fault tolerance, and the complexity of developmental software which impacts testing, maintenance, and release costs.

Rocky Roads

Since the underlying theme of these additional dimensions is business context, it should be noted that rarely does an architect get to mitigate all imaginable scalability risks.  Usually this is simple economics.  In the early days of an application, for example, the focus is functionality without which million-user performance may never get to be an issue.  Furthermore, until its particular financial model is proven, excessive spending on scalability may be premature.

However, a good technology roadmap should project forward to anticipate as many scale factors as possible and have its vision corrected periodically.  Scalability almost always comes down to architecture and an architectural change which is usually pervasive by definition is the last thing you want to treat at a hot-fix.

SOA in Good Eternal Company

graveyard-689x407There’s a place where good acronyms go to die.  I call it the GAG (Good Acronym Graveyard).  It’s a dark foreboding place where over-hyped acronyms lie interred separated from their perfectly valid and useful living legacies.

Terminal Terminology

The first GAG funeral that I witnessed in my career personally was Artificial Intelligence.  In the 80s and early 90s, AI was hyped to the point where our brains would surely atrophy as expert systems, neural networks, fuzzy sets, and other goodies would put homo-sapiens out of business.  AI would be the computing industry’s super hero of the era.  But just as most of our super heroes eventually disappoint as they fail to live up to impossible expectations, AI came crashing down.  So many companies and investors were burnt in the process that the term itself became a pariah.  A proposal or business plan could promise to cure cancer, but would be rejected out of hand if it included the term “AI”.

In reality, the AI funeral was for the term itself.  The living legacy of AI is all around us.  We have automated decisioning and diagnostic systems that use many expert systems concepts.  Rule based systems are widely used to codify business policies, determine insurance quotes, and manage the complexities of telecommunications billing.  Neural networks among other techniques are used in pattern analyses such as facial recognition and linguistics.  Just about every complex search technique in use today owes its roots to a university AI lab.  More generally, heuristic algorithms are now pervasive in everything from music recommendations to counter terrorism.

The principles and techniques of AI have been staggeringly successful, but the over-hyped term and its unreasonable expectations rest in peace in the GAG.  This was no time for sorrow, however.  With this burial went the wasteful distraction of trying to satisfy the insatiable.  Released from this burden, practitioners were free to focus and produce the awesome results that have transformed large tracts of the computing landscape.

So Soon SOA

Service Oriented Architecture or SOA has now entered the GAG.  Following a similar pattern as AI, there is nothing wrong with its principles.  In fact, SOA is exactly the transformative movement required by complex enterprises that require breakthrough advances in agility while avoiding the infeasible cost and limitations of wholesale legacy replacement.  Over the past several years, however, the term SOA has been over-hyped as a silver bullet, a specific technology, or a turnkey solution depending on the agenda of the “hyper”.  To these expectations, SOA must and has failed.

In a 01-05-2009 post entitled “SOA is Dead; Long Live Services“, Anne Thomas Manes writes the following insightful obituary:

SOA met its demise on January 1, 2009, when it was wiped out by the catastrophic impact of the economic recession.  SOA is survived by its offspring: mashups, BPM, SaaS, Cloud Computing, and all other architectural approaches that depend on “services”.

SOA is a strategy and an architecture (people tend to forget that’s what the “A” stands for).  It is a path to which enterprises must commit and in which they must invest in order to realize the returns.  When a project is framed as full blown “SOA”, compelling returns on investment are exceedingly difficult to devise and sell.  However, Software as a Service (SaaS) has gained acceptance as an agile, cost effective alternative to wide-scale software installation and maintenance.  Cloud computing is rapidly ascending to acceptance as a nimble alternative to sizing data centers to handle peak-plus demands.  Mashups are everywhere from grass-roots developers to the enterprise back office.  As these mindset changes continue to cure, the principles of SOA will flourish – even better without the baggage of the term itself.

Requiem

And so we gather together on this cold day in January of 2009 to lay to rest the body of SOA, but not its spirit.  We do not mourn this passing as untimely or empty.  Rather we rejoice in the opportunity to move past empty promises and impossible expectations.

Perhaps now that the GAG is sporting yet another tombstone, we can attend to the real business of enterprise transformation through service orientation.  Perhaps we can even throw in a little AI for good measure… D’OH!!!

Taxonomy for Web Service Sources

Various taxonomies for web services are possible.  A focus on technology might produce classifications such as transport (e.g., HTTP, JMS), representation (e.g., SOAP, REST), response handling (e.g., blocking, asynchronous via polling, asynchronous via callback).  A focus on purpose might look more like data source vs. computational vs. legacy API exposure.

In the Web 2.0+ era, web services are proliferating wildly.  The mashup community is providing huge demand satisfied in part by sites such as ProgrammableWeb where over 1,000 web services can be found, and increasingly online services are opening their platforms via APIs.  Enterprise level SOA (Service Oriented Architecture) initiatives, while slowed by a slowing economy, are also beginning to consume external services as well as exposing their own services for internal use.

In recognition of this proliferation, I believe a new taxonomy is required that addresses the source of services from the perspective of the user, orthogonal to technology or purpose.  By “user” in this context, I am referring to the human developer of any application that consumes web services.

Source Taxonomy

The figure illustrates a draft web service taxonomy where services are classified by the nature of their sources or providers as seen by potential users.

ws-source-taxonomy

Classification:  Ownership

Ownership is the distinction between services that are sourced within the user’s organization (e.g., company, business unit) versus those sourced by parties unaffiliated with their organization.

Internal:  “Services sourced within the user’s organization implying some potential for control over their implementation.”  Examples include web service APIs to internal legacy systems as part of a SOA project.

External:  “Services sourced independent of the user’s organization.”  Examples include information services (e.g., news, market quotes, credit report) and APIs to platforms like Twitter or SalesForce.

Classification:  Provision

Provisioning refers to the relationship between the entity supporting the web service endpoint from the user’s perspective (i.e., the provider) and the entity that supplies the functional implementation of the web service (i.e., the source).  When the provider is the source, the service is said to be original.  Conversely, if the provider is some form of third party intermediary between user and source, the service is said to be syndicated.

External / Original:  “External services that are called directly by a user’s application.”

External / Syndicated:  “External services for which users call a third party provider or syndicator which would then call the original source on their behalf.”  Presumably in this type of structure, the syndicator would add some value for acting as intermediary.  For example, a syndicator could serve as a common front for many original service sources thereby presenting an additional interface abstraction and a common point of billing and support.

Internal / Original vs. Syndicated:  Based on the foregoing definitions, the notion of a syndicated internal service seems oxymoronic.  The taxonomic intent is to enable larger enterprises to make the distinction, for example, between point-to-point calling of a legacy API (i.e., original source) versus the use of an intermediate hub, messaging service, or some other abstraction layer (i.e., a form of syndication).

Classification:  Differentiation

This classification addresses the potential for multiple sources to provide functionally equivalent services and how the user perceives the relative value of those sources.  A service is referred to as a commodity if it is possible for multiple sources to provide functionally equivalent implementations of that service, whether or not multiple such sources actually exist.  In contrast, a service is referred to as branded if there can be only one source of that service.

External / Original / Commodity:  “Services provided by an original source that can also be offered by other sources.”  The functional equivalence of these services can enable a user to select a source based on non-functional factors such as price, performance, and reliability.  Examples might include data services such as weather or financial market data.  In this scenario, the commodity sourcing decision must be performed by the user.  Despite functional equivalence, each source may present differing interfaces to which the user must code.

External / Syndicated / Commodity:  “Services provided by a syndicator fronting for potentially multiple functionally equivalent sources.”  The key value of commodity syndication lies in the fact that functional equivalence does not imply interface equivalence.  For a given commodity service, a commodity syndicator has the opportunity to normalize interfaces across functionally equivalent sources, thus providing the user with a single stable interface per service.  This scenario would support the commodity sourcing decision being made either by the user or transparently by the syndicator.

External / Original / Branded:  “Services provided by an original source that can only be available from that source.”  The most common types of these services are APIs to specific applications or platforms.  For example, consider writing a mashup for your back office that uses the SalesForce API.  It cannot decide lightly to call a different CRM application since it is highly unlikely that its API will be functionally equivalent at the service level, not to mention the fact that your company’s data lives at SalesForce.

External / Syndicated / Branded:  “Services that can only be available from a single source, but are accessed through a syndicator.”  This class is included for taxonomic completeness although it is unclear what significant value the syndicator would provide in this case.  There may be some value in a single gateway to multiple branded services for billing, support, or auditing purposes, but this alone hardly seems compelling relative to the overhead.

Classification:  Session

This classification recognizes that certain logical operations may require multiple web service calls.  While this may seem like a technical distinction, its relevance to this taxonomy is in the context of commodity source selection.

… / Commodity / Stateless:  “Completely independent web service calls enabling commodity sourcing decisions on a per call basis if desired.”  This is the finest granularity of web service commoditization.  An example of this might be a request for a stock quote for a known ticker symbol.  A single call does the job and there is any number of functionally equivalent service sources for this information.

… / Commodity / Statefull:  “A logically related group of web service calls that all must be made to the same source, thus necessitating a single commodity sourcing decision for the group.”  An example might be obtaining a credit report on a company.  A first call requests a “list of similars” based on the company name.  The returned list includes a set of possible matches with additional data for disambiguation and source specific IDs.  After selecting the desired company from the list, the second call requests the actual report based on the ID.  The user may not care which source is used, but having made the sourcing decision for the first call, the rest of this conversation must return to the same source since it carries source specific information.

Summary

The last 5 years have seen the rapid proliferation of available web services and a growing appetite of Web 2.0+ developers anxious to consume them.  Thus far, the focus has been on what mashups and service oriented applications can do and how to achieve them functionally.  Going forward, we will see increased attention to qualities of service, stability, and source redundancy analogous to that of cloud computing.  Lack of maturity in these areas is among the factors holding back enterprises from full scale consumption of external web services in their business applications.  Concepts such as syndication and commoditization can play a key role in breaking through this barrier.

New Math

Reductionism, the philosophical position that a complex system is nothing other than the sum of its parts, has largely fallen out of favor and for good reason.  Quantum physics and phenomena such as chaos theory and emergent network effects have effectively closed the book on the notion that truth can be found by setting aside the whole and just analyzing fundamental elements.

Given this preamble, I was recently reminded of a pretty cool algebra trick that I first came across about 15 years ago.  Starting with a = b as a given, this series of simple algebraic manipulations proves that 1 = 2; a principle I wish I could apply to my 401k.

a  =  b
ab  =  b2
ab – a2  =  b2 – a2
a(b – a)  =  (b – a)(b + a)
a  =  b + a
a  =  a + a
a  =  2a
1  =  2

Clearly there’s a problem.  For readers who have not seen this and want to solve it, I’ll offer the following assurance without being a spoiler.  There is nothing lame here.  There are no hidden assumptions like this only works if you’re a subatomic particle or if you redefine the symbology of numbers.  The problem is simple and includes everything you need to know.

The frustrating part about this problem is that each individual step from one expression to the next is impeccable in its correctness, but ultimately 1 cannot equal 2.  There is of course an explanation, but it cannot be found solely in the exhaustive analysis of each step.

There is no profound moral to this post; just an observation that connecting dots can be more rewarding than just knowing each dot on a first name basis.  Said another way, if you’re beating your head against a wall because you don’t know where to go, try stepping back and reading the sign.

The Over-Under on Process

As long as there has been a Software Development Life Cycle (SDLC), there have been efforts to devise processes to manage it.  From the excruciating waterfalls of the 1980s (e.g., Mil-Spec 2167), through the OO methodology wars of the 1990s (e.g., OMT, Booch), to the broader processes of the 2000s (RUP, Agile), these processes have evolved along with technologies and business demands.

Various process aspects may be more or less applicable to a particular reality and they are always adapted in some way from the published baseline.  In my last company, we embraced Scrum as being closest to our sensibilities out of the box.  We then augmented the notion of the Product Owner with multiple feature owners recognizing that no one person can expertly represent the constituencies of market trends, immediate customer requests, and the underlying technical issues.

We also had two teams, Application and Platform, each with interdependencies that couldn’t always be split by our full 3-week sprints.  So we concocted a process by which each team executed sprints separately; still 3 weeks, but offset by 1 week to give the Platform team a head-start.  Pros and cons with this, but that’s another post.

The point is that SDLC management processes along with their human and non-human components form complex systems.  Their selection and adaptation must be performed thoughtfully and nothing substitutes for experience here since to a large degree, human behavior will be the make or break factor.

Fundamental Objectives

Any SDLC process worth implementing must achieve certain fundamental objectives irrespective of the underlying technology, the experience of the team, someone’s favorite textbook, the phase of the moon, or the flavor of the month.  In my view, these are they.

  1. Measurability: Coining a famous maxim, you can’t manage what you don’t measure. Metrics may vary from one process to another, but fundamentally a well-defined process enables consistent and comparable measurement of activities so that they can be reviewed dispassionately, tuned, and reported.
  2. Repeatability & Predictability: As in most endeavors, practice makes perfect. The more releases, iterations, or other cycles a team executes, the more efficient that team can become, the closer estimates will align with reality, and the more the process itself can be tuned. With each cycle comes a new set of technical challenges. Procedural challenges should trend toward zero.
  3. Visibility & Transparency: One of the fundamentals of forging a team from a group of individuals is providing them with a fully connected view of the broader scope. Up a level, the Engineering department is a member of a team of departments many of which include direct stakeholders. A good process enables a comprehensible view to its inner workings and the impact of external forces, without which accountability will be a scarce resource.
  4. Decision Context: An urgent customer requirement comes in from left field. Can it be accommodated and what may be impacted (e.g., the release date, other tasks, which ones, etc.)? A good process provides a well-understood context for making hard choices without resorting to throwing food. Not everyone may leave happy, but everyone understands how the decision was made, why it was made, and the benefits and costs it carries.
  5. Comprehensibility: The team can’t execute what it can’t understand and none of these objectives will be realized if team members are following significantly different interpretations. The simpler the process, the more likely its compliance will be true to its intent. Furthermore, staff changes are inevitable. Shorter learning curves yield faster capacity availability.

Notice that I omitted rate of delivery and quality.  Clearly these are factors we all endeavor to maximize.  I would argue, however, that to achieve and sustain these without the foregoing is like trying to speed up a poker game by not looking at your cards.

Potential Pathologies

Processes can turn pathological; conditions where even good qualities are accidentally subverted by being out of balance with other important factors.  Even the most well-meaning process practitioners can find themselves spiraling down the rabbit hole.  Here are a few of my favorites.

  1. Responsibility Transference: Now that we have a process, why burden ourselves with common sense? Processes are like any other system with many moving parts; they need to be initially debugged and then tuned over time. They should never be assumed to be so perfect that the brains of the participants can be disabled. This is like blindly coding to a specification even when errors are suspected assuming that the spec writers must have known what they were doing.
  2. Rigor Mortis: Can’t – move – process – not – letting me. When the house is on fire, don’t wait for a ruling on procedure; just grab a hose. There’s a fine line between adhering to the process and elevating it beyond the product. The process is a tool to meet objectives; it is not the objective in and of itself. Similar to the previous, there are times when common sense really does need to prevail with a logjam review to follow.
  3. Exception Domination: An estimated 60-70% of most source code goes to handling exceptions leaving the minority for primary functionality. An SDLC process rarely anticipates every odd circumstance. If it does, it probably has so many paths as to be incomprehensible to those trying to execute it. Unlike CPU-executed software, missing process paths are a good tradeoff for simplicity. Human collaboration can fill in the gaps.
  4. Illusion of Competence: Certifications such as ISO-9000 and SEI-CMM can be useful when properly applied. Their principles embody years of best practices and refinements. However, these are process certifications, not product certifications. A software development shop can be CMM Level-5 and still produce junk. It is not uncommon, for example, to find offshore shops touting these credentials having only been in business for a year – run for the hills. These are cases where more energy is spent looking like a world-class operation rather than being one.
  5. Numerous Definitions of Done: Is it done? Yes; well, except for testing. Is it done? Yes; um, it just needs to be reviewed and there’s that other thing. Is it done? Yes. Great, so the press release can go out? Well, no it’s being held back a release so we can stress test it some more. The use of the word “done” should be outlawed until its unambiguous definition is signed up for in blood by every member of the team. I have a theory that more project management frustration stems from the misuse of this word than any other singular cause. Done is definitely a 4-letter word.

Summary

Process, good.  Process plus people using their brains and talking to each other, better.  Done.

On This Thanksgiving

candles-smallYes, I’m a Star Trek fan.  I don’t play with phasers or go to conventions, but I like the characters, the settings, the cool often physics-defying technology, and the underlying theme of hope for the human race.  I watched one of the movies yesterday.  I’ll leave it to the trekkies to figure out which, but most of it takes place on Earth after traveling back in time to the year 2063.  From there a reference is made that within 50 years, all hunger, poverty, and war will be wiped out as the human race becomes united in a way it never thought possible.

I am reflecting on this hopeful future as this week of Thanksgiving has turned to one of mourning and remembrance of nearly 200 fallen in Mumbai.  They were of different nationalities, different genders, different ages, and different faiths, but had in common that they were brutally slaughtered by sociopaths.

Their murderers were not insane madmen.  They decided, planned, prepared, coordinated, and acted.  They not only embodied barbarism and hatred, but wasted potential.  We, they, and their victims share over 99.99% of the same DNA.  We, they, and their victims are all us.  How can we as a species transcend this incredible nonsense?  Can we ever, or is it just an inescapable consequence of billions of people crowded together with the various imbalances of human nature?

According to the Star Trek timeline, we have about 100 years to make good; less than one year per Mumbai victim.  What would it really take?  The human race has been at it for 100,000 years, so it would seem that normal evolution won’t cut it in the next 100.  It would take something truly disruptive, so global in scope, so profound in meaning, so undeniable in faith, that the remaining cells of these warped strains would be too small to be biologically viable and become in essence an endangered species.

Can this happen in the next century?  With all my heart I hope yes.  However, history says no.  Either way, I’ll be thinking good thoughts for the families and friends of those so sadly taken on this Thanksgiving.

When Antivirus = Virus

virusSecurity is not convenient.  Anyone who says differently is probably trying to sell you security products.  That being said, antivirus programs have been around just about as long as viruses.  You would think by now the major brands would have cracked the code.  Now I’m not talking about the daunting task of keeping up with new strains or heuristically predicting that a given byte sequence might be an unidentified threat.  I’m referring to basic usability and not bringing my computer to its knees.

While I am a professional and fairly accomplished technologist, I am also a user and sometimes I want to be just that; a user.  I use computers to get things done.  I don’t enjoy whiling away the hours dissecting, maintaining, or fixing them… which is how I spent a chunk of my yesterday.

Let me set the stage for my rant.  I have a reasonably good PC running Windows XP SP3 on a 3.2GHz Pentium with 1Gb RAM, and an endless sea of disk space.  It’s definitely not state of the art, but it’s not so bad that it deserves to be punished by my antivirus software, which is the latest supported version and is fully up-to-date.

As to the offending software, I should probably not name it so as to avoid a slander suit.  However, it begins with “M” and rhymes loosely with Hack-A-Fee, Lack-A-Key, Smack-N-See and Sack-N-Free.

Antivirus is a Virus When…

  1. It actually causes pop-ups.  It seems like every couple weeks, I was getting a pop-up from the system tray offering me new products or urgently urging me to extend my subscription for a low, low price.  My subscription was good for another 6 months, but out of fear of exposure I had to either take the time to look that up or mindlessly succumb!  Aren’t pop-ups one of those intrusions to be prevented?  It’s like saying torture is evil, but the U.S. can do it because we’re good and not like those other guys.  Now that I’ve uninstalled this thing, maybe I’ll send it to Guantanamo.
  2. It fights to prevent you from doing basic things.  A while back, I setup my PC and another laptop as a home network to share a printer and move files around.  Windows for all its faults actually makes this very easy now, even for the average user.  After following the simple steps, my laptop couldn’t see my PC.  Realizing it was probably the product’s firewall, I could have switched back to the Windows firewall, but now I was on a mission.  I started checking for the switch that enables home networking, assuming that non-technologists also use this thing.  Nothing.  I had to check the firewall’s intrusion logs to find that requests to certain ports had been blocked within the last few minutes.  After going online to find out the common use of these ports, wouldn’t you know they are the ports typically used by Windows for home networking?  Going into the advanced firewall settings, I opened these ports and my problem along with my patience dissolved.  I mean, who wouldn’t have figured that out.
  3. It causes your PC to gasp for air.  There’s really nothing quite like waiting 2 minutes to open a file or a browser link.  During that newly realized free time, one can make coffee, learn a foreign language, or watch all the snails zipping by in the proverbial passing lane.  I assumed it was performing realtime scanning, although what scan take 2 minutes on a 30K document I’ll never know.  I systematically disabled each protection feature to find the culprit, but to no avail.  Only when I removed the program en masse did my computer rise from the dead, which tells me that it injected something beyond user control; something un-good.
  4. It must be forcibly removed to shut it down.  As I mentioned when troubleshooting my giant bucket of slow, I tried disabling each feature individually.  As far as I could tell after probing every menu, option, and orifice, there was no master switch akin to “turn this damn thing off”.  I actually had to uninstall the entire application, a choice about which I was un-conflicted by this time but still…  When you have to force your guests to leave your house at gunpoint to get them to stop breaking dishes, it’s time to rethink your guest list.

End of Rant

I’m a life-long technologist and technocrat.  I am fully capable of resolving any of the issues cited herein, but I don’t have time to waste on something that is now so pervasive and so basic to every personal computer whatever the flavor.

I ultimately did solve my problem by sending the offending software on its way to the harbor and replacing it with a choice that will remain nameless.  By the way, one of my favorite actors is Ed Norton, but he too may one day be replaced by lesser known understudies.

I realize that security requires diligence and diligence requires time.  As a CTO, I spend my days and nights worrying deeply about security on many levels.  But as a PC user, I don’t want to waste precious minutes of my life thinking about antivirus any more than I would want to throw a dinner party to discuss printer drivers.

The Redundancy Principle

Architecting complex systems includes the pursuit of “ilities”; qualities that transcend functional requirements such as scalability, extensibility, reliability, maintainability, and availability.  Performance and security are included as honorary “ilities” since aside from being suffix-challenged, they live in the same family of “critical real-world system qualities other than functionality”.  The urge to include beer-flavored took a lot to conquer.

Reliability, maintainability, and availability have some overlap.  For example, most would agree that availability is a key aspect of reliability in addition to repeatable functional correctness.  Similarly, a highly maintainable system is not only one that is composed of easily replaceable commodity parts, but one that can be serviced while remaining available.

As an architect, designing for availability can be great fun.  It’s like a chess game where you have a set of pieces, in many cases multiples of the same kinds.  Your opponent is a set of failure modes.  You know that in combating these failures, pieces will be lost or sacrificed, but if well played, the game continues.

We [Don’t] Interrupt this Broadcast

Every component in a system is subject to failure.  Hardware components like servers and disk drives carry MBTF (mean time before failure) specifications.  Communication media and external services are essentially compositions of components that can fail.  Even software modules may be subject to latent defects, memory leaks, or other unstable states however statistically rare.  Even the steel on a battleship rusts.  Failures cannot be avoided.  They can, however, be tolerated.

The single most effective weapon in the architect’s availability arsenal is redundancy.  Every high availability system incorporates redundancy in some way, shape, or form.

  • The aging U.S. national power grid provides remarkable uptime to the average household in spite of a desperately needed overhaul. At my house, electrical availability exceeds the IT-coveted five nines (i.e., 99.999%) and most outages can be traced to the local last mile.
  • The U.S. Department of Defense almost always contracts with dual sources for the manufacturing of weapon systems and typically on separate coasts in an attempt to survive disasters, natural or not.
  • The Global Positioning System comprises 27 satellites; 24 operational plus 3 redundant spares. The satellites are arranged such that a GPS receiver can “see” at least 4 of them at any point on earth. However, only 3 are minimally required to determine position albeit with less accuracy.
  • Even the smallest private aircraft have magnetos; essentially small alternators that generate just enough energy to keep spark plugs firing in case an alternator failure causes the battery to drain. Having experienced this particular failure mode as a pilot, I was happy indeed that this redundancy kept my engine available to its user.

Returning to the more grounded world of IT, redundancy can occur at many levels.  Disk drives and power supplies have among the highest failure rates of internal components and thus RAID technology and dual power supply modules in many servers and other devices.  Networks can be designed to enable redundant LAN paths among servers.  Servers can be clustered assuming their applications have been designed accordingly.  Devices such as switches, firewalls, and load balancers can be paired for automatic failover.  The WAN can include multiple geographically disparate hosting sites.

Drawing the Line

The appropriate level of redundancy in any system reduces to an economic decision.  By definition, any expenses incurred to achieve redundancy are in excess of those required to deliver required functionality.  Although in some cases, redundant resources used to increase availability may provide ancillary benefits (e.g., a server cluster can increase availability and throughput).

Redundancy decisions really begin as traditional risk analyses.  Consider the events to be addressed (e.g., an entire site going down, certain capabilities being unavailable, a specific application becoming inaccessible; for some period of time).  Then determine the failure modes that can cause these conditions (e.g., a server locking up, a firewall going down, a lightning strike hitting the building).  Finally, consider the cost of these events each as a function of its impact (e.g., lost revenue, SLA penalties, emergency maintenance, bad press) and the probabilities of its failure modes actually occurring.  The cost of redundancy to tolerate these failure modes can now be made dispassionately against their value.

As technologists, our purist hearts want to build the indestructible system.  Capture my bishops and rooks and my crusading knights will continue processing transactions.  However, the cost-benefit tradeoff drives the inexorable move from pure to real.

The good news is that many forms of redundancy within the data center are inexpensive or at least very reasonable these days given the commoditization of hardware and the pervasiveness of the redundancy principle.  Furthermore, if economics keeps you from realizing total redundancy, do not be disheartened.  We’re all currently subject to the upper bound that we live on only one planet.