On Eliminating Error in Distributed Software Systems


This article is the companion to my article On Embracing Error in Distributed Software Systems and expands on my talk What Lies Between: The Challenges of Operationalising Microservices from QCon London 2019.

In designing, developing, operating, and evolving distributed software systems, failure must be considered intrinsic to the system. In any sufficiently large or complex system, there will always be something that has failed. Engineering for failure is essential. Systems must be responsive, resilient, reliable, and safe in the event of failure, especially at scale. What must be avoided is allowing the failure to become catastrophic.

Engineering for failure is particularly important, but also especially challenging, when elements of these systems are distributed in space—like messaging across services, or even vast geographies—or in time—like batching or buffering messages, and processing them at a later time, or on different schedules. It is also very challenging when we do not own or control parts of these systems, or when these systems interact with the physical world, like in manufacturing, industrial automation, critical infrastructure, the Internet of Things (IoT), or even consumer-facing applications that are connected to the Internet.

In this article, I will review some common techniques used to eliminate errors in software systems: testing, the type system, functional programming, and formal verification. I will explore the strengths of these techniques, but I will also highlight their limitations and explain why they are not jointly sufficient for eliminating error in distributed software systems.

Testing

There are many types of software tests—unit tests, integration tests, regression tests, system tests, acceptance tests, performance tests, scalability tests, security tests, compliance tests, just to name a bunch. Testing in production environments is effective for measuring service-level objectives, or detecting errors, failures, and anomalous conditions. Testing in production includes evaluating canary deployments; instrumentation; log analysis; container-level health checks; service-level health checks; or even simulators to create synthetic requests and regularly exercise core aspects of the system. As I mentioned in my article Observations on Observability, Design of Experiments can be effective for gaining insights into the dynamic relationships among the state variables of the system and for evolving the system to more optimal operating conditions.

In the first few years of this blog, I wrote a number of articles on testing. It is important to have an effective balance of unit tests and end-to-end tests for supporting development activities and for reliably evolving and extending software systems, over time, through refactoring. Tests are great for acting as a specification or documentation when interacting with colleagues.[1] Techniques like fuzz testing can be used to find defects empirically. I believe one of the most important things about testing is that it allows us to independently validate our assumptions about the system by testing different hypothesis, experimenting and learning, using the scientific method. This can fuel our creativity and inform development and operational objectives, allowing us to measure our performance against them.[2]

Tests are invaluable for reducing errors in software systems, especially during development, or when validating critical behaviours of the system through a small number of end-to-end tests, even in production environments. However, putting too much of an emphasis on testing can be detrimental, especially when compared to the advantages of leveraging the type system and functional programming. It is impossible to rely on testing to eliminate errors. A test cannot show an absence of error. A test simply proves a set of conditions were true for a specific execution of the test—and this assumes the test itself is free of defects. For a sufficiently complex system, it is impossible to enumerate all negative test-cases, security test-cases, performance test-cases, or scalability test-cases. Furthermore, most systems evolve organically, over time, resulting in emergent and untested behaviours. Even from an organizational perspective, testing can slow progress and reduce quality. For example, if one team relies on another team for software testing, if poorly managed, this can lead to long feedback cycles that impair quality and amplify errors, rather than reduce errors.

The Type System

Static typing eliminates errors by identifying errors at compile time, rather than having errors manifest as run-time failures. The efficacy of static typing in eliminating errors is unclear. There are studies that show statically-typed languages eliminate a significant number of errors, but if you want to argue that dynamically-typed languages are equally effective, you can also find studies supporting this thesis.

I have a strong preference for statically-typed languages.[3] I question if the studies supportive of dynamic typing focused too much on the errors encountered when implementing a brand-new feature, or a trivial reference feature, rather than on evolving a large, complex software system, with many dependencies, and with many different people contributing to it, over many years. Most importantly, I question if these studies, directly or indirectly, focused too much on comparing an implementation in a dynamically-typed language to an equivalent implementation in a statically-typed language. The type system has so much more to offer than simply implementing the equivalent program with fundamental types. For example, if you only use strings, especially when these strings can also be null or empty, it is hard to even say what a type error is, or understand the full extent of what can be specified with types. The same can be said of using sentinel values, like -1, null, or empty, to handle special conditions, rather than using types that actually model the domain.[4]

The type system becomes much more advantageous when you move beyond fundamental types and start using higher-order types—like refinement types to constrain data when accepting input; option types to encode errors as data, instead of returning null or throwing exceptions; dependent types for collections that include their size within the type itself; or path-dependent types to prevent the mixing of specific instances of a type.

Returning to strings as types, consider the following example, in Scala, where an identifier is characterized by a three-character, lower-case string. Rather than indiscriminately assigning the identifier to a string, a refinement type is used to explicitly restrict the type to only a three-character, lower-case string.

type Id = Size[Equal[W.`3`.T]] And Forall[LowerCase]
val id = refineMV[Id]("abc")

With an upper-case string, the program will not compile, similarly for a string that is too long, or too short.

val uppercaseId = refineMV[Id]("ABC") // Will not compile
val tooLongId = refineMV[Id]("lmnop") // Will not compile
val tooShortId = refineMV[Id]("jk") // Will not compile

This significantly reduces the state space of the program, since there is no need to consider the behaviours for a null string, an empty string, an upper-case string, or a string longer than, or less than, three characters. There is also no point in writing tests for these possibilities, since they will not compile.

Of course, not all variables can be assigned statically. More commonly, our programs must deal with dynamic assignment when input comes from messages, files, or computations. But a type can also be refined dynamically, at run-time. The following example refines a string into either a valid identifier, or a None, that must be handled by the caller.

object Id {
  type Id = String
  def refine(x: String): Option[Id] =
    refineV[Size[Equal[W.`3`.T]] And Forall[LowerCase]](x).right.toOption.map(_.toString)
}

class IdSpec extends FlatSpec with Matchers {
  "abc" should "be a valid Id" in {
    val id = Id.refine("abc")
    id.get.value shouldEqual "abc"
  }

  "An uppercase Id" should "return None" in {
    val id = Id.refine("ABC")
    id shouldEqual None
  }

  "An Id that is too long" should "return None" in {
    val id = Id.refine("lmnop")
    id shouldEqual None
  }

  "An Id that is too short" should "return None" in {
    val id = Id.refine("jk")
    id shouldEqual None
  }
}

In software systems that interact with the physical world—like operational technologies in manufacturing or IoT—many errors are the result of inputs—from trusted or untrusted sources—that result in undefined behaviours. Using the type system to provide value checks, establish trust, and limit the state space, can be extremely important in terms of informational and physical security, especially when inputs are crossing a trust boundary. When messaging among dependent services or systems, providing a strong schema for messages is a way of bridging the type system with the messaging layer to eliminate errors and build more reliable systems. This includes using tools like Protocol Buffers or gRPC, rather than passing around the ubiquitous CSV, JSON, or YAML strings that are far too popular in far too many systems.

As powerful as a high-level type system is, the type system alone cannot address the dynamics or uncertainty in exchanging messages with services that are separated in space and time. Messages can be delayed, lost, reordered, or ignored. Furthermore, while some languages support compile-time checks for memory safety and memory leaks, there is certainly no type system that will prevent a program from allocating too much memory, or querying too much data, and running out of memory. Similarly, the type system cannot prevent the failure of a liveness probe from the control plane, due to run-time dynamics, from repeatedly terminating an application container.[5]

Functional Programming

In essence, functional programming is programming with functions—functions in the mathematical sense, mapping a single input to a single output—rather than programming imperatively, using procedures that may have side-effects. Functions should be total, meaning they return an output for every input; deterministic, in that they always return the same output for the same input; and pure, in that their only effect is on computing the output, rather than manipulating internal or external state. Functional programming focuses on immutability, which means variables and expressions cannot change after they have been declared or assigned. Functions are composed to construct useful programs, rather than relying on object-oriented techniques like encapsulation or inheritance.

Functional programming can be very effective in eliminating errors. Using immutable variables and data structures eliminates errors, race conditions, or even the need for locks in some cases, since it is impossible for two threads to concurrently modify an immutable data-structure. It can also reduce cognitive load, because when you reference a variable, you do not need to understand everywhere in the system it can be modified. Small, pure functions, with simple input-output models, are easy to reason about. Just like in modelling physical systems—electrical systems, mechanical systems, chemical systems—simple input-output models can then be composed to provide higher-level abstractions that describe more complex behaviours.

Testing and the type system are both very complimentary to functional programming. A function that is total, deterministic, and pure, is much easier to test, since these functions will never throw exceptions, fail to return (e.g., by entering an infinite loop), or modify external state. The type system can restrict the number of tests that are required, since it can significantly restrict the number of inputs to a function and outputs from a function. For example, as we have already seen, option types can be useful for ensuring a function does not return a null value. The type system can also be leveraged to support property-based testing for eliminating errors.

Consider the following example, again in Scala, with a simple algebraic data type that defines two states, On and Off. The function isOn returns a Boolean mapping On to true and Off to false.

sealed trait State
case object On extends State
case object Off extends State

def isOn(x: State): Boolean = x match {
  case On => true
  case Off => false
}

This function is total, because it returns a value for each input, which is aided by the sealed trait, since the compiler will generate a warning if the match statement does not handle all of the states.[6] This function is deterministic, since the same input always results in the same output. It is also pure, as there are no side-effects. With the combination of its simple input-output model, and the type system restricting the possibilities to only two inputs and only two outputs, it is easy to test this function, exhaustively, using only two tests.

"isOn" should "return true for On" in {
  isOn(On) shouldBe true
}

"isOn" should "return false for Off" in {
  isOn(Off) shouldBe false
}

I recently reviewed changes to an application in a modern and popular, but non-functional, programming language. The changes were related to a race condition in resetting session state. The problem was a result of mutating global state. In some cases, the session needed to be completely reset, in other cases, only a subset of the session state needed to be reset. The issue was "resolved" by modifying the method that reset the session state from reset(data) to reset(data, true). In other words, by adding a Boolean to indicate when the session should be partially reset, instead of fully reset.

This approach made me cringe. I found it very challenging to review, since I have not reviewed code like this in years. Where else was this session state being mutated or read? Some people enjoy these mental gymnastics, delighting in understanding all of the internal intricacies of the system. I certainly did at one point in time, but it is largely unnecessary and I have seen more than enough errors from this approach to ever want to return to this style of programming.[7] It may not be long until another data race is found, or some new functionality is required, and this method evolves to reset(data, true, false, true).[8]

Functional programming may be the most effective approach for eliminating errors in software systems, especially when combined with a rich type-system and effective testing. It is also compelling for eliminating errors when composing smaller entities into larger ones. But as powerful as it is, similar to the type system, functional programming cannot guard against errors that result from system dynamics—like reading too many messages and running out of memory, overwhelming another system with messages, head-of-line blocking when processing messages from a work queue, or failures caused by network partitions.

Formal Verification

Formal verification is the discipline of proving program logic correct through static analysis, aided by tools and mathematics. As we have already seen, the type system can be more specific than simply ensuring an integer is not inadvertently assigned to a string. It can be used for something as specific as statically verifying that an identifier is restricted to a three-character, lower-case string. Formal verification goes beyond this and can be used to prove that a program never enters an unwanted state, like an infinite loop.

I am intrigued by formal verification—it is certainly something I want to learn more about—but unlike testing, the type system, and functional programming, I am not a practitioner of formal verification. In fact, for most of the code that I encounter, I wonder what place formal verification has? For example, consider the isOn function from the previous section. What does formal verification offer that testing and the type system do not? This example may seem overly simplistic, but for most of the programs that I write, my goal is for them to decompose into something this simple.

While I have not practised formal verification myself, I certainly appreciate when the consensus protocol used by a distributed system that I rely on has been formally verified, and I can see how formal verification is valuable when composing many smaller entities into larger systems, ensuring that the resulting behaviours are desirable.

We gain a much better understanding of algorithms by trying to formalize them. When implementing, you naturally focus on the common case because that’s what you want to optimize. Formal verification pushes you to consider the rare edge cases.
Dr Martin Kleppmann

Formal verification is a challenging, academic, error-prone, and time-consuming discipline. It is not widely used outside of academia or safety-critical industries—like healthcare, transportation, and critical infrastructure—certainly when compared to testing, types, and functional programming. For a given problem, formal verification requires defining the correct assertions and the selection of the correct algorithms. If the assumptions of the model are not correct, then all you have is a mechanized poof that tells you nothing about the system. Formal verification is particularly challenging in distributed software systems where there is non-determinism, like dropped messages, retries, network partitions, and failed services.

Those proofs turned out to be wrong due to assumptions which did not hold in all circumstances. Some proofs were even mechanized, but if you have the wrong assumptions then a mechanized proof does not help you.
Dr Martin Kleppmann

I imagine a strong type-system aids formal verification, since things like algebraic data types, refinement types, and dependent types, can limit the size of the state space, making formal verification more tractable. I imagine the same is true for functional programming, with its focus on immutability, input-output models, and the elimination of side effects.

If, or when, formal verification becomes a more accessible discipline, it would be great to rely on it to model everything about a system and verify that it is correct. But even so, model checkers do not scale well to large state-spaces.

The myth of formally verified systems: model checkers don't scale well to large state-spaces
Peter Alvaro

Guarantees do not necessarily compose into systems. In fact, none of the techniques that I have explored in this article compose, because it is difficult to compose guarantees in sufficiently large or complex systems. For example, it is impossible to statically type, formally verify, or test all of Google.

Beware of bugs in the above code; I have only proved it correct, not tried it.
Donald Knuth

Distributed software systems are also designed, developed, operated, and maintained by people. Likewise, people will interact with the products and services these software systems support. This means we must include people as part of the system and incorporate sociotechnical and human-factor considerations in our quest to eliminate errors that lead to systemic failures. It does not matter how well tested, statically type-checked, purely functional, or formally verified a system is, if someone mistakenly, or maliciously, deploys the wrong version of the software to a production environment, or makes a consequential operational decision after misinterpreting a key signal.

How I View These Tools

While these tools are imperfect for eliminating error in distributed software systems, and while the guarantees they provide individually do not necessarily compose, they do have their strengths and they do help construct more reliable systems. I view these tools for eliminating error and preventing failure as follows:

  • Testing: Testing supports the process of software development. Tests provide independent verification of behaviours and contracts. Tests can serve as documentation or a specification. Tests support learning, creativity, and product development, through experimentation.
  • Types: Types provide compile-time safety. Static typing eliminates a host of errors by making these errors impossible. This is especially true when using advanced types, rather than just fundamental types. Types reduce the need for tests, by reducing the state-space, and can simplify code by modelling the domain in the type, rather than in program statements.
  • Functional Programming: The strengths of functional programming are input-output models, immutability, composition, and treating errors as data. The type system supports functional programming, and functional programming supports testability.
  • Formal Verification: Formal verification should be used for validating logic and algorithms. It is especially valuable for complex or safety-critical systems, but also for the foundational elements of systems, like state machines or consensus protocols.

I want all of these tools at my disposal for building sophisticated and reliable distributed software systems, but three things are missing: the first is embracing failure as part of the model, rather than trying to eliminate it; the second is handling run-time system dynamics, especially in complex systems; the third is considering the people that develop, operate, and rely on these systems as part of the system itself. I will explore approaches for handling these challenges in a subsequent article.

Thanks to Adam Kafka for his input on this article.


  1. Tests can be especially important for documentation and specification in dynamically-typed languages. ↩︎

  2. See my article Test-Driven Career Development where I expand on this idea. ↩︎

  3. In my career, I have used C, C++, C#, and Scala—all statically-typed languages. Over the past four years, I have been particularly enjoying the rich type-system that Scala has to offer. I still have much to learn about type systems, but I have no desire to ever work in a language that does not have an equally rich type-system. ↩︎

  4. See the article License Plate "NULL" for an amusing story about using the string "NULL" as a sentinel value. ↩︎

  5. For the non-intuitive dangers of using container-level probes, see my series Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot. ↩︎

  6. I always use the -Xfatal-warnings Scala compiler option to turn these warnings into fatal compilation errors. ↩︎

  7. This bug reminded so much of my article The Most Difficult Bug I Have Encountered. ↩︎

  8. This Tweet is timely. ↩︎