Distributed Transactions

#transactions #distributed transactions #two-phase commit #transactionless #protocol design #contracts design #tradeoffs #architecture

Maciej Chałapuk

2015-11-24

Recently, I had few email conversations with past job coleague of mine. The most challenging one was on the topic of distributed transactions. It turns out we both had some experience with them, but our opinions were rather different. As this topic is pretty hot these days, I decided to write down our insights in this article. The goal is to list disadvantages of distributed transactions, show how they are overused, provide an alternative approach, and specify conditions under which each technique may be applied.

Why Not?

Software developers are attracted to distributed transactions because they provide all‑or‑nothing solution across multiple database resources. Standardized interfaces of a distributed resource (XA) and standardized transactional APIs (JTA, repoze) greatly reduce the time-cost of applying this technique. It results in less code written, because domain logic is free of transaction-related code. On the other hand, there are many downsides to distributed transactions.

Two Phase Commit¹ lowers maximum throughput of the system. The more nodes are involved in the transaction, the lower throughput is achievable. Cost of needed hardware is increased.
Risk of contention grows with number of users. More contention affects performance and limits scalability to a point where performance becomes unacceptably low. Performance problems can be resolved (in retrospective) by optimizing the data model and re-designing the transactions, but it means that a system built using distributed transactions will never be scale-agnostic².
All-or-nothing availability. If one resource inside a transaction is unavailable, whole operation must fail. Partial availability of the system is not possible. What's more, availability of a single operation is inversely proportional to number of resources at use in the transaction. A system containing 5 resources, each 99.999 % available, will be availabile pow(99.999, 5) ~= 99.995 % of the time.
All resources must implement the same transactional API. If a resource don't natively implement required interface, an adapter, which implements it, must be added to the system. This increases overall system complexity because of additional dependency added.
Distributed transaction is just a technical candy. The concept is based on the assumption that transactions⁴ between subsystems are a cross-cutting concern (like logging or concurrency), while in true they should be treated as parts of program's domain logic. Generic implementation of a distributed transaction is strictly technical and very complicated. Thinking about errors in the second phase of 2PC is hard and requires detailed knowledge of the commit protocol⁵, while transaction implemented as part of entity's interface is always simple.

Transaction⁴ Inside Contract

Let's say we have to model a newsletter subscription system with prizes for first 10 subscribers. Messages must be sent to winners during successful subscription. We have two resources to be used inside a distributed transaction—a database for storing subscriber entities and a mail server for sending emails to subscribers. Each time a subscriber is added to the database, distributed transaction must be set up. After adding subscriber, we have to count all subscribers added so far. If there is less than or equal 10, add an email to send in the transaction. If all is well, commit.

The relationship between subscriber entities and sent emails, which is an essential part of the model, gets marginalized when using distributed transactions (hidden inside a technical solution). All ACID properties are kept, yet not all of them are required across all resources. Next paragraph describe alternative model, wich doesn't use distributed transactions.

After checking the database for a winner, we can store another entity that represents an email. Then write a simple daemon that periodically checks the database for unsent emails, send them and mark them in database as sent. Done. Operations are now asynchronous, transactions are explicitly reflected in the contracts between subsystems. In case of an error occuring between sending and marking, message will be sent second time (in the next batch), which is acceptable. No distributed transactions needed.

A drawing of envelope and a database field list. Fields listen in white: id, subscriber_id, subject, message. Green color and plus sign suggests that another field is being added: is_sent.

Tradeoffs

Starting a green-field project is the best time to consider abandoning distributed transactions technique. Situation is less obvious when reality imposes many constrains. Following questions may be helpful in making a decision in this matter.

Are all ACID properties needed accross distributed resources?
What is the probability of an error in the second phase of 2PC?
Is an inconsistency after an error in 2PC acceptable?
What is the cost of manually removing such an inconsistency?
What is the cost of automatically removing such an inconsistency?
What are the requirements for availability?
What are the requirements for scalability?
How much of operations consist of communication with legacy systems?

Very often the most significant factor will be the time-cost comparison between distributed-transaction-based solution (containing cost of hardware) and solution with transactions inside contracts.

Dedicated solutions are often better than general ones, but since they are dedicated, they may be not compatible with legacy systems. Much may depend on company profile and policies. Start-ups must be innovative and will tend towards dedicated solutions. Corporations seek to standardize things, so they prefer vastly used and generally accepted technology.

References

"Distributed Transactions are Evil" and "Avoiding Distributed Transactions" pages on Cunningham & Cunningham wiki touch many problems connected with the topic.
Gregor Hohpe (ToughWorks) wrote a paper titled "Your Coffee Shop Doesn’t Use Two-Phase Commit", in which he states that real-world workflows are often asynchronuous, concurrent and simple. He suggests that information systems should be modeled the same way.
In white paper titled "Life beyond Distributed Transactions" Pat Helland (Amazon) describes the impact that distributed transactions have on scalability.
Kevin Hoffman, in his article on distributed transactions in microservice environment, suggests that the need of distributed transactions may indicate an violation of SRP.
In his article on 2PC, Dan Pritchett writes about how distributed transactions affect performance and availability. He also provides a simple alternative.
On his personal website, Martin Fowler (ToughWorks) wrote a short essay about transactionlessness in eBay's architecture. References are worth checking out.

Annotations

^{1} More advanced commitment protocols are out of scope of this document. They share many of the disadvantages listed in here.

^{2} With exception of data models designed for no contention. Shared Nothing architectural pattern satisfies non-functional requirement of no contention.

^{4} Term ‘transaction’ (not distributed transaction) is used throughout this article in traditional (non-technical) meaning — an execution of a deal.

^{5} Probability of error during second phase is extremely low (years may pass before one happens). Commit errors are often ignored by software developers (data integrity must be fixed manually or restored from backup after the incident).