A Story About Eventual Consistency

The migration of software to the cloud has begun, a long time ago… In fact, when I started writing software for the cloud somewhere in 2010, I expected us to be a lot further by now. Now, ten years later, cloud services are very mature, fairly cheap compared to on-premises hosting, and very secure. All this, makes me ask why some companies still don’t embrace the cloud. One example is scaling. It’s almost impossible to host a service on-premises and make it scale as it could in the cloud. Distributed systems leverage this power which makes it almost too easy to build software systems that scale extremely well. But every huge advantage comes with a downside. In this story about eventual consistency, I will show how how to tackle this problem using the Azure Service Bus.

Now what is a distributed system?

Let’s say you own a bank. Not one to sit on, but the financial one. Even better, you own a digital bank. No offices, everything online. Cheap, just brilliant. Now your success is pretty much bound to the performance of your online systems. You had a good chat with your colleagues and identified a couple of domains. One domain, the Transactions domain is expected to face the biggest load. A distributed system could be a nice solution here. So now, what is this distributed system?

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another

Not so distributed

OK fair enough, but what does this mean? To answer this, imagine a traditional Web API you used to create a couple of years ago. It has an endpoint, and when you call it, the request is processed and you return the outcome of the process in response.

Distributed

In the distributed version, you have this same endpoint, which starts the process, and once the process is started successfully, returns a message that the process was started. A 202 Accepted is a common response, meaning ‘Hey, I accepted the workload, I will do my best for you’.

What’s next?

Your API, in fact, sent a message to one or more processes, to do their job. This process may send a message to an additional process, starting a chain of things that happen, triggered by that single request.

Keeping track of operations

It’s wise to keep track of processes that ‘belong’ to each other. For example, when I use the transactions API and transfer money from Account A to Account B, I want to validate the transfer request, update the balance of Account A and update the balance of Account B, and update the ‘initiator’ (me) that the transfer was successful. But if the validation failed, I want to be able to relate that validation back to the original ‘massage’ that came from my API. To do this, it’s common to use a correlation ID. So when you start a process or a chain of processes, make sure they all have the same correlation ID, so you can relate to them as the same operation. Also, using this correlation ID makes your life as a developer easy when debugging or finding information in your system logs.

All nice, but what’s the problem?

The problem with this architecture is that things sometimes fail in the cloud. So given the example of the bank, when you transfer money from Account A to Account B, and the balance of Account A is updated, but your system failed to update Account B. You’re in big trouble there mister! This can all be related back to a concept called eventual consistency. This term is broadly used when using microservices which basically face this same problem. You start an operation that consists of multiple processes, and you want processes to complete, or else make sure you’re notified. But most important, these processes are related to each other. You want Account A and Account B to be updated simultaneously. And worst-case scenario… If something fails, they should at least fail both.

Azure Service Bus to the rescue!

The Azure Service Bus is a great tool for 1) allowing you to write distributed systems and 2) tackle your eventual consistency problem.

A brief introduction

The service bus is a messaging system that allows you to send and retrieve messages. It has a retry mechanism that allows you to retry processing messages when failed. In the case of too many failures, your message ends up in the dead letter queue (DLQ). The service bus contains queues and topics. A queue is used for a pile of work to be done. You can have multiple queue clients connected to the same queue, but only one of these clients will get a message when a message arrives in the queue. A topic is different. You can see a topic as a newspaper. A topic can have subscriptions. When a message arrives in a topic, all subscribers get a message. So compared to the newspaper, everyone with a subscription will get a copy as soon as a new newspaper comes out.

The nifty trick

Now for our eventual consistency problem, we’re going to use both queues and topics. We will create a queue for each process that we need. So basically this means four queues. A validation queue, two ‘update balance’ queues, and a ‘report back to the user’ queue.

Eventual Consistency The queues here are absolutely not mandatory, but for convenience. In case you wish to separate processes and start populating this queue through another process or mechanism, you now can. In a topic, it’s more difficult because you may also affect other subscribers to that topic.

This way, we’re sure that when a message arrives in the topic, it’s subscribers will (in this case both) forward that message to the queue. The automatic retry mechanism will make sure the message is processed. However, you do need to make sure you monitor the DQL in case messages could not be processed even after x tries. And in the occurrence of a bank transaction, you want to make sure that money transferred from Account A to Account B, is removed from Account A’s balance, and added to Account B’s balance.

In the event your infra is down, let’s say your Azure Functions project (as in the example of the drawing) fails, work is just added to the queue and waiting for your functions to become available again. And because Azure Functions scale so nicely, they will probably scale up by the time they’re running again to keep up with the demand.

On the other hand, let’s assume the Service Bus is down. The problem now is that messages cannot be sent to queues or topics. This is also not a problem, because in this case, the entire system fails. Now money is removed from Account A’s balance, and no money is added to Account B’s balance. Which is also good. Probably not desired, you’d like the system to succeed, but the outcome of both processes is consistent.

Last modified on 2020-06-19