Alok Menghrajani

Previously: security engineer at Square, co-author of HackLang, put the 's' in https at Facebook. Maker of CTFs.

This blog does not use any tracking cookies and does not serve any ads. Enjoy your anonymity; I have no idea who you are, where you came from, and where you are headed to. Let's dream of an Internet from times past.

Building multi-master distributed systems using feeds

Sep 19, 2016

tags: distributed systems | data replication | multi-master | tla+

At Square, backend engineers have a large set of tools that help build application at a datacenter level: app containers to handle application life cycle and monitoring, MySQL databases, zookeeper, logging, rpc, etc. When it comes to replicating data across our datacenters, we have to understand various tradeoffs and build the replication layer ourselves.

The advantage of having each team tackle data storage is that applications and algorithms can be tailored to specific use-cases. For example, we have an application which creates crypto keys and replicates them pre-emptively while preventing the same key from being concurrently assigned to two clients. It might have been harder to build such a feature with a generic distributed key-value store.

The disadvantage of having to care about data replication is that we end up building less reusable abstractions. Our lack of distributed systems skills also leads us to repeatedly make some mistakes.

In 2012, Bob Lee gave a talk: Engineering Elegance: The Secrets of Square's Stack , where he described our use of feeds. In this post, I will explain some of the issues our team has had to deal with over the last few years.

Feeds overview

The feed abstraction allows remote clusters to sync data from a given cluster. The model is poll based and the contract is roughly:

the master cluster keeps track of changes. It exposes an endpoint which takes a cursor. The response contains up to 10 data entries and an updated cursor. If there is no change to return, the same cursor is returned.
the other clusters keep track of their cursor and continuously hit the master datacenter's endpoint, updating their local state as they go along.

When building a multi-master application, we end up with N^2 feeds (where N is the number of datacenters or clusters).

The feed contract boils down to: "Two different feed consumers polling the same feed with the same cursor get back the exact same entries in the exact same order. (The number of necessary poll requests can vary between the consumers, for example if entries are added to the feed between the two poll requests.)"

At Square, the number of items we return is configurable. We also have a concept of shard. I'll ignore those details in this post.

Ordering issue

The first issue with using feeds to replicate data is that the application needs to handle ordering issues. Let's look at the following toy application: Two or more servers which are updating some internal value V at random intervals. We want the state to replicate and we want data to be eventually consistent.

The pseudo-code for this example might look like:

function init() {
  value = 0;
  current_version = 0;
  cursors = [0];
  schedule(job)
}

function job() {
  lock();
  value = rand();
  current_version++;
  release();
}

function feed_provider(cursor, num_results) {
  lock();
  if (cursor < current_version) {
    t = (current_version, [value]);
  } else if (cursor == current_version) {
    t = (current_version, []);
  } else {
    // scream...
  }
}

function feed_listener() {
  t = request(peer, cursors[0], 1)
  if (t.entries == []) { return; }
  lock();
  value = t.entries[0];
  cursors[0] = t.cursor;
  unlock();
}

The code will work fine when deployed to two servers. The values might look something like:

At any point in time, either the data is consistent or the feed is slightly behind and the data will become consistent. Things fall apart when the code is deployed to three servers. The values can end up looking something like:

Server 2 and 3 generate a value at roughly the same time, the two servers end up with the same value (thanks to the use of locks), but server 1 doesn’t see any of this and ends up with a different view of the world.

For fun, I modeled these two cases in TLA+. The model checks with two servers and fails with more servers.

Using timestamps for feeds

In order to track and return change sets, applications usually store changes in a dedicated table. An alternative is to have an updatedAt column and use map the cursor to the clock.

This approach however breaks the feed contract: polling the same feed with the same cursor will no longer return the same data.

Using timestamps also comes with a few implementation risks: you must ensure you have the right indexes (to make the feed query fast), you must take into account that server clocks can be slightly off, be careful about rounding (*), think about query ordering (**), etc.

* older versions of MySQL round timestamps at a seconds granularity. This can lead a failure to propagate data if you have more than 10 writes per second.

** query completion ordering can result in an updated value which is slightly in the past.

Auto-increment appears non-monotonic

When storing log entries in a MySQL table with an auto-increment id, we use the row id as the feed cursor. There is however a case where the feed can skip one or more rows.

Let's say a query is adding a row. MySQL will reserve a row id (e.g. 100). If another query is adding a row, MySQL will increases the row id (e.g. 102). There is however no guarantee that the first query will finish first and it's possible for the database to contain only row 102 for a brief period of time.

From the feeds point of view, the rows are being added in a non-monotonic fashion and row 100 can get skipped if the feed fetches 102:

Summary

To summarize, replicating data with feeds requires taking data ordering into account. Application developers need to decide if they can handle inconsistent views on their data or if they need to apply some kind of re-ordering logic.

Developers don't get to control their consistency model: feeds provide eventual consistency and you have to build on top of that.

In practice, it is easier to handle data which has an inherent order (e.g. counters or timestamps) or store unordered sets.

Ensuring the correctness of the code is hard. You can end up with code which fails to sync a few rows when some rare conditions are met.

Failure to properly monitor feeds or perform the right administrative operations leads to outages.

Given these findings, we tend to write our applications with a fallback to query all the remote datacenters if a piece of information is missing. This is only possible for specific kinds of datastructures but it has saved us multiple times!