More Resilient Schema Changes at Scale
Because many schema changes in MySQL are blocking, the accepted workaround for many years has been to use an external tool such as pt-online-schema-change or Gh-ost. These tools both work similarly:
- Copy the original schema definition to a new table
- Perform the
ALTER TABLE
operation on the new table - Copy the data from the original table to the new table (while subscribing to updates)
- Swap the new table in to replace the existing table
Gh-ost has been our tool of choice for many years, and has performed thousands of schema changes at scale without downtime.
However, due to reasons I will explain below, we embarked on a project to create our own version of Gh-ost from scratch. In homage to Gh-ost, we affectionately named this tool Spirit.
This is part 1 of 2. Tomorrow I will post Introducing Spirit, but for today let’s explain our motivations.
Reason #1 Resiliency
At Cash App, gh-ost is deployed in Kubernetes. Our Kubernetes services are expected to be resilient: if a pod fails, it is rescheduled and all work resumes.
Unfortunately with Gh-ost, on resume all progress is lost. When you consider that some of our longer DDL operations are best measured in weeks, it becomes a serious problem. A failure on day 15 means a restart from day 1, and there’s no guarantee there won’t be another failure on the second attempt.
When we analyzed our previous migrations, we found that if a schema change took longer than three days, it had a 40% chance of failure. That’s not to say that it causes an outright failure (schema changes will restart), but predictability is important. It’s hard to make plans when you want to deploy code that requires a schema change with an unknown ETA.
Reason #2 Performance
Cash App is built as a set of microservices, each with their own database. One of our organizational objectives is to make the developers of each of those services as productive as possible.
For our database team this translates to: if we don’t need to enforce sharding on an application, we won’t. As much as possible we will support applications to have larger databases if that helps serve this goal.
The single largest issue blocking us from having larger databases has been schema changes. That is to say that having a 1TB table is usually more of an issue than having a 10TB database.
The problems with schema changes are not just related to the time that it takes to alter the table (with longer time introducing more risk due to issue #1). The larger tables also tend to be more active with new inserts and updates arriving and having to be applied to the Gh-ost copy table.
Gh-ost was designed in a time when MySQL replication was single threaded. As a direct consequence of this, it itself is only single threaded. We frequently end up with scenarios that the incoming rate of changes is to tables arriving in parallel and exceeding the rate at which Gh-ost can alter the table. For large tables we end up in a situation where the schema changes will never finish.
Our Answer
With these two motivations driving us, we set a goal: we wanted to be able to comfortably support 5TB tables, migrating in 5 days. I built an initial prototype over the New Year’s break showing that it was indeed possible, and in mid-January it was greenlit as a project.
Because most tables at Cash App have an auto_increment PRIMARY KEY
, we were able to target our optimizations specifically for this use case, and in our integration we also added the ability to fallback to Gh-ost if required.
There are a couple of difficult parts about writing a schema change tool:
- The cutover algorithm to switch between tables is incredibly complex for MySQL 5.7. It gets easier in MySQL 8.0, but initially we decided to support both.
- Any bugs can potentially mean data loss. This is a very scary prospect for most people, but especially when you are storing financial data.
In part 2 tomorrow, I will introduce how we mitigated some of these risks (spoiler: there’s a checksum), but I will leave it at this for now!