This is part 2 of 2. See yesterday’s More Resilient Schema Changes at Scale post for an introduction.
INSTANT is not supported, Spirit creates a copy of the table definition, applies the
ALTER TABLE to the new table, and then copies data from the original table to the new table in parallel threads. Each of these threads copies the data in small chunks which are dynamically sized to a target chunk time. We have found this to be both faster and safer than using a fixed size chunk such as 1000 rows at a time.
While the table is being copied, Spirit subscribes to updates on the table using the binary log (i.e. it acts like a MySQL replica). This is the same mechanism that Gh-ost uses, but we have been able to implement two very helpful optimizations:
Key above watermark
This optimization ignores changes in the replication feed that occur above the highest point that rows either have been copied, or could have been copied.
Currently this optimization is only enabled in the case you have an
auto_increment PRIMARY KEY, since with collations determining ‘above’ is actually difficult.
This optimization has proven very effective for us, since we typically have
BIGINT auto_increment keys and see most activity on the newest inserted rows at the end of the table.
This optimization de-duplicates changes that occur in the replication feed. Provided that the
PRIMARY KEY does not have collations, we only need to store the
KEY and if the last modification was a delete or not.
Periodically we then parallel-flush the changes that have been stored in the map. If a row has been modified several times, we only need to flush it once.
We automatically disable this optimization if the
PRIMARY KEY has collations. Unfortunately when this optimization is disabled it requires that changes are flushed in order and thus also single threaded2.
Because Spirit tasks are resumable, you can kill a schema change at any point. Any schema change will automatically resume provided that Spirit is invoked with the same
ALTER statement, and the binary logs that it needs are still available. For us this means that our Kubernetes pods can be rescheduled, or we can roll out new deployments or upgrades without much consequence.
It also means that we can resize our RDS instances just for the purpose of applying schema changes faster. The benefit of this shouldn’t be undersold: we have noticed that schema change performance with Aurora drops significantly when the secondary indexes of the new table no longer fit in the buffer pool. It is often the case that scaling up the instance to have twice the RAM may have a 5x+ not a 2x improvement, since it changes the profile of the task to be less IO bound.
Writing a new schema change tool is risky, for a few reasons:
- Any bugs could lead to data loss. It’s possible that the copier process could mishandle character sets, time zones, partially invalid data (sql modes) or not handle disconnects, deadlocks or lock timeouts correctly.
- The cut-over process in MySQL versions less than 8.0 relies on a strange side-effect of how locks are prioritized, and is difficult to test for, leading to the possibility of future bugs being introduced.
- We want Spirit to run faster than Gh-ost, but at the same time it can’t impact production traffic.
On the first point, during development we introduced a checksum feature to compare the intersection of columns between the source table and the new table. We initially never intended to keep the checksum enabled, but in most cases it only adds about 10% of time to each schema change. With Spirit’s checksum, we are effectively able to prevent all data loss scenarios that occur before the cutover process occurs.
On the final point, Spirit is now much better at not impacting production systems, but it required real-world testing to get there. Early in development we saw issues where the last chunk in the table was much larger than earlier chunks, or the distribution of the data was unexpected. This is a hard problem, and I expect there will be future opportunities for improvement.
Why not do X instead?
The two most common questions are why don’t we use either Instant DDL (available in MySQL 8.0) or Blue/Green (advertised by AWS as an option to do schema changes). The answers are quite straightforward actually:
- Spirit always first attempts
INSTANTDDL. If it is applicable, then the schema change applies immediately and the operation is complete. In fact, we contributed the same feature to Gh-ost.
- Blue/Green depends on statement based replication, which is deprecated. It is unsafe to apply schema changes on the replica first when using Row-based Replication (default).
I like to describe Spirit as a workaround to a MySQL limitation. The MySQL team has been making more operations
INSTANT in newer versions of MySQL, and by nature of ‘always attempting instant first’, Spirit automatically supports them.
We look forward to a time in the future where we won’t need Spirit, but for that to happen all operations need to be
INSTANT with a version of
INPLACE that does not block on replicas.
Is Spirit a good fit for you?
We would like to think yes! But one caveat up front:
We acknowledge that our use-case is also a little bit different than GitHub’s, in that Gh-ost was written with a lot of emphasis on minimizing delay on read replicas. Spirit does implement throttling based on replication lag, but it only supports values 10s or greater.
This is because we only use binary-log replicas for disaster recovery and not read/write split as GitHub does. We made the deliberate decision to tolerate small ‘blips’ of replica delay in exchange for faster schema changes.
Thank you to my colleagues who contributed to Spirit’s development including Prudhvi Dhulipalla, Kolbe Kegel, Yinan Liang, Joy Nag and Daniel Nichter. Additional thanks to the MySQL Team at Oracle and the RDS team at AWS for reviewing our work and providing feedback.