Rolling Out Org-Wide Changes

Cash App has grown into a complex engineering organization. Our team of hundreds of engineers deploys dozens of services to AWS multiple times a day. Our teams span several time zones and locations, and each team has its own availability and velocity requirements.

Unfortunately our custom Kubernetes management tooling wasn’t able to keep up with our organizational complexity or able to provide the service management features which we required. Service teams were also not able to dictate deployment strategies for their custom services and we wanted infrastructure to support Continuous Delivery.

Changing this was a big challenge. We needed to change both the people process and software tools without interrupting our teammates ability to deploy new code. We also needed to release onboarding docs to provide training for all the new things we’d be rolling out.

I’m happy with the process we used and really proud of its result. This post explains how we made this change. Hopefully it’ll help you to make your own organization-wide changes!

Step 1. Requirements List

We started by listing all the features we wanted out of a tool that handles Cash App service management. We did not want to invest in our own homegrown tooling at that point in time, so searched for an open source, industry standard tool that met our requirements. We made a shortlist, analyzed how well each tool met our requirements, and ranked them in terms of ease of use and integration with our environment.

The requirements, possible tools, and analysis on each were outlined in a document which was distributed to both the Platform team and service teams which would be impacted by the change.

Step 2. Proof of Concept

From the shortlist, we chose Spinnaker —  an open source continuous delivery tool that claimed to meet almost all of our requirements. We set up a local deployment to verify the features we were interested in, and to understand what we needed to do to configure Spinnaker for our purposes. We tested the main integration points with our existing environment, as well as authentication and authorization to ensure it fit with our existing security model.

We demoed the proof of concept to the team at our end of sprint meeting.

Step 3. Design Doc

Next we created a design doc to encapsulate our design for integrating Spinnaker into our system. Our design docs are a high level description of the problem space and solution, and can be broken down into the following:

Our design docs are sometimes expanded to include more, but almost always include the above items. We share the design doc to spread awareness — especially to teams who will be impacted by our change.

Step 4. Security

At this point, we wanted to ensure that the integration of Spinnaker into our current system was not going to make us vulnerable from a security perspective. We had multiple meetings with the security team and they closely vetted the integration and the system, to make sure points of potential attack were secure and there was minimal risk. All communication between Spinnaker and our system was secured and left clear audit trails. We also used the opportunity to improve our security tooling and infrastructure.

This is recommended for any system you introduce to your organization, as an attack can compromise integrity of user data, cause significant downtime, or leak PII. It is also an opportunity to improve the security tooling in your organization.

Step 5. Build

At this point, we’d made all the decisions and were ready to build our easy-to-deploy and manage Staging and Production Spinnaker to manage Cash App services. The major unknowns we were dealing with at the beginning when determining what tool to use were:

These points were answered, but building still took time. We naturally came across issues when integrating within a large system. We didn’t want to compromise on code quality, or the additional security tooling we built for secure Spinnaker communication, and broke things down into subproblems to ensure all components worked as expected.

Step 6. Early Adopters

We finished building and were ready to start testing real world scenarios. We sent out a survey for early adopters and received a response from ~20 teams. We tested by doing the following:

Each step allowed us to understand what kind of information would be useful for onboarding, and the interviews helped to iteratively improve the guide. The interviews and observations of how teams were using our features also informed our future planning.

Step 7. General Availability

We finished testing with our early adopters and ironing out the kinks. We stress-tested the system with our Production clusters and launched our Spinnaker Platform to general availability.

This is when we celebrated!

Future Work

Launching is only the beginning. Rolling out an org-wide process/tool change takes time, but there is more to do still — new features to work on, feedback from teams on how the tool is performing or meeting their needs, and further expansion. The journey is not yet over, but a major hurdle is overcome, and the path is paved to work on more exciting features in the future.