GCP – Spotify keeps engineers and code in tune with fleet management
Since 2016, Spotify and Google Cloud have collaborated on solutions for Spotify’s developers, Google Cloud users, and open source communities. One of the most intensive and productive partnerships for both companies has been the work Google Cloud and Spotify have done to develop and optimize Google Cloud technologies in a way that also helps Spotify grow and scale. A recent example of this partnership is an automated fleet management solution that lets Spotify developers deliver secure, scalable, easy-to-manage apps and services faster than they ever have before.
The drivers behind fleet management
Even though Spotify is a large, established company with over 3,000 engineers, it still has a startup mindset. More than 500 ‘squads’ have specific goals and are empowered to achieve them. They own their product and technical strategies, and they’re constantly iterating both.
The challenge is that Spotify’s squads collectively manage more than 10,000 components, and all these backend services, data pipelines, websites, APIs, and front-end libraries are connected in a complex mesh of interdependencies. For example, the squad that manages playlists provides a full stack of components used by most of Spotify’s engineers. So not only must the playlist squad ensure their components are reliably supporting 550+ million global customers, they must also find the time to manage routine software updates and develop new playlist features.
Unprecedented scale required a new approach
Spotify’s complexity is only increasing. The company’s growth in users, content, and components is exponential. To meet demand and ease the pressure on engineers, several years ago the company abstracted away as much infrastructure as possible by lifting and shifting its platform to Google Cloud. Yes, the managed services automated many manual procurement and provisioning tasks, but engineers were still spending too much time on software maintenance. Releases, software updates, and new security threats kept squads heads down on tedious tasks. And platform migrations required significant effort by hundreds of teams over months, preventing them from doing new development.
Before fleet management, upgrading Spotify’s Java runtime took eight months.
Spotify saw that it had to evolve or it wouldn’t be able to scale and innovate fast enough to meet its requirements, let alone its goals. The company increased its use of automation and adopted a fleet management model. A step beyond infrastructure-as-a-service, fleet management removes many repetitive tasks for developers by providing backend services — like library updates, security patches, and even software observability — as part of infrastructure.
Niklas Gustavsson, VP and Chief Architect, explains, “We want to abstract away more levels of the technology stack and manage more commodity aspects of our platform so that developers’ work is more productive and fun.”
Making the shift to fleet management
One of Spotify’s biggest challenges in adopting fleet management was earning developers’ trust. The company had to show them they could go a step further and rest easy while automated processes pushed code changes to their components without any human interaction. Engineers had to be able to see for themselves that automation worked.
Spotify’s Backstage portal, which is now a Cloud Native Computing Foundation open source project, provided a single pane of glass into software components and cloud resources, but that wasn’t enough. The company had to give engineers easy fleetwide observability and controls, so they could see every component change and its impact. Spotify delivered those advanced insight capabilities using BigQuery.
Today, from Backstage, developers can make fleetwide changes, updating code used by ten or 1,000 components without taking controls away from the squads that own the components.
No one wants to go backwards
Today, more than 80% of Spotify’s production components are fleet-managed. As a result, developers are happier, squads iterate and ship new features dramatically faster, and security is better. Instead of squads spending weeks and months updating libraries using inconsistent processes, the internal and external software libraries supporting 2,600 components are automatically updated every day. And updates to the internal service framework used by backend services take less than a week; previously, those updates took several months.
With fleet management, updating Spotify’s service framework
takes 7 days rather than 200 days.
More than 95% of Spotify’s developers say software quality has improved with fleet management. That’s because faster updates translate into healthier components with up-to-date internal and external libraries, frameworks, code improvements, bug fixes, and security patches. When the Log4j vulnerability emerged, developers rolled out the security fix to fleet-managed backend services in just nine hours. Manually deploying the fix to the remaining 20% of unmanaged services took eight days.
After merging the initial fix for Log4j, 80% of components were patched on the first day of the rollout.
Discovering new uses for existing products
As Spotify continues to expand its fleet management model, the company is looking to take on more complex changes to remove more toil and improve developer experience at Spotify — and at other organizations. Gustavsson explains, “We’re trying to figure out: do we want to externalize some of this stuff?”
Spotify has externalized many tools over the years, with Backstage being the most successful example. Today, more than 2,200 global adopters have built developer portals off the Backstage framework to improve their own developer experience and productivity. In December of 2022, Spotify also released a commercial plugin bundle subscription for adopters to enhance open-source Backstage, and Spotify plans to release more plugins to the bundle over time.
“Some of the infrastructure that we built for fleet management certainly doesn’t need to be unique at Spotify so we want to figure out what parts we can potentially open source or commercialize,” says Gustavsson. “Instead of every company building that portal on their own and building their own plugins, how about we figure out one shared framework in which we can all target our requirements.”
You can read more about about Fleet Management at Spotify on the Spotify Engineering blog:
Part 1: Spotify’s Shift to a Fleet-First MindsetPart 2: The Path to Declarative InfrastructurePart 3: Fleet-wide Refactoring
Or listen to NerdOut@Spotify, the official tech podcast from Spotify R&D:
Episode 12: Fleet FirstEpisode 22: Declarative Infra and Beyond
Read More for the details.