GCP – Opening up Google’s Windows management tools
Managing a global fleet of Windows desktops, laptops, and servers for Google’s internal teams can be tricky, with a constant stream of new tools, high expectations, and stringent organizational needs for secure, code-based, scalable administration. Add in a globally distributed business and extended work-from-home requirements, and you have a recipe for potential trouble.
Today we’d like to walk you through some of the tools that the Windows Operations (WinOps) team uses at Google, and why we made (and open-sourced) them. Our team is constantly working to improve the process we use to manage our client fleet of laptops and desktops, and we’ve spent the past several years building open source, infrastructure-as-code tools to do just that.
Now that we’re all working from home, these choices have enabled us to keep operating at scale remotely. Let’s dig into a few common Windows administrative challenges and how our open tools can help.
Challenges with scale
When you manage Windows in a large, globally distributed business environment, problems of scalability are front and center. Many popular administrative tools are GUI-based, which makes them easy to learn but difficult to scale and integrate. An administrator is often limited to the functionality built into the product by its vendor. Many times, core management suites lack qualities that we would consider critical in a reliable production environment, including the ability to:
- Peer review edits and to roll changes backward and forward on demand
- Implement platform testing, with support for automation pipelines
- Integrate seamlessly with tooling that also manages our other major platforms
Because they rely on explicit network-level access, many of these products also depend heavily on a well defined corporate network, with clear distinctions between inside and outside .
At Google, we’ve been rethinking the way we manage Windows to address these limitations. We have built several tools that have helped us scale our environment globally and enabled us to consistently support Google employees, even when major unexpected events happen.
Open source products are increasingly a key to our success. With the right knowledge and investment, open source tools can be extended and tailored to our environment in ways other applications simply can’t. Our designs also focus heavily on configuration as code, rather than user interfaces: Code-based infrastructure provides optimal integration with other internal systems, and enables us to manage our fleet in ways that are audited, peer reviewed, and thoroughly tested. Finally, the principles of the BeyondCorp model dictate that our management layer operates from anywhere in the world, rather than only inside the company’s private network.
Let’s dig into some of these tools, organized by what they help us get done.
Prepping Windows devices
Glazier, a tool for imaging, marked our team’s first foray into open source. This Python-based tool is at the core of our Windows device preparation process. It focuses on text-based configuration, which we can manage using a version control system. Much like code, we can use the flexible format to write automated tests for our configuration files, and trivially roll our deployments back and forward. File distribution is based around HTTPS, making it globally scalable and easy to proxy. Glazier supports modular actions (such as installing host certificates or gathering installation metrics), making it simple to extend with new capabilities over time as our environment changes.
Secure, modular imaging with Glazier helps prepare devices
Traditional imaging tends to rely heavily on network trust and presence inside a secure perimeter. Systems like PXE, Active Directory, Group Policy, and System Center Configuration Manager require you to either set up a device on a trusted network segment or have sensitive infrastructure exposed to the open internet. The Fresnel project addressed these limitations by making it possible to deliver boot media securely to our employees, anywhere in the world. We then integrated it with Glazier, enabling our imaging process to obtain critical files required to bootstrap an image from any network. The result was an imaging process that could be started and completed securely from anywhere, on any network, which aligns with our broader BeyondCorp security model.
Fresnel enables imaging from any network in the world
The remote imaging and provisioning process included several other network trust dependencies that we had to resolve. Puppet provides the basis of our configuration management stack, while software delivery now leverages GooGet, an open source repository platform for Windows. GooGet’s open package format lends itself well to automation, while its simple, APT-like distribution mechanism is able to scale our package deployments globally. For both Puppet and GooGet the underlying use of HTTPS provides security and accessibility from any network. We also utilize OSQuery as a means of collecting distributed host state and inventory.
GooGet helps us automate package distribution and deployment
Our infrastructure still has dependencies on classic Active Directory (AD), and the domain join process was a particularly unique challenge for hosts that do not bootstrap from a trusted network. This led to the Splice project, which uses the Windows offline domain join API and Google Cloud services to enable domain joining from any network. Splice enables us to apply flexible business logic to the traditionally rigid domain join process. With the ability to implement custom authentication and authorization models, host inventory checks, and naming rules not typically available in AD environments, this project has given us the flexibility to extend our domain well beyond the classic network perimeter.
Splice helps us join new devices onto our Active Directory domain from anywhere
Maintaining our fleet
Deployment is only the beginning of the device lifecycle; we also need to be able to manage our active fleet and keep it secure.
The Windows internal update mechanism is generally sufficient to keep the operating system patched, but we also wanted to be able to exercise some control over updates hitting our fleet. Specifically, we need the ability to rapidly deploy a critical update, or to postpone installing a problematic one. Enter Cabbie, a Windows service that builds upon Windows APIs to provide an additional management layer for patching. Cabbie gives us centralized control over the update agent on each machine in our fleet using our existing configuration management stack.
Centralized patch control using configuration management
We also have Windows servers to manage, and these hosts present unique challenges, distinct from those we face with our client fleet. One such challenge is how to schedule routine maintenance in a way that’s easily configurable, automated, and can be integrated with our various agents like Cabbie. This led to Aukera, a simple yet flexible service for defining recurring maintenance windows, establishing periods where a device can safely perform one or more automated activities that might otherwise be disruptive.
Building for the future
Our team was fortunate to have started many of these projects well before the Spring of 2020, when many of us had to abruptly leave our offices behind. This was due, in part, to embracing the idea of building a Windows fleet for the future: one where every network is part of our company network. Whether our users are working at a business office, from home, or on a virtual machine in a Cloud data center, our tools must be flexible, scalable, reliable, and manageable to meet their needs.
Most of the challenges we’ve discussed here are not unique to Google. Companies of all shapes and sizes can benefit from increasing security, scalability, and flexibility in their networks. Our goal in opening up these projects, and sharing the principles behind them, is to assist our peers in the Windows community to build stronger solutions for their own businesses.
To learn more about our wider fleet management strategy and operations, read our “Fleet Management at Scale” white paper.
Read More for the details.