GCP – How Unity analyzes petabytes of data in BigQuery for reporting and ML initiatives
Editor’s note: We’re hearing today from Unity Technologies, which offers a development platform for gaming, architecture, film and other industries. Here, Director of Engineering and Data Sampsa Jaatinen shares valuable insights for modern technology decision makers, whatever industry they’re in.
Unity Technologies is the world’s leading platform for creating and operating real-time 3D (RT3D) content. We’ve built and operated services touching billions of endpoints a month, as well as external services benefiting financial operations, customer success, marketing and many other functions. All of these services and systems generate information that is essential for understanding and operating our company’s business and services. For complete visibility, and to unlock the full potential of our data, we needed to break down silos and consolidate numerous data sources in order to efficiently manage and serve this data.
Centralizing data services
Data platforms are essential to keeping a business running, and ensuring that we can continue serving our customers—no matter what disruptions or events are happening. Before migrating to Google Cloud, we used one solution where datasets were stored for machine learning, an enterprise data warehouse for enterprise data, and yet another solution for processing reports from streaming data. We saw an opportunity to reduce overhead and serve all our needs from the same source.
We wanted to centralize data services so we could build one set of solutions with a focused team instead of having different teams and business units creating their own siloed environments. A centralized data service can build once and serve multiple use cases. It also makes it easy to understand and govern the environment for compliance and privacy.
Of course, centralization has its challenges. If the internal central service provider is the gatekeeper for numerous things, the team will eventually become a bottleneck, especially if the central team members’ direct involvement is needed to unlock other teams to move forward. To avoid this scenario, the centralized data services team assumes a strategy of building an environment where customer teams can operate more independently by employing self-service tooling.
With easy-to-use capabilities, our data users would be able to manage their own data and development schedules independently, while maintaining high standards and good practices for data privacy and access. These cornerstones, together with the specific features and capabilities we wanted to provide, guided our decision to choose a foundational technology. We needed to build atop a solution that fully supports our mission of connecting the data to business and machine learning needs within Unity.
Why we chose BigQuery
For these reasons, we decided to migrate our entire infrastructure, over two years ago, from another cloud service into Google Cloud, and based our analytics on top of BigQuery. We focused on a few main areas for this decision: scalability, features to support our diverse inputs and use cases, cost effectiveness that best fits our needs, and strong security and privacy.
The scale of data that Unity processes is massive. With more than 3 billion downloads of apps per month, and 50% of all games (averaged across console, mobile, and PC) powered with Unity, we operate one of the largest ad networks in the world. We also support billions of game players around the world. Our systems ingest and process tens of billions of events every day from Unity services. In addition, we operate with outside enterprise services like CRM systems needed for our operations, whose data we want to integrate, combine, and serve alongside our own immense streaming datasets. This means that our data platform has to process billions of events per day. Furthermore, it had to be able to ingest petabytes of data per month, and enable a variety of company stakeholders to use the platform and its analytics results to make critical business decisions.
The data we capture and store is used to serve insights to various internal teams. Product managers at Unity need to understand how their features and services are adopted, which also helps with development of future releases. Marketing uses the data to understand how markets are evolving and how to best engage with our existing and potential new customers. And decision makers from finance, business development, business operations, customer success, account representatives, and other teams need information about their respective domains to understand the present and recognize future gaming opportunities. In addition, the solution we chose needed to support Unity’s strong security and privacy practices. We enforce strict limitations on Unity employees’ access to datasets—the anonymization and encryption of this data is an absolute requirement and was important in making this decision.
In addition, the data platform we chose had to support the use of machine learning that sits at the heart of many Unity services. Machine learning relies on a fast closed feedback loop of the data, where the services generate data and then read it back to adjust behavior toward a more optimal behavior—for example, providing a better user experience by offering more relevant recommendations on Unity’s learning material. We wanted a data platform that could easily handle these activities.
Migrating to BigQuery
The migration started as a regular lift and shift, but required some careful tweaking of table schemas and ETL jobs and queries. The migration took slightly over six months and was a very complex engineering project—primarily because we had to meet the requirement to conform to GDPR policies. Another key factor was transforming our fragmented ecosystem of databases and tools toward a single unified data platform.
Throughout this process, we learned some valuable lessons that we hope will be useful to other companies with extreme analytics requirements. Here are a few of the considerations to understand.
Migration considerations
BigQuery requires a fixed schema, which has pros and cons (and differs from other products). A fixed schema removes flexibility on the side of the applications that write events, and forces stricter discipline on developers. But on the positive side, we can use this to our advantage, providing safe downstream operations since erroneous incoming records won’t break the data.
This required us to build a schema management system. This allows the teams within Unity who generate data and need to store and process it to create schemas, change the schemas, and reprocess data that did not reach the target table because of a schema mismatch. The security provided by schema enforcement, and the flexibility of self-serve schema management, are essential for us to roll these data ingestion capabilities out to our teams.
Another consideration for us was data platform flexibility. On top of the ingested data, we aim to provide data aggregates for easy reporting and analysis, and an easy-to-use data processing toolset for anyone to create new aggregates, joins, and samples of the data. Both the aggregates and the event-level data are available for reporting, analysis, and machine learning targets of the data usage—all accessible in BigQuery in a flexible, scalable manner.
Something else to keep in mind with any complex analytics system is that it’s important to understand who the target users are. Some people in our company only need a simple dashboard, and BigQuery’s integration with products like Data Studio makes that easy. Sometimes these users require more sophisticated reporting and the ability to create complex dashboards, and the Looker option may make more sense.
Support for machine learning was important for us. Some machine learning use cases benefit from easy-to-develop loops, where data stored in BigQuery allows easy usage of AutoML and BigQuery ML. At the same time, other machine learning use cases may require highly customizable production solutions. For these situations, we’re developing Kubeflow-based solutions that also are capable of consuming data from BigQuery.
Next steps to modernize your analytics infrastructure
At Unity, we’ve been able to deploy a world-class analytics infrastructure, capable of ingesting petabytes of data from billions of events per day. We can now make that data available to key stakeholders in the organization within hours. After bringing together our previously siloed data solutions, we have seen improved internal processes, the possibility to operationalize reporting, and quicker turnaround times for many requests. Ingesting all the different data into one system, serving all the different use cases from a single source, and consolidating into BigQuery have resulted in a managed service that’s now highly scalable, flexible, and comes with minimal overhead.
Check out all that is happening in machine learning at Unity, and if you want to work on similar challenges with a stellar team of engineers and scientists, browse our open ML roles.
Read More for the details.