GCP – MLOps System with AutoML and Pipeline in Vertex AI
When you build a machine learning product, you need to consider at least two MLOps scenarios. First of all, the model could be replaced later, as breakthrough algorithms are introduced in academia or industry. Secondly, the model itself has to evolve with the data in the changing world.
We can handle both scenarios with the services provided by Vertex AI. For example:
AutoML capability automatically figures out the best model based on your budget, data and the settings.
You can easily manage the dataset with Vertex Managed Datasets by creating a new one or adding additional data to an existing dataset.
And you can build a machine learning pipeline to automate a series of steps from importing a dataset all to deploying the model using Vertex Pipelines.
This blog post shows you how to build such a system. You can find the full notebook for reproduction here. Many folks focus on the machine learning pipeline when it comes to MLOps, but there are more parts to build MLOps as a “system”. In this post, you will see how GCSGoogle Cloud Storage and Google Cloud Functions can help manage data and handle events in the MLOps system.
Architecture
Figure 1 Overall MLOps Architecture (original)
Figure 1 shows the overall architecture of this blog. Let’s first go over what components are involved, and then let’s understand how they are connected together to understand these two common workflows of the MLOps system.
Components
Vertex AI is at the heart of this system, and it leverages Vertex Managed Datasets, AutoML, Predictions, and Pipelines. We can not only create a dataset but also manage the dataset as it grows using Vertex Managed Datasets. Vertex AutoML guarantees for us to generate the best model possible without knowing much about modeling. Vertex Predictions creates an endpoint (RestAPI) for the client to communicate with.
It is a simple (fully managed) yet somewhat complete end to end MLOps workflow that goes from dataset to training a model that gets deployed. This workflow can be programmatically written in Vertex Pipelines. Vertex Pipelines emits the specification for a machine learning pipeline so that we can run the same pipeline whenever or wherever we want. We just need to know when and how to trigger the pipeline, and that is where the next two components of Cloud Functions and Cloud Storage come in.
Cloud Functions is a serverless way to deploy your code in the GCP. In this particular project, it is used to trigger the pipeline by listening to any changes on the specified Cloud Storage location. Specifically, if a new dataset is added as in a new span number, the pipeline is triggered to train the whole dataset , and a new model is deployed.
Workflow
This MLOps system works in the following manner. First of all, you prepare the dataset with either Vertex Dataset’s built-in user interface or any external tools based on your preference, and you can upload the prepared dataset into the designated GCS bucket with a new folder named SPAN-NUMBER. Cloud Functions then detects the changes in the GCS bucket and triggers the Vertex Pipeline to run the jobs from AutoML training to Endpoint deployment.
Inside the Vertex Pipeline, it checks if there is an existing dataset created previously. If the dataset is new, it creates a new Vertex Dataset by importing the dataset from the GCS location and emits the corresponding Artifact. Otherwise, it adds the additional dataset to the existing Vertex Dataset and emits an artifact.
When the Vertex Pipeline sees the dataset as a new one, it trains a new AutoML model and deploys it by creating a new Endpoint. If the dataset is not new, it tries to grasp the model ID from Vertex Model and figures out whether a new AutoML model or an updated AutoML model is needed. The reason for the second branch is that if somehow the AutoML model has not been created it makes sure it creates a new model. Also, when the model is trained, the corresponding component emits the Artifact as well.
Directory structure to reflect different distributions
In this project, I have created two subsets of the CIFAR-10 dataset, one for the SPAN-1 and the other one for the SPAN-2. A more general version of this project can be found here which shows how to build training and batch evaluation pipelines and make them cooperate for evaluating the currently deployed model and triggering the retraining process.
ML Pipeline with Kubeflow Pipelines (KFP)
We chose to use Kubeflow Pipelines to orchestrate the pipeline. There are a few things that I would like to highlight. First, it’s good to know how to make branches with conditional statements in KFP. Second, you need to explore AutoML API specifications to fully leverage AutoML capabilities (such as training a model based on the previously trained one). Last but not least, you also need to find a way to emit Artifacts for Vertex Dataset and Vertex Model so that Vertex AI can recognize them. Let’s go through these one by one.
Branching strategy
In this project, there are two main conditions and two sub-branches inside the second main branch. The main branches split the pipeline based on a condition if there is an existing Vertex Dataset. The sub-branches are applied in the second main branch which is selected when there is an exciting Vertex Dataset. It tries to look up the list of models and decide to train a AutoML model from scratch or based on the previously trained one.
Machine learning pipelines written in KFP could have conditions with a special syntax of kfp.dsl.Condition. For instance, we can make the branches like below.
get_dataset_id and get_model_id are custom KFP components which try to see if there is an existing Vertex Dataset and Vertex Model respectively, and it returns ‘None’ if they found one or not ‘None’ otherwise. Also they output Vertex AI aware Artifacts. You will see what this means in the next section.
Emit Vertex AI Aware Artifacts
Artifact is not only helpful for you to track the whole path of each experiment in the machine learning pipeline but also it displays metadata in the pipeline UI (Vertex Pipeline UI in this case). When Vertex AI aware artifacts are emitted in the pipeline, Vertex Pipeline UI displays links for the internal services of Vertex AI such as Vertex Dataset so that users can click to visit a web page to find more details about them.
So how could you write a custom component to generate Vertex AI aware artifacts? To do this, first of all, custom components should have Output[Artifact] in their parameters. Then you need to fill up the resourceName of the metadata attribute with a special string format.
The code example above is the actual implementation of the get_dataset_id component used in the previous code snippet. As you see, the dataset is defined in the parameters as Output[Artifact]. Even though it appears in the parameter, it is actually emitted automatically. You just need to fill up the necessary data inside as if it is a function variable.
Datasets retrieves the list of Vertex Dataset by calling the aiplotform.ImageDataset.list API. If the length of it is zero, it simply returns ‘None’. Otherwise, it returns the found resource name of the Vertex Dataset and fills up the dataset.metadata[‘resourceName’] with the resource name at the same time. The Vertex AI aware resource name follows a special string format which is ‘projects/<project-id>/locations/<location>/<vertex-resource-type>/<resource-name>’.
The <vertex-resource-type> can be anything to point to an internal vertex AI service. For instance if you want to specify the Artifact is the Vertex Model, then you should replace <vertex-resource-type> with models. The <resource-name> is the unique ID of the resource, and it can be accessed in the name attribute of the resource found by aiplatform API. The other custom component, get_model_id, is written in a very similar way as well.
AutoML based on the previous model
You sometimes want to train a new model based on top of the previously best model. If that is possible, the new model will probably be much better than the one trained from scratch because it leverages a sort of knowledge learnt beforehand.
Luckily, Vertex AutoML comes with this capability. AutoMLImageTrainingJobRunOp component lets you do this by simply filling in the base_modelargument as below:
When training a new AutoML model from scratch, you pass None in the base_model argument, and it is the default value. However, you can set it with a VertexModel artifact, and the component will trigger an AutoML training job based on the other model.
One thing to be careful of is that VertexModel artifacts can not be constructed in a typical way of Python programming. That means you can not create an instance of VertexModel artifact by setting the id found in the Vertex Model dashboard. The only way you can create one is to set the metadata[‘resourceName’] parameters properly. The same rule should be applied to other Vertex AI related artifacts such as VertexDataset. You can see how the VertexDataset artifact is constructed properly to get an existing Vertex Dataset to import additional data to it in the full notebook of this project here.
Cost
You can reproduce the same result from this project with the free $300 credits when you create a new GCP account. However, I list the actual cost taken for this kind of job for those who might wonder.
At the time of this blog post, Vertex Pipelines costs about $0.03/run, and the type of underlying VM for each pipeline component is e2-standard-4 which costs about $0.134/hour. Vertex AutoML training costs about $3.465/hour for image classification… The GCS holds the actual data which costs about $2.40/month for 100GiB capacity, and the Vertex Dataset itself is free.
To simulate two different branches, the entire experiment took about one to two hours, and If we sum everything up, the total cost for this project is approximately $16.59. Please find more detailed pricing information about Vertex AI here.
Conclusion
Many people underestimate the capability of AutoML, but it is a great alternative for app and service developers who don’t have much of a machine learning background. Vertex AI is a great platform that provides AutoML as well as Pipeline features to automate the machine learning workflow In this article, I have demonstrated how to set up and run a basic MLOps workflow from data injection to training a model based on the previously achieved best one to deploying the model in Vertex AI platform. With this, we can let our machine learning model automatically adapt to the changes in a new dataset. What’s left for you to implement is to integrate a model monitoring system to detect data/model drift(one example could be found here).
Read More for the details.