GCP – From dark data to bright insights: The dawn of smart storage
Organizations interested in AI today have access to amazing computational power with Tensor Processing Units (TPUs) and Graphical Processing Units (GPUs), while foundational models like Gemini are redefining what’s possible. Yet for many enterprises a critical obstacle to AI is the data itself, specifically unstructured data. According to Enterprise Strategy Group, for most organizations, 61% of their total data is unstructured, the vast majority of which sits unanalyzed and unlabeled in archives, so-called “dark data.” But with the help of AI, this untapped resource is an opportunity to unlock a veritable treasure trove of insights.
At the same time, when it comes to unstructured data, traditional tools only scratch the surface, and subject matter experts must build massive, manual preprocessing pipelines and define the data’s semantic meaning. This prevents any real analysis at scale, preventing companies from using even a fraction of what they store.
Now imagine a world where your unstructured data isn’t just stored, but understood. A world where you can ask complex questions of data such as images, videos, and documents, and get interesting answers in return. This isn’t just a futuristic vision — the era of smart storage is upon us. Today we are announcing new auto annotate and object contexts features that use AI to generate metadata and insights on your data, so you can then use your dark data for discovery, curation, and governance at scale. Better yet, the new features relieve you from having to build and manage your own object-analysis data pipelines.
Leveraging AI to transform dark data
Now, as unstructured data lands in Google Cloud, it’s no longer treated as a passive object. Instead, a data pipeline leverages AI to automatically process and understand the data, surfacing key insights and connections. Two new features are integral to this vision: auto annotate, which enriches your data by automatically generating metadata using Google’s pretrained AI models, and object contexts, which lets you attach custom, actionable tags to your data. Together, these two features can help transform passive data into active assets, unlocking use cases such as rapid data discovery for AI model training, streamlined data curation to reduce model bias, enhanced data governance to protect sensitive information, and the ability to build powerful, stateful workflows directly on your storage.
Making your data smart
Auto annotate, currently in a limited experimental release, automatically generates rich metadata (“annotations”) about objects stored in Cloud Storage buckets by applying Google’s advanced AI models, starting with image objects. Getting started is simple: enable auto annotate for your selected buckets or an entire project, pick one or more available models, and your entire image library will be annotated. Furthermore, new images are automatically annotated as they are uploaded. An annotation’s lifecycle is always tied to its object’s, simplifying management and helping to ensure consistency. Importantly, auto annotate operates under your control, only accessing object content to which you have explicitly granted permissions. Then, you can query the annotations, which are available as object contexts, through Cloud Storage API calls and Storage Insights datasets. The initial release uses pretrained models for generating annotations: object detection with confidence scores, image labeling, and objectionable content detection.
a sample of generated annotations for an object
Then, with object contexts, you can attach custom key-value pair metadata directly to objects in Cloud Storage, including information generated by the new auto annotate feature. Currently in preview, object contexts are natively integrated with Cloud Storage APIs for listing and batch operations, as well as Storage Insights datasets for analysis in BigQuery. Each context includes object creation and modification timestamps, providing valuable lineage information. You can use Identity and Access Management (IAM) permissions to control who can add, change, or remove object contexts. When migrating data from Amazon S3 using Cloud Storage APIs, existing S3 Object Tags are automatically converted into contexts.
In short, object contexts provide a flexible and native way to add context to enrich your data. Combined with a smart storage feature like auto annotations, object contexts convert data into information, letting you build sophisticated data management workflows directly within Cloud Storage.
Now, let’s take a deeper look at some of the new use cases these smart storage features deliver.
1. Data discovery
One of the most significant challenges in building new AI applications is data discovery — how to find the most relevant data across an enterprise’s vast and often siloed data stores. Locating specific images or information within petabytes of unstructured data can feel impossible. Auto annotate automatically generates rich, descriptive annotations for your data in Cloud Storage. Annotations, including labels and detected objects, are available within object contexts and fully indexed in BigQuery. After generating embeddings for them, you can then use BigQuery to run a semantic search for these annotations, effectively solving the “needle in a haystack” problem. For example, a large retailer with millions of product images can use auto annotate and BigQuery to quickly find ‘red dresses’ or ‘leather sofas’, accelerating catalog management and marketing efforts.
2. Data curation for AI
Building effective AI models requires carefully curated datasets. Sifting through data to ensure it is widely representative (e.g., “does this dataset have cars in multiple colors?”) to reduce model bias, or to select specific training examples (e.g., “Find images with red cars”), is both time-consuming and error-prone. Auto annotate can identify attributes like colors and object types, to automate selecting balanced datasets.
For instance, an autonomous vehicle company training models could use petabytes of on-road camera data to recognize traffic signs, using auto annotate to identify and extract images that contain the word ‘Stop’ or ‘Pedestrian Crossing’.
Vivint, a smart home and security company, has been using auto annotate to find and understand their data.
“Our customers trust us to help make their homes and lives safer, smarter, and more convenient, and AI is at the heart of our product and customer experience innovations. Cloud Storage auto annotate’s rich metadata delivered in BigQuery helps us scale our data discovery and curation efforts, speeding up our AI development process from 6 months to as little as 1 month by finding the needle-in-a-haystack data essential to improve our models.” – Brandon Bunker, VP of Product, AI, Vivint
3. Governing unstructured data at scale
Unstructured data is constantly growing, and manually managing and governing that data to identify sensitive information, detect policy violations, or categorize it for lifecycle management is a challenge. Auto annotate and object contexts help solve these data governance and compliance challenges. For example, a retail customer can use auto annotate to identify and flag images containing visible customer personally identifiable information (PII) such as shipping labels or order forms. This information, stored in object context, can then trigger automated governance actions such as moving flagged objects to a restricted bucket or initiating a review process.
BigID, a partner building solutions on Cloud Storage, reports that using object contexts is helping them manage their customers’ risk:
“Object contexts gives us a way to take the outputs of BigID’s industry-leading data classification solutions and apply labels to Cloud Storage objects. Object contexts will allow BigID labels to shed light onto data in Cloud Storage: identifying objects which contain sensitive information and helping them understand and manage their risk across AI, security, and privacy.” – Marc Hebrard, Principal Technical Architect, BigID
The future is bright for your data
At Google Cloud, we’re committed to building a future where your data is not just a passive asset but an active catalyst for innovation. Don’t keep your valuable data in the dark. Bring your data to Cloud Storage and enable auto annotation and object contexts to unlock its full potential with Gemini, Vertex AI, and BigQuery.
You can start using object contexts today, and reach out to us for an early look at auto annotate. Once you have access, simply enable auto annotate for selected buckets or on an entire project, pick one or more available models, and your entire image library will be annotated. You can then query the annotations that are available as object contexts through Cloud Storage API calls and Storage Insights datasets.
To learn more, read about our end-to-end vision in a showcase paper with Enterprise Strategy Group: Illuminating Dark Data With Smart Storage from Google Cloud.
Read More for the details.