GCP – Use Gemini 2.0 to speed up document extraction and lower costs
A few weeks ago, Google DeepMind released Gemini 2.0 for everyone, including Gemini 2.0 Flash, Gemini 2.0 Flash-Lite, and Gemini 2.0 Pro (Experimental). All models support up to at least 1 million input tokens, which makes it easier to do a lot of things – from image generation to creative writing. It’s also changed how we convert documents into structured data. Manual document processing is a slow and expensive problem, but Gemini 2.0 changed everything when it comes to chunking pdfs for RAG systems, and can even transform pdfs into insights.
Today, we’ll take a deep dive into a multi-step approach using generative AI where you can use Gemini 2.0 to improve your document extraction by combining language models (LLMs) with structured, externalized rules.
A multi-step approach to document extraction, made easy
A multi-step architecture, as opposed to relying on a single, monolithic prompt, offers significant advantages for robust extraction. This approach begins with modular extraction, where initial tasks are broken down into smaller, more focused prompts targeting specific content locations within a document. This modularity not only enhances accuracy but also reduces the cognitive load on the LLM.
Another benefit of a multi-step approach is externalized rule management. By managing post-processing rules externally, for instance, using Google Sheets or a BigQuery table, we gain the benefits of easy CRUD (Create, Read, Update, Delete) operations, improving both maintainability and version control of the rules. This decoupling also separates the logic of extraction from the logic of processing, allowing for independent modification and optimization of each.
Ultimately, this hybrid approach combines the strengths of LLM-powered extraction with a structured rules engine. LLMs handle the complexities of understanding and extracting information from unstructured data, while the rules engine provides a transparent and manageable system for enforcing business logic and decision-making. The following steps outline a practical implementation.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e436dc40100>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Step 1: Extraction
Let’s test a sample prompt with a configurable set of rules. This hands-on example will demonstrate how easily you can define and apply business logic to extracted data, all powered by the Gemini and Vertex AI.
First, we extract data from a document. Let’s use Google’s 2023 Environment Report as the source document. We use Gemini with the initial prompt to extract data. This is not a known schema, but a prompt we’ve created for the purposes of this story. To create specific response schemas, use controlled generation with Gemini.
- code_block
- <ListValue: [StructValue([(‘code’, ‘<PERSONA>rnYou are a meticulous AI assistant specializing in extracting key sustainability metrics and performance data from corporate environmental reports. Your task is to accurately identify and extract specific data points from a provided document, ensuring precise values and contextual information are captured. Your analysis is crucial for tracking progress against sustainability goals and supporting informed decision-making.rnrn<INSTRUCTIONS>rnrn**Task:**rnAnalyze the provided Google Environmental Report 2023 (PDF) and extract the following `key_metrics`. For each metric:rnrn1. **`metric_id`**: A short, unique identifier for the metric (provided below).rn2. **`description`**: A brief description of the metric (provided below).rn3. **`value`**: The numerical value of the metric as reported in the document. Be precise (e.g., “10.2 million”, not “about 10 million”). If a range is given, and a single value is not clearly indicated, you must use the largest of the range.rn4. **`unit`**: The unit of measurement for the metric (e.g., “tCO2e”, “million gallons”, “%”). Use the units exactly as they appear in the report.rn5. **`year`**: The year to which the metric applies (2022, unless otherwise specified).rn6. **`page_number`**: The page number(s) where the metric’s value is found. If the information is spread across multiple pages, list all relevant pages, separated by commas. If the value requires calculations based on the page, list the final answer page.rn7. **`context`**: One sentance to put the metric in context.rn**Metrics to Extract:**rnrn“`jsonrn[rn {rn “metric_id”: “ghg_emissions_total”,rn “description”: “Total GHG Emissions (Scope 1, 2 market-based, and 3)”,rn },rn {rn “metric_id”: “ghg_emissions_scope1”,rn “description”: “Scope 1 GHG Emissions”,rn },rn {rn “metric_id”: “ghg_emissions_scope2_market”,rn “description”: “Scope 2 GHG Emissions (market-based)”,rn },rn {rn “metric_id”: “ghg_emissions_scope3_total”,rn “description”: “Total Scope 3 GHG Emissions”,rn },rn {rn “metric_id”: “renewable_energy_capacity”,rn “description”: “Clean energy generation capacity from signed agreements (2010-2022)”,rn },rn {rn “metric_id”: “water_replenishment”,rn “description”: “Water replenished”,rn },rn {rn “metric_id”: “water_consumption”,rn “description”: “Water consumption”,rn },rn {rn “metric_id”: “waste_diversion_landfill”,rn “description”: “Percentage of food waste diverted from landfill”,rn },rn {rn “metric_id”: “recycled_material_plastic”,rn “description”: “Percentage of plastic used in manufactured products that was recycled content”,rn },rn {rn “metric_id”: “packaging_plastic_free”,rn “description”: “Percentage of product packaging that is plastic-free”,rn }rn]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e436dc40be0>)])]>
The JSON output below, which we’ll assign to the variable `extracted_data`, represents the results of the initial data extraction by Gemini. This structured data is now ready for the next critical phase: applying our predefined business rules.
- code_block
- <ListValue: [StructValue([(‘code’, ‘extracted_data= [rn {rn “metric_id”: “ghg_emissions_total”,rn “description”: “Total GHG Emissions (Scope 1, 2 market-based, and 3)”,rn “value”: “14.3 million”,rn “unit”: “tCO2e”,rn “year”: 2022,rn “page_number”: “23”,rn “context”: “In 2022 Google’s total GHG emissions, including Scope 1, 2 (market-based), and 3, amounted to 14.3 million tCO2e.”rn },rn {rn “metric_id”: “ghg_emissions_scope1”,rn “description”: “Scope 1 GHG Emissions”,rn “value”: “0.23 million”,rn “unit”: “tCO2e”,rn “year”: 2022,rn “page_number”: “23”,rn “context”: “In 2022, Google’s Scope 1 GHG emissions were 0.23 million tCO2e.”rn },rn {rn “metric_id”: “ghg_emissions_scope2_market”,rn “description”: “Scope 2 GHG Emissions (market-based)”,rn “value”: “0.03 million”,rn “unit”: “tCO2e”,rn “year”: 2022,rn “page_number”: “23”,rn “context”: “Google’s Scope 2 GHG emissions (market-based) in 2022 totaled 0.03 million tCO2e.”rn },rn {rn “metric_id”: “ghg_emissions_scope3_total”,rn “description”: “Total Scope 3 GHG Emissions”,rn “value”: “14.0 million”,rn “unit”: “tCO2e”,rn “year”: 2022,rn “page_number”: “23”,rn “context”: “Total Scope 3 GHG emissions for Google in 2022 reached 14.0 million tCO2e.”rn },rn {rn “metric_id”: “renewable_energy_capacity”,rn “description”: “Clean energy generation capacity from signed agreements (2010-2022)”,rn “value”: “7.5”,rn “unit”: “GW”,rn “year”: 2022,rn “page_number”: “14”,rn “context”: “By the end of 2022, Google had signed agreements for a clean energy generation capacity of 7.5 GW since 2010.”rn },rn {rn “metric_id”: “water_replenishment”,rn “description”: “Water replenished”,rn “value”: “2.4 billion”,rn “unit”: “gallons”,rn “year”: 2022,rn “page_number”: “30”,rn “context”: “Google replenished 2.4 billion gallons of water in 2022.”rn },rn {rn “metric_id”: “water_consumption”,rn “description”: “Water consumption”,rn “value”: “3.4 billion”,rn “unit”: “gallons”,rn “year”: 2022,rn “page_number”: “30”,rn “context”: “In 2022 Google’s water consumption totalled 3.4 billion gallons.”rn },rn {rn “metric_id”: “waste_diversion_landfill”,rn “description”: “Percentage of food waste diverted from landfill”,rn “value”: “70”,rn “unit”: “%”,rn “year”: 2022,rn “page_number”: “34”,rn “context”: “Google diverted 70% of its food waste from landfills in 2022.”rn },rn {rn “metric_id”: “recycled_material_plastic”,rn “description”: “Percentage of plastic used in manufactured products that was recycled content”,rn “value”: “50”,rn “unit”: “%”,rn “year”: 2022,rn “page_number”: “32”,rn “context”: “In 2022 50% of plastic used in manufactured products was recycled content.”rn },rn {rn “metric_id”: “packaging_plastic_free”,rn “description”: “Percentage of product packaging that is plastic-free”,rn “value”: “34”,rn “unit”: “%”,rn “year”: 2022,rn “page_number”: “32”,rn “context”: “34% of Google’s product packaging was plastic-free in 2022.”rn }rn]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e436dc400d0>)])]>
Step 2: Feed the extracted data into a rules engine
Next, we’ll feed this `extracted_data` into a rules engine, which, in our implementation, is another call to Gemini, acting as a powerful and flexible rules processor. Along with the extracted data, we’ll provide a set of validation rules defined in the `analysis_rules` variable. This engine, powered by Gemini, will systematically check the extracted data for accuracy, consistency, and adherence to our predefined criteria. Below is the prompt we provide to Gemini to accomplish this, along with the rules themselves.
- code_block
- <ListValue: [StructValue([(‘code’, “<PERSONA>rnYou are a sustainability data analyst responsible for verifying the accuracy and consistency of extracted data from corporate environmental reports. Your task is to apply a set of predefined rules to the extracted data to identify potential inconsistencies, highlight areas needing further investigation, and assess progress towards stated goals. You are detail-oriented and understand the nuances of sustainability reporting.rnrn<INSTRUCTIONS>rnrn**Input:**rnrn1. `extracted_data`: (JSON) The `extracted_data` variable contains the values extracted from the Google Environmental Report 2023, as provided in the previous turn. This is the output from the first Gemini extraction.rn2. `analysis_rules`: (JSON) The `analysis_rules` variable contains a JSON string defining a set of rules to apply to the extracted data. Each rule includes a `rule_id`, `description`, `condition`, `action`, and `alert_message`.rnrn**Task:**rnrn1. **Iterate through Rules:** Process each rule defined in the `analysis_rules`.rn2. **Evaluate Conditions:** For each rule, evaluate the `condition` using the data in `extracted_data`. Conditions may involve:rn * Accessing specific `metric_id` values within the `extracted_data`.rn * Comparing values across different metrics.rn * Checking for data types (e.g., ensuring a value is a number).rn * Checking page numbers for consistency.rn * Using logical operators (AND, OR, NOT) and mathematical comparisons (>, <, >=, <=, ==, !=).rn * Checking for the existence of data.rn3. **Execute Actions:** If a rule’s condition evaluates to TRUE, execute the `action` specified in the rule. The action describes *what* the rule is checking.rn4. **Trigger Alerts:** If the condition is TRUE, generate the `alert_message` associated with that rule. Include relevant `metric_id` values and page numbers in the alert message to provide context.rnrn**Output:**rnrnReturn a JSON array containing the triggered alerts. Each alert should be a dictionary with the following keys:rnrn* `rule_id`: The ID of the rule that triggered the alert.rn* `alert_message`: The alert message, potentially including specific values from the `extracted_data`.”), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e436dc401c0>)])]>
`analysis_rules` is a JSON object that contains the business rules we want to apply to the extracted receipt data. Each rule defines a specific condition to check, an action to take if the condition is met, and an optional alert message if a violation occurs. The power of this approach lies in the flexibility of these rules; you can easily add, modify, or remove them without altering the core extraction process. The beauty of using Gemini is that the rules can be written in human-readable language and can be maintained by non-coders.
- code_block
- <ListValue: [StructValue([(‘code’, ‘analysis_rules = {rn “rules”: [rn {rn “rule_id”: “AR001”,rn “description”: “Check if all required metrics were extracted.”,rn “condition”: “extracted_data contains all metric_ids from the original extraction prompt”,rn “action”: “Verify the presence of all expected metrics.”,rn “alert_message”: “Missing metrics in the extracted data. The following metric IDs are missing: {missing_metrics}”rn },rn {rn “rule_id”: “AR002”,rn “description”: “Check if total GHG emissions equal the sum of Scope 1, 2, and 3.”,rn “condition”: “extracted_data[‘ghg_emissions_total’][‘value’] != (extracted_data[‘ghg_emissions_scope1’][‘value’] + extracted_data[‘ghg_emissions_scope2_market’][‘value’] + extracted_data[‘ghg_emissions_scope3_total’][‘value’]) AND extracted_data[‘ghg_emissions_total’][‘page_number’] == extracted_data[‘ghg_emissions_scope1’][‘page_number’] == extracted_data[‘ghg_emissions_scope2_market’][‘page_number’] == extracted_data[‘ghg_emissions_scope3_total’][‘page_number’]”,rn “action”: “Sum Scope 1, 2, and 3 emissions and compare to the reported total.”,rn “alert_message”: “Inconsistency detected: Total GHG emissions ({total_emissions} {total_unit}) on page {total_page} do not equal the sum of Scope 1 ({scope1_emissions} {scope1_unit}), Scope 2 ({scope2_emissions} {scope2_unit}), and Scope 3 ({scope3_emissions} {scope3_unit}) emissions on page {scope1_page}. Sum is {calculated_sum}”rn },rn {rn “rule_id”: “AR003”,rn “description”: “Check for unusually high water consumption compared to replenishment.”,rn “condition”: “extracted_data[‘water_consumption’][‘value’] > (extracted_data[‘water_replenishment’][‘value’] * 5) AND extracted_data[‘water_consumption’][‘unit’] == extracted_data[‘water_replenishment’][‘unit’]”,rn “action”: “Compare water consumption to water replenishment.”,rn “alert_message”: “High water consumption: Consumption ({consumption_value} {consumption_unit}) is more than five times replenishment ({replenishment_value} {replenishment_unit}) on page {consumption_page} and {replenishment_page}.”rn }rn ]rn}’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e436dc402e0>)])]>
Step 3: Integrate your insights
Finally – and crucially – integrate the alerts and insights generated by the rules engine into existing data pipelines and workflows. This is where the real value of this multi-step process is unlocked. For our example, we can build robust APIs and systems using Google Cloud tools to automate downstream actions triggered by the rule-based analysis. Some examples of downstream tasks are:
-
Automated task creation: Trigger Cloud Functions to create tasks in project management systems, assigning data verification to the appropriate teams.
-
Data quality pipelines: Integrate with Dataflow to flag potential data inconsistencies in BigQuery tables, triggering validation workflows.
-
Vertex AI integration: Leverage Vertex AI Model Registry for tracking data lineage and model performance related to extracted metrics and corrections made.
-
Dashboard integration Use Looker, Google Sheets, or Data Studio to display alerts
-
Human in the loop trigger: Build a trigger system for the Human in the loop, using Cloud Tasks, to show which extractions to focus on and double check.
Make document extraction easier today
This hands-on approach provides a solid foundation for building robust, rule-driven document extraction pipelines. To get started, explore these resources:
-
Gemini for document understanding: For a comprehensive, one-stop solution to your document processing needs, check out Gemini for document understanding. It simplifies many common extraction challenges.
-
Few-shot prompting: Begin your Gemini journey with few-shot prompting. This powerful technique can significantly improve the quality of your extractions with minimal effort, providing examples within the prompt itself.
-
Fine-tuning Gemini models: When you need highly specialized, domain-specific extraction results, consider fine-tuning Gemini models. This allows you to tailor the model’s performance to your exact requirements.
Read More for the details.