2025 08 21

GCP – How to build a real-time voice agent with Gemini, Google ADK, and A2A protocol

Building advanced conversational AI has moved well beyond text.

Now, we can use AI to create real-time, voice-driven agents. However, these systems need low-latency, two-way communication, real-time information retrieval, and the ability to handle complex tasks. This guide shows you how to build one using Gemini and the Google Agent Development Kit (ADK). You’ll learn how to create an intelligent, responsive voice agent.

The foundational agent

First, we create an agent with a persona but no access to external tools. This is the simplest agent, relying only on its pre-trained knowledge. It’s a great starting point.

code_block: <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.pyrnfrom google.adk.agents import Agentrnfrom core_utils import MODEL, SYSTEM_INSTRUCTIONrnrnself.agent = Agent(rn name=”voice_assistant_agent”,rn model=MODEL,rn instruction=SYSTEM_INSTRUCTION,rn # The ‘tools’ list is omitted for now.rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385be0>)])]>

This agent can chat, but it lacks access to external information.

The advanced agent

To make the agent useful, we add tools. This lets the agent access live data and services. In streaming_service.py, we give the agent access to Google Search and Google Maps.

code_block: <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.pyrnfrom google.adk.tools import GoogleSearch, MCPToolsetrnfrom google.adk.tools.mcp_tool.mcp_toolset import StdioServerParametersrnfrom core_utils import MODEL, SYSTEM_INSTRUCTIONrnimport osrnrnMaps_api_key = os.environ.get(“Maps_API_KEY”)rnself.agent = Agent(rn name=”voice_assistant_agent”,rn model=MODEL,rn instruction=SYSTEM_INSTRUCTION,rn tools=[rn GoogleSearch,rn MCPToolset(rn connection_params=StdioServerParameters(rn command=’npx’,rn args=[“-y”, “@modelcontextprotocol/server-google-maps”],rn env={“Maps_API_KEY”: Maps_api_key}rn ),rn )rn ],rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385dc0>)])]>

A closer look at the tools

Google Search: This pre-built ADK tool lets your agent perform Google searches to answer questions about current events and real-time information.
MCP Toolset for Google Maps: This uses the Model Context Protocol (MCP) to connect your agent to a specialized server (in this case, one that understands the Google Maps API). The main agent acts as an orchestrator, delegating tasks it can’t handle to specialist tools.

Engineering a natural conversation

The RunConfig object defines how the agent communicates. It controls aspects like voice selection and streaming mode.

code_block: <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.py (inside the handle_stream method)rnfrom google.adk.agents.run_config import RunConfig, StreamingModernfrom google.genai import typesrnfrom core_utils import VOICE_NAMErnrnrun_config = RunConfig(rn streaming_mode=StreamingMode.BIDI,rn speech_config=types.SpeechConfig(rn voice_config=types.VoiceConfig(rn prebuilt_voice_config=types.PrebuiltVoiceConfig(rn voice_name=VOICE_NAMErn )rn )rn ),rn response_modalities=[“AUDIO”],rn output_audio_transcription=types.AudioTranscriptionConfig(),rn input_audio_transcription=types.AudioTranscriptionConfig(),rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385970>)])]>

StreamingMode.BIDI (bi-directional) enables users to interrupt the agent, creating a more natural conversation.

The asynchronous core

Real-time voice chats require handling multiple tasks concurrently: listening, thinking, and speaking. Python’s asyncio and TaskGroup manage these parallel tasks.

code_block: <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.py (inside the handle_stream method)rnimport asynciornasync with asyncio.TaskGroup() as tg:rn # Task 1: Listens for audio from the user’s browser.rn tg.create_task(receive_client_messages(), name=”ClientMessageReceiver”)rn # Task 2: Forwards audio to the Gemini service.rn tg.create_task(send_audio_to_service(), name=”AudioSender”)rn # Task 3: Listens for responses from Gemini.rn tg.create_task(receive_service_responses(), name=”ServiceResponseReceiver”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e31943857f0>)])]>

Translating the agent’s voice

The receive_service_responses task processes the agent’s output before sending it to the user. This output includes audio and text transcription.

Handling audio

Audio is handled using Base64 encoding to convert binary data into a text string for transmission.

code_block: <ListValue: [StructValue([(‘code’, ‘# — Inside receive_service_responses —rnimport base64rnimport jsonrn# Handling Audio Responsernif hasattr(part, “inline_data”) and part.inline_data:rn # Encode the raw audio bytes into a Base64 text string.rn b64_audio = base64.b64encode(part.inline_data.data).decode(“utf-8”)rn # Package it in a JSON message, typed as “audio”.rn await websocket.send(json.dumps({“type”: “audio”, “data”: b64_audio}))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385730>)])]>

Handling text

Text transcription is streamed for real-time feedback.

code_block: <ListValue: [StructValue([(‘code’, ‘# — Inside receive_service_responses —rn# Handling Text Responsernif hasattr(part, “text”) and part.text:rn # Check if the text is a partial thought.rn event_str = str(event)rn # Check if the text is a streaming, partial thought.rn if “partial=True” in event_str:rn # Send it for real-time display on the client.rn await websocket.send(json.dumps({“type”: “text”, “data”: part.text}))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385e20>)])]>

GCP – How to build a real-time voice agent with Gemini, Google ADK, and A2A protocol

The foundational agent

The advanced agent

A closer look at the tools

Engineering a natural conversation

The asynchronous core

Translating the agent’s voice

Handling audio

Handling text

Get started

Related Posts

GCP – Introducing the TalayLink subsea cable and new connectivity hubs

AWS – Amazon EMR Serverless now supports Apache Spark 4.0.1 (preview)

AWS – Amazon Athena for Apache Spark is now available in Amazon SageMaker notebooks