GCP – How to build a real-time voice agent with Gemini, Google ADK, and A2A protocol
Building advanced conversational AI has moved well beyond text.
Now, we can use AI to create real-time, voice-driven agents. However, these systems need low-latency, two-way communication, real-time information retrieval, and the ability to handle complex tasks. This guide shows you how to build one using Gemini and the Google Agent Development Kit (ADK). You’ll learn how to create an intelligent, responsive voice agent.
The foundational agent
First, we create an agent with a persona but no access to external tools. This is the simplest agent, relying only on its pre-trained knowledge. It’s a great starting point.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.pyrnfrom google.adk.agents import Agentrnfrom core_utils import MODEL, SYSTEM_INSTRUCTIONrnrnself.agent = Agent(rn name=”voice_assistant_agent”,rn model=MODEL,rn instruction=SYSTEM_INSTRUCTION,rn # The ‘tools’ list is omitted for now.rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385be0>)])]>
This agent can chat, but it lacks access to external information.
The advanced agent
To make the agent useful, we add tools. This lets the agent access live data and services. In streaming_service.py, we give the agent access to Google Search and Google Maps.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.pyrnfrom google.adk.tools import GoogleSearch, MCPToolsetrnfrom google.adk.tools.mcp_tool.mcp_toolset import StdioServerParametersrnfrom core_utils import MODEL, SYSTEM_INSTRUCTIONrnimport osrnrnMaps_api_key = os.environ.get(“Maps_API_KEY”)rnself.agent = Agent(rn name=”voice_assistant_agent”,rn model=MODEL,rn instruction=SYSTEM_INSTRUCTION,rn tools=[rn GoogleSearch,rn MCPToolset(rn connection_params=StdioServerParameters(rn command=’npx’,rn args=[“-y”, “@modelcontextprotocol/server-google-maps”],rn env={“Maps_API_KEY”: Maps_api_key}rn ),rn )rn ],rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385dc0>)])]>
A closer look at the tools
-
Google Search: This pre-built ADK tool lets your agent perform Google searches to answer questions about current events and real-time information.
-
MCP Toolset for Google Maps: This uses the Model Context Protocol (MCP) to connect your agent to a specialized server (in this case, one that understands the Google Maps API). The main agent acts as an orchestrator, delegating tasks it can’t handle to specialist tools.
Engineering a natural conversation
The RunConfig object defines how the agent communicates. It controls aspects like voice selection and streaming mode.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.py (inside the handle_stream method)rnfrom google.adk.agents.run_config import RunConfig, StreamingModernfrom google.genai import typesrnfrom core_utils import VOICE_NAMErnrnrun_config = RunConfig(rn streaming_mode=StreamingMode.BIDI,rn speech_config=types.SpeechConfig(rn voice_config=types.VoiceConfig(rn prebuilt_voice_config=types.PrebuiltVoiceConfig(rn voice_name=VOICE_NAMErn )rn )rn ),rn response_modalities=[“AUDIO”],rn output_audio_transcription=types.AudioTranscriptionConfig(),rn input_audio_transcription=types.AudioTranscriptionConfig(),rn)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385970>)])]>
StreamingMode.BIDI (bi-directional) enables users to interrupt the agent, creating a more natural conversation.
The asynchronous core
Real-time voice chats require handling multiple tasks concurrently: listening, thinking, and speaking. Python’s asyncio and TaskGroup manage these parallel tasks.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# In app/server/streaming_service.py (inside the handle_stream method)rnimport asynciornasync with asyncio.TaskGroup() as tg:rn # Task 1: Listens for audio from the user’s browser.rn tg.create_task(receive_client_messages(), name=”ClientMessageReceiver”)rn # Task 2: Forwards audio to the Gemini service.rn tg.create_task(send_audio_to_service(), name=”AudioSender”)rn # Task 3: Listens for responses from Gemini.rn tg.create_task(receive_service_responses(), name=”ServiceResponseReceiver”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e31943857f0>)])]>
Translating the agent’s voice
The receive_service_responses task processes the agent’s output before sending it to the user. This output includes audio and text transcription.
Handling audio
Audio is handled using Base64 encoding to convert binary data into a text string for transmission.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# — Inside receive_service_responses —rnimport base64rnimport jsonrn# Handling Audio Responsernif hasattr(part, “inline_data”) and part.inline_data:rn # Encode the raw audio bytes into a Base64 text string.rn b64_audio = base64.b64encode(part.inline_data.data).decode(“utf-8”)rn # Package it in a JSON message, typed as “audio”.rn await websocket.send(json.dumps({“type”: “audio”, “data”: b64_audio}))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385730>)])]>
Handling text
Text transcription is streamed for real-time feedback.
- code_block
- <ListValue: [StructValue([(‘code’, ‘# — Inside receive_service_responses —rn# Handling Text Responsernif hasattr(part, “text”) and part.text:rn # Check if the text is a partial thought.rn event_str = str(event)rn # Check if the text is a streaming, partial thought.rn if “partial=True” in event_str:rn # Send it for real-time display on the client.rn await websocket.send(json.dumps({“type”: “text”, “data”: part.text}))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e3194385e20>)])]>
Read More for the details.