GCP – Build and refine your audio generation end-to-end with Gemini 1.5 Pro
Generative AI is giving people new ways to experience audio content, from podcasts to audio summaries. For example, users are embracing NotebookLM’s recent Audio Overview feature, which turns documents into audio conversations. With one click, two AI hosts start up a lively “deep dive” discussion based on the sources you provide. They summarize your material, make connections between topics, and discuss back and forth.
While Notebook LM offers incredible benefits for making sense of complex information, some users want more control over generating unique audio experiences – for example, creating their own podcasts. Podcasts are an increasingly popular medium for creators, business leaders, and users to listen to what interests them. Today, we’ll share how Gemini 1.5 Pro and the Text-to-Speech API on Google Cloud can help you create conversations with diverse voices and generate podcast scripts with custom prompts.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3ef677ac5d90>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
The approach: Expand your reach with diverse audio formats
A great podcast starts with accessible audio content. Gemini’s multimodal capabilities, combined with our high-fidelity Text-to-Speech API, offers 380+ voices across 50+ languages and custom voice creation. This unlocks new ways for users to experience content and expand their reach through diverse audio formats.
This approach also helps content creators reach a wider audience and streamline the content creation process, including:
-
Expanded reach: Connect with an audience segment that prefers audio content.
-
Increased engagement: Foster deeper connections with listeners through personalized audio.
-
Content repurposing: Maximize the value of existing written content by transforming it into a new format, reaching a wider audience without starting from scratch.
Let’s take a look at how.
The architecture: Gemini 1.5 Pro and Text-to-Speech
Our audio overview creation architecture uses two powerful services from Google Cloud:
-
Gemini 1.5 Pro: This advanced generative AI model excels at understanding and generating human-like text. We’ll use Gemini 1.5 Pro to:
-
Generate engaging scripts: Feed your podcast content overview to Gemini 1.5 Pro, and it can generate compelling conversational scripts, complete with introductions, transitions, and calls to action.
-
Adapt content for audio: Gemini 1.5 Pro can optimize written content for the audio format, ensuring a natural flow and engaging listening experience. It can also adjust the tone and style to suit any format such as podcasts.
Text-to-Speech API: This API converts text into natural-sounding speech, giving a voice to your scripts. You can choose from various voices and languages to match your brand and target audience.
How to create an engaging podcast yourself, step-by-step
-
Content preparation: Prepare your podcast. Ensure it’s well-structured and edited for clarity. Consider dividing longer posts into multiple episodes for optimal listening duration.
-
Gemini 1.5 Pro integration: Use Gemini 1.5 Pro to generate a conversational script from your podcast. Experiment with prompts to fine-tune the output, achieving the desired style and tone. Example prompt: “Generate an engaging audio overview script from this podcast, including an introduction, transitions, and a call to action. Target audience is technical developers, engineers, and cloud architects.”
-
Section extraction: For complex or lengthy podcasts, you might use Gemini 1.5 Pro to extract key sections and subsections as JSON, enabling a more structured approach to script generation.
A python function that powers our podcast creation process can look as simple as below:
- code_block
- <ListValue: [StructValue([(‘code’, ‘def extract_sections_and_subsections(document1: Part, project=”<your-project-id>”, location = “us-central1”) -> str:rn “””rn Extracts hierarchical sections and subsections from a Google Cloud blog postrn provided as a PDF document.rnrnrn This function uses the Gemini 1.5 Pro language model to analyze the structurern of a blog post and identify its key sections and subsections. The extractedrn information is returned in JSON format for easy parsing and use inrn various applications.rnrnrn This is particularly useful for:rnrnrn * **Large documents:** Breaking down content into manageable chunks forrn efficient processing and analysis.rn * **Podcast creation:** Generating multi-episode series where each episodern focuses on a specific section of the blog post.rnrnrn Args:rn document1 (Part): A Part object representing the PDF document,rn typically obtained using `Part.from_uri()`.rn For example:rn “`pythonrn document1 = Part.from_uri(rn mime_type=”application/pdf”,rn uri=”gs://your-bucket/your-pdf.pdf”rn )rn “`rn location: The region of your Google Cloud project. Defaults to “us-central1”.rn project: The ID of your Google Cloud project. Defaults to “<your-project-id>”.rnrnrnrnrn Returns:rn str: A JSON string representing the extracted sections and subsections.rn Returns an empty string if there are issues with processing orrn the model output.rn “””rnrnrn vertexai.init(project=project, location=location) # Initialize Vertex AIrn model = GenerativeModel(“gemini-1.5-pro-002”)rnrnrn prompt = “””Analyze the following blog post and extract its sections and subsections. Represent this information in JSON format using the following structure:rn [rn {rn “section”: “Section Title”,rn “subsections”: [rn “Subsection 1”,rn “Subsection 2”,rn // …rn ]rn },rn // … more sectionsrn ]”””rnrnrn try:rn responses = model.generate_content(rn [“””The pdf file contains a Google Cloud blog post required for podcast-style analysis:”””, document1, prompt],rn generation_config=generation_config,rn safety_settings=safety_settings,rn stream=True, # Stream results for better performance with large documentsrn )rnrnrn response_text = “”rn for response in responses:rn response_text += response.textrnrnrn return response_textrnrnrn except Exception as e:rn print(f”Error during section extraction: {e}”)rn return “”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef677f01a60>)])]>
Then, use Gemini 1.5 Pro to generate the podcast script for each section. Again, provide clear instructions in your prompts, specifying target audience, desired tone, and approximate episode length.
For each section and subsection you can use a function like below to generate a script:
- code_block
- <ListValue: [StructValue([(‘code’, ‘def generate_podcast_content(section, subsection, document1:Part, targetaudience, guestname, hostname, project=”<your-project-id>”, location=”us-central1″) -> str:rn “””Generates a podcast dialogue in JSON format from a blog post subsection.rnrnrn This function uses the Gemini model in Vertex AI to create a conversationrn between a host and a guest, covering the specified subsection content. It usesrn a provided PDF as source material and outputs the dialogue in JSON.rnrnrn Args:rn section: The blog post’s main section (e.g., “Introduction”).rn subsection: The specific subsection (e.g., “Benefits of Gemini 1.5″).rn document1: A `Part` object representing the source PDF (created usingrn `Part.from_uri(mime_type=”application/pdf”, uri=”gs://your-bucket/your-pdf.pdf”)`).rn targetaudience: The intended audience for the podcast.rn guestname: The name of the podcast guest.rn project: Your Google Cloud project ID.rn location: Your Google Cloud project location.rnrnrn Returns:rn A JSON string representing the generated podcast dialogue.rn “””rn print(f”Processing section: {section} and subsection: {subsection}”)rnrnrn prompt = f”””Create a podcast dialogue in JSON format based on a provided subsection of a Google Cloud blog post (found in the attached PDF).rn The dialogue should be a lively back-and-forth between a host (R) and a guest (S), presented as a series of turns.rn The host should guide the conversation by asking questions, while the guest provides informative and accessible answers.rn The script must fully cover all points within the given subsection.rn Use clear explanations and relatable analogies.rn Maintain a consistently positive and enthusiastic tone (e.g., “Movies, I love them. They’re like time machines…”).rn Include only one introductory host greeting (e.g., “Welcome to our next episode…”). No music, sound effects, or production directions.rnrnrn JSON structure:rn {{rn “multiSpeakerMarkup”: {{rn “turns”: [rn {{“text”: “Podcast script content here…”, “speaker”: “R”}}, // R for host, S for guestrn // … more turnsrn ]rn }}rn }}rnrnrn Input Data:rn Section: “{section}”rn Subsections to cover in the podcast: “{subsection}”rn Target Audience: “{targetaudience}”rn Guest name: “{guestname}”rn Host name: “{hostname}”rn “””rnrnrn vertexai.init(project=project, location=location)rn model = GenerativeModel(“gemini-1.5-pro-002”)rnrnrn responses = model.generate_content(rn [“””The pdf file contains a Google Cloud blog post required for podcast-style analysis:”””, document1, prompt],rn generation_config=generation_config, # Assuming these are defined alreadyrn safety_settings=safety_settings, # Assuming these are defined alreadyrn stream=True,rn )rnrnrn response_text = “”rn for response in responses:rn response_text += response.textrnrnrn return response_text’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef677f010d0>)])]>
Next, feed the generated script by Gemini to the Text-to-Speech API. Choose a voice and language appropriate for your target audience and content.
A function as below can generate human quality audio based on text. For this we can use the advanced text-to-speech API in Google Cloud.
- code_block
- <ListValue: [StructValue([(‘code’, ‘def generate_audio_from_text(input_json):rn “””Generates audio using Google Text-to-Speech API.rnrnrn Args:rn input_json: A dictionary containing the ‘multiSpeakerMarkup’ for the TTS API. This is generated by the Gemini 1.5 Pro model in the buildPodCastContent() function. rnrnrn Returns:rn The audio data in bytes (MP3 format) if successful, None otherwise.rn “””rnrnrn try:rn # Build the Text-to-Speech servicern service = build(‘texttospeech’, ‘v1beta1’)rnrnrn # Prepare synthesis inputrn synthesis_input = {rn ‘multiSpeakerMarkup’: input_json[‘multiSpeakerMarkup’]rn }rnrnrn # Configure voice and audio settingsrn voice = {rn ‘languageCode’: ‘en-US’,rn ‘name’: ‘en-US-Studio-MultiSpeaker’rn }rnrnrn audio_config = {rn ‘audioEncoding’: ‘MP3’,rn ‘pitch’: 0,rn ‘speakingRate’: 0,rn ‘effectsProfileId’: [‘small-bluetooth-speaker-class-device’]rn }rnrnrn # Make the API requestrn response = service.text().synthesize(rn body={rn ‘input’: synthesis_input,rn ‘voice’: voice,rn ‘audioConfig’: audio_configrn }rn ).execute()rnrnrn # Extract and return audio contentrn audio_content = response[‘audioContent’]rn return audio_contentrnrnrn except Exception as e:rn print(f”Error: {e}”) # More informative error messagern return None’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ef677f01a90>)])]>
Finally, to store audio content already encoded as base64 MP3 data in Google Cloud Storage, you can use the google-cloud-storage Python library. This allows you to decode the base64 string and upload the resulting bytes directly to a designated bucket, specifying the content type as ‘audio/mp3’.
Hear it for yourself
While the Text-to-Speech API produces high-quality audio, you can further enhance your audio conversation with background music, sound effects, and professional editing using tools. Hear it for yourself – download the audio conversation I created from this blog using Gemini 1.5 Pro and Text-to-Speech API.
To start creating for yourself, explore our full suite of audio generation features using Google Cloud services, such as Text-to-Speech API and Gemini models using the free tier. We recommend experimenting with different modalities like text and image prompts to experience Gemini’s potential for content creation.
Read More for the details.