How to Use Text to Speech on SpeechGen.io: Complete Guide

30-11--0001 , 16-09-2025

🚀 Quick Start — Create Audio in 4 Steps

Step 1: Select Language

Open the language dropdown and select the language of your text. Supported languages: Over 150 languages (AI voices library).

Step 2: Choose Voice

After selecting the language, a list of voices will appear. Listen to samples and choose your favorite

Step 3: Paste Text

Copy your text into the text box or upload a file (DOCX, PDF). For converting subtitles to speech, use the dedicated SRT to voice page.

Step 4: Click "Generate Speech" (blue button)

Step 4: Click Generate Speech

Wait for processing and download your ready audio file

That's it! Your first voiceover is ready in just a couple of minutes.

Text Preparation

Avoid:

Emojis and emoticons (may disrupt audio generation)
Exotic symbols: ✓, ★, ♦, ►, ♪, ©, ™, ®, ∞, •, ◦, ▪, ▫
Special Unicode symbols:

💡 Tip: When copying from PDF files, pay special attention to the text — invisible characters may appear that will ruin the audio!

Limits and Restrictions

Supported languages: 150+ languages (full list).
Upload formats: plain text, DOCX, PDF, SRT.

Maximum per generation: 2,000,000 characters (≈ 285,000-330,000 words) - this is the impressive amount of text you can convert to speech in a single generation, making it ideal for long-form content like entire books or extensive documentation.

Detailed Step-by-Step Instructions

Step 1: Upload Text

Paste text: Copy your text into the text box
Upload file: Or click the upload button and select a file (DOCX, PDF)
Check text: Make sure the text displays correctly

Step 2: Select Language

⚠️ Important: First select the correct language for your text

Open the language dropdown list
Find the needed language (150+ languages available)
For multi-language texts, use multi-voice audio generation

Step 3: Choose Voice

After selecting the language, a list of available voices will open. Listen to samples by clicking the play button for each voice to find the one that best suits your needs. You'll see different voice types available: Regular voices offer standard quality, PRO voices provide improved quality and naturalness, and Multi-language voices (marked with language codes like Ava_US, Ava_ES) allow you to maintain voice consistency across different languages. Take time to preview each voice as they vary significantly in tone, emotion, and character.

Step 4: Configure Parameters

Speech speed: from x0.1 (very slow) to x2.2 (very fast)
Voice pitch: from -20 to +20 (step 2)

Below the text box, above the generate button, you can adjust the pause settings:

pause settings

Pauses between sentences: 150ms - 30 seconds
Pauses between paragraphs: 150ms - 30 seconds

Step 5: Generate Speech

Click the "Generate Speech" button below the text box to start the conversion process. The processing time depends on your text length - shorter texts complete in seconds while longer documents may take a few minutes. Once generation is complete, you'll be able to listen to the result directly in the browser to ensure it meets your expectations.

Step 6: Download

After generation completes, a "Download" button will appear. By default, you can simply download the file as MP3. However, if you need a different format (WAV or OPUS) or want to change the audio quality (sample rate from 8000 to 44000 Hz), you'll need to first select these options from the dropdown menus, regenerate the speech with your chosen settings, and then download the file with your preferred specifications.

Audio Parameter Settings

Speech Speed

Speed scale:

x0.1 - x0.9: Slowdown (for complex material, language learning)
x1.0: Normal speed (default)
x1.1 - x2.2: Speed up (for dynamic content)

Why this scale: Fractional values less than 1 slow down speech, greater than 1 speed up. This allows precise tempo selection for your audience.

Speed recommendations:

Education: x0.8-x1.0 (for better comprehension)
Presentations: x0.9-x1.1 (official pace)
Podcasts: x1.0-x1.2 (lively pace)
YouTube: x1.1-x1.4 (attention retention)

Voice Pitch

Pitch range: from -20 to +20 with step 2

Why step 2: A step of 2 units provides noticeable but not sharp pitch change. Smaller steps would be unnoticeable, larger steps too dramatic.

Pitch influence:

Negative values (-2 to -20): Make voice lower, more serious, authoritative
Positive values (+2 to +20): Make voice higher, friendlier, more energetic
0: Neutral pitch (default)

Applications:

Business content: -4 to +2
Children's content: +4 to +12
Dramatic content: -8 to -16
Friendly content: +2 to +8

Working with Pauses

Automatic Pauses

Pauses between sentences: 300ms (default)

Pauses between paragraphs: 400ms (default)

These settings can be changed in dropdown menus from 150ms to 30 seconds.

Manual Pause Insertion

Through interface:

Place cursor at the desired location in text
Click the "Pause" button in menu
The symbol .- will appear in text

Through tags:

Insert tag <break time="200ms"/> or <break time="2s"/> at the desired location

Pause rules:

Maximum pause: 30 seconds
Can place multiple pauses in a row for longer delay
Pauses don't consume additional limits

When to use pauses:

Before important statements
After rhetorical questions
Between different topics
To create dramatic effect

Multi-Voice Audio

The dialogue function allows using different voices in one text.

Applications:

Audiobooks: Different voices for characters
Educational dialogues: Teacher and student
Presentations: Main speaker and commentator
Podcasts: Multiple hosts

The multi-voice dialogue feature opens up creative possibilities beyond just character voices. Foreign language teachers, for instance, can use this function to demonstrate the same phrase at multiple speeds for language learning, helping students grasp pronunciation at different comprehension levels. For detailed techniques and classroom applications, see our guide on using text-to-speech for foreign language teaching.

Voice Selection

Multi-language Voices

Voices with language codes (e.g., Ava_US, Ava_ES, Ava_DE) are designed to maintain consistent voice recognition across different languages. These multi-language voices enable you to create a unified style for multilingual content, ensuring that the same voice character can speak multiple languages seamlessly. This feature is particularly useful in dialogue mode, where you can switch between languages while keeping the same recognizable voice personality throughout your audio project.

Audio Segmentation

SpeechGen allows you to split your generated audio into multiple segments within a single synthesis project, making it perfect for video editors who need separate audio files for different scenes or chapters. This feature is particularly useful for creating voiceovers for YouTube videos, online courses, or any project requiring precise audio synchronization.

How to Create Segments

To split your audio, simply place your cursor where you want to divide the text and click the cut button in the menu panel. This inserts a <cut/> ag at that position. You can also manually type or copy-paste this tag throughout your text. For custom filenames, use this format:

<cut name="your-filename"/>

This feature helps you organize segments with meaningful names like:

<cut name="intro"/>

<cut name="chapter-1"/>

Downloading and Managing Segments

Once you've added at least one segment tag, a "download segments" button appears after generation. Click it to download all segments at once, or use the "more" button on the audio player to access individual segments. Each file is automatically named with a unique ID, sequence number, and descriptive title (e.g., "7054789_1_first-sentence"), making it easy to identify and organize your audio files in your editing software.

Segment Limitations

Short segments: Up to 1000 segments per generation
Long segments: Up to 500 segments per generation

For larger projects, split into multiple generations. For comprehensive instructions, advanced techniques, and video tutorials, visit our complete audio segmentation documentation.

Intonation Setup

Some voices have intonation graphs:

Intonation graphs are available on voices that display a settings icon next to the voice name - this feature is found on more than half of the voices in the library, including both regular and PRO options

Drag points on the graph to change intonation
Raise points to increase pitch on certain words
Lower points to create a more serious tone
Experiment with different curves for naturalness

Drag points on the graph to change intonation

Select the sentence in which you want to adjust the intonation and press the intonation button. This interface will appear.

Caching System and Limit Savings

Smart Cache

SpeechGen. uses an intelligent caching system that significantly saves your limits. The system works by saving each sentence (up to 100,000 characters) in cache for 7 days. When you regenerate your audio, any unchanged sentences are automatically retrieved from the cache for free - you only pay for new or edited sentences. This means you can make incremental edits to your text without consuming your entire character allowance each time. Project history is stored for 30 days, and files you add to favorites are kept permanently.

Storage periods:

Sentence cache: 7 days
Project history: 30 days
Favorite files: Stored permanently

Troubleshooting Common Issues

Audio Quality Issues

Voice sounds unnatural:

Try PRO voices
Reduce speed to x0.9-x1.1
Check punctuation correctness
Use neutral pitch (0)

Incorrect pronunciation:

Make sure correct language is selected
Write complex words phonetically
Use SSML tags for precise control

Unnatural pauses:

Check punctuation
Configure pauses between sentences
Use manual pauses .- or <break time=""/>
Remove extra spaces and line breaks

SSML errors:

Check tag correctness
Not all voices support all SSML tags

Additional Features

SSML (Speech Synthesis Markup Language)

For expert voice control, use SSML tags:

<break time="2s"/> — pauses
<emphasis level="strong"> — voice emphasis
<prosody rate="slow" pitch="low"> — speech characteristics change

⚠️ Attention: Different voices support different sets of SSML tags. Test functionality for each specific voice.

History and Favorites

Project history: Automatically saved for 30 days
Favorites: Add important projects for permanent storage

Integration and API

API is available for developers to integrate SpeechGen.io into their own applications and services.

My file won't upload to SpeechGen. What should I do?

First, check that your file is in a supported format (DOCX, PDF, or TXT). Make sure the file isn't corrupted and try uploading again. If the issue persists, copy the text manually and paste it directly into the text box. Also verify that your file size doesn't exceed the platform limits.

How long does SpeechGen keep my generated audio files?

Your project history is automatically saved for 30 days. The smart cache (for sentence-level savings) lasts 7 days. To keep files permanently, add them to your favorites. This ensures your important audio projects are never lost and remain accessible in your profile.

Can I use different voices for different characters in one audio file?

Yes! SpeechGen offers multi-voice audio generation (dialogue mode). You can assign different voices to different text sections, making it perfect for audiobooks with multiple characters, educational dialogues, or podcasts with multiple speakers. You can even use multi-language voices to switch between languages while maintaining character consistency.

What's the difference between regular and PRO voices in SpeechGen?

PRO voices offer superior quality and naturalness compared to regular voices. They typically have better emotional expression, more accurate pronunciation, and some support advanced features like intonation graphs. For professional projects like audiobooks, courses, or business presentations, PRO voices are recommended.

Does changing audio settings consume my character limits?

It depends on which settings you change. Adjusting speech speed or pitch requires full regeneration and will consume your character limits, as these changes affect the entire voice synthesis. However, you can freely modify pauses between sentences and paragraphs without any limit consumption. Additionally, SpeechGen uses smart caching: if you generate a large text, then edit just one sentence and regenerate, the system will only charge you for that single changed sentence, not the entire text. This caching system saves your unchanged sentences for 7 days, making iterative editing very economical.

Video

Still Have Questions?

Get help from our community! Ask your questions in our Telegram chat: https://t.me/speechgen