SSML Markup Capabilities for Speech Synthesis
09-09-2025 , 09-09-2025
SSML (Speech Synthesis Markup Language) is a markup language. It is used to describe text for converting into speech by neural networks.
- What is its purpose? With SSML, you can control tone, accent, pronunciation, and add pauses and other audio effects. This makes the generated speech sound more natural and expressive.
- Usage goals: The main goal is to make synthesized speech sound natural and expressive. SSML also ensures accurate pronunciation of numbers, dates, phone numbers, and other specific information.
- Who created it? SSML was developed by the World Wide Web Consortium (W3C). This organization sets web standards.
- What is its mission? SSML aims to standardize and enhance speech synthesis methods in the digital space.
For SSML documentation on the official W3C website: https://www.w3.org/TR/speech-synthesis/
Basic Rules for Writing SSML Tags
- SSML tags are usually enclosed in angle brackets, like in HTML. Example: <speak>text</speak>.
- Typically, there should be an opening and closing tag (except for <break>).
- Within tags, you can use attributes to adjust pronunciation settings.
- Some tags can be nested within others.
- SSML tag and attribute syntax follows XML standards.
Supported Tags
SpeechGen supports the most common SSML tags. Some voices might not follow certain tag attributes. Specific details are in the documentation for each parameter.
Below is a list of main tags with links to detailed documentation for each.
Break
Break – Pause This is the most popular tag on SpeechGen. It allows you to control pause duration.
<break time="2s"/>
Say-as
The primary SSML tag with many settings is say-as. It manages the pronunciation of various types of information.
Spell-out – Spell the text letter by letter
<say-as interpret-as="spell-out">Ashlee</say-as>
Cardinal
<say-as interpret-as="cardinal">123456789</say-as>
Ordinal
<say-as interpret-as="ordinal">3</say-as>
Fraction
<say-as interpret-as="fraction">3 1/2</say-as>
Date
<say-as interpret-as="date" format="ymd" detail="1">1969.07.21</say-as>
Time
<say-as interpret-as="time" format="hms12">2:30</say-as>
Telephone
My number is <say-as interpret-as="telephone">8883451715</say-as>
Currency
<say-as interpret-as="currency">79.4 USD</say-as>
Alias
<sub alias="Doctor">Dr.</sub> Smith.
Prosody
<prosody pitch="x-low">I'm speaking this text with an x-low constant pitch</prosody>
Emphasis
<emphasis level="strong">And today the weather is sunny.</emphasis>
Phoneme
<phoneme alphabet="ipa" ph="haɪˈpɜːrbəli">Hyperbole</phoneme>