Using the <prosody> tag in SSML: A Comprehensive Guide

, 11-10-2023

Tag <prosody> is used to control several attributes of speech synthesis, such as the pitch, volume, and rate (speed) of speech. This guide will explain in detail how to use the <prosody> tag, with examples, covering all the measurement systems including Hertz, semitones, percentage, and relative values.

Defining

The <prosody> tag allows you to control the tone (pitch), speed (rate), and volume of the synthesized speech. This is especially useful when you want to convey certain emotions or emphasis in the synthesized speech.

The pitch changes should be within 0.5 to 1.5 times the original audio.

Attention: The <prosody> tags are ideally used to enclose a complete sentence. Applying them to individual words within a sentence might lead to unintended breaks in the speech flow. While it's possible to use <prosody> on a part of a sentence, it should ideally be restricted to the ending portion (and not in the middle or beginning).

Pitch Attribute

The pitch attribute adjusts the tonality of the speech. Here's how you can express the pitch:

Semitones

Relative value in semitones (st): A relative value can be either a change up (+) or down (-) from the current pitch, expressed in semitones (st). For example:

<prosody pitch="+6st">The text will be spoken 6 semitones higher</prosody>

 
 
00:04

Percentage value

This is a relative change expressed as a percentage, with "+" increasing the pitch and "-" decreasing the pitch. For example

<prosody pitch="-40%">The text will be spoken 40% lower</prosody>

 
 
00:04

The good thing about relative values is that you can set any, even very extreme ones. But remember, the effect might vary for different voices.

Constant values

You can also use one of these predefined constant values:

  • x-low,
  • low,
  • medium,
  • high,
  • x-high,
  • default.

Let's take a look at some examples.

<prosody pitch="x-low">I'm speaking this text with an x-low constant pitch</prosody>

 
 
00:04

<prosody pitch="x-high">I'm speaking this text with an x-high constant pitch</prosody>

 
 
00:04

Additional

Some voices support the additional feature of setting the pitch in Hertz

Relative value in Hertz (Hz): A change up (+) or down (-) from the current pitch, expressed in Hertz (Hz). For example:

<prosody pitch="-40Hz">The text will be spoken 40Hz lower. </prosody>

 
 
00:03

Rate Attribute

The rate attribute controls the speed at which the text is spoken. Here's how you can express the rate:

A relative value

Percentage value: This is a relative change expressed as a percentage, with "+" increasing the rate and "-" decreasing the rate. For example.

<prosody rate="-30%">This text will be spoken at a rate 30% slower than the current rate.</prosody>

 
 
00:07

Example of speeding up speech:

<prosody rate="+70%">This text will be spoken at a rate 70% faster than the current rate</prosody>

 
 
00:03

A non-negative percentage

Not all voices support relative speed change in percentages, and some voices only support percentage change without plus or minus signs. A non-negative percentage indicates the change in speaking rate. For instance, 100% signifies no change, 200% means the speech is twice as fast, and 50% indicates it's half as fast. The range for this value is between 20-200%.

Here's an example with a rate of 50%, which means the speed is half of 100%. The speech will be two times slower than usual.

<prosody rate="50%">This text will be spoken at a rate 50% slower than the current rate</prosody>

 
 
00:10

Here's an example where the rate is 150%. This means the speech is half as fast again. So, it's 50% faster because 100% is the normal speed.

<prosody rate="150%">This text will be spoken at a rate 150% faster than the current rate</prosody>

 
 
00:04

A constant value

You can also use one of these predefined constant values: x-slow, slow, medium, fast, x-fast, or default.

<prosody rate="slow">I speak slowly and with expression, wow!</prosody>

 
 
00:06

Now let's make it speak quickly using x-fast.

<prosody rate="x-fast">I speak faster and with expression, wow! </prosody>

 
 
00:02

Word constants to speed up speech are good when you need a simple, quick result.

Volume Attribute

The volume attribute controls the loudness of the speech. Here's how you can express the volume.

Values in decibels

You can set the value in decibels (dB) with a plus or minus sign.

Normal text. <prosody volume="-15dB">And This text will be spoken at a volume 15dB lower than the current volume.</prosody> <prosody volume="+10dB">This text is 10 decibels higher.</prosody>

Example:

 
 
00:10

Constants

  • silent
  • x-low,
  • low,
  • medium,
  • loud,
  • x-loud,
  • default.

Here's an example:

<prosody volume="default">This is default volume.</prosody> <prosody volume="x-soft">This is x-soft volume.</prosody> <prosody volume="soft"> This is soft volume.</prosody> <prosody volume="loud"> This is loud volume.</prosody> <prosody volume="x-loud">This is x-loud volume. </prosody>

Example:

 
 
00:09

Percentage value

The method of setting the volume parameters of synthesized speech through percentage is not available for all voices.

This is a relative change expressed as a percentage, with "+" increasing the volume and "-" decreasing the volume. For example.

<prosody volume="+50%">This is volume +50%</prosody>
<prosody volume="-50%">This volume is -50%</prosody>

Example:

 
 
00:05

Combining Attributes

You can combine pitch, rate, and volume within the <prosody> tag to customize the output of the synthesized speech, providing more nuance and emotion to your spoken content.

<prosody pitch="-2st" rate="fast" volume="+3dB">This is a combined example</prosody>

Example:

 
 
00:02

Remember, the exact impact of these adjustments can vary based on the synthesizer and the voice being used.

Summary

The <prosody> tag in SSML is a powerful tool to manipulate the prosody or the melodic and rhythmic aspects of speech synthesis. Understanding how to use it effectively can greatly improve the expressiveness and naturalness of your text-to-speech applications. 

  1. The <prosody> tag might work differently for various voices because of how Speechgen uses multiple neural network engines.
  2. For some voices, semitones might not work well, while for others, percentages might not be effective.
  3. Remember, the <prosody> tag is best used for an entire sentence or its ending. It won't work properly if you try to use it in the middle of a sentence.

If you have any questions about SSML, feel free to ask in our International Telegram chat at @speechgen. We're here to help!

 

Support

International Telegram chat @speechgen

Personal support in Telegram @speechgen_alex

E-mails

We use cookies to ensure you get the best experience on our website. Learn more: Privacy Policy

Accept Cookies