31-07-2023 , 11-10-2023
Tag <prosody> is used to control several attributes of speech synthesis, such as the pitch, volume, and rate (speed) of speech. This guide will explain in detail how to use the <prosody> tag, with examples, covering all the measurement systems including Hertz, semitones, percentage, and relative values.
The <prosody> tag allows you to control the tone (pitch), speed (rate), and volume of the synthesized speech. This is especially useful when you want to convey certain emotions or emphasis in the synthesized speech.
The pitch changes should be within 0.5 to 1.5 times the original audio.
Attention: The <prosody> tags are ideally used to enclose a complete sentence. Applying them to individual words within a sentence might lead to unintended breaks in the speech flow. While it's possible to use <prosody> on a part of a sentence, it should ideally be restricted to the ending portion (and not in the middle or beginning).
The pitch attribute adjusts the tonality of the speech. Here's how you can express the pitch:
Relative value in semitones (st): A relative value can be either a change up (+) or down (-) from the current pitch, expressed in semitones (st). For example:
<prosody pitch="+6st">The text will be spoken 6 semitones higher</prosody>
This is a relative change expressed as a percentage, with "+" increasing the pitch and "-" decreasing the pitch. For example
<prosody pitch="-40%">The text will be spoken 40% lower</prosody>
The good thing about relative values is that you can set any, even very extreme ones. But remember, the effect might vary for different voices.
You can also use one of these predefined constant values:
Let's take a look at some examples.
<prosody pitch="x-low">I'm speaking this text with an x-low constant pitch</prosody>
<prosody pitch="x-high">I'm speaking this text with an x-high constant pitch</prosody>
Some voices support the additional feature of setting the pitch in Hertz
Relative value in Hertz (Hz): A change up (+) or down (-) from the current pitch, expressed in Hertz (Hz). For example:
<prosody pitch="-40Hz">The text will be spoken 40Hz lower. </prosody>
The rate attribute controls the speed at which the text is spoken. Here's how you can express the rate:
Percentage value: This is a relative change expressed as a percentage, with "+" increasing the rate and "-" decreasing the rate. For example.
<prosody rate="-30%">This text will be spoken at a rate 30% slower than the current rate.</prosody>
Example of speeding up speech:
<prosody rate="+70%">This text will be spoken at a rate 70% faster than the current rate</prosody>
Not all voices support relative speed change in percentages, and some voices only support percentage change without plus or minus signs. A non-negative percentage indicates the change in speaking rate. For instance, 100% signifies no change, 200% means the speech is twice as fast, and 50% indicates it's half as fast. The range for this value is between 20-200%.
Here's an example with a rate of 50%, which means the speed is half of 100%. The speech will be two times slower than usual.
<prosody rate="50%">This text will be spoken at a rate 50% slower than the current rate</prosody>
Here's an example where the rate is 150%. This means the speech is half as fast again. So, it's 50% faster because 100% is the normal speed.
<prosody rate="150%">This text will be spoken at a rate 150% faster than the current rate</prosody>
You can also use one of these predefined constant values: x-slow, slow, medium, fast, x-fast, or default.
<prosody rate="slow">I speak slowly and with expression, wow!</prosody>
Now let's make it speak quickly using x-fast.
<prosody rate="x-fast">I speak faster and with expression, wow! </prosody>
Word constants to speed up speech are good when you need a simple, quick result.
The volume attribute controls the loudness of the speech. Here's how you can express the volume.
You can set the value in decibels (dB) with a plus or minus sign.
Normal text. <prosody volume="-15dB">And This text will be spoken at a volume 15dB lower than the current volume.</prosody> <prosody volume="+10dB">This text is 10 decibels higher.</prosody>
Here's an example:
<prosody volume="default">This is default volume.</prosody> <prosody volume="x-soft">This is x-soft volume.</prosody> <prosody volume="soft"> This is soft volume.</prosody> <prosody volume="loud"> This is loud volume.</prosody> <prosody volume="x-loud">This is x-loud volume. </prosody>
The method of setting the volume parameters of synthesized speech through percentage is not available for all voices.
This is a relative change expressed as a percentage, with "+" increasing the volume and "-" decreasing the volume. For example.
<prosody volume="+50%">This is volume +50%</prosody>
<prosody volume="-50%">This volume is -50%</prosody>
You can combine pitch, rate, and volume within the <prosody> tag to customize the output of the synthesized speech, providing more nuance and emotion to your spoken content.
<prosody pitch="-2st" rate="fast" volume="+3dB">This is a combined example</prosody>
Remember, the exact impact of these adjustments can vary based on the synthesizer and the voice being used.
The <prosody> tag in SSML is a powerful tool to manipulate the prosody or the melodic and rhythmic aspects of speech synthesis. Understanding how to use it effectively can greatly improve the expressiveness and naturalness of your text-to-speech applications.
If you have any questions about SSML, feel free to ask in our International Telegram chat at @speechgen. We're here to help!