Script recording guidelines
Learn how to read and record the voice cloning script effectively.
In order to create a voice clone, we need the speaker to record a tailored script. This script is specially designed to provide all the speech data required for effective voice training.
The quality of the script reading — both in terms of performance and sound quality — will have an impact on the resulting voice clone. We have put together these speaking and recording guidelines to help you achieve the best possible result.
Our scripts contains thousands of utterances used to a train voice model. We require a minimum of 1,000 high-quality utterances to train a production-ready voice.
We recommend that you record 100 utterances per session, which will take around 45 minutes each. This will reduce the risk of error and voice fatigue.
The speaker should read the script as if they were narrating an article. A few examples below:
Below are a few tips to help ensure highest-quality recording:
- Recording location: It is important to record in a quiet location and to use the same recording equipment throughout. We recommend recording in a professional studio and sitting at a consistent distance from the microphone. The speaker should make sure they are comfortable before recording, to eliminate the need for movement.
- Pronunciations: To ensure that words are mapped to their correct sounds, it is crucial that words are pronounced accurately and distinctly, exactly as they are in the script. The script has been normalized for text-to-speech, so you will notice some unusual punctuation and formatting (for example, '2020' written as 'twenty twenty'). Where letters should be pronounced individually, spaces or hyphens will be used to indicate breaks (for example, 'I S S', 'CAR-T'). The speaker should take the time to review the script beforehand and clarify the pronunciation of any unfamiliar or ambiguous words.
- Speaking style: Use a natural speaking style that you will be able to maintain consistently throughout the recordings. Each line that you record should be plausible in isolation. This means that you shouldn't give particular emphasis to any word which would rely on context from outside the text. While some variance is natural, it is important to keep volume, pitch, intonation, and tempo as consistent as possible.
- Voice quality: The speaker should take regular water breaks and rest their voice to ensure consistency. Rather than recording the script all at once, we recommend recording in multiple short sessions, to reduce the risk of the voice becoming tired or strained.
- Breathing and pausing: Make sure to pause after each utterance, and try to breathe away from the microphone before starting the next one. Otherwise, try to keep your breathing at a low and consistent volume, or else the voice clone's breaths can become unnatural and distracting.
We recommend saving each utterance as an individual .wav audio file, with the file name matching the utterance ID — for example, 1.wav.
If you wish to record multiple utterances per file, there needs to be a pause of at least 3 seconds between each utterance. The file name should match the utterance ID range — for example, 1-100.wav.
File format | *.wav, Mono |
Sampling rate | 22 kHz minimum |
Sample format | 16 bit PCM minimum |
Peak volume levels | -3 dB to -6 dB |
SNR | > 35 dB |
Environment noise, echo | The level of noise at start of the wave before speaking: <-70 dB |
Once recording is complete, you will need to submit the audio files to BeyondWords. Our team will then use the recordings to train your voice model.