Fun with speech synthesis and Chicken Scheme
I've always been fascinated with making my computer talk. I can still hear my old MacBook saying "I'm Alex, I'm a new voice for Leopard." Anyway, that's not an endorsement of Apple, they're trash1. I remember first starting to play Eve Online2 and thinking how cool it would be to play "docking request… accepted" upon successful login to my laptop. Though I do have my laptop yell at me when it needs to be plugged in ("Battery critically low, consider charging!"), I've mostly come around to thinking that it could get pretty annoying if everything on my computer talked to me. Still, it can be fun to play with speech synthesis.
On Linux, there are a couple options. I'm partial to the more robotic charm of espeak-ng, but there's also flite and festival. While the last two sound a little more natural, I find espeak the best combination of simple and featureful: it supports a good amount of languages, tweakable parameters, and even text to IPA phoneme conversion!
Using espeak-ng as a C library
Espeak is a neat command line program, but it also comes with a fairly
featureful (if a bit wonky) C library, speak_lib.h
. While there's a getting
started document and other assorted documentation, most of the C library is
only documented in the header file itself. I won't be speaking about the
callback/event API, as I'm not yet familiar with that part of the library.
Let's take a look at some of the base functions and explain some of the less self-explanatory aspects:
int espeak_Initialize(espeak_AUDIO_OUTPUT output, int buflength, const char *path, int options);
This must be called once per prorgam (you can't re-initialize), and most of the
library functions depend of an initialized state to work properly at all. For
speaking text out loud, you want to pass AUDIO_OUTPUT_PLAYBACK
to output
.
espeak_ERROR espeak_Synth(const void *text, size_t size, unsigned int position, espeak_POSITION_TYPE position_type, unsigned int end_position, unsigned int flags, unsigned int* unique_identifier, void* user_data);
This is the base speech synthesis function. There's a few fun options here: you
can set the character, word, or sentence in the text
to start speaking at
with position
and position_type
with an integer and any of POS_CHARACTER
,
POS_WORD
, or POS_SENTENCE
. Through the flags
parameter, you can also tell
it to treat < >
elements as espeak SSML, [[ ]]
elements as
Kirshenbaum-encoded phonemes. The espeak_SynthMark
function is very similar
to this one, except you can specify the name of an ssml <mark name="example">
element as the beginning of synthesis.
Things to watch out for: currently the character option behaves just like the word option, and word and sentence both seem 1-indexed, which isn't documented anywhere.
espeak_ERROR espeak_Key(const char *key_name); espeak_ERROR espeak_Char(wchar_t character);
These are kind of cool - they speak the name of the keyboard key (a
single-character length string) or character, respectively. espeak_Key
will
also conveniently speak a full string if you pass it that.
Note that espeak_Char
behaves kinda weirdly with higher code-point unicode
characters, repeating the first word in the name twice. (For example, it will
say character 0x1f617, KISSING FACE, as "kissing kissing".)
espeak_ERROR espeak_Synchronize(void);
You call this to wait for all audio to be spoken. Useful for making sure your program doesn't exit abruptly in the middle of speaking.
const char *espeak_TextToPhonemes(const void **textptr, int textmode, int phonememode);
I like this one a lot. I don't personally have much use for it, but I can think
of one or two people who just might3 . With it, you can
translate text into IPA phonemes in either Kirshenbaum encoding or IPA UTF-8
characters. Note that this requires a voice to be set using any of the
espeak_SetVoiceBy*
functions.
The way to toggle between the two is with a bit flag in phonememode
. It seems
the documentation for this flag is off by 1 bit. Regardless, you can use
espeakPHONEMES_TIE
and espeakPHONEMES_IPA
so you don't have to remember
these numbers. Here's some simple examples of setting these flags:
int phonememode = 0; phonememode |= espeakPHONEMES_IPA; // Use IPA symbols phonememode |= espeakPHONEMES_TIE; // Use the separator (if set) as a tie phonememode |= ' ' << 8; // Use space as a separator // Or phonememode |= '-' << 8; // Use a hyhpen as a separator // Or, use COMBINING DOUBLE BREVE BELOW as separator. To my understanding this // is useful as a tie since it's a combining character. phonememode |= 0x35C << 8;
The library also has options to set things like gender, language, age, speech rate, etc.
A friendlier option with Chicken Scheme
Anyway, all that to say - this library is definitely cool, but not the most straightforward in my opinion. I thought it would be fun to create bindings for it in Chicken Scheme. Here is the finished result with documentation.
To go into it in a bit more detail, here's a quick example of a direct binding
to espeak_Initialize
, for example:
(define initialize (foreign-lambda int "espeak_Initialize" int int c-string int))
That would work fine, and we could call espeak_Initialize
directly from scheme
in a similarfashion as in C, but since this is a high level language, it makes
sense to make the API a little more user-friendly.
The first thing we can do is to make optional arguments just that - optional. We can pass some default value to the low level binding if the user doesn't want to specify that parameter. Next, we can take all those bit flags and pass them as individual optional arguments as well, doing the actual bit processing internally so the user doesn't have to worry about it. Keyword arguments in scheme are great for both of these, as you can pass in flags and options by name.
With these things in mind, the function signature (and a usage example) of
initialize
becomes:
(initialize #!key output buflength path phoneme-events phoneme-ipa dont-exit)
And it can be used as follows, for example.
;; Usage example: (initialize) ;; Or, e.g. (initialize path: "some/path" buflength: 80)
Next, we can keep track of wether we've initialized already with an internal module variable, meaning we can ensure initialize is implicitly called before every function that requires it only if it hasn't been called already. No more need to initialize by hand. This makes it easy to start using the API defaults directly without having to remember what order to call things in, etc.
As a similar example, the function espeak_TextToPhonemes
won't run with just
initialization, it requires a voice to be set. In that case, I ensure that if
no voice is currently set, we can grab a default by passing a call to
espeak_GetCurrentVoice
to espeak_SetVoiceByProperties
. This is exploiting a
(maybe intentional) behavior by which if you pass a voice to this function with
no properties set (all 0 or NULL), the voice seems to be chosen by the "priority byte"
described in speak_lib.h
.
Here's the updated signature for the espeak_TextToPhonemes
binding:
(text->phonemes input #!key ipa tie (separator #\null))
Here's how to use it:
;; Example usage: (text->phonemes "hello") ;; => "h@l'oU" (text->phonemes "hello" ipa: #t) ;; => "həlˈəʊ" (text->phonemes "hello" ipa: #t separator: #\-) ;; => "h-ə-l-ˈəʊ" (text->phonemes "hello my name is" ipa: #t tie: #t separator: #\x35c) ;; => "həlˈə͜ʊ ma͜ɪ nˈe͜ɪm ɪz"
Another quality improvement is that sometimes when retrieving a voice using the
library, the previously mentioned "prority byte" is set, but
espeak_SetVoiceByProperties
doesn't like that. I modified the binding to that
function to remove that byte if present.
And finally, I wrote a higher-level function for speech synthesis that combines many
different parameters, say
:
(say text #!key sync ;; Voice properties name language identifier gender age variant ;; Parameters rate volume pitch range punctuation capitals wordgap)
You can specify any number of these options (or none for the default voice) once, and subsequent calls will use the same settings.
(import espeak) (say "Una nueva voz para Linux" language: "es" gender: gender/female rate: 200) ;; And this will sound the same. (say "Y esto sonará igual!")
If only you could see me now, Alex.
Footnotes:
That's mostly a joke.
It's ok, I know the kind of person I sound like.
Hi mom! No, it doesn't do Nigerian Pidgin, sorry, though you can technically add and improve languages.