Fun with speech synthesis and Chicken Scheme

I've always been fascinated with making my computer talk. I can still hear my old MacBook saying "I'm Alex, I'm a new voice for Leopard." Anyway, that's not an endorsement of Apple, they're trash1. I remember first starting to play Eve Online2 and thinking how cool it would be to play "docking request… accepted" upon successful login to my laptop. Though I do have my laptop yell at me when it needs to be plugged in ("Battery critically low, consider charging!"), I've mostly come around to thinking that it could get pretty annoying if everything on my computer talked to me. Still, it can be fun to play with speech synthesis.

On Linux, there are a couple options. I'm partial to the more robotic charm of espeak-ng, but there's also flite and festival. While the last two sound a little more natural, I find espeak the best combination of simple and featureful: it supports a good amount of languages, tweakable parameters, and even text to IPA phoneme conversion!

Using espeak-ng as a C library

Espeak is a neat command line program, but it also comes with a fairly featureful (if a bit wonky) C library, speak_lib.h. While there's a getting started document and other assorted documentation, most of the C library is only documented in the header file itself. I won't be speaking about the callback/event API, as I'm not yet familiar with that part of the library.

Let's take a look at some of the base functions and explain some of the less self-explanatory aspects:

int espeak_Initialize(espeak_AUDIO_OUTPUT output, int buflength,
                      const char *path, int options);

This must be called once per prorgam (you can't re-initialize), and most of the library functions depend of an initialized state to work properly at all. For speaking text out loud, you want to pass AUDIO_OUTPUT_PLAYBACK to output.

espeak_ERROR espeak_Synth(const void *text,
                          size_t size,
                          unsigned int position,
                          espeak_POSITION_TYPE position_type,
                          unsigned int end_position,
                          unsigned int flags,
                          unsigned int* unique_identifier,
                          void* user_data);

This is the base speech synthesis function. There's a few fun options here: you can set the character, word, or sentence in the text to start speaking at with position and position_type with an integer and any of POS_CHARACTER, POS_WORD, or POS_SENTENCE. Through the flags parameter, you can also tell it to treat < > elements as espeak SSML, [[ ]] elements as Kirshenbaum-encoded phonemes. The espeak_SynthMark function is very similar to this one, except you can specify the name of an ssml <mark name="example"> element as the beginning of synthesis.

Things to watch out for: currently the character option behaves just like the word option, and word and sentence both seem 1-indexed, which isn't documented anywhere.

espeak_ERROR espeak_Key(const char *key_name);
espeak_ERROR espeak_Char(wchar_t character);

These are kind of cool - they speak the name of the keyboard key (a single-character length string) or character, respectively. espeak_Key will also conveniently speak a full string if you pass it that.

Note that espeak_Char behaves kinda weirdly with higher code-point unicode characters, repeating the first word in the name twice. (For example, it will say character 0x1f617, KISSING FACE, as "kissing kissing".)

espeak_ERROR espeak_Synchronize(void);

You call this to wait for all audio to be spoken. Useful for making sure your program doesn't exit abruptly in the middle of speaking.

const char *espeak_TextToPhonemes(const void **textptr,
                                  int textmode,
                                  int phonememode);

I like this one a lot. I don't personally have much use for it, but I can think of one or two people who just might3 . With it, you can translate text into IPA phonemes in either Kirshenbaum encoding or IPA UTF-8 characters. Note that this requires a voice to be set using any of the espeak_SetVoiceBy* functions.

The way to toggle between the two is with a bit flag in phonememode. It seems the documentation for this flag is off by 1 bit. Regardless, you can use espeakPHONEMES_TIE and espeakPHONEMES_IPA so you don't have to remember these numbers. Here's some simple examples of setting these flags:

int phonememode = 0;
phonememode |= espeakPHONEMES_IPA; // Use IPA symbols
phonememode |= espeakPHONEMES_TIE; // Use the separator (if set) as a tie
phonememode |= ' ' << 8; // Use space as a separator
// Or
phonememode |= '-' << 8; // Use a hyhpen as a separator
// Or, use COMBINING DOUBLE BREVE BELOW as separator. To my understanding this
// is useful as a tie since it's a combining character.
phonememode |= 0x35C << 8;

The library also has options to set things like gender, language, age, speech rate, etc.

A friendlier option with Chicken Scheme

Anyway, all that to say - this library is definitely cool, but not the most straightforward in my opinion. I thought it would be fun to create bindings for it in Chicken Scheme. Here is the finished result with documentation.

To go into it in a bit more detail, here's a quick example of a direct binding to espeak_Initialize, for example:

(define initialize
  (foreign-lambda int "espeak_Initialize" int int c-string int))

That would work fine, and we could call espeak_Initialize directly from scheme in a similarfashion as in C, but since this is a high level language, it makes sense to make the API a little more user-friendly.

The first thing we can do is to make optional arguments just that - optional. We can pass some default value to the low level binding if the user doesn't want to specify that parameter. Next, we can take all those bit flags and pass them as individual optional arguments as well, doing the actual bit processing internally so the user doesn't have to worry about it. Keyword arguments in scheme are great for both of these, as you can pass in flags and options by name.

With these things in mind, the function signature (and a usage example) of initialize becomes:

(initialize #!key

And it can be used as follows, for example.

;; Usage example:
;; Or, e.g.
(initialize path: "some/path" buflength: 80)

Next, we can keep track of wether we've initialized already with an internal module variable, meaning we can ensure initialize is implicitly called before every function that requires it only if it hasn't been called already. No more need to initialize by hand. This makes it easy to start using the API defaults directly without having to remember what order to call things in, etc.

As a similar example, the function espeak_TextToPhonemes won't run with just initialization, it requires a voice to be set. In that case, I ensure that if no voice is currently set, we can grab a default by passing a call to espeak_GetCurrentVoice to espeak_SetVoiceByProperties. This is exploiting a (maybe intentional) behavior by which if you pass a voice to this function with no properties set (all 0 or NULL), the voice seems to be chosen by the "priority byte" described in speak_lib.h.

Here's the updated signature for the espeak_TextToPhonemes binding:

(text->phonemes input
                (separator #\null))

Here's how to use it:

;; Example usage:
(text->phonemes "hello") ;; => "h@l'oU"
(text->phonemes "hello" ipa: #t) ;; => "həlˈəʊ"
(text->phonemes "hello" ipa: #t separator: #\-) ;; => "h-ə-l-ˈəʊ"
(text->phonemes "hello my name is" ipa: #t tie: #t separator: #\x35c)
;; => "həlˈə͜ʊ ma͜ɪ nˈe͜ɪm ɪz"

Another quality improvement is that sometimes when retrieving a voice using the library, the previously mentioned "prority byte" is set, but espeak_SetVoiceByProperties doesn't like that. I modified the binding to that function to remove that byte if present.

And finally, I wrote a higher-level function for speech synthesis that combines many different parameters, say:

(say text #!key
          ;; Voice properties
          name language identifier gender age variant
          ;; Parameters
          rate volume pitch range punctuation capitals wordgap)

You can specify any number of these options (or none for the default voice) once, and subsequent calls will use the same settings.

(import espeak)
(say "Una nueva voz para Linux"
     language: "es" gender: gender/female rate: 200)
;; And this will sound the same.
(say "Y esto sonará igual!")

If only you could see me now, Alex.



That's mostly a joke.


It's ok, I know the kind of person I sound like.


Hi mom! No, it doesn't do Nigerian Pidgin, sorry, though you can technically add and improve languages.

Date: 2020-12-18