Phat Do researches synthetic speech
A Frisian voice from Vietnam
Phat Do likes language. He speaks Vietnamese – his mother tongue – as well as English, some Chinese, and Japanese. He is trying to learn Dutch with the help of DuoLingo while also dabbling in Spanish and French. But he does not speak Frisian.
Yet the screen he’s placed in a small conference room at Campus Fryslân displays a Frisian saying: ‘Bûter, brea en griene tsiis, wa’t dat net sizze kin is gjin oprjochte Fries.’ Which translates as: ‘Butter, bread and green cheese, if you can’t say that, you’re no true Frisian.’
He clicks on a button to the right and a female voice starts to speak the phrase that – as legend goes – the medieval freedom fighter Grutte Pier used to tell real Frisians from the hated Dutch.
Click on the flags below to hear the Frisian pronunciation of synthetic voices ‘trained’ in Dutch, Finnish, Spanish, Japanese, and French.
The synthetic voice that uses Dutch as a source language sounds good, but the ‘ii’ sound in tsiis is not elongated enough. The intonation is off too. The pitch goes up in ‘oprjochte’ where it should stay leveled.
When Finnish is used as a source, the the word ‘griene’ is cut short. The ‘e’ at the end is indiscernable. Here too the pitch goes up in ‘oprjochte’ where it shouldn’t.
With Spanish as a source language, the ‘ii’ in ‘tsiis’ sounds a bit like the ‘i’ in the English word ‘into’. The pitch at the end sounds good, though.
Japanese has to much ‘t’ sound in ‘bûter’. The ‘ii’ in ‘tsiis’ sounds more like the ‘a’ in the English ‘later’, which also goes for the shorter ‘i’ in ‘sizze’.
French as a source language produces too short of an ‘i’ in ‘tsiis’. The short ‘i’ in ‘sizze’ sounds like an elongated ‘a’.
It’s intelligible. Quite distinct even, for those who actually speak Frisian. Yes, there are some hiccups. The ‘ii’ sound of tsiis is definitely too short. The ‘o’ in oprjochte sounds a little off. But still, it’s rather good.
Geartsje de Vries
‘Look’, Do says. ‘You can make it say anything you want.’
He quickly goes to the website of local network Omrop Fryslân and selects a random piece of text about the problems of local companies. Omrop Fryslân is good for a simple test like this, because he can be sure their written Frisian is accurate. He has little idea himself, though, what the text is actually about.
Instead of typing or clicking, we use our voices to interact with our devices
He pastes it in the text-area of his program and clicks again. The female voice, modelled after Frisian voice actor Geartsje de Vries, reads the words aloud. Again, it’s not hard to make out, although the awkward pacing of the text makes it more difficult to follow than the earlier short sentence.
It may not seem like much, but Do, who’s doing his PhD in voice technology at Campus Fryslân in Leeuwarden, has done something remarkable: he’s constructed a synthetic voice for Frisian. Google Translate may be able to translate words or sentences for you, but it cannot tell you how the words should actually sound. But Do’s homemade voice can, and that’s new.
What is more: he created the voice using only thirty minutes of recordings from audiobooks spoken by De Vries.
‘The use of voice technology is getting more and more substantial’, he explains. ‘We are moving away from typing or clicking. Instead, we use our voices to interact with our devices.’ Think of Alexa, Siri or Google Home. Think of Google Translate. But there are also websites that can be read out aloud to someone who is blind or can’t read.
While that technology is widely available for languages like English, Mandarin, Spanish, or even Dutch, it‘s a different matter for what Do calls ‘low-resource languages’; languages like Frisian that aren’t spoken by very many people.
Normally, creating a synthetic voice would require hundreds of hours of clearly spoken language with matching text. All that text has to be broken up into sentences or phrases, which must be transcribed into phonemes, which are the smallest units of sound that can distinguish one word from another.
These must then be linked to the matching audio and fed into the computer. After the machine has ‘learned’ how the words have to be pronounced, it starts predicting how unknown pieces of text might sound.
That takes a lot of time, a lot of resources, and therefore a lot of money, which means low-resource languages – terms like ‘ethnic’ or ‘minority’ language can be a political statement and are therefore to be avoided – may not be able to get their own voice.
But it is important that they do, says Do. ‘Local governments who want a language to stay strong in their community would need it on their websites for visually challenged people’, he says. ‘But it is also relevant for learning languages or translation purposes.’
For a language to stay strong, governments need a synthetic voice on their websites
So Do is trying to find a way to create these synthetic voices as efficiently as possible for all low-resource languages. In his research, Frisian is just an example. It’s the ideal language for a case study, since at Campus Fryslân, he’s surrounded by native speakers.
‘I use a technique that is called “transfer learning”’, he explains. ‘It’s where you train a model with lots of data from one language first, and then make it adapt itself, using a little data from the target language.’
The technique itself isn’t new, Do stresses. But he believes it can be greatly improved, because until now, researchers have often used their gut feeling to choose the ideal ‘donor language’ to make up for the unavailable data. Usually, they go for a language from a similar family – for Frisian that might be Dutch or English, all West Germanic languages.
But after doing a meta-analysis of studies on the topic, Do came to believe that other factors than just family relation might play an important role. He set up an experiment in which he trained a Frisian voice using donor data from five different languages: Dutch, French, Spanish, Finnish and Japanese. He then asked native speakers to assess the audio of his synthetic voices and to judge how natural they felt them to be.
The result? ‘Using Dutch as a donor language gave the best quality’, Do says. He’d expected that. What was more interesting was the runner-up. Of the remaining options, Spanish or French might seem closest to Frisian. But no: ‘It turned out to be Finnish’, Do says. ‘A language which is part of the Uralic family.’
The synthetic voice that uses Dutch as a source language pronounces the sentence ‘Sa gau as it út it sicht rekket, twifel ik’ (As soon as it gets out of sight, I start wavering) almost perfectly, but the intonation at the end is not quite right.
When Finnish is used as a source, the word ‘gau’ sounds a little like ‘go’. The word ‘twifel’ has a ‘w’ sound that sounds like the English ‘wall’, but should sound like the English ‘far’.
With Spanish as a source language, the ‘gau’ sounds even more like an elongated ‘go’. The ‘w’ in twífel is almost indiscernable.
Japanese doesn’t give a good ‘au’ sound either. Again it’s an o-sound, but with a twist. The ‘uu’ of ‘út’ has changed to ‘ah’. The ‘w’ in ‘twifel’ is English-sounding.
French as a source language produces an ‘l’ in ‘rekket’ ‘where an ‘r’ should be heard. Here too the ‘w’ in twífel sounds more like the English one in ‘wall’.
More important than language family, Do concluded, is a similarity of the sound – or phoneme – systems of a language. When you compare languages on their phoneme systems, there are certain sounds being used in one language, but not in another.
For example, the African !Xu languages have 141 phonemes – including many clicking sounds – whereas on the Papua New Guinean island of Bougainville, a language is spoken that has only 11. ‘It’s also about the frequency with which those phonemes are used, and how they are used together’, Do says. ‘In certain languages the ‘m’ often comes after the ‘a’ sound, but in other languages it’s never used that way.’
I dream about making my very own personal assistant
The phoneme systems of Dutch and Finnish are closer to Frisian, whereas the Japanese system differed so much that the Japanese-trained voice scored last.
It’s a first step, says Do, who is currently working on improving his model. He now knows he should look at a language’s phoneme system. Next, he’s going to focus on the amount of data he will need from the source language and the target language. Could you make do with fewer hours of the target language, or, conversely, can you use too much data from the source language? Where is the ‘sweet spot’?
Hopefully, he will one day use all that knowledge to create his own Google Home or Alexa. Or rather – his partner’s. ‘She has always said that she wants me to someday build a little robot that resembles a certain cartoon character that we both like – Qoobee, a cute, yellow, chubby dragon from China. It may be far away still, but I dream about making my very own personal assistant.’