Kamis, 01 Desember 2011

An Analysis of the Phonology of Computer-Generated English Using the Festival Speech Synthesis System

Crystal Darby
LALS 3002
December 18, 2009
1 Introduction
The goal of computer-generated speech is to mimic natural language. Many scholars have
researched methods of creating synthesized voice to mimic the natural human voice. One
such effort is the open-source initiative, the Festival Speech Synthesis System (henceforth
just ‘Festival’).
This paper will begin by establishing the theoretical background concerning the phonological
phenomena that will be tested in Festival. Next, it will show the methodologies that
were used in order to gather data from Festival. Next, the results will be shown. I will
discuss the findings and expand on any interesting data that has resulted. Finally, I will
conclude with suggestions for the improvement of Festival as a speech synthesizer to ensure
high-quality and natural-sounding computer-generated English.
This paper makes contrasts between computer-generated English and theoretical phonological
rules of English. This paper does not contrast computer-generated English with a
native-speaking human participant on the argument that there are too many nuances concerning
dialectical variation and variation within speakers themselves. Also, Festival has a
good, but limited, choice of voices for comparison. I do not think it is very accurate to
compare the limited choices RP British English or American English that Festival provides
with the available Canadian English from a human participant. Nevertheless, this paper
attempts to uncover some of the theoretical models of English that Festival makes use of in
its program. The main research questions are:
1. Can Festival create human-like speech?
2. What kinds of phonological rules can we find evidence of being used in Festival?
3. Can Festival implement some allophonic variation rules for English?
This paper will attempt to answer these questions by focusing on some of the key areas
discussed in LALS 3002, Phonology I: features, morphophonemic analysis, phonological alternation,
syllables, stress, and intonation (Hayes, 2009). More specifically, this paper will
1
implement a wug test (Berko, 1958) for testing the phonological rules for pluralization to
answer the overall question: can festival create human-like speech?
2 Theoretical Background
The following theories have been established in the field of phonology (Hayes, 2009; O’Grady
& Archibald, 2004; Odden, 2005). This study makes use of the literature that exists on
English phonology and borrows and expands on the methods and procedures outlined in
this section (Berko, 1958; Labov, 1966).
2.1 Computer-generated Language
Computer-generated speech has often been criticized for its “hollow” or “tin can” sounding
voice, but many efforts have improved the quality of these voices. It does not seem unreasonable
to think that a computer can produce realistic sounding phonemes because phonemes
can be measured in waves. Waves can be easily modeled by computers. Of course, the
presence of a vocal tract is missing from a computer, but nevertheless, computers are able to
generate realistic-sounding phonemes in isolation because of their measurable quality. Phonetics
can be easily modeled, but what about phonology? We know that phonetics is the
study of sounds and phonology is the study of sound patterns. We have realized that the
phoneme /t/, for example, is not always realized as /t/ in natural speech. /t/ can become
a completely different phoneme in certain environments. The tapping rule for English is as
follows:
Rule 1. Tapping (adapted from Hayes (2009, p. 32) to use features)
26666664
- son
- approx
- voice
37777775
!
26666664
+ son
+ approx
+ voice
37777775
/ [+ syll]
2664
+ syll
- stress
3775
/t/ becomes /R/ when it is between a vowel and an unstressed vowel.
2
which allows us to generate the following output1:
butter attention
/b2t@~/ /@tEnS2n/ underlying forms
R – Rule 1. Tapping
[b2R@~] [@t˜EnS˜2n] surface forms
Even though we have “tt” orthographically in both “butter” and “attention,” it does
not necessarily mean it will correspond to /t/ in the phonology. One exhaustive way to
solve this problem without phonology would be to record all of the words in a language and
have the speech synthesizer access these recordings and arrange them into the order of the
given input. But, it would be inefficient to record and calculate all of the possible phonemic
alterations, especially since languages in general have a creative quality and can be generate
new combinations of words on the spot to create never before heard utterances. Recording
every word of a language would be nearly impossible! Therefore, a well-designed speech
synthesizer must make use of phonology to extract the correct output of the sounds of an
utterance. In the tapping case, it is not that bad if the synthesizer fails because /t/ and /R/
are not contrasts of one another. That is, /b2R@~/ and /b2t@~/ do not signal two different
words in English. They both correspond to the word “butter.” This may not be the case
in other languages. It is true of Spanish that /t/ and /R/ contrast (Hayes, 2009, p. 32).
Thus, it is particularly important that computer-generated speech can not only recognize
allophonic variation of one language but also realize the differences of other languages.
2.2 Wug Testing
Berko (1958) performed a simple experiment to test for a child’s acquisition of inflectional
morphemes by presenting pictures of creatures and identifying them with English-sounding
names. The most famous of these creatures is the wug. Berko would show a picture of
the creature along with the sentence “This is a wug.” A second picture was presented and
1The nasalization rule is also applied to get this form but I have not spelled it out in the section.
3
children were prompted with “Now, there’s another one. There are two of them. Now, there
are two . . . ?” where the expected response would be “wugs” and pronounced [w2gz]. If the
children were able to produce the proper pronunciation of the base noun with the affix, it
was assumed that the child had acquired the corresponding phonological rule.
In the case of [w2gz], if we assume that the underlying form of the morpheme -s is /z/,
then the phonological rule governing this for the case of /s/ is:
Rule 2. Assimilation
[+ voice] ! [- voice] /
2664
+ cons
- voice
3775
/z/ becomes /s/ after voiceless consonants.
and for the cases of -es, we will consider that /z/ is the underlying form but the following
happens to allow for the correct pronunciation:
Rule 3. Schwa insertion
; !
26666666666666666664
- front
- back
- round
- high
- low
- tense
37777777777777777775
/
266666666664
+ cons
+ cont
- son
- approx
377777777775
26666666666666666664
+ cons
+ cont
+ delayed release
- voice
- son
- approx
37777777777777777775
/@/ is inserted between a fricative or affricate and /s/.
We assume /z/ is the underlying form of -s because it occurs in more environments than
/s/ and /@s/. With these rules, we can generate the following phonological output:
4
cat dog horses
/kæt/ /dAg/ /hOôs/ lexical entries
Morphology
kæt-z dAg-z hOôs-z Plural
Phonology
/kætz/ /dAgz/ /hOôsz/ underlying forms
s – - Rule 2. Assimilation
- – @ Rule 3. Schwa insertion
[kæts] [dAgz] [hOôs@z] surface forms
Of course, these rules only apply to the -s suffix. Other morphological pluralization rules
exist (ex. mouse ! mice), but since Festival equates orthography with sound, we can assume
that there is very little morphology going on with irregular pluralization. Section 4 outlines
the results of Festival’s implementation of the phonological rules of pluralization.
2.3 Playing it by Ear
I believe that it will be a downfall of this experiment to rely on my intuitions of what I hear
and to use these subjective opinions as results. Nevertheless, this method is a necessary
consequence for my decision of not doing a comparison study of computer-generated English
as compared to a native English speaker. I would like to establish the study of whose
methodology I am influenced by.
Labov, in 1966, published a book on The Social Stratification of English in New York
City. This book was published as a result of a study he performed at three department
stores in New York City. The goal of the study was to analyze the difference in production
of a postvocalic [ô] and the absence of the postvocalic [ô] in terms of social rank. That is,
Labov visited three department stores that were considered to have different social ranking.
Saks was considered the upper class store, Macy’s was considered the middle class store, and
5
Klein was considered the lower class store. It was predicted that the employees of this store
would speak in a register that corresponds to the social rank of those who shop there, so
that an employee from Saks would speak in a upper-class register and so on. Labov designed
a small test for testing the presence the postvocalic [ô] by posing a question that requires
the response “fourth floor” which contains two instances of the target phoneme. He had
the employees repeat “fourth floor” a second time with careful speech. The results showed
that the employees at the upper class store, Saks, used the postvocalic [ô] most frequently
in regular speech and emphasized it the most when it was repeated. It was found with the
middle class store, Macy’s, that they would emphasize the postvocalic [ô] at the end of the
word (ie. “floor”) but not necessarily inside the world (ie. “fourth”). At the lower class
store, Klein, employees most often did not pronounce the postvocalic [ô] in regular speech
but often produced it in careful speech.
In its simplicity, Labov’s experiment is still respected even though he did not use technical
equipment other than his own ears for his judgements. I will choose to follow his methodology
and not consider it a downfall that I am playing it by ear, so to say. However, we will see in
the next section that I will use technology to strengthen my intuitions.
3 Methodology
The methodology section has been broken down in to sections 3.1 Materials and 3.2
Procedure.
3.1 Materials
The Festival Speech Synthesis System was used as the primary source of data. I used the
SLT American female English voice from the online version of Festival, available at http://
www.cstr.ed.ac.uk/projects/festival/onlinedemo.html. Research on the phonology
of the English language was used as a secondary source of data to compare with the data
6
from Festival (Hayes, 2009; Odden, 2005; O’Grady & Archibald, 2004; Berko, 1958; Labov,
1966). My own intuitions about English phonology were used as a third source of data, but
because of the comparison between American English and my own Canadian English dialect,
I did not heavily rely on these intuitions.
Raw .wav data was collected directly from Festival. It was saved and played back directly
through the sound card on a Gentoo Linux system using the following command:
aplay filename.wav
This method of playback allowed for the avoidance of any specialized filters from media
players. The output came from speakers. I was the only interpreter of the output. However,
Praat was also used to look at the waveforms and spectrograms of the .wav files (Boersma
& Weenink, 2009).
3.2 Procedure
Target sentences were created to elicit certain responses from Festival (see Appendix A
for the input). Festival was accessed online at http://www.cstr.ed.ac.uk/projects/
festival/onlinedemo.html and the target sentences were entered with the SLT American
female English voice selected. The resulting .wav files were stored on a computer.
The files were played back using the aplay command as discussed in section 3.1. They
were played multiple times each to guarantee a concise interpretation of the data. The
interpretations were recorded.
Praat was used, when applicable, to support the interpretations of the data.
4 Results
Festival failed the wug test. After collecting data from four target sentences (see Appendix
A) testing for /s/, /z/, and /@z/, Festival generated the following:
7
cat dog horses
/kæt/ /dAg/ /hOôs/ lexical entries
Morphology
kæt-s dAg-s hOôs-s Plural
Phonology
/kæts/ /dAgs/ /hOôss/ underlying forms
- – - Assimilation
- – @ Schwa insertion
[kæts] *[dAgs] *[hOôs@s] surface forms
5 Discussion
Out of the three target words for the wug test, [kæts], [dAgz], and [hOôs@z], only [kæts] was
produced correctly.
I believe that the correct production of [kæts] is actually the product of a mistake. In
section 2.2, I argued that /z/ was the underlying form of the morpheme -s because it was
the candidate for the elsewhere condition since its environments were not as easily predicted
as /s/ and /@z/. However, I argue that Festival chooses /s/ as the underlying form of -s.
This is not a radical suggestion since Festival relies on orthographic input and the written
letter “s” corresponds very often to the phoneme /s/ (but not in all cases, as outlined in
section 2.2).
If we assume that the assimilation rule is ignored, *[dAgs] can be produced. The same
output is produced for “wugs” as well. Festival produces *[w2gs]. This is not terribly
surprising because “dogs” and “wugs” share the environment of
g #
before the morphology is applied. What is surprising, however, is that the spectrograms of
the /gs/ portion of “dogs” and “wugs” are not identical (see Figures 1 and 2).
8
Time (s)
1.156 1.527
-0.268
0.2335
0
Now_there_are_two_dogs
Figure 1: Waveform of the /gs/ portion of “dogs.”
Time (s)
1.1 1.479
-0.2733
0.31
0
Now_there_are_two_wugs
Figure 2: Waveform of the /gs/ portion of “wugs.”
I was surprised by this data because I did not think a speech synthesis system would have
much room for variation of pronunciation, even within the same environments. I conducted
a test for variation by recording the sentence “Now there are two dogs” a second time and
compared the results in Praat. They were identical. This left the problem that there cannot
9
be variation within the synthesizer or else the two waveforms of “Now there are two dogs”
would be slightly different.
There is a potential answer to this issue. It could be that the /dA/ and /w2/ portions of
the words have some effect on the pronunciation of the rest of the word. This is somewhat
unintuitive. However, it may be a design approach from the developers of Festival to have
an overall assimilation algorithm that takes into account all of the sounds of an utterance
and apply some sort of blending to make the sounds sound similar. It is difficult to isolate
the exact reason why the /gs/ portions differ, especially since there seems to be no variation
within duplicated phrases.
It is interesting to note that Rule 2 Assimilation was ignored but Rule 3 Schwa insertion
was not. This can be proven by looking at waveforms of “horses” and “horse” (see Figures
3 and 4).
Time (s)
0.9103 1.61
-0.5416
0.934
0
Now_there_are_two_horses
Figure 3: Waveform of “horses.”
The /s/ sounds occurs where the waveband is most narrow. We can see in Figure 3 that
there is something between the two /s/. That would be the /@/. We can prove this even
further by looking at the spectrogram in Figure 5.
The /@/ is visible in Figure 5 by the dark formant bands. Formants show peak energy.
10
Time (s)
0.08155 0.6432
-0.5735
0.9899
0
horse
Figure 4: Waveform of “horse.”
Time (s)
1.11 1.66
0
5000
Frequency (Hz)
Figure 5: Spectrogram of /s@s/
Vowels have the highest sonority of any sounds, and thus, they have the most vocalic energy
and are easily spotted with formants. The /@/ is absent from Figure 4 so it cannot be in
the underlying form of “horse.”
So why is it that we have /@/ in the surface form of “horses” but we do not get the
11
assimilation rule to turn the second /s/ to /z/? It could be that schwa insertion may occur
because of some other rule of English, such as to block gemination. I tried to test this by
giving Festival the input “ss” but what resulted was [EsEs] as if Festival was “spelling out”
the letters rather than attempting to say them as a word. Because of this, it is difficult to
isolate why the /@/ appears.
6 Conclusion
In conclusion, Festival failed the wug test. Berko (1958) had success with children of just
5 years of age in her study and this sophisticated piece of software was unable to match
human-like speech with this test. This cannot discredit Festival completely as an inferior
product.
There are technical limitations to making sweeping generalizations about Festival through
this study. This was a study of just one of Festival’s voices. It could be true that this voice
has not been programmed to follow rules like English pluralization but other voices have.
This becomes a limitation of the voice and not of Festival itself.
I have made suggestions as to why Festival may have failed to produce the correct outputs.
It may be the case that it assumes /s/ to be the underlying phonetic form of -s in which
the assimilation rule is ignored. Also, schwa insertion may occur as a result of blocking
gemination or another such rule. This study cannot account for differences in the same
immediate phonetic environments. I have suggested that there may be a sort of smoothing
algorithm that is applied to the utterance as a whole. Further attempts to isolate these
particular issues are needed.
12
References
Berko, J. (1958). The child’s learning of English morphology. Word, 14, 150–177.
Boersma, P. & Weenink, D. (2009). Praat: Doing phonetics by computer. http://www.fon.
hum.uva.nl/praat/.
Chomsky, N. & Halle, M. (1968). The Sound Pattern of English. New York: Harper & Row.
Hayes, B. (2009). Introductory Phonology. Malden, MA: Wiley-Blackwell.
Labov, W. (1966). The Social Stratification of English in New York City. Washington, DC:
The Center for Applied Linguistics.
Odden, D. (2005). Introducing Phonology. Cambridge: Cambridge University Press.
O’Grady, W. & Archibald, J. (2004). Contemporary Linguistic Analysis (5th ed.). Toronto:
Pearson-Longman.
University of Edinburgh (2009). The festival speech synthesis system. http://www.cstr.
ed.ac.uk/projects/festival/.
13
A Appendix
The following are the items that were typed in to the Festival Speech Synthesis System at
http://www.cstr.ed.ac.uk/projects/festival/onlinedemo.html using the SLT American
female voice.
1. Now there are two cats.
2. Now there are two dogs. (x2)
3. Now there are two horses.
4. Now there are two wugs.
5. horse
14
source: http://crystal.dcbruce.com/wp-content/uploads/2010/08/3002-final-paper.pdf

Tidak ada komentar:

Posting Komentar