Bayes Risk Decoding and its Application to System Combination
Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von DiplomInformatiker Bj¨orn Hoffmeister aus Aachen
Berichter: Professor Dr.–Ing. Hermann Ney Privatdozent Dr. Jean–Luc Gauvain Tag der m¨undlichen Pr¨ufung: 18. Juli 2011 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verf¨ugbar.
Abstract Speech recognition is the task of converting an acoustic signal, which contains speech, to written text. The error of a speech recognition system is measured in the number of words in which the recognized and the spoken text differ. This work investigates and develops decoding and system combination approaches within the Bayes risk decoding framework with the objective of reducing the number of word errors. The investigated approaches are computationally too expensive to be applied in the speech decoder. Instead, the result of a first recognition run is used which narrows the number of hypotheses and provides the result in a compact form, the word lattice. In the single system decoding task a single word lattice is given and in the latticebased system combination task a word lattice is provided by each system. In both cases the goal is to minimize the number of word errors in the ultimate hypothesis. In large vocabulary continuous speech recognition (LVCSR) tasks the number of word errors is computed as the Levenshtein distance between recognized and spoken text. The Bayes risk decoding framework yields the hypothesis with the least expected number of errors w.r.t. a specified loss function and given the true sentence posterior probabilities. However, neither the true probabilities are known nor is the computation of the Bayes risk hypothesis with the Levenshtein distance as loss function computationally feasible for a word lattice. Consequently, in latticebased Bayes risk decoding and system combination two problems have to be addressed: first, how to compute an estimate for the sentence posterior probabilities given one or several word lattices; second, how to approximate the Levenshtein distance such that the computation of the Bayes risk hypothesis becomes computationally feasible. Based on the separation of the posterior probability computation and the loss function in the Bayes risk decoding rule a framework will be developed, which covers the common approaches to latticebased system combination, like ROVER, CNC, and DMC. Furthermore, it will be shown that the common approximations of the Levenshtein distance used in LVCSR tasks can be classified into two categories for which efficient Bayes risk decoder exist. The existing approximates will be investigated and compared. New loss functions will be developed which overcome drawbacks of the existing approximations to the Levenshtein distance, like the frequently observed deletion bias. A data structure of particular interest is the confusion network (CN). In previous work it was shown that a CN has a simple decoding rule in the Bayes risk framework. In this work new algorithms for deriving a CN from a word lattice will be developed and compared to existing methods. Furthermore, the CN will be the base for several investigations aiming at improving the posterior probability estimates and the approximation of the Levenshtein distance. The methods looked into include classifierbased system combination and the usage of a windowed Levenshtein distance as loss function for the Bayes risk decoder. A further topic of research is the loglinear model combination for which the enhancement with modeland worddependent scaling factors will be investigated. The methods are tested on the Chinese speech recognition systems used by RWTH Aachen in the GALE project and on the lattices provided within the English track of the 2007 TCStar EPPS evaluation. The best performing system combination methods investigated in this work improve the error rates by up to 10% relative for intrasite combination experiments and by more than 20% relative for crosssite combinations compared to the best single system. The newly developed methods show a slight improvement over the existing approaches to lattice decoding and latticebased system combination.
iii
Zusammenfassung Die automatische Spracherkennung befasst sich mit der Aufgabe gesprochene Sprache in geschriebenen Text umzuwandeln. Der Fehler eines Spracherkennungsystems wird in der Anzahl der W¨orter gemessen, in denen der gesprochene vom erkannten Text abweicht. Thema dieser Arbeit ist die Verwendung des Bayes Risk Frameworks mit dem Ziel den Fehler eines einzelnen Systems oder einer Kombination von mehreren Systemen zu minimieren. Bedingt durch die Komplexit¨ at der Methoden werden alle Experimente und Untersuchungen in dieser Arbeit auf Wortgraphen durchgef¨ uhrt. Ein Wortgraph ist die kompakte Darstellung eines eingeschr¨ankten Hypothesenraums, der von einem vorgeschalteten Erkennungslauf erzeugt wird. Im Falle der Systemkombination wird pro System ein Wortgraph bereitgestellt. Das Ziel ist es, aus den Wortgraphen eine finale Hypothese zu generieren, die einen geringeren Wortfehler aufweist als jedes der einzelnen System. In der kontinuierlichen Spracherkennung mit großem Wortschatz wird der Wortfehler als der Levenshteinabstand zwischen gesprochener und erkannter Wortfolge definiert. Falls die wahren Satzwahrscheinlichkeiten bekannt sind, liefert das Bayes Risk Framework die Wortfolge mit dem geringsten zu erwarteten Fehler. In der Praxis sind allerdings weder die wahren Wahrscheinlichkeiten bekannt, noch ist die Komplexit¨at der Berechnung der Bayes Risk Hypothese auf einem Wortgraphen handhabbar, wenn der Levenshteinabstand als Kostenfunktion verwendet wird. Somit ergeben sich die beiden folgenden Aufgabenstellungen: Erstens, wie lassen sich aus den systemabh¨ angigen Wortgraphen Wahrscheinlichkeiten sch¨atzen. Und zweitens, wie l¨ asst sich der Levenshteinabstand so absch¨atzen, daß die Komplexit¨at der Berechnung der Bayes Risk Hypothese handhabbar wird. In dieser Arbeit wird, basierend auf der Trennung der Sch¨atzung der Wahrscheinlichkeiten und der Kostenfunktion in der Bayes Risk Berechnung, ein allgemeines Framework f¨ ur die wortgraphgest¨ utzte Systemkombination entwickelt. Das Framework deckt die in der Praxis g¨angigen Methoden ab, u.a. ROVER, CNC und DMC. Weiterhin wird gezeigt, daß sich die, in der Sprachererkennung g¨angigen, Absch¨atzungen des Levenshteinabstands in zwei Klassen einteilen lassen, f¨ ur die sich die Bayes Risk Hypothese effizient berechnen l¨ asst. Die bekannten Absch¨ atzungen werden untersucht und verglichen. Neue Verfahren werden entwickelt, die die Nachteile der bestehenden Absch¨atzungen ausgleichen, insbesondere den h¨aufig zu beobachtenden hohen Anteil an Ausl¨ oschungen. Eine Datenstruktur von besonderem Interesse ist das Confusion Network (CN). In fr¨ uheren Arbeiten wurde gezeigt, daß sich die Bayes Risk Hypothese eines CNs auf triviale Weise berechnen l¨asst. In dieser Arbeit werden neue Verfahren zur Umwandlung eines Wortgraphen in ein CN vorgestellt und mit bestehenden Verfahren verglichen. Weiterhin bildet das CN die Grundlage f¨ ur mehrere Ans¨atze zur verbesserten Sch¨ atzung der Wahrscheinlichkeiten und zur genaueren Absch¨atzung des Levenshteinabstands. Die untersuchten Ans¨ atze beinhalten die klassifikatorbasierte Systemkombination und den Einsatz eines gefensterten Levenshteinabstands als Kostenfunktion in der Berechnung der Bayes Risk Hypothese. Ein weiteres Thema, das in dieser Arbeit untersucht wird, ist die loglineare Modellkombination, f¨ ur die modell und wortabh¨ angige Skalierungsfaktoren eingef¨ uhrt werden. Experimente werden mit den chinesischen Spracherkennern durchgef¨ uhrt, die an der RWTH Aachen im Laufe des GALE Projekts entwickelt wurden, sowie mit den Wortgraphen, die im Zuge der 2007 TCStar EPPS Evaluation bereitgestellt wurden. Die besten Methoden zur Systemkombination, die in dieser Arbeit untersucht werden, zeigen eine relative Verbesserung in der Wortfehlerrate um bis zu 10% f¨ ur die hausinterne Wortgraphkombination und mehr als 20% f¨ ur die Kombination von Wortgraphen mehrerer Projektpartner. Dabei bezieht sich die relative Verbesserung auf die Fehlerrate des besten Einzelsystems. Im Vergleich zu den bestehenden Methoden zur wortgraphbasierten Systemkombination erzielen die neuentwickelten Verfahren leichte Verbesserungen.
v
Acknowledgement First of all I would like to thank my doctoral adviser, Prof. Dr.Ing. Hermann Ney, head of the Chair of Human Language Technology and Pattern Recognition, Lehrstuhl f¨ ur Informatik 6, at the RWTH Aachen University, for his support and his interest. He introduced me to speech recognition in 2004 when I started my studies as a PhD student and he has since then given me the opportunity and the freedom to pursue my ideas. I would also like to thank Dr. JeanLuc Gauvain for agreeing to review this thesis and for the interest in this work. I am very grateful to Dr. Ralf Schl¨ uter for his support in the field of Bayes risk decision theory and its application to speech recognition. His supportive coaching helped me to make my decisions and to define my longterm research goals. Special thanks go to Stephan Kanthak who mentored me in my first year and introduced me to the concepts of transducers and their application to speech recognition. I would like to thank all my colleagues in the speech recognition group for the great team play in doing (and winning) evaluations, designing our software, and developing new ideas. In no particular order these include Christian Gollan, Stefan Hahn, Georg Heigold, Jonas L¨o¨of, Christian Plahl, and David Rybach. During my time at the Lehrstuhl f¨ ur Informatik 6 I worked together with many people whom I would like to thank for the fruitful collaborations. Especially Dustin Hillard for the great teamwork in developing the classifierbased approach to system combination, and MeiYuh Hwang for the challenging and exciting times in the GALE project. For the good times and the memorable moments I had at the Lehrstuhl f¨ ur Informatik 6 I would like to thank all my former and current colleagues including Sasa Hasan, Oliver Bender, Thomas Deselaers, Philippe Dreuw, Saab Mansour, David VilarTorres, Arne Mauser, Evgeny Matusov, and many more. Also, my thanks go to our system administration team and our secretariat for their always available help and their excellent support. I am very thankful for the friendly atmosphere and the support I received at the NTT Communication Laboratories, Kyoto, Japan during my stay in 2009. Thanks go to all members of the laboratories, in particular to Erik McDermott, Takaaki Hori, and Shinji Watanabe. Finally, I would like to thank my parents and all my family members for their understanding and encouragements during the long years of my doctoral studies and the writing of this thesis.
This work was partly funded by the European Commission under the integrated projects TCSTAR (FP6506738), this work was partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation, and this work is partly based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR00106C0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the DARPA.
vii
Contents 1 Introduction 1.1 Statistical Speech Recognition . . . . . 1.2 Signal Analysis/ Feature Extraction . 1.3 Acoustic Model . . . . . . . . . . . . . 1.4 Language Model . . . . . . . . . . . . 1.5 Search . . . . . . . . . . . . . . . . . . 1.6 MultiPass Search . . . . . . . . . . . 1.6.1 Lattices . . . . . . . . . . . . . 1.6.2 Speaker Adaptation . . . . . . 1.7 Weighted Finite State Transducers . . 1.7.1 Notation . . . . . . . . . . . . . 1.7.2 Algorithms . . . . . . . . . . . 1.7.3 WFSTs in ASR . . . . . . . . . 1.8 Bayes Risk Decoding: State of the Art 1.9 Model and System Combination: State 1.9.1 Loglinear Model Combination 1.9.2 System Combination . . . . . . 1.9.3 CrossAdaptation . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of the Art . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
2 Scientific Goals 3 LatticeBased System Combination in the Bayes Risk Decoding Framework 3.1 WFSTs as a HighLevel Programming Language for latticebased System Combination . 3.2 Probabilities over Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Probabilities over a single Lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Probabilities over the Lattice Intersection . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Probabilities over the Lattice Union . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 LatticeBased System Combination in the Bayes Risk Decoding Framework . . . . . . . 3.3.1 The MAP/Viterbi Decoding Framework . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 MAP/Viterbi Decoding Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 The Bayes Risk Decoding Framework with Local Cost Functions . . . . . . . . . 3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework 3.4.1 Confusion Network Combination (CNC) . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 ROVER: An Approximation of CNC . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Lattice Combination Framework vs. StateoftheArt in System Combination . . . 3.6 Lattice PreProcessing for Bayes Risk Decoding and System Combination . . . . . . . . 3.6.1 Lattice Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Lattice Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 The nonWord Cloud Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Parameter Optimization for Bayes Risk Decoding and System Combination . . . . . . . 3.7.1 Parameter Optimization based on the DownhillSimplex Algorithm . . . . . . . . 3.7.2 Parameter Optimization based on Minimum Risk Training . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 3 4 6 7 8 8 10 10 10 12 13 14 15 15 15 16 17
. . . . . . . . . . . . . . . . . . . . . . .
21 22 25 25 26 27 28 28 29 30 34 36 37 37 41 43 43 44 44 45 47 48 49 50
4 Local Cost Functions for Bayes Risk Decoding 4.1 Local Costs and the Deletion Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53 53
ix
Contents 4.2
4.3
4.4
4.5
Frame Error . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Partially Normalized Frame Error . . . . . . . . 4.2.2 Symmetrically Normalized Frame Error . . . . . 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . Local Alignment based Error . . . . . . . . . . . . . . . 4.3.1 Povey’s Approximation in MPE/MWE Training 4.3.2 The 1/2 Overlap Approximation . . . . . . . . . 4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . Confusion Network Distance based Error . . . . . . . . . 4.4.1 Distances betweens Arcs and Arc Clusters . . . . 4.4.2 The ArcCluster CN Construction Algorithm . . 4.4.3 The StateCluster CN Construction Algorithm . 4.4.4 The CenterFrame CN Construction Algorithm . 4.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
53 54 55 56 59 59 60 61 62 64 65 67 68 72 75
5 Confusion Networks: Applications and Investigations 5.1 Frame Level Confusion Networks . . . . . . . . . . . 5.1.1 Minimum and InverseEntropy Combination 5.1.2 Time Alignment with Frame Level CNs . . . 5.1.3 Results . . . . . . . . . . . . . . . . . . . . . 5.2 Word Level Confusion Networks . . . . . . . . . . . 5.2.1 Confidence Warping . . . . . . . . . . . . . . 5.2.2 The windowed Levenshtein Distance . . . . . 5.2.3 Results . . . . . . . . . . . . . . . . . . . . . 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
77 77 77 79 79 81 81 82 94 98
6 Classifier based System Combination 6.1 Combination with Classification . 6.1.1 Features . . . . . . . . . . 6.1.2 Classifiers and Training . 6.1.3 The iROVER Approach . 6.1.4 The iCNC Approach . . . 6.1.5 The iCN Approach . . . . 6.2 Experiments . . . . . . . . . . . . 6.2.1 Experimental Setup . . . 6.2.2 Results . . . . . . . . . . 6.2.3 Analysis . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
101 101 102 102 103 104 104 104 104 106 106 108
7 LogLinear Model Combination vs. System Combination 7.1 LogLinear Model Combination with WordDependent Scaling Factors 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
111 112 112 112 115 115
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
8 Scientific Contributions
119
9 Outlook
123
A The Deletion Bias in LVCSR Decoding
125
B Corpora and Systems 127 B.1 Chinese GALE Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 B.1.1 The Chinese 230h Testing System . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
x
Contents B.1.2 The RWTH Aachen Chinese GALE 2008 Evaluation System B.2 English TCStar/EPPS Systems . . . . . . . . . . . . . . . . . . . . B.2.1 The RWTH Aachen English EPPS 2007 Evaluation System . B.2.2 The English EPPS 2007 Evaluation Crosssite Combination . C Experimental Results C.1 The Chinese 230h Testing System . . . . . . . . . . . . . . . . C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System . C.3 The RWTH Aachen English EPPS 2007 Evaluation System . C.4 The English EPPS 2007 Evaluation Crosssite Combination .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
129 130 130 131
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
133 133 139 142 146
D Symbols and Acronyms 151 D.1 Mathematical Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 D.2 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 List of Figures
157
List of Tables
159
Bibliography
163
xi
Chapter 1 Introduction Speech is the most common and most natural way for humans to communicate, even in times of email, chat, and blogs. This makes an automatic speech recognition (ASR) system the natural choice for a humanmachine interface. In the recent years a huge amount of audio and video data became available in the worldwide web. Most of these podcasts, news, and homemade videos use speech as the natural form of communication. ASR is the first step in making the information contained in the speech data available to machine processing. The speech recognition problem is defined as the task of converting an acoustic signal, which contains speech (the speech signal), to written text (the recognized word sequence). The automatic speech recognizer serves as a humanmachine interface or provides the input for further machine processing like machine translation. According to the specific task ASR systems have to fulfill certain requirements, e.g. an ASR system which serves as a humanmachine interface has to work in realtime. The ASR systems considered in this thesis are large vocabulary continuous speech recognition (LVCSR) systems. The vocabulary contains 50,000 and more words, recognition is performed on complete utterances (in opposite to single word recognition), and realtime is not required. Modern LVCSR systems use a statistical approach to find the sequence of words with the highest probability given the acoustic features. The signal analysis which converts the speech signal into a sequence of features happens in a preprocessing step and stays apart from the statistical approach. The standard evaluation measure for LVCSR systems is the word error rate (WER). Bayes risk approaches in LVCSR aim at finding the word sequence given the speech signal which produces the least expected WER. The exact computation of the Bayes risk hypothesis in a modern LVCSR system is prohibitive and requires approximations. Usually it is applied in a postprocessing step which follows a first decoding run that produces a set of alternative word sequences. In this thesis a variety of approximations for computing the minimum expected WER hypothesis are developed and analyzed. The WER for an utterance can be greatly reduced by combining several ASR systems. In this thesis a general framework is developed for system combination by applying the approximate minimum expected WER decoder to multiple systems.
1.1 Statistical Speech Recognition The statistical approach to ASR takes a sequence of acoustic features xT1 as input and aims at finding the sequence of words w1N which maximizes the posterior probability. The statistical approach applies Bayes’ decision rule [Bayes 1763]: ˆ xT1 → W
:=
argmax p(w1N xT1 ) w1N ,N
=
argmax p(xT1 w1N )p(w1N )
(1.1)
w1N ,N
The result is referred to as the maximum aposterior (MAP) hypothesis. The equation defines two stochastic models, the acoustic model p(xT1 w1N ) and the language model p(w1N ). The acoustic model computes the likelihood for observing the feature sequence xT1 given the word sequence w1N . The language model denotes the apriori probability of the word sequence w1N . A word wn in the word sequence w1N is either taken from the finite alphabet Σ (aka vocabulary) or N equals the empty word , that is w1N ∈ Σ ∪ {} . The convention of allowing the empty word at any position in the word sequence will be frequently used later when dealing with confusion networks. In
1
Chapter 1 Introduction Speech Input
Feature Extraction
Feature Vectors x1...x T
Global Search Process:
p(x1... x T  w1 ...wN )
maximize
Acoustic Model  subword units  pronunciation lexicon
p(w1...wN) p(x1... x T  w1 ...w N ) p(w1...wN )
over w1...w N
Language Model
Recognized Word Sequence {w1...wN } opt Figure 1.1. Basic architecture of a statistical automatic speech recognition system according to [Ney 1990].
the computation of the equality of two word sequences the empty word is not considered, e.g. it holds “a b” = “a b” = “a b ”. The extraction of the feature sequence xT1 from the continuous speech signal happens in a preprocessing step, the signal analysis. The signal analysis itself is based on models of the human auditory system. The resulting features are further processed by datadriven approaches, which ultimately yield the feature sequence xT1 . Figure 1.1 summarizes the interaction between feature extraction, acoustic model, and language model during the search. The search algorithm aims at finding the word sequences that fulfills Equation (1.1). The search space for a LVCSR system consists of all possible word sequences over the (finite) vocabulary. The huge size makes the complete exploration of the search space prohibitive and pruning techniques are used to restrict the effective number of hypotheses. The subset of the search space considered during the search process can be stored and used for applying sophisticated methods, which are too complex to be applied to the full search space. The main topic of this thesis is the application of Bayes risk decoding1 and system combination as a postprocessing step for LVCSR systems. The conventional decoding rule in Equation (1.1) aims at minimizing the number of incorrectly recognized word sequences or sentences. But the standard evaluation measure in LVCSR is the WER, which is based on the number of incorrectly recognized words. More precisely, the WER is the normalized Levenshtein or edit distance between the correct and the hypothesized sentence calculated on word level [Levenshtein 1966]. Considering the Levenshtein distance in the Bayes risk framework results in decision rule X ˆ := argmin xT1 → W p(v1M xT1 ) Lev(w1N , v1M ), w1N ,N
v1M ,M
where Lev(v1M , w1N ) denotes the Levenshtein distance between the two word sentences v1M and w1N [Bishop 1 In
the speech recognition literature the term “Minimum Bays risk decoding” is frequently used. However, this terminology is misleading as by definition the Bayes risk hypothesis is already the sequence producing the least number of expected errors, i.e. it is already the minimum.
2
1.2 Signal Analysis/ Feature Extraction 2006]. The computation of the equation for a LVCSR task is computationally not feasible even for the reduced search space and requires further approximations. This thesis investigates a variety of approximations for Bayes risk decoding with the Levenshtein distance as loss function. A successful way to decrease the WER for an utterance is to combine several models or systems. In the model combination approach all knowledge sources are combined into a single loglinear model from which the posterior probability p(w1N xT1 ) is computed. The knowledge sources combined in the loglinear model usually consist of the language model and several acoustic models. In the crossadaptation approach two or more independently trained systems are combined, where the interaction between the systems takes place in the speaker adaptation step. The third and most common approach is to introduce the system as a hidden variable and to compute the marginal over the resulting weighted, systemdependent posteriors p(w1N xT1 ) =
J X j=1
p(w1N , jxT1 ) =
J X
p(jxT1 )p(w1N j, xT1 ),
j=1
for J LVCSR systems. This type of combination is usually applied within the Bayes risk decoding framework. In this thesis all three approaches are considered, but the focus is on system combination within the Bayes risk framework.
1.2 Signal Analysis/ Feature Extraction The signal analysis and feature extraction module of the ASR system provides the statistical model with a sequence of observations or acoustic vectors. The goal is to keep only the information from the speech signal that is relevant for finding the correct word sequence. Discarding all the irrelevant information makes the acoustic model robust e.g. to the intensity of the speech, to background noise, to speaker gender and identity. The feature extraction of today’s stateoftheart LVCSR systems happens in three steps: 1. A first set of features is extracted from the speech signal based on models of the human auditory system. 2. The features are transformed, augmented, and/or reduced by parametric models, where the model parameters are estimated on the acoustic training data. 3. Speaker normalization steps are applied either to the features directly or to the acoustic model parameters in order to achieve speaker independence; usually the free parameters are estimated based on the result of a previous, unadapted recognition run. The most common signal analysis applied in the first step is based on a short term spectral analysis, usually a Fast Fourier Transformation (FFT) [Rabiner & Schafer 1979]. Widely used procedures for further processing the FFT result yield the Mel Frequency Cepstral Coefficients (MFCCs) [Davis & Mermelstein 1980] or the Perceptual Linear Predictives (PLPs) [Hermansky 1990]. Another feature now commonly used by RWTH Aachen are the Gammatone filter based features (GT), which work in the time domain [Aertsen & Johannesma+ 1980; Schl¨ uter & Bezrukov+ 2007]. The recognition performance can be significantly improved by concatenating articulatory motivated acoustic features to the shortterm FFTbased features [Kocharov & Zolnay+ 2005; Zolnay & Schl¨ uter+ 2005]. An alternative approach which became popular in the recent years is the usage of phone posterior probability estimates as acoustic features. In this approach features from the first step are feed into a classifier, usually a neural network, which has as output the posterior estimates [Chen & Zhu+ 2004; Hermansky & Ellis+ 2000; Valente & Vepa+ 2007]. The parameters of the classifier are estimated on the training data. The features described above were designed to scope with European languages and do not consider tone information, that is the contour of the pitch for a syllable. For tonal languages like Chinese stateoftheart speech recognition systems integrate an additional tone feature [Chang & Zhou+ 2000; Chen & Gopinath+ 1997; Chen & Li+ 2001; Lei & Siu+ 2006].
3
Chapter 1 Introduction Dynamic information can be included by augmenting the feature vector with the first and second derivatives. A more general approach is to apply the Linear Discriminant Analysis (LDA) [Fisher 1936] or the heteroscedastic LDA (HLDA) [Kumar & Andreou 1998] to a window of usually 9 or 11 of the original feature vectors. The result is a linear transformation which projects the original features into a lower dimensional feature space such that the class separability is maximized, assuming that the data given a class follows a normal distribution. The (H)LDA is also successfully used to combine acoustic features from several feature extraction procedures, i.e. several shortterm FFT features [Schl¨ uter & Zolnay+ 2006] or shortterm FFT and tone + features for Chinese systems [Ng & Zhang 2008; Plahl & Hoffmeister+ 2008a]. The third step puts the focus on gender and speaker independence of the acoustic features which is hard to meet and usually not achieved by the feature extraction procedures mentioned above. For example, the MFCC and PLP features are also used to detect the gender of the speaker [Stolcke & Bratt+ 2000] or even for speaker identification [Doddington & Przybocki+ 2000]. Several methods have been developed to reduce the speaker dependency of the acoustic features. Two widespread approaches are the vocal tract length normalization (VTLN) and the MLLR transformation [Gales & Woodland 1996; Lee & Rose 1996; Leggetter & Woodland 1995]. The MLLR approach consists of a speakerdependent linear transformation of the model parameters and is discussed in more detail in Section 1.6. A comprehensive comparison of speaker normalization and adaptation methods is given in [Pitz 2005].
1.3 Acoustic Model The stochastic model which computes the likelihood of the acoustic feature sequence xT1 given a word sequence w1N is called acoustic model. For LVCSR systems usually subword models like syllables, phonemes, or allophones are used instead of wholeword models. The pronunciation model p(ψ1L w1N ) assigns a sequences of subword units ψ1L to a sequence of words w1N . Most modern LVCSR systems use a finite pronunciation dictionary to store the (weighted) mapping from words to sequences of subword units. Assuming independence in the pronunciation of a word from adjacent words yields Equation (1.2). X p(xT1 w1N ) = p(xT1 ψ1L )p(ψ1L w1N ) ψ1L
=
X ψ1L
p(xT1 ψ1L )
N Y
n p(ψlln−1 +1 wn )
(1.2)
n=1
The advantage of subword units is that they reduce the model complexity, which allows a reliable parameter estimation. Another advantage is that the search vocabulary needs not to be equal to or a subset of the training vocabulary. The acoustic model for a new word with known pronunciation is assembled from the corresponding sequence of subword units. Even if a word is not in the pronunciation dictionary, i.e. a new word with unknown pronunciation, there exist algorithms which compute with high accuracy a matching sequence of subword units [Bisani & Ney 2003]. The common approach for modern LVCSR systems is to use a twostage mapping. First, the pronunciation dictionary provides the weighted mapping from the word to a phoneme sequence. It follows the unique mapping from phonemes to triphones, where a triphone is a phoneme together with its predecessor and successor; some systems use a larger context, socalled quinphones, septaphones, etc. The motivation for contextdependent phonemes is the observation that the articulation of a phoneme highly depends on the adjacent phonemes. In general, the acoustic realization of a phoneme is called allophone and the triphone is the most common way in LVCSR systems to model allophones. If the context is considered across word boundaries the resulting acoustic model is called an acrossword model [Sixtus 2003]. Natural speech shows a great variability in speaking rate. The quasi standard approach to scope with the varying acoustic realization of subword units at different speaking rates is the Hidden Markov Model (HMM) [Baker 1975; Rabiner & Juang 1986]. An HMM is a stochastic finite state automaton, where the states represent (hidden) random variables which cannot be observed directly. The output of an HMM is generated according to the probability distributions which depend on the values sT1 of the hidden variables. The HMM is a generative model and an HMM representing an acoustic model generates feature sequences xT1 .
4
1.3 Acoustic Model The acoustic probability for observing xT1 given word sequence w1N is the marginal over all possible state sequences: X p(xT1 w1N ) = p(xT1 , sT1 w1N ) N sT 1 :w1
T X Y
=
N sT 1 :w1
t−1 t N N p(xt xt−1 1 , s1 ; w1 )p(st s1 ; w1 )
(1.3)
t=1
The equation is simplified by applying the first order Markov assumption [Duda & Hart+ 2001]. The assumption states that the probabilities at time t do not depend on previous observations, but only on the current and the immediate preceding state. Furthermore, it is assumed that the probability of an observation depends only on the current state. Under this assumptions Equation (1.3) simplifies to: p(xT1 w1N )
=
T X Y
p(xt st ; w1N )p(st st−1 ; w1N )
(1.4)
N t=1 sT 1 :w1
In the socalled Viterbi or maximum approximation the sum in Equation (1.4) is replaced by the maximum: p(xT1 w1N )
=
max
N sT 1 :w1
T Y
p(xt st ; w1N )p(st st−1 ; w1N )
(1.5)
t=1
According to Equation (1.4) two probability distributions have to be considered: the emission probability p(xt st ; w1N ) and the transition probability p(st st−1 ; w1N ). The emission probability denotes the probability of observing acoustic feature vector xt while being in state st . The transition probability is the probability for moving from state st−1 to state st . A triphone is usually modeled by a linear HMM with three to six states. The possible transitions are the loop transition going from the state back to itself, the forward transition connecting to the next state, and the skip transition, which skips the next state and goes to the next to next state. Six state models like the topology introduced by Bakis [Bakis 1976] use the skip transition, whereas some three state topologies forbid the skip. In the Bakis topology each two successive states are identical, which makes it almost equivalent to a three state topology without skip. Both models are inadequate for fast speech, because they absorb at least 30ms of speech considering the standard frame shift for ASR systems of around 10ms [Molau 2003]. In this case the common choice is a three state model with skip. The HMM for a sequence of words is assembled by concatenating the HMMs of the according triphone sequence. Equation (1.4) and Equation (1.5) are also referred to as the time alignment problem. The result computed for a particular word sequence w1N is called the forced acoustic alignment of w1N . An efficient algorithm for solving the time alignment problem based on dynamic programming [Bellman 1957; Ney 1984; Viterbi 1967] is the forwardbackward algorithm for HMMs [Baum 1972; Rabiner & Juang 1986]. Figure 1.2 shows an example for a time alignment in speech recognition. For a part of the word “seven” the ultimate HMM is constructed using the Bakis topology and it is aligned against a sequence of acoustic feature vectors. In the time alignment the HMM is enrolled along the times axis and the resulting graph is referred to as trellis. The trellis visualizes the complete search space for the time alignment. In the Viterbi approximation, cf. Equation (1.5), the solution is the path from the lower left to the upper right corner with the highest probability. The emission probabilities p(xt st ; w1N ) of the HMM are usually modeled by Gaussian mixture models (GMMs). Alternative approaches are discrete probabilities [Jelinek 1976], semicontinuous probabilities [Huang & Jack 1989] or other continuous probability distributions like mixtures of Laplacians [HaebUmbach & Aubert+ 1998; Levinson & Rabiner+ 1983]. The RWTH Aachen system uses the GMMs defined in Equation (1.6). Ls X p(xs; w1N ) = csl N (xµsl , Σsl ; w1N ) (1.6) l=1
The emission probability for state s is described by a GMM of Ls Gaussian densities N (xµsl , Σsl ; w1N ) with mean vector µsl and covariance matrix Σsl and nonnegative mixture weights csl , where the mixture
5
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
HMM States S
<1> <3>
<1>
<2>
<3>
Triphone: eh v un Triphone: s eh v Triphone: # s eh
Word: seven
Phoneme Sequence: s eh v un
Chapter 1 Introduction
Acoustic Vectors X
Figure 1.2. 6state hidden Markov model in Bakis topology for the triphone s ehv in the word “seven” and the resulting trellis for a time alignment. The HMM segments are denoted by <1>, <2>, and <3>.
PLs weights are subject to the constraint l=1 csl = 1. The LVCSR systems at RWTH Aachen use only a single, globally pooled and diagonal covariance matrix Σ. The choice is made to avoid data sparseness problems in the acoustic model training. Using a diagonal covariance matrix requires that the components of the acoustic features are decorrelated, which happens for the RWTH Aachen LVCSR systems in the feature extraction step by applying a discrete cosine transformation. The free parameters of the acoustic model µsl , csl , and Σ are estimated by applying Maximum Likelihood (ML) estimation in combination with the Expectation Maximization (EM) algorithm [Dempster & Laird+ 1977]. In stateoftheart LVCSR systems the ML/EM training is followed by a discriminative refinement of the acoustic model parameters [Bahl & Padmanabhan+ 1996; Schl¨ uter 2000; Woodland & Povey 2002]. In the discriminative training step the objective is to maximize the aposteriori probability of the correct sentence [Bahl & Brown+ 1986; Normandin & Lacouture+ 1994] or to minimize the word or phoneme error rate on the training data [Juang & Katagiri 1992; Kaiser & Horvat+ 2000; McDermott & Katagiri 2005; Povey & Woodland 2002]. In the RWTH Aachen system the transition probabilities are replaced by so called time distortion penalties (TDPs). The TDPs depend only on the transition type, but not on the state itself. A special case is the HMM for the silence model, which consists only of a single state and has separate TDPs.
1.4 Language Model The language model provides the apriori probability p(w1N ) for a word sequence w1N . Ideally, it covers the syntax, the semantics, and the pragmatics of the language and the situation. In practice, a rather simple model is the standard for LVCSR systems. The mgram model makes the assumption that the n−1 probability of the current word wn depends only on the previous m − 1 words wn−m+1 [Bahl & Jelinek+ 1983]. Equation (1.7) motivates the factorization of the apriori probability under the assumption of an (m − 1)thorder Markov process. p(w1N )
= p(wn w1N −1 ) · p(wN −1 w1N −2 ) · · · p(w1 ) N −1 N −2 = p(wn wN −m+1 ) · p(wN −1 wN −m ) · · · p(w1 )
(1.7)
The consecutive sequence of m words is called an mgram and in the general case the history hn of word n−1 wn is a function of wn−m+1 . For the standard mgram model hn is the identity; examples for alternative
6
1.5 Search history functions are the class language model or the trigger models [Martin 2000]. n−1 The estimates for p(wn wn−m+1 ) are usually based on the relative frequencies computed on a large training set of transcripts of speech and written text. The relative frequency is the optimal solution if the mgram language model is optimized w.r.t. the perplexity (PP) of the training data. " log P P (w1N )
=
log
N Y
#−1/N n−1 p(wn wn−m+1 )
n=1
= −
N 1 X n−1 log p(wn wn−m+1 ) N n=1
(1.8)
The logperplexity defined in Equation (1.8) is a common evaluation measure for mgram language models. It equals the entropy of the model and can be interpreted as the number of different words which follow on average any given history hn . However, the number of possible mgrams grows exponentially in m and for LVCSR tasks many mgrams are not seen in the training data or have only very few observations. Applied to the test data, any word sequence containing a single unseen mgram has a probability of zero and an infinite logPP. Therefore, the relative frequencies have to be smoothed. Common smoothing techniques are based on discounting followed by backingoff or interpolation [Generet & Ney+ 1995; Katz 1987; Ney & Essen+ 1994; Ney & Martin+ 1997]. In the discounting step probability mass is removed from the relative frequencies. The backingoff or interpolation step distributes the discounted probability mass over all unseen mgrams (backingoff) or over all mgrams (interpolation). A popular method to estimate the parameters of a smoothed language model is leavingoneout, a crossvalidation approach [Ney & Essen+ 1994].
1.5 Search The search problem in ASR consists of finding an efficient algorithm and appropriate approximations for ˆ which maximizes the aposteriori probability solving Equation (1.1), i.e. for finding the word sequence W ˆ xT ) for a given feature sequence xT . As shown in Figure 1.1 the search combines the different p(W 1 1 knowledge sources: the acoustic model (including the pronunciation model) and the language model. If the acoustic model is an HMM as described in Equation (1.4) and the language model is an mgram model following Equation (1.7), then Equation (1.9) describes the resulting optimization problem. ( N X Y ) T Y n−1 T N N ˆ p(wn w ) p(xt st ; w1 )p(st st−1 ; w1 ) x1 → W = argmax w1N ,N
Viterbi
=
argmax w1N ,N
n−m+1
n=1
( N Y n=1
N t=1 sT 1 :w1
) T Y n−1 N N p(wn wn−m+1 ) max p(xt st ; w1 )p(st st−1 ; w1 ) N sT 1 :w1
(1.9)
t=1
The optimization problem can be efficiently solved by using dynamic programming [Bellman 1957]. The Markov assumptions and the Viterbi approximation yield a mathematical structure which divides the global optimization problem in Equation (1.9) into subproblems with local dependencies and allows the application of dynamic programming. In general, the search can be organized in two ways: depthfirst or breadthfirst. Prominent instances of the depthfirst search (aka stack decoding algorithms) are the Dijkstra [Dijkstra 1959] and the A∗ algorithm [Jelinek 1969; Paul 1991]. The hypotheses space is explored in a timeasynchronous manner according to the stack organization. In the A∗ algorithm the stack is sorted by a heuristic estimate of the cost to complete a hypothesis. In contrast, in the breadthfirst search all hypotheses are expanded in a timesynchronous manner [Baker 1975; Ney 1984; Sakoe 1979; Vintsyuk 1971]. However, for LVCSR tasks the resulting search space is still huge and a full exploration is prohibitive. Modern recognizer use pruning techniques to visit only the promising parts of the search space thereby avoiding search errors. A search error occurs if due to pruning the output of the recognizer differs from the solution of Equation (1.9). In an A∗ decoder pruning is applied by removing the least promising partial paths from the stack. The quality of the pruning depends on the quality of the heuristic cost estimate. In
7
Chapter 1 Introduction contrast, the standard pruning for breadthfirst search decoders does not require an explicit heuristic. In a breadthfirst search implementation the likelihoods for all hypotheses are computed at each time frame. The socalled beam pruning compares at each time frame the likelihoods and keeps only those hypotheses which have likelihoods sufficiently close to the one of the current best hypothesis [Lowerre 1976; Ney & Mergel+ 1987; Ortmanns & Ney 1995]. A careful tuning of the pruning parameters yields a considerable reduction of the search effort without having a significant number of search errors. Beam search approaches for LVCSR decoders are in particular effective in combination with lexical prefix trees [Ney & H¨ abUmbach+ 1992; Ortmanns & Eiden+ 1998]. Pronunciations with common prefixes are laid together in the lexical prefix tree. Pruning in the early stages of the tree removes whole subtrees and eventually discards large parts of the search space. Language model lookahead techniques aim at considering the language model probabilities in the early stages of the lexical prefix tree [Alleva & Huang+ 1996; Ortmanns & Ney+ 1996; Steinbiss & Ney+ 1993]. Weighted finite state transducer (WFST) provide a generic way to optimize the search space [Allauzen & Mohri+ 2004; Mohri & Riley 1997]. The acoustic model (HMM) and the language model (mgram model) have natural WFST representations and the respective WFSTs can be combined and minimized by using generic algorithms. In particular, the lexical prefix tree and the language model lookahead technique are implicitly applied by a WFST decoder using a minimized static search space transducer [Kanthak & Ney+ 2002]. WFSTs and the construction of the static search space transducer are discussed in Section 1.7. Other methods to reduce the computational complexity of the search include fast likelihood computation [Cardinal & Dumouchel+ 2008; Kanthak & Sch¨ utz+ 2000; Ortmanns & Ney+ 1997b; Parihar & + Schl¨ uter 2009; Ramasubramansian &Paliwal 1992], several lookahead techniques [Alleva &Huang+ 1996; H¨ abUmbach & Ney 1994; Ortmanns & Ney+ 1996], and multipass approaches, where a fast first pass reduces the search space for the ultimate Viterbi search [Ljolje & Pereira+ 1999; Murveit & Butzberger+ 1993; Ney & Aubert 1994; Ortmanns & Ney+ 1997a; Schwartz & Chow 1990].
1.6 MultiPass Search Stateoftheart LVCSR recognizers perform multiple recognition and/or rescoring passes, see for example [Evermann & Chan+ 2003; Hoffmeister & Plahl+ 2007; Prasad & Matsoukas+ 2005]. Supervised adaptation techniques like standard VTLN, MLLR, constrained MLLR, and domain specific language model adaptation require a reference transcription (supervisor). In a multipass decoder the output of the first, unadapted recognition run serves as supervisor for the adaptation step, which is followed by a second recognition run with the adapted models and/or adapted features. Some models and techniques cannot be applied during the Viterbi search because of their complexity, like the language model used in [Emami &Papineni+ 2007] or the phoneme duration model in [Jennequin & Gauvain 2007]. They are applied to a restricted search space, which is the result of an extended Viterbi search: instead of finding a single hypothesis, the search algorithm narrows the search space. The result is an N best list or a lattice containing the best scoring word sequences, which is subsequently rescored with the sophisticated model. N best lists or lattices are also used for applying Bayes risk decoding with the (approximate) Levenshtein distance as loss function and for system combination approaches, cf. Section 1.8 and Section 1.9.
1.6.1 Lattices A word or phoneme lattice is a directed, acyclic graph with time stamps on the states and labels on the arcs. In a word lattice the label is usually the word together with the pronunciation of the word, in a phoneme lattice the label is simply a phoneme. In addition, for each arc the acoustic and the language model score from the Viterbi decoding is stored. An example for a word lattice produced by a LVCSR system is shown in Figure 1.3. The goal in lattice creation is to store in a compact form a large number of hypotheses, where the number of hypotheses is usually by magnitudes larger than the size of any feasible N best list. The exact properties of a lattice depend on the search algorithm, i.e. on the HMM decoder design, and on subsequent filter steps. The default Viterbi search of the RWTH Aachen LVCSR system is a timesynchronous wordconditioned tree search implementation [Beulen & Ortmanns+ 1999]. A word lattice produced by the
8
1.6 MultiPass Search the/[2.03222e99 1.09534e21 1] 130 t=579 is/[0 3.88192e47 1] 142 t=518
137 t=571
means/[0 1.90778e61 1]
the/[2.03222e99 8.04422e20 1]
*EPS*/[1.07264e229 4.28655e36 1] is/[0 4.0873e44 1] 134 t=571
the/[2.03222e99 7.64304e15 1]
138 t=579
129 t=571
means/[0 6.23786e66 1]
is/[0 1.35408e26 1]
132 t=550
is/[8.85425e283 3.88192e47 1]
of/[9.54218e180 4.77794e40 1]
is/[0 5.74511e49 1] and/[0 4.36953e23 1] *EPS*/[0 8.27889e22 1]
163 t=410
and/[4.83981e130 1.41816e30 1] 162 t=393
not/[1.29216e266 4.8724e46 1]
*EPS*/[2.999e82 1.62337e33 1] 152 t=410
not/[1.29216e266 4.8724e46 1]
153 t=432
*EPS*/[2.01606e241 1.36457e13 1]
violence/[0 1.5141e82 1]
*EPS*/[1.1418e223 1.36457e13 1]
violence/[0 1.5141e82 1]
131 t=502
is/[8.85425e283 3.88192e47 1]
113 t=579
109 t=574
means/[0 8.88299e70 1]
105 t=620
24 t=777
158 t=387
nonviolence/[0 8.35058e72 1]
0 t=0
and/[1.28654e158 1.41816e30 1] and/[6.34569e156 1.41816e30 1]
181 t=10
this/[1.20552e220 2.31015e28 1]
1 t=14
it/[7.72228e188 1.08551e16 1]
179 t=13
it/[1.84836e197 1.08551e16 1]
182 t=29
2 t=30,cw
is/[1.03919e192 1.36373e06 1]
is/[7.72228e188 1.08551e16 1] is/[1.84836e197 1.08551e16 1]
180 t=29,cw
183 t=46
our/[1.24367e144 1.72061e43 1]
our/[1.24352e144 1.37505e36 1] 3 t=46
184 t=62
4 t=62
duty/[0 2.27172e31 1]
duty/[0 8.3294e11 1]
view/[0 5.90157e66 1]
185 t=114
176 t=113
177 t=116
to/[1.00356e198 9.75866e12 1]
duty/[0 8.3294e11 1]
duty/[0 8.3294e11 1]
6 t=130
promote/[0 4.92486e42 1] promote/[0 4.92486e42 1]
173 t=173
*EPS*/[1.81627e68 1 1]
167 t=275
168 t=297
that/[4.19809e258 4.60059e32 1]
and/[4.83981e130 3.28684e23 1] 160 t=410
democracy/[0 1.07628e67 1]
to/[1.00356e198 0.000577 1]
the/[9.10426e139 1.18146e13 1]
154 t=187
7 t=178
155 t=246
*EPS*/[1.97076e51 1 1]
156 t=250
that/[0 1.40043e10 1]
164 t=295
*EPS*/[1.19395e62 1 1]
165 t=298
169 t=391 *EPS*/[9.09939e51 1 1]
democracy/[0 1.73989e65 1]
view/[0 5.90157e66 1]
the/[7.37339e115 1.18146e13 1]
promote/[0 4.92486e42 1] 175 t=130
174 t=177
157 t=297
that/[0 1.40043e10 1]
149 t=390
171 t=393
view/[0 4.91764e64 1]
9 t=246
*EPS*/[1.97076e51 1 1]
10 t=250
that/[0 3.32602e23 1]
11 t=297
*EPS*/[2.94332e56 1 1]
16 t=502
*EPS*/[2.94332e56 1 1]
nonviolence/[0 8.35058e72 1] not/[3.47273e251 8.77168e43 1]
*EPS*/[4.4412e98 1.86521e38 1]
98 t=785
*EPS*/[2.54177e212 1.86521e38 1]
95 t=795
and/[1.63804e147 4.61212e20 1]
23 t=774
*EPS*/[4.77718e224 4.50347e23 1]
37 t=795
change/[0 3.26491e124 1]
124 t=550
106 t=637
a/[8.73928e59 1.25481e20 1]
133 t=574
127 t=574
110 t=574
14 t=432
violence/[0 6.75159e77 1]
change/[0 1.66753e70 1]
governing/[0 6.19343e87 1] 34 t=838 governing/[0 6.19343e87 1] 26 t=838
119 t=495
*EPS*/[1.07244e229 6.2917e39 1] *EPS*/[2.01569e241 8.40927e19 1]
a/[8.73928e59 1.7473e33 1] a/[8.73928e59 1.7473e33 1]
is/[8.85425e283 8.7503e46 1]
change/[0 1.66753e70 1]
making/[0 1.11055e36 1]
20 t=620
21 t=637
making/[0 1.11055e36 1]
99 t=680
*EPS*/[8.34736e110 4.50347e23 1] *EPS*/[2.82451e186 4.50347e23 1] *EPS*/[4.4412e98 1.56588e28 1]
change/[0 1.66753e70 1]
means/[0 6.94785e54 1]
111 t=579
18 t=574
*EPS*/[2.91473e69 1 1]
100 t=683
*EPS*/[4.77718e224 2.25297e17 1] 36 t=773
64 t=785
*EPS*/[4.88111e69 1 1]
41 t=838
114 t=495
*EPS*/[1.14152e223 8.40927e19 1] *EPS*/[1.07244e229 4.68203e25 1]
65 t=785
38 t=820
ourselves/[0 6.7434e59 1] governing/[0 1.3092e105 1] governing/[0 1.3092e105 1] governing/[0 1.3092e105 1]
of/[1.09194e231 6.02311e44 1]
governing/[0 1.23805e95 1]
of/[1.09194e231 1.16289e36 1] 56 t=820
and/[0 1.41816e30 1] and/[1.18761e278 1.41816e30 1]
90 t=776
and/[0 1.41816e30 1]
*EPS*/[4.88111e69 1 1]
of/[9.54218e180 1.16289e36 1] of/[3.17936e176 1.16289e36 1]
and/[3.31275e275 1.41816e30 1]
55 t=795
58 t=835
59 t=838 governing/[0 1.23805e95 1]
and/[0 1.41816e30 1]
is/[8.85425e283 5.74511e49 1]
118 t=518
is/[8.85425e283 5.74511e49 1]
of/[3.63123e228 1.16289e36 1]
57 t=838
governing/[0 1.3092e105 1]
is/[0 5.74511e49 1]
92 t=820
*EPS*/[3.86884e43 6.18154e29 1] *EPS*/[7.2716e55 1 1]
32 t=896
84 t=838
31 t=896
governing/[0 1.37219e86 1]
ourselves/[0 6.7434e59 1]
governing/[0 1.37219e86 1]
of/[1.27195e151 1.16289e36 1] governing/[0 1.3092e105 1] of/[1.09194e231 1.16289e36 1]
of/[9.54218e180 2.81111e47 1]
the/[8.10311e104 2.73124e27 1] 83 t=835
governing/[0 1.37219e86 1] 77 t=838
of/[1.09194e231 2.81111e47 1] of/[3.17936e176 2.81111e47 1] 102 t=579
76 t=820
78 t=892
and/[1.35694e105 1.59517e28 1]
governing/[0 1.37219e86 1]
79 t=898
ourselves/[0 1.01674e78 1]
governing/[0 1.37219e86 1] of/[3.63123e228 2.81111e47 1]
governing/[0 1.0267e74 1]
45 t=895
ourselves/[0 3.36005e80 1]
of/[1.27195e151 2.81111e47 1] and/[1.65264e210 1.41816e30 1]
148 t=574
ourselves/[0 1.69597e103 1]
governing/[0 1.37219e86 1] governing/[0 1.37219e86 1]
60 t=832 *EPS*/[4.88111e69 1 1]
91 t=820
116 t=574
a/[8.73928e59 4.78153e17 1] is/[0 7.75183e19 1]
30 t=893
governing/[0 1.37219e86 1]
of/[3.63123e228 1.16289e36 1] of/[9.54218e180 1.16289e36 1]
is/[0 8.7503e46 1]
115 t=550
governing/[0 1.3092e105 1] governing/[0 1.37219e86 1]
of/[3.17936e176 1.16289e36 1]
of/[4.23801e148 1.16289e36 1] and/[0 1.77789e26 1]
a/[8.73928e59 4.88331e20 1]
*EPS*/[2.94332e56 1 1]
governing/[0 1.3092e105 1]
of/[1.27195e151 1.16289e36 1] 61 t=820
the/[8.10311e104 9.2114e16 1]
means/[0 9.16526e48 1]
101 t=574
*EPS*/[0 7.59968e18 1] 146 t=498
27 t=895
governing/[0 1.23805e95 1]
42 t=832
of/[1.27195e151 6.02311e44 1]
and/[0 1.77789e26 1]
and/[1.65264e210 1.77789e26 1]
*EPS*/[1.32048e59 1.56588e28 1]
is/[0 5.74511e49 1]
117 t=518
*EPS*/[6.07362e212 4.68203e25 1]
violence/[0 3.49312e79 1]
governing/[0 1.23805e95 1]
governing/[0 1.23805e95 1]
of/[9.54218e180 6.02311e44 1] of/[4.23801e148 6.02311e44 1]
and/[0 1.41816e30 1]
*EPS*/[2.82451e186 2.25297e17 1]
*EPS*/[7.2977e31 1 1]
violence/[0 3.49312e79 1]
governing/[0 6.19343e87 1]
governing/[0 1.23805e95 1] governing/[0 1.23805e95 1]
40 t=835
governing/[0 6.19343e87 1] 50 t=820
and/[3.31275e275 1.77789e26 1] and/[0 1.77789e26 1]
*EPS*/[2.48188e71 4.50347e23 1] change/[0 1.66753e70 1]
*EPS*/[3.88273e19 2.25297e17 1]
128 t=579
a/[8.73928e59 5.8716e22 1] *EPS*/[0 8.40927e19 1]
governing/[0 6.19343e87 1] governing/[0 6.19343e87 1]
governing/[0 1.23805e95 1]
39 t=838
of/[3.17936e176 1.80878e37 1]
of/[4.23801e148 1.16289e36 1] violence/[0 6.75159e77 1] violence/[0 6.75159e77 1]
violence/[0 6.75159e77 1]
145 t=433
*EPS*/[4.88111e69 1 1]
of/[3.63123e228 4.77794e40 1]
of/[1.27195e151 1.80878e37 1]
of/[3.17936e176 6.02311e44 1] and/[1.18761e278 1.77789e26 1]
94 t=767
means/[0 5.62587e52 1]
means/[0 2.64702e50 1] 17 t=550
22 t=681
a/[8.73928e59 1.93838e21 1]
a/[8.73928e59 1.7473e33 1]
is/[0 7.48063e32 1]
126 t=550
making/[0 8.30359e39 1] making/[0 8.30359e39 1]
19 t=579
120 t=518 is/[8.85425e283 4.0873e44 1]
violence/[0 6.75159e77 1] not/[1.29216e266 8.77168e43 1] not/[7.36733e263 8.77168e43 1]
97 t=820
and/[1.18761e278 5.01491e31 1]
governing/[0 1.3092e105 1] 125 t=574
of/[1.45334e156 0.00432893 1] *EPS*/[0 1.36457e13 1] *EPS*/[0 5.17766e15 1]
nonviolence/[0 8.35058e72 1]
and/[1.37723e177 4.61212e20 1] 13 t=411
is/[0 1.35408e26 1]
*EPS*/[6.07491e212 4.85633e20 1] 135 t=495
*EPS*/[4.8464e45 7.62629e44 1]
33 t=835
of/[3.17936e176 4.77794e40 1]
of/[3.63123e228 6.02311e44 1]
and/[0 5.01491e31 1]
of/[4.23801e148 1.80878e37 1]
is/[0 3.88192e47 1]
*EPS*/[0 8.40927e19 1]
*EPS*/[0 1.36457e13 1]
*EPS*/[0 5.17766e15 1]
*EPS*/[9.10901e57 1 1] 143 t=430
96 t=820
of/[3.63123e228 1.80878e37 1] 104 t=579
103 t=571
is/[0 5.74511e49 1]
is/[0 5.74511e49 1] 139 t=518 123 t=502
121 t=410
150 t=393
12 t=391
violence/[0 1.69133e81 1] violence/[0 6.75159e77 1]
122 t=498
nonviolence/[0 8.35058e72 1] and/[5.80131e128 4.61212e20 1]
*EPS*/[9.09939e51 1 1]
democracy/[0 1.73989e65 1]
democracy/[0 6.95111e60 1]
*EPS*/[1.56599e44 7.59968e18 1] 15 t=498
35 t=820
of/[9.54218e180 1.80878e37 1]
and/[0 5.01491e31 1]
of/[1.45334e156 4.0207e09 1]
the/[2.03222e99 2.1184e22 1] is/[0 7.48063e32 1] is/[0 4.0873e44 1]
violence/[0 6.75159e77 1] violence/[0 6.75159e77 1] violence/[0 1.69133e81 1]
161 t=432 nonviolence/[0 8.35058e72 1] and/[5.80131e128 1.07062e19 1]
democracy/[0 6.95111e60 1] 8 t=187
not/[1.29216e266 6.25991e37 1]
and/[1.63804e147 1.07062e19 1]
and/[4.87763e158 4.61212e20 1]
their/[3.80785e129 1.56369e37 1]
144 t=433
is/[0 5.74511e49 1]
democracy/[0 1.07628e67 1]
172 t=236
democracy/[0 1.73989e65 1]
to/[1.91213e173 0.000577 1]
5 t=114
*EPS*/[1.01484e273 9.5777e35 1] *EPS*/[9.36614e317 9.5777e35 1]
170 t=390 view/[0 5.90157e66 1] to/[6.29912e153 0.000577 1]
our/[9.086e178 1.37505e36 1] 178 t=62
166 t=241
to/[1.91213e173 9.75866e12 1]
*EPS*/[6.55955e65 1 1]
107 t=502
and/[3.31275e275 5.01491e31 1] 159 t=393
*EPS*/[4.84131e39 1.62337e33 1]
and/[3.45924e117 1.41816e30 1]
*EPS*/[1.56599e44 5.17766e15 1]
of/[1.09194e231 1.80878e37 1]
means/[0 1.6659e55 1] 140 t=571
*EPS*/[1.07264e229 4.85633e20 1]
*EPS*/[2.999e82 1.70297e31 1]
democracy/[0 1.73989e65 1]
and/[0 4.36953e23 1]
a/[8.73928e59 2.34832e25 1]
is/[0 5.74511e49 1] *EPS*/[0 8.27889e22 1]
151 t=393
25 t=820
of/[1.09194e231 4.77794e40 1] the/[2.03222e99 7.64304e15 1] the/[2.03222e99 7.64304e15 1]
*EPS*/[1.0355e28 1 1] 108 t=550
141 t=495
nonviolence/[0 8.35058e72 1]
and/[5.80131e128 1.41816e30 1]
112 t=571
136 t=518
governing/[0 1.0267e74 1] ourselves/[0 5.51977e93 1] 86 t=820
147 t=502
of/[4.23801e148 2.81111e47 1] of/[4.53105e184 2.81111e47 1]
85 t=832
the/[8.10311e104 9.25424e11 1]
43 t=838
governing/[0 2.18674e68 1] governing/[0 2.18674e68 1]
44 t=895
governing/[0 2.18674e68 1]
and/[0 1.41816e30 1]
51 t=815
of/[1.50984e180 2.81111e47 1]
62 t=815
not/[8.97892e312 4.8724e46 1]
governing/[0 2.18674e68 1] a/[5.87733e60 7.9538e36 1]
not/[8.97892e312 1.18765e35 1]
and/[0 1.62969e24 1]
52 t=837
governing/[0 1.50405e97 1]
and/[0 1.62969e24 1]
46 t=893
governing/[0 1.0267e74 1]
81 t=838
and/[0 1.62969e24 1]
66 t=776
47 t=896
*EPS*/[7.2716e55 1 1] 80 t=835
governing/[0 1.0267e74 1]
93 t=815
not/[8.97892e312 4.8724e46 1]
63 t=837
88 t=815
not/[8.97892e312 3.80759e40 1]
89 t=837
governing/[0 9.41392e102 1] governing/[0 1.87262e93 1]
53 t=893
49 t=896
82 t=895
ourselves/[0 3.14947e75 1]
*EPS*/[7.2716e55 1 1] 54 t=896 governing/[0 7.41553e97 1]
off/[3.64777e227 3.32672e77 1] 87 t=820
72 t=838
off/[3.1881e213 3.32672e77 1] 67 t=821
offer/[2.24674e222 1.95978e62 1]
governing/[0 7.41553e97 1] governing/[0 7.41553e97 1]
offer/[2.57044e236 1.95978e62 1]
68 t=838
*EPS*
29 t=979
ourselves/[0 3.74879e65 1]
*EPS*/[7.2716e55 1 1]
75 t=896
69 t=895
ourselves/[0 2.6043e66 1]
*EPS*/[7.2716e55 1 1]
71 t=896
74 t=893
governing/[0 4.55675e102 1]
and/[0 1.62969e24 1]
28 t=976
ourselves/[0 2.14796e51 1]
ourselves/[0 3.74879e65 1]
73 t=895
governing/[0 7.41553e97 1] and/[0 1.62969e24 1]
ourselves/[0 5.51977e93 1]
ourselves/[0 3.36005e80 1] *EPS*/[7.2716e55 1 1] 48 t=893
governing/[0 7.09985e84 1]
ourselves/[0 2.6043e66 1]
governing/[0 4.55675e102 1] governing/[0 4.55675e102 1]
70 t=893
Figure 1.3. Lattice produced by the RWTH 2007 TCStar EPPS Evaluation System for English [L¨ oo ¨f & Gollan+ 2007].
decoder follows the word pair approximation in which the assumption is made that the end time of the word in question depends only on the current and the preceding word hypothesis [Ney & Aubert 1994; Ortmanns & Ney+ 1997a]. The word pair approximation guarantees that at any time t and for any word w and predecessor word v there exists only one lattice arc labeled with w. As a consequence the lattice is deterministic, i.e. each word sequence exists only once, in particular the same word sequence cannot exist with different word boundaries. This makes a lattice which fulfills the word pair approximation compact. However, the only guaranteed property of a lattice created by the RWTH Aachen decoder is that it contains the best sentence hypothesis with the correct scores and correct word boundaries. Due to the word pair approximation hypotheses competing with the best one may have inaccurate word boundaries and thus overestimated acoustic scores. Furthermore, it is not guaranteed that a lattice of M hypotheses contains the N best list for 1 < N ≤ M , i.e. the N best scoring hypotheses. The constraints hold not only for the RWTH Aachen decoder but for any popular LVCSR decoder design, for example a discussion of issues in creating lattices from a WFST decoder is given in [Ljolje & Pereira+ 1999]. HMM decoding results can be stored in a compact form due to the several independence assumptions made in the search, cf. Section 1.3 and Section 1.4. The assumptions restrict the dependencies for computing any probability applied in HMM decoding to a finite context. On a word (or phoneme) level and for LVCSR tasks this is the context for crossword modeling and the context for computing an mgram probability. The context can be stored in the lattice topology, for example for any state in a lattice which stores bigram probabilities all incoming arcs must have the same label. However, this also means that in general a lattice build from a trigram LM requires more arcs to represent the same number of hypotheses than a lattice which stores bigram probabilities. The advantage of storing all context information in the lattice topology is that it allows to apply generic graph algorithms to the lattice, like the transducer operations introduced later in this chapter in Section 1.7. For example, the LM probability for a sentence is simply the product of the mgram probabilities stored on the arcs along a path through the lattice. The quality of a lattice is measured in terms of graph error rate (GER) and density and the goal in lattice construction is to achieve a low GER for a small density. The GER of a word lattice L is defined Nr in Equation (1.10), where Lev(w1N , w ˜r,1 ) is the Levenshtein distance between word sequence w1N and Nr ˜ reference w ˜r,1 , where N is the number of reference words. Nr Lev(w1N , w ˜r,1 ) N ˜ w1 ,N : N
GER(L) = min
(1.10)
w1N ∈L
˜ and the number of The density is defined as the ratio between the number of words in the reference N arcs in the lattice E(L). If the reference is unknown, then the density can be approximated by using the ˆ. number of words in the Viterbi decoding result N density(L) :=
˜ ˆ N N ≈ E(L) E(L)
(1.11)
All lattices produced by the RWTH Aachen LVCSR system use the wordconditioned tree search decoder and the wordpair approximation; the resulting lattices are referred to as wordconditioned lattices. Word lattice densities presented in this work are always approximated densities. Furthermore, all lattices used in any experiment presented in this work store all context information in the lattice topology.
9
Chapter 1 Introduction
Table 1.1. Semirings used by WFSTs for speech recognition tasks.
Semiring probability log tropical
K R+ R ∪ {−∞, +∞} R ∪ {−∞, +∞}
x⊕y x+y −log(exp(−x) + exp(−y)) min(x, y)
x⊗y x·y x+y x+y
¯0 0 +∞ +∞
¯1 1 0 0
1.6.2 Speaker Adaptation Speaker adaptation requires a speaker label S for each speech utterance, where utterances spoken by the same speaker build a speaker cluster. A common approach for unsupervised speaker clustering is to optimize the Bayesian information criterion (BIC) on the acoustic features of the clustered utterances [Chenand & Gopalakrishnan 1998; Tritschler & Gopinath 1999]. The commonly applied speaker adaptation methods in the RWTH Aachen LVCSR decoder are vocal tract length normalization (VTLN), maximum likelihood linear regression (MLLR), and constrained MLLR (CMLLR). In VTLN the warping factor for a speaker S is chosen by a grid search which aims at maximizing the likelihood of the speaker cluster given the output of the previous recognition result. The approach is computationally expensive and the RWTH Aachen system uses by default the fastVTLN implementation, where the warping factor is selected by a classifier [Lee & Rose 1996; Molau 2003]. In the MLLR approach the parameters of the GMMs are adapted to the speaker by applying a speakerdependent linear transformation to the means and variances. Equation (1.12) shows the unconstrained form of MLLR. (S) (S)T (S) ˆ (S) = H(S) µ ˆsl = A(S) Σ (1.12) s µsl + bs , s Σsl Hsl sl In the RWTH Aachen system only the means are adapted, but not the globally pooled, diagonal co(S) variance matrix Σ. The state dependent transformation matrices As for a given speaker S are tied according to a decision tree [Pitz 2005]. In the estimation step those transformation matrices are chosen which maximize the likelihood of the corresponding speaker cluster, where likewise for VTLN the output of the previous decoding pass serves as supervisor. In the constrained form of MLLR the means and variances are transformed by the same matrices. The RWTH Aachen system uses CMLLR for speaker adaptive training (SAT), where only a single transformation per speaker is used. The resulting transformation is shown in Equation (1.13). (S)
µ ˆsl = A(S) µsl + b(S) ,
ˆ (S) = A(S) ΣA(S)T Σ
(1.13)
The advantage of CMLLR is that it can be implemented as a feature transformation, which makes the integration in a LVCSR system simple [Leggetter & Woodland 1995].
1.7 Weighted Finite State Transducers Weighted finite state transducers (WFSTs) are directed graphs with an input label, an output label, and a weight on each arc. In speech recognition WFSTs are commonly used to represent the stochastic models, in particular the HMMbased acoustic model and the mgram language model, and lattices. The representation as transducers allows to manipulate them by generic WFST operations [Mohri & Pereira+ 2008]. Word lattices represented as WFSTs and the notation developed in this section as well as the presented algorithms are heavily used in the following chapters. Besides introducing the notation and algorithms, this section shows how the search and time alignment problem is tackled with the help of WFSTs.
1.7.1 Notation A weighted finite state transducer T is a 7tuple (Σin , Σout , (K, ⊕, ⊗, ¯0, ¯1), S, sI , SF , E). The input and output labels are taken from the alphabets Σin and Σout . An acceptor A is a transducer without Σout .
10
1.7 Weighted Finite State Transducers The weights of an transducer or acceptor form a semiring (K, ⊕, ⊗, ¯0, ¯1), where a semiring has the following properties: ¯ is a commutative monoid: 1. (K, ⊕, 0) • (x ⊕ y) ⊕ z = x ⊕ (y ⊕ z) • ¯ 0⊕x=x⊕¯ 0=x • x⊕y =y⊕x 2. (K, ⊗, ¯ 1) is a monoid: • (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z) • ¯ 1⊗x=x⊗¯ 1=x 3. ⊗ distributes over ⊕: • x ⊗ (y ⊕ z) = (x ⊗ y) ⊕ (x ⊗ z) • (x ⊕ y) ⊗ z = (x ⊗ z) ⊕ (y ⊗ z) 4. ¯ 0 is an annihilator for ⊗: ¯⊗x=x⊗¯ • 0 0=¯ 0 The common semirings used in speech recognition are summarized in Table 1.1. The log semiring equals the probability semiring in negated, logarithmic probability space. Applying the maximum or Viterbi approximation to the log semiring results in the tropical semiring. The semirings used in this work (including the semirings listed in Table 1.1) have the additional property that the ⊗operation is commutative and for each element x but ¯0 the ⊗inverse element x−1 exists in K. The states in the WFST are denoted by S, the single initial by sI , and the set of final states by SF . Final states can have weights, which are denoted by w(s), s ∈ SF . In a lattice each state s carries a time stamp denoted by t(s). The set of arcs or edges in a WFST is denoted by E ⊆ S × {Σin ∪ } × {Σout ∪ } × K × S, where denotes the empty word. For an arc e ∈ E the input label is denoted by i(e), the output label by o(e), the weight by w(e), the source state by from(e), and the target state by to(e). For a state s the set of incoming arcs is denoted by in(s) and the set of outgoing arcs by out(s). The notation e ∈ E and s ∈ S are abbreviated by e ∈ T and s ∈ T. A (sub)path aL 1 ∈ E × · · · × E in transducer T is any consecutive sequence of arcs. The set of all paths starting from state s and ending in state s0 are denoted by π(s, s0 ) and according π(S, S 0 ) is the set of all paths starting in s ∈ S and ending in s0 ∈ S 0 . Paths in π({sI }, SF ) are called paths through T, other paths are called subpaths in T. Likewise for edges and states, the notation aL 1 ∈ π({sI }, SF ) is L L ∗ L abbreviated by aL ∈ T. For path a the sequence of non input labels is given by i(a 1 1 1 ) ∈ Σin , o(a1 ) is defined analogously. The ⊗product over the arc weights w(a1 ) ⊗ w(a2 ) ⊗ . . . ⊗ w(aL ) of a (sub)path is denoted by [[aL 1 ]] and the ⊕sum over the product of each path through T by M [[T]] := [[aL (1.14) 1 ]]. aL 1 ∈T
The interpretation of [[T]] depends on the semiring: the tropical semiring yields the Viterbi decoding result for T. For adequate weights the result of the log or probability semiring can be interpreted as the normalization term for a probability distribution over the paths through T, i.e. p(aL 1  T) := exp − −1 L [[T]]log ⊗log [[a1 ]]log . The weight for a sequence of input labels w1N and output labels v1M is the sum over all paths through T accepting w1N as input and v1M as output: M [[T]](w1N , v1M ) := [[aL (1.15) 1 ]] ⊗ w(to(aL )) aL 1 ∈T: L M i(a1 )=w1N ∧o(aL 1 )=v1
[[A]](w1N )
:=
M
[[aL 1 ]] ⊗ w(to(aL ))
(1.16)
aL 1 ∈A: N i(aL 1 )=w1
WFSTs have a natural graphical representation as shown in Figure 1.4 for a transducer and an acceptor.
11
Chapter 1 Introduction a/0.5
a)
0
a:d/0.5
b/0.3
1
c/0.0 d/0.6
b)
2/0.8
0
b:c/0.3
1
c:b/0.0 d:a/0.6
2/0.8
Figure 1.4. Graphical representation of a weighted acceptor a) and a weighted transducer b). An arc in the acceptor is labeled by i(e)/ w(e), a transducer arc by i(e) : o(e)/ w(e). States are labeled with their state number and a final weight, if the state is final.
1.7.2 Algorithms SingleSource ShortestDistance. The shortestdistance of a state s to the final states of T is defined in Equation (1.17). ! L M O w(al ) ⊗ w(to(aL )) (1.17) d(s; T) := aL 1 ∈π(s,SF )
l=1
Starting from the initial state d(sI ; T) equals [[T]]. For the tropical semiring with nonnegative weights the Dijkstra algorithm can be used to compute d(·; T). The shortest path for acyclic WFSTs or WFSTs with idempotent semirings (like the tropical semiring) can be computed efficiently by using the BellmanFord algorithm, which applies a form of dynamic programming. In particular, the time complexity for acyclic WFSTs is O(E + S). A summary of efficient solutions to the singlesource shortestdistance problem for arbitrary WFSTs and semirings is given in [Mohri 2002b]. Composition and Intersection. The composition of two transducers T1 ◦ T2 is a mapping from sequences in Σ∗in,1 to sequences in Σ∗out,2 and is defined in Equation (1.18). [[T1 ◦ T2 ]](w1N , v1M ) :=
M
L M [[T1 ]](w1N , uL 1 ) ⊗ [[T2 ]](u1 , v1 )
(1.18)
uL 1
The result of the composition of two acceptors A1 and A2 is their intersection: w1N is accepted if A1 and A2 accept w1N . Composition and intersection can be efficiently computed in time O((E1 +S1 )(E2 + S2 )) [Mohri 2004]. Determinization and Minimization. In a determinized WFST det(T) no two arcs leaving the same state have the same input label. In the common definition of determinization for WFSTs the empty word is treated as a normal label. However, in a strong sense determinization means that for acceptor A and input label sequence w1N at most one path through A accepts w1N . The strong form of determinization is achieved by first removing labels from A. All acyclic WFSTs and unweighted FSTs are determinizable, but cyclic WFSTs can be nondeterminizable. A workaround is to convert a nondeterminizable WFST into a determinizable WFST by inserting additional arcs labeled with socalled disambiguating input labels or disambiguators [Allauzen & Mohri 2004]. In the worst case the number of states in the determinized WFST grows exponentially even for acyclic transducers. A determinized, acyclic WFST can be efficiently minimized in time O(E), where the minimized WFST is the equivalent transducer with the minimal number of states; the complexity of the minimization depends on the semiring [Mohri 2004]. removal. After applying removal to an acceptor A the resulting acceptor remove (A) has no arcs with the empty word as input label. The complexity is O(SE + S2 ) for acyclic acceptors. The complexity for the general case and possible extensions of removal to transducers are discussed in [Mohri 2002a, 2003].
12
1.7 Weighted Finite State Transducers Project. The projection converts a transducer T into an acceptor project(T) and is defined in Equation (1.19). M [[T]](w1N , v1M ) [[project(T)]](w1N ) := (1.19) v1M
Union. The union of two transducers accepts (w1N , v1M ) if T1 or T2 accepts (w1N , v1M ). Equation (1.20) defines the union of two WFSTs. [[T1 ∪ T2 ]](w1N , v1M ) = [[T1 ]](w1N , v1M ) ⊕ [[T2 ]](w1N , v1M )
(1.20)
Building the union has a time complexity of O(1): a new superinitial state is introduced and connected via arcs with the initial states of T1 and T2 . Miscellaneous. In the transposed WFST TT the arc direction is inverted. In T−1 input and output label are exchanged. ∂(s; T) denotes the subWFST of T with s as new initial state. The result of trim(T) has only coaccessible states, where a coaccessible state s is any state on a path through T, i.e. s can be reached from the initial state and at least one of the final states can be reached from s. If not explicitly mentioned otherwise, any WFST T is assumed to be trim. Several WFSTs libraries which include the algorithms presented in this section are publicly available, cf. [Allauzen & Riley+ 2007; Hetherington 2004; Kanthak & Ney 2004].
1.7.3 WFSTs in ASR In ASR tasks transducers are often used for representing and manipulating lattices and for solving the search and the time alignment problem. WFSTs for lattice representation are discussed in detail in Chapter 3. This section briefly summarizes how the search and the time alignment problem are solved with the help of WFSTs. The main idea in using WFSTs for solving the search problem is to factorize the problem into a set of simpletoconstruct WFSTs and then to use generic WFSTs algorithms to solve the search problem ˆ given feature sequence xT . The common defined in Equation (1.9), i.e. to find the Viterbi hypothesis W 1 factorization for LVCSR systems consists of five transducers, cf. [Mohri & Pereira+ 2008]: • O emission probabilities: An acyclic transducer with the acoustic feature xt as input, an HMM state s as output, and the likelihood p(xt s) as weight. • H HMM state to contextdependent (CD) phone mapping: A cyclic transducer which consists of the collection of all the triphone dependent HMMs; the weights are the transition probabilities. p(ss0 ) • C CD phone to contextindependent (CI) phone mapping: A cyclic, unweighted FST which maps triphones to their central phoneme. • L CI phone to word mapping: The WFST representation of the pronunciation lexicon; the weights are the pronunciation probabilities p(waL 1 ). • G language model probabilities: The representation of an mgram language model with backingoff as acceptor; the weights are the n−1 language model probabilities p(wn wn−m+1 ). The five knowledge sources are combined via composition and the resulting form of the search problem is given in Equation (1.21), where probabilities are represented in negated logspace and the transducers are defined over the tropical semiring. ˆ = o arg d(O ◦ H ◦ C ◦ L ◦ G) xT1 → W (1.21)
13
Chapter 1 Introduction The advantage of the transducer representation is that the static part of the search space H ◦ C ◦ L ◦ G can be optimized offline. Minimizing the static part does significantly reduce the number of states and the runtime of the decoder. However, in practice the minimization of H ◦ C ◦ L ◦ G is not straightforward, because it is not determinizable and contains many arcs (removal is prohibitive as it would cause a dramatic blowup). The placement of the disambiguator arcs and of the labels is crucial for getting a small and efficient WFST decoder [Allauzen & Mohri 2004; Allauzen & Mohri+ 2004]. WFST decoder for LVCSR use a version of the singlesource shortestdistance operation d(·) which includes pruning. For LVCSR tasks the full static search space transducer can become huge and common decoder designs perform onthefly compositions ◦fly , which are applied during the decoding in combination with a pruning dprune (·) implementation. The following list summarizes the most common decoder designs. [w1N ]opt
=
o(arg dprune (O ◦fly min(H ◦ C ◦ L ◦ G)))
(1.22)
[w1N ]opt [w1N ]opt
=
o(arg dprune (O ◦fly H ◦fly min(C ◦ L ◦ G)))
(1.23)
=
o(arg dprune (O ◦fly min(H ◦ C ◦ L) ◦fly G))
(1.24)
Decoder design (1.22) uses the fully optimized static search space, where minimization is usually applied over the log semiring. Design (1.23) expands the HMM states on the fly, which significantly reduces the size of the precomputed WFST. The third decoder design (1.24) is conceptually equivalent to the wordconditioned tree search, the standard decoder at RWTH Aachen. Producing a lattice with a WFST decoder is conceptually simple: instead of applying dprune (·) the search space is only pruned. The time alignment problem is solved in the WFST framework by simply replacing acceptor G by acceptor R, which is a linear transducer representing the reference transcription.
1.8 Bayes Risk Decoding: State of the Art ˆ with the minimum risk (aka minimum Bayes risk decoding for ASR aims at finding the word sequence W T expected loss/error/cost) given feature sequence x1 and given a loss function L(·, ·). Equation (1.25) shows the general form of the Bayes risk decision rule [Bishop 2006]. ˆ = argmin xT1 → W w1N ,N
X
p(v1M xT1 ) L(w1N , v1M )
(1.25)
v1M ,M
In fact, Equation (1.1) is the instance of the Bayes risk decoder, which uses the sentence error L(w1N , v1M ) := 1 − δ(w1N , v1M ) as loss function. However, the standard cost function for LVCSR tasks is the WER which is defined as the Levenshtein distance normalized by the length of the reference string. Due to the discrepancy in the cost function the MAP (and Viterbi) decoding result is not optimal for LVCSR tasks and motivates the application of cost functions which are closer to the WER. Usually, the normalization in the WER is omitted and the goal is to minimize the Levenshtein distance. However, the sum in Equation (1.25) prohibits the usage of a complex, nonlocal cost function like the Levenshtein distance during the search. Thus, Bayes risk decoding approaches with nonlocal loss functions are usually applied in a postprocessing step on N best lists or on word lattices. N best lists allow a direct computation of Equation (1.25) with the Levenshtein distance as loss function [Goel & Byrne+ 1998; Stolcke & K¨ onig+ 1997]. Lattices possess many more hypotheses than any practicable N best list and preserve more probability mass, especially for long utterances. On the downside, a direct computation of the Bayes risk decoding rule is still prohibitive for word lattices from a LVCSR system. A commonly used approximation is the confusion network (CN) for which Bayes risk decoding with an approximate Levenshtein distance as loss function is reduced to a local, wordwise decision problem [Mangu & Brill+ 1999, 2000]. In the recent years several methods have been proposed to build confusion networks directly from lattices [Hakkani & Riccardi 2003; Hoffmeister & Schl¨ uter+ 2009; Mangu & Brill+ 2000; Xue & Zhao 2005]. An extension to the CN decoding approach cuts the lattice into small, independent segments and computes the Levenshtein distance within the segments [Goel & Kumar+ 2004, 2000, 2001; Kumar & Byrne 2002]; the standard CN case is derived by allowing at most one word per segment.
14
1.9 Model and System Combination: State of the Art Another extension is to replace the standard decision rule in CN decoding by a classifier, which can compensate for alignment errors and for unreliable probability estimates [Hoffmeister & Schl¨ uter+ 2008; + Mangu & Padmanabhan 2001; Venkataramani & Chakrabartty 2003, 2007]. In [Chien &Huang+ 2006] Bayesian priors are used in the risk computation, which model the uncertainty in the parameters of the acoustic and the language model. In [Goel & Byrne 2000] the authors aim at finding the Bayes risk hypothesis by doing an A∗ search over the lattice, where the algorithm requires an estimation of the residual costs. An estimation of the Bayes risk and a criterion to decide whether the Bayes risk hypothesis with the Levenshtein distance as loss function is different from the MAP hypothesis is given in [Schl¨ uter & Scharrenbach+ 2005]. Other latticebased approaches use modified loss functions which allow an efficient computation of the Bayes risk hypothesis [Hoffmeister & Schl¨ uter+ 2009; Wessel & Schl¨ uter+ 2000, 2001c; Xu & Povey+ 2009]. An algorithm for computing the latticebased Bayes risk hypothesis with the Levenshtein distance as loss function using only generic transducer operations is presented in [Mohri 2003], but the algorithm has exponential worst case complexity.
1.9 Model and System Combination: State of the Art 1.9.1 Loglinear Model Combination The standard in ASR is to use a loglinear model with only two knowledge sources: the acoustic model and the language model. For optimal performance LVCSR systems introduce a language model scale which eventually turns Equation (1.1) into a loglinear model. The loglinear model can be used explicitly for model combination by simply adding more knowledge sources to the model, usually additional acoustic models [Metze & Waibel 2002a,b; Zolnay 2006]. In the discriminative model combination (DMC) each of the knowledge sources combined in the loglinear model gets its own scaling factor which is optimized for minimal word error rate [Beyerlein 1997, 1998; Vergyri 2000; Zolnay & Schl¨ uter+ 2005]. In practice, performing a decoding with many acoustic models is expensive in terms of time and memory and the common approach is to produce a lattice with a base decoder and rescore the lattice arcs with the additional knowledge sources. In the standard LVCSR training procedures the interaction during the search between the several knowledge sources is not (fully) considered during model parameter estimation. An approach to compensate for the shortcoming of the model training is to capture the interactions in the loglinear model combination by using contextdependent scaling factors [Hoffmeister &Liang+ 2009; Huang &Belin+ 1993; Vergyri &Tsakalidis+ 2000].
1.9.2 System Combination An alternative to the loglinear model combination is the N best list or latticebased system combination, where the output of several decoders is combined. In the loglinear model combination all (acoustic) models are combined into a single system, whereas in the system combination approach from each of the acoustic frontends a separate system is built. In the simplest approach only a single hypothesis from each system is combined like in ROVER [Fiscus 1997]. The quality of the ROVER result can be significantly increased by using confidence scores [Mangu & Brill+ 2000; Wessel & Schl¨ uter+ 2001a] or by replacing + ROVER’s simple decision rule by a classifier [Hillard & Hoffmeister 2007; Zhang & Rudnicky 2006]. Instead of a single hypothesis, N best lists or confusion networks can be combined [Evermann & Woodland 2000; Mangu 2000; Ostendorf & Kannan+ 1991; Stolcke & Bratt+ 2000]. In [Ostendorf & Kannan+ 1991] the systemdependent N best lists are merged into a single N best list followed by a rescoring step. In the other approaches a super CN is derived by aligning the systemdependent N best lists or CNs. A lattice combination approach which derives system weights from the Bayesian decision theory is presented in [Sankar 2005]. In [Hoffmeister & Schl¨ uter+ 2008] a more general classifier is used to predict which system is correct. The minimum frame error decoding rule introduced in [Wessel &Schl¨ uter+ 2001c] is extended in [Hoffmeister & + Klein 2006] to a system combination approach. A similar method is used in [Chen & Lee 2006], where
15
Chapter 1 Introduction alternatively a phoneme error based cost is minimized. A general approach to combine and decode lattices from several systems, which covers the two latter approaches, is discussed in [Hoffmeister &Schl¨ uter+ 2009]. In [Omar & Mangu 2007] an approach is presented, where the scores from the first system drives the search of the second system and thereby aiming at minimizing a smoothed loss function. A comparison of ROVER with confidence scores, CN combination, and minimum frame error based lattice combination shows that all three approaches perform almost equally well [Hoffmeister & Hillard+ 2007]. The results presented in [Zolnay 2006] and in [Hoffmeister & Liang+ 2009] indicate that the performance of latticebased system combination approaches are superior to the loglinear model combination. The theoretical motivation for system combination comes from machine learning. The basic idea is that if one classifier is not perfect then the combination with more classifiers improves the result, if the classifier make different kind of errors [Dietterich 2000a]. The same author discusses a simple way for getting an ensemble of classifiers by randomizing decision trees [Dietterich 2000b]. In [Ramabhadran & Siohan+ 2006; Siohan & Ramabhadran+ 2005] the approach is applied to the estimation of the phonetic decision tree used in modern LVCSR systems, e.g. [Beulen 1999]. The usage of different acoustic frontends or randomized decision trees works well in practice, but it does not guarantee that the resulting systems benefit from combination. In the recent years some effort has been put in deriving an ensemble of complementary systems which benefit from each other in the system combination [Breslin & Gales 2006, 2007a,b; Willett & He 2008]. But so far, the gain from complementary system training is rather small.
1.9.3 CrossAdaptation Crossadaptation is an alternative way for doing system combination which became popular in the recent years [Soltau & Kingsbury+ 2005; St¨ uker & F¨ ugen+ 2006]. Instead of applying the system combination in a postprocessing step to decoding, the interaction between the systems is put into the speaker adaptation step of a multipass decoder. In the crossadaptation approach the supervisor for MLLR adaptation, cf. Equation (1.12), is the output of an alternative system. In [Guiliani &Brugnara 2006] the approach is extended to multiple supervisors. The multiple supervisors are either reduced to a single supervisor in a preprocessing step by applying system combination methods, or the ultimate adaptation statistics are derived from the weighted average of the supervisordependent statistics [Guiliani & Brugnara 2007; Hoffmeister & Plahl+ 2007].
16
Chapter 2 Scientific Goals System combination is an important techniques in stateoftheart highly accurate LVCSR systems. In particular for those systems, where a low error rate is mandatory and runtime is secondrate. The combination via a single loglinear model is theoretically well grounded and was extensively studied in [Beyerlein 2000; Vergyri 2000]. The loglinear model is eventually a sentencewise combination approach, whereas the popular ROVER [Fiscus 1997] approach comes as an adhoc method for wordwise system combination. ROVER is closely related to the common latticebased combination via a confusion network combination (CNC) [Evermann & Woodland 2000; Mangu 2000]. CNC is theoretically motivated by the Bayes risk decoding rule with the Levenshtein distance as loss function: a CN provides an upper bound to the Levenshtein alignment between any two paths through a lattice. An alternative approximation is based on the definition of the frame error and was introduced in [Wessel & Schl¨ uter+ 2001c] and extended + to system combination in [Hoffmeister & Klein 2006]. The first objective of this thesis is to develop an unified view on system combination and to explore the connections between the popular approaches. The latticebased approach to system combination, which applies a Bayes risk decoder with an approximate Levenshtein distance as loss function, proved to be a simple and successful method and is the most widespread combination techniques used in stateoftheart LVCSR systems. However, confusion networks and frame error are only two of a variety of Levenshtein distance approximations used in LVCSR for training and decoding. The second objective of the thesis is to categorize, investigate, and extend existing and develop new approximations to the Levenshtein distance, which can be used to build efficient and accurate Bayes risk decoder. Besides the loss function, Bayes risk decoding relies on the quality of the posterior probability estimates. Standard approaches based on the Bayes risk decoding rule blindly trust the posterior probabilities derived from the word lattices. However, these probabilities are only estimates of the true posteriors. The third objective is to find and explore approaches to deal with the bias in the latticebased posterior estimates. From the objectives a set of theoretical and experimental goals are derived and investigated in this thesis, which include: Development of an unified view on system combination. Word lattices have a natural representation as weighted finite state transducers (WFSTs). Based on the WFST framework and the Bayes decoding rule an unified view on system combination is developed which covers the loglinear model, the minimum frame error combination, CNC, and many more. The framework is used to compare sentence error based decoders, e.g. the Viterbi decoder, and approximated Levenshtein distance based decoder with regard to their capability for system combination. At first glance the common approach to CN combination stays aside from the latticebased framework, because it makes use of the special structure of a CN. In this work the interpretation of the CNC in the latticebased Bayes risk combination and decoding framework is explored as well as its connection to ROVER. Investigations on local cost functions used in Bayes risk decoding. Bayes risk decoding of word lattices derived from LVCSR systems with the Levenshtein distance as loss function is computationally prohibitive. The common approach is to place the necessary approximation in the loss function, i.e. to use an approximate of the Levenshtein distance. Different approximates to the Levenshtein distance are in use in ASR for different purposes. The key idea of the approximation is to reduce the dependencies in the loss function such that the computation of the loss has a local nature. The degree of locality is used to derive two general classes of loss functions, which yield efficiently computable Bayes risk decoder.
17
Chapter 2 Scientific Goals In practice, the common local losses show a strong deletion bias. This work continuous the theoretical investigations on the deletion bias started in [Gibson 2008] with special attention to the frame error based losses. New variants of local loss functions are developed, which have a direct influence on the deletion/insertion ratio and thus reduce the inherent deletion bias. Investigations on confusion networks. Confusion networks are used in speech recognition and speech processing for many purposes like confidence score computation, Bayes risk decoding, system combination, and other tasks like (speech) translation [Evermann & Woodland 2000; Mangu & Brill+ 1999; Matusov & Hoffmeister+ 2008]. The common algorithms for converting a word lattice into a CN are based on an arc or state clustering. The algorithms are parametrized and finding the right parameters is crucial for good performance. Inspired by the common approaches to CN construction two algorithms are developed and investigated. Furthermore, a conceptually new and simple algorithm is proposed, which comes completely parameterfree. The algorithm is based on framewise word posterior probabilities and draws a connection between minimum frame error and minimum CN distance decoding. CN construction algorithms aim at approximating the Levenshtein alignment, but the heuristic nature of the common latticebased algorithms do not allow to make any assumption about the resulting alignment, besides that it is an upper bound to the exact Levenshtein distance. However, experimental results indicate that the CN alignments are close to the Levenshtein alignments. In this work an approach is investigated which uses the CN alignment to initialize a windowed Levenshtein distance. A hierarchy of approximate Bayes risk decoders is developed, which starts with the common CN decoding rule for a window of size one. For a sufficiently large window the decoder eventually becomes the Bayes risk decoder with the exact, unwindowed Levenshtein distance as loss function. Development of a new approach to system combination. The common system combination approaches formulated in the Bayes risk decoding framework have two major drawbacks. The first is the approximation of the Levenshtein distance and the second is the blind reliance on the posterior probability estimates derived from the word lattice. In this work an approach is proposed which aims at overcoming both problems: a classifier based system combination. Instead of using the combined posterior estimates directly, a set of posterior estimates and further features are fed into a classifier. The classifier has also access to the results of the standard approaches to system combination and decides for the ultimate output. Thus, the classifier can learn systematic biases in the approximation of the Levenshtein distance and in the probability estimates. The approach is investigated for different feature sets and classifiers. The investigation contains a detailed analysis of the error detection and error correction capabilities of the classifier based system combination. Investigations on the loglinear model combination. Loglinear model combination for speech recognition has been studied before in [Beyerlein 2000; Vergyri 2000; Zolnay 2006]. However, a systematic comparison of system combination approaches using either sentence posterior probabilities based on a loglinear model combination or sentence posterior probabilities based on the weighted average of systemdependent sentence posteriors is lacking and will be given in this thesis. In the standard loglinear model combination each knowledge source has a single scaling factor. However, a single factor cannot reflect the dynamic change in the influence of the knowledge source. This work investigates the usage of word and knowledge source dependent scaling factors in the loglinear combination of several acoustic models. The loglinear combination is compared to a combination based on averaged systemdependent sentence posterior probabilities, where the knowledge source dependent scaling factors are applied in the computation of the systemdependent posteriors. The remainder of the thesis is organized as follows: Chapter 3 develops the unified view on system combination based on the WFST framework. Classes of local cost functions are derived and efficient Bayes risk decoders for system combination based on the local cost functions are developed. The common approaches to latticebased system combination are investigated and classified into the framework.
18
Concrete instances of local costs are introduced and investigated in Chapter 4. Three categories of local costs are explored: frame error based costs, costs defined via a local alignment, and the confusion network (CN) distance. For each cost function an efficient Bayes risk decoder with the according cost as loss function is developed. In this work the primer function of latticederived CNs is to define an approximate of the Levenshtein alignments between any two paths through the lattice. Three algorithms for constructing a CN from a lattice are introduced and investigated in the chapter. Chapter 5 investigates applications of CNs defined on frame and on word level. In particular, a Bayes risk decoder with the windowed Levenshtein distance as cost function is developed, which also draws a connection between decoding with the CN distance and with the exact Levenshtein distance. The classifier based approach to system combination is introduced in Chapter 6 and Chapter 7 is dedicated to investigations on the loglinear model combination. The thesis is concluded by a summary of the scientific contributions in Chapter 8 and an outlook in Chapter 9. The appendix contains a description of the systems and corpora used in the various combination experiments as well as detailed results for all systems and corpora.
19
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework In the latticebased system combination task the goal is to combine and decode lattices provided by several systems. The number of LVCSR systems to be combined is denoted by J and the word lattice produced by the jth system by Lj ; word lattices will be introduced in the next section. The Bayes risk decoding framework requires sentence posterior probabilities for computing the optimal hypothesis, cf. Equation(1.25). The ultimate goal is to compute the sentence posterior probability from the J lattices and thus performing a system combination within the Bayes risk framework. Note that the MAP decoding rule is included in the considerations, because it is the instance of the Bayes risk decoding rule with the sentence error as loss function. In this chapter it will be shown that the sentence posteriors used in the common approaches to system combination are eventually computed from either the lattice intersection or the lattice union. In the course of the chapter the Bayes risk decoders are investigated which arise from combining the lattice intersection or union with different loss functions. The standard model used in LVCSR to describe the systemdependent sentence posterior probabilities is a loglinear model of the form ! N X I X N T exp λi fi (n; w1 , x1 ) pλ (w1N xT1 ) := X
n=1 i=1 M X I X
exp
v1M ,M
!,
(3.1)
λi fi (m; v1M , xT1 )
m=1 i=1
where the feature functions fi (·; w1N , xT1 ) represent the I knowledge sources used to solve the search problem. For simplicity it is assumed that each of the J systems combines the same number of I feature functions. Each feature function has its own scaling factor λi . In the general definition given in Equation (3.1) the feature functions depend on the whole sentence w1N . However, in practice the features use only a restricted context and the model has a compact representation in form of a word lattice. Common LVCSR systems combine only two knowledge sources, an HMMbased acoustic model and an mgram language model: fAM (n; w1N , xT1 )
:=
log p(xttnn−1 +1 wn )
fLM (n; w1N , xT1 )
:=
n−1 log p(wn wn−m+1 )
For λAM = λLM = 1 the loglinear combination equals the factorization of the posterior probability p(w1N xT1 ) according to Bayes rule shown in Equation (1.1). However, the estimate of the acoustic model is usually less reliable than the language model, which the scaling factors in the loglinear model aim at compensating for. In the MAP decoding the normalization in Equation (3.1) can be discarded and only the ratio between the λs is considered. If the MAP decoding is applied to the standard combination of an acoustic and a language model, the acoustic model scale is usually set to one and the language model scale is optimized. For computing sentence posterior probabilities the normalization is needed and the absolute values of the scaling factors matter [Wessel & Macherey+ 1998; Woodland & Povey 2000]. All lattices used in this work for Bayes risk decoding and system combination experiments provide separate acoustic and language model scores. Furthermore, all lattices store any required context information in their topology as described in Section 1.6.1. Word lattices have a natural representation as weighted finite state transducers (WFSTs), in particular as acyclic weighted finite state acceptors. The next section introduces semirings over RD which allows to represent a loglinear model directly as a WFST.
21
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework In Section 3.2 the WFST framework and Equation (3.1) are combined for computing probabilities over word lattices. In Section 3.2.1 probabilities are derived from a single lattice. From lattice and thus systemdependent probabilities the next step is to get a combined probability in order to perform the system combination within the Bayes risk decoding framework. The combination via the lattice intersection is introduced and motivated in Section 3.2.2 and via the lattice union in Section 3.2.3. In Section 3.3 a general framework for the combination and Bayes risk decoding of several lattices is developed using the WFST framework. In Section 3.3.1 the MAP decoding of the lattice union and intersection is investigated. In Section 3.3.3 the MAP decoding rule is replaced by a Bayes risk decoder with an approximate of the Levenshtein distance as loss function. A classification for Levenshtein distance approximates is introduced. For two classes of loss functions Bayes risk decoders are developed, which efficiently decode single lattices, the lattice intersection, and the lattice union. System combination based on confusion networks is discussed in Section 3.4. The common confusion network combination (CNC) algorithm is derived from minimizing the Bayes risk of the combination result and it is shown that CNC is in fact a CN decoding of the lattice union. Finally, the ROVER method is introduced as an approximation of the CNC algorithm. The result of the previous three sections is an abstract view on lattice combination and decoding which includes the common approaches to latticebased system combination, in particular the discriminative model combination (DMC), CNC and ROVER. The resulting combination and decoding framework is summarized in Section 3.5. The section shortly discusses the common approaches to latticebased system combination and shows how they fit into the framework. A crucial step for getting good results with Bayes risk decoding and system combination techniques is a careful preprocessing of the lattices. Section 3.6 discusses the several normalization steps applied in intra and crosssite lattice combination. Section 3.7 describes the optimization algorithm used to estimate the scaling factors of the loglinear model and further combination and decoding dependent parameters. In the same section the general difference between parameter optimization for Bayes risk decoding and minimum risk based parameter estimation is discussed.
3.1 WFSTs as a HighLevel Programming Language for latticebased System Combination The WFST framework provides a highlevel programming language which is used in this work to describe the latticebased combination and decoding problems. The advantage of using the WFST framework and generic WFST operations is a compact description of the problems. Furthermore, a WFST representation immediately yields an algorithm for solving the problem. The algorithm allows a first complexity analysis and helps to identify the expensive subproblems which require further analysis or have to be replaced by sophisticated algorithms or by approximations. A lattice L is defined as an acyclic WFST over the log or tropical vector semiring with time stamps on the states, where the log or tropical vector semiring is given by (RD , ⊕, ⊗, ¯0, ¯1, λ). An arc weight x ∈ RD and the vector λ ∈ RD correspond to the arcdependent features and to the scaling factors in the log linear model defined in Equation (3.1). The standard log and tropical semiring are defined in Section 1.7. The log semiring equals the probability semiring in negated log space, i.e. exp(−x), x ∈ R, is a homomorphism from the log to the probability semiring. Applying the Viterbi approximation to the log semiring yields the tropical semiring. The interpretation of an arc weight x ∈ RD and vector λ as the features and scaling factors in a loglinear model induces that the scalar product λ · x shall be a homomorphism from the log or tropical vector semiring to the standard log or tropical semiring. In other words, the definitions of the ⊕operator, the ⊗operator, and of the neutral elements have to satisfy λ · (x ⊕ y) = (λ · x) ⊕ (λ · y),
λ · (x ⊗ y) = (λ · x) ⊗ (λ · y).
(3.2)
Thus, it is guaranteed that the standard log and the log vector semiring as well as the standard tropical and the tropical vector semiring produce equivalent results as long as λ is kept fixed, e.g. finding the path through the lattice with the shortest distance.
22
3.1 WFSTs as a HighLevel Programming Language for latticebased System Combination A second, desired property is that the operators and neutral elements are independent of λ, i.e. that changing the λvector before or after applying an operation shall not effect the outcome of the operation. Equation (3.3) formally defines the property for an arbitrary operation . λ 6= λ0
λ · (x λ y) = λ · (x λ0 y),
(3.3)
Drawing the connection to the loglinear model reveals the motivation for the second property: λ corresponds to the scaling factors in the loglinear model. Therefore, operations which are independent of λ can be applied without affecting the outcome of a subsequent optimization of the loglinear model. It is easy to see that due to the distribution of multiplication over addition the following definitions of the neutral elements and the ⊗operator fulfill Equation (3.2): x1 + y1 y1 x1 0 ∞ . . .. . . ¯ := ¯ := 0 .. , .. ⊗ .. := .. , 1 . ∞
0
xD
yD
xD + yD
The neutral elements and the ⊗operator are defined independently of λ and thus Equation (3.3) holds for the ⊗operator. The ⊕operator for the log and tropical semiring cannot be defined independently of the λvector and thus does not fulfill Equation (3.3). This becomes obvious when looking at the interpretation of the ⊕operation in the loglinear model. Computing the ⊕sum over all paths in the WFST equals the computation of the normalization constant in the loglinear model for the log semiring. In the tropical semiring the ⊕sum is equivalent to finding the best scoring path. Both operations are obviously not independent of λ. For the log semiring many possible definitions of the ⊕operator, which fulfill Equation (3.2), exist. In a series of experiments from all tested ⊕operators the following definition gives the best approximation to Equation (3.3): x ⊕log y = z, log exp(−λi xi ) + exp(−λi yi )
where
zi :=
−λ−1 i D X
log exp(−
log exp(−λd xd ) + exp(−λd yd )
D X d=1
λd xd ) + exp(−
D X
! λd yd )
(3.4)
d=1
d=1
The ⊕operator for the tropical vector semiring fulfilling Equation (3.2) is denoted by: x , if λ · x ≤ λ · y x ⊕trop y := y , otherwise
(3.5)
As pointed out before, the multidimensional semiring is not needed: instead the arc weights can be set to the precomputed scalar products and the standard log or tropical semiring can be applied. However, the theoretical advantage of the multidimensional semiring is that the scaling factors λ are integrated in the model: the semiring itself describes the loglinear model of which λ is part of. Modifying the scaling factors changes the weight computation over the lattice and thus the outcome of operations like computing the best scoring path. Consequently, modifying λ instantiates a new semiring. The practical advantage of the multidimensional semirings over the corresponding single dimensional semirings is twice. After decoding over the tropical vector semiring the system and modeldependent scores of the best hypothesis are available for postprocessing steps, e.g. machine translation. And λinvariant operations can be performed on the lattice without affecting the weight computation over the lattice with changed scaling factors. That is, the parameters of the loglinear model can be optimized on the modified lattice, e.g. after a lattice preprocessing. In particular, algorithms which do not use the ⊕operator, like composition, removal (without determinization of the closures), and trimming, are invariant to the scaling factors. Algorithms which use the ⊕operator, like singlesource shortestpath or determinization, are not λinvariant. Changing from λ0 to λ, λ0 6= λ, after instead before applying the ⊕operator induces an error. An
23
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
0.05 x= 0.04
(25, 1.25) (250, 12.5) (2500, 125)
0.03
normalized error
0.02 0.01 0 0.01 0.02 0.03 0.04 0.05 16
18
20 LM scale
22
24
Figure 3.1. Error induced by changing the LM scale after computing x ⊕ x; the LM scale is initialized with 20. The correct sum results from changing the scaling factors before applying the ⊕operator. The ⊕operator is defined in Equation (3.4).
24
3.2 Probabilities over Lattices example of the development of the error for the ⊕operator of the log vector semiring, cf. Equation (3.4), is shown in Figure 3.1. The setup for the example uses values which are typical for a single arc in a word lattice produced by a LVCSR decoder. The initial scaling factors are λ0 := (1/20, 1), where 20 is a typical value for the language model scale. The modified scaling factors λ := (1/α, 1) vary only in the first component with language model scale α ∈ {15, 16, . . . , 25}. The xaxis of the graph is the language y −y model scale α. The yaxis shows the normalized error defined as corycorappr , where ycor := λ · (x ⊕λ x) and yappr := λ · (x ⊕λ0 x). The error in a lattice is additive along a path. Thus, for long sentences or large scores the error induced by the ⊕operator can become huge, which forbids the attempt to tune the scaling factors on the modified, e.g. determinized, lattice. In the field, the obvious disadvantage of the vector semirings is that the computation of the ⊕ and ⊗operator are more expensive for multidimensional weights. However, if speed is an issue the scores can be projected to a single dimension and the standard log or tropical semiring can be applied. The time stamps stored at the transducer states are also a subject to problems when applying generic transducer operations to word lattices. Operations which merge states, like the composition or determinization, destroy the uniqueness of the time stamp; In this case the time stamps are discarded. If in a composition only one transducer has time stamps, then the time stamps are transferred to the composition result. If both WFSTs have time stamps, then the time stamps are discarded.
3.2 Probabilities over Lattices 3.2.1 Probabilities over a single Lattice Let L be a lattice in WFST representation as described in Section 3.1 produced by a LVCSR system given acoustic features xT1 . According to the definitions given in Equation (1.15) and Equation (3.1) the posterior probability of a word sequence w1N is given by h i−1 h i p(w1N xT1 ) = exp − λ · [[L]] exp − λ · [[L]](w1N ) .
(3.6)
Defining the probability for a path aL 1 through L as " T p(aL 1 x1 )
#−1
:= exp − λ · [[L]]
" exp − λ ·
L h O
w al
i
⊗ w to(aL )
# ,
(3.7)
l=1
allows to rewrite Equation (3.6) as p(w1N xT1 ) =
X
T p(aL 1 x1 ).
(3.8)
aL 1 ∈L, N i(aL 1 )=w1
The posterior probability for an arc a is defined as the sum over all paths going through a: X T p(axT1 ) := p(aL 1 x1 )
(3.9)
aL 1 ∈L, ∃l:al =a
The next equation shows an efficient way to compute arc probabilities with the help of generic WFST operations: h i−1 h i p(axT1 ) = exp − λ · d(sI ) exp − λ · d ∂(from(a); LT ) ⊗ w(a) ⊗ d to(a)
(3.10)
For lattice L and state s the value d(∂(s; LT )) is called forward score and d(s) is called backward score and the resulting algorithm is known as the forward/backwardalgorithm. The forward scores for all states in an acyclic lattice can be efficiently computed in time O(E + S) by calculating the forward score for the last state and storing all intermediate results. The backward scores can be computed analogously.
25
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework Framewise word posterior probabilities pt (wxT1 ) model the chance of observing word w at time t: X T pt (wxT1 ) := p(aL 1 x1 ) aL 1 ∈L, ∃l:i(al )=w ∧ beg(al )≤t
X
=
p(axT1 )
(3.11)
a∈L, i(a)=w ∧ beg(a)≤t
Thus, the framewise word posteriors can be efficiently computed from the arc probabilities. Similar, position or slotwise posterior probabilities can be computed from a lattice. Let us assume a function σ : E(L) → N which assigns each arc to a position or slot number. Under the assumption that for any two arcs a1 and a2 such that both arcs lay on the same path and a1 precedes a2 holds σ(a1 ) < σ(a2 ), the probability for word w being observed at position s is given by: X T ps (wxT1 ) := p(aL 1 x1 ) aL 1 ∈L, ∃l:i(al )=w ∧σ(al )=s
X
=
p(axT1 )
(3.12)
a∈L, i(a)=w ∧σ(a)=s
3.2.2 Probabilities over the Lattice Intersection Let J be the number of LVCSR systems to be combined and Lj the lattice produced by the jth system given acoustic features xT1 . For the intersection approach thesemiring of all lattices is either set to the log or to the tropical vector semiring with dimensionality I · J , where I is the number of feature functions per system. The lattice from the jth system stores the scores in dimension (j − 1) · I + 1 to j · I , the other dimensions are set to zero. The construction defines a loglinear model which combines the I × J knowledge sources provided by the J word lattices. And the intersection of the J lattices is the loglinear model combination of the J × I models: by the definition of the intersection each path and thus each arc in L∩ :=
J \
Lj
(3.13)
j=1
has scores assigned from all J × I models. The sentence posterior probabilities can now be computed directly from the intersection result in the same way as for a single lattice, cf. Equation (3.6). However, in practice the intersection approach has several drawbacks: • Building the intersection from lattices with many arcs is expensive1 ; often an removal and a determinization of the lattices is necessary to make the intersection work. • Time stamps are invalidated when applying standard transducer operations including the determinization and the intersection. Bayes risk decoder with loss functions which rely on correct word boundaries cannot be applied; this includes all Levenshtein distance approximations investigated in this thesis, cf. Chapter 4. • The vocabulary of the intersection result is the intersection of the systemdependent vocabularies. Thus, the intersection increases the outofvocabulary (OOV) rate. 1A
decoder like the wordconditioned tree search decoder used in the RWTH Aachen system produces arc free (and even deterministic) lattices. In this case arcs can result from preparing the lattices for combination, e.g. by replacing nonword events like silence or noise by the empty word. Lattice preprocessing is discussed in detail in Section 3.6.
26
3.2 Probabilities over Lattices • The intersection of several lattices can be empty, if the lattices do not contain a common input sequence. This is especially the case if the systems use different vocabularies, e.g. in a crosssite system combination, or if rather long utterances are decoded. An alternative to the intersection approach is the lattice rescoring. A base lattice is provided and arcwise rescored with all I · J models. The approach resolves the drawbacks of the intersection but introduces new problems. All systems must have the same pronunciation dictionary and the rescoring with fixed word boundaries usually causes inferior error rates. Results with the rescoring approach are given in Chapter 7. From a theoretical point of view intersection and rescoring describe the same model and thus are not distinguished in the abstract framework developed in this chapter.
3.2.3 Probabilities over the Lattice Union Again, let J be the number of LVCSR systems to be combined. In the common approaches to system combination in ASR the sentence posterior probabilities are computed as the weighted average of the systemdependent sentence posteriors. The motivation is to introduce the system as a hidden variable and derive the posterior probability by marginalizing over the systems p(w1N xT1 )
=
J X
p(jxT1 )p(w1N j, xT1 )
j=1
=
J X
p(j)pj (w1N xT1 ),
(3.14)
j=1
where the model assumption is made that the system prior p(j) is independent of the acoustic observation. This is the model used in ROVER with confidence scores [Fiscus 1997] and in confusion network combination (CNC) [Evermann & Woodland 2000]. Let Lj be the lattice produced by the jth system given acoustic features xT1 . Looking at the definition of the union in Equation (1.20) and at the definition of the sentence posterior probability for a single lattice given in Equation (3.6) it is easy to see that the union over slightly modified lattices Lj yields the desired posterior probabilities. Each lattice Lj is modified such that it has a new initial state which is connected with the former initial state by an arc with weight ωj ⊗ [[Lj ]]−1 . Here, ωj is simply the weighted negated logarithm of the jth system prior p(j) such that exp − λ · (ωj ⊗ x) = p(j) exp − λ · x . The modified lattice is denoted by L0j and the union over the modified lattices by L0∪ :=
J [
L0j .
(3.15)
j=1
Equation (3.16) proofs that the union over the modified lattices yields the desired posterior probabilities. i h exp − λ · [[L0∪ ]](w1N )
=
J i h M exp − λ · [[L0j ]](w1N ) j=1
=
J X
h
i exp − λ · ωj ⊗ [[Lj ]]−1 ⊗ [[Lj ]](w1N )
j=1
=
J X
p(j)pj (w1N xT1 )
(3.16)
j=1
A direct advantage of the modified union approach over the intersection method is that the OOV rate in the union is reduced rather than increased. The union always exists, whereas the intersection might be empty. And in contrast to the lattice intersection, in the lattice union the time stamps always survive. This makes the union in particular interesting for all Bayes risk decoders based on a cost function which requires exact time stamps; this includes all Levenshtein distance approximations investigated in this thesis, cf. Chapter 4.
27
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework The framewise word posteriors of the modified lattice union can be either computed directly from L0∪ , cf. Equation (3.11), or equivalently by averaging the systemdependent framewise word posterior probabilities: X T pt (wxT1 ) = p(aL 1 x1 ) 0 aL 1 ∈L∪ , ∃l:i(al )=w ∧ beg(al )≤t
=
J X
X
p(j)
j=1
T pj (aL 1 x1 )
aL 1 ∈Lj , ∃l:i(al )=w ∧ beg(al )≤t
=
J X
p(j)pj,t (wxT1 )
(3.17)
j=1
The same holds for the slotwise word posteriors, cf. Equation (3.12), given a slot function σ : E(L0∪ ) → N defined over the lattice union: X T p(aL ps (wxT1 ) = 1 x1 ) 0 aL 1 ∈L∪ , ∃l:i(al )=w ∧σ(al )=s
=
J X
X
p(j)
j=1
T pj (aL 1 x1 )
aL 1 ∈Lj , ∃l:i(al )=w ∧σ(al )=s
=
J X
p(j)pj,s (wxT1 )
(3.18)
j=1
3.3 LatticeBased System Combination in the Bayes Risk Decoding Framework 3.3.1 The MAP/Viterbi Decoding Framework The maximum aposteriori (MAP) decoding rule for a word lattice L is derived by inserting Equation (3.6) in Equation (1.1): ˆ xT1 → W
:=
argmin λ · [[L]]log (w1N ) w1N ,N
=
w1N ,N
=
X
argmax
T p(aL 1 x1 )
aL 1 ∈L, N i(aL 1 )=w1
best detlog (remove (L)) ,
(3.19)
where best(L) returns the sequence of the input labels of the shortest path through L; the weight of the shortest path equals dtrop (L). Applying the Viterbi approximation yields X T ˆ xT1 → W := argmax p(aL 1 x1 ) w1N ,N
Viterbi
=
argmax w1N ,N
=
28
aL 1 ∈L, N i(aL 1 )=w1
max
aL 1 ∈L, N i(aL 1 )=w1
best(L).
T p(aL 1 x1 )
(3.20)
3.3 LatticeBased System Combination in the Bayes Risk Decoding Framework The main difference in the implementation is that Viterbi decoding does not require the determinization. In contrast to the full search space of an HMM statewise decoding, the determinization of a lattice is computationally possible, if the lattice is not too dense. But determinization is expensive and still has an exponential worst case complexity and can cause a runtime in O exp(E(L) for the MAP decoder. In practice, a strong lattice pruning is applied before determinization. Lattice pruning is discussed further in Section 3.6. The MAP and Viterbi decoder can easily be used for system combination by decoding the lattice intersection or the modified lattice union, cf. Section 3.2.2 and Section 3.2.3. However, in both cases the computation of the MAP hypothesis requires once or even several times a lattice determinization, which can become expensive. In practice, especially the determinization of the lattice union turned out to be very expensive and makes the approach for LVCSR infeasible. On the other hand, the Viterbi decoder is not a suitable choice for decoding the lattice union, because the Viterbi approximation replaces the sum in Equation (3.14) by the maximum, which eventually results in a sentence posterior based system selection. In conclusion, the MAP/Viterbi decoding framework is not a good choice for intersection or union based, i.e. loglinear or averaged sentence posterior probability based, system combination. An exception is the arcwise rescoring based approach which is investigated further in Chapter 7.
3.3.2 MAP/Viterbi Decoding Results In this section experimental results for the intersection and union based system combination in the MAP and Viterbi framework are given and discussed. Experiments are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation system. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. Only, for the English EPPS 2007 evaluation crosssite combination neither MAP nor intersection results are produced. The setup uses extremely long utterances, which makes already the computation of the determinization of the systemdependent lattices computationally infeasible. For all experiments acoustic and language model scales and the system weights in the union based combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described later in Section 3.7. The first set of experiments compares latticebased MAP and Viterbi decoding for a single system. The Chinese setup consists of three subsystems and the English setup of four. The experimental results are shown in Table 3.1 and Table 3.2. The results are summarized by decoder, for each decoder the system with the lowest error rate on the tuning set is highlighted. The results show no benefit for MAP based lattice decoding. In fact, the MAP decoding is slower and has the disadvantage that the MAP decoding result comes without time stamps (due to the determinization). The word boundaries are computed with the forcedalignment algorithm described later in Section 5.1.2. Intersection results are produced with the MAP and the Viterbi decoder. In order to make the computation of the intersection efficient, the systemdependent lattices are made arc free and are determinized. As a result, the intersection has no time stamps and alike the MAP decoder the word boundaries are computed according to Section 5.1.2. For the union based lattice combination only Viterbi results are produced. MAP computation turned out to be expensive due to the determinization of the union; the runtime for a single experiment took days to weeks. The results show that the intersection based combination approach works and the outcome improves over the results of the best single system. The improvements on the tuning set generalize to the test sets and improvements in the same magnitude can be observed. For both, the Chinese and the English system, the best approach reduces the error rate by 5% relative compared to the best single system. The Chinese system benefits from intersecting all three system whereas the error rates for the English system increase when intersecting more than two systems. A possible explanation is the OOV rate, the Chinese subsystems share the same vocabulary, whereas the four English lattice sets are produced with different vocabularies. For some utterances the intersection is empty and a backoff strategy is applied: the hypothesis from the best performing system (on the tuning set) is used. The percentage of utterances for which the intersection exists is included in Table 3.1 and Table 3.2. Again, for the intersection approach the MAP decoder cannot improve over the Viterbi decoder. The modified union based combination decoded with the Viterbi approximation is eventually a system selection: the system with the hypothesis with the highest posterior probability is chosen. Even this
29
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
Table 3.1. Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded.
System Combination Viterbi Decoder s1 s2 s3 s1+s2 intersection(97.6%) union s1+s2+s3 intersection(92.4%) union MAP Decoder s1 s2 s3 s1+s2 intersection(97.6%) s1+s2+s3 intersection(92.4%) 1
dev071
CER[%] (del/ins) err eval07
dev08
(2.63/1.59) 14.54 (2.65/1.70) 14.82 (2.65/1.64) 15.07 (2.55/1.58) 14.05 (2.59/1.65) 14.25 (2.46/1.56) 13.91 (2.57/1.64) 14.09
(4.42/0.91) 15.08 (4.44/0.93) 15.02 (4.57/1.04) 15.60 (4.43/0.91) 14.59 (4.44/0.92) 14.86 (4.38/0.91) 14.57 (4.47/0.92) 14.83
(2.80/0.87) 13.28 (2.71/0.94) 13.54 (2.84/0.93) 13.80 (2.75/0.84) 13.09 (2.74/0.89) 13.36 (2.66/0.83) 12.65 (2.77/0.87) 13.17
(2.67/1.56) 14.56 (2.63/1.72) 14.80 (2.66/1.63) 15.08 (2.48/1.64) 14.04 (2.49/1.59) 14.01
(4.42/0.91) 15.14 (4.41/0.96) 15.00 (4.56/1.04) 15.58 (4.37/0.91) 14.56 (4.40/0.90) 14.45
(2.88/0.85) 13.39 (2.66/0.97) 13.47 (2.83/0.92) 13.82 (2.63/0.85) 12.91 (2.68/0.87) 12.63
tuning set
simple selection scheme works and shows an improvement over the best single system. Throughout this work the Viterbi result of the best single system will serve as the baseline for the upcoming Bayes risk decoding and system combination results.
3.3.3 The Bayes Risk Decoding Framework with Local Cost Functions The definition of the latticebased Bayes risk distinguishes between hypothesis space and summation space. The summation space is represented by lattice S and describes the posterior probability distribution over all word sequences w1N computed according to Equation (3.6). By the definition of the sentence posterior probability, cf. Equation (3.8), a word sequence which is not present in S has a probability of zero and thus has no contribution to the posterior computation in the Bayes risk decoder. However, the Bayes risk hypothesis might not be contained in the summation space lattice S as shown for example in Table 3.3. The size of the hypothesis space depends on the summation space and on the loss function, but contains in the general case all possible word sequences. In practice, often only a subset of the complete hypothesis space is explored. The restricted hypothesis space is represented by lattice H. The Bayes risk for an arbitrary loss function L(·, ·), summation space lattice S, and hypothesis space lattice H is given by X xT1 → ˆr := min p(w1N xT1 ) L(v1M , w1N ) v1M ,M
= min
v1M ,M
≤ min
aL 1 ∈H
w1N ,N
X
T M K p(bK 1 x1 ) L(v1 , i(b1 ))
bK 1 ∈S
X
T L K p(bK 1 x1 ) L(i(a1 ), i(b1 )),
(3.21)
bK 1 ∈S
where the inequality is caused by the possibly restricted hypothesis space: if the optimal hypothesis is not contained in H, then the result is larger than the exact Bayes risk ˆr.
30
3.3 LatticeBased System Combination in the Bayes Risk Decoding Framework
Table 3.2. Results for the English EPPS 2007 evaluation systems, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded.
System Combination Viterbi Decoder s1 s2 s3 s4 s1+s2 intersection(99.5%) union s1+s2+s3 intersection(98.5%) union s1+s2+s3+s4 intersection(97.4%) union MAP Decoder s1 s2 s3 s4 s1+s2 intersection(99.5%) s1+s2+s3 intersection(98.5%) s1+s2+s3+s4 intersection(97.4%) 1
dev06
WER[%] (del/ins) err eval061
eval07
(1.65/2.21) 11.09 (1.77/2.28) 11.89 (2.06/2.29) 12.43 (2.04/2.18) 12.06 (1.72/2.09) 10.85 (1.82/2.00) 11.05 (1.73/2.17) 11.27 (1.86/2.05) 11.23 (1.72/2.17) 11.19 (1.86/2.05) 11.24
(1.38/1.36) 8.43 (1.67/1.23) 8.70 (1.80/1.30) 8.98 (1.85/1.38) 9.44 (1.48/1.25) 8.07 (1.56/1.24) 8.33 (1.49/1.28) 8.18 (1.59/1.26) 8.38 (1.54/1.20) 8.12 (1.59/1.26) 8.38
(1.86/1.31) 9.81 (2.12/1.31) 10.07 (2.22/1.34) 10.76 (2.68/1.42) 11.73 (1.99/1.21) 9.29 (2.04/1.23) 9.79 (1.93/1.28) 9.57 (1.99/1.22) 9.66 (1.99/1.24) 9.54 (1.99/1.23) 9.67
(1.66/2.29) 11.19 (1.85/2.27) 11.81 (2.04/2.34) 12.46 (1.97/2.32) 12.27 (1.68/2.12) 10.84 (1.73/2.22) 11.28 (1.70/2.23) 11.27
(1.41/1.43) 8.51 (1.72/1.23) 8.73 (1.79/1.33) 8.99 (1.73/1.47) 9.45 (1.46/1.29) 8.11 (1.48/1.31) 8.22 (1.53/1.26) 8.20
(1.84/1.35) 9.84 (2.18/1.33) 10.14 (2.19/1.37) 10.77 (2.56/1.54) 11.77 (1.94/1.25) 9.40 (1.90/1.33) 9.62 (1.96/1.33) 9.73
tuning set, eval06 was the official development set in the 2007 evaluation campaign
31
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
ˆ , i.e. the hypothesis with the minimum Table 3.3. Example for the situation where the Bayes risk hypothesis W expected word error rate, has a sentence posterior probability of zero and thus is not contained in the summation space.
“coca “coca “ ˆ = “coca W
cola’s cola’s cola’s cola’s
w1N share in market” share the market” share in the market” share in the market”
p(w1N xT1 ) 0.4 0.4 0.2 error=1
In Bayes risk decoding of LVCSR lattices the main interest is in loss functions which approximate the Levenshtein distance. In case of the Levenshtein distance the hypothesis space is usually larger than the summation space defined by lattice S. In the general case the hypothesis space shall provide exact word boundaries which are required by most approximates of the Levenshtein distance. The consideration motivates the usage of the timeconditioned form of the summation space lattice S as the default hypothesis space lattice H: all states with the same time stamp are merged [Hoffmeister &Klein+ 2006; Hoffmeister & Schl¨ uter+ 2009]. Thus, the resulting hypothesis space is a super set of the summation space, but preserves the correct time stamp for each state. Two special cases arise from using the sentence error and the confusion network (CN) distance as loss functions. For the sentence error, i.e. for the MAP decoder, it is easy to see that hypothesis and summation space are equal. The CN distance is an approximation of the Levenshtein distance for which it is possible to search the complete hypothesis space; CNs and Bayes risk decoding with the CN distance as loss function are discussed later in Section 3.4. The Bayes risk decoder is simply the lattice decoder which returns the path from the hypothesis space which minimizes the Bayes risk on the summation space w.r.t. a given loss function. The extension of the Bayes risk decoder to latticebased system combination is straightforward: in Equation (3.21) the sentence posterior p(w1N xT1 ) is computed either from a loglinear model combination, cf. Section (3.2.2), or from the weighted average of the systemdependent sentence posterior probabilities, cf. Section (3.2.3). This is equivalent to using the lattice intersection L∩ or the modified lattice union L0∪ as summation space lattice S. The lattice intersection is not suitable for the Levenshtein distance approximations investigated in this work. All approximations require exact word boundaries, which are not preserved in the intersection. However, in Chapter 7 Bayes risk decoding with the CN distance as loss function is applied to the loglinear model derived from a lattice rescoring and compared to the modified lattice union. The computation of the Bayes risk hypothesis from a LVCSR lattice using the Levenshtein distance as loss function is computationally prohibitive and approximations are required. In the first decoding approaches N best lists of moderate size were used and the Bayes risk with the Levenshtein distance as loss function was computed on N best lists [Goel & Byrne+ 1998; Stolcke & K¨onig+ 1997]. The N best list approach is still computationally expensive and the considered summation and hypothesis space are by magnitudes smaller than for lattices. The standard approach to latticebased decoding is to place the approximation in the loss function. The goal is to find a loss function which is close to the Levenshtein distance and at the same time enables an efficient computation of the latticebased Bayes risk decoding rule defined in Equation (3.21). Looking at the decoding rule reveals that an efficient computation of the Bayes risk cannot have longterm dependencies in the loss computation. Longterm dependencies would require an expansion of the lattice structure, in the worst case the expansion to the full N best list. Thus, the loss functions used for latticebased Bayes risk decoding aim at reducing the dependencies. K L Let c(aL 1 , b1 ) be a general cost function for a path a1 through the hypothesis space lattice H and K path b1 through the summation space lattice S. The cost function is assumed to be additive likewise the Levenshtein distance. The first approximation makes the cost function local to the hypothesis space
32
3.3 LatticeBased System Combination in the Bayes Risk Decoding Framework lattice, i.e. the cost for arc al does not depend on the cost of arc ak with k 6= l: L X
K c(aL 1 , b1 ) =
c(al , bK 1 )
(3.22)
l=1
The second approximation requires that only arcs compete which have overlap in time: K c(aL 1 , b1 ) =
L X
X
l=1
bk j:
c(al , bkj )
(3.23)
o(al ,bi )>0 for i∈[j,k], o(al ,bi )=0 for i∈[j,k] /
where o(a, b) denotes the overlap in time of arc a and arc b. Cost functions fulfilling Equation (3.22) and Equation (3.23) are called type one cost functions. In addition, the most common approximations for the Levenshtein distance are local in the summation space: L K X X K c(aL , b ) = c(al , bk ) (3.24) 1 1 l=1
k=1: o(al ,bk )>0
These cost functions will be referred to as cost functions of the second type. For both types of local costs an efficient implementation of the Bayes risk decoder exists, where for a type two cost function an efficiently computable Bayes risk decoder exists even if contraint (3.23) is violated. For the derivation of the decoders the following notation is introduced. The set of all subpaths in L which intersect in time k with arc a are denoted by Osub (a; L). In other words, for each path bK 1 such that subpath bj ∈ Osub (a; L) k holds o(a, bi ) > 0 for i ∈ [j, k] and o(a, bi ) = 0 for i ∈ / [j, k]. Furthermore, the notation bj ∈ φK 1 means k K that bj is a subpath of path φ1 . The Bayes risk for a cost function of the first type can be computed by finding the shortest path in an arcwise rescored hypothesis space lattice: xT1
→ ˆr := min
aL 1 ∈H
= min
aL 1 ∈H
= min
aL 1 ∈H
X
T p(bK 1 x1 )
bK 1 ∈S L X
L X
X
l=1
bk j ∈Osub (al ;S)
X
X
c(al , bkj )
T k p(φK 1 x1 )c(al , bj )
K l=1 bk j ∈Osub (al ;S) φ1 ∈S: K bk j ∈φ1
L X
X
l=1
bk j ∈Osub (al ;S)

p(bkj xT1 )c(al , bkj ) {z
:=c(al ;S)
= dtrop rescore(H, c(·; S))
} (3.25)
The disadvantage for Bayes risk decoding with cost functions of the first type is that the computation still requires a local expansion of the summation space for getting the partial paths bkj . The time overlap constraint is essential in order to restrict the expansion. Otherwise each path in the summation space would have to be compared to each arc in the hypothesis space, which would be computationally infeasible in LVCSR. Once the expansion is done the computation of the partial path probability p(bkj xT1 ) is efficient: simply use the algorithm for computing arc posteriors and replace the single arc weight by the product over the arc weights in the partial path w(bj ) ⊗ w(bj+1 ) ⊗ . . . ⊗ w(bk ). However, due to the local expansion the worst case complexity is exponential in the number of arcs. In practice, the runtime highly depends on the lattice structure. If a long word competes with a highly connected cloud of short words, then a local exponential blowup can happen. In practice, the union of lattices from several systems and in particular the crosssite combination case shows such a petulant behavior.
33
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework In contrast to the type one cost functions, Bayes risk decoding with cost functions of the second type does not require the local expansion and guarantees an efficient computation of the Bayes risk hypothesis: xT1 → ˆr := min
aL 1 ∈H
= min
aL 1 ∈H
= min
aL 1 ∈H
X
T p(bK 1 x1 )
bK 1 ∈S L X
X
L X
K X
l=1
k=1: o(al ,bk )>0
X
c(al , bk )
T p(φK 1 x1 )c(al , b)
l=1 f ∈E(S): φK ∈S: 1 o(al ,b)>0 ∃k:φk =b L X
X
c(al , b)p(bxT1 )
l=1 f ∈E(S): o(al ,b)>0

{z
:=c(al ;S)
= dtrop rescore(H, c(·; S))
} (3.26)
The time complexity of Equation (3.26) is in the worst case O S(H) + S(S) + E(H)E(S) . The arcwise probabilities p(axT1 ) for all arcs a ∈ E(S) can be computed in time O(S(S)+E(S)), because S is acyclic. In the rescoring step for each arc in the hypothesis a sum over the posteriors of all arcs in the summation space is computed. Together with the subsequent Viterbi decoding step this yields the worst case complexity. In the worst case examination the time overlap constraint cannot be considered. In practice, due to the time overlap constraint and using not too dense lattices the algorithm can be implemented such that the runtime grows almost only linearly with the number of arcs. However, since an efficient decoding for type two cost functions does not rely on the time locality constraint, the requirement can be declared optional. Indeed, in the following chapter a local cost function is investigated which is of type two but violates the time locality constraint: the cost function defined by the confusion network combination (CNC). The two classes cover all cost functions which are commonly used in word error minimizing training and decoding approaches for LVCSR tasks; the topic is discussed further in Section 3.5. Instances of cost functions of type one and two and the resulting Bayes risk decoder are introduced and investigated in Chapter 4.
3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework A confusion network (CN) is a sequence of word posterior probability distributions. The probabilities are derived from the sentence posteriors of a set of aligned word sequences. The CN can be interpreted as the sequence of alignment positions, where to each position belongs a posterior distribution over all words aligned to that position. The alignment positions are often referred to as slots and the CN to as a sequence of slots. The terminology comes presumably from the way CN construction algorithms work: each word is inserted into a slot. A CN is completely described by a lattice L and a function σ : E(L) → N referred to as slot function. The slot function maps the lattice arcs to the CN slots, where for any two consecutive lattice arcs a and b holds σ(a) < σ(b). In particular, the constraint guarantees that two arcs lying on the same path are not assigned to the same slot. The mapping is used to derive the slotwise word posterior probabilities which are computed according to Equation (3.18) for all words but the empty word. In order to guarantee a probability distribution the probability for the empty word in slot s is derived as X ps (xT1 ) = 1 − ps (wxT1 ). w6=
A CN is defined as the ordered sequence of the slotwise word posterior probability distributions and can be expressed as a word lattice without time stamps and with a sausage structure: all arcs leaving state si
34
3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework
Figure 3.2. The figure shows a word lattice with time stamps at the states, a slot function, and the confusion network induced by the slot function.
end in state si+1 . The CN in word lattice representation derived from lattice L and slot function σ(·) is denoted by CN(L, σ(·)), or by CN(L) for an arbitrary slot function. From the construction follows that the sentences accepted by CN(L) are a super set of the sentences accepted by the lattice itself. In this work only CNs derived from a lattice and a slot function are considered. Thus, a CN is always associated with a lattice and provides a unique mapping from the arcs of the lattice to the CN slots. Figure 3.2 visualizes the connection between lattice, slot function, and CN. The slot function of a CN defines an alignment between each pair of paths through the lattice: arcs assigned to the same slot compete with each other. CN construction algorithms aim at finding a slot function which covers the Levenshtein alignment between each pair of paths through the lattice. Instances of CN construction algorithms are introduced and discussed in the next chapter in Section 4.4. The slot function can be used to define a local cost of the second type which results together with Equation (3.26) in an efficient Bayes risk decoder. The cost is in particular simple to compute if the CN of the summation space lattice CN(S) serves as hypothesis space. Any two word sequences v1S and w1S taken from CN(S) have equal length S, where S is the number of slots in the CN. Making use of this property, the CN distance between v1S and w1S is given by cCN (v1S , w1S )
=
S X
1 − δ(vs , ws ) .
(3.27)
s=1
Defining the appropriate rescoring function for the Bayes risk decoder, cf. Equation (3.26), using CN(S) as hypothesis space lattice H, and simplifying the resulting formula yields a simple decoding rule: ˆ 1S , xT1 → W
ˆ s := argmax ps (wxT1 ) W
(3.28)
w
Furthermore, the usage of the CN as hypothesis space guarantees that the optimal hypothesis is included ˆ S is the Bayes risk hypothesis and the Bayes risk itself is given by in H. Therefore W 1 xT1 → ˆr =
S X s=1
1 − argmax ps (wxT1 ) .
(3.29)
w
The CN decoding rule was originally developed in [Mangu 2000], where also the proofs of the above claims can be found. The extension to the CN decoding of arbitrary hypothesis space lattices will be given in the next chapter in Section 4.4. Confusion network based system combination can be done in two ways: the first way is to derive the slot function directly from the lattice intersection or modified lattice union. An alternative way is to compute a slot function and thus a CN for each of the J lattices and align the systemdependent slot sequences; the result of the alignment is again a CN which can be decoded according to Equation (3.28). In the next section the common confusion network combination (CNC) algorithm proposed in [Evermann &Woodland 2000] is investigated in the Bayes risk decoding framework.
35
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
3.4.1 Confusion Network Combination (CNC) In this section the confusion network combination (CNC) algorithm proposed in [Evermann & Woodland 2000] is derived by formulating the CN alignment problem in the Bayes risk decoding framework. Furthermore, it is shown that the CNC computes a slot function over the lattice union and thus CNC is nothing else but a Bayes risk decoding of the modified lattice union with a cost function of the second type. The first step is to look at the alignment of two CNs derived from the two lattices L1 and L2 . The according slotwise posterior probability distributions are denoted by p1,n (·xT1 ) and p2,n (·xT1 ). The alignment between the two CNs is defined on slot level and consists of pairs of slot numbers, where slot numbering starts from one: A := (k1 , l1 ), (k2 , l2 ) . . . , (kS , lS ) , where either ki < kj for i < j or ki = 0, but not ki = li = 0, and analogously for li . An alignment pair (k, l) means that the kth slot from the first CN is aligned to the lth slot of the second CN. If k = 0 then the lth slot from the second CN is inserted and vice versa. For convenient reasons the posterior distribution p·,0 (·xT1 ) for the pseudo slot 0 is introduced, it equals one for the empty word and zero otherwise. The alignment can be used to build a new CN by averaging the slotwise word posterior distributions of the aligned slots. Hence, the slotwise word posterior probabilities for the combined CN are given by ps (wxT1 ) = p(1)p1,ks (wxT1 ) + p(2)p1,ls (wxT1 ).
(3.30)
On the other hand it is easy to see that the combined CN defines a slot function over the lattice union L1 ∪ L2 : all arcs in L1 which are assigned by the systemdependent slot function to slot ls are assigned to slot s in the combined CN, and analogously for L2 . Applying the slot function to the modified lattice union, cf. Equation (3.16), results in ps (wxT1 )
=
X
T p(aL 1 x1 )
0 0 aL 1 ∈L1 ∪L2 , ∃l:σ(al )=s ∧ i(al )=w
=
p(1)
X
T p1 (aL 1 x1 )
+ p(2)
T p2 (aL 1 x1 )
0 aL 1 ∈L2 , ∃l:σ(al )=ls ∧ i(al )=w
0 aL 1 ∈L1 , ∃l:σ(al )=ks ∧ i(al )=w
=
X
p(1)p1,ks (wxT1 ) + p(2)p2,ls (wxT1 ).
That is, the difference between CNC and applying a CN construction algorithm directly to the modified lattice union is only in the resulting cost function. Both approaches define different local costs of the second type, where the CNC based cost function might violate the time overlap constraint. The remaining question is how to find the CN alignment. The goal in CNC is to minimize the Bayes risk computed from the cost function defined by the combined CN. Obviously, the Bayes risk depends on ˆ Using the definition of the CN distance, the CN alignment. Let us denote the optimal alignment by A. cf. Equation (3.27), the resulting optimization problem is defined as Aˆ :=
argmin min A
=
argmin min A
36
v1S ,S
v1S ,S
X
pA (w1S xT1 )cCN (v1S , w1S )
w1S
X w1S
pA (w1S xT1 )
S X s=1
1 − δ(vs , ws ) .
(3.31)
3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework Inserting Equation (3.30) into the optimization problem yields Aˆ :=
argmin min A
v1S ,S
X
pA (w1S xT1 )
S X
1 − δ(vs , ws )
s=1
w1S
" =
argmin min A
=
v1S
argmin min A
v1S ,S
X
p(1)p1 (w1S xT1 )
S X 1 − δ(vs , wks ) + p(2)p2 (w1S xT1 ) 1 − δ(vs , wls )
s=1
w1S ,S S X
S X
h i 1 − p(1)p1,ks (vs xT1 ) + p(2)p2,ls (vs xT1 ) .
#
s=1
(3.32)
s=1
Equation (3.32) can be solved efficiently by dynamic programming similar to the computation of the Levenshtein distance, but CN slots are aligned instead of words. The local cost function for the dynamic programming is given by n o c(k, l) := 1 − max p(1)p1,k (wxT1 ) + p(2)p2,l (wxT1 ) . w
The extension of the algorithm to the simultaneous alignment of multiple CNs is straightforward, but expensive. In practice, the common way is to approximate the multiple alignment by a sequence of pairwise alignments: CN 1 and 2 are aligned, the result is aligned to CN 3, and so on. As a rule of thumb the CNs are sorted according to their error rate, least error first.
3.4.2 ROVER: An Approximation of CNC Recognizer Output Voting Error Reduction (ROVER) is a system combination approach working on singlebest results [Fiscus 1997]. ROVER is a simple but powerful approach to system combination, especially in combination with confidence scores, see for example [Hoffmeister & Hillard+ 2007] for a comparison with CNC and a frame error based system combination approach. ROVER aligns and decodes the singlebest results from J systems. A singlebest output can be interpreted as a CN with a single entry per slot and thus ROVER can be interpreted as a combination of J CNs. In ROVER with majority voting the assumption is made that pj (w1N xT1 ) = 1 and thus pj,n (wxT1 ) = 1, where w1N is the systemdependent singlebest output for system j. The decoding happens analogously to the CNC: per slot the word with the highest averaged word posterior probability is chosen. That is, per slot the word wins for which the most systems voted. However, the assumption is usually wrong and the better model is the CN, which provides a slotwise posterior distribution over all words. And in fact CNs are a common base for computing wordwise confidence scores for LVCSR systems [Evermann & Woodland 2000; Hillard & Ostendorf 2006]. ROVER with confidence scores can now be derived from the CNC by regarding the singlebest hypothesis as the result of a slotwise pruning of the systemdependent CNs: in each slot of the systemdependent CNs only the entry with the highest probability survives. That is, per slot and system only a single word is considered, but the word posterior probability is taken from the CN slot. The standard implementation of ROVER2 makes always the assumption that pj,n (wxT1 ) equals one for computing the alignment. In the subsequent slotwise decoding step of the resulting CN either majority or confidence voting is applied.
3.4.3 Results In this section experimental results for the CN decoder and the different ways to CN based system combination are given. The CNs are computed with the arccluster algorithm introduced in the next chapter in Section 4.4.2, which is the default CN construction algorithm in the RWTH Aachen system. Experiments are presented for the Chinese 230h testing system, the English EPPS 2007 evaluation system, and the English EPPS 2007 evaluation crosssite combination. A detailed description of the
37
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
Table 3.4. Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction.
System Combination Viterbi Decoder s1 s2 s3 s1+s2 ROVER w/o confidences ROVER w/ confidences s1+s2+s3 ROVER w/o confidences ROVER w/ confidences CN Decoder s1 s2 s3 s1+s2 union CNC s1+s2+s3 union CNC 1
dev071
CER[%] (del/ins) err eval07
dev08
(2.63/1.59) 14.54 (2.65/1.70) 14.82 (2.65/1.64) 15.07 (2.66/1.57) 14.54 (2.49/1.59) 13.63 (2.74/1.35) 13.55 (2.70/1.34) 13.22
(4.42/0.91) 15.08 (4.44/0.93) 15.02 (4.57/1.04) 15.60 (4.44/0.90) 15.13 (4.30/0.91) 14.09 (4.59/0.75) 14.16 (4.55/0.74) 13.86
(2.80/0.87) 13.28 (2.71/0.94) 13.54 (2.84/0.93) 13.80 (2.86/0.85) 13.32 (2.64/0.94) 12.61 (2.89/0.75) 12.61 (2.89/0.76) 12.47
(2.79/1.45) 14.30 (2.90/1.50) 14.52 (2.97/1.48) 14.86 (3.05/1.29) 13.54 (2.93/1.34) 13.56 (2.88/1.24) 13.13 (2.87/1.29) 13.17
(4.53/0.85) 14.96 (4.62/0.81) 14.74 (4.74/0.92) 15.42 (4.69/0.73) 14.01 (4.66/0.76) 13.99 (4.77/0.67) 13.73 (4.68/0.70) 13.70
(2.85/0.80) 13.05 (2.88/0.79) 13.35 (3.01/0.85) 13.67 (3.01/0.73) 12.54 (2.93/0.74) 12.50 (3.01/0.73) 12.30 (2.92/0.72) 12.21
tuning set
systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based combination and in the CNC are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described later in Section 3.7. For ROVER the confidence score for making a deletion (aka nullconfidence) is included in the optimization. The first set of experiments compares Viterbi and CN decoding for a single system. The results in Table 3.4, Table 3.5, and Table 3.6 are consistent: for all systems, languages, and setups the CN decoder shows a small but consistent improvement of around 0.2% absolute over the Viterbi decoder. The first set of combination experiments is done with ROVER with majority and confidence voting. The Viterbi hypotheses of the systemdependent lattices are combined and the confidence scores are derived from framewise word posterior probabilities according to [Wessel & Schl¨ uter+ 2001a]. In preliminary experiments the ROVER combination of the systemdependent CN decoding results and CN based confidences were tested, but no significant differences in the results were observed. The ROVER combination gives a huge improvement of up to 10% relative for the Chinese testing and the English evaluation system, and more than 20% relative for the English crosssite combination compared to the best Viterbi result. The experimental results show that ROVER benefits from the confidence scores. And ROVER benefits from adding more systems: in all setups adding more systems further decreased the error rate. Note that ROVER with majority voting is not a suitable choice for two systems: the ROVER implementation will always take the word hypothesis from the first system. In the further experiments CN based system combination is investigated. The CN decoding of the modified lattice union is compared with the CNC approach. For the Chinese testing and the English evaluation system both approaches show an almost identical performance, but for the crosssite combination a small advantage for CNC is observed. Presumably, the advantage for CNC comes from the independence of the CNC algorithm from word boundaries. The word boundaries are needed to build the CNs, but not anymore in the CN combination itself. In the Chinese testing and the English evaluation system 2 The
NIST ROVER implementation is part of the NIST Scoring Toolkit (SCTK) which is publicly available at http://www.itl.nist.gov/iad/mig/tools/.
38
3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework
Table 3.5. Results for the English EPPS 2007 evaluation system, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.
System Combination Viterbi Decoder s1 s2 s3 s4 s1+s2 ROVER w/o confidences ROVER w/ confidences s1+s2+s3 ROVER w/o confidences ROVER w/ confidences s1+s2+s3+s4 ROVER w/o confidences ROVER w/ confidences CN Decoder s1 s2 s3 s4 s1+s2 union CNC s1+s2+s3 union CNC s1+s2+s3+s4 union CNC 1
dev06
WER[%] (del/ins) err eval061
eval07
(1.65/2.21) 11.09 (1.77/2.28) 11.89 (2.06/2.29) 12.43 (2.04/2.18) 12.06 (1.65/2.20) 11.07 (1.97/1.70) 10.54 (1.81/1.91) 10.90 (2.05/1.57) 10.42 (1.77/1.93) 10.92 (1.82/1.91) 10.70
(1.38/1.36) 8.43 (1.67/1.23) 8.70 (1.80/1.30) 8.98 (1.85/1.38) 9.44 (1.38/1.36) 8.41 (1.75/0.93) 7.90 (1.49/1.13) 7.91 (1.79/0.87) 7.73 (1.45/1.17) 7.81 (1.47/1.08) 7.67
(1.86/1.31) 9.81 (2.12/1.31) 10.07 (2.22/1.34) 10.76 (2.68/1.42) 11.73 (1.85/1.30) 9.80 (2.28/0.95) 9.11 (1.99/1.09) 9.32 (2.40/0.89) 9.17 (1.97/1.11) 9.28 (2.06/1.08) 9.15
(1.90/1.92) 10.73 (2.14/1.90) 11.42 (2.29/1.98) 11.97 (2.31/1.94) 11.87 (2.02/1.56) 10.21 (1.94/1.62) 10.22 (2.03/1.59) 10.21 (1.95/1.60) 10.14 (2.03/1.64) 10.33 (1.88/1.65) 10.22
(1.55/1.12) 8.22 (1.90/1.08) 8.61 (1.90/1.14) 8.83 (2.09/1.17) 9.31 (1.73/0.94) 7.79 (1.66/0.99) 7.82 (1.74/0.94) 7.73 (1.67/0.96) 7.70 (1.70/0.96) 7.59 (1.60/0.97) 7.59
(2.09/1.16) 9.57 (2.40/1.07) 9.78 (2.47/1.15) 10.48 (2.96/1.29) 11.57 (2.25/0.93) 8.97 (2.17/0.96) 8.98 (2.26/0.95) 8.96 (2.22/0.95) 8.98 (2.29/0.95) 8.94 (2.18/0.91) 8.92
tuning set, eval06 was the official development set in the 2007 evaluation campaign
39
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
Table 3.6. Results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.
System Viterbi Decoder LIMSI RWTH UKA IRST LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST CN Decoder LIMSI RWTH UKA IRST LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST 1
40
Combination
ROVER ROVER ROVER ROVER ROVER ROVER
union CNC union CNC union CNC
w/o confidences w/ confidences w/o confidences w/ confidences w/o confidences w/ confidences
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.47/1.33) 8.46 (1.76/1.31) 8.80 (2.35/1.40) 10.09 (1.50/1.24) 7.87 (1.63/0.91) 6.69 (1.35/0.84) 6.58 (1.43/0.76) 6.32 (1.36/0.78) 6.38 (1.37/0.79) 6.21
(1.74/1.23) 9.13 (1.91/1.26) 9.71 (2.00/1.28) 10.22 (2.48/1.14) 9.81 (1.70/1.20) 9.06 (2.13/0.87) 7.85 (1.86/0.78) 8.01 (2.00/0.70) 7.77 (1.82/0.79) 7.67 (1.77/0.73) 7.26
(1.65/1.33) 8.07 (1.55/1.13) 8.24 (1.83/1.39) 8.98 (2.35/1.39) 10.06 (1.63/0.77) 6.46 (1.45/0.80) 6.38 (1.51/0.79) 6.38 (1.47/0.72) 6.27 (1.61/0.73) 6.28 (1.45/0.71) 6.14
(1.76/1.18) 8.96 (2.07/1.15) 9.54 (2.08/1.33) 10.36 (2.47/1.13) 9.82 (2.17/0.71) 7.67 (1.88/0.75) 7.51 (2.04/0.77) 7.63 (1.87/0.68) 7.24 (2.19/0.67) 7.36 (1.87/0.69) 7.12
tuning set, eval06 was the official development set in the 2007 evaluation campaign
3.5 The Lattice Combination Framework vs. StateoftheArt in System Combination
Table 3.7. The table summarizes common approaches to latticebased system combination. The methods are classified according to a) the lattice combination method and b) the decoder. The lattices are either combined via an intersection (or an theoretically equivalent lattice rescoring) or by building the lattice union. The decoder is either the Viterbi decoder, which is an approximation of the Bayes risk decoder with the sentence error as loss function, or the Bayes risk decoder with a local cost function as loss function. The local cost functions are of the second type for all methods but Povey’s MPE, which is of the first type.
Combination intersection/ rescoring union
Decoder Viterbi DMC
Bayes risk with local cost DMC + CN decoding

CN, CNC, ROVER, N best ROVER, frame error, Povey’s MPE
all lattices are produced with the same decoder and thus all lattices have the same bias in their time stamps. On the other hand, for systems from different sites the bias is usually different [BaghaiRavary & Kochanski+ 2009]. In conclusion, the advantage of CNC is that word boundaries are only used within a system, whereas for the CN decoding of the modified lattice union time stamps are compared across systems. This explains why a significant performance gap between the two approaches is only observed for the crosssite combination. Alike ROVER both CN based system combination approaches benefit from adding more systems. In a direct comparison ROVER performs only slightly worse than CNC. While on the tuning set the performance is almost equal, ROVER seems to have a tendency to overfit on the test corpora. However, the comparison of the ROVER and CNC results indicate that in the CNC only very few word hypotheses per slot are eventually involved in the decision making. The ROVER and CN based combination methods clearly outperform the Viterbi or MAP decoding of the lattice intersection, cf. Section 3.3.2.
3.5 The Lattice Combination Framework vs. StateoftheArt in System Combination In the last three sections several methods to lattice and CN decoding and combination were discussed. As a result two decoding approaches and two combination methods were identified which allow to efficiently combine and decode lattices, directly or via CNs. In particular it was shown that CN combination and decoding can be implemented as a latticebased Bayes risk decoder with a CN based cost function. The result is a separation of the computation of the sentence posterior probabilities from the decoding process. This applies not only to CN based decoding approaches, but to a wide class of combination methods including the common approaches to system combination. The key result of the framework developed in this work is the separation of the computation of the sentence posterior probabilities from the decoding process. For a wide class of combination methods the probability computation is only driven by the way the lattices are combined: intersection or union. The choice of the lattice combination is independent of the decoder. Furthermore, the common decoders applied in latticebased system combination can be partitioned into two classes: the Viterbi decoders (the maximum approximation of the Bayes risk with the sentence error as loss function) and the Bayes risk decoders with a local cost function, e.g. CNC. For both classes of decoders efficient implementations exist. The common approaches to latticebased system combination can now be classified within the framework as shown in Table 3.7. Note that the lower left cell is empty, because the decoding of the lattice union with the Viterbi decoder is nonsense as pointed out in Section 3.2.3. The following list gives a short overview of the different methods. • DMC. In the discriminative model combination all knowledge sources are combined into a single loglinear model. The lattice scores can either be determined by intersecting the systemdependent lattices, cf. Section 3.2.2, or by rescoring the arcs in a given base lattice with all models, cf. [Beyerlein
41
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework 1997; Vergyri 2000; Zolnay & Schl¨ uter+ 2005] and also Chapter 7. The DMC approach was successfully used, for example in the Philips/RWTH broadcast news system [Beyerlein & Aubert+ 1999], but eventually superseded by the more flexible ROVER and CNC methods. • DMC + CN decoding. All previously published work on DMC applied the Viterbi decoder. In [Hoffmeister & Liang+ 2009] and in this work, cf. Chapter 7, DMC is combined with CN decoding and compared to the CN decoding of the lattice union. • CN, CNC, ROVER, N best ROVER. These are the most popular methods to lattice decoding (besides Viterbi) and to latticebased system combination [Evermann &Woodland 2000; Fiscus 1997; Mangu & Brill+ 1999; Stolcke & Bratt+ 2000]. Although several years old, these methods are still the combination approaches of choice for stateoftheart LVCSR systems, see for example [Hsiao & Fuhs+ 2008; Huang & Marcheret+ 2009; Ng & Zhang+ 2008; Vergyri & Mandal+ 2008]. All four methods can be interpreted as a CN decoder applied to the modified lattice union, cf. Section 3.2.3 and Section 3.4.1. The methods differ in the way the CN is derived from the lattice union and in the summation and hypothesis space. A special case is ROVER with confidence scores, which can be regarded as an approximation of CNC, where the systemdependent CNs are pruned to a single entry per CN slot. N best ROVER is conceptually closer to CNC than to ROVER: the systemdependent lattices are heavily pruned and converted into N best lists. This allows to apply a CN construction algorithm to the systemdependent N best lists which works less heuristic than the construction algorithms computing the CN directly from the lattice. The systemdependent CNs are then aligned and decoded as in CNC. That is, N best ROVER is eventually CNC with a different cost function and a heavily restricted hypothesis and summation space. The construction of a CN from a lattice is discussed in the next chapter in Section 4.4. • Frame Error. The frame error and frame error based cost functions will be introduced and discussed in detail in the next chapter in Section 4.2. The idea is to count errors on a frame instead on a word base. The results are cost functions of the second type, i.e. the according Bayes risk decoding rule can be computed efficiently. Experimental results show a strong connection between frame and word error, cf. [Wessel & Schl¨ uter+ 2001c], which motivates the usage of frame error based costs as an approximation for the Levenshtein distance. The approaches to lattice combination presented in [Hoffmeister & Klein+ 2006] and [Chen & Lee 2006] are Bayes risk decoders with frame error based costs applied to the modified lattice union. The frame error based approach to system combination is also successfully used in stateoftheart LVCSR systems, see for example [Plahl & Hoffmeister+ 2008a]. • Povey’s MPE. Povey’s MPE refers to the cost function used in [Povey & Woodland 2002] for a variant of discriminative acoustic model training which aims at minimizing the expected phoneme error. The same cost can be defined on word instead of phoneme level. The cost is of the first type as it lacks locality with respect to the reference. This and other cost functions of the first type are discussed in detail in the next chapter in Section 4.3. The cost was applied to Bayes risk decoding in [Xu & Povey+ 2009] and also to system combination in [Hoffmeister & Schl¨ uter+ 2009]. Another approach to system combination frequently used in stateoftheart LVCSR systems is the crossadaptation, cf. Section 1.9.3. The crossadaptation is applied in the speaker adaptation step of the speech decoder and thus does not fit into the framework developed in this chapter. But it can be stacked with the methods investigated in this work. Some crossadaptation results are given in the appendix in Section C.2.
42
3.6 Lattice PreProcessing for Bayes Risk Decoding and System Combination
3.6 Lattice PreProcessing for Bayes Risk Decoding and System Combination A crucial step in Bayes risk decoding and system combination is the preprocessing. Vocabularies from different sites usually differ in their spelling, abbreviations, or simply use different encodings. A special case are LVCSR systems for Chinese: most stateoftheart recognizers produce word level lattices, like for example [Lei & Wu+ 2009; Plahl & Hoffmeister+ 2009], but the objective for Chinese LVCSR systems is the character error rate (CER) and not the word error rate (WER). For a Viterbi decoder, i.e. regarding the sentence error, this does not make a difference, but it does for a Bayes risk decoder which aims at minimizing the Levenshtein distance defined on character level. In the latter case the preprocessing includes the transformation of the word lattice into a character lattice. The next section discusses the normalization topics in detail. Lattice pruning reduces the size of the lattice and thus speedingup the decoding. Some algorithms, like the determinization which has an exponential worstcase complexity, require a preceding pruning for becoming computationally feasible. Lattice pruning is discussed in Section 3.6.2. If posterior probabilities are derived from lattices, the lattices require a preprocessing step which makes the probabilities comparable. Reasons and solutions for distorted posteriors are discussed in Section 3.6.2 and Section 3.6.3.
3.6.1 Lattice Normalization In languages like English many words have different, but equally correct spellings, e.g. American vs. British English. While it is easy to agree on the spelling of a single word, the situation becomes ambiguous for expressions like “Tony’s”, which can indeed mean “Tony’s”, but can also be short for “Tony is” or “Tony has”. The NIST scoring tools3 , the defacto standard evaluation tools for LVCSR tasks, allow all three alternatives in the computation of the error rate. However, simply substituting a lattice arc labeled with “Tony’s” by all three alternatives would change the posterior probability distribution defined by the lattice. Reweighting the alternatives solves the problem, but requires the estimation of appropriate weights. An approximation to the reweighting is to simply choose the most frequent alternative. This is the solution used throughout the experiments presented in this work, where the frequencies are computed from the training set. Other important normalizations include hyphens like in “wordlevel” vs. “word level” and abbreviations like “AM” vs. “A.M.” vs. “A. M.” Here, the solution used throughout the work is to expand a word or abbreviation to the alternative with the maximum number of tokens, which increases the probability for partial matches. In Chinese LVCSR systems the objective is the character error rate (CER). Nevertheless, many systems like the RWTH Aachen system produce word level lattices. For decoding in the Bayes risk framework with the Levenshtein distance on character level as loss function the word lattice arcs are split into character arcs. All normalizations described so far are onetoone or onetomany mappings. Applied to a lattice they result in an arc mapping or an arc splitting. After an arc split new time stamps have to be estimated for the resulting subarcs, for which two algorithms are tested: 1. Approximate word boundaries. The duration of the arc is distributed over the subarcs according to the number of phonemes or characters per subword. The number of characters approximates the number of phonemes per subword and is used if the pronunciations, i.e. the phoneme sequences, for the subwords are not known. For the conversion of Chinese word lattices to character lattices the algorithm described in [Hoffmeister & Plahl+ 2007] is used for all experimental results presented in this work. The algorithm simply distributes the word arc duration uniformly among the character arcs. 2. Recognizer word boundaries. The word boundaries are derived from a forced acoustic alignment of the subwords. Computing the forced alignment is much more expensive than the approximate word boundary method and 3 The
NIST Scoring Toolkit (SCTK) is publicly available at http://www.itl.nist.gov/iad/mig/tools/.
43
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework a)
{b}/1
{si}/1
have/5
{b}/1
{si}/3 have/6
{cough}/3 {b}/2
move/8 b)
{b}/1 have/5
{si}/1 {b}/1
have/6 move/8
{b}/2
Figure 3.3. Illustration of the nonspeech cloud filter applied to a word lattice. In figure a) four paths are connecting the left most and the right most state, three of them starting with “have” and continuing with nonspeech arcs marked as “{·}”. These three paths define a nonspeech cloud and the nonspeech cloud filter removes all but the best scoring path through the cloud. The filter result is shown in figure b).
requires pronunciations for all subwords. Pronunciations and even acoustic models are not always available, especially when lattices are shared across several sites.
3.6.2 Lattice Pruning Lattice pruning aims at removing unlikely paths from the lattice; the defacto standard is the forward/backward pruning described in [Sixtus & Ortmanns 1999]. The main motivation for lattice pruning is the reduction of the lattice size with the goal to reduce memory and runtime of lattice processing algorithms. Especially for algorithms with an exponential worstcase complexity, like the determinization, a preceding lattice pruning can become mandatory. The posterior probabilities over a pruned lattice are usually sharper than the posteriors from the unpruned base lattice, because unlikely hypotheses are removed from the probability distribution. That is, the comparability of the posteriors derived from two lattices depends, among other factors, on the lattice density. Thus, for latticebased system combination the densities of the individual lattices should be in a similar range. For the system combination experiments presented in this work all lattices are pruned to the same density, where the density is computed according to Equation (1.11). A typical density for Bayes risk decoding and system combination tasks is between 30 and 100, whereas the lattices produced by a Viterbi decoder can have a density of several hundreds up to several thousands. The bias in the systemdependent posteriors is investigated further in Chapter 5.
3.6.3 The nonWord Cloud Bias Some systems use several models for nonspeech events, e.g. articulatory noise and stationary noise. If the acoustics of the different nonspeech events are similar and no other control of the occurrence of the nonspeech events, like including them into the language model, is applied, then socalled “nonword clouds” appear in the lattices produced by the decoder. Due to the similarity of the models all nonword events are hypothesized in parallel with similar scores and if they survive the pruning they appear as clouds in the lattice. The clouds do not harm the Viterbi result, but they influence the posteriors derived from the lattice: the posterior probability for words lying on paths which go through these clouds are overestimated [Hoffmeister & Klein+ 2006; Wessel & Schl¨ uter+ 2001b]. The clouds can be removed from a lattice by applying an appropriate filter as described in [Hoffmeister & Klein+ 2006]. Figure 3.3 illustrates the function of the filter. In Figure 3.3 a) two arcs labeled with “have” start from the leftmost node and both arcs are followed by nonspeech events. From all the alternative paths starting with one of the “have”arcs and ending in the rightmost node, only a single one shall survive.
44
3.6 Lattice PreProcessing for Bayes Risk Decoding and System Combination
Table 3.8. Results for the Chinese 230h testing system, cf. Section B.1.1. Wordlevel vs. characterlevel decoding and approximated vs. exact character boundaries. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline word level s1+s2+s3
Combination/Decoder
dev071 (2.63/1.59) 14.54
ROVER w/ confidences (2.90/1.43) 13.38 union/CN (3.42/1.33) 13.41 CNC (3.14/1.40) 13.32 character level, approximated char. boundaries s1+s2+s3 ROVER w/ confidences (2.69/1.35) 13.24 union/CN (2.98/1.27) 13.20 CNC (2.86/1.29) 13.16 character level, char. boundaries from forced alignment s1+s2+s3 ROVER w/ confidences (2.70/1.34) 13.22 union/CN (2.88/1.24) 13.13 CNC (2.87/1.29) 13.17 1
CER[%] (del/ins) err eval07 dev08 (4.42/0.91) 15.08 (2.80/0.87) 13.28 (4.76/0.82) 14.03 (5.25/0.75) 13.99 (5.05/0.85) 13.99
(2.94/0.85) 12.55 (3.36/0.77) 12.43 (3.11/0.84) 12.39
(4.52/0.76) 13.89 (4.80/0.71) 13.73 (4.70/0.69) 13.71
(2.85/0.76) 12.47 (3.06/0.73) 12.28 (2.93/0.74) 12.26
(4.55/0.74) 13.86 (4.77/0.67) 13.73 (4.68/0.70) 13.70
(2.89/0.76) 12.47 (3.01/0.73) 12.30 (2.92/0.72) 12.21
tuning set
For all the nodes in the “nonword cloud”, all incoming arcs but the best scoring one are discarded. The result is lattice Figure 3.3 b). The dotted arc is removed by a subsequent trimming step.
3.6.4 Results In this section two lattice normalization issues are experimentally investigated. The first set of experiments is performed on the Chinese 230h testing system and compares Bayes risk decoding with the CN distance as loss function for word and character lattices, where two different approaches are investigated for deriving a character lattice from a given word lattice. The second set of experiments evaluates the impact of the lattice density for CN decoding for the Chinese 230h testing system and the English EPPS 2007 evaluation system. A detailed description of the systems is given in Appendix B. For all experiments acoustic and language model scales and the system weights in the union based combination and in the CNC are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described later in Section 3.7. For ROVER the confidence score for making a deletion (nullconfidence) is included in the optimization. In the first set of experiments the combination and decoding of Chinese lattices is performed on word and on character level. The character lattices are derived from the word lattices by splitting the word arcs into character arcs by applying the algorithms described in Section 3.6.1. The results are shown in Table 3.8. Going from word to character level improves the CN based lattice decoding and combination and the CER decreases by around 0.2% absolute. The results for the arc splitting algorithms differ only slightly without showing a clear advantage for any. The observation makes sense under the consideration that the duration of a character in spoken Chinese is similar for most characters. That is, instead of performing an expensive forced alignment it is sufficient to distribute the word duration uniformly among the character arcs. In the second set of experiments the impact of the lattice density on the CN combination and decoding result is explored. For a density of one only the Viterbi hypothesis remains in the lattice and CNC degrades to ROVER with majority voting. In the case of the CNC of two Viterbi paths the implementation chooses always the hypothesis from the first CN. This explains why system s1 and the system combination s1+s2
45
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
14.8 s1 s1+s2 s1+s2+s3
14.6
14.4
CER[%]
14.2
14
13.8
13.6
13.4
13.2
13 0
10
20
30 density
40
50
60
Figure 3.4. CN decoding results for the Chinese 230h testing system, cf. Section B.1.1, for different lattice densities.
8.5 s1 s1+s2 s1+s2+s3 s1+s2+s3+s4
8.4 8.3 8.2
WER[%]
8.1 8 7.9 7.8 7.7 7.6 7.5 0
10
20
30 density
40
50
60
Figure 3.5. CN decoding results for the English EPPS 2007 evaluation system, cf. Section B.2.1, for different lattice densities.
46
3.7 Parameter Optimization for Bayes Risk Decoding and System Combination show equal error rates for a density of one, cf. Figure 3.4 and Figure 3.5. Not surprisingly the error drops significantly for densities larger than one. Remarkably, the optimal performance is already achieved for almost all experiments for a density of five. A further increase of the density helps only slightly, if at all. The conclusion is that only few words and eventually few lattice paths have an impact on the decision finding. The conclusion is supported by the ROVER vs. CNC results from Section 3.4.3: ROVER with confidence scores performs almost as good as CNC, but considers only one hypothesis per system and slot. This can be interpreted as the CNC of heavily pruned CNs. In the experiments presented in this section the CNs are derived from heavily pruned lattices. The results indicate that in CNC only few hypotheses are considered and required in decision making.
3.7 Parameter Optimization for Bayes Risk Decoding and System Combination The focus of this thesis is on the decoding and combination of lattices, where a lattice is a loglinear combination of feature functions. Throughout this work it is assumed that the feature functions are given and fix, i.e. no parameters of the feature functions are optimized. Let J be the number of lattices to be combined and I be the number of feature functions per system, to simplify matters it is assumed that each system is combining the same number of features. The parameters optimized for each combination experiment consist of the (J · I)scaling factors of the J systemdependent loglinear models, cf. Equation (3.6), the J system priors if used, cf. Equation (3.14), and a small number of combination and decoding specific parameters. The set of free parameters is denoted by θ. For most experiments the free parameters consist of two scaling factors per system (the acoustic model and the language model scale), a weight per system, and one or two method specific parameters. Thus, the typical size of θ ranges between 1 (single system with Viterbi decoding) and 13 (four systems with system weights and one method specific parameter). This small number of parameters is optimized on a development set via a direct error rate minimization using the DownhillSimplex algorithm as described in the next section. In Chapter 7 experiments with worddependent scaling factors are presented which increases the number of parameters up to several thousands. A direct parameter optimization is prohibitive and instead the minimum risk training (MRT) approach described in Section 3.7.2 is applied. By definition the Bayes risk is the lower bound of the overall risk (or expected loss) of any classifier. Thus, the overall risk for a speech recognition system g(·) is given by X X xT1 → r := P r(xT1 , w1N ) L w1N , g(xT1 ) , (3.33) N xT 1 ,T w1 ,N
where P r(w1N , xT1 ) is the true joint probability of observing sentence w1N and acoustic feature sequence xT1 together and L(·, ·) denotes an arbitrary loss function. The Bayes risk is defined as the risk of the optimal classifier: X X xT1 → ropt := min r = min P r(xT1 , w1N ) L g(xT1 ), w1N g(·)
g(·)
=
N xT 1 ,T w1 ,N
X
P r(xT1 ) min v1M
xT 1 ,T
X
P r(w1N xT1 ) L(v1M , w1N )
From the last equation it follows that the optimal classifier is given by X gopt (xT1 ) = argmin P r(w1N xT1 ) L(v1M , w1N ), v1M ,M
(3.34)
w1N ,N
(3.35)
w1N ,N
if the true posterior distribution P r(w1N xT1 ) is known. In practice the true posteriors are unknown and Nr R r only a limited training or optimization set [xTr,1 ,w ˜r,1 ]r=1 is available. However, Equation (3.35) motivates the usage of a classifier of the following form for decoding LVCSR lattices: X T L K gθ (xT1 ) = argmin pθ (bK (3.36) 1 x1 ) L(a1 , b1 ) aL 1 ∈H
bK 1 ∈S
47
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework Parameter optimization means to find those parameters θˆ which yield the best approximation of g(·) on the empirical overall risk ˆr. The empirical risk is derived by approximating the true joint probability by Pˆ r(xT1 , w1N ), which is estimated on the training set. The direct parameter optimization and the MRT approach differ in the way they estimate the joint probability, where in particular the estimation used in MRT can lead to suboptimal results, cf. Section 3.7.2. Noteworthy, the goal of the optimization is to derive the Bayes risk classifier, but not necessarily to derive a good predictor for the Bayes risk itself. Under the assumption that pθ (·xT1 ) can be arbitrarily exactly approximate the true posterior probability distribution it is obviously guaranteed that the set of classifiers having the form given in Equation 3.36 includes the Bayes risk classifier. But in general, the Bayes risk classifier is not unique. In particular, if the parameter set θˆ describes a Bayes risk classifier, it does not necessarily follow that pθˆ(·xT1 ) is a good estimate of the true posteriors. In conclusion, after parameter optimization for Bayes risk decoding the interpretation of the latticederived probability pθˆ(w1N xT1 ) as the true posterior probability of w1N is questionable. However, only few parameters are optimized and in acoustic and language model training the vast majority of the parameters are (at least initially) maximum likelihood trained. The two parameter optimization algorithms presented in this section choose by design from all risk minimizing classifiers a one with parameters θˆ close to the initial parameters. That is, in practice the interpretation of pθˆ(w1N xT1 ) is passable and for example successfully used in confidence score computation, cf. Section 5.2.1.
3.7.1 Parameter Optimization based on the DownhillSimplex Algorithm The approach uses the empirical risk as the objective function in the definition of the optimization problem R 1 X Nr r L w ˜r,1 , gθ (xTr,1 ) . θˆ := argmin R r=1 θ
(3.37)
The classifier based on θˆ minimizes the error on the training set and the estimate of the joint probability is the relative frequency R 1 X Nr r δ(xT1 , xTr,1 )δ(w1N , w ˜r,1 ), Pˆ r(xT1 , w1N ) := R r=1 which converges to the true probability for sufficiently large training sets. The drawback of the approach is that the objective function is not differentiable and thus gradientdescent based optimization algorithms cannot be applied. In practice, the following algorithm for optimizing the parameters turned out to work fast and robust. 1. Optimize the language model scale βj of the jth system separately for each lattice such that the error rate of the Viterbi decoder is minimized. 2. Initialize θ, i.e. the set of all parameters, as follows. Set the scaling factor for the acoustic model of the jth system λj,AM to 1/βj and the language model scaling factor λj,LM to one. The system prior p(j) is initialized with 1/J and for the combination and decoding parameters some defaults are assumed. Now, optimize each parameter in θ consecutively w.r.t. Equation (3.37). 3. Apply the NelderMead downhill simplex optimization algorithm [Nelder & Mead 1965]. The parameters from step 1 are usually already close to the optimum. The optimization in step 2 can be accelerated by making use of the knowledge about the parameters to be optimized, e.g. the system priors have to sum up to one. The initial values for the third step are in most cases already very close to the optimum and only few more iterations are needed. Equation (3.37) is not differentiable which motivates the usage of the NelderMead downhill simplex algorithm, which was successfully applied to similar problems, e.g. [Zens & Hasan+ 2007]. However, a direct start with the downhillsimplex algorithm is not recommended. The algorithm is sensitive to local minimums which can be avoided by choosing a good starting point, i.e. a point close to a good (ideally to the global) minimum. This motivates the three step architecture.
48
3.7 Parameter Optimization for Bayes Risk Decoding and System Combination
3.7.2 Parameter Optimization based on Minimum Risk Training Minimum risk training (MRT) is wellknown for its application to the parameter estimation of acoustic models, where it is usually referred to as minimum word error (MWE) or minimum phoneme error (MPE) training [Kaiser & Horvat+ 2000; Povey & Woodland 2002]. The optimization problem solved by minimum risk training is defined as θˆ := argmin θ
R X X
Nr pθ (w1N xTr,1 ) L(w ˜r,1 , w1N ).
(3.38)
r=1 w1N ,N
The problem is differentiable and θˆ can be computed by the help of the extended BaumWelch algorithm or by gradientdescent based approaches. The implementation used throughout this work applies Rprop, a gradientdescent based optimization algorithm [Gunawardana & Mahajan+ 2005; Riedmiller & Braun 1993]. Remarkably, Equation (3.38) models the posterior probability, which results in the following modelbased estimate of the joint probability: 1 Pˆ rθ (xT1 , w1N ) := pθ (w1N xT1 ) R
R X
r ) δ(xT1 , xTr,1
r=1
That is, the estimate of the risk depends twice on θ, in the probability of the occurrence of sentence w1N and in the classifier: R 1 X X ˆ r ˆrθ = P rθ (xTr,1 , w1N ) L(w1N , gθ (xTr,1 )) R r=1 N w1 ,N
The consequence is that MRT is not only aiming at optimizing the parameters of the classifier gθ (·), but at the same time changing the probability distribution over the training data such that the classifier fits. Consequently, the training aims not at finding the true probability distribution over the training data and therefore the optimization will in general not converge to the Bayes risk classifier. In fact, it is easy to show that the resulting distribution is one for the class selected by the classifier and zero otherwise. The following example shows how the dependency of the empirical risk on θ can lead to a suboptimal solution. Let us assume that a feature extraction produces five times the same feature x, but the observed classes differ. The observations are: 1 × (x, 111), 2 × (x, 112), 1 × (x, 211), and 1 × (x, 221). In MRT the goal is to find the probability distribution pˆ(·x) such that the following optimization problem is solved, cf. Equation (3.38), where the loss function of choice is the Levenshtein distance: R
1 XX p(cx) Lev(c, c˜r ) p(·x) R r=1 c 1 = argmin 5 × p(111x) + 6 × p(112x) + 4 × p(211x) + 7 × p(221x) 5 p(·x)
pˆ(·xT1 ) := argmin
Table 3.9 shows the empirical posterior probability distribution, i.e. the relative frequencies, the posterior distribution resulting from the minimum risk training, the classification results, and the risks of the classification results. The classifier using the empirical posterior probabilities yields “111”, which minimizes the expected loss on the training set. On the other hand, the hypothesis of the classifier based on the MRT result is ”211”, which is not an optimal solution for the training set: the expected loss on the training set is by 1/5 greater than for the optimal solution “111”. The consequence is that MRT will in general not produce the Bayes risk classifier even on infinite training data and with no model restrictions. In contrast, the empirical risk minimizing approach will yield the Bayes risk classifier under the same conditions. However, in practice MRT in combination with regularization is successfully applied for several optimization tasks involving thousands to millions of free parameters [Heigold & Deselaers+ 2008; Povey & Woodland 2002], i.e. where a direct parameter optimization is not applicable. The regularization applied in this work penalizes the deviation from an initial parameter; MRT with regularization is used in Chapter 7.
49
Chapter 3 LatticeBased System Combination in the Bayes Risk Decoding Framework
Table 3.9. Comparison of the posterior probability distributions resulting from maximum likelihood estimation and from MRT training given the observations 1 × (x, 111), 2 × (x, 112), 1 × (x, 211), and 1 × (x, 221). The table also shows the Bayes risk hypothesis given the two distributions and the according risks given the empirical distribution.
obs. 1 × (x, 111) 2 × (x, 112) 1 × (x, 211) 1 × (x, 221) gˆ(x) r(111) r(211)
Pˆ r(cx) 1/5 2/5 1/5 1/5 111 5/5 6/5
MRT pˆ(cx) 0 0 1 0 211 5/5 4/5
Another problem of the MRT concerns the default system combination used throughout this work: the weighted average of systemdependent posterior probabilities as defined in Equation (3.14). Under the assumption that θ consists only of the systemdependent parameters, i.e. θ = {Λ1 , . . . , ΛJ , p(1), . . . , p(j)}, and that the systemdependent parameters are mutually exclusive, the optimization problem solved by MRT can be rewritten as J R X X X Nr p(j)pj (w1N xTr,1 , Λj ) L(w ˜r,1 , w1N ) θˆ := argmin θ
=
argmin p(·)
r=1 w1N ,N J X j=1
j=1
p(j) argmin Λj
R X X
Nr pj (w1N xTr,1 , Λj ) L(w ˜r,1 , w1N ).
r=1 w1N ,N
Under the constraint that the system priors sum up to one, it is easy to see that the optimization problem has the following solution: optimize the systemdependent scaling factors Λj separately and subsequently set for the best performing system the system prior to one. That is, minimum risk training (in contrast to empirical risk minimization) does not consider the interaction between the systems and ends up with a system selection. This makes MRT unsuitable for parameter optimization for all lattice union based system combination approaches, in particular for ROVER and CNC.
3.8 Summary In this chapter a unified view on system combination has been developed which covers the most common approaches used in LVCSR. In the Bayes risk decoding framework system combination reduces to the problem of computing sentence posterior probabilities over multiple systems. A common approach is to use a single loglinear model which combines all knowledge sources from all systems. The alternative is to compute the weighted average of the systemdependent sentence posteriors. The two approaches have a natural representation in the transducer framework. A new semiring, the vector semiring, is introduced, which contains dimensiondependent scaling factors. Lattices are represented by weighted finitestate acceptors over the vector semiring. Thus, a lattice eventually defines a loglinear model distribution over sentences. The combination of several lattices can be done by building the intersection or the union. The intersection results directly in a loglinear model combination of the knowledge sources provided by the systemdependent lattices. A slightly modified union yields the weighted average of the systemdependent sentence posteriors. In both cases the result is again a lattice. The investigation on lattice decoding in the Bayes risk framework with the aim of minimizing the Levenshtein distance commences with categorizing approximate loss functions. Two classes of loss functions are derived and efficient Bayes risk decoder are developed. The characteristic of the two classes is the
50
3.8 Summary locality of the loss computation: loss functions of the first class are local w.r.t. the reference arcs, i.e. in the computation of the loss for a single reference arc no context is considered. Loss functions of the second class are local w.r.t. reference and hypothesis. The intersection can be efficiently decoded in a MAP/Viterbi decoder, but not in a Bayes risk decoder with a common Levenshtein distance approximation as loss function, because the intersection invalidates the word boundaries which are needed in the loss computation. Vice versa for the union approach: in the Viterbi decoder the union approach degenerates to a system selection and the MAP decoder is computationally expensive. But the Bayes risk decoder developed for a single lattice can be applied to the union and yields an efficient combination approach. An alternative to the latticebased system combination is the confusion network combination (CNC). The lattices are first transformed into CNs and subsequently the CNs are aligned into a super CN followed by a standard CN decoding. The common CNC alignment rule is derived from formulating the combination problem within the Bayes risk decoding framework. Furthermore, it is shown that the difference between CNC and constructing a CN directly from the lattice union is only in the loss function, but not in the computation of the probabilities. That is, eventually CNC is a Bayes risk decoding of the lattice union. Finally, ROVER is introduced as an approximation to CNC. Experimental results show that a CN based combination performs better than the intersection approach and gives up to 10% relative improvement for intrasite and more than 20% relative improvement for crosssite combination experiments. Improvements are measured in terms of error rate reduction compared to the Viterbi decoding result of the best single system. The experiments indicate that only few hypotheses are needed for decision finding. In particular, ROVER performs almost as good as CNC. Lattice normalization and parameter optimization are discussed in the end of the chapter. For all experiments the acoustic and language model scales of all systems and the combination technique specific parameters are tuned for minimum error rate via the downhillsimplex method.
51
Chapter 4 Local Cost Functions for Bayes Risk Decoding Local cost functions are introduced in the last chapter in Section 3.3.3. In the Bayes risk decoding framework for LVCSR tasks, a local cost function approximates the Levenshtein distance and makes the computation of the Bayes risk hypothesis from a lattice computationally feasible. In this chapter local cost functions are investigated in detail. The first section discusses the general deletion bias of local costs. The remaining sections introduce several concrete implementations of local cost functions: based on the frame error in Section 4.2, based on local alignments in Section 4.3, and based on confusion networks (CN) in Section 4.4. All of the local costs show in their common form a deletion bias, especially the costs based on frame error and on local alignments. The reasons for the bias are investigated and improved versions of the cost functions are developed, which compensate for deletions. The section about CNs introduces and compares three approaches to CN construction including a new approach based on framewise word posterior probabilities. The new approach has some interesting properties: in opposite to the common approaches to CN construction from lattices, the new algorithm is parameterfree and does not rely on distance functions comparing arcs or arc clusters.
4.1 Local Costs and the Deletion Bias LVCSR systems tuned for minimum error rate have a general deletion bias: it is better to discard an unlikely word rather than to risk an insertion. The detailed proof and a further discussion of the bias is given in Appendix A. However, in practice the impact is negligible and the actual deletion bias of a system is mainly driven by the model approximations and the choice of the loss function in case of a Bayes risk decoder. Local cost functions as defined in Section 3.3.3 have an inherent deletion bias caused by the requirement that only arcs can compete which overlap in time. This requires exact time stamps for words which do not exist in continuous speech. The discretization of the acoustic signal impairs the situation. Short words like “I” or “a” or fast and unclear spoken words like “have” are good candidates for fluctuating word boundaries, especially if they occur in context with words starting or ending in the same or a similar vowel. These words can occur several times in the lattice without or with only little overlap in time, even so they are clearly referring to the same word position in the spoken sentence. Consequently, in a Bayes risk decoder using a loss function, which requires exact word boundaries, these arcs are not aligned, which usually strengthens the hypothesis of the empty word and thus causes deletions. The situation is even worse in crosssite lattice combinations, because as shown in [BaghaiRavary & Kochanski+ 2009] LVCSR decoder usually show a systematic bias in where to set word boundaries. The specific bias of the concrete implementations of local costs is discussed in the next sections when introducing concrete instances of local cost functions. Local costs for discriminative acoustic model training are investigated in [Gibson 2008]. The work focuses on local alignment and frame error based costs and the author comes to similar conclusions concerning the deletion bias of local cost functions.
4.2 Frame Error The frame error is a common approximation of the Levenshtein distance and is used in discriminative acoustic model training [Gibson & Hain 2006; Zheng & Stolcke 2005] and in Bayes risk decoding [Wessel & Schl¨ uter+ 2001c]. The plain frame error between two paths through a lattice is simply the number of time frames in which the overlapping arcs have different word labels. Let at denote the arc in path aL 1 which intersects with time frame t and let o(a, b) denote the overlap in time of arc a and arc b. In order to
53
Chapter 4 Local Cost Functions for Bayes Risk Decoding achieve a simplified notation the helper function h(a, b) := o(a, b)δ i(a), i(b) is defined. The frame error K between lattice path aL 1 and lattice path b1 is defined as K cFE (aL 1 , b1 )
:=
T h X i 1 − δ i(at ), i(bt ) t=1
=
K X
"
L X
dur(bk ) −
k=1
=
L X
# h(bk , al )
l=1
" dur(al ) −
l=1
K X
# h(al , bk ) .
(4.1)
k=1
Note that the computation of the pure frame error is symmetric w.r.t. summing over the hypothesis arcs or over the reference arcs The frame error itself and all the modifications discussed in this section are local cost functions of the second type. That is, the Bayes risk hypothesis can be computed efficiently by using the Bayes risk decoder developed in Equation (3.26).
4.2.1 Partially Normalized Frame Error In [Wessel &Schl¨ uter+ 2001c] a modified version of the frame error is used as loss function for latticebased Bayes risk decoding. The modified frame error has an additional normalization term with the intention to average between frame error and a wordlike error and the resulting error is defined as
K chypnFE (aL 1 , b1 ) :=
L X l=1
dur(al ) −
K X
h(al , bk )
k=1
. 1 + α dur(al ) − 1
(4.2)
The parameter α smoothly interpolates between frame and wordwise normalization. According to Equation (3.26) the Bayes risk decoding can be implemented as a lattice rescoring with the following rescoring function: X dur(a) − h(a, b)p(bxT1 ) chypnFE (a; S, α)
:=
b∈E(S): o(a,b)>0
1 + α dur(a) − 1 end(a)−1
dur(a) − =
X
pt (i(a)xT1 )
t=beg(a)
1 + α dur(a) − 1
(4.3)
The resulting decoding rule is referred to as the min.hypnFE decoding rule1 , where hypnFE is short for hypothesisside normalized frame error. The extension of the min.hypnFE decoding PJrule to system combination is straightforward by using Equation (3.14), i.e. by setting p(w1N xT1 ) = j=1 p(j)pj (w1N xT1 ). This is equivalent to computing the PJ T framewise word posteriors according to Equation (3.17), i.e. pt (wxT1 ) = j=1 p(j)pj,t (wx1 ), and is exactly the form given in [Hoffmeister & Klein+ 2006]. In [Chen & Lee 2006] the authors start from a different approach, but it is easy to see their frame error based combination rule is exactly the min.hypnFE rule for system combination, where the lattice union serves as hypothesis space. 1 In
previous work the resulting decoder was referred to as min.fWER decoder. However, for a consistent notation throughout the thesis the name is changed to min.hypnFE decoder.
54
4.2 Frame Error
a)
b)
Figure 4.1. The bias in partially normalized frame errors. In a) the frame error is normalized w.r.t. the hypothesis, which results in ignoring deletion errors (left side) while insertions are counted (right side). In b) the frame error is normalized w.r.t. the reference and insertion errors are ignored (left side) while deletions are counted.
4.2.2 Symmetrically Normalized Frame Error In Equation (4.2) a normalization term is introduced for the frame error. However, the normalization happens only w.r.t. the left argument, the hypothesis, which destroys the symmetry in the definition of the plain frame error, cf. Equation (4.1). Let us assume that the left argument is the hypothesis and the right argument the reference. The notation in Equation (4.1) stresses the symmetry. Breaking the symmetry and normalizing w.r.t. the hypothesis ignores deletions, while normalizing w.r.t. the reference ignores insertions. Figure 4.1 illustrates the behavior. The min.hypnFE decoding rule defined in Equation (4.2) normalizes w.r.t. the hypothesis which causes a deletion bias. Consequently, experimental results show a high deletion ratio for the min.hypnFE decoder which increases with larger α. Not surprisingly the optimal performance is achieved with a small α, usually around 0.05. On the other hand a normalization is reasonable, because the plain frame error depends on the duration of the words and is dominated by long words. Two approaches have been proposed to normalize the frame error without breaking the symmetry, i.e. without having a bias towards deletions or insertions. The first approach implements the cost function proposed in [Gibson 2008]. The symmetry of the error is achieved on arc level by counting the total number of frames at which two overlapping arcs differ, divided by the length of the shorter arc. The resulting rescoring function for the Bayes risk decoder of the second type, cf. Equation (3.26), is given by
carcnFE (a; S) :=
X b∈E(S): o(a,b)>0
max end(a), end(b) − min beg(a), beg(a) − h(a, b) . p(bxT1 ) min dur(a), dur(b)
(4.4)
So far, the arcnFE2 was only tested for discriminative acoustic model training, where the approximation shows good results. The approach proposed in [Hoffmeister & Schl¨ uter+ 2009] achieves the symmetry on path level by averaging hypothesis and referencenormalized frame error. The error is called pathnFE and is defined as K cpathnFE (aL 1 , b1 ) := γ
PK PL L K X X dur(bk ) − l=1 h(bk , al ) dur(al ) − k=1 h(al , bk ) + (1 − γ) . dur(al ) dur(bk ) l=1
(4.5)
k=1
The parameter γ allows to bias the error towards deletions or insertions; symmetry is achieved for γ = 0.5. Obviously, the error has a new bias: substitutions are penalized twice compared to insertions and deletions. However, experimental results do not show a significantly increased fraction of insertions or deletions in the error rates. Using the pathnFE error in Bayes risk decoding yields the min.pathnFE decoder. The rescoring function for a Bayes risk decoder of the second type is derived by inserting the definition of the error into 2 The
author refers to the error as symmetrically normalised frame error (SNFE). However, for a consistent notation throughout the thesis the name is changed to arcnFE.
55
Chapter 4 Local Cost Functions for Bayes Risk Decoding
Table 4.1. Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three different approaches to wordwise frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1
s1+s2
s1+s2+s3
1
Norm. hyp. arcsym. pathsym. hyp. arcsym. pathsym. hyp. arcsym. pathsym.
dev071 (2.63/1.59) 14.54 (2.92/1.38) 14.35 (2.68/1.53) 14.42 (2.52/1.61) 14.23 (3.07/1.30) 13.57 (2.83/1.41) 13.83 (2.57/1.58) 13.49 (3.06/1.23) 13.18 (2.85/1.30) 13.45 (2.99/1.22) 13.06
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.62/0.79) 14.98 (4.45/0.90) 15.09 (4.32/0.98) 14.96 (4.69/0.68) 13.95 (4.58/0.80) 14.21 (4.31/0.90) 13.93 (4.72/0.69) 13.71 (4.70/0.73) 14.09 (4.76/0.66) 13.64
dev08 (2.80/0.87) 13.28 (3.01/0.75) 13.13 (2.80/0.83) 13.09 (2.75/0.94) 13.11 (3.05/0.70) 12.54 (2.85/0.70) 12.67 (2.65/0.89) 12.45 (3.01/0.72) 12.22 (2.92/0.72) 12.52 (3.04/0.71) 12.22
tuning set
Equation (3.26): ˆ xT1 → W
:=
argmin aL 1 ∈H
X
T L K p(bK 1 x1 ) cpathnFE (a1 , b1 )
bK 1 ∈S
=
argmin
L X γ
dur(al ) −
h(al , b)p(bxT1 )
b∈E(S)
dur(al )
aL 1 ∈H l=1
X h(b, al )p(bxT ) 1 (1 − γ) +(1 − γ) p(bxT1 ) − dur(b) l=1 b∈E(S) b∈E(S) L X X h(a , b) h(a , b) l l argmin −γ + (1 − γ) p(bxT1 ) dur(al ) dur(b) aL 1 ∈H l=1 b∈E(S)  {z } X
=
X
L X
(4.6)
:=cpathnFE (al ;S,γ)
The left term in cpathnFE (·; L) equals the hypnFE cost function with α = 1. For the path symmetric cost function the smoothing parameter α could be easily included in Equation (4.5), but in preliminary experiments it turned out not to be necessary for optimal performance.
4.2.3 Results In this section results for the Bayes risk decoder with frame error based local cost functions are presented and discussed. Experiments have been performed for single lattices and for union based lattice combinations, cf. Section 3.2.3. Results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation crosssite combination. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. In addition, for the min.hypnFE decoder α and for the min.pathnFE decoder γ is included into the optimization process. The optimization algorithm is described in Section 3.7.
56
4.2 Frame Error
Table 4.2. Minimum frame error decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare three different approaches to wordwise frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI
LIMSI+RWTH
LIMSI+RWTH+UKA
LIMSI+RWTH+UKA+IRST
1
Norm. hyp. arcsym. pathsym. hyp. arcsym. pathsym. hyp. arcsym. pathsym. hyp. arcsym. pathsym.
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.95/1.15) 8.08 (2.22/0.99) 9.00 (1.72/1.34) 8.24 (1.82/1.22) 9.19 (1.68/1.32) 8.05 (1.84/1.15) 9.00 (1.60/0.85) 6.65 (1.99/0.76) 7.73 (1.57/1.29) 8.35 (2.02/1.21) 9.58 (1.62/0.76) 6.46 (2.09/0.73) 7.57 (1.80/0.72) 6.48 (2.21/0.68) 7.52 (1.61/1.39) 8.23 (1.74/1.27) 9.19 (1.53/0.74) 6.24 (2.01/0.74) 7.28 (1.70/0.79) 6.52 (1.93/0.76) 7.26 (1.57/1.28) 8.33 (2.01/1.22) 9.55 (1.36/0.85) 6.10 (1.81/0.85) 7.21
tuning set, eval06 was the official development set in the 2007 evaluation campaign
In the first set of experiments the three frame error based cost functions hypnFE, arcnFE, and pathnFE are compared. The definitions of the cost functions can be found in Equation (4.3), Equation (4.4), and Equation (4.6). The results for the Chinese system are summarized in Table 4.1. The arcnFE cost performs clearly worst. In a direct comparison of the partially normalized hypnFE and the symmetrically normalized pathnFE a small advantage of the pathnFE over the hypnFE is observed. Looking at the deletion/insertion ratio shows that the pathnFE cost has a reduced deletion ratio compared to the hypnFE cost. The parameters are tuned for minimum error rate, which means that a low del/ins ratio only appears if it is beneficial for the decoder performance. In fact, for the combination of three systems the del/ins ratio does almost not change between the min.hypnFE and the min.pathnFE decoder, which means that the optimal error rate has a rather high deletion rate. For the crosssite combination results shown in Table 4.2 the benefit from the reduced del/ins ratio of the min.pathnFE decoder is higher. Especially, for the combination of three and four systems the error rate benefits from a lower del/ins ratio. Here again, the pathnFE cost outperforms the other two costs, where arcnFE performs clearly worst. The second set of experiments investigates the influence of the size of the hypothesis space on the decoding result. By default, for experiments requiring exact word boundaries in the hypothesis, like the frame error based costs, the hypothesis space is the timeconditioned form of the summation space lattice, cf. Section 3.3.3. The summation space lattices are the result of a wordconditioned tree search decoder and thus are wordconditioned lattices. The experimental results presented in Table 4.3 and Table 4.4 compare the two hypothesis spaces: the summation space lattice and the timeconditioned form of the summation space lattice. In the combination experiments the summation space lattice is the union of the systemdependent lattices. Experiments are performed with the min.pathnFE decoder, which performed best among all tested frameerror based decoders. The results for the Chinese system show no clear advantage for the timeconditioned hypothesis space, whereas the crosssite combination clearly benefits from the increased size of the hypothesis space. The reason for the different behavior is due to the different decoding setups. The Chinese system uses many short segments. On the contrary, the English crosssite combination uses only a few segments each spanning over a whole recording. Now, defining the hypothesis space as the union of the systemdependent
57
Chapter 4 Local Cost Functions for Bayes Risk Decoding
Table 4.3. Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1). The experiments compare the word and timeconditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1 s1+s2 s1+s2+s3 1
TimeCond. Hyp. Space no yes no yes no yes
dev071 (2.63/1.59) 14.54 (2.74/1.47) 14.23 (2.52/1.61) 14.23 (2.64/1.52) 13.50 (2.57/1.58) 13.49 (2.80/1.32) 13.09 (2.99/1.22) 13.06
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.45/0.84) 14.87 (4.32/0.98) 14.96 (4.37/0.84) 13.98 (4.31/0.90) 13.93 (4.61/0.72) 13.67 (4.76/0.66) 13.64
dev08 (2.80/0.87) 13.28 (2.93/0.85) 13.09 (2.75/0.94) 13.11 (2.73/0.86) 12.42 (2.65/0.89) 12.45 (2.89/0.77) 12.19 (3.04/0.71) 12.22
tuning set
Table 4.4. Minimum frame error decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare the word and timeconditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST 1
58
TimeCond. Hyp. Space no yes no yes no yes no yes
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.90/1.18) 8.07 (2.11/1.03) 8.96 (1.68/1.32) 8.05 (1.84/1.15) 9.00 (1.50/1.14) 6.94 (1.93/0.97) 7.77 (1.62/0.76) 6.46 (2.09/0.73) 7.57 (1.55/0.93) 6.61 (2.02/0.94) 7.80 (1.53/0.74) 6.24 (2.01/0.74) 7.28 (1.58/0.97) 6.60 (2.03/0.88) 7.62 (1.36/0.85) 6.10 (1.81/0.85) 7.21
tuning set, eval06 was the official development set in the 2007 evaluation campaign
4.3 Local Alignment based Error lattices does not allow to switch between the hypotheses of the different systems within a segment. For Chinese the restriction does not harm as the segments are short anyway. But for the English task with the long segments the restriction has a clear negative impact. Consequently, the large benefit of the timeconditioned hypothesis space is only observed in the combination case, but not for single lattice decoding.
4.3 Local Alignment based Error In cost functions based on local alignments each word in the reference is aligned to all subpaths which overlap in time with the reference word. These are cost functions of the first type and the Bayes risk hypothesis can be computed according to Equation (3.25).
4.3.1 Povey’s Approximation in MPE/MWE Training The most prominent cost function based on a local alignment is the approximation used for minimum risk acoustic model training in [Povey & Woodland 2002]. The cost between lattice path aL 1 and lattice path bK is defined as 1
K cPovey (aL 1 , b1 ) := K −
L X l=1
max k
−1 +
o(al , bk ) dur(bk )
+
h(al , bk ) dur(bk )
.
(4.7)
In practice, the approximation is either applied on word level or phoneme level. Accordingly, the combination with minimum risk acoustic model training is referred to as minimum word error (MWE) training and minimum phone error (MPE) training, respectively. MPE training is the defacto standard in discriminative acoustic model training for LVCSR systems. The MPE criterion is also used as loss function for Bayes risk decoding. In [Chen & Lee 2006] the authors develop an arcwise cost based on the phoneme error approximation for word latticebased system combination and decoding. However, the phoneme alignment is computed only within a word lattice arc which eventually yields a cost function of the second type. The approach presented in [Xu & Povey+ 2009] rescores N best lists with the phoneme error approximation used by Povey for MPE training. The cost function developed in this section is a modified version of Povey’s cost but applied on words and was first published in [Hoffmeister & Schl¨ uter+ 2009]. A drawback of the approximations used in MPE/MWE training is that they show a strong deletion bias as pointed out in [Gibson 2008; Zheng & Stolcke 2005] and being experimentally verified for Bayes risk decoding in [Hoffmeister & Schl¨ uter+ 2009]. Alternative criteria like the minimum phone frame error (MPFE) [Zheng & Stolcke 2005] have been proposed, but the MPFE objective is rather expensive to compute and requires a state alignment. Furthermore, neither phoneme nor state alignments might be available, e.g. in a crosssite system combination task. The criterion proposed in this thesis modifies Povey’s original cost applied to words.. Equation (4.7) is extended by an additional term which adds a penalty if the occurrence of a deletion is likely. The two l ,bk ) terms in the original error definition in Equation (4.7) have the following interpretation: − 1 + o(a dur(bk ) adds a penalty if an insertion is likely and
h(al ,bk ) dur(bk ) is o(al ,bk ) dur(al ) is
the accuracy, thus indirectly modeling substitutions
and deletions. An additional term − 1 + introduced which is similar to the insertion penalty, but normalized by the duration of the hypothesis word dur(al ). The motivation is: if a long hypothesis word al competes with a much shorter reference word bk , then presumably a deletion takes place and is penalized by the new term. The new term is weighted by a scalar χ which allows to smoothly increase the deletion penalty; setting χ = 0 yields Povey’s original criterion. The resulting rescoring function for
59
Chapter 4 Local Cost Functions for Bayes Risk Decoding
Table 4.5. The substitution, insertion, and deletion error for the discrete and the continuous case of the 1/2 overlap approximation.
error
discrete δ i(a), i(b) o(a, b) > 0.5 1 o(a, b) ≤ 0.5 1 o(a, b) ≤ 0.5
substitution insertion deletion
continuous h(a, b) dur(b) dur(a) − o(a, b) dur(a) dur(b) − o(a, b) dur(b)
the type one Bayes risk decoder, cf. Equation (3.25), is given by X T L K ˆ := argmin p(bK xT1 → W 1 x1 ) cχPovey (a1 , b1 ) aL 1 ∈H
bK 1 ∈S
= argmin aL 1 ∈H
X bK 1 ∈S
 X
T p(bK 1 x1 )K +
l=1
{z
=const(aL 1)
k T p(bj x1 ) −
K k κ bk j :∃φ1 ∈S with bj =φι , o(al ,bi )>0 for i∈[j,k], o(al ,bi )=0 for i∈[j,k] /
L X
}
o(al , bi ) h(al , bi ) o(al , bi ) + χ −1 + + . (4.8) max −1 + dur(bi ) dur(al ) dur(bi ) i∈{j,...,k}

{z
:=cχPovey (al ;S,χ)
}
4.3.2 The 1/2 Overlap Approximation The use of the fractional values in the error approximation in MPE/MWE training is a tribute to the locality of the approximation, because two hypothesis arcs a and a0 can be assigned to the same competing arc b. The flaw in the alignment can be avoided by requiring that a (or a0 ) can only be aligned to b if the fractional overlap exceeds one half. Following the consideration two cost functions are developed in [Hoffmeister & Schl¨ uter+ 2009] and presented in this section. The cost for an error can now be chosen discrete like in the Levenshtein alignment or again be smoothed by using normalized overlaps. For hypothesis arc a and reference arc b Table 4.5 summarizes the definition of the substitution, insertion, and deletion error for both cases. The requirement of a minimum overlap of 0.5 is in practice too strong for optimal error rate. Instead, 0.5 is replaced by the parameter β which is empirically optimized on the tuning set. Equation (4.9) shows the resulting Bayes risk decoder for the continuous case; the discrete case follows analogously. X T L K ˆ := argmin xT1 → W p(bK 1 x1 ) cβPovey (a1 , b1 ) aL 1 ∈H
bK 1 ∈S
= argmin aL 1 ∈H
X bK 1 ∈H
 X k T p(bj x1 ) − K k κ bk j :∃φ1 ∈S with bj =φι , o(al ,bi )>β for i∈[j,k], o(al ,bi )≤β for i∈[j,k] /
T p(bK 1 x1 )K
+
L X
" 1 ∀b ∈ S : o(al , b) < β +
l=1
{z
=const(aL 1)
}
# o(al , bi ) o(al , bi ) h(al , bi ) max −1 + + −1 + + (4.9) dur(bi ) dur(al ) dur(bi ) i∈{j,...,k}
The penalty term 1 ∀b ∈ S : o(al , b) < β is necessary to count an insertion in the case that the minimum overlap requirement prohibits the alignment of the hypothesis arc al to any arc from the summation space
60
4.3 Local Alignment based Error
Table 4.6. Minimum local alignment error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare four variants of the local alignment based cost. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1
s1+s2
s1+s2+s3
1
Criterion POV χPOV βINT (cont.) βINT (disc.) POV χPOV βINT (cont.) βINT (disc.) POV χPOV βINT (cont.) βINT (disc.)
dev071 (2.63/1.59) 14.54 (2.89/1.39) 14.33 (2.32/1.68) 14.17 (2.70/1.51) 14.33 (2.61/1.55) 14.34 (3.11/1.25) 13.60 (2.47/1.53) 13.44 (2.78/1.37) 13.48 (2.68/1.45) 13.54 (3.12/1.14) 13.19 (2.61/1.33) 13.09 (2.69/1.31) 13.12 (2.58/1.38) 13.20
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.62/0.80) 15.03 (4.23/1.03) 15.01 (4.45/0.89) 14.98 (4.46/0.92) 15.01 (4.75/0.67) 14.00 (4.32/0.85) 13.93 (4.51/0.75) 13.97 (4.44/0.82) 14.00 (4.82/0.62) 13.74 (4.48/0.75) 13.67 (4.55/0.72) 13.70 (4.50/0.80) 13.87
dev08 (2.80/0.87) 13.28 (3.00/0.75) 13.14 (2.61/0.97) 13.04 (2.84/0.84) 13.12 (2.73/0.88) 13.06 (3.06/0.70) 12.57 (2.58/0.86) 12.35 (2.80/0.75) 12.49 (2.78/0.81) 12.44 (3.16/0.68) 12.26 (2.72/0.78) 12.08 (2.81/0.74) 12.15 (2.74/0.77) 12.25
tuning set
lattice. The 1/2 overlap cost function is almost identical to the modified version of Povey’s cost, cf. Equation (4.8). For β = 0.5 the additional deletion penalty becomes now crucial, because due to the 1/2 overlap constraint the accuracy term is not accounting for the deletion anymore. Therefore, χ is set to one for choosing β = 0.5. For β < 0.5 and especially for β = 0 two hypothesis arcs can in fact be aligned to the same reference arc and the accuracy term penalizes (indirectly) the deletions. Consequently, the impact of the χterm has to be decreased in order to avoid an overestimation of the deletion error.
4.3.3 Results In this section results for the Bayes risk decoder with local alignment based cost functions are presented and discussed. Experiments have been performed for single lattices and for union based lattice combinations, cf. Section 3.2.3. Results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation crosssite combination. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. In addition, for the Bayes risk decoder based on Equation (4.8) and Equation (4.9) the deletion penalty weight χ and respectively the minimum overlap β are included in the parameter optimization. The optimization algorithm is described in Section 3.7. Four different local alignment based cost functions are compared: the original cost approximation used by Povey for minimum risk training (abbreviated as POV), the modified cost with the additional deletion penalty term (χPOV), and the 1/2 overlap approximation (βINT) with continuous and discrete costs. The results are summarized in Table 4.6 and Table 4.7. For the English crosssite combination setup experimental results are presented only for single systems and the combination of two systems. For the combination of three and four systems the computation of the local alignment in the lattice union became infeasible. The reason is that long hypothesis words were aligned to a highly connected cloud of short words. During the alignment the cloud was expanded to a huge number of paths, each one being aligned against the hypothesis word.
61
Chapter 4 Local Cost Functions for Bayes Risk Decoding
Table 4.7. Minimum local alignment error decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare four variants of the local alignment based cost. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI
LIMSI+RWTH
1
Criterion POV χPOV βINT (cont.) βINT (disc.) POV χPOV βINT (cont.) βINT (disc.)
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.67/1.29) 8.04 (1.87/1.15) 9.03 (1.62/1.40) 8.13 (1.73/1.22) 8.99 (1.66/1.33) 8.07 (1.79/1.14) 8.96 (1.65/1.28) 8.07 (1.82/1.24) 9.09 (1.78/0.72) 6.66 (2.33/0.61) 7.73 (1.48/0.90) 6.61 (2.05/0.79) 7.70 (1.44/0.92) 6.66 (1.96/0.88) 7.96 (1.66/0.87) 6.70 (2.10/0.74) 7.81
tuning set, eval06 was the official development set in the 2007 evaluation campaign
Overall, the χPOV cost shows a good performance and gives for almost all experiments the best result. The χPOV cost profits from the deletion penalty and in most experiments the deletion ratio is significantly reduced compared to the POV result. The 1/2 approximation with the continuous cost can improve in some experiments over POV and also decreases the deletion ratio. The continuous version is clearly superior to the discrete version, but the overall performance of the χPOV cost is slightly better. In a direct comparison to the frame error based cost functions in Section 4.2.3 the local alignment based costs are competitive in terms of error rate, but clearly outperformed in terms of runtime.
4.4 Confusion Network Distance based Error Confusion networks (CNs) have already been introduced in the last chapter in Section 3.4. In this work the interest is in the CN derived directly from a lattice L via a function σ : E(L) → N, where for two consecutive arcs a and b holds σ(a) < σ(b). The integer σ(al ) is interpreted as the position of arc al within the alignment of lattice path aL 1 and any other path through L: two arcs mapped to the same position are aligned. Note that due to the constraint on σ(·) two arcs on the same path are never aligned. From the alignments positionwise word posterior probability distributions can be derived, cf. Section 3.4. In the common CN terminology the alignment positions are referred to as slots and σ(·) is called slot function. Following the terminology, a CN is defined as the ordered sequence of the slotwise word posterior distributions. The CN can be expressed as a word lattice without time stamps and with a sausage structure and is denoted by CN(L, σ(·)) or by CN(L) for an arbitrary slot function. From the construction follows that CN(L, σ(·)) is also a compact representation of the alignment between each two sentences accepted by L. From the alignment given by the slot function a path distance can be computed, cf. Equation (3.27), which is referred to as CN distance3 The CN distance yields a cost function of the second type and thus can be efficiently decoded in the Bayes risk framework. The Bayes risk decoder distinguishes between the hypothesis space lattice H and the summation space lattice S and thus slot functions for both lattices are required in order to produce an alignment between any path in H and any path in S. In the last chapter in Section 3.4 the special case of using the CN derived from S as hypothesis space, i.e. H := CN(S), was investigated. For the general case let us assume that two slot functions σH (·) and σS (·) exist and that the two arcs e ∈ E(H) and f ∈ E(S) are aligned if σH (a) = σS (b). Furthermore, let S be the number of slots 3 The
62
terminology is somewhat misleading as the path distance depends on the slot function, but not on the CN itself.
4.4 Confusion Network Distance based Error
Figure 4.2. The figure shows a lattice, a CN derived from the lattice, and a lattice in which all paths have the same length. The positions for the insertions of the arcs are derived from the CN according to the algorithm described in the text. The number at the arcs corresponds to the CN slot the arc is assigned to and the number in the states is the minimum slot number from all outgoing arcs.
in the corresponding CNs, i.e. S := maxa∈E(H) σH (a) = maxb∈E(S) σS (b). The next step is to compute K the CN distance between a path aL 1 ∈ H and a path b1 ∈ S, where attention should be paid to that in general the sequence σ(a1 ), σ(a2 ), . . . , σ(al ) is not consecutive but can have gaps, likewise for bK 1 . By the insertion of arcs gaps can be filled and both paths are brought to equal length S and the CN distance is computed according to Equation (3.27). The positions for the insertions of arcs into a lattice can be easily found by the following algorithm. Given a lattice state s and a slot function σ(·) the minimum slot number for a state s is defined as min.σ(s) := min σ(a), (4.10) a∈out(s)
where min.σ(sI ) := 0 for the initial F ) := S for all final states sF . Given arc a state sI and min.σ(s with σ(a) = n then for each i ∈ min.σ(from(a)), n an arc with slot number i is inserted before a, and for each j ∈ n, min.σ(to(a)) an arc with slot number j is inserted after a. Figure 4.2 visualizes K the algorithm. By the help of min.σH (·) the CN distance between aL 1 and b1 can be computed without explicitly inserting arcs into the summation or hypothesis space lattice: " L X L K 1 ∃k : σH (al ) = σS (bk ) ∧ i(al ) 6= i(bk ) cCN (a1 , b1 ) = l=1 min.σH (to(al ))−1
X
+
# 1 ∃k : σS (bk ) = s ∧ i(bk ) 6=
(4.11)
s=min.σH (from(al )): s6=σH (al )
The derivation of the rescoring function for the Bayes risk decoder is now straightforward: X T L K ˆ := argmin xT1 → W p(bK 1 x1 ) cCN (a1 , b1 ) aL 1 ∈H
bK 1 ∈S
=
L X argmin L a1 ∈H l=1
min.σH (to(al ))−1
X
b∈E(S): σH (al )=σS (b)∧ i(al )6=i(b)
p(bxT1 )
X
+
X
s=min.σH (from(al )): b∈E(S): s6=σH (al ) σS (b)=s∧ i(b)6=
=
argmin

L X 1 − pσ (a ) (i(al )xT1 ) + H l
aL 1 ∈H l=1
p(bxT1 )
min.σH (to(al ))−1
X s=min.σH (from(al )): s6=σH (al )
{z
1 − ps (xT1 )
(4.12)
}
:=cCN al ;S,σS (·),σH (·)
63
Chapter 4 Local Cost Functions for Bayes Risk Decoding A nice property of the CN distance is that by setting H = CN(S, σS (·)) it is guaranteed that the optimal hypothesis is included in the hypothesis space. Furthermore, the construction of the CN from the slot function yields a sausage lattice in which each path has exactly length S, cf. Section 3.4. By the choice of CN(S, σS (·)) as hypothesis space the Bayes risk decoding rule defined in Equation (4.12) simplifies to the decoding rule given in Equation (3.28). In all experimental results presented in this work CN(S, σS (·)) is used as hypothesis space. The motivation behind CN distance based Bayes risk decoding is that the alignments defined by the slot function are good approximations of the Levenshtein alignments. The constraint that the outcome of the slot function for two consecutive arcs must be strictly ascending guarantees that the CN distance is an upper bound of the Levenshtein distance, cf. [Mangu 2000]. Nr R r The slot function of choice minimizes the error on the training samples [xTr,1 ,w ˜r,1 ]r=1 and is defined as R
σopt (·) := argmin σ(·)
1 X Lev R r=1
Nr gCN ·, Lr ; σ(·) , w ˜r,1 .
(4.13)
However, no efficient algorithm is known to compute σopt (·) from LVCSR lattices and in practice heuristic approaches are used with at most a few free parameters which are optimized on a tuning set. Algorithms computing a slot function from a lattice will be referred to as CN construction algorithms. A common heuristic used in many CN construction algorithms is the time overlap constraint defined in Section 3.3.3 for local cost functions. The constraint claims that arcs assigned to the same slot overlap in time and thus guarantees that two consecutive arcs cannot be aligned. However, the time overlap constraint causes a deletion bias in the subsequent CN decoding. Let us assume that the optimal CN alignment has S slots and due to the time overlap constraint the outcome of the CN construction algorithm has S 0 > S slots. A common situation in which the time overlap constraint causes such a suboptimal alignment is the occurrence of short words with fuzzy word boundaries as pointed out in Section 4.1. Let us assume that the Levenshtein alignment would align these words, although due to the short duration and the fuzzy boundaries they have no or only little overlap in time. The CN construction algorithm would not align these words and would probably create extra CN slots. Eventually, the same number of arcs is spread among more slots and the number of arcs per slot decreases. This weakens the probability for a specific word v, if two varcs which should be aligned end up in different slots. In turn, this usually strengthens the probability of the empty word in the affected slots and causes eventually the deletion bias. The first algorithm which constructs a CN directly from a lattice and thereby making use of the time overlap constraint is introduced in [Mangu & Brill+ 1999]. The main idea is to cluster arcs, where the final clustering defines the slot function. The construction can be significantly speed up by using a socalled pivot path [Hakkani & Riccardi 2003; Stolcke 2002]. An algorithm following this approach is presented in Section 4.4.2. The algorithm requires the computation of the distance between arcs and arc clusters. Section 4.4.1 introduces the distance functions used in the CN construction algorithms presented in this work. In [Xue & Zhao 2005] an algorithm is proposed which traverses the lattice in chronological order and thereby builds state clusters. The state clusters are then used to derive the ultimate arc clusters. An algorithm extending and overcoming some drawbacks of the original version is developed in Section 4.4.3. The third CN construction algorithm is based on framewise word posterior probabilities and is introduced in Section 4.4.4. The algorithm is proposed in [Hoffmeister & Schl¨ uter+ 2009] and aims at finding in each iteration a single frame which defines the center of the next CN slot. The algorithm has some interesting properties, e.g. in opposite to the common CN construction algorithms it does not require a distance measure between arcs or arc clusters and comes completely parameter free.
4.4.1 Distances betweens Arcs and Arc Clusters Distance functions between arcs and arc clusters are an important heuristic in the two CN construction algorithms presented in Section 4.4.2 and Section 4.4.3. A common choice in CN construction algorithms are the distance functions introduced in [Mangu & Brill+ 1999]. However, they depend on a phoneme alignment which is not always available, especially not in crosssite system combinations, or it is expensive to compute. The distance functions used throughout this work are eventually the result of empirical tests. The
64
4.4 Confusion Network Distance based Error
eh
hello
10
40
0 32
hello
1.
2.
eh
eh
[si]
hello
hello hello
3.
eh
hello
[si]
hello Figure 4.3. CN construction with the arccluster algorithm.
distance between two arcs a and b or between an arc a and an arc cluster C is computed by: max end(a), end(b) − min beg(a), beg(b) darc (a, b) := 2 − δ i(a), i(b) dur(a) + dur(b) dslot (a, C) := min darc (a, b) b∈C
(4.14)
In arc clustering algorithms usually the distances are weighted by the posteriors of the arcs which yields the following weighted forms of the previously defined distances: p(axT1 )p(bxT1 ) dwarc (a, b) := 1+ darc (a, b) α dwslot (a, C) := min dwarc (a, b) (4.15) b∈C
The weight sees to it that the lattice arcs with a high probability of occurrence dominate the CN construction, where the parameter α controls the impact of the weight and is tuned on the development set.
4.4.2 The ArcCluster CN Construction Algorithm The arc clustering algorithm presented in this section is based on a set of pivot arcs. The pivot arcs are used to initialize the arc clusters. In the next step the algorithm aims at assigning all arcs to the clusters. If some arcs cannot be assigned, because they would violate the consistency of the arc clusters, i.e. lacking overlap in time with some arcs in a cluster, then additional pivot elements are chosen from the remaining arcs and the algorithm starts over. The idea of using a set of pivot arcs for initializing the arc clusters and subsequently clustering the remaining arcs is presumably the most common approach to CN construction algorithms for lattices and also for N best lists, cf. [Hakkani & Riccardi 2003; Stolcke & Bratt+ 2000; Stolcke 2002]. The method presented in this section is also based on the idea of using a set of pivot arcs, but the algorithm itself was developed as part of this work. The pseudo code for the algorithm is given in Figure 4.4. The distance function used is the weighted distance defined in Equation (4.15). Figure 4.3 illustrates the algorithm on a small example. The first set of pivot arcs are the arcs from the best path through the lattice, in the example “eh” and “hello”. New clusters are initialized from the pivot arcs. The remaining arcs are now assigned in a greedy manner, which makes the other “hello” to fall in the same cluster as the first “hello”. The silence arc cannot be assigned anymore without violating the time overlap condition. In the next step the silence arc is added to the pivot elements, the algorithm starts over, and eventually three clusters are built.
65
Chapter 4 Local Cost Functions for Bayes Risk Decoding
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
# Initialize pivot elements with the arcs making up the best hypothesis P < [ e for e in L if e in best ( L ) ] while True do # Initialize remaining arcs R < [ e for e in E ( L ) if e not in P ] # Use pivot elements to initialize the CN slots CN < [] foreach e in P do append ( CN , [ e ]) # Store remaining arcs together with their distance to the closest slot Q < [] foreach e in R do d_e 0} S_e < argmin { d (e , S ) for S in CN if overlap (e , S ) > 0} append (Q , (e , d_e , S_e )) # Sort remaining arcs by their distance to the closest slot sort Q by d_e in increasing order # Assign remaining arcs to the closest slot , if possible # Store arcs that could not be assigned together with their # posterior probability Q ’ < [] while not empty ( Q ) do (e , d_e , S_e ) < pop ( Q ) if overlap (e , S_e ) > 0 then append ( S_e , e ) else p_e < p ( e  x_1 ^ T ) append (Q ’ , (e , p_e )) # If no remaining arcs exist , then stop if empty (Q ’) then break # Sort remaining arcs by their posterior probability sort Q ’ by p_e in decreasing order # Add new pivot elements P ’ < [] while not empty (Q ’) do (e , p_e ) < pop (Q ’) if not overlap (e , P ’) then append (P ’ , e ) P < P + P ’ finalize CN Figure 4.4. Pseudo code for the arccluster CN construction algorithm.
66
4.4 Confusion Network Distance based Error
eh
10
hello 40
0
1.
eh
[si]
32
hello 1.
eh
eh
eh
2.
2. hello
hello
eh hello
3.
3.
eh
hello eh
hello hello
4.
4. hello
hello
eh
[si]
hello
[si]
hello
Figure 4.5. CN construction with the statecluster algorithm.
The time complexity of the algorithm for lattice L is in the worst case O(E(L)2 ). However, in practice the algorithm is the fastest of the three CN construction algorithms investigated in this work. The algorithm turned out to be very robust, i.e. to produce among the best results for all tested systems and conditions including the union based system combination. The clusters are built in a greedy manner and no properties can be assured besides that all arcs in a cluster overlap in time. The actual clustering result depends on the distance function used and on the choice of the initial pivot elements.
4.4.3 The StateCluster CN Construction Algorithm The state clustering algorithm is proposed in [Xue & Zhao 2005]. The main idea of the algorithm is to visit the lattice states in chronological order and to add all states to the current cluster until the following condition is met: for the state in question s there exists an arc a such that a starts from a state in the current cluster and ends in s. If the condition is fulfilled a new state cluster is started and initialized with s. Let C(s) denote the number of the state cluster to which state s is assigned to and let us assume that the state clusters are numbered in ascending order, then the constraint guarantees for each arc a that C from(a) < C to(a) holds. For the subsequent arc clustering step an empty arc cluster is initialized between each two state clusters. The arcs are traversed and arc a is assigned to the best matching arc cluster which lays between the state clusters given by the source and the target state of a. The state clustering constraint guarantees that after the arc clustering step for each two consecutive arcs a and b the slot function constraint σ(a) < σ(b) holds. By default the algorithm uses the unweighted arc distances which makes the algorithm independent of the posterior probabilities computed from the lattice. The pseudo code for the algorithm is given in Figure 4.6 and an example in Figure 4.5, left side. The example illustrates a shortcoming of the algorithm: the greedy approach obviously fails in finding the correct arc clustering. The greedy procedure aligns “eh” and the first “hello” before considering the second “hello”. Because the target state of the “eh” arc is the source state of the second “hello” arc a new state cluster is started and the two “hello” arcs cannot be aligned. In this work an extension of the statecluster algorithm is developed, which can compensate for the
67
Chapter 4 Local Cost Functions for Bayes Risk Decoding
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
# Initialize states S < [ s in S ( L ) ] sort S chronologically in increasing order # Initial state cluster C_0 < [ pop ( S ) ] j < 0 # Initialize CN CN < [] while not empty ( S ) do # Process next state s < pop ( S ) # If potential violation of the alignment property is detected , # then start new state and arc cluster ( aka slot ) if max { state_cluster_index ( from ( e )) for e in In ( s ) } = j then j < j + 1 C_j < [] , A_j < [] append ( CN , A_j ) append ( C_j , s ) # Find best arc slot for all incoming arcs foreach e in In ( s ) do i < state_cluster_index ( from ( e )) k < argmin { d (e , A_k ) for k in ( i .. j ] } append ( A_k , e ) finalize CN Figure 4.6. Pseudo code for the statecluster CN construction algorithm.
shortcoming of the original method. The extension allows socalled backsplits where an existing arc cluster is split and a new state cluster is inserted. The procedure compares the arctobeclustered a to all already clustered arcs which overlap in time with a. If an arc a0 is found which matches better to a than to any arc in its current cluster, then the split is accomplished. The right side of Figure 4.5 illustrates the idea: when the matching arc cluster for the second “hello” is searched, the existing arc cluster containing “eh” and the first “hello” is split and both “hello” are assigned to the right cluster. The complete pseudo code for the statecluster algorithm with backsplitting is given in Figure 4.7. The time complexity for lattice L is for both algorithms in the worst case O(E(L)2 ), alike the pivot path based arc clustering algorithm from the previous section. In practice, the algorithm is fast, while slower than the pivot path based arc clustering algorithm. The performance is good, sometimes the results are even slightly better than for the pivot path based arc clustering. But the algorithm is sensitive to the lattice structure, especially for union based system combinations. For these cases the backsplitting improves the error rates significantly, but still it works not as robust as the algorithm from the previous section. An interesting property of the algorithm is that it works quasi online: in the original algorithm the processing of state s at time t affects only the incoming arcs of state s. In particular, when using the posterior free distance functions, cf. Equations (4.15), it depends only on what happened chronologically before t.
4.4.4 The CenterFrame CN Construction Algorithm The heuristic used in the center frame algorithm as proposed in [Hoffmeister & Schl¨ uter+ 2009] is based T on the framewise word posterior probabilities pt (wx1 ) and the arc probabilities p(axT1 ) computed from the lattice. In opposite to the CN construction algorithms presented in the two previous sections, the algorithm does not rely on distances between arcs and arc clusters. The core idea of the algorithm is to find in each iteration the frame t that fulfills best three conditions.
68
4.4 Confusion Network Distance based Error
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
# Initialize states S < [ s in S ( L ) ] sort S chronologically in increasing order # Initial state cluster C_0 < [ pop ( S ) ] j < 0 # Initialize CN CN < [] while not empty ( S ) do s < pop ( S ) if max { state_cluster_index ( from ( e )) for e in In ( s ) } = j then j < j + 1 C_j < [] , A_j < [] append ( CN , A_j ) append ( C_j , s ) foreach e in In ( n ) do i < state_cluster_index ( from ( e )) k_fwd < argmin { d (e , A_k ) for k in ( i .. j ] } d_fwd < d (e , A_k_fwd ) # Check , whether we prefer a back insertion , \ ie insert edge e # into a slot where it might violate the slot consistency Q < [ (k , d (e , A_k )) for k in [1.. i ] ] sort Q by d in increasing order while not empty ( Q ) do ( k_bwd , d_bwd ) < pop ( Q ) # Try a back insertion , if it is cheaper if d_bwd < d_fwd then I_left < [] , I_right < [] foreach e_bwd in A_k_bwd do if overlap ( e_bwd , e ) = 0 do append ( I_left , e_bwd ) else append ( I_right , e_bwd ) if empty ( I_left ) then # Back insertion doesn ’ t violate the slot # consistency append ( A_k_bwd , e ); break else # Back insertion violates the slot consistency # Check , whether a slot split is desired or not F_left < I_left , F_right < [ e ] while not empty ( I_right ) do e_bwd < pop ( I_right ) if d ( e_bwd , I_left ) < d ( e_bwd , e ) then append ( F_left , e_bwd ) else append ( F_right , e_bwd ) # If at least one arc is assigned to the new # slot F_right , then perform the split if  F_right  > 1 then replace ( A_k_bwd , ( F_left , F_right )); break # No back insertion happened d_bwd < infinity # If no back insertion happened , then do the forward insertion if d_bwd = infinity then append ( A_k_fwd , e ) finalize CN Figure 4.7. Pseudo code for the statecluster CN construction algorithm with backsplitting.
69
Chapter 4 Local Cost Functions for Bayes Risk Decoding The first condition requires the definition of the region of maximum overlap \ mo(a) := beg(b), end(b)
(4.16)
b∈L: i(a)=i(b)∧ o(a,b)>0
for an arc a. Now, the three conditions are: 1. t lays in the region of maximum overlap for all arcs it intersects with 2. the probability of the empty word has a minimum at time t 3. t lays in the center of all arcs it intersects with Condition 1 overrules condition 2 and condition 2 overrules condition 3. That is, first the regions are selected which fulfill best condition 1. From these regions those time frames are selected which fulfill best condition 2 and condition 3 is used for the final selection. In the optimal case t is the center of all arcs which intersect with time frame t and none of these arcs is an arc. Condition 1 and 3 ensure that the arcs in the resulting slot are competitors. Condition 2 aims at reducing the probability of the empty word in a slot and thus reducing the deletion bias of the resulting CN. In practice, condition 2 forces a compact CN with the fewest number of slots compared to the alternative CN construction algorithms. The algorithm has a crucial drawback: the region of maximum overlap for arc a will be empty if there exist two arcs which have the same label as a and overlap with a, but do not mutually overlap. This case is referred to as the ambiguous case, because no unambiguous region of maximum overlap exists. An alternative definition of the region of maximum overlap can be derived based on framewise word posteriors, which is in the unambiguous case equivalent to the original definition given in Equation (4.16), but provides in the ambiguous case a meaningful set of time stamps. The new definition is based on the observation that pt i(a)xT1 = max pτ i(a)xT1 , for t ∈ mo(a). beg(a)≤τ
That is, those time frames in an arc’s time span, where the probability of the arc label is maximized, are good candidates for the region of maximum overlap. The resulting region is referred to as region of maximum probability and is defined as mp(a) := t : beg(a) ≤ t < end(a) ∧ pt (i(a)xT1 ) = max pτ i(a)xT1 . (4.17) beg(a)≤τ
The definition guarantees that for all arcs the region of maximum probability is not empty and equals the region of maximum overlap in the unambiguous case. The resulting CN construction algorithm is illustrated in Figure 4.8. The only time frame being close at fulfilling all three conditions is frame 22. The slot derived from frame 22 contains both “hello” arcs. Assuming that the nonword “[si]” is regarded by the algorithm as the empty word, then the next choice is frame 5 goring the “eh” arc. And finally the third slot is built from the silence arc. The complete algorithm is given in pseudo code in Figure 4.9. In the first two lines the algorithm is initialized, where E := {a ∈ L : i(a) 6= } is the set of all non arcs. The main loop starts in line 8 with updating the framewise word posteriors and the framewise average deviation from the arc center. In the experiments presented in Section 4.4.5 the deviation is measured by the l1 norm, but also so the l2 norm works well. In line 14 the framewise posteriors are updated, whereas only arcs are considered whose region of maximum probability intersects with the current time frame. And in line 19 the selection of the next slot building frame starts. Finally, in line 27 the next slot is created, where again only those arcs are considered whose region of maximum probability intersects with the slot building time frame. The algorithm has some nice properties. First of all it does not require a distance function for arcs or arc clusters and it comes completely parameter free. The abandonment of the distance function has a direct consequence: the algorithm is invariant to the fragmentation of paths, i.e. consecutive arcs of silence, noise, or other nonwords. All slots produced by the algorithm contain only non arcs, whereas the other algorithms usually produce many slots containing only arcs. Furthermore, it is guaranteed
70
4.4 Confusion Network Distance based Error
eh
hello
10
40
0 32
hello
[si]
hello 1. hello
2.
eh
hello hello
3.
eh
hello
[si]
hello Figure 4.8. CN construction with the centerframe algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
# Initialize set of non  eps edges E < [ e for e in E ( L ) if label ( e ) != eps ] # Initialize CN CN < [] # Main loop while not empty ( E ) do # Update frame  wise word  posteriors foreach t in [1.. T ] do p_t ( eps  x_1 ^ T ) < 1 dev_t ( x_1 ^ T ) < sum { t  center ( e ) * p_t ( e  x_1 ^ T ) for e in E } foreach w in W do p_t ( w  x_1 ^ T ) < sum { p_t ( e  x_1 ^ T ) for e in E if label ( e ) = w } # Update frame  wise eps  posteriors foreach e in E do p_max < max { p_t ( label ( e ) x_1 ^ T ) for t in [ begin ( e ).. end ( e )) } for t in [ begin ( e ).. end ( e )) with p_t ( label ( e ) x_1 ^ T ) = p_max do p_t ( eps  x_1 ^ T ) < p_t ( eps  x_1 ^ T )  p_max # Find next slot building frame n < infinity foreach e in E do p_max < max { p_t ( label ( e ) x_1 ^ T ) for t in [ begin ( e ).. end ( e )) } for t in [ begin ( e ).. end ( e )) with p_t ( label ( e ) x_1 ^ T ) = p_max do if ( p_t ( eps  x_1 ^ T ) < p_n ( eps  x_1 ^ T ) ) and ( dev_t ( x_1 ^ T ) < dev_n ( x_1 ^ T ) ) then n < t # Build next slot S < [] foreach e in E with t in [ begin ( e ).. end ( e )) do p_max < max { p_t ( label ( e ) x_1 ^ T ) for t in [ begin ( e ) , end ( e )] } if p_n ( label ( e ) x_1 ^ T ) = p_max then append (S , e ) remove (E , e ) append ( CN , S ) finalize CN Figure 4.9. Pseudo code for the centerframe CN construction algorithm.
71
Chapter 4 Local Cost Functions for Bayes Risk Decoding
Table 4.8. CN decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1
Comb.
s1+s2
union
CNC
s1+s2+s3
union
CNC
1
CN alg. arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe
dev071 (2.63/1.59) 14.54 (2.79/1.45) 14.30 (2.95/1.41) 14.31 (2.81/1.45) 14.32 (3.05/1.29) 13.54 (3.47/1.20) 13.69 (2.90/1.34) 13.54 (2.93/1.34) 13.56 (3.03/1.32) 13.53 (2.91/1.36) 13.55 (2.88/1.24) 13.13 (3.38/1.14) 13.27 (2.74/1.33) 13.15 (2.87/1.29) 13.17 (2.93/1.26) 13.15 (2.74/1.34) 13.16
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.53/0.85) 14.96 (4.69/0.82) 14.93 (4.56/0.85) 14.95 (4.69/0.73) 14.01 (5.18/0.69) 14.22 (4.60/0.74) 13.96 (4.66/0.76) 13.99 (4.72/0.72) 13.95 (4.60/0.75) 13.95 (4.77/0.67) 13.73 (5.19/0.65) 13.77 (4.56/0.73) 13.65 (4.68/0.70) 13.70 (4.71/0.67) 13.65 (4.57/0.76) 13.74
dev08 (2.80/0.87) 13.28 (2.85/0.80) 13.05 (3.07/0.79) 13.10 (2.89/0.80) 13.10 (3.01/0.73) 12.54 (3.45/0.66) 12.75 (2.90/0.71) 12.43 (2.93/0.74) 12.50 (3.09/0.75) 12.66 (2.91/0.74) 12.49 (3.01/0.73) 12.30 (3.34/0.64) 12.32 (2.87/0.74) 12.19 (2.92/0.72) 12.21 (3.03/0.70) 12.29 (2.86/0.77) 12.14
tuning set
that all overlapping arcs with the same label are assigned to the same slot, if an unambiguous solution exists. The worst case complexity of the algorithm is O(T 2 ). In the conducted experiments the centerframe algorithm needs between two and eight times longer than the pivot path based arc clustering algorithm, depending on the length and structure of the lattice. The produced CNs are the most compact of all three algorithms and the decoding usually yields the lowest deletion ratio. The error rates are competitive to the arc clustering approach, for some tasks even better. And the algorithm is robust, showing good results under all test conditions including the experiments with union based system combinations, where it usually beats the other arc clustering algorithms.
4.4.5 Results In this section results for the Bayes risk decoder with the CN distance as loss function are presented and discussed. Experiments have been performed for single lattices, for union based lattice combinations, and for CN combinations, see Section 3.2.3 and Section 3.4.1 for details about the combination techniques. The CN decoder follows Equation (3.28) and considers the complete hypothesis space. Results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation crosssite combination. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based lattice combination and in the CN combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described in Section 3.7. In the first set of experiments three CN construction algorithms are compared: the arccluster algorithm introduced in Section 4.4.2, the statecluster algorithm from Section 4.4.3 in the modified version with backsplitting, and the centerframe algorithm from Section 4.4.4.
72
4.4 Confusion Network Distance based Error
Table 4.9. CN decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI
Comb.
LIMSI+RWTH
union
CNC
LIMSI+RWTH+UKA
union
CNC
LIMSI+RWTH+UKA+IRST
union
CNC
1
CN alg. arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe arccluster statecluster (mod.) centerframe
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.65/1.33) 8.07 (1.76/1.18) 8.96 (1.71/1.25) 8.04 (1.88/1.14) 8.94 (1.64/1.33) 8.08 (1.75/1.18) 8.97 (1.63/0.77) 6.46 (2.17/0.71) 7.67 (1.90/0.85) 6.95 (2.29/0.79) 8.13 (1.50/0.77) 6.39 (1.92/0.73) 7.52 (1.45/0.80) 6.38 (1.88/0.75) 7.51 (1.49/0.78) 6.38 (1.96/0.75) 7.52 (1.45/0.81) 6.41 (1.88/0.80) 7.58 (1.51/0.79) 6.38 (2.04/0.77) 7.63 (1.98/0.73) 6.57 (2.63/0.69) 7.76 (1.54/0.73) 6.30 (1.89/0.69) 7.32 (1.47/0.72) 6.27 (1.87/0.68) 7.24 (1.58/0.67) 6.25 (2.04/0.64) 7.28 (1.36/0.74) 6.23 (1.77/0.76) 7.32 (1.61/0.73) 6.28 (2.19/0.67) 7.36 (2.31/0.63) 6.61 (2.90/0.52) 7.58 (1.61/0.71) 6.23 (2.00/0.61) 7.10 (1.45/0.71) 6.14 (1.87/0.69) 7.12 (1.54/0.65) 6.10 (2.04/0.57) 7.12 (1.36/0.73) 6.11 (1.82/0.67) 7.16
tuning set, eval06 was the official development set in the 2007 evaluation campaign
73
Chapter 4 Local Cost Functions for Bayes Risk Decoding
Table 4.10. Comparison of the original and the modified statecluster CN construction algorithm for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1
Comb.
s1+s2+s3
union CNC
1
CN alg. statecluster statecluster statecluster statecluster statecluster statecluster
(orig.) (mod.) (orig.) (mod.) (orig.) (mod.)
dev071 (2.63/1.59) 14.54 (3.10/1.43) 14.45 (2.95/1.41) 14.31 (3.70/1.53) 13.88 (3.38/1.14) 13.27 (2.83/1.29) 13.14 (2.93/1.26) 13.15
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.85/0.83) 15.02 (4.69/0.82) 14.93 (5.36/1.10) 14.43 (5.19/0.65) 13.77 (4.67/0.71) 13.73 (4.71/0.67) 13.65
dev08 (2.80/0.87) 13.28 (3.25/0.81) 13.30 (3.07/0.79) 13.10 (3.72/1.11) 13.03 (3.34/0.64) 12.32 (3.03/0.69) 12.30 (3.03/0.70) 12.29
tuning set
The results are summarized in Table 4.8 and Table 4.9. The single system experiments show almost no difference in error rate, but a slightly higher deletion ratio for the statecluster algorithm. The confusion network combination experiments show a similar picture: the error rates are almost identical and the statecluster algorithm has the highest and the centerframe algorithm the lowest deletion ratio. The unionbased lattice combinations are the most challenging tasks for the CN algorithms, because a single CN from several, sometimes diverse lattices has to be constructed. In particular demanding is the crosssite combination, where the CN has to be built from lattices with different biases in the word boundaries. The results for the Chinese task show that the arccluster and the centerframe algorithm are doing well and the error rates do not differ from the CNC results. For the statecluster algorithm the number of deletions increases heavily and raises the error rate compared to the CNC result. On the English crosssite combination task the CNC approach shows a small advantage over the union based system combination. Presumably, the advantage comes from the independence of the CNC algorithm from time information. The time information is needed to build the systemdependent CNs, but not anymore in the CN combination itself. In the Chinese testing system all lattices are produced with the same decoder and thus all lattices have the same bias in their time stamps. On the other hand, in a crosssite system combination the lattices are usually produced by different decoder and vary in their bias, cf. [BaghaiRavary & Kochanski+ 2009]. This explains the different behavior of the Chinese system and the English crosssite combination. Similar to the Chinese results, the statecluster algorithm is inferior to the arccluster and centerframe algorithm for the union based combination. Again, a heavily increased deletion ratio is observed. Among the union based experiments the centerframe algorithm shows a small advantage over the arccluster method, although little the difference can be observed in almost all experimental setups, cf. Appendix C. A direct comparison of the CNC results with the best frame error results, cf. Section 4.2.3, shows no significant difference in error rate. Compared with the CN decoding of the lattice union the frame error approximation shows a small advantage for the crosssite combination. The CN combination and decoding approaches show good generalization abilities. For all experimental setups the improvement on the tuning and on the testing sets are of similar magnitude. The second set of experiments investigates the modification of the original statecluster algorithm, the backsplitting. The results are summarized in Table 4.10 and Table 4.11. For the single lattice case and the CNC the backsplitting gives a small improvement making the algorithm competitive to the arccluster and the centerframe algorithm. The performance of the original statecluster algorithm on the lattice union is rather poor and the allowance of backsplits results in a large improvement. However, the deletion ratio remains high and the performance on the lattice union stays inferior.
74
4.5 Summary
Table 4.11. Comparison of the original and the modified statecluster CN construction algorithm for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI
Comb.
LIMSI+RWTH+UKA+IRST
union CNC
1
CN alg. statecluster statecluster statecluster statecluster statecluster statecluster
(orig.) (mod.) (orig.) (mod.) (orig.) (mod.)
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.71/1.35) 8.13 (1.82/1.22) 9.04 (1.71/1.25) 8.04 (1.88/1.14) 8.94 (2.25/1.08) 7.39 (2.82/1.00) 8.30 (2.31/0.63) 6.61 (2.90/0.52) 7.58 (1.60/0.63) 6.15 (2.04/0.63) 7.14 (1.54/0.65) 6.10 (2.04/0.57) 7.12
tuning set, eval06 was the official development set in the 2007 evaluation campaign
4.5 Summary In this chapter three different approaches to the approximation of the Levenshtein distance have been investigated. The approximations belong to two classes of local cost functions for which efficient Bayes risk decoder exist. Local cost functions and efficient Bayes risk decoder for local costs of the first and second type were introduced in the previous chapter in Section 3.3.3. The cost based on a local alignment is an example for a local cost function of the first type and the frame error and the CN distance based costs are examples for cost functions of the second type. The frame error counts the number of frames in which hypothesis and reference disagree in the word label. In practice, the frame error is normalized in order to get a more wordlike error. The common normalization used in Bayes risk decoding happens only under consideration of the hypothesis. An investigation of the hypothesisside normalization shows that it ignores deletions and thus causes a heavy deletion bias in decoding. A new frame error based cost is introduced which averages between hypothesisand referenceside normalization. The new cost is compared to the original cost function and to a third frame error based cost, which applies a symmetric normalization on arc level. The new cost performs superior in all experiments and in some experiments a considerable decrease in error rate is observed. The class of cost functions based on a local alignment includes the cost approximation used in Povey’s implementation of MPE/MWE training. In Bayes risk decoding Povey’s MWE cost shows a deletion bias and a modified version is developed. The modified cost contains an additional term which explicitly penalizes deletions. In the computation of the cost function two reference arcs on the same path can be assigned to the same hypothesis arc. Povey’s cost function is designed to compensate for the flaw by computing fractional error counts. The 1/2 overlap approximation is an alternative approach which allows two arcs to compete only if their overlap exceeds one half. The constraint guarantees that no two hypothesis arcs on the same path are assigned to the same reference word. From this approach two cost functions are derived using continuous and discrete costs. The experimental results reveal the deletion bias of Povey’s cost approximation. The modified criterion can reduce the deletion bias and shows the best error rates of the four compared cost functions. The results for the 1/2 overlap approximation with the continuous error counts are close to the modified criterion, whereas the version using discrete costs performs inferior. In the last section three confusion network (CN) construction algorithms are introduced. The arccluster and the statecluster algorithm are based on common algorithms used in LVCSR lattice decoding. The centerframe algorithm is a new approach which does not rely on distances between arcs or arc clusters. The arccluster algorithm uses a set of pivot arcs to built an initial set of arc clusters, which are redefined in an iterative manner until all arcs are clustered. The statecluster algorithm performs a
75
Chapter 4 Local Cost Functions for Bayes Risk Decoding chronological traversal of the lattice thereby clustering the states. The state cluster information is used for building the ultimate arc clustering. The original stateclustering algorithm exhibits problems for some lattice structures. A modified version is developed, which is able to compensate for the shortcoming. In the experimental tests the modified version performs better for all tested systems and conditions. The arccluster and the statecluster algorithms are eventually based on building arc clusters by comparing arcs. The centerframe algorithm works differently: in each iteration a single time frame is selected which defines the center of the next CN slot. The heuristic aims at choosing the time frame such that a compact CN arises with a high arc overlap within the slots. In the experimental comparison of the three algorithms the performance is similar for single lattices and for confusion network combinations. For union based lattice combination the arccluster and centerframe algorithms are on the same level and outperform the state clustering approach.
76
Chapter 5 Confusion Networks: Applications and Investigations Confusion networks (CNs) have been introduced in Chapter 3, Section 3.4.1, and have been further discussed in Chapter 4, Section 4.4. A CN defines a sequence of slots, where each slot represents a posterior probability distribution over words. The CN can be interpreted as the result of an alignment of word sequences: words in the same slot are aligned. Having said this, the sequence of slots equates to the possible alignment positions. For each slot and word the CN provides the posterior probability of the observation of the word at the corresponding alignment position given the acoustic observations. In particular, the slotwise posterior probabilities are derived from a given alignment, which makes them independent of the posterior distributions of the adjacent slots. The independence yields the simple decoding rule for CNs: for each slot select the word with the highest slotwise posterior probability. In this chapter more applications of CNs are presented which make explicit use of the independence. In the last chapters CNs have been introduced on word and on Chinese character level. In this chapter also CNs defined on frame level are used. For example, the time alignment introduced in Section 1.3 can be expressed as a CN: for each time frame the acoustic alignment provides a probability distribution over all HMM states; in the Viterbi case the probability is zero for all but one state. The corresponding CN has a slot for each time frame and the slotwise distribution is defined over HMM states. In this chapter framewise CNs are used which provide per frame a distribution over all word labels. Thus, they can be interpreted as an acoustic alignment on word level instead of HMM state level. Figure 5.1 shows an example for a lattice and the derived wordwise and framewise CNs. In Section 5.1.1 a framewise entropy is computed from framewise defined CNs and used for a combination approach. Another application of framewise CNs is presented in Section 5.1.2, where word boundaries are derived from the frame level CNs. And in Section 5.2.1 the slotwise posteriors of a word level CN are warped for optimal performance in a CN combination. A CN derived from a lattice induces an alignment for each pair of paths in the lattice. The CN decoding result equals the Bayes risk decoding with the Levenshtein distance as loss function, if the CN defines the Levenshtein alignment for all path pairs. In practice, this is not the case for LVCSR tasks, but the true alignment is usually close to the CN alignment. This motivates the idea of using a windowed Levenshtein distance in the Bayes risk decoder, where the alignment is initialized by a CN alignment. In Section 5.2.2 the idea is explored in detail. It is shown that the resulting decoder with a window size of one equals the CN decoding rule and for a sufficiently large window it becomes the Bayes risk decoder with the Levenshtein distance as loss function.
5.1 Frame Level Confusion Networks A framewise CN (fCN) is defined on word labels and is completely described by the framewise word posterior probability distributions pt (wxT1 ) which define the slots in the fCN. The posterior distributions and thus the fCN are derived from a lattice according to Equation (3.11). In contrast to a word or arc alignment, the time alignment requires no explicit computation of the alignment: the alignment is implicitly given by the time stamps in the lattice. Thus, in contrast to a wordlevel CN, in the fCN the articulation of a word is usually spread over several slots.
5.1.1 Minimum and InverseEntropy Combination The min.hypnFE decoding rule defined in the last chapter in Equation (4.3) is solely based on framewise word posterior distributions, no other latticebased probabilities are required. For the union based lattice
77
Chapter 5 Confusion Networks: Applications and Investigations
Figure 5.1. The figure shows in the first row a lattice. The second and the third row show the wordlevel resp. frame level CN derived from the lattice. In the wordlevel CN each slot assigns a single position to each word hypothesis. In the framewise CN each slot represents a single time frame and a word hypothesis is usually spread among several slots.
combination the framewise word posteriors are computed according to Equation (3.17) as the weighted average of the systemdependent framewise word posteriors. In [Misra & Bourlard+ 2003; Valente 2009] the authors propose alternative ways to combine framewise posteriors based on the framewise computed, systemdependent entropy. In their work neural network based framewise phoneme posterior probabilities are derived from several feature streams. Systemdependent entropy values are computed from the posteriors and used for merging the phoneme posteriors into a new acoustic frontend. In this work the combination method is applied to the framewise word posteriors derived from the systemdependent lattices. The basic idea of entropy based combination as proposed in [Misra & Bourlard+ 2003] is that the system with the lowest entropy is the most reliable system. From the main idea the authors derive two approaches: for each frame make a hard or a soft decision for one of the systems based on the systemdependent entropy. In the first approach, at each time frame simply the posterior distribution of the system with the lowest entropy is chosen. The resulting combination rule is called the “minimum entropy” weighting scheme and is defined as follows, where the entropy for the posterior distribution pj,t (·xT1 ) is denoted by Hj,t (xT1 ): pt (wxT1 ) :=
J X j=1
δ Hj,t (xT1 ), min Hk,t (xT1 ) pj,t (wxT1 ) k
(5.1)
In the “inverseentropy” weighting scheme the systemdependent posteriors are weighted according to the inverse of the systemdependent entropy values: pt (wxT1 ) := Z −1
J X j=1
−1 T p(j)Hj,t (x1 )pj,t (wxT1 ),
Z :=
J X
−1 T p(j)Hj,t (x1 )
(5.2)
j=1
The “inverseentropy” can be interpreted as a smoothed version of the “minimum entropy” approach: the closer a system’s entropy is to zero, the more it dominates the competitors. Results with the entropy based weighting schemes are presented and discussed in Section 5.1.3.
78
5.1 Frame Level Confusion Networks
5.1.2 Time Alignment with Frame Level CNs Some lattice combination and decoding approaches, like the lattice intersection or the MAP decoding rule, erase the time stamps and invalidate the word boundaries. In theory, for computing and optimizing the Levenshtein distance based error rate time stamps are not necessary. However, in practice they are needed for applying the popular NIST scoring tools or for postprocessing steps applied to the decoding result, for example in the preparation for a subsequent translation step [Matusov & Mauser+ 2006]. A general way to produce new word boundaries is to perform a time alignment of the decoding result with an appropriate acoustic model. The drawback is that the alignment is expensive compared to the lattice decoding and acoustic models are required. Especially in the crosssite system combination case an appropriate acoustic model is not always available or it has an outofvocabulary (OOV) problem, i.e. the pronunciation lexicon at hand does not contain pronunciations for all words in the lattices. An alternative approach is to modify the lattice processing tools such that they compute approximate word boundaries. The approach is usually fast, but the drawbacks are that only approximate time stamps are computed and that generic algorithms, like the lattice determinization, have to be modified, i.e. no generic WFST toolkits can be used anymore. A third approach similar to the acoustic time alignment is presented in this section. The idea is to use the framewise word posterior distributions for computing the word boundaries. Given the framewise word posteriors pt (wxT1 ) computed from lattice L and given the decoding result w1N computed from the same lattice, then the alignment problem is given by ˆ
(w1N , xT1 ) → tˆN 1 := argmin tN 1
N Y
tn Y
pτ (wn xT1 ).
(5.3)
n=1 τ =tn−1 +1
The ending time of word wn is denoted by tn , that is the boundaries of wn are [tn−1 +1, tn ]. The alignment can be efficiently computed using a dynamic programming approach. The approach can be easily derived from the recursive formulation of the problem h(t, n; w1N , xT1 ) := pt (wn xT1 ) min h(t − 1, n − 1; w1N , xT1 ), h(t − 1, n; w1N , xT1 ) , where h(0, 0; w1N , xT1 ) := 1. Computing h(T, N ) and tracing the changes in the word index yields the desired word boundaries. Also for system combination approaches which are not based on a single lattice, like the CN combination (CNC), the algorithm is suitable. The framewise word posteriors are computed according to Equation (3.17) as the weighted average of the systemdependent posteriors or equivalently directly from the modified lattice union as defined in Section 3.2.3. The choice of the union for computing the framewise posteriors guarantees that no OOV problem occurs during the alignment. The algorithm is used in [Hoffmeister & Hillard+ 2007] for computing word boundaries for the output of a CNC decoder and throughout this work to compute word boundaries for lattice intersection and MAP decoding results.
5.1.3 Results In this section experimental results for the entropybased combination of framewise word posterior probabilities in the minimum frame error framework are given and discussed. The corresponding minimum frame error decoder using the standard combination approach is defined in Section 4.2.1. Experiments are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation crosssite combination. A detailed description of the systems is given in Appendix B. For all experiments acoustic and language model scales, the system weights in the union based combination approach, and the smoothing parameter α in the minimum frame error decoder are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described in Section 3.7. The results are summarized in Table 5.1 and Table 5.2. Especially for the English crosssite combination, the inverseentropy combination performs better than the maximum entropy approach. However, both entropybased combination rules are inferior to the standard method of the weighted average.
79
Chapter 5 Confusion Networks: Applications and Investigations
Table 5.1. Entropybased combination results for the Chinese 230h testing system, cf. Section B.1.1. Experiments are performed with the minimum frame error decoder with hypothesisside frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1+s2
s1+s2+s3
1
Frame Comb. average min. entropy inv. entropy average min. entropy inv. entropy
dev071 (2.63/1.59) 14.54 (3.07/1.30) 13.57 (2.78/1.48) 13.65 (3.12/1.28) 13.61 (3.06/1.23) 13.18 (2.84/1.40) 13.37 (3.08/1.23) 13.20
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.69/0.68) 13.95 (4.52/0.76) 13.95 (4.79/0.71) 14.01 (4.72/0.69) 13.71 (4.65/0.77) 13.85 (4.82/0.70) 13.82
dev08 (2.80/0.87) 13.28 (3.05/0.70) 12.54 (2.78/0.81) 12.38 (3.12/0.69) 12.55 (3.01/0.72) 12.22 (2.99/0.79) 12.18 (3.09/0.70) 12.10
tuning set
Table 5.2. Entropybased combination results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. Experiments are performed with the minimum frame error decoder with hypothesisside frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI+RWTH
LIMSI+RWTH+UKA
LIMSI+RWTH+UKA+IRST
1
80
Frame Comb. average min. entropy inv. entropy average min. entropy inv. entropy average min. entropy inv. entropy
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.60/0.85) 6.65 (1.99/0.76) 7.73 (1.65/0.97) 6.84 (1.91/0.88) 7.80 (1.50/0.97) 6.64 (1.74/0.90) 7.61 (1.80/0.72) 6.48 (2.21/0.68) 7.52 (1.70/1.04) 6.84 (1.84/1.01) 7.84 (1.67/0.97) 6.64 (1.81/0.88) 7.43 (1.70/0.79) 6.52 (1.93/0.76) 7.26 (2.13/1.01) 7.88 (2.27/0.95) 8.50 (1.88/0.95) 7.29 (2.00/0.93) 8.02
tuning set, eval06 was the official development set in the 2007 evaluation campaign
5.2 Word Level Confusion Networks In their original work [Misra & Bourlard+ 2003; Valente 2009] the authors also find that the inverse entropy method is superior to the maximum entropy method. They improved with both methods over the simple average. However, in their experiments the biggest gains were observed for noisy data, whereas for clean speech almost no improvement was seen. The tasks considered in this work use clean speech only which is presumably the reason why the combination does not benefit from the entropybased approaches.
5.2 Word Level Confusion Networks A word level CN is completely described by the slotwise word posterior probability distributions denoted by ps (wxT1 ). In contrast to the framewise CN, in the wordwise CN the articulation of a word is never distributed over several slot: each slot represents a complete word. Alternatively, a word level CN is described by a word lattice L and a slot function σ : E(L) → N as introduced in Section 3.4. The posterior probability ps (wxT1 ) is computed from the lattice according to Equation (3.12), where the slot function assigns each lattice arc to a single CN slot. The slot function fulfills the constraint that σ(a) < σ(b) for any two consecutive lattice arcs a and b. The constraint guarantees that ps (·xT1 ) is a probability distribution, cf. Section 3.4, and that the Levenshtein distance is a lower bound for the CN distance, cf. Section 4.4.
5.2.1 Confidence Warping The confidence score for a word in the decoding output is a measure of how certain the decoder is about the hypothesized word. Thus, the confidence score can be interpreted as the probability of how often the hypothesized word is correct [Wessel 2002]. The common confidence scores in LVCSR are based on fCNs [Wessel & Schl¨ uter+ 2001b] or on CNs [Evermann & Woodland 2000; Mangu & Brill+ 2000]. In the simplest approach the slotwise word posterior derived from the CN are used directly as confidence score. However, posterior probabilities derived directly from lattices are usually biased due to model assumptions, beam pruning in the search, and subsequent lattice pruning. If all systems in a system combination show the same bias, then the bias presumably does not effect the decoding result. But if completely different systems are combined, e.g. lattices contributed from different sites, the posteriors might be biased differently and the systemdependent bias effects the decoding result. Focusing on the slotwise word posteriors derived from a CN the bias can be measured by interpreting the posteriors as confidence scores. The normalized crossentropy (NCE) or other confidence measures show how close the latticebased posterior estimates are to the true posteriors [Hillard & Ostendorf 2006; Wessel 2002]. The bias of confidence scores derived from slotwise word posteriors and an algorithm to compensate for the bias are discussed for example in [Hillard & Ostendorf 2006]. Here, the idea is to improve the CNC based system combination, cf. Section 3.4.1, by introducing systemdependent warping functions which compensate for the bias in the word posterior probability distributions of the systemdependent CNs. The bias of slotwise word posteriors derived from LVCSR lattices is almost always characterized by overestimated large probabilities and underestimated small probabilities, or vice versa. In the consequence, a simple, word and slotindependent warping function is sufficient to considerably improve the CN based confidence scores. The warping function used in this work is defined in Equation (5.4), where j denotes the system and bj and γj are the two systemdependent parameters. γj 1 − (1 − b) 1−x , if x > bj 1−bj γj hj (x) := bj x , otherwise bj hj pi,s (wxT1 ) p0i,s (wxT1 ) := X (5.4) hj pi,s (vxT1 ) v
Figure 5.4 in the results section, cf. Section 5.2.3, shows the warping function for b = 0.3 and γ = 0.4 and the result of its application to the slotwise word posteriors derived from a CN. In the right plot the true confidence scores computed on a tuning set are drawn against the confidence estimates. The warped confidence estimates are already very close to the true scores.
81
Chapter 5 Confusion Networks: Applications and Investigations In the application for CNC the two parameters γj and bj of the systemdependent warping function hj (·) are first optimized separately for each system for maximum NCE. The expectation is that this brings the slotwise word posterior estimates close to the true posteriors and makes the probabilities comparable among systems. Eventually, the J systemdependent γs are included in the overall parameter optimization process and tuned directly for minimum error rate of the CNC decoding. CNC results with systemdependently warped slotwise word posterior probabilities are presented and discussed in Section 5.2.3.
5.2.2 The windowed Levenshtein Distance: from the CN Distance to the exact Levenshtein Distance In this section the connection between CN decoding and Bayes risk decoding with the Levenshtein distance as loss function is developed. The idea is to relax the alignment defined by the CN until the Levenshtein alignment is computed. In this work a CN is derived from a lattice via a slot function which assigns a slot number to each arc in the lattice, cf. Section 3.4. In the CN distance the alignment between any two paths through the lattice is defined by the slot function: two arcs taken from the two paths compete with each other if they have the same slot number. For the computation of the Levenshtein distance each arc in the first path would have to be allowed to compete with each arc in the second path, obeying the monotony constraint in the Levenshtein alignment. In between the two extremes exists the windowed Levenshtein distance initialized with the CN alignment. For a window of size 2d + 1, the arc from the first path with the slot number n can compete with one of the arcs from the second path with a slot number in [n − d, n + d]. For d = 0, the result is the original CN distance and for sufficiently large d the exact Levenshtein distance is derived. The idea of applying the windowed Levenshtein distance initialized with a CN alignment is motivated by experimental results. In preliminary experiments the alignment between the Viterbi hypothesis and the reference was derived from a common CN construction algorithm. It turned out that almost always a symmetric window of size three or five was sufficient to find the exact Levenshtein alignment. The example given in Figure 5.2 shows a common mistake in the alignment produced by a heuristically working CN construction algorithm. The two “b” arcs in the lattice do not overlap in time and thus they are not clustered into the same slot. As a result the alignments defined by the CN are different from the Levenshtein alignments and the outcome of the Bayes risk decoder with the CN distance as loss function differs from the result of the Bayes risk decoder with the Levenshtein distance. The windowed Levenshtein distance with a window size of three would be sufficient to get the correct Levenshtein alignments. In the following the general windowed Levenshtein distance decoder with an arbitrary window size and an initial CN alignment is developed within the Bayes risk decoding framework. Afterwards it is shown that the result for a window of size one equals the CN decoding rule given in Equation (3.28) and for a sufficiently large window the Bayes risk decoder with the Levenshtein distance as loss function is derived. For a window of size one the decoding is a local decision which is made independently for each CN slot, i.e. the classic CN decoding. For a window larger than one the locality of the decision is no longer given and decoding becomes a nontrivial problem. The decoding of a slot depends now on the decisions made for the neighboring slots. Furthermore, the set of possible hypotheses increases beyond the hypotheses defined by the CN. In the following, dynamic programming equations are derived which efficiently compute an approximate of the Bayes risk decoding rule with the windowed Levenshtein distance as loss function. For a window of size one and for sufficiently large windows, i.e. for the CN distance and for the exact Levenshtein distance, the equations produce the correct Bayes risks. The rest of the section is organized as follows. A recursive definition of the Levenshtein distance is introduced from which the windowed Levenshtein distance will be derived. At the same time the hypothesis space is constructed such that the inclusion of the Bayes risk hypothesis is guaranteed, where the size of the resulting hypothesis space is a function of the initial CN and the window size. The computation of the Bayes risk consists of the computation of an outer loop going over the hypothesis space, an inner loop going over the summation space, and the computation of the loss function and the sentence posterior probability. It will be shown that loss and sentence probability can be computed without approximation, whereas the two loops require approximations in order to enable an efficient computation. Finally, the dynamic programming equations are derived, followed by an exact analysis of the runtime and memory requirements.
82
5.2 Word Level Confusion Networks
lattice (with CN positions): a:0/0.4
10
b:1/1.0
Nbest (with CN positions): c:3/1.0
11
c:3/1.0
0 b:2/0.5 a:0/0.6
CN: a:0
12
11
20
1. a:0 b:1 c:3 p=0.4 2. a:0 b:2 c:3 p=.03 3. a:0 c:3 p=0.3 p=1.0
c:3/0.5
EPS:1
EPS:2
b:1
b:2
c:3
windowed Levenshtein distance: a) window size 1 (CN case) ref: a:0 b:1 b:2 c:3 b:1 b) window size 3 ref: a:0 b:1 b:2 c:3 b:1
a) alignment of hypothesis "a:0 EPS:1 EPS:2 c:3", BR hypothesis for window size 1 (CN case): 1. ref:
a:0 b:1
hyp: a:0 2. ref:
a:0
hyp: a:0 3. ref:
a:0
hyp: a:0
c:3 c:3 err=0.4x1 b:2 c:3 c:3 err=0.3x1 c:3 c:3 err=0.3x0 err=0.7
b) alignment of hypothesis "a:0 b:1 EPS:2 c:3", BR hypothesis for window size 3: 1. ref:
a:0 b:1
hyp: a:0 b:1 2. ref:
a:0
hyp: a:0 b:1 3. ref:
a:0
hyp: a:0 b:1
c:3 c:3 err=0.4x0 b:2 c:3 c:3 err=0.3x0 c:3 c:3 err=0.3x1 err=0.3
Figure 5.2. Example for a typical error made by the common CN construction algorithms and the correction of the error by using a windowed Levenshtein distance, where the window is centered around the CN alignment. The example lattice consists of three paths which are listed to the right of the lattice together with their path probabilities. The arc labels in the lattice are composed of the word, the CN slot to which the arc is assigned, and the arc probability. The resulting CN is drawn below the lattice. To the right of the CN an example for the possible alignment position of arc “b:1” within a windowed Levenshtein alignment is given: a) shows the only possible alignment position for a window of size one, b) shows the possible alignment positions for a symmetric window of size three. The lower part of the figure shows the alignments for the Bayes risk hypotheses for different window sizes with the windowed Levenshtein distance as cost function. Alignment a) is the outcome for a window of size one, which is equivalent to the standard CN decoding. Alignment b) uses a symmetric window of size three. The larger window allows the alignment of “b:1” and “b:2” which compensates for the flaw in the CN construction, where the two arcs were assigned to different slots. The Bayes risk hypothesis for a window of size three is “a b c”, which is also the minimum WER hypothesis for the example lattice.
83
Chapter 5 Confusion Networks: Applications and Investigations The Levenshtein distance. For the following considerations a recursive definition of the Levenshtein distance is required. The recursion is defined via an auxiliary cost function, i.e. it holds Lev(v1M , w1N ) = C(M, N ; v1M , w1N ). The recursion is given by C(m, n; v1M , w1N ) := min d(vm , wn ) + C(m − 1, n − 1; v1M , w1N ), d(, wn ) + C(m, n − 1; v1M , w1N ), d(vm , ) + C(m − 1, n; v1M , w1N ) = min Lev(vm , wjn ) + C(m − 1, j − 1; v1M , w1N ) . j∈[1,n+1]
The equation describes a computation of the Levenshtein distance which is positionsynchronous in v1M , where the computation of a cost at position m depends only on costs at the previous position m − 1. For the Levenshtein distance the local cost d(v, w) is defined as d(v, w) :=
0 if v = w 1 otherwise,
but in general any local cost can be substituted. The partial risk. The socalled partial risk of w1n , n ≤ N , given acoustic observation sequence xT1 and given word sequence v1m , m ≤ M , is defined as the Levenshtein distance weighted by the posterior probability of the complete hypothesis w1N : R(m, n; v1M , w1N ) := p(w1N xT1 )C(m, n; v1M , w1N ) The partial risk depends on the observed feature sequence xT1 , but for the sake of clarity the dependency is discarded from the notation. The Bayes risk decoding rule with the Levenshtein distance as loss function can be rewritten in terms of the partial risk: xT1 → g(xT1 )
:=
argmin v1M
=
argmin v1M
X
p(w1N xT1 ) Lev(v1M , w1N )
w1N
X
R(M, N ; v1M , w1N )
w1N
The summation and hypothesis space. The further steps require that all sentences in the hypothesis and summation space have equal length S. The summation space S can be restricted to word sequences w1N with p(w1N xT1 ) > 0; obviously, a word sequence with a probability of zero is not considered in the summation. By inserting the empty word all word sequences in S can be expanded to equal length S 0 , which yields the aligned summation space SS 0 . The positions for inserting the s are given by the initial CN alignment and S 0 equals the number of slots in the CN. Before continuing with the definitions of aligned summation and hypothesis spaces, some properties of the hypothesis space are investigated which motivate the next steps. The hypothesis space is in general larger than the aligned summation space as illustrated in the following example: w1N abcdf bcde acde g(xT1 ) = abcde
p(w1M xT1 ) 0.¯3 0.¯3 0.¯3 err=1
The example shows that the Bayes risk hypothesis “abcde” is not contained in the summations space S = {“abcdf”, “bcde”, “acde”}. Furthermore, the hypothesis space can contain word sequences which are longer than the sequences in the aligned summation space. The next example shows such a case:
84
5.2 Word Level Confusion Networks w1N abcd bcde acde g(xT1 ) = abcde
p(w1M xT1 ) 0.¯3 0.¯3 0.¯3 err=1
Again, the Bayes risk hypothesis “abcde” is not contained in the summations space. Keep in mind that the goal is to define a hypothesis and a summation space in which all sequences have equal length S. Let ˆ be the length of the shortest Bayes risk hypothesis. It is easy to see that for the Levenshtein distance M ˆ < 2S 0 : the maximum Levenshtein distance between two sequences is the number as loss function holds M of words in the longer sequence, that is an alignment with more insertions and deletions than number of words in the longer sequence cannot be the Levenshtein alignment. S is set to 2S 0 (or 2S 0 − 1, if S 0 is odd) and a new aligned summation space SS is constructed by adding S 0 /2 × as prefix and as suffix to any sequence in the old aligned summation space SS 0 . Let us use the first example to produce the required quantities stepbystep. The summation space is given by S := {“abcdf”, “bcde”, “acde”}. By inserting s at the appropriate positions, e.g. given by a CN, an aligned summation space with S 0 = 5 is derived: S5 = { “abcdf”, “bcde”, “acde” }. The Bayes ˆ = 5. For the final summation space S 0 /2 s risk hypothesis is “abcde” which has length 5 and thus M are attached to the begin and end of each sentence in S5 . The result is S9 = { “abcdf”, “bcde”, “acde” } and thus S = 9. The next equations give the formal definitions of the summation space and the set of all words in the (n) summation space at position n denoted by SS , where Σ denotes the vocabulary. S SS := w1S : p(w1S xT1 ) > 0 ⊂ Σ ∪ {} ,
(n)
SS
:= wn : w1S ∈ SS (n)
The hypothesis space corresponding to summation space SS is defined with the help of SS ( HS :=
S [
as
)S (i) SS
.
i=1
That is, at each position each word can occur which is contained anywhere in the summation space. It is easy to see that this hypothesis space contains all possible outcomes of the Bayes risk decoding rule with (n) the CN distance or (windowed) Levenshtein distance as loss function. Worthwhile to mention, using SS as hypothesis space at position n as in the CN decoding rule is in general not sufficient as shown in the following example: p(w1M xT1 ) 0.¯3 0.¯3 0.¯3 err=1
abcdf bcde acde g(xT1 ) = abcde
In summary, there exists always an S such that the constructed hypothesis and summation space fulfill xT1 → g(xT1 ) := argmin v1M
= argmin v1S ∈HS
X
p(w1N xT1 ) Lev(v1M , w1N )
w1N
X
p(w1S xT1 ) Lev(v1S , w1S ).
w1S ∈SS
In the remainder it is assumed that word sequences are taken from the aligned hypothesis and the aligned summation space of length S, i.e. all word sequences are assumed to have equal length S, where the empty word can occur at any position in the word sequence.
85
Chapter 5 Confusion Networks: Applications and Investigations The windowed Levenshtein distance and the windowed risk. For an initial alignment of two word sequences v1S and w1S the window is defined as the maximum deviation d, d ≥ 0, from the initial alignment, i.e. vn can be aligned to wn−d , . . . , wn , . . . , wn+d . The resulting windowed cost is given by Cd (m, n; v1S , w1S ) := min Lev(vm , wjn ) + Cd (m − 1, j − 1; v1S , w1S ) . j∈[m−d,n+1]
The windowed cost is only defined for m − d ≤ n ≤ m + d. It is more convenient to define n in terms of the deviation i from m, i.e. n = m + i with −d ≤ i ≤ d: m+i Lev(vm , wm+j ) + Cd,j (m − 1; v1S , w1S ) Cd,i (m; v1S , w1S ) := min j∈[−d,i+1]
The notation can be interpreted as having a cost vector of fixed length 2d + 1 at each position m. The definition of the windowed Levenshtein distance and the windowed risk are now straightforward Levd (v1S , w1S )
Cd,0 (S; v1S , w1S )
:=
Rd,i (S; v1S , w1S )
:= p(w1S xT1 )Cd,i (S; v1S , w1S )
and the following inequalities are a direct consequence of the fact that the Levenshtein distance is a lower bound for the windowed Levenshtein distance. Lev(v1S , w1S ) = R(S, S; v1S , w1S ) =
LevS (v1S , w1S ) ≤ · · · ≤ Levd+1 (v1S , w1S ) ≤ RS,0 (S; v1S , w1S ) ≤ · · · ≤ Rd+1,0 (S; v1S , w1S ) ≤
Levd (v1S , w1S ) Rd,0 (S; v1S , w1S )
For the windowed Levenshtein alignment holds: a hypothesis word vm can only be aligned to a word in {wm−d , . . . , wm+d }, w1S ∈ SS , and consequently the following hypothesis space is sufficient for the windowed Levenshtein distance decoder: ( m+d )S [ (i) HS,d := SS i=m−d
m=1
Taking the hypothesis space and the above approximation the following inequalities for the windowed Bayes risk decoding rule are derived for going from a window size of S down to d: X xT1 → r := min p(w1N xT1 ) Lev(v1M , w1N ) v1M
w1N
X
= min
v1S ∈HS,S
=
RS,0 (S; v1S , w1S )
w1S ∈SS
X
min
v1S ∈HS,S−1
RS−1,0 (S; v1S , w1S )
w1S ∈SS
≤... X
≤ min
v1S ∈HS,d
Rd,0 (S; v1S , w1S )
w1S ∈SS
The approximated posterior probability. The approximation of the posterior probability happens by applying the chain rule and shorten the sequence in the condition (the “history”) to fixed length L ≥ 0, i.e. the posteriors are approximated by an Lgram model conditioned on the acoustic observations: p(w1S xT1 ) = p(wS w1S−1 , xT1 )p(wS−1 w1S−2 , xT1 ) · · · p(w1 xT1 ) S−1 S−2 ≈ p(wS wS−L , xT1 )p(wS−1 wS−L−1 , xT1 ) · · · p(w1 xT1 )
For the partial product of the approximated posteriors a new notation is introduced, where L is set to 2d. That is, the length of the subsequences equals the size of the window which is used for the windowed Levenshtein distance. The product is defined recursively as ˜ d (n; w1S ) := p(wn+d wn+d−1 , xT1 )p(wn+d−1 wS+d−2 , xT1 ) · · · p(w1 xT1 ) P n−d S−d−1 n+d−1 T ˜ S = p(wn+d w , x )Pd (n − 1; w ). n−d
86
1
1
5.2 Word Level Confusion Networks With the help of the approximated posteriors the following windowed risk is defined: ˜ d,i (n; v1S , w1S ) := P ˜ d (n; w1S )Cd,i (n; v1S , w1S ) R The approximation in the posteriors does not cause an approximation in the according Bayes risk computation with the windowed Levenshtein distance as loss function. In other words, replacing in the Bayes risk formula the correct posteriors by the approximated ones does still yield the correct result: min v1S
X
Rd,0 (S; v1S , w1S ) = min v1S
w1S
X
˜ d,0 (S; v1S , w1S ) R
w1S
The reason is the locality of the errors in the windowed Levenshtein distance. The decision whether a sequence w1S in the summation space contributes to the error of hypothesized word vn is made in the local window around position n. Thus, in the summation the fore and rear parts of each sequence in the summation space fall together. For a window of size 2d + 1 a history of length 2d (or larger) is required in order to get the correct windowed Bayes risk result. The first step of the proof is to show that the Bayes risk decoding with the windowed Levenshtein distance relies only on the posterior probabilities of sequences of length 2d + 1 centered at position n. Let the windowed alignment of v1S and w1S be denoted by AS1 , where An contains all the information required n+d by loss function L(n; vn , wn−d , An ) to compute the number of errors due to vn : vn can be aligned to one n+d of the words in wn−d or it can be an insertion. Furthermore, the alignment of vn can cause the alignment n+d of one or several s to words in wn−d . The loss function L is only an auxiliary construct for this proof and is not to be confused with the recursively defined cost function C. With the help of the loss function L the Bayes risk for the windowed Levenshtein distance can be computed as: xT1 → rd
:=
min v1S
=
X
p(w1M xT1 ) Levd (v1S , w1S )
w1S
min min v1S
=
AS 1
min min v1S
AS 1
X
p(w1M xT1 )
S X
n+d L(n; vn , wn−d , An )
n=1
w1S S X X
L(n; vn , un+d n−d , An )
n=1 un+d
min min v1S
AS 1
S X X
p(w1M xT1 )
n+d w1S :wn−d =un+d n−d
n−d
=
X
n+d T L(n; vn , un+d n−d , An )p(un−d x1 )
n=1 un+d
n−d
T The crucial step of the proof is to show that p(un+d n−d x1 ) can be computed from the approximated word sequence posteriors. In other words, the proof is concluded by showing that the following equality holds:
T p(un+d n−d x1 ) =
X
!
p(w1M xT1 ) =
n+d w1S :wn−d =un+d n−d
S Y
X
i−1 p(wi wi−2d , xT1 )
n+d i=1 w1S :wn−d =un+d n−d
For d = 0 (window size of one) this is easy to see:
X
S Y
w1S :wn =un i=1
p(wi xT1 ) = p(un xT1 )
S X Y i=1, w i6=n 
p(wxT1 ) = p(un xT1 ) {z
=1
}
87
Chapter 5 Confusion Networks: Applications and Investigations Next, the proof is shown for d = 1; the extension to d > 1 is straightforward. S Y
X
i−1 p(wi wi−2 , xT1 )
n+1 i=1 w1S :wn−1 =un+1 n−1
X
=
p(un+1 un−1 , un , xT1 )
n+1 w1S :wn−1 =un+1 n−1
p(un un−2 , wn−1 , xT1 )p(un−1 wn−3 , wn−2 , xT1 ) n−2 Y
i−1 p(wi wi−2 , xT1 )
i=1
=
S Y
i−1 p(wi wi−2 , xT1 )
i=n+2
p(un+1 un−1 , un , xT1 ) X
p(un un−2 , wn−1 , xT1 )
wn−2
i−1 p(wi wi−2 , xT1 )
S i=n+2 wn+2
{z
!
S X Y
i−1 p(wi wi−2 , xT1 )
w1n−4 i=1

p(un−1 wn−3 , wn−2 , xT1 )
wn−3 n−2 Y
X
X
}
{z
}
=1
=p(wn−3 ,wn−2 xT 1 )(∗)
= p(un+1 un−1 , un , xT1 ) X X p(un un−2 , wn−1 , xT1 ) p(un−1 wn−3 , wn−2 , xT1 )p(wn−3 , wn−2 xT1 ) wn−2
=
wn−3
p(un−1 , un , un+1 xT1 )
The proof is concluded by showing that assumption (∗) made in the last equation is correct. Y X n−2
i−1 p(wi wi−2 , xT1 )
w1n−4 i=1
=
X
p(wn−3 wn−2 , wn−4 , xT1 )
wn−4
X
p(wn−3 wn−5 , wn−4 , xT1 ) · · ·
wn−5
X
p(w4 w2 , w3 , xT1 )
w2
X
p(w3 w1 , w2 , xT1 )p(w2 w1 , xT1 )p(w1 xT1 )
w1

{z
=p(w2 ,w3 xT 1 )
}
= p(wn−3 , wn−2 xT1 ) The approximated summation. In order to make the computation of the sum over the summation space SS feasible on a structure like a lattice the dependencies of the summands have to be reduced, i.e. for a window of size 2d + 1 it is required that the sum at position n depends only on its 2d predecessors. For the posterior probabilities this is achieved by defining a recursive function which computes the marginals of the approximated posteriors. No further approximation is required, because the context of the conditional posteriors is already limited, i.e. X X ˜ d (n; w1S ) = p(wn+d wn+d−1 , xT1 ) ˜ d (n − 1; w1S ). P P n−d w1n−d−1
w1n−d−1
Making explicit use of the fact that the dependency is bounded by the window size the marginals can be computed as X ˜ d (n; wn+d ) := p(wn+d wn+d−1 , xT ) ˜ d (n − 1; wn+d−1 ). P P 1 n−d n−d n−d−1 wn−d−1 (n−d−1) ∈SS
88
5.2 Word Level Confusion Networks Next, the socalled marginal risk is defined which is computed over a window of fixed size. In order to reduce the dependency to the last 2d positions, the sum in the risk computation has to be approximated. The alignment of a hypothesis word is already limited to the last 2d positions by using the windowed cost function, but the approximation is still required because the cost function contains a sum over a minimum and the minimum operation does not distribute over addition: X X ˜ d,i (n; v1S , w1S ) = ˜ d (n; w1S , xT1 )Cd,i (S; v1S , w1S ) R P w1n−d−1
w1n−d−1
n+d−1 = p(wn+d wn−d , xT1 )
X
min
w1n−d−1
+ Cd,j (n − ≤
1; v1S , w1S )
n+d−1 p(wn+d wn−d , xT1 )
X
+
j∈[−d,i+1]
˜ d (n − 1; wS , xT ) Lev(vn , wn+i ) P 1 1 n+j
X
min j∈[−d,i+1]
˜ d (n − 1; w1S , xT1 ) Lev(vn , wn+i ) P n+j
w1n−d−1
S S S S ˜ Pd,j (n − 1; v1 , w1 )Cd,j (n − 1; v1 , w1 )
w1n−d−1
=
n+d−1 p(wn+d wn−d , xT1 )
X
+
min j∈[−d,i+1]
n+i Lev(vn , wn+j )
X
˜ d (n − 1; wS , xT ) P 1 1
w1n−d−1
˜ d,j (n − 1; v1S , w1S ) R
w1n−d−1
Applying the approximation to n − 1, n − 2, . . . the following recursion is derived, which defines the approximated marginal risk: ˜ d,i (n; v1n , wn+d ) R n−d n+d−1 := p(wn+d wn−d , xT1 )
+
X
min j∈[−d,i+1]
n+i Lev(vn , wn+j )
X
˜ d (n − 1; wn+d , xT ) P 1 n−d
wn−d−1 (n−d−1) ∈SS
˜ d,j (n − 1; v n−1 , wn−1+d ) R 1 n−1−d
wn−d−1 (n−d−1) ∈SS
The approximated marginal risk efficiently computes an approximation of the sum over the aligned summation space by considering only a context of fixed size, which is set to 2d. The following inequality results from the approximation: X X ˜ d,0 (S; v S , wS+d−1 ) Rd,0 (S; v1S , w1S ) ≤ R 1 S−d+1 w1S ∈SS
S+d−1 wS−d+1 :w1S ∈SS
Unfortunately, the approximation destroys the hierarchy w.r.t. the window size: the swapping of sum and minimum can cause the preference of an alignment which yields the lowest cost up to the current window, but is not the lowest final cost. The approximated minimum. The last operation preventing from an efficient computation is the minimum over the hypothesis space. In general, the cost of two hypotheses can only be compared after the alignment of all words in the hypotheses, even when using the windowed Levenshtein distance. That is, when comparing two partial hypotheses up to position n the minimum over all possible expansions to full length S has to be taken into account in order to guarantee the correct result. The approximation happens by considering only the next d positions instead of all positions up to S.
89
Chapter 5 Confusion Networks: Applications and Investigations For the approximation the definition of the summation space over all subsequences in a given range is needed n (m,n) SS := wm : w1S ∈ SS , and also the definition of the hypothesis space at a given position and for a given range (m)
HS,d :=
m+d [
(i)
(m,n)
SS ,
HS,d
(m)
(n)
:= HS,d × · · · × HS,d .
i=m−d
For computing the hypothesis up to position n all possible expansions to length n + d are considered, i.e. the algorithm looks d positions into the future: X ˜ d,0 (n; v1n , wn+d ) v˜1n−d := argmin nmin R n−d v1 n−d vn−d+1
n+d wn−d
Applying the approximation to n − d − 1, n − d − 2, . . . yields the recursive definition n X vn−d n+d ˜ v˜n−d := argmin min Rd,0 (n; , wn−d ). n v˜1n−d−1 vn−d vn−d+1 n+d (n−d)
∈HS,d
wn−d (n−d,n+d)
(n−d+1,n)
∈HS,d
∈SS
And the following inequality is a direct result from the definition of the Levenshtein distance: X X ˜ d,0 (S; v˜1S , wS+d ) ˜ d,0 (S; v1S , wS+d ) ≤ R R min v1S ∈HS,d
S−d
S−d
S+d wS−d
S+d wS−d
The dynamic programming equations. Putting it all together, the following dynamic programming equations which efficiently compute an approximation of the Bayes risk and the according hypothesis for the windowed Levenshtein as loss function are derived: X ˜ d (n; wn+d ) := p(wn+d wn+d−1 , xT1 ) ˜ d (n − 1; wn+d−1 ) P P (5.5) n−d n−d n−d−1 wn−d−1 (n−d−1) ∈SS
˜ d,i (n; v n , wn+d ) R n−d n−d
n+d−1 := p(wn+d wn−d , xT1 ) min j∈[−d,i+1] X ˜ d (n − 1; wn+d ) Lev(vn , wn+i ) P n+j n−d wn−d−1 (n−d−1) ∈SS
X
+
˜ d,j (n − 1; R
wn−d−1 (n−d−1) ∈SS
v˜n−d
:=
argmin
X
min
n vn−d vn−d+1 (n−d) (n−d+1,n) ∈HS,d ∈HS,d
n−1 vn−d v˜n−1−d
n−1+d , wn−1−d )
n+d n ˜ Rd,0 (n; vn−d , wn−d )
(5.6)
(5.7)
n+d wn−d (n−d,n+d)
∈SS
The equations describe a nested recursion: alternating the approximated risk at position n and the final word hypothesis at position n − d is computed. The hypothesis word at position n − d is the leftmost hypothesis word considered in computing the approximated risk at position n. The probabilities n+d−1 p(wn+d wn−d , xT1 ) can be efficiently computed in a preprocessing step from the summation space lattice under consideration of the arc and path alignment given by the initializing CN. The equations are initialized in the following way, where for all n ≤ 0 and n > S holds vn = wn := : ˜ d (−d; w0 ) P −2d
:=
1
0 ˜ d,i (−d; v −d , w−2d R ) −2d
:=
0
v˜−d
90
:=
5.2 Word Level Confusion Networks The interpretation of the initialization is that the probability of the sequence of empty words preceding the ultimate hypothesis v˜1S equals one and the according risk is zero. For computing the approximated Bayes risk hypothesis it is sufficient to look at the last hypothesis word v˜S , because the recursion will produce the remaining S−1 elements and will compute the approximate Bayes risk for the complete hypothesis. The approximate risk for a window of size 2d + 1 equals ˜ d,0 (S + d + 1; v S+d+1 , wS+2d+1 ) R S+1 S+1 S+2d = p(wS+2d+1 wS+1 , xT1 ) min j∈[−d,i+1] X S+d S+d+1+i ˜ d (S; w P S−d ) Lev(vS+d+1 , wS+d+1+j ) wS
+
X
˜ d,j (S + d; R
wS
=
X
˜ d,0 (S + d; R
S+d vS+1 v˜S
wS
S+d vS+1 n−1+d , wn−1−d ) v˜S n−1+d , wn−1−d ).
In the (S +d+1)th computation of the risk, only empty words are aligned, because wn = vn = for n > S. The risk computation reduces to a simple sum and the sum depends on the last hypothesis element v˜S . This initializes the nested recursion and in the next step v˜S is computed as X ˜ d,0 (S + d; v S+d , wS+2d ) R v˜S = argmin S S vS
S+2d wS
= argmin vS
X
˜ d,0 (S + d; v S+d , wS+2d ). R S S
wS
The result depends on the risk at position (S + d) and thus the recursive computation of the approximate Bayes risk is initiated. The risk computation terminates with v˜−d = . The first d calls in the enrolled recursion just fill the right half of the window, which is used to predict the current hypothesis word. Thus, the first (d + 1) 0 hypothesis words produced by the recursion equal the empty word, i.e. v˜−d = , and v˜1S is the ultimate hypothesis. Figure 5.3 visualizes the approach for different window sizes. For getting the word hypothesis at position n the decoder considers the alignment between any partial word sequence vnn+d from the hypothesis space and any partial word sequence wnn+2d from the summation space as shown by figure b). For a window of size one the alignment is unique as shown in figure a), i.e. the alignment is already determined. For a sufficiently large window the complete word sequences v1S and w1S are considered, see figure c). The time and space complexity. The runtime and memory requirements of the algorithm depend on the window size d and on the initial CN alignment with length S, from which the aligned hypothesis and summation space are derived. For a full search the exact runtime and memory consumption can be computed; slightly simplified the recursion has the following time and space requirements (the underbraces point at the quantity for whose computation the time resp. memory is used): time:
S+d+1 X
n=−d+1
space:
S+d+1 X n=−d+1
{z
˜ d (n;wn+d ) P n−d
(n−d,n)
(n−d−1,n+d)
SS 
(n−d,n+d)
SS  {z
˜ d (n;wn+d ) P n−d
(n−d−1,n+d)
SS  + (2d + 1)HS,d }  {z
˜ d,· (n;v n ,wn+d ) R n−d n−d
(n−d,n)
 + HS } 
(n−d,n+d)
SS {z
˜ d,· (n;v n ,wn+d ) R n−d n−d
 }
(n−d,n)
 + HS,d } 
(n−d,n+d)
SS {z
v ˜n−d
 }
+ {z} S v ˜1S
The space complexity can be reduced by holding only the information necessary for computing the quantities at the current position; the sum is replaced by two times the maximum.
91
Chapter 5 Confusion Networks: Applications and Investigations
Figure 5.3. The figure visualizes the alignments performed in the Bayes risk decoder with the windowed Levenshtein distance as loss function. Figure a) shows the CN alignment case, where the window size is one and thus the alignment is unique. For a window size of 2d + 1 the computation of the hypothesis word n+d n+2d at position n considers the alignment between vn and wn as shown in b). For sufficiently large window size, that is ≥ 2S − 1, the alignment between v1S and w1S is computed, see c), which yields the exact Levenshtein distance.
92
5.2 Word Level Confusion Networks (n)
(n)
A further estimate can be done using the fact that SS  ≤ Σ and HS,d  ≤ Σ, where Σ denotes the vocabulary: time: O d(S + d)Σ3d+3 space: O (S + d)Σ3d+2 Due to the function of the algorithm no tracebacks are needed; in each step the algorithm produces a word of the final output. But if the alignment of the final hypothesis is desired, then tracebacks have to be stored. The approximations. The following inequalities summarize the approximations applied in the windowed Levenshtein distance decoder with a window size of 2d + 1, starting from the exact Bayes risk with the exact Levenshtein distance as loss function: X xT1 → r := min p(w1N xT1 ) Lev(v1M , w1N ) v1M
=
min min
X
v1S
≤
v1S
=
min v1S
≤
w1N
X
min v1S
R(S, S; v1S , w1S )
w1S
Rd,0 (S; v1S , w1S )
w1S
X
˜ d,0 (S; v1S , w1S ) R
w1S
X
˜ d,0 (S; v1S , wS+d ) R S−d
S+d wS−d
˜ d,0 (S + d + 1; v S+d+1 , wS+2d+1 ) =: rd (xT1 ) R S+1 S+1
≤
The first inequality is due to the windowed Levenshtein distance. The second inequality follows from toggling summation and minimization in the risk computation. And the third inequality is due to only considering a limited future when finding the next hypothesis word. The limits. The nice property of the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function is that for d = 0 it becomes the wellknown CN decoding rule, and for d ≥ S − 1 it equals the Bayes risk decoder with the exact Levenshtein distance as loss function. For a window of size one, i.e. d = 0, the resulting decoding rule is the CN decoding rule introduced in Section 3.4. In the notation used in this section the decoding rule becomes S [v1S ]CN = argmax p(vn xT1 ) n=1 . vn
The decoding of the hypothesis word at position n is independent of the adjacent hypothesis words. Thus, for the proof it is sufficient to investigate the result of Equation (5.7) for any n: X ˜ 0,0 (n; vn , wn ) v˜n = argmin R vn
= argmin
wn
X
vn
p(wn xT1 )
wn
X wn−1
˜ 0 (n − 1; wn−1 ) P {z } 
p(w1n−1 xT 1 )=1 n−1
P = w
Lev(vn , wn )  {z } =d(vn ,wn )
X ˜ 0,0 (n − 1; v˜n−1 , wn−1 ) + R wn−1
= argmin vn
X ˜ 0,0 (n − 1; v˜n−1 , wn−1 ) 1 − p(vn xT1 ) + R
= argmin 1 − vn
wn−1
p(vn xT1 )
93
Chapter 5 Confusion Networks: Applications and Investigations
gamma=0.4, breakpoint=0.3
s1.limsi/eval07en
1
1 unwarped warped
zero bias unwarped warped 0.8
estimated confidence
0.8
h(x)
0.6
0.4
0.2
0.6
0.4
0.2
0
0 0
0.2
0.4
0.6
0.8
1
0
x
0.2
0.4
0.6
0.8
1
true confidence
Figure 5.4. Confidence warping applied to the lattices for eval07en produced by the LIMSI English EPPS 2007 evaluation system.
For d = 0 the alignment is completely determined by the initial CN alignment which allows to compute the risk for vn independently of w1S . This greatly reduces the runtime and the space requirement of the CN decoding rule, which is given by: time: O(SΣ) space: O(S) From the construction of the windowed Levenshtein distance decoder it is obvious that for a sufficiently large window, i.e. a window spanning over the whole initial alignment, the result equals the outcome of the exact Bayes risk with the Levenshtein distance as loss function. In fact, choosing d ≥ S − 1 is sufficient for avoiding any approximation in the Bayes risk computation. The proof is done by inserting the window size into the equations which eventually yield the dynamic programming equations. The proof itself is mathematically straightforward, but bulky. Here, only the outline is given: first it is proved that ˜ S−1 (n; wn+S−1 ) is not an approximation, but computes the correct posterior probability for wS . The P 1 n−S+1 ˜ S−1,0 (S; v S , wS ) computes the correct risk, i.e. equals R(S, S; v S , wS ), from result is used in showing that R 1 1 1 1 which follows that the result equals the exact Bayes risk with the Levenshtein distance as loss function and thus v˜1S is the Bayes risk hypothesis.
5.2.3 Results In this section experimental results for CN and fCN combination with posterior probability warping and for approximate Bayes risk decoding with the windowed Levenshtein distance as loss function are presented and discussed. Experimental results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation crosssite combination. A detailed description of the systems can be found in Appendix B. For all experiments acoustic and language model scales and the system weights in the union based combination and in CNC are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described in Section 3.7. For the experiments applying the warping function defined in Equation (5.4) the systemdependent γs are included into the optimization. The first set of experiments investigates the impact of the slotwise posterior probability warping on the performance of the fCN and CN combination. In the fCN combination the min.hypnFE decoder defined in Section (4.2.1) is applied, which solely relies on framewise word posterior probabilities. In the union approach to lattice combination, the combined framewise word posteriors are computed as the weighted average of the systemdependent framewise posteriors, cf. Equation (3.17). The warping function is applied to the systemdependent framewise posteriors before computing the sum. The systemdependent γ in the warping function defined in Equation (5.4) is initialized for each system separately by maximizing on the tuning set the NCE value for the systemdependent Viterbi result. The confidence
94
5.2 Word Level Confusion Networks
Table 5.3. Combination results with systemdependent frame and CNslotwise posterior warping for the Chinese 230h testing system, cf. Section B.1.1. The warping is optimized for minimum character error rate. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
Warped System Comb. baseline Frame Error Decoder s1+s2 no yes s1+s2+s3 no yes CNC Error Decoder s1+s2 no yes s1+s2+s3 no yes 1
dev071 (2.63/1.59) 14.54
CER[%] (del/ins) err eval07 dev08 (4.42/0.91) 15.08 (2.80/0.87) 13.28
(3.07/1.30) 13.57 (2.76/1.46) 13.56 (3.06/1.23) 13.18 (3.04/1.25) 13.15
(4.69/0.68) 13.95 (4.46/0.82) 13.97 (4.72/0.69) 13.71 (4.75/0.69) 13.69
(3.05/0.70) 12.54 (2.83/0.80) 12.48 (3.01/0.72) 12.22 (3.01/0.70) 12.15
(2.93/1.34) 13.56 (2.96/1.33) 13.55 (2.87/1.29) 13.17 (2.92/1.25) 13.12
(4.66/0.76) 13.99 (4.65/0.74) 13.99 (4.68/0.70) 13.70 (4.68/0.68) 13.69
(2.93/0.74) 12.50 (3.01/0.74) 12.59 (2.92/0.72) 12.21 (2.99/0.72) 12.19
tuning set
Table 5.4. Normalized cross entropy (NCE) results with frame and CNslotwise posterior warping for the Chinese 230h testing system, cf. Section B.1.1.
System Warping/Objective Frame Error Decoder s1+s2 systemdep./min. CER systemindep./max. NCE s1+s2+s3 systemdep./min. CER systemindep./max. NCE CNC Error Decoder s1+s2 systemdep./min. CER systemindep./max. NCE s1+s2+s3 systemdep./min. CER systemindep./max. NCE 1
dev071
NCE eval07
dev08
0.310 0.342 0.348 0.320 0.338 0.343
0.346 0.372 0.375 0.340 0.353 0.358
0.338 0.366 0.376 0.342 0.358 0.368
0.307 0.334 0.344 0.335 0.338 0.355
0.347 0.376 0.375 0.362 0.364 0.377
0.333 0.368 0.370 0.354 0.366 0.378
tuning set
95
Chapter 5 Confusion Networks: Applications and Investigations
Table 5.5. Combination results with systemdependent frame and CNslotwise posterior warping for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The warping is optimized for minimum word error rate. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline Frame Error Decoder LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST CNC Error Decoder LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST 1
Warped Comb.
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13
no yes no yes no yes
(1.60/0.85) 6.65 (1.53/0.82) 6.43 (1.80/0.72) 6.48 (1.66/0.79) 6.46 (1.70/0.79) 6.52 (1.53/0.83) 6.37
(1.99/0.76) 7.73 (1.90/0.76) 7.54 (2.21/0.68) 7.52 (1.92/0.76) 7.24 (1.93/0.76) 7.26 (1.82/0.78) 7.07
no yes no yes no yes
(1.45/0.80) 6.38 (1.47/0.77) 6.33 (1.47/0.72) 6.27 (1.46/0.72) 6.16 (1.45/0.71) 6.14 (1.52/0.66) 6.11
(1.88/0.75) 7.51 (1.95/0.69) 7.47 (1.87/0.68) 7.24 (1.87/0.68) 7.32 (1.87/0.69) 7.12 (2.00/0.59) 7.01
tuning set, eval06 was the official development set in the 2007 evaluation campaign
score used for computing the NCE are derived from the framewise word posteriors according to [Wessel & Schl¨ uter+ 2001a]. For the CN combination (CNC) the slotwise word posterior probabilities from the systemdependent CNs are warped before feeding the CNs into the CNC algorithm. Again, the γparameter in the warping function is initialized for each system separately by maximizing the NCE on the tuning set; the slotwise word posterior probability is used directly as confidence score. Figure 5.4 shows the resulting warping function for the LIMSI English EPPS 2007 evaluation system. The green line in the left plot shows the warping function with the γparameter optimized for maximum NCE on the Viterbi path. The right graph shows the ideal confidence scores in red, the unwarped confidence scores in green, and the warped scores in blue. In a contrast experiment the confidence scores for the unwarped system combination result are warped in a postprocessing step. The warping is applied to the frame or slotwise combined word posterior probabilities of the combination and decoding output and the single γ is optimized for maximum NCE. The objective of the experiment is twofold: first, it shows how in a simple postprocessing step the NCE value of confidence scores based on frame or slotwise posterior probabilities can be improved. Second, the comparison with the systemdependent warping, where the γs are optimized for minimum error rate, indicates whether minimum error rate and maximum NCE go along. The error rates for the experiments with the Chinese system are shown in Table 5.3 and the NCE values in Table 5.4. Keep in mind that the unwarped system and the system with systemindependently warped confidence scores have the same error rate, because warping is applied after combination and decoding. For the Chinese system almost no improvement in CER is observed. The result is not surprising as all three Chinese systems use the same decoder to produce the lattices. Thus, it can be expected that for all lattice sets the bias in the lattice derived posterior probabilities is the same. The NCE value is increased by the systemdependent posterior warping by 5 to 10% relative over the unwarped baseline. The gain comes from the systemdependent optimization of the systemdependent γ for maximum NCE. In the subsequent combined optimization of all γs for minimum CER almost no changes in the γs are observed.
96
5.2 Word Level Confusion Networks
Table 5.6. Normalized cross entropy (NCE) results with frame and CNslotwise posterior warping for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2.
NCE System Frame Error Decoder LIMSI+RWTH
Warping /Objective
systemdep./min. WER systemindep./max. NCE LIMSI+RWTH+UKA systemdep./min. WER systemindep./max. NCE LIMSI+RWTH+UKA+IRST systemdep./min. WER systemindep./max. NCE CNC Error Decoder LIMSI+RWTH systemdep./min. WER systemindep./max. NCE LIMSI+RWTH+UKA systemdep./min. WER systemindep./max. NCE LIMSI+RWTH+UKA+IRST systemdep./min. WER systemindep./max. NCE 1
eval061
eval07
0.309 0.291 0.318 0.310 0.247 0.322 0.320 0.303 0.343
0.371 0.361 0.378 0.367 0.293 0.384 0.375 0.341 0.401
0.323 0.317 0.332 0.342 0.315 0.356 0.331 0.316 0.344
0.387 0.371 0.394 0.388 0.372 0.405 0.382 0.358 0.402
tuning set, eval06 was the official development set in the 2007 evaluation campaign
97
Chapter 5 Confusion Networks: Applications and Investigations The final warping has virtually no impact on the decoding result. The gain in NCE from putting the warping in the postprocessing step and tuning it for maximum crossentropy is a little higher for an almost identical error rate. The results for the English crosssite combination are summarized in Table 5.5 and in Table 5.6. For the framewise and slotwise posterior probability warping a small decrease in error rate is observed. The improvements are larger for the frame error decoder, which on the other hand starts from a higher baseline. In contrast to the Chinese system, the NCE values decrease for the systemdependent posterior warping if optimized for minimum WER. On the other hand, for the postdecoding warping the NCE values increase slightly. The observation is consistent with the considerations in Section 3.7.1, where it is shown that the objective of the parameter optimization is to find a good classifier and not to find a good approximation of the true posteriors. In the second set of experiments the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function is investigated. The windowed Levenshtein distance is initialized with the CN alignment derived from the arccluster CN construction algorithm described in Section 4.4.2. Table 5.7 and Table 5.8 summarize the results for the Chinese and the English task. The results are similar: increasing the window size does not help, the error rates are even slightly worse for larger windows. An investigation of the resulting alignments did not give a final explanation for the disappointing results. Noticeably, in the alignments for the English task for windows larger than one, frequently erroneous alignments of short words appear. It seems that especially for clouds of short words the windowed Levenshtein distance fails and the word boundaries considered in the common Levenshtein distance approximations are a valuable hint for correctly aligning these words. However, before drawing any conclusions further investigations are needed which are beyond the scope of this work.
5.3 Summary In this chapter several applications based on confusion networks (CNs) have been presented. Confusion networks derived from word lattices have a simple structure: they can be regarded as a sequence of slots, where each slot defines a posterior probability distribution over the decoding vocabulary. In framewise defined CNs (fCNs) a slot represents a time frame and the articulation of a word is distributed among slots. In wordlevel CNs the articulation of a word is assigned to a single slot. The first application uses the fCN to compute the time alignment for a word sequence. The method is similar to the common time alignment algorithm using an acoustic model. The difference is that the framewise scores are not computed by an acoustic model but are provided by the fCN. The algorithm is of particular interest for latticebased combination and decoding experiments, where the decoder does not provide word boundaries. In this case, the fCN derived from the union of the systemdependent lattices can be used for computing new word boundaries. In particular, the union approach avoids outofvocabulary problems in the time alignment for (crosssite) system combination results. In the second application entropybased methods are used to combine several systemdependent fCNs. Entropybased combination methods have been successfully applied in combining several feature streams in noisy environments. In this work the approach is integrated into the hypnFE decoder, cf. Section 4.2.1, which relies solely on framewise word posteriors. The standard combination consisting of the weighted average of the systemdependent framewise word posteriors is replaced by the entropybased methods. However, in the experimental tests the entropybased methods cannot beat the standard approach. The results presented in [Misra & Bourlard+ 2003] suggest that the method is most beneficial in the presence of noise, whereas all experiments conducted in this work use clean speech. The third application aims at warping frame or slotwise word posterior probabilities for optimal error rate. The motivation is twofold: by warping the posterior distributions the probability estimates achieve a better approximation of the true posteriors, which theoretically helps in Bayes risk decoding. The other motivation comes from the observation that latticebased posteriors have a systemspecific bias. The posterior warping is a means for making the posteriors comparable among systems, especially in crosssite system combinations. The experimental results show a small benefit for the crosssite system combination, but no improvement for an intrasite combination. The confidence scores are directly derived from the warped frame or slotwise word posteriors. An evaluation of the normalized cross entropy (NCE) shows that for all systems the posterior warping can increase the NCE, if tuned for maximum NCE. However,
98
5.3 Summary
Table 5.7. Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the Chinese 230h testing system, cf. Section B.1.1. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.
System baseline s1
s1+s2+s3
1
Window Size 1 3 5 1 3 5
dev071 (2.63/1.59) 14.54 (2.83/1.44) 14.33 (2.71/1.49) 14.33 (2.78/1.44) 14.35 (2.87/1.25) 13.12 (2.64/1.41) 13.27 (2.75/1.33) 13.20
CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.54/0.81) 14.91 (4.48/0.86) 14.98 (4.56/0.85) 14.95 (4.69/0.69) 13.70 (4.48/0.81) 13.80 (4.56/0.72) 13.71
dev08 (2.80/0.87) 13.28 (2.97/0.78) 13.15 (2.84/0.85) 13.24 (2.99/0.78) 13.22 (2.94/0.77) 12.27 (2.74/0.85) 12.34 (2.80/0.74) 12.38
tuning set
Table 5.8. Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.
System baseline LIMSI
LIMSI+RWTH+UKA+IRST
1
Window Size 1 3 5 1 3 5
WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.60/1.30) 8.01 (1.72/1.16) 8.94 (1.58/1.33) 8.02 (1.73/1.18) 8.99 (1.57/1.31) 8.01 (1.71/1.18) 8.98 (1.61/0.69) 6.29 (2.07/0.60) 7.18 (1.42/0.82) 6.33 (1.80/0.76) 7.22 (1.43/0.80) 6.34 (1.81/0.75) 7.20
tuning set, eval06 was the official development set in the 2007 evaluation campaign
99
Chapter 5 Confusion Networks: Applications and Investigations in the crosssite combination experiments with the warping optimized for minimum error rate the NCE values decrease. In the last section a windowed Levenshtein decoder is developed within the Bayes risk framework. The resulting decoder draws the connection between CN decoding and Bayes risk decoding with the exact Levenshtein distance as loss function: the windowed Levenshtein decoder is initialized with a CN alignment. The result for a window of size one equals the CN decoder and for a sufficiently large window the Bayes risk decoder with the exact Levenshtein distance as loss function is achieved. Dynamic programming equations for the windowed Levenshtein decoder are given, which compute in polynomial time an approximation of the Bayes risk with the windowed Levenshtein distance as loss function. However, experimental results show no improvements for the windowed Levenshtein decoder with a symmetric window of size three or five over the standard CN decoder, i.e. over a window of size one.
100
Chapter 6 Classifier based System Combination In the Bayes risk decoding framework presented in Chapter 3 two assumptions have been made: the probabilities derived from lattices are trustworthy and the local cost functions are good approximations of the Levenshtein distance. Section 5.2.1 discusses why the probabilities are not always reliable and not necessarily comparable among systems. The biases and drawbacks of the approximation of the Levenshtein distance by local cost functions are described in Chapter 4. That is, in practice, neither assumption is fulfilled. The motivation for using classifiers in system combination follows directly from the above considerations: neither to blindly trust the latticebased posterior probabilities nor the cost approximation. Instead, all available information is fed into a classifier. In the best case the classifier learns the underlying patterns like the systematic bias of a cost approximation or the bias of a systemdependent posterior probability under certain conditions. Eventually, the classifier shall separate reliable from unreliable information and decide for the ultimate output of the system combination. The approach to classifier based system combination described in this section was introduced in [Hillard & Hoffmeister+ 2007] and further developed in [Hoffmeister & Schl¨ uter+ 2008].
6.1 Combination with Classification Confusion network combination (CNC) and also ROVER work in two steps. In the first step the systemdependent inputs (CNs or 1bests) are aligned to a super CN. The second step consists of decoding the super CN, which is done in CNC and ROVER by a simple, slotwise decision rule. CNs, CN decoding, CNC, and ROVER have been discussed before in Chapter 3. Under the assumptions that the posterior probability estimates derived from the lattice are the true probabilities and for each pair of paths in the lattice the CN alignment equals the Levenshtein alignment, then the simple decision rule is optimal, i.e. the rule yields the hypothesis for the Bayes risk decoder with the Levenshtein distance as loss function. However, in practice, neither assumption is fulfilled. Posterior probabilities derived from a lattice usually show a bias due to model assumptions, beam pruning in the search, and subsequent lattice pruning. Even in the case that all Levenshtein alignments between all paths in the lattice can be expressed as a CN, the common, heuristic CN construction algorithms, like the algorithms presented in Section 4.4, usually do not find the optimal alignment. In this work an approach is described which aims at compensating for inaccuracies in the probabilities and the alignment by using a classifier. The main idea is to take advantage of the super CN constructed in the first step of the CNC or ROVER algorithm. The classifier makes a slotwise decision on the super CN, where the decision is based on the slotwise word posteriors and other slotwise features derived from the systemdependent and also from the combined lattices. As pointed out in Section 5.2.2, the CN alignment deviates from the Levenshtein alignment usually only in one or two positions. This observation motivates the inclusion of context information for the current slot into the classification process. The classifier works on a symmetric window of slots centered on the slot in question. The features from all slots in the window are concatenated and used by the classifier to predict the output for the current slot. The features are described in detail in Section 6.1.1. The context is brought in by augmenting the feature vector of the current slot by the features of the two adjacent slots. In the training phase the classifier can learn the systematic bias of the latticebased probability estimates and the bias in the CNC or ROVER alignment. In particular, the classifier can learn a probability warping similar to the explicit model given in Section 5.2.1. The consideration of the context is akin to the windowed Levenshtein distance introduced in Section 5.2.2 with a window size of three. The particular classifiers used in this work are discussed in Section 6.1.2.
101
Chapter 6 Classifier based System Combination From the basic idea three approaches to combination with classification are derived. They are distinguished according to the alignment they are based on. The iROVER approach presented in Section 6.1.3 is based on the ROVER alignment, the iCNC approach in Section 6.1.4 is based on the CNC alignment, and the iCN in Section 6.1.5 uses directly the super CN derived from the CN combination. The last section presents and discusses the results. The i in the iapproaches refers to improved or intelligent.
6.1.1 Features The first feature for each word hypothesis is the information which systems have hypothesized the word. The information is crucial for the subsequent decoding, because in the iROVER and iCNC approach the classifier’s prediction per slot is not the concrete word, but the system which produced, according to the classifier’s belief, the correct word. The other features are divided into three categories: the word features, the posterior features, and the decoder features. The first category are the word features, which are computed on word level and do not necessarily require a lattice, i.e. they can be computed from any 1best decoding result. The category consists of the acoustic and the language model score, word duration, the number of characters, and the averaged character duration, which serves as an approximation for the average phoneme duration. The features are produced for each system separately. Furthermore, the information is added whether the word is in the list of the 10, 20, or 100 words causing the most errors on a tuning set. In the ROVER based approaches only the Viterbi results are aligned and the scores and time stamps of a word are unambiguous. In contrast, in a CN many word lattice arcs are collapsed into a single slot entry. In this case the averaged time stamps and scores are used, where the average is weighted according to the lattice arc posteriors. In [Hillard & Hoffmeister+ 2007] the word identity was added as a feature, but further experiments indicated that the feature is not helpful and rather caused overfitting on some setups. Results presented in this work do not use this feature. The second category of features includes all features derived from lattice posterior probabilities and is referred to as posterior features. The features include the systemdependent CN confidence score and the entropy of the slotwise word posterior probability distribution. If a CNC is available, the CNC confidence score and slot entropy are added. Furthermore, confidence scores based on framewise posterior probabilities, cf. [Wessel 2002], are included which are computed across all systems as well as from the combined framewise posterior probabilities. The crosssystem confidence score assigned by system A to a word hypothesis from system B is defined as follows: the confidence score is computed according to [Wessel 2002], where the required framewise word posterior probabilities are derived from the lattice provided by system A. This allows system A to give a confidence estimate for the hypothesis of system B. A classifier can use the crosssystem confidence scores as an indicator for outofvocabulary (OOV) words. The third and last feature category consists of the decisions of the standard approaches to system combination. The ROVER, CNC, and the min.hypnFE combination and decoding results are computed. For each word and classifier the information is included whether the word would have been chosen by the decoder. ROVER alignment based experiments do not use CNC based features, because the CNC is superior to ROVER and thus if a CNC is computed, the combination is based on the CNC alignment. The ROVER, CNC, and min.hypnFE decoder have been introduced in Section 3.4 and in Section 4.2.1. The final feature vector consists at least of the word features of the current and the adjacent slots. The vector is augmented by the minimum distance in seconds to the adjacent slots. According to the setup, the features from the other two categories are added.
6.1.2 Classifiers and Training The classifiers applied are Boostexter (BT) [Schapire & Singer 2000], random forests (RF) [Breiman 2001], and a loglinear model trained in the maximum entropy framework (Maxent) [Keysers & Och+ 2002]. For each slot in the provided CN the classifier makes an independent decision; context is included in the feature vector as explained in the previous section. The classifiers are learned on the CNs which were produced on the training set, where the reference transcription is matched to the CN via an oracle alignment. The result of the oracle alignment is used to
102
6.1 Combination with Classification assign to each slot a reference word. The ultimate slot labels for classifier training are either the systems which predicted the correct word for the slot or the rank of the reference word within the slot. The pure oracle error between CN and reference is computed by using the local cost defined in Equation (6.1), where ps (·xT1 ) denotes the slotwise word posterior distribution for slot number s, and the reference is denoted by w ˜1S : 0, if ps (w ˜s xT1 ) > 0 c(w ˜s ) := (6.1) 1, otherwise The resulting alignment is not optimal for slot labeling, because it disregards the rank of the reference word in the slot. Especially in ambiguous alignments the reference word can be aligned to a slot with a low rank for the reference word instead of being aligned to the adjacent slot, where the reference word has a high rank. However, it is not clear what is the optimal alignment for the classifier training. Intuitively, for the classifier training the alignment shall assign the reference word to a slot where it has a high rank, and at the same time the alignment should minimize the oracle error rate. The alignment derived from minimizing the expected reference error shows (but not guarantees) these properties and gives good results in practice. The according local cost for computing the alignment is given by c(w ˜s ) := 1 − ps (w ˜s xT1 ).
(6.2)
Due to the not well defined alignment there is no guarantee that the resulting labeling is optimal for training. Some training labels have to be considered wrong and the resulting training set to be noisy. First experiments are done using Boostexter, a simple classifier which shows good performance on a wide range of tasks. The idea of BT is to learn a series of weak classifiers (decision stumps) and reweight the training examples using Adaboost, real Adaboost.MH with logistic loss for the experiments presented in this work, cf. [Schapire & Singer 2001]. The second classifier is the random forest, which has some relations to the Boostexter approach. In a RF the weak classifier is a full decision tree, and randomization is applied instead of boosting. The RF implementation used in this work is the Randomized C4.5 as suggested in [Dietterich 2000b]. Randomization is in particular preferable to boosting in the presence of noisy training data. Boosting starts to focus on the incorrectly labeled and thus hard to classify examples. Randomization is a simple approach to avoid this bias. RFs have been successfully applied to several tasks, e.g. to a CN based confidence annotation task in [Xue & Zhao 2006]. An alternative to the two decision tree based approaches is the loglinear model. The model parameters are estimated using the Maxent Toolkit described in [Keysers & Och+ 2002]. The next three sections investigate different approaches for applying the classifiers to the system combination problem. In two of the setups the classifier predicts the system which is believed to produce the correct output. In the classifier training this setup causes multi labels, because for each slot more than one system can be correct. BT training can directly handle multi labels, unlike the RF implementation and the Maxent toolkit. Multi label classification problems can be reduced to a single label problem, cf. [Tsoumakas & Katakis 2007]. In preliminary tests two approaches were tested for the RF and the Maxent classifier. The first approach is to build a new label set which consists of one label for each combination of the original labels which occurs in the training set. In the preliminary experiments this approach worked best for Maxent. During the classification process the Maxent model assigns a probability to each label. From these probabilities the ultimate probabilities for the original labels are derived by splitting the new labels into the original ones and summing up the probabilities for each of the original labels. The investigated alternative tackles the multi label problem by performing a onevsall classification. For each label a binary classifier is built. In classification all classifiers are applied and the final result is taken from the highest scoring classifier. This approach worked best for random forests, where the score is simply the number of trees within an RF which vote for the label in question.
6.1.3 The iROVER Approach In the iROVER approach the Viterbi results of the systems are aligned and the standard decoding rule is replaced by a classifier. A similar approach is investigated in [Zhang & Rudnicky 2006], where the authors apply a neural network to a set of basic features, but observe only a small improvement.
103
Chapter 6 Classifier based System Combination In the combination of J systems the Viterbi results are aligned with the ROVER tool. In addition, the min.hypnFE decoding result is added as the (J + 1)th system. The output of the classifier is one of the J + 1 systems and the final output of iROVER is the word hypothesis of the predicted system. In the result of the ROVER alignment each slot contains exactly one word from each system, where the word can be the empty word . The feature vector for a slot is simply the concatenation of the features of the J + 1 words in the slot. Thus, a feature vector of fixed size is constructed from which the classifier maps to the J + 1 classes.
6.1.4 The iCNC Approach Two approaches for improving CNC decoding by classification are investigated. The first approach is referred to as iCNC and follows directly the iROVER approach. A super CN is computed from the systemdependent CNs by performing the alignment step of the CNC method. For each slot the best hypothesis from each of the J systems is selected, i.e. the word which maximizes pj,s (·xT1 ) is the word hypothesis provided by the jth system for the sth slot. In addition, the CNC and the min.hypnFE hypothesis are added as the (J + 1)th and (J + 2)th system. Noticeably, the ROVER result does not have to be explicitly added as it is contained in the J systemdependent words; a binary flag indicates for each word whether it equals the ROVER result. The classifier is now applied in the same way as in the iROVER approach.
6.1.5 The iCN Approach The iCN approach uses the CNC in a different way following the approach applied in [Mangu &Padmanabhan 2001] to a single CN. The decision is made slotwise among the N best word hypotheses list of the slot, where the hypotheses are ranked according to the averaged word posterior probability as in CNC decoding, cf. Section 3.4.1. Choosing N =2 already gives an oracle error rate lower than the corresponding ROVER oracle error rate, i.e. in theory the iCN approach can compensate for more errors than iROVER or iCNC. For each word in the N best list the feature vectors from all systems are concatenated. Each word is further tagged with whether or not it is the min.hypnFE or ROVER choice; the CNC choice is always the word with rank one. For N =2 the construction results in a feature vector of fixed size and a binary classification problem.
6.2 Experiments Experiments are performed on the four lattice sets from the English EPPS 2007 evaluation crosssite combination setup. The corpus and the lattices are described in Appendix B. The baseline results for the single systems and system combinations with CNC, ROVER, and min.hypnFE decoding are summarized in Table 6.1 . All results presented in this chapter are produced on the evaluation set, the TCStar/EPPS 2007 eval07 set.
6.2.1 Experimental Setup The classifiers are trained on the development set (eval06) of the TCStar/EPPS 2007 Evaluation, which serves as well as tuning set for any further parameter optimization. A larger training set is not available, as in the TCStar project only lattices for the eval06 and eval07 sets were produced and exchanged. Due to the limited training data a 10fold crossvalidation is applied for tuning the parameters of the classifiers. With the optimized parameter set the final classifier is trained on the complete data. Table 6.2 summarizes the statistics for the classifiers and the corpora. The number of samples is the number of slots in which not all systems agree on the same word, i.e. where a nontrivial classification problem exists. These samples make up the effective training set for the classifiers. The number of features is the dimensionality of the ultimate feature vector fed into the classifier. The iROVER+FE classifier refers to the setup where the min.hypnFE decoding result is added as (J + 1)th system, whereas iROVER only combines the Viterbi results from the J systems.
104
6.2 Experiments
Table 6.1. Baseline results for eval07. ROVER results come with confidence score based voting and with majority voting. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.
WER[%] (del/ins) err System LIMSI RWTH UKA IRST LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST
Viterbi/ ROVER (1.91/1.21) 9.38 (1.93/1.26) 9.76 (2.12/1.26) 10.22 (2.41/1.18) 9.79 (2.71/0.53) 8.10/ (1.91/1.19) 9.38 (2.30/0.70) 7.83/ (2.05/0.82) 7.95 (2.02/0.67) 7.43/ (2.44/0.54) 7.52
CN(C) (2.00/1.07) 9.00 (2.23/1.08) 9.52 (2.07/1.25) 10.09 (2.40/1.19) 9.82 (1.89/0.68) 7.42
min.hypnFE (comb.) (1.86/1.18) 9.00 (2.42/0.98) 9.62 (2.22/1.20) 10.14 (2.45/1.16) 9.80 (1.81/0.85) 7.62
(1.91/0.61) 7.09
(1.72/0.84) 7.38
(1.94/0.60) 7.05
(1.61/0.88) 7.17
Table 6.2. Corpora statistics for the training/tuning set (eval06) and the evaluation set (eval07).
#features System LIMSI+RWTH
LIMSI+RWTH+UKA
LIMSI+RWTH+UKA+IRST
Comb. iROVER iROVER+FE iCNC iCN(N =2) iROVER iROVER+FE iCNC iCN(N =2) iROVER iROVER+FE iCNC iCN(N =2)
75 108 71 99 111 147 94 126 149 188 166 157
#samples eval06 eval07 3,032 3,215 3,115 3,301 647 659 28,900 26,961 4,237 4,386 4,207 4,416 1,709 1,801 32,624 30,069 5,320 5,178 5,346 5,207 3,696 3,354 33,252 30,504
105
Chapter 6 Classifier based System Combination
Table 6.3. CN oracle error rates for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.
Comb. iROVER iROVER+FE iCNC iCN(N =2)
Oracle WER[%] (del/ins) err 2 systems 3 systems 4 systems (1.38/0.49) 5.39 (1.22/0.42) 4.44 (1.10/0.34) 3.82 (1.30/0.41) 5.06 (1.12/0.33) 4.21 (0.97/0.24) 3.56 (1.71/0.57) 6.59 (1.52/0.41) 5.13 (1.29/0.27) 3.70 (0.87/0.33) 3.56 (0.86/0.29) 3.41
For Boostexter and Maxent the number of training iterations is optimized for each task separately. Random forests proved not to be sensitive to parameter tuning: eventually C4.5 is used with default parameters and 100 trees for all experiments.
6.2.2 Results The first set of experiments explores the potential of the proposed approach. The ROVER or CNC alignment is performed for the evaluation set and the reference is aligned according to Equation (6.2). From the resulting CN the oracle error rate is computed. The oracle error is defined as the error of the optimal classifier: only if the reference word is not present in the slot an error is counted. Table 6.3 shows the oracle error rates for the four investigated setups. Comparing the table to the baseline results in Table 6.3 shows that the classifier based approaches have a huge potential for improving the error rate. The largest gap is observed for the iCN approach, where already the combination of two systems halves the baseline error rate. In the next set of experiments the iROVER approach is investigated in detail. Especially the importance of the different feature categories is explored. The results are summarized in Table 6.4. Using iROVER with only the simple word features already improves considerably over standard ROVER. Adding the posterior features boosts iROVER to the level of the min.hypnFE combination. Results with the Maxent classifier are only produced for the combination of two and three systems. The Maxent toolkit applies the General Iterative Scaling (GIS) algorithm which causes extremely long runtimes, e.g. 100K iterations for the iROVER/2systems task and 1M iterations for the iROVER/3systems task, without giving an advantage over BT and RF. Eventually, no further Maxent classifiers are trained. In the remaining experiments the features of all three categories are combined. Table 6.5 shows the results for a direct comparison of the four i approaches using BT and RF as classifier. iROVER+FE goes beyond iROVER and can take over min.hypnFE combination, but fails on improving over the CNC baseline. iCNC performs best and can slightly improve over standard CNC. The iCN approach disappoints and cannot improve clearly over standard CNC and is beaten by iCNC on the four system combination task, even though it shows the lowest oracle error rate. The analysis of the dissatisfying performance are subject to the next section. Boostexter and random forests are mostly on the same level with some advantages for RF. Especially for the hard iCN task the RF classifier seems to be more robust. The results are rather sobering, the improvements over the standard approaches are present, but small. Especially the CNC baseline is only slightly beaten by one of the classifier based approaches, the iCNC.
6.2.3 Analysis The analysis of the classifier based combination methods is based on the error detection and correction statistics for the different approaches. Error detection is defined as the ability of the classifier to detect that the hypothesis chosen by the according standard combination approach is not correct. The error correction statistics tell whether the classifier is able to replace a detected erroneous hypothesis by the correct word. The formal definitions of precision and recall for detecting and correcting wrong word
106
6.2 Experiments
Table 6.4. iROVER combination results for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%.
iROVER 2 systems word features Boostexter (2.14/0.86) 7.89 Random forests (2.14/0.88) 7.88 Maxent (1.98/0.96) 7.92 word and posterior features Boostexter (2.04/0.81) 7.61 Random forests (2.08/0.83) 7.68 Maxent (2.07/0.86) 7.77
WER[%] (del/ins) err 3 systems
4 systems
(2.10/0.75) 7.60 (2.15/0.72) 7.57 (1.98/0.83) 7.78
(2.17/0.69) 7.56 (2.02/0.74) 7.37 
(2.07/0.77) 7.40 (2.13/0.70) 7.43 (2.06/0.78) 7.60
(2.11/0.73) 7.25 (2.07/0.70) 7.19 
Table 6.5. Combination results with Boostexter (BT) and random forests (RF) as classifier for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%.
Comb. iROVER iROVER +FE iCNC iCN(N =2)
BT RF BT RF BT RF BT RF
2 systems (2.04/0.81) 7.61 (2.08/0.83) 7.68 (1.87/0.79) 7.57 (1.91/0.78) 7.49 (1.86/0.69) 7.39 (1.88/0.69) 7.41 (1.93/0.71) 7.46 (1.90/0.71) 7.37
WER[%] (del/ins) err 3 systems 4 systems (2.07/0.77) 7.40 (2.11/0.73) 7.25 (2.13/0.70) 7.43 (2.07/0.70) 7.19 (1.89/0.71) 7.24 (1.76/0.75) 7.00 (1.88/0.69) 7.31 (1.90/0.62) 6.97 (1.90/0.62) 7.07 (1.97/0.58) 6.90 (1.96/0.63) 7.08 (1.99/0.60) 6.93 (1.88/0.68) 7.20 (1.91/0.72) 7.15 (1.87/0.63) 7.05 (1.94/0.65) 7.01
107
Chapter 6 Classifier based System Combination
Table 6.6. Error detection and correction results for eval07 for four systems and with a random forest as classifier.
Error Detection recall prec. 0.22 0.7 (357/1,658) (357/514) iROVER 0.16 0.72 +FE (267/1,629) (267/369) iCNC 0.14 0.71 (153/1,086) (153/215) iCN(N =2) 0.08 0.63 (153/2,018) (153/244) Comb. iROVER
Error Correction recall prec. 0.16 0.52 (269/1,658) (269/514) 0.12 0.51 (189/1,629) (189/369) 0.1 0.48 (104/1,086) (104/215) 0.06 0.46 (113/2,018) (113/244)
hypotheses are given by: X recdetect (S) :=
1{ws 6= ws,base ∧ ws,base 6= ws,ref }
s∈S
X
1{ws,base 6= ws,ref }
s∈S
X precdetect (S) :=
1{ws 6= ws,base ∧ ws,base 6= ws,ref }
s∈S
X
1{ws 6= ws,base }
s∈S
X reccorrect (S) :=
1{ws 6= ws,base ∧ ws = ws,ref }
s∈S
X
1{ws,base 6= ws,ref }
s∈S
X preccorrect (S) :=
1{ws 6= ws,base ∧ ws = ws,ref }
s∈S
X
1{ws 6= ws,base }
(6.3)
s∈S
The sequence of slots in the CN is denoted by S and the reference word for slot s by ws,ref , the baseline hypothesis by ws,base , and the classifier hypothesis by ws . The baseline result depends on the investigated combination approach: for iROVER it is the ROVER result, for iROVER+FE it is the min.hypnFE result, and for iCNC and iCN it is the CNC result. Table 6.6 gives the performance obtained with RF as classifier applied to the four systems task; the results for the other setups show the same tendencies. For iROVER, iROVER+FE, and iCNC the precision remains almost constant, whereas the recall decreases. This suggests that the iROVER approaches mostly compensate for errors which are already wiped out by standard CNC. Comparing iCNC and iCN shows that the absolute number of recovered and corrected errors is almost equal for both approaches, but iCN produces many more false positives. Thus, for the tested classifiers and features it helps to apply the ROVER constraint, i.e. to restrict the choice to hypotheses which occurred at least for one system as best hypothesis. On the other hand the results indicate that the iCN approach suffers from having more choices which implies that either the feature set or the modeling is still insufficient.
6.3 Summary In this chapter classifier based system combination has been introduced. The core idea in classifier based system combination is that latticebased posterior estimates and the common approximations of the Levenshtein distance have systematic biases which the classifier can learn and compensate for.
108
6.3 Summary Based on ROVER and confusion network combination (CNC) three different approaches to classifier based system combination are developed. The common idea of the approaches is to apply the classifier to the super CN derived from the ROVER or CNC alignment. The CN can be decoded slotwise and thus the decoding is reduced to a local classification problem. For each slot and each word in the slot a variety of features is computed. The features range from the simple word duration to sophisticated features based on the posterior probabilities derived from the systemdependent lattices. Context information is brought into the classification process by combining the feature vectors of the current and the adjacent slots. In the iROVER approach the systemdependent Viterbi results are aligned with the ROVER tool. In the decoding step the classifier predicts for each slot which system hypothesized the correct word. In this approach the number of target classes equals the number of systems and is therefore small and fixed. The results of alternative combination methods can be added to iROVER by simply including their output as an additional system. The iCNC approach works similar to the iROVER approach. The systemdependent CNs are aligned to a super CN and for each slot and system only the word with the highest systemdependent posterior probability is kept. The CNC result is added as an additional system and decoding is performed according to the iROVER approach. In the iCN approach for each slot in the super CN derived from the CNC alignment only the N words with the highest posterior probabilities are kept. The classifier predicts for each slot the rank of the correct word. For N =2 the approach reduces to a binary decision problem. For all approaches three classifiers are tested: Boostexter (BT), random forests (RF), and a loglinear model (Maxent). In the experimental results the iROVER and iCNC approaches can slightly improve over the corresponding baseline methods. Overall, RF performs slightly better than BT and both are superior to Maxent. The best results are achieved with iCNC and RF as classifier beating the standard CNC by 0.2% absolute on a four systems crosssite combination task.
109
Chapter 7 LogLinear Model Combination vs. System Combination The standard loglinear model used in modern speech recognition systems combines the acoustic model and the language model with modeldependent scaling factors. If the combination is used in Viterbi decoding only, no normalization is required and a single scaling factor is sufficient: the language model scale. Equation (7.1) shows the model with LM scale β, where the normalization term Z guarantees a probability distribution over all sentences: pβ (w1N xT1 ) := Z −1
N Y
n−L β p(wn xttnn−1 +1 )p(wn wn−1 )
(7.1)
n=1
The model in Equation (7.1) is a special case of the general loglinear model used in speech recognition which is defined as ! I N X X N T N T −1 (7.2) λi fi (n; w1 , x1 ) . pλ (w1 x1 ) := Z exp n=1 i=1
fi (n; w1N , xT1 )
The feature functions are in the simplest case the negated log probabilities provided by the acoustic and the language model. In practice, the feature functions used in LVCSR, like the negated logarithm of the HMM based acoustic model or the Lgram language model, depend only on the local context given position n. Therefore, the model can be compactly stored as a word lattice. In the loglinear model combination more knowledge sources are combined into a single loglinear model, usually several acoustic models. In theory, all knowledge sources can be used jointly to produce lattices with Idimensional scores, where I is the number of knowledge sources. The lattice is represented as a WFST over the log or tropical vector semiring; the connection between transducers over the vector semirings and the loglinear model is discussed in Chapter 3. However, the usage of many knowledge sources during the search is expensive in terms of memory and runtime. Instead, lattices are usually built with a single acoustic and a single language model. Using an appropriate semiring, cf. Section 3.1, the intersection of the lattices from several decoders results in a loglinear combination of the systemdependent knowledge sources. In practice, instead of the intersection the conceptual similar rescoring is used: lattices are produced with a single acoustic and a single language model and are subsequently rescored with the additional models. In the discriminative model combination (DMC) the lattices are used to optimize the modeldependent scaling factors for minimum error rate [Beyerlein 2000; Vergyri 2000; Zolnay 2006]. The models in the combination are usually trained independently and the task of the scaling factors in the loglinear model combination is to capture the dependencies between the several models. In order to better describe the interaction between the knowledge sources, several scaling factors per model can be used. In the following section the loglinear model is extended by word and pronunciationdependent scaling factors. The scaling factors are optimized for minimum error rate using the MRT training described in Section 3.7.2, which is eventually DMC with word and pronunciationdependent scaling factors. The concrete setup of the scaling factor training is discussed in Section 7.2.1. The approach to worddependent scaling factors investigated in this section follows [Hoffmeister & Liang+ 2009]. Word or word classdependent scaling factors were used before in [Huang & Belin+ 1993; Sarukkai & Ballard 1996]. In the first paper a joint training of the acoustic model, the language model, and the scaling factors is performed. In the latter work word classdependent scaling factors are used among other techniques in an adaptation step. Neither paper investigates the improvement coming solely from the worddependent scaling factors. Another approach is applied in [Vergyri & Tsakalidis+ 2000], where an improvement of around 3% relative is reported for a DMC experiment by using scaling factors which depend on classes derived from several acoustic features.
111
Chapter 7 LogLinear Model Combination vs. System Combination A comparison of a loglinear model combination with modeldependent scaling factors and ROVER based system combination is performed in [Zolnay 2006], where ROVER outperformed DMC. In Section 7.2.2 the loglinear model combination with and without worddependent scaling factors and with CN decoding is compared to the CN decoding of the union based lattice combination approach described in Section 3.2.3.
7.1 LogLinear Model Combination with WordDependent Scaling Factors In this work an extended form of the loglinear model as defined in Equation 7.2 is used, where the scaling factors are made worddependent. It consists of a set of word level feature functions fi (wn ; w1N , xT1 ) and a corresponding set of worddependent scaling factors λi (wn ): ! N X I X exp λi (wn )fi (wn ; w1N , xT1 ) pλ (w1N xT1 ) := X v1M
exp
n=1 i=1 M X I X
!
(7.3)
λi (vm )fi (vm ; v1M , xT1 )
m=1 i=1
In the following it is assumed that for each word its pronunciation is known. That is, a word wn is considered to be a tuple of the orthography orth(wn ) of the word and the pronunciation pron(w n ). The feature functions used are the logarithms of several acoustic models p pron(wn )xttnn−1 +1 , of the n−L pronunciation model p pron(wn )orth(wn ) , of the Lgram language model p orth(wn )orth(wn−1 ) , and a word penalty. Going from a single scaling factor per model to worddependent scaling factors is motivated by the following observations, which give reason to assume a word and pronunciationdependent interaction between the models. • Varying discriminative power of the acoustic model: the discriminative power of an acoustic model is usually unsteady across phones and thus across pronunciations. • Varying discriminative power among different acoustic models: different acoustic frontends differ in their ability to discriminate among phones. • Several modeling and training issues of the acoustic model, e.g. the severe independence assumptions and the presumably underestimated variances of the GMMs. Furthermore, due to the worddependent scaling factors the training of the model in Equation (7.3) estimates worddependent pronunciation scores and the word penalty in a discriminative manner.
7.2 Experiments Experiments are conducted on the Chinese 230h testing system described in detail in Appendix B. In addition to the 230h speech data for acoustic model training, a separate 120h corpus is created for the estimation of the worddependent scaling factors. Both training sets do not overlap and have the same ratio between broadcast news and broadcast conversation data. The three acoustic models used in the experiments are based on the MFCC, PLP, and Gammatone filter (GT) based frontends.
7.2.1 Experimental Setup The loglinear model combination of the three acoustic models with worddependent scaling factors is applied in a lattice rescoring step. Lattices are produced with the MFCC system and are subsequently arcwise rescored with fixed word boundaries. The language model scores are taken from the LM used in the decoding pass; a further language model rescoring of the lattices was omitted. For experiments on
112
7.2 Experiments
Table 7.1. Training, tuning (dev07), and test sets. The worddependent scaling factors are trained on the 120h “λtraining” set. For the first test set no wordsegmented transcripts are available.
Corpus AMtraining λtraining heldout dev071 eval07 dev08 1
Duration ∼230h ∼120h 1.5h 2.5h 1.6h 1h
Running Words Char.s 2.4M 4.0M 1.3M 2.2M 12.7K 21.5K 27.5K 46.8K 28.1K 10.5K 18.2K
Vocabulary Words Char.s 42.1K 5.3K 33.7K 4.4K 4.4K 1.8K 5.3K 1.9K 1.7K 2.9K 1.4K
tuning set
Table 7.2. Lattice rescoring results with various acoustic models. The lattice sets are generated with the MFCC model and subsequently rescored with the PLP and resp. with the Gammatone (GT) acoustic model, where the character boundaries are kept fixed. The acoustic models were estimated on the 230h AM training set.
Acoustic Model MFCC PLP GT 1
dev071 (2.60/1.64) 14.91 (2.66/1.72) 15.19 (2.71/1.65) 15.62
[%CER] (del/ins) err eval07 dev08 (4.40/1.01) 15.45 (2.69/0.89) 13.44 (4.44/1.07) 15.41 (2.78/0.88) 13.90 (4.55/1.05) 16.15 (2.76/0.93) 14.11
heldout (2.02/1.19) 10.82 (2.22/1.12) 10.82 (2.11/1.14) 10.74
tuning set
character or syllable level the word arcs are first split into character arcs using the time information from an arcwise forced alignment with the MFCC model. The lattices for the 120h scaling factor training set, for the development set, and for the test set are produced with identical setups. Unfortunately, the language model training data includes both training sets which results in a much lower perplexity on the 120h scaling factor training set than on the development and test sets. In order to get an idea of how much performance is lost due to the discrepancy in the training and evaluation setting an additional heldout set is created by removing each hundredth segment from the 120h training set. Table 7.1 summarizes the corpora statistics. Viterbi decoding results of the rescoring of the MFCC lattices with the three acoustic models are summarized in Table 7.2. The language model scale is optimized separately for each acoustic model. The MFCC based model clearly outperforms the PLP and GT frontend, and will be referred to as baseline in the remainder of this chapter. The 120h training set is not sufficient to reliably estimate a scaling factor for each word. In order to get a robust estimation only words which occur more often than a cutoff Nmin get their own scale. The scaling factors for all other words are tied by a backingoff scale, where the backingoff scaling factor depends on the number of phonemes in the pronunciation of the word: λi,w , if #w > Nmin λi (w) := (7.4) λi,pron(w) , otherwise For experiments on character level only a single backingoff class is used. In order to get an idea of how important the lexical information is, an alternative set of scaling factors is built, where characterdependent scaling factors are tied among equal pronunciations, i.e. syllable classes are built. Table 7.3 shows the number of scaling factors per model for different cutoffs. The vocabulary size is 60K and the table shows that even for 7K worddependent scaling factors (∼10% vocabulary coverage) a high coverage of 90% of the running words in the development set is achieved. For character and syllabledependent scaling factors the coverage is almost complete. For most experiments five models are combined: the three acoustic models, the pronunciation model, and the language model. The interdependency between the several models is sufficiently described by
113
Chapter 7 LogLinear Model Combination vs. System Combination
Table 7.3. Statistics for worddependent scaling factors on dev07: number of worddependent scaling factors and coverage of running words for a given cutoff Nmin .
Nmin 200 50 20 10 5
#classes 997 3,596 6,904 10,911 16,665
Running Words[%] 67% 83% 90% 93% 96%
putting the worddependent scaling factors on four of the five models. Following the considerations from Section 7.1 the worddependent scaling factors are put on the acoustic models and the pronunciation model (and on the word penalty, if used). For parameter estimation the minimum risk training (MRT) described in Section 3.7.2 is applied. The objective function is either the smoothed phoneme error (MPE training) or word error (MWE training) applied on character level. The estimation is done iteratively using Rprop, a gradientdescent algorithm [Riedmiller & Braun 1993]. The implementation of the MPE objective function follows directly [Povey & Woodland 2002]. The objective function applied in MWE training is the confusion network (CN) error computed on character level. The CNs are built from the training set lattices using the arccluster CN construction algorithm described in Section 4.4.2. Regularization turns out to be important, similar to the Ismoothing used in [Povey & Woodland 2002] for GHMM training. The objective function for MRT is defined as follows, where L(·, ·) denotes the loss ˜r R N r function and [xTr,1 ,w ˜r,1 ]r=1 the training samples. R X X 1 C ˜r N r F(λ) := pλ (w1N xTr,1 ) L(w1N , w ˜r,1 ) + λ − λ(0) 22 R r=1 2 N
(7.5)
w1
The initial set of scaling factors λ(0) is made up of the modeldependent scales derived from a direct error rate minimization on the development set. Thus, the initial LM scaling factors are around one and the acoustic model scaling factors are close to the inverse language model scale (as used in Viterbi decoding, where the acoustic model scale is fixed to one) divided by the number of acoustic models. The scaling factors are optimized until convergence in the objective function occurs and the scaling factors from the last training iteration are taken for decoding. The regularization constant C is optimized on the development set for minimum error rate, which is expensive and therefore is not done in finegrained steps. For lattice decoding the CN decoder with the arccluster CN construction algorithm described in Section 4.4.2 is used, which is consistent with the optimization criterion used for character level MWE training. In a final set of experiments the loglinear model combination is compared to the modified lattice union approach described in Section 3.2.3, which derives the combined sentence posterior probability as the weighted average of the systemdependent sentence posteriors. For a fair comparison of the loglinear model combination and the union based system combination it is necessary to use equivalent word lattices. In the experiments, a system is simply defined as the loglinear combination of the language model, the pronunciation model, and a single acoustic model. The systemdependent sentence posteriors are computed from the rescored lattices by setting the scaling factor for all but one acoustic frontend to zero. That is, the lattices and lattice arc scores, i.e. the features, are the same for all experiments. The three sentence posterior distributions are then combined according to Equation (3.14) and the CN decoder is applied. The system weights and model scales are optimized for minimum error rate on the development set. As pointed out in Section 3.7.2, scaling factor optimization via MRT is meaningless for the union based system combination. For all experiments with worddependent scaling factors the λs are used which are optimized for the loglinear model combination.
114
7.3 Summary
7.2.2 Results In the first set of experiments the different tying strategies for the scaling factors are investigated. Table 7.4 assembles the results for word, character, and syllabledependent scaling factors for different cutoff values. The number of classes refers to the number of scaling factors per model; throughout, the language model gets only a single scaling factor. The baseline is the setup using a single scale per model, which is equivalent to the common DMC approach. The best improvement is achieved with 7K worddependent scaling factors, but the difference among the cutoff values is tiny and especially for 3K and more scaling factors it might even disappear with a more finegrained optimization of the regularization constant. The relative improvement in character error rate (CER) is around 2%, a little better for the heldout set where a relative improvement of 3% is observed. On the training set the error rate of the Viterbi decoding is measured and even here the gain is at most around 4% relative. In preliminary experiments with an additional word penalty no further improvements were observed: error rates changed only in the second decimal place. Figure 7.1 shows detailed results for the training and evaluation of the 7K worddependent scaling factors, the best performing setup. The left plot shows that the objective function (smoothed phoneme accuracy) improves smoothly and the CER on training, heldout, and development set smoothly decreases. The right plot shows again the development set together with the two test sets. Both, the Viterbi and the CN results are plotted. The plots for the other setups look rather similar. The results with characterdependent scaling factors are similar to the worddependent results, where on eval07 and on the heldout set the improvements are a little smaller. The differences in the word and character level baselines are due to fixing the boundaries of the character arcs with the MFCC model. When rescoring with the PLP and GT model the character boundaries are not optimal and the results are slightly worse compared to a word arcwise rescoring. The results for character level MWE training are a little worse than for MPE, but here again the differences are too small for drawing reliable conclusions. The CN decoder cannot benefit from MWE trained, characterdependent scaling factors: the gap to the corresponding Viterbi results do not widen compared to the experiments using the MPE criterion. The syllabledependent scaling factors are inferior to the characterdependent ones. The differences are small, but consistent among all test sets. In the second set of experiments the loglinear model combination is compared with the system combination approach based on the weighted average of the systemdependent sentence posteriors. The results are summarized in Table 7.5. The worddependent scaling factors are optimized for the loglinear model combination containing the three acoustic models using the MPE criterion. Obviously, the resulting scales cannot be applied directly in a loglinear model using only one of the three acoustic models, because the impact of the acoustic and the language model are not balanced anymore. As compensation an additional scaling factor per model is introduced and optimized on the development set. The results for the loglinear combination of a single acoustic model, the pronunciation, and the language model are shown in the first part of the table. The next two parts show the results for the loglinear model combination and the averaged sentence posterior based system combination. The CN decoding of the averaged sentence posteriors clearly outperforms the CN decoding of the loglinear model combination. Notably, the relative improvement from the worddependent scaling factors is almost the same for both approaches, even if they are optimized only for the loglinear combination. The picture is completed by the results from the experiments with a single acoustic model, where the relative improvement is in the same range. That is, the loglinear model combination with the three acoustic models cannot benefit from the joint training considering all the acoustic models. The conclusion is that the worddependent scaling factors presumably do not capture the dependencies between the acoustic models, but solely the interdependency of acoustic and language model.
7.3 Summary In this chapter the loglinear model combination with word and pronunciationdependent scaling factors has been introduced. The goal is to describe within the loglinear model the interaction between the combined, but independently trained knowledge sources. The scaling factors are optimized for minimum error rate using the training method described in Section 3.7.2.
115
Chapter 7 LogLinear Model Combination vs. System Combination
Table 7.4. CNdecoding results for the loglinear model combination using word, character, and syllabledependent scaling factors. The scaling factors are trained on 120h using either minimum phone error (MPE) or minimum character error (MWE) training. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model.
#classes Criterion (cutoff) dev071 baseline (2.60/1.64) 14.91 worddependent scaling factors MPE 1 (2.72/1.47) 13.94 997(200) (2.81/1.38) 13.80 3,596( 50) (2.80/1.40) 13.76 6,904( 20) (2.81/1.40) 13.73 10,911( 10) (2.80/1.40) 13.73 16,665( 5) (2.82/1.39) 13.74 characterdependent scaling factors MPE 1 (2.70/1.50) 13.95 2,708( 20) (2.71/1.42) 13.79 3,707( 5) (2.72/1.42) 13.80 MWE 1 (2.69/1.50) 13.95 2,708( 20) (2.87/1.36) 13.84 3,707( 5) (2.87/1.36) 13.83 syllabledependent scaling factors MPE 1 (2.70/1.50) 13.95 1,064( 20) (2.72/1.43) 13.79 MWE 1 (2.69/1.50) 13.95 1,064( 20) (3.01/1.33) 13.91 1
[%CER] (del/ins) err eval07 dev08 (4.40/1.01) 15.45 (2.69/0.89) 13.44
heldout (2.02/1.19) 10.82
(4.53/0.88) 14.51 (4.59/0.75) 14.26 (4.57/0.75) 14.23 (4.60/0.75) 14.19 (4.57/0.76) 14.28 (4.59/0.77) 14.25
(2.85/0.72) 12.73 (2.86/0.69) 12.56 (2.84/0.69) 12.57 (2.84/0.69) 12.51 (2.81/0.71) 12.53 (2.80/0.69) 12.46
(2.28/0.88) 9.76 (2.54/0.76) 9.59 (2.61/0.74) 9.54 (2.56/0.73) 9.43 (2.55/0.75) 9.57 (2.58/0.74) 9.52
(4.52/0.89) (4.57/0.81) (4.56/0.81) (4.52/0.89) (4.72/0.76) (4.72/0.77)
14.59 14.37 14.37 14.60 14.42 14.42
(2.69/0.76) (2.76/0.70) (2.76/0.70) (2.71/0.76) (2.88/0.69) (2.87/0.69)
12.63 12.40 12.40 12.64 12.55 12.50
(1.99/0.90) (2.37/0.80) (2.37/0.80) (1.99/0.90) (3.04/0.71) (3.03/0.71)
9.77 9.54 9.55 9.77 9.77 9.78
(4.52/0.89) (4.59/0.82) (4.52/0.89) (4.78/0.76)
14.59 14.51 14.60 14.61
(2.69/0.76) (2.81/0.71) (2.71/0.76) (3.01/0.70)
12.63 12.40 12.64 12.46
(1.99/0.90) (2.36/0.78) (1.99/0.90) (3.18/0.66)
9.77 9.60 9.77 9.84
tuning set
15
12 train(obj.func.) train(Viterbi) heldout(Viterbi) heldout(CNdec.) dev(Viterbi) dev(CNdec.)
11
10
14.5
14 %CER
%CER
13
objective func.(phoneme accuracy)
14
dev(Viterbi) dev(CNdec.) test1(Viterbi) test1(CNdec.) test2(Viterbi) test2(CNdec.)
13.5
13
12.5 0
5
10
15 iteration
20
25
0
5
10
15
20
iteration
Figure 7.1. Results for the loglinear modelcombination for 25 training iterations and 6,904 worddependent scaling factors. The worddependent scaling factors are trained on 120h. The left plot shows the objective function and character error rates for the training set, the heldout set, and the development set. The right plot shows the progression of the error rates for the development set and the two test sets.
116
25
7.3 Summary
Table 7.5. CNdecoding results for loglinear model combinations and for a system combination using the weighted average of sentence posteriors. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model.
Acoustic [%CER] (del/ins) err Model(s) #classes dev071 eval07 baseline (2.60/1.64) 14.91 (4.40/1.01) 15.45 model combination with one acoustic model MFCC 1 (2.86/1.51) 14.79 (4.58/0.92) 15.18 6,904 (2.88/1.44) 14.44 (4.64/0.81) 15.07 PLP 1 (2.92/1.55) 15.07 (4.64/0.88) 15.20 6,904 (3.00/1.36) 14.71 (4.73/0.84) 15.05 GT 1 (2.89/1.57) 15.47 (4.68/0.94) 15.97 6,904 (3.00/1.41) 15.20 (4.81/0.84) 15.87 loglinear model comb. MFCC+PLP+GT 1 (2.75/1.48) 14.01 (4.52/0.88) 14.51 6,904 (2.82/1.39) 13.69 (4.61/0.75) 14.18 avg. sentence posteriors MFCC+PLP+GT 1 (2.88/1.37) 13.72 (4.62/0.74) 14.13 6,904 (2.94/1.25) 13.35 (4.72/0.63) 13.89 1
dev08 (2.69/0.89) 13.44 (2.84/0.80) (2.86/0.70) (3.07/0.80) (3.08/0.75) (2.89/0.91) (3.02/0.80)
13.32 12.98 13.67 13.54 14.05 13.68
(2.85/0.72) 12.71 (2.82/0.69) 12.52 (2.86/0.72) 12.44 (2.96/0.64) 12.13
tuning set
In this work three acoustic models, a pronunciation model, and a language model are combined for a Chinese task. The training set for the worddependent scaling factors consists of 120h, which is separated from the 230h used for acoustic model training. Many words of the 60K vocabulary occur only infrequently or not at all in the 120h training set and no reliable worddependent scaling factors can be estimated. Those words are tied into a set of fallback classes. Each fallback class has its own scaling factor where the fallback class for a particular word depends on the length of the word’s pronunciation counted in number of phonemes. An alternative approach applicable for Chinese is investigated, where the words are split into characters and character dependent scaling factors are used. In the experimental results the worddependent scaling factors performed better than the characterdependent scales and a small but consistent gain in error rate is observed. The error rate is decreased by around 2% relative for all tasks. In the final set of experiments the loglinear model combination is compared to the system combination via the modified lattice union, cf. Section 3.2.3. The union based approach clearly outperforms the loglinear model combination for model and for worddependent scaling factors. Notably, the same relative gain from the worddependent scaling factors is observed for both combination approaches, even so the worddependent scaling factors are solely optimized for the loglinear model combination.
117
Chapter 8 Scientific Contributions The goal of this work has been to investigate Bayes risk decoding techniques and system combination in the Bayes risk decoding framework for LVCSR systems. This work contains the following contributions which cover different aspects of Bayes risk decoding and system combination: Development of a unified view on system combination. A unified view on system combination in the Bayes risk decoding framework has been developed, which covers most of the common approaches to system combination applied in stateofthe art LVCSR systems. The loglinear model used in modern LVCSR systems has a natural representation as a weighted finite state transducer (WFST) over a vector semiring, in which the scaling factors of the loglinear model are part of the semiring. An arc label in the WFST is a single word and an arc weight is the vector of the values of wordwise feature functions. The vector usually consists of two scores, the score from the acoustic model and the language model score. Thus, in combination with time stamps assigned to the states, the WFST defines a word lattice, where the probabilities derived from then lattice follow the loglinear model. Context information like the language model history or crossword boundaries are preserved in the WFST topology. The loglinear model combination corresponds to a WFST intersection (or to the conceptual similar arcwise rescoring of the WFST). Path and sentence posterior probabilities are derived directly from the single loglinear model. The common alternative used in system combination methods like confusion network combination is to compute the weighted average of the systemdependent sentence posterior probabilities, where the sentence posteriors are derived from the systemdependent lattices. The combination via the averaged systemdependent posteriors has its interpretation in the WFST framework as a slightly modified lattice union. In the first combination method all systemdependent loglinear models are combined into a super loglinear model from which sentence posteriors are derived. In the second method systemdependent sentence posteriors are derived from the systemdependent loglinear models. The sentence posteriors are subsequently combined in a linear manner. Intersection and modified union implement the two common approaches for estimating sentence posterior probabilities from a set of word lattices. The lattice combination itself is accomplished by generic transducer operations which combine the systemdependent lattices into a single super lattice, either based on the lattice intersection or the lattice union. Thus, the combination and the decoding problem are separated and system combination is reduced to a single lattice decoding problem. The lattice decoding is formulated in the Bayes risk framework, where the posterior probabilities are provided by the lattice. The common loss function for Bayes risk decoding for LVCSR tasks is the Levenshtein distance. The computation of the Bayes risk hypothesis from a LVCSR lattice with the Levenshtein distance as loss function is computationally prohibitive and in practice the Levenshtein distance is replaced by an approximate. In this work a classification for loss functions which aim at approximating the Levenshtein distance has been developed. The classes are based on the degree of locality of the approximates. Two classes of local loss functions have been derived which cover the common approximations used in LVCSR tasks and for these two classes efficient Bayes risk decoder have been developed. The theoretical investigations show that the computation of the Bayes risk hypothesis from the union based combination is more efficient if a local loss function is used rather than the sentence error. The generic Bayes risk decoder covers a variety of known approaches to system combination including the discriminative model combination (DMC) and the confusion network combination (CNC). In the confusion network combination the lattices are first transformed into CNs which are subsequently combined into a super CN. In this work it has been shown that the CNC decoding rule can be also expressed as a Bayes risk decoding of the lattice union with an appropriate cost function. Furthermore, it has been shown that the CNC cost function is optimal in terms of Bayes risk decoding under the contraint that
119
Chapter 8 Scientific Contributions the systemdependent alignments can be expressed as CNs. The relation between CNC and ROVER has been made and ROVER with confidence voting has been developed as an approximation of the CNC. The experimental results show that latticebased system combination improves over the decoding of the best single lattice for all investigated combination approaches and loss functions. The best results are achieved for the lattice union based Bayes risk decoder with either the CN distance or the symmetrically normalized frame error as loss function, where especially the CNC shows a small advantage in the crosssite combination tasks. ROVER degrades only slightly in error rate compared to CNC. For intrasite combination experiments the improvements are around 10% relative compared to the best single system’s Viterbi result and more than 20% relative for the crosssite combination task. Investigations on the local cost functions used in Bayes risk decoding. In this work the common approximations to the Levenshtein distance used in LVCSR tasks have been compared for Bayes risk decoding of word lattices. Improved, but still efficiently computable loss functions have been developed based on an analysis of the drawbacks of the common approximations. The investigated loss functions include the CN distance, the frame error, and Povey’s popular cost function for discriminative acoustic model training. The Bayes risk decoders with the common frame error based cost and Povey’s cost show a strong deletion bias. A further analysis of the frame error based cost has revealed that the major reason is the normalization. In particular, it has been shown that the standard normalization of the frame error used for Bayes risk decoding ignores deletions. A modified version has been proposed which shows a lower deletion ratio and outperforms the original frame error based approach. As well, a modified version of Povey’s cost has been developed, which successfully compensates for the deletion bias. Both modifications are parametrized and thus allow a direct tuning of the deletion ratio. In the experimental results the modified loss functions improve over the original versions and are competitive or on some tasks even slightly better than the CN decoder, i.e. the Bayes risk decoder with the CN distance as loss function. Investigations on confusion networks. The common algorithms for constructing confusion networks from word lattices are based on heuristics and require a careful parameter tuning. The most common approaches are based on a direct arc clustering. Alternative algorithms do a fast state clustering by exploiting the topology of the word lattice followed by a subsequent arc clustering. In this work two implementations of CN construction algorithms based on the arc and the state clustering have been developed. The arc clustering algorithm proved to work fast and robust over a wide range of systems and conditions. Though the main concept is inspired by existing approaches, the concrete algorithm is new. The state clustering algorithm follows the implementation of [Xue & Zhao 2005], but the experimental results show that their approach is inferior to the arc clustering algorithm. A modified version has been developed, which improves over the original algorithm and proved to be competitive to the direct arc clustering approach. Both algorithms are parametrized and careful parameter tuning is required for optimal performance. A new approach to latticebased CN construction has been developed which is conceptually simple and parameter free. The algorithm is based on framewise word posterior probabilities and proved to be competitive or even better on some tasks than the two competing algorithms, though it is significantly slower. The sentence posterior probabilities derived from word lattices are only estimates of the true posteriors. The structure of the CN allows to break the sentence posteriors down to word posteriors and to compare them with the empirical posterior estimates for a given development set. In this work a warping function has been applied to the slotwise word posterior probability distributions defined by the CN in order to bring them closer to the true probability distributions. The technique is especially interesting for crosssite CN combinations, where it is to be expected that the systemdependent posterior estimates show different biases. In the experimental evaluation on a crosssite combination task the warping reduces the error rate, whereas for an intrasite combination almost no effect on error rate is observed. However, in both cases the warping function has the ability to significantly improve the quality of the posterior probability based confidence scores measured in terms of the normalized crossentropy.
120
Furthermore, in this work the connection between CN distance and Levenshtein distance has been explored. The latticebased CN construction algorithms work heuristically and no assumption about the resulting alignment can be made. However, experiments indicate that the CN alignment is a close approximation of the Levenshtein alignment. The idea is to use the CN alignment as a starting point from which the Levenshtein alignment is reached. An approximate Bayes risk decoder with the windowed Levenshtein distance as loss function and the according dynamic programming equations have been developed. Time and space requirement of the decoder are polynomial in the size of the window. The windowed Levenshtein distance can be initialized with any CN alignment and it has been shown that for setting the window size to one the result is the common CN decoding rule. For any initial CN alignment and sufficiently large window the decoder passes into the Bayes risk decoder with the exact Levenshtein distance as loss function. Unfortunately, the approximations made in the windowed Levenshtein distance based Bayes risk decoder prevents from having the property that the approximated Bayes risk decreases monotonously with an increased window size. Though of theoretic interest, in the experimental evaluation the windowed Levenshtein decoder could not gain over the CN decoder in terms of error rate. Development of a new approach to system combination. The common system combination approaches formulated in the Bayes risk decoding framework have two major drawbacks. The first is the approximation of the Levenshtein distance and the second is the blind reliance on the posterior probability estimates derived from the word lattices. In this work an approach has been introduced and analyzed which aims at overcoming both problems: a classifier based system combination. The experimental results show that under some conditions the classifier approach can clearly outperform the standard approach. However, compared to the best performing common methods to system combination the classifier based approach gains only little. In the experiments several setups, feature sets, and classifiers have been compared. Investigations on the loglinear model combination. The loglinear model combination is a common approach in speech recognition to combine several knowledge sources. It can be used as a means to system combination instead of approaches like CNC or ROVER. A common choice for a system combination setup is to build several systems which differ only in their acoustic frontend. The combination happens by averaging the weighted posterior probabilities derived from the several systems. Instead, in the loglinear model combination only a single system is built by combining the acoustic models derived from the several acoustic frontends into a single loglinear model from which the posterior probabilities are computed. In this work the performance of both combination approaches, applied in the Bayes risk decoding framework with the CN distance as loss function, has been experimentally compared. The combination approach based on separate systems clearly outperforms the loglinear model in terms of error rate. The second study introduces worddependent scaling factors. Instead of using a single scaling factor per knowledge source the scales are made word and knowledge sourcedependent. The experimental results show a small but consistent improvement in error rate. Again, the single loglinear model approach has been compared to the approach based on the averaged systemdependent posteriors, where in both approaches worddependent scaling factors are applied. The results show that both approaches benefit from the worddependent scales in the same magnitude and the loglinear model combination stays inferior.
121
Chapter 9 Outlook In this thesis a unified view on system combination in the Bayes risk decoding framework has been developed. Several aspects of system combination and Bayes risk decoding for speech recognition have been investigated. The combination approaches are able to improve over the best single system by up to 20% relative. However, the oracle error rates for lattices and confusion networks (even with a single hypothesis per system like in ROVER) indicate a large potential for further improvements. In particular, none of the sophisticated combination techniques was able to considerably outperform the simple ROVER approach with word confidence scores. From these considerations the following theoretical and experimental questions remain open and may serve as a starting point for further research: Bayes risk decoding. • How much improvement can be expected from latticebased Bayes risk decoding using the Levenshtein distance instead of the sentence error as loss function? This is the question of the general potential of word error instead of sentence error minimization in speech recognition under the constraint that the unmodified latticebased posterior probability estimates are used. The followup question is: how close gets Bayes risk decoding for LVCSR tasks with any suitable Levenshtein distance approximation to the decoder with the exact Levenshtein distance? First experiments with the windowed Levenshtein distance initialized by a confusion network (CN) alignment were rather disappointing, because a more accurate error approximation did not yield immediately a lower error rate. However, the experimental results indicate that the windowed Levenshtein distance with a symmetric window of small size, three or five seems to be sufficient, is a good candidate for a very close approximation of the exact Levenshtein distance. The experiments might give a starting point for further theoretical and experimental investigations. • According to the experimental results presented in this thesis none of the investigated approximate Levenshtein distances is superior for all systems and under all conditions. The question is if one of the approximations is superior on a broader range of systems and conditions or is there even a better, efficiently computable approximation? • Several approaches tried to deal with the unreliability of the latticebased posterior probabilities. So far, no approach could considerably outperform the plain probability estimates derived directly from the lattice and the remaining question is: exists a better approach to model and compensate for the bias in the latticebased posterior estimates with the objective to reduce the error rate? System combination techniques. • The simple ROVER approach performs amazingly well and is hardly beaten by sophisticated combination techniques. We still lack a good understanding of why ROVER performs that well. A good starting point might be the view of ROVER as CNC with pruning. Then the question is: when do search errors occur due to pruning and can the error be bounded? In other words, can we explain the ROVER performance by showing that even a heavily pruned CNC makes almost no search errors? • The classifier based approaches to system combination are only at their beginning. There exist several possible extensions which might boost the performance. The first idea is to apply classifiers which consider the context of the complete sentence like conditional random fields. The second
123
Chapter 9 Outlook direction are the features, which are so far derived only from the lattices. The classifier based approach describes a simple way to bring in additional knowledge sources into the combination process. The question is: does there exist better classifiers and better feature functions for classifier based system combination? • The interaction between crossadaptation and latticebased system combination is yet not systematically explored. In fact, so far there is only intuition but no true understanding of why and how crossadaptation improves the error rate. • An issue is still the question of how to generate ASR systems such that they are optimal for system combination performance. A few approaches have been explored in [Breslin & Gales 2006, 2007a; Willett & He 2008], but none gave a considerable improvement. The question is: can we derive from a deeper analysis of the combination techniques a better algorithm for estimating complementary systems? Confusion networks. • The set of alignments stored in a confusion network is restricted and, in general, cannot express the Levenshtein alignments between all sentence pairs in the lattice. The questions is: how severe is the restriction in practice? • All latticebased confusion network construction algorithms use heuristics to estimate the alignments. Ideally, the algorithm finds the CN which minimizes the Bayes risk with the CN error as loss function. The question is: does an efficiently computable algorithm exist which finds the optimal CN? • The centerframe CN construction algorithm introduced in Section 4.4.4 shows some nice properties and is competitive or even better in error rate than the standard algorithms. However, the algorithm is based on heuristics and so far, it is slower than the common CN algorithms which are based on a direct arc clustering. Can the heuristics of the centerframe algorithm be further improved and can the construction be speed up?
124
Appendix A The Deletion Bias in LVCSR Decoding The optimization of an ASR system for minimum word error rate (WER), the standard evaluation measure for LVCSR tasks, biases the system towards producing deletions. The main insight is: for a LVCSR system it is preferable to discard a word with a low confidence rather than to risk an insertion. The remainder proves the intuition. ˜ Let w1N be the hypothesis and w ˜1N be the reference and let A = [(k1 , l1 ), (k2 , l2 ) . . . , (kM , lM )] denote the Levenshtein alignment between hypothesis and reference. The interpretation of the alignment is that hypothesis word wkm and reference word w ˜lm are aligned, where km or lm (but not both) can be zero, where w0 equals the empty word , i.e. it is an insertion or deletion. Let us assume the following cost function: ccor , for w = w ˜ csub , for w 6= w ˜ ∧ w 6= ∧ w ˜ 6= c(w, w) ˜ := (A.1) c , for v = ins cdel , for w = For the standard Levenshtein distance holds ccor = 0 and csub = cins = cdel = 1. Given the Levenshtein alignment, the cost function for the Levenshtein distance, and a probability distribution over the hypothesis space, then the expectation of the Levenshtein distance is given by ˜
E Lev(w1N , w ˜1N ) = E
M X m=1
c(wlm , w ˜ km ) =
M X
Em c(wlm , w ˜km ),
(A.2)
m=1
where the expectation is computed over all sentences w1N and the according posterior probability P r(w1N xT1 ). Under the assumption of a fixed alignment, further investigations can be done alignment positionwise. The expected cost at position m is given by Em c(w, w) ˜
= P rm (w 6= w, ˜ w 6= , w ˜ 6= xT1 ) +P rm (v = xT1 ) +P rm (w = xT1 ).
(A.3)
The question of interest is now: when is it advantageous to delete w at position m, i.e. to replace w by the empty word . The expectation for setting w to is given by Em, c(w, w) ˜ w→
= P rm (w = wx ˜ T1 ) +P rm (w 6= w, ˜ w 6= , w ˜ 6= xT1 ) +P rm (w = ; m).
(A.4)
An insertion cannot happen anymore, but an error occurs if w equals the correct word w. ˜ A comparison of Equation (A.3) and Equation (A.4) shows when it is advantageous for the system to replace w by the empty word , i.e. to delete w: Em, c(w, w) ˜ < Em c(w, w) ˜ w→ T ⇔ P rm (w = wx ˜ 1 ) < P rm (v = xT1 ) ⇔ P rm (w = ww ˜ 6= , xT1 ) < P rm (v = w 6= , xT1 )
(A.5)
The result in words: if the risk of an insertion is higher than the probability of the word being correct, then it is better to discard the word. Thus, a system optimized for minimum WER will have a (slight) deletion bias.
125
Appendix A The Deletion Bias in LVCSR Decoding The above result can be used in a postprocessing step to the search: simply delete all words from the hypothesis for which Equation (A.5) is fulfilled. In practice, the estimate for P rm (w = ww ˜ 6= , xT1 ) is the word confidence score conf(wlm ) for hypothesis word wlm . The probability for an insertion at position m can be roughly estimated as P rm (v = w 6= , xT1 ) ≈
ins(Abest ) ins(Abest ) ≈ , ˜ N N
where Abest is the alignment of the decoder output and the reference and ins(Abest ) counts the number of insertions in the alignment. That is, the probability for an insertion is simply approximated by the ˜ insertion ratio in the WER computed between the original hypothesis w1N and reference w ˜1N . Theoretically, N by deleting words in the decoding output, i.e. by replacing words in w1 with , the Levenshtein alignment ˜ to the reference w ˜1N can be changed, what is not considered in the above analysis. However, the error rate will presumably benefit from the new alignment: let us assume a lowconfident word which was actually aligned to a reference word is replaced by and let us further assume one of the adjacent hypothesis words cause an insertion in the current alignment, then in the new alignment only a substitution appears but not the insertion anymore. The practical use of the postprocessing algorithm is very limited. For the systems investigated in this work the insertion ratio is small ( 10%) and only few words have such a small confidence score and confidence scores in this range are usually not very reliable. In one experiment the approach was applied to a task with a rather high insertion ratio (almost 10%) and all words with an confidence score lower than a given threshold were discarded. The experiment was done for a Chinese task and the threshold was chosen empirically for minimum character error rate (CER). For thresholds between 0.3 and 0.4 an improvement in CER was observed. Experiments with common tasks which have a low insertion ratio showed only a slight improvement, if at all, for the price of a highly increased deletion ratio.
126
Appendix B Corpora and Systems Experiments have been performed on four setups with an overall of 19 subsystems. Two systems were built for the Chinese track of the GALE project: a testing system used for fast technology tests and the RWTH Aachen GALE 2008 evaluation system. The Chinese corpora and systems are introduced in Section B.1. Further system combinations are done within the RWTH Aachen English TCStar/EPPS 2007 evaluation system. For the same evaluation word lattices were provided by four project partners and extensively used for crosssite combination experiments. The corpora used in the English track of the TCStar/EPPS 2007 evaluation, the RWTH Aachen evaluation system, and the crosssite combination setup are described in Section B.2.
B.1 Chinese GALE Systems The systems were developed as part of the participation of the RWTH Aachen in the Global Autonomous Language Exploitation (GALE) project [Hoffmeister & Plahl+ 2007; Plahl & Hoffmeister+ 2008b, 2009]. The goal of the GALE program is to provide the technology for translating and analyzing huge volumes of speech and text in multiple languages. A particular subtask is the transcription of Chinese broadcast news (BN) and broadcast conversations (BC). Training and tuning/testing data is provided within the project and is summarized in Table B.1. The complete 1,600 hours of training data consist of the Hub4 and TDT4 data and the GALE data releases Y1Q14, P2R12, P3R12, and P4R1. Hub4 consists of 30h of carefully transcribed BN data. The 120h of TDT4 BN data come with closed captions and the GALE data releases with quick transcriptions1 . The 230h training set is a subset made up of the Hub4 data and 100h BN and 100h BC data taken from the GALE releases. Two systems are used for the experiments, each system consisting of several subsystems. The first system described in Section B.1.1 is trained on the 230h training set and is used for technology testing and analysis. Section B.1.2 describes the latest RWTH Aachen Chinese system used in the GALE 2008 evaluation. Both systems share the same pronunciation dictionary, word list, and language model. The derivation of the pronunciation dictionary is described in detail in [Plahl & Hoffmeister+ 2008b]. Word list and language model are kindly provided by SRI/University of Washington(UW) and are equivalent 1 The
data is provided by LDC and at least the Chinese Hub4 and TDT4 data is publicly available at http://ldc.upenn.edu; the GALE data releases are not yet publicly available.
Table B.1. Corpora statistics for the Chinese GALE systems.
Corpus #Segments training 230h 206K 1600h 1.3M tuning/ testing dev07 1,655 eval07 1,013 dev08 618
#Words
Audio data [h]
2.4M 15.5M
230 1,600
27.5K 10.5K
2.5 1.6 1.0
127
Appendix B Corpora and Systems
Table B.2. Subsystems in the Chinese 230 testing system.
Name s1 s1.r1 s1.r2 s2 s2.r1 s3 s3.r1
Acoustic FrontEnd MFCC MFCC MFCC PLP PLP GT GT
Randomized CART no yes yes no yes no yes
to the ones used in the SRI/UW GALE evaluation systems [Hwang & Peng+ 2007; Lei & Wu+ 2009]. The word list contains 60K words and the language model is a large 4gram. A pruned version of the LM is used in the recognition runs and the full 4gram is applied in a subsequent lattice rescoring step.
B.1.1 The Chinese 230h Testing System The 230h testing system consists of seven subsystems, all maximum likelihood (ML) trained on the 230h training set and dev07 is used for parameter tuning. The subsystems vary in their acoustic frontend and some use a randomized phonetic decision tree (randomized CART). The following list gives an overview of the training setup and decoding structure for a single subsystem; a detailed discussion can be found in [Plahl & Hoffmeister+ 2008b]. • 3 × 1state HMMs • acrossword acoustic model • statetying via (randomized) phonetic decision tree • 4,501 mixtures with a total of 1.1M Gaussian densities • 16 dimensional acoustic features • LDA on 9 adjacent input frames (16 × 9 = 144 input features), reduced to 45 dimensions • 1 tone feature including first and second derivatives • 60K vocabulary • 4gram LM (P Pdev07 = 367) • 1. decoding pass: ML trained VTLN acoustic model (fast variant of VTLN) • 2. decoding pass: ML trained SAT/CMLLR acoustic model, MLLR • 3. decoding pass: lattice rescoring with full LM The randomization of the phonetic decision tree follows the approach described in [Dietterich 2000b] and was applied to speech recognition first in [Siohan & Ramabhadran+ 2005]. Table B.2 lists the resulting seven subsystems which are used in the various system combination experiments. In the experiments systems with different acoustic frontends and with randomized phonetic decision trees are combined. In particular, the experiments shown in Appendix C compare the approaches to complementary system building via different acoustic frontends and via randomized phonetic decision trees. Table B.3 gives an overview over the tested combinations. For the lattice decoding and combination experiments the word lattices are pruned to a density of 75.
128
B.1 Chinese GALE Systems
Table B.3. System combinations for the Chinese 230 testing system.
Name s1+s2 s1+s2+s3 s1+s1.r1+s1.r2 s1+s1.r1+s2+s2.r1+s3+s3.r1
#Systems 2 3 3 6
Acoustic FrontEnd(s) MFCC, PLP MFCC, PLP, GT MFCC MFCC, PLP, GT
Randomized CARTs no no yes yes
B.1.2 The RWTH Aachen Chinese GALE 2008 Evaluation System This section describes the Chinese system used by RWTH Aachen in the GALE 2008 evaluation. The basic setup follows the Chinese 230h testing system but additional techniques and the complete 1,600h training set are used. The additional techniques include neural network (NN) based phoneme posterior features, minimum phoneme error (MPE) discriminative acoustic model training, and crossadaptation. The following list gives a summary of the training and decoding setup. • 3 × 1state HMMs • acrossword acoustic model • statetying via phonetic decision tree • 4,501 mixtures with a total of 1.2M Gaussian densities • 16 dimensional acoustic base features (+1 voicing feature) • 1 tone feature • LDA on 9 adjacent input frames (16+1×9 = 153 input features, with voicing feature: 16+1+1×9 = 162 input features), reduced to 45 dimensions • 35(IDIAP) or 32(ICSI) dimensional NN features, concatenated to the LDA result • 60K vocabulary • 4gram LM (P Pdev07 = 367) • 1. decoding pass: ML trained VTLN acoustic model (fast variant of VTLN) • 2. decoding pass: MPE trained SAT/CMLLR acoustic model, MLLR or crosssystem MLLR • 3. decoding pass: lattice rescoring with full LM A detailed discussion of the setup including a description of the NN features is given in [Plahl &Hoffmeister+ 2009]. Crossadaptation and latticebased system combination are two combination techniques which can be easily combined: first crossadapting the systems and subsequently combining the resulting, crossadapted lattices. For the Chinese GALE 2008 evaluation system the interaction of crossadaptation and latticebased system combination is experimentally explored. The system consists of two core subsystems, one is based on MFCC features augmented with the NN features provided by IDIAP, and the other uses a PLP frontend together with the NN features provided by ICSI. Each of the two core subsystems exists in two flavors: with and without crossadaptation. The crossadapted system uses the final output of the other, noncrossadapted system as supervisor in the CMLLR/MLLR adaptation step. latticebased system combination experiments are performed for the pair of noncrossadapted as well as for the pair of crossadapted subsystems. Table B.4 summarizes the differences between the subsystems. In the system combination experiments the two crossadapted systems, called s1.x2 and s2.x1, and the two noncrossadapted systems, called s1
129
Appendix B Corpora and Systems
Table B.4. Subsystems in the RWTH Aachen Chinese GALE 2008 evaluation system.
Name s1 s1.x2 s2 s2.x1
Acoustic FrontEnd MFCC MFCC PLP PLP
NN features IDIAP IDIAP ICSI ICSI
voicing feature no no yes yes
CMLLR/MLLR supervisor s1 (1. pass output) s2 (final output) s2 (1. pass output) s1 (final output)
Table B.5. Corpora statistics for the English EPPS systems.
Corpus #Segments training supervised 67K unsupervised tuning/ testing dev06 726 eval06 742 eval07 644
#Words
Audio data [h]
660K 
91.6 187.2
29K 30K 27K
3.2 3.2 2.9
and s2, are combined. In particular, the experiments presented in Appendix C show the effect of stacking crossadaption and latticebased system combination. For the lattice combination experiments word lattices are produced with all four subsystems and pruned to a density of 75.
B.2 English TCStar/EPPS Systems The European parliament plenary sessions (EPPS) task was part of the TCStar project. The objective is to transcribe debates from the European parliament. RWTH Aachen participated in all evaluations which took place in 2005, 2006, and 2007 [L¨o¨of &Bisani+ 2006; L¨o¨of &Bisani+ 2006; L¨o¨of &Gollan+ 2007]. In 2006 and 2007 the project partners agreed on sharing lattices from their best evaluation (sub)system for system combination experiments. In this work results are presented for the TCStar 2007 English EPPS evaluation. The corpora statistics for the training and testing data are summarized in Table B.5. The eval06 set was the evaluation set in the 2006 evaluation and the official development set in the 2007 evaluation. Section B.2.1 describes the RWTH Aachen English EPPS 2007 evaluation system and Section B.2.2 the setup of the crosssite combination experiments based on the lattices shared after the 2007 evaluation.
B.2.1 The RWTH Aachen English EPPS 2007 Evaluation System The section describes the English system used by RWTH Aachen for the EPPS task in the TCStar 2007 evaluation campaign. Four subsystems are trained varying in the acoustic frontends and in the amount of training data. Parameter tuning is done on the eval06 corpus. An overview of the training and decoding setup is given by the following list. • 3 × 2states HMMs • acrossword acoustic model • 4,501 mixtures with a total of 0.8M Gaussian densities • statetying via phonetic decision tree
130
B.2 English TCStar/EPPS Systems
Table B.6. Subsystems in the RWTH Aachen English EPPS 2007 evaluation system.
Name s1 s2 s3 s4
Acoustic FrontEnd MFCC MFCC GT MFCC
NN features no no no yes
unsupervised training data yes no no no
• 16 dimensional acoustic base features + 1 voicing feature • LDA on 9 adjacent input frames (16 + 1 × 9 = 153 input features), reduced to 45 dimensions • neural network based phoneme posterior features • 52K vocabulary • 4gram LM (P Peval06 = 106) • 1. decoding pass: ML trained VTLN acoustic model (fast variant of VTLN) • 2. decoding pass: MPE trained SAT/CMLLR acoustic model, MLLR • 3. decoding pass: lattice rescoring with full LM The pronunciation lexicon is based on the English Beep lexicon and missing pronunciations are derived from a graphemetophoneme conversion model, which is trained on the Beep lexicon [Bisani & Ney 2003]. For the decoding passes a pruned version of the 4gram LM is used and the full LM is applied in the lattice rescoring step. A detailed description of the system can be found in [L¨o¨of & Gollan+ 2007]. The four subsystems and their main differences are listed in Table B.6. In the system combination experiments the combination of the first two systems, called s1+s2, the first three systems, s1+s2+s3, and of all four systems, s1+s2+s3+s4, are used. In particular, the experimental results in Appendix C show how the combination benefits from adding more systems. For the lattice combination experiments word lattices were produced with all subsystems and pruned to a density of 75.
B.2.2 The English EPPS 2007 Evaluation Crosssite Combination All partners who participated in the English EPPS task of the TCStar 2007 evaluation campaign were asked to provide lattices from their best (sub)system for system combination experiments. In the end, four sites kindly distributed their lattices: CNRS/LIMSI [Lamel & Gauvain+ 2007], FBK/IRST (former ITC/IRST) [Falavigna &Bertoldi+ 2007], RWTH Aachen University [L¨o¨of &Gollan+ 2007], and University of Karlsruhe (UKA) [St¨ uker & F¨ ugen+ 2007]. The lattices provided by RWTH Aachen were produced by subsystem s1, cf. Section B.2.1. Word lattices are provided for the eval06 corpus (the official development set for the 2007 evaluation) and for the eval07 corpus. All sites used their own acoustic segmentation. For the latticebased system combination experiments the segmentation was unified by concatenating the lattices recordingwise, where eval06 consists of five and eval07 of eight recordings. The lattices are normalized by applying the normalization rules used in scoring. The resulting lattices are pruned to a density of 50, where the target density is given by the least dense lattice set. All lattices come with separate acoustic and language model scores. Parameter optimization is done on the development set (eval06). System combination results are produced for the combination of the two best performing systems, LIMSI+RWTH, the three best performing systems, LIMSI+RWTH+UKA, and for the combination of all four systems, LIMSI+RWTH+UKA+IRST. Systematic results for each of the combinations are presented in Appendix C.
131
Appendix C Experimental Results Detailed results for the systems introduced in Appendix B are given. Experimental results are produced and summarized for each system and for all combination methods and decoding rules introduced in Chapter 3 and Chapter 4. First, the results for the several subsystems are presented followed by the various combination results. The first set of results is produced with the minimum sentence error decoding rules discussed in Chapter 3. For single systems this is the Viterbi and the MAP decoder. In the system combination experiments the Viterbi and the MAP decoding rule is applied to the lattice intersection and the modified lattice union. MAP decoding results for the union based combination were eventually omitted, because a single decoding run took several days and thus no parameter optimization was possible in a reasonable amount of time. The second set of results is produced by Bayes risk decoders which aim at minimizing an approximate Levenshtein distance, in particular the approximations introduced in Chapter 4. The results are structured as follows: first, the results for the three confusion network (CN) construction algorithms are given. They are followed by the frame error results with different normalization approaches. And last, the four variants of the error approximation based on local alignments are added. The system combination experiments use the modified union approach to combine the systemdependent lattices. For comparison, ROVER and confusion network combination (CNC) results are included.
C.1 The Chinese 230h Testing System This section summarizes the results for the Chinese 230h testing system introduced in Section B.1.1. All results are produced on character lattices. The character lattices are derived from word lattices by splitting the word arcs into character arcs, where the character boundaries are determined by a forced alignment of the characters within a word arc. The error measure is the character error rate (CER).
MFCC frontend (s1) Results for the Chinese 230h testing system with the MFCC acoustic frontend. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.63/1.59) 14.54 (2.67/1.56) 14.56
(4.42/0.91) 15.08 (4.42/0.91) 15.14
(2.80/0.87) 13.28 (2.88/0.85) 13.39
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.79/1.45) 14.30 (2.95/1.41) 14.31 (2.81/1.45) 14.32
(4.53/0.85) 14.96 (4.69/0.82) 14.93 (4.56/0.85) 14.95
(2.85/0.80) 13.05 (3.07/0.79) 13.10 (2.89/0.80) 13.10
(2.92/1.38) 14.35 (2.68/1.53) 14.42 (2.52/1.61) 14.23
(4.62/0.79) 14.98 (4.45/0.90) 15.09 (4.32/0.98) 14.96
(3.01/0.75) 13.13 (2.80/0.83) 13.09 (2.75/0.94) 13.11
(2.89/1.39) (2.32/1.68) (2.70/1.51) (2.61/1.55)
(4.62/0.80) (4.23/1.03) (4.45/0.89) (4.46/0.92)
(3.00/0.75) (2.61/0.97) (2.84/0.84) (2.73/0.88)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
14.33 14.17 14.33 14.34
15.03 15.01 14.98 15.01
13.14 13.04 13.12 13.06
133
Appendix C Experimental Results
MFCC frontend and randomized CART (s1.r1) Results for the Chinese 230h testing system with the MFCC acoustic frontend and a randomized phonetic decision tree. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.72/1.60) 14.61 (2.77/1.57) 14.59
(4.57/0.94) 15.22 (4.57/0.92) 15.20
(2.87/0.88) 13.58 (2.85/0.88) 13.53
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.91/1.46) 14.33 (3.05/1.43) 14.38 (2.83/1.49) 14.33
(4.66/0.82) 14.83 (4.82/0.81) 14.89 (4.55/0.87) 14.86
(2.99/0.83) 13.25 (3.11/0.82) 13.28 (2.89/0.83) 13.19
(2.97/1.46) 14.39 (2.77/1.54) 14.51 (2.65/1.60) 14.31
(4.68/0.85) 14.90 (4.53/0.90) 14.99 (4.42/0.99) 14.87
(3.00/0.80) 13.24 (2.85/0.86) 13.26 (2.77/0.89) 13.18
(2.99/1.42) (2.47/1.71) (2.72/1.52) (2.71/1.55)
(4.65/0.82) (4.30/1.06) (4.49/0.94) (4.49/0.89)
(3.04/0.75) (2.58/0.97) (2.81/0.83) (2.83/0.85)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
14.40 14.34 14.34 14.33
14.89 14.94 14.93 14.90
13.20 13.14 13.18 13.26
MFCC frontend and randomized CART (s1.r2) Results for the Chinese 230h testing system with the MFCC acoustic frontend and a randomized phonetic decision tree. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.70/1.58) 14.49 (2.63/1.65) 14.51
(4.51/0.96) 15.11 (4.44/0.94) 15.09
(2.77/0.99) 13.56 (2.74/0.99) 13.53
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.98/1.44) 14.28 (3.17/1.38) 14.29 (2.95/1.44) 14.25
(4.74/0.83) 15.05 (4.83/0.81) 15.04 (4.71/0.83) 15.05
(3.10/0.84) 13.45 (3.25/0.84) 13.48 (3.10/0.84) 13.43
(2.99/1.43) 14.27 (2.85/1.48) 14.34 (2.59/1.60) 14.21
(4.71/0.85) 15.05 (4.62/0.88) 15.08 (4.41/1.00) 14.90
(3.05/0.81) 13.42 (2.94/0.87) 13.48 (2.73/0.96) 13.24
(2.98/1.38) (2.72/1.51) (2.73/1.50) (2.70/1.55)
(4.72/0.78) (4.52/0.88) (4.54/0.88) (4.55/0.91)
(3.06/0.82) (2.80/0.92) (2.84/0.90) (2.87/0.93)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
134
14.28 14.24 14.24 14.27
14.93 14.95 14.95 15.02
13.28 13.24 13.31 13.39
C.1 The Chinese 230h Testing System
PLP frontend (s2) Results for the Chinese 230h testing system with the PLP acoustic frontend. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.65/1.70) 14.82 (2.63/1.72) 14.80
(4.44/0.93) 15.02 (4.41/0.96) 15.00
(2.71/0.94) 13.54 (2.66/0.97) 13.47
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.90/1.50) 14.52 (3.12/1.44) 14.53 (2.85/1.52) 14.48
(4.62/0.81) 14.74 (4.83/0.80) 14.80 (4.59/0.82) 14.71
(2.88/0.79) 13.35 (3.12/0.74) 13.39 (2.88/0.77) 13.35
(2.90/1.50) 14.55 (2.78/1.58) 14.67 (2.66/1.63) 14.47
(4.63/0.81) 14.74 (4.53/0.86) 14.83 (4.43/0.92) 14.73
(2.93/0.78) 13.36 (2.78/0.83) 13.41 (2.71/0.89) 13.30
(2.95/1.46) (2.52/1.69) (2.76/1.56) (2.77/1.53)
(4.67/0.76) (4.33/0.96) (4.52/0.85) (4.53/0.86)
(2.95/0.78) (2.56/0.94) (2.73/0.84) (2.81/0.87)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
14.54 14.48 14.55 14.54
14.76 14.72 14.75 14.76
13.43 13.30 13.33 13.44
PLP frontend and randomized CART (s2.r1) Results for the Chinese 230h testing system with the PLP acoustic frontend and a randomized phonetic decision tree. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.69/1.68) 14.73 (2.68/1.69) 14.72
(4.45/0.99) 14.97 (4.43/0.96) 14.98
(2.75/0.93) 13.51 (2.73/0.93) 13.38
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.91/1.56) 14.46 (3.12/1.51) 14.49 (2.90/1.56) 14.47
(4.62/0.87) 14.77 (4.82/0.83) 14.82 (4.59/0.86) 14.76
(2.97/0.81) 13.24 (3.12/0.80) 13.28 (2.96/0.81) 13.19
(3.14/1.44) 14.50 (2.80/1.61) 14.56 (3.07/1.48) 14.40
(4.78/0.79) 14.75 (4.53/0.91) 14.85 (4.74/0.82) 14.75
(3.18/0.77) 13.30 (2.89/0.85) 13.33 (3.09/0.78) 13.22
(3.03/1.46) (2.40/1.84) (2.84/1.57) (2.76/1.63)
(4.68/0.82) (4.25/1.06) (4.57/0.88) (4.45/0.92)
(3.07/0.78) (2.53/1.08) (2.91/0.81) (2.81/0.88)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
14.47 14.42 14.46 14.48
14.75 14.79 14.82 14.79
13.30 13.16 13.22 13.22
135
Appendix C Experimental Results
GT frontend (s3) Results for the Chinese 230h testing system with the acoustic frontend based on the Gammatone filter bank. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.65/1.64) 15.07 (2.66/1.63) 15.08
(4.57/1.04) 15.60 (4.56/1.04) 15.58
(2.84/0.93) 13.80 (2.83/0.92) 13.82
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.97/1.48) 14.86 (3.11/1.44) 14.88 (2.89/1.50) 14.83
(4.74/0.92) 15.42 (4.91/0.88) 15.42 (4.73/0.93) 15.42
(3.01/0.85) 13.67 (3.21/0.86) 13.82 (2.98/0.85) 13.65
(2.98/1.49) 14.87 (2.84/1.56) 15.01 (2.84/1.53) 14.76
(4.77/0.86) 15.41 (4.67/0.96) 15.53 (4.72/0.95) 15.41
(3.06/0.83) 13.63 (2.94/0.87) 13.80 (3.01/0.91) 13.71
(3.09/1.41) (2.74/1.57) (2.87/1.50) (2.79/1.56)
(4.89/0.83) (4.62/0.99) (4.68/0.94) (4.65/0.99)
(3.16/0.79) (2.89/0.92) (2.96/0.88) (2.88/0.94)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
14.91 14.80 14.86 14.87
15.42 15.42 15.45 15.49
13.82 13.75 13.69 13.71
GT frontend and randomized CART (s3.r1) Results for the Chinese 230h testing system with the acoustic frontend based on the Gammatone filter bank and a randomized phonetic decision tree. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.65/1.69) 15.23 (2.62/1.72) 15.23
(4.56/1.08) 15.86 (4.53/1.08) 15.86
(2.79/0.97) 14.15 (2.78/0.97) 14.05
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.91/1.54) 15.07 (3.13/1.46) 15.07 (2.93/1.52) 15.05
(4.73/0.94) 15.66 (4.93/0.90) 15.68 (4.71/0.94) 15.62
(2.91/0.84) 13.71 (3.12/0.80) 13.78 (2.94/0.85) 13.75
(2.96/1.53) 15.09 (2.87/1.59) 15.24 (2.41/1.85) 15.00
(4.72/0.95) 15.71 (4.70/0.97) 15.76 (4.38/1.22) 15.72
(3.00/0.83) 13.74 (2.88/0.85) 13.80 (2.58/1.03) 13.59
(2.98/1.51) (2.65/1.66) (2.80/1.59) (2.73/1.62)
(4.79/0.93) (4.50/1.04) (4.60/0.99) (4.59/1.02)
(2.97/0.81) (2.71/0.94) (2.82/0.88) (2.78/0.89)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
136
15.13 14.98 15.04 15.07
15.72 15.61 15.65 15.68
13.71 13.65 13.66 13.68
C.1 The Chinese 230h Testing System
Combination of two acoustic frontends (s1+s2) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
dev071
CER[%] (del/ins) err eval07
dev08
(2.55/1.58) 14.05 (2.48/1.64) 14.04 (2.59/1.65) 14.25
(4.43/0.91) 14.59 (4.37/0.91) 14.56 (4.44/0.92) 14.86
(2.75/0.84) 13.09 (2.63/0.85) 12.91 (2.74/0.89) 13.36
(3.05/1.29) (3.47/1.20) (2.90/1.34) (2.93/1.34) (3.03/1.32) (2.91/1.36) (2.66/1.57) (2.49/1.59)
(4.69/0.73) (5.18/0.69) (4.60/0.74) (4.66/0.76) (4.72/0.72) (4.60/0.75) (4.44/0.90) (4.30/0.91)
(3.01/0.73) (3.45/0.66) (2.90/0.71) (2.93/0.74) (3.09/0.75) (2.91/0.74) (2.86/0.85) (2.64/0.94)
13.54 13.69 13.54 13.56 13.53 13.55 14.54 13.63
14.01 14.22 13.96 13.99 13.95 13.95 15.13 14.09
12.54 12.75 12.43 12.50 12.66 12.49 13.32 12.61
(3.07/1.30) 13.57 (2.83/1.41) 13.83 (2.57/1.58) 13.49
(4.69/0.68) 13.95 (4.58/0.80) 14.21 (4.31/0.90) 13.93
(3.05/0.70) 12.54 (2.85/0.70) 12.67 (2.65/0.89) 12.45
(3.11/1.25) (2.47/1.53) (2.78/1.37) (2.68/1.45)
(4.75/0.67) (4.32/0.85) (4.51/0.75) (4.44/0.82)
(3.06/0.70) (2.58/0.86) (2.80/0.75) (2.78/0.81)
13.60 13.44 13.48 13.54
14.00 13.93 13.97 14.00
12.57 12.35 12.49 12.44
Combination of three acoustic frontends (s1+s2+s3) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
dev071
CER[%] (del/ins) err eval07
dev08
(2.46/1.56) 13.91 (2.49/1.59) 14.01 (2.57/1.64) 14.09
(4.38/0.91) 14.57 (4.40/0.90) 14.45 (4.47/0.92) 14.83
(2.66/0.83) 12.65 (2.68/0.87) 12.63 (2.77/0.87) 13.17
(2.88/1.24) (3.38/1.14) (2.74/1.33) (2.87/1.29) (2.93/1.26) (2.74/1.34) (2.74/1.35) (2.70/1.34)
(4.77/0.67) (5.19/0.65) (4.56/0.73) (4.68/0.70) (4.71/0.67) (4.57/0.76) (4.59/0.75) (4.55/0.74)
(3.01/0.73) (3.34/0.64) (2.87/0.74) (2.92/0.72) (3.03/0.70) (2.86/0.77) (2.89/0.75) (2.89/0.76)
13.13 13.27 13.15 13.17 13.15 13.16 13.55 13.22
13.73 13.77 13.65 13.70 13.65 13.74 14.16 13.86
12.30 12.32 12.19 12.21 12.29 12.14 12.61 12.47
(3.06/1.23) 13.18 (2.85/1.30) 13.45 (2.99/1.22) 13.06
(4.72/0.69) 13.71 (4.70/0.73) 14.09 (4.76/0.66) 13.64
(3.01/0.72) 12.22 (2.92/0.72) 12.52 (3.04/0.71) 12.22
(3.12/1.14) (2.61/1.33) (2.69/1.31) (2.58/1.38)
(4.82/0.62) (4.48/0.75) (4.55/0.72) (4.50/0.80)
(3.16/0.68) (2.72/0.78) (2.81/0.74) (2.74/0.77)
13.19 13.09 13.12 13.20
13.74 13.67 13.70 13.87
12.26 12.08 12.15 12.25
137
Appendix C Experimental Results
Combination of three randomized trees (s1+s1.r1+s1.r2) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
dev071
CER[%] (del/ins) err eval07
dev08
(2.57/1.60) 14.14 (2.56/1.59) 14.13 (2.64/1.68) 14.40
(4.38/0.94) 15.04 (4.35/0.93) 14.97 (4.39/0.93) 15.02
(2.77/0.91) 13.19 (2.77/0.91) 13.15 (2.78/0.90) 13.41
(2.83/1.39) (3.13/1.29) (2.78/1.42) (2.86/1.39) (2.92/1.34) (2.78/1.43) (2.61/1.54) (2.70/1.47)
(4.59/0.78) (4.89/0.76) (4.53/0.82) (4.62/0.80) (4.67/0.79) (4.54/0.81) (4.45/0.90) (4.50/0.85)
(3.00/0.75) (3.21/0.71) (2.91/0.77) (3.01/0.75) (3.10/0.75) (2.95/0.77) (2.72/0.88) (2.89/0.83)
13.83 13.86 13.82 13.84 13.82 13.84 14.00 13.80
14.53 14.60 14.51 14.54 14.51 14.52 14.81 14.63
12.79 12.88 12.81 12.86 12.88 12.83 13.05 12.98
(2.98/1.39) 13.89 (2.85/1.41) 13.99 (2.83/1.41) 13.77
(4.63/0.84) 14.55 (4.60/0.84) 14.66 (4.56/0.81) 14.45
(3.05/0.78) 12.90 (2.92/0.76) 12.87 (2.99/0.80) 12.77
(3.06/1.28) (2.53/1.52) (2.76/1.43) (2.70/1.43)
(4.75/0.73) (4.30/0.93) (4.53/0.85) (4.48/0.87)
(3.15/0.70) (2.68/0.87) (2.89/0.79) (2.92/0.82)
13.86 13.74 13.83 13.79
14.59 14.54 14.53 14.57
12.81 12.71 12.80 12.88
Combination of three acoustic frontends and three randomized trees (s1+s1.r1+s2+s2.r1+s3+s3.r1) dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error intersection Viterbi MAP union Viterbi
(2.36/1.59) 13.78 (2.42/1.58) 13.77 (2.58/1.59) 14.22
(4.32/0.91) 14.70 (4.31/0.88) 14.53 (4.55/1.01) 15.07
(2.54/0.86) 12.53 (2.55/0.82) 12.44 (2.68/0.85) 13.06
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf.
(2.98/1.26) (3.55/1.11) (2.73/1.36) (2.88/1.26) (3.00/1.23) (2.89/1.27) (2.60/1.42) (2.65/1.37)
(4.80/0.69) (5.45/0.60) (4.53/0.71) (4.70/0.69) (4.80/0.65) (4.64/0.70) (4.44/0.79) (4.50/0.75)
(3.10/0.70) (3.67/0.70) (2.81/0.72) (2.97/0.70) (3.08/0.70) (3.00/0.69) (2.80/0.81) (2.85/0.77)
Decoder
Frame Error error norm.:
1
138
tuning set
asym. arcsym. pathsym.
13.05 13.19 13.02 13.05 13.05 13.10 13.15 12.97
(3.00/1.25) 13.03 (2.82/1.36) 13.35 (2.58/1.43) 12.89
13.82 14.00 13.63 13.81 13.74 13.84 13.91 13.72
(4.65/0.68) 13.68 (4.66/0.77) 13.89 (4.38/0.79) 13.64
12.20 12.53 11.98 12.14 12.09 12.13 12.34 12.15
(2.91/0.70) 11.96 (2.89/0.70) 12.15 (2.70/0.84) 11.97
C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System
C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System Results for the setup for the RWTH Aachen Chinese GALE 2008 evaluation system introduced in Section B.1.2. All results are produced on character lattices. The character lattices are derived from word lattices by splitting the word arcs into character arcs, where the character boundaries are determined by a forced alignment of the characters within a word arc. The error measure is the character error rate (CER).
MFCC+IDIAPNN frontend (s1) Results for the Chinese GALE 2008 evaluation system with the MFCC acoustic frontend combined with the neural network (NN) based phoneme posterior features provided by IDIAP. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.17/1.11) 9.56 (2.18/1.11) 9.55
(4.17/0.71) 10.90 (4.19/0.70) 10.94
(2.37/0.76) 9.25 (2.38/0.75) 9.26
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.22/1.09) 9.46 (2.38/1.06) 9.47 (2.21/1.08) 9.44
(4.12/0.65) 10.91 (4.28/0.64) 10.92 (4.13/0.65) 10.87
(2.34/0.64) 8.98 (2.52/0.63) 9.05 (2.35/0.63) 8.98
(2.30/1.06) 9.47 (2.15/1.35) 9.77 (2.38/1.02) 9.44
(4.23/0.65) 10.89 (4.08/1.00) 11.28 (4.28/0.62) 10.85
(2.57/0.66) 9.15 (2.32/0.91) 9.36 (2.65/0.61) 9.15
(2.31/1.01) (2.27/1.05) (2.12/1.11) (2.27/1.10)
(4.21/0.62) (4.14/0.65) (4.05/0.68) (4.18/0.67)
(2.52/0.60) (2.45/0.64) (2.30/0.65) (2.43/0.64)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
9.46 9.46 9.44 9.49
10.87 10.87 10.97 10.87
9.10 9.09 9.00 9.06
PLP+ICSINN frontend (s2) Results for the Chinese GALE 2008 evaluation system with the PLP acoustic frontend combined with the NN based phoneme posterior features provided by ICSI. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.41/1.13) 9.96 (2.36/1.17) 9.95
(4.14/0.76) 11.12 (4.11/0.77) 11.10
(2.42/0.68) 9.24 (2.39/0.69) 9.19
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.45/1.05) 9.87 (2.60/1.04) 9.89 (2.43/1.08) 9.87
(4.23/0.71) 11.05 (4.33/0.68) 11.02 (4.21/0.70) 11.00
(2.47/0.60) 9.19 (2.61/0.58) 9.25 (2.45/0.63) 9.22
(2.55/1.06) 9.88 (2.34/1.28) 10.11 (2.29/1.17) 9.84
(4.26/0.70) 11.00 (4.16/0.92) 11.30 (4.07/0.83) 11.03
(2.58/0.63) 9.25 (2.34/0.86) 9.51 (2.38/0.75) 9.25
(2.52/1.01) (2.29/1.20) (2.47/1.03) (2.61/1.02)
(4.29/0.66) (4.06/0.78) (4.24/0.68) (4.32/0.71)
(2.54/0.59) (2.33/0.72) (2.51/0.61) (2.62/0.65)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
9.85 9.87 9.86 9.91
11.07 11.02 11.01 11.12
9.25 9.22 9.20 9.33
139
Appendix C Experimental Results
MFCC+IDIAPNN frontend and crossadaptation (s1.x2) Results for the Chinese GALE 2008 evaluation system with the MFCC acoustic frontend combined with the NN based phoneme posterior features provided by IDIAP. The CMLLR/MLLR adaption is performed as a crossadaptation with the final output of system s2 as supervisor. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(1.87/1.16) 9.07 (1.86/1.18) 9.02
(3.99/0.76) 10.67 (3.99/0.75) 10.68
(2.11/0.72) 8.72 (2.10/0.71) 8.70
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(1.96/1.13) 8.98 (2.12/1.06) 8.98 (1.95/1.11) 8.97
(4.08/0.68) 10.67 (4.20/0.67) 10.66 (4.06/0.68) 10.65
(2.22/0.65) 8.63 (2.33/0.64) 8.67 (2.18/0.64) 8.59
(2.00/1.11) 9.01 (1.83/1.48) 9.39 (1.75/1.28) 8.96
(4.09/0.67) 10.66 (3.94/1.12) 11.13 (3.84/0.82) 10.69
(2.26/0.67) 8.66 (2.07/0.91) 8.97 (2.04/0.82) 8.67
(2.16/1.00) (1.81/1.21) (2.03/1.07) (2.09/1.04)
(4.23/0.62) (3.90/0.78) (4.10/0.68) (4.13/0.67)
(2.47/0.61) (2.11/0.81) (2.21/0.64) (2.37/0.63)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
8.97 8.91 8.96 8.99
10.58 10.58 10.61 10.62
8.73 8.63 8.67 8.77
PLP+ICSINN frontend (s2.x1) Results for the Chinese GALE 2008 evaluation system with the PLP acoustic frontend combined with the NN based phoneme posterior features provided by ICSI. The CMLLR/MLLR adaption is performed as a crossadaptation with the final output of system s1 as supervisor. dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error Viterbi MAP
(2.01/1.15) 9.26 (1.97/1.20) 9.26
(3.98/0.71) 10.60 (3.95/0.72) 10.60
(2.17/0.72) 8.91 (2.09/0.71) 8.76
Confusion Network (CN) Error CN construct. alg.: arccluster statecluster centerframe
(2.09/1.11) 9.24 (2.10/1.15) 9.27 (2.07/1.13) 9.22
(4.04/0.66) 10.46 (4.10/0.68) 10.55 (4.03/0.66) 10.45
(2.26/0.65) 8.79 (2.38/0.64) 8.77 (2.24/0.63) 8.73
(2.04/1.16) 9.24 (1.90/1.51) 9.64 (1.98/1.22) 9.21
(3.99/0.71) 10.47 (3.89/1.11) 10.94 (3.96/0.75) 10.47
(2.19/0.71) 8.85 (2.11/1.06) 9.18 (2.14/0.74) 8.73
(2.08/1.13) (1.82/1.31) (1.96/1.19) (2.07/1.14)
(4.04/0.66) (3.91/0.81) (3.92/0.70) (4.01/0.68)
(2.24/0.61) (1.98/0.80) (2.15/0.68) (2.26/0.64)
Decoder
Frame Error error norm.:
hyp. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
140
9.28 9.20 9.25 9.28
10.52 10.50 10.47 10.50
8.70 8.75 8.66 8.79
C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System
Combination of two acoustic frontends (s1+s2) dev071
CER[%] (del/ins) err eval07
dev08
Sentence Error intersection Viterbi MAP union Viterbi
(2.11/1.08) 9.12 (2.11/1.09) 9.10 (2.17/1.12) 9.50
(4.10/0.73) 10.67 (4.09/0.72) 10.68 (4.09/0.75) 10.92
(2.23/0.74) 8.80 (2.28/0.69) 8.73 (2.39/0.74) 9.23
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf.
(2.36/0.94) (2.80/0.87) (2.32/0.96) (2.32/0.96) (2.44/0.93) (2.26/0.99) (2.14/1.12) (2.11/1.09)
(4.24/0.61) (4.67/0.56) (4.21/0.61) (4.21/0.63) (4.26/0.58) (4.19/0.63) (4.14/0.72) (4.08/0.73)
(2.57/0.56) (2.99/0.51) (2.46/0.59) (2.49/0.59) (2.60/0.54) (2.42/0.60) (2.35/0.76) (2.26/0.70)
Decoder
Frame Error error norm.:
1
asym. arcsym. pathsym.
8.95 9.13 8.92 8.91 8.94 8.93 9.55 9.02
(2.33/1.02) 9.02 (2.20/1.24) 9.40 (2.05/1.10) 8.87
10.46 10.60 10.45 10.52 10.43 10.46 10.95 10.54
(4.21/0.65) 10.54 (4.08/0.85) 10.79 (4.01/0.72) 10.41
8.74 8.86 8.64 8.71 8.69 8.67 9.22 8.69
(2.51/0.65) 8.74 (2.38/0.77) 8.99 (2.24/0.70) 8.57
tuning set
Combination of two acoustic frontends, with crossadaptation (s1.x2+s2.x1) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set
dev071
CER[%] (del/ins) err eval07
dev08
(1.91/1.15) 9.02 (1.92/1.16) 9.00 (2.04/1.11) 9.02
(3.93/0.69) 10.50 (3.93/0.69) 10.50 (4.10/0.70) 10.69
(2.21/0.72) 8.64 (2.17/0.70) 8.61 (2.31/0.72) 8.67
(1.96/1.10) (2.27/1.01) (2.05/1.09) (2.04/1.10) (2.09/1.09) (1.96/1.12) (1.87/1.16) (1.84/1.20)
(4.06/0.64) (4.34/0.61) (4.05/0.62) (4.05/0.63) (4.15/0.63) (4.05/0.66) (3.99/0.76) (3.87/0.78)
(2.21/0.63) (2.61/0.62) (2.26/0.64) (2.26/0.64) (2.33/0.63) (2.20/0.66) (2.11/0.72) (2.05/0.74)
8.84 8.90 8.85 8.86 8.87 8.85 9.07 8.84
10.40 10.52 10.29 10.33 10.33 10.37 10.67 10.41
8.54 8.65 8.48 8.49 8.54 8.47 8.72 8.47
(2.05/1.11) 8.88 (1.81/1.36) 9.17 (1.87/1.20) 8.80
(4.02/0.65) 10.35 (3.90/0.96) 10.84 (3.88/0.73) 10.36
(2.26/0.70) 8.60 (2.07/0.78) 8.74 (2.13/0.75) 8.50
(2.22/1.01) (1.89/1.15) (2.00/1.09) (2.02/1.13)
(4.18/0.59) (3.92/0.66) (4.05/0.63) (4.04/0.66)
(2.41/0.57) (2.09/0.69) (2.23/0.63) (2.22/0.66)
8.87 8.81 8.84 8.90
10.38 10.25 10.36 10.37
8.54 8.37 8.43 8.50
141
Appendix C Experimental Results
C.3 The RWTH Aachen English EPPS 2007 Evaluation System Results for the RWTH Aachen English EPPS 2007 evaluation system introduced in Section B.2.1. The error measure is the word error rate (WER).
MFCC frontend with unsupervised training (s1) Results for the English EPPS 2007 evaluation system with the MFCC acoustic frontend and model refinement with unsupervised training. Decoder
dev06
WER[%] (del/ins) err eval061
eval07
Sentence Error Viterbi MAP
(1.65/2.21) 11.09 (1.66/2.29) 11.19
(1.38/1.36) 8.43 (1.41/1.43) 8.51
(1.86/1.31) 9.81 (1.84/1.35) 9.84
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(1.90/1.92) 10.73 (2.06/1.78) 10.64 (1.82/1.93) 10.73
(1.55/1.12) 8.22 (1.75/1.10) 8.25 (1.54/1.15) 8.24
(2.09/1.16) 9.57 (2.22/1.10) 9.53 (2.03/1.16) 9.56
(1.89/1.91) 10.73 (1.81/2.06) 11.05 (2.03/1.69) 10.53
(1.57/1.14) 8.24 (1.54/1.24) 8.44 (1.72/1.03) 8.17
(2.02/1.13) 9.49 (1.99/1.27) 9.91 (2.34/1.00) 9.51
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (1.92/1.78) 10.66 (1.62/1.09) 8.24 (mod.) (1.80/1.96) 10.73 (1.48/1.19) 8.24 1/2 overlap cost (cont.) (1.77/1.96) 10.76 (1.48/1.17) 8.20 (disc.) (1.90/1.97) 10.86 (1.63/1.20) 8.39 1 tuning set, eval06 was the official development set in the 2007 evaluation
(2.17/1.06) (1.98/1.19) (1.98/1.17) (2.15/1.25) campaign
9.52 9.55 9.55 9.72
MFCC frontend (s2) Results for the English EPPS 2007 evaluation system with the MFCC acoustic frontend. dev06
WER[%] (del/ins) err eval061
eval07
Sentence Error Viterbi MAP
(1.77/2.28) 11.89 (1.85/2.27) 11.81
(1.67/1.23) 8.70 (1.72/1.23) 8.73
(2.12/1.31) 10.07 (2.18/1.33) 10.14
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(2.14/1.90) 11.42 (1.99/2.14) 11.74 (1.90/2.08) 11.57
(1.90/1.08) 8.61 (1.80/1.09) 8.57 (1.80/1.11) 8.59
(2.40/1.07) 9.78 (2.28/1.15) 9.90 (2.19/1.14) 9.76
(2.23/1.82) 11.44 (1.78/2.32) 11.97 (2.15/1.94) 11.51
(2.04/0.96) 8.55 (1.71/1.22) 8.82 (1.97/1.02) 8.57
(2.59/0.96) 9.75 (2.15/1.28) 10.18 (2.42/1.06) 9.76
Decoder
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (1.91/2.16) 11.76 (1.75/1.09) 8.57 (2.24/1.15) 9.91 (mod.) (1.92/2.11) 11.71 (1.78/1.11) 8.57 (2.25/1.17) 9.91 1/2 overlap cost (cont.) (1.87/2.13) 11.65 (1.74/1.12) 8.56 (2.19/1.19) 9.98 (disc.) (2.02/2.13) 11.72 (1.97/1.18) 8.80 (2.32/1.30) 10.10 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign
142
C.3 The RWTH Aachen English EPPS 2007 Evaluation System
MFCC+NN based phoneme posteriors frontend (s3) Results for the English EPPS 2007 evaluation system with the MFCC frontend combined with NN based phoneme posterior features. dev06
WER[%] (del/ins) err eval061
eval07
Sentence Error Viterbi MAP
(2.06/2.29) 12.43 (2.04/2.34) 12.46
(1.80/1.30) 8.98 (1.79/1.33) 8.99
(2.22/1.34) 10.76 (2.19/1.37) 10.77
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(2.29/1.98) 11.97 (2.40/1.93) 11.96 (2.19/2.00) 11.95
(1.90/1.14) 8.83 (2.02/1.10) 8.82 (1.89/1.17) 8.84
(2.47/1.15) 10.48 (2.56/1.10) 10.47 (2.39/1.16) 10.45
(2.54/1.66) 11.83 (2.10/2.15) 12.34 (2.26/2.00) 12.00
(2.21/0.94) 8.82 (1.82/1.26) 9.11 (1.92/1.17) 8.87
(2.80/0.92) 10.46 (2.34/1.23) 10.74 (2.39/1.18) 10.44
Decoder
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (2.29/1.90) 11.95 (1.96/1.09) 8.85 (mod.) (2.23/1.97) 12.03 (1.91/1.14) 8.87 1/2 overlap cost (cont.) (2.13/2.06) 12.06 (1.81/1.22) 8.90 (disc.) (2.22/2.11) 12.13 (1.88/1.27) 8.99 1 tuning set, eval06 was the official development set in the 2007 evaluation
(2.49/1.10) (2.42/1.17) (2.30/1.25) (2.35/1.36) campaign
10.51 10.49 10.59 10.77
GT frontend (s4) Results for the English EPPS 2007 evaluation system with the acoustic frontend based on the Gammatone filter bank. dev06
WER[%] (del/ins) err eval061
eval07
Sentence Error Viterbi MAP
(2.04/2.18) 12.06 (1.97/2.32) 12.27
(1.85/1.38) 9.44 (1.73/1.47) 9.45
(2.68/1.42) 11.73 (2.56/1.54) 11.77
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(2.31/1.94) 11.87 (2.42/1.82) 11.78 (2.19/1.93) 11.80
(2.09/1.17) 9.31 (2.21/1.15) 9.32 (2.02/1.21) 9.31
(2.96/1.29) 11.57 (3.08/1.22) 11.54 (2.85/1.31) 11.53
(2.30/1.88) 11.78 (2.18/2.08) 12.17 (2.25/2.01) 11.87
(2.07/1.18) 9.33 (1.93/1.26) 9.41 (2.02/1.26) 9.31
(2.98/1.19) 11.47 (2.78/1.31) 11.72 (2.86/1.30) 11.52
Decoder
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (2.26/1.91) 11.79 (2.06/1.18) 9.30 (mod.) (2.31/1.86) 11.81 (2.09/1.17) 9.33 1/2 overlap cost (cont.) (2.16/2.02) 11.92 (1.98/1.26) 9.35 (disc.) (2.29/2.00) 11.98 (2.10/1.23) 9.43 1 tuning set, eval06 was the official development set in the 2007 evaluation
(2.99/1.27) (3.03/1.18) (2.86/1.27) (3.15/1.27) campaign
11.59 11.50 11.49 11.69
143
Appendix C Experimental Results
Combination of acoustic two frontends (s1+s2) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
WER[%] (del/ins) err eval061
dev06
asym. arcsym. pathsym.
eval07
(1.72/2.09) 10.85 (1.68/2.12) 10.84 (1.82/2.00) 11.05
(1.48/1.25) 8.07 (1.46/1.29) 8.11 (1.56/1.24) 8.33
(1.99/1.21) 9.29 (1.94/1.25) 9.40 (2.04/1.23) 9.79
(2.02/1.56) (2.25/1.55) (1.83/1.74) (1.94/1.62) (1.97/1.56) (1.82/1.77) (1.65/2.20) (1.97/1.70)
(1.73/0.94) (1.95/0.92) (1.60/1.04) (1.66/0.99) (1.79/0.95) (1.58/1.04) (1.38/1.36) (1.75/0.93)
(2.25/0.93) (2.50/0.93) (2.09/1.04) (2.17/0.96) (2.21/0.91) (2.06/1.02) (1.85/1.30) (2.28/0.95)
10.21 10.29 10.29 10.22 10.19 10.35 11.07 10.54
(2.07/1.53) 10.18 (1.83/2.04) 10.99 (2.00/1.62) 10.29
7.79 7.80 7.83 7.82 7.81 7.81 8.41 7.90
(1.80/0.90) 7.80 (1.54/1.20) 8.37 (1.75/0.95) 7.76
8.97 9.13 9.04 8.98 8.94 9.01 9.80 9.11
(2.35/0.90) 9.01 (2.04/1.25) 9.92 (2.25/0.96) 8.92
Local Alignment based Error Povey’s cost (orig.) (1.97/1.57) 10.24 (1.74/0.94) 7.86 (2.26/0.90) (mod.) (1.83/1.93) 10.58 (1.52/1.11) 7.84 (1.99/1.12) 1/2 overlap cost (cont.) (1.82/1.76) 10.36 (1.58/1.04) 7.84 (2.03/1.01) (disc.) (2.00/1.73) 10.49 (1.82/1.07) 8.07 (2.24/1.15) 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign
9.05 9.08 9.04 9.30
Combination of acoustic three frontends (s1+s2+s3) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
dev06
WER[%] (del/ins) err eval061
eval07
(1.73/2.17) 11.27 (1.73/2.22) 11.28 (1.86/2.05) 11.23
(1.49/1.28) 8.18 (1.48/1.31) 8.22 (1.59/1.26) 8.38
(1.93/1.28) 9.57 (1.90/1.33) 9.62 (1.99/1.22) 9.66
(2.03/1.59) (2.22/1.55) (1.93/1.65) (1.95/1.60) (2.06/1.50) (1.89/1.64) (1.81/1.91) (2.05/1.57)
(1.74/0.94) (1.94/0.92) (1.65/0.99) (1.67/0.96) (1.79/0.91) (1.64/0.98) (1.49/1.13) (1.79/0.87)
(2.26/0.95) (2.51/0.89) (2.16/0.96) (2.22/0.95) (2.27/0.90) (2.16/0.95) (1.99/1.09) (2.40/0.89)
10.21 10.38 10.24 10.14 10.11 10.19 10.90 10.42
(1.97/1.71) 10.50 (1.85/2.31) 11.30 (1.98/1.64) 10.21
7.73 7.79 7.73 7.70 7.69 7.69 7.91 7.73
(1.73/0.99) 7.79 (1.65/1.48) 8.78 (1.70/0.97) 7.70
(2.18/1.01) 9.01 (2.14/1.44) 10.11 (2.27/1.00) 8.97
Local Alignment based Error Povey’s cost (orig.) (2.03/1.49) 10.14 (1.75/0.91) 7.75 (2.34/0.86) (mod.) (1.74/1.93) 10.52 (1.47/1.09) 7.70 (1.98/1.18) 1/2 overlap cost (cont.) (1.97/1.54) 10.09 (1.73/0.95) 7.74 (2.28/0.92) (disc.) (1.91/1.80) 10.40 (1.70/1.22) 8.01 (2.13/1.27) 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign
144
8.96 9.00 8.99 8.98 8.94 9.01 9.32 9.17
9.01 9.05 9.00 9.28
C.3 The RWTH Aachen English EPPS 2007 Evaluation System
Combination of acoustic four frontends (s1+s2+s3+s4) Decoder Sentence Error intersection union
Viterbi MAP Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
dev06
WER[%] (del/ins) err eval061
eval07
(1.72/2.17) 11.19 (1.70/2.23) 11.27 (1.86/2.05) 11.24
(1.54/1.20) 8.12 (1.53/1.26) 8.20 (1.59/1.26) 8.38
(1.99/1.24) 9.54 (1.96/1.33) 9.73 (1.99/1.23) 9.67
(2.03/1.64) (2.47/1.48) (1.93/1.68) (1.88/1.65) (1.96/1.63) (1.94/1.59) (1.77/1.93) (1.82/1.91)
(1.70/0.96) (2.07/0.86) (1.67/0.99) (1.60/0.97) (1.71/0.94) (1.66/0.98) (1.45/1.17) (1.47/1.08)
(2.29/0.95) (2.73/0.87) (2.24/0.92) (2.18/0.91) (2.21/0.88) (2.34/0.87) (1.97/1.11) (2.06/1.08)
10.33 10.33 10.54 10.22 10.18 10.25 10.92 10.70
(1.91/1.76) 10.45 (1.85/2.21) 11.20 (1.92/1.77) 10.37
7.59 7.71 7.71 7.59 7.60 7.65 7.81 7.67
(1.62/1.03) 7.69 (1.62/1.41) 8.63 (1.63/1.02) 7.62
8.94 9.09 9.10 8.92 8.86 9.03 9.28 9.15
(2.18/1.02) 8.97 (2.17/1.40) 10.06 (2.15/1.00) 8.89
Local Alignment based Error Povey’s cost (orig.) (2.06/1.47) 10.11 (1.77/0.90) 7.65 (2.42/0.84) (mod.) (1.95/1.72) 10.39 (1.62/1.02) 7.73 (2.24/0.99) 1/2 overlap cost (cont.) (1.85/1.72) 10.30 (1.56/0.98) 7.64 (2.16/0.98) (disc.) (2.01/1.79) 10.51 (1.73/1.19) 7.92 (2.34/1.16) 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign
8.93 9.12 8.96 9.33
145
Appendix C Experimental Results
C.4 The English EPPS 2007 Evaluation Crosssite Combination Results for the English EPPS 2007 evaluation crosssite combination introduced in Section B.2.2. The error measure is the word error rate (WER).
The LIMSI System Results for the lattices provided by CNRS/LIMSI within the TCStar English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07
Decoder Sentence Error Viterbi
(1.59/1.33) 8.04
(1.71/1.21) 9.08
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(1.65/1.33) 8.07 (1.71/1.25) 8.04 (1.64/1.33) 8.08
(1.76/1.18) 8.96 (1.88/1.14) 8.94 (1.75/1.18) 8.97
(1.95/1.15) 8.08 (1.72/1.34) 8.24 (1.68/1.32) 8.05
(2.22/0.99) 9.00 (1.82/1.22) 9.19 (1.84/1.15) 9.00
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development
(1.67/1.29) 8.04 (1.87/1.15) 9.03 (1.62/1.40) 8.13 (1.73/1.22) 8.99 (1.66/1.33) 8.07 (1.79/1.14) 8.96 (1.65/1.28) 8.07 (1.82/1.24) 9.09 set in the 2007 evaluation campaign
The RWTH Aachen System Results for the lattices provided by RWTH Aachen University within the TCStar English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07
Decoder Sentence Error Viterbi
(1.51/1.30) 8.42
(1.95/1.25) 9.75
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(1.55/1.13) 8.24 (1.62/1.11) 8.24 (1.55/1.14) 8.26
(2.07/1.15) 9.54 (2.17/1.11) 9.51 (2.03/1.16) 9.54
(1.84/0.96) 8.23 (1.47/1.27) 8.46 (1.73/1.03) 8.21
(2.39/0.97) 9.54 (1.94/1.28) 9.83 (2.33/1.00) 9.50
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development
146
(1.63/1.09) 8.23 (2.15/1.07) 9.49 (1.58/1.13) 8.24 (2.05/1.09) 9.47 (1.47/1.20) 8.24 (1.99/1.18) 9.54 (1.69/1.22) 8.47 (2.15/1.18) 9.66 set in the 2007 evaluation campaign
C.4 The English EPPS 2007 Evaluation Crosssite Combination
The UKA System Results for the lattices provided by University of Karlsruhe (UKA) within the TCStar English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07
Decoder Sentence Error Viterbi
(1.77/1.29) 8.78
(2.00/1.29) 10.21
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(1.83/1.39) 8.98 (2.00/1.38) 9.03 (1.81/1.42) 9.06
(2.08/1.33) 10.36 (2.26/1.28) 10.38 (2.08/1.39) 10.49
(1.93/1.35) 9.04 (1.64/1.96) 10.04 (1.97/1.33) 8.97
(2.19/1.33) 10.31 (1.88/2.02) 11.61 (2.24/1.37) 10.34
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development
(1.88/1.32) 9.02 (2.17/1.35) 10.41 (1.95/1.33) 9.03 (2.20/1.32) 10.40 (1.70/1.45) 9.06 (1.96/1.48) 10.53 (1.67/1.62) 9.19 (1.93/1.60) 10.60 set in the 2007 evaluation campaign
The IRST System Results for the lattices provided by FBK/IRST (former ITC/IRST) within the TCStar English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07
Decoder Sentence Error Viterbi
(2.35/1.40) 10.09
(2.49/1.14) 9.82
Confusion Network (CN) Error CNconstrcut. alg.: arccluster statecluster centerframe
(2.35/1.39) 10.06 (2.34/1.39) 10.05 (2.35/1.39) 10.06
(2.47/1.13) 9.82 (2.46/1.14) 9.79 (2.47/1.13) 9.82
(2.44/1.31) 10.04 (2.34/1.40) 10.05 (2.43/1.32) 10.04
(2.56/1.09) 9.81 (2.45/1.15) 9.85 (2.55/1.10) 9.82
Frame Error error norm.:
asym. arcsym. pathsym.
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development
(2.35/1.39) 10.06 (2.44/1.14) 9.80 (2.33/1.38) 10.04 (2.44/1.14) 9.80 (2.31/1.41) 10.07 (2.40/1.17) 9.84 (2.28/1.43) 10.05 (2.39/1.19) 9.84 set in the 2007 evaluation campaign
147
Appendix C Experimental Results
Combination of the LIMSI and the RWTH lattices WER[%] (del/ins) err eval061 eval07
Decoder Sentence Error intersection
Viterbi
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf. Frame Error error norm.:
asym. arcsym. pathsym.
(1.58/1.32) 8.02
(1.71/1.20) 9.07
(1.63/0.77) (1.90/0.85) (1.50/0.77) (1.45/0.80) (1.49/0.78) (1.45/0.81) (1.50/1.24) (1.63/0.91)
(2.17/0.71) (2.29/0.79) (1.92/0.73) (1.88/0.75) (1.96/0.75) (1.88/0.80) (1.70/1.20) (2.13/0.87)
6.46 6.95 6.39 6.38 6.38 6.41 7.87 6.69
(1.60/0.85) 6.65 (1.57/1.29) 8.35 (1.62/0.76) 6.46
Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development set
7.67 8.13 7.52 7.51 7.52 7.58 9.06 7.85
(1.99/0.76) 7.73 (2.02/1.21) 9.58 (2.09/0.73) 7.57
(1.78/0.72) 6.66 (2.33/0.61) 7.73 (1.48/0.90) 6.61 (2.05/0.79) 7.70 (1.44/0.92) 6.66 (1.96/0.88) 7.96 (1.66/0.87) 6.70 (2.10/0.74) 7.81 in the 2007 evaluation campaign
Combination of the LIMSI, the RWTH, and the UKA lattices Decoder Sentence Error intersection Viterbi
(1.75/1.25) 8.18
(1.84/1.17) 9.24
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf.
(1.51/0.79) (1.98/0.73) (1.54/0.73) (1.47/0.72) (1.58/0.67) (1.36/0.74) (1.35/0.84) (1.43/0.76)
(2.04/0.77) (2.63/0.69) (1.89/0.69) (1.87/0.68) (2.04/0.64) (1.77/0.76) (1.86/0.78) (2.00/0.70)
Frame Error error norm.:
1
148
WER[%] (del/ins) err eval061 eval07
asym. arcsym. pathsym. tuning set, the official development
6.38 6.57 6.30 6.27 6.25 6.23 6.58 6.32
7.63 7.76 7.32 7.24 7.28 7.32 8.01 7.77
(1.80/0.72) 6.48 (2.21/0.68) 7.52 (1.61/1.39) 8.23 (1.74/1.27) 9.19 (1.53/0.74) 6.24 (2.01/0.74) 7.28 set in the 2007 evaluation campaign
C.4 The English EPPS 2007 Evaluation Crosssite Combination
Combination of the LIMSI, the RWTH, the UKA and the IRST lattices Decoder
WER[%] (del/ins) err eval061 eval07
Sentence Error intersection Viterbi
(2.46/1.51) 10.30
(2.52/1.17) 9.89
Confusion Network (CN) Error union arccluster statecluster (mod.) centerframe CNC arccluster statecluster (mod.) centerframe ROVER w/o conf. w/ conf.
(1.61/0.73) (2.31/0.63) (1.61/0.71) (1.45/0.71) (1.54/0.65) (1.36/0.73) (1.36/0.78) (1.37/0.79)
(2.19/0.67) (2.90/0.52) (2.00/0.61) (1.87/0.69) (2.04/0.57) (1.82/0.67) (1.82/0.79) (1.77/0.73)
6.28 6.61 6.23 6.14 6.10 6.11 6.38 6.21
7.36 7.58 7.10 7.12 7.12 7.16 7.67 7.26
Frame Error error norm.:
1
asym. (1.70/0.79) 6.52 (1.93/0.76) 7.26 arcsym. (1.57/1.28) 8.33 (2.01/1.22) 9.55 pathsym. (1.36/0.85) 6.10 (1.81/0.85) 7.21 tuning set, the official development set in the 2007 evaluation campaign
149
Appendix D Symbols and Acronyms In this appendix, all relevant mathematical symbols and acronyms which are used in this thesis are defined for convenience. Detailed explanations are given in the corresponding chapters.
D.1 Mathematical Symbols x⊕y
collect operator in a semiring, ⊕sum of x and y
x⊗y
extend operator in a semiring, ⊗product of x and y
1{cond}
equals one if condition cond is true, and zero otherwise
A
alignment between two word sequences or two CNs
a, b
word lattice arcs
K aL 1 , b1
K paths through a word lattice, aL 1 := a1 , a2 , . . . , aL and b1 := b1 , b2 , . . . , bK , where al and bk are word lattice arcs
bkj
partial path in a word lattice, bkj := bj , bj+1 , . . . , bk , where bi is a word lattice arc
beg(a)
begin time of word lattice arc a
best(L)
non input label sequence of the best path through lattice L
β
language model scale
c(b, a)
cost function, defined between word lattice arcs
c(b; S)
cost function, defined between an arc b from the hypothesis space lattice and the summation space lattice S
L c(bK 1 , a1 )
cost function, defined between two paths through a word lattice
CN(L)
confusion network derived from word lattice L and an arbitrary slot function
CN(L, σ(·))
confusion network derived from word lattice L and slot function σ(·)
d(L)
singlesource shortest distance for word lattice L starting from the initial state, score of the best path if computed over the tropical semiring
dur(a)
duration in number of time frames of word lattice arc a
δ(i, j)
Kronecker delta, equals one for i = j, and zero otherwise
end(a)
end time of word lattice arc a
E(L)
set of all lattice arcs in word lattice L
the empty word
from(a)
source state of lattice arc a
fi (. . . )
ith feature function in a loglinear model
151
Appendix D Symbols and Acronyms g(xT1 )
Bayes risk classifier applied the acoustic observations xT1
H(·)
entropy
H
word lattice representing the hypothesis space of a Bayes risk decoder
h(a, b)
conditional overlap; overlap in number of time frames between two word lattice arcs, if both arcs have the same input label, zero otherwise
i
common index for the scaling factors and feature functions in a loglinear model
I
number of scaling factors resp. feature functions in a loglinear model
i(a)
input label of word lattice arc a
i(aL 1)
sequence of non input labels of a path through a word lattice
j
common index for the systems in a system combination or the lattices in a latticebased system combination
J
number of systems in a system combination, number of lattices in a latticebased system combination
k, l
common indices for the arcs in a path through a word lattice
L(·, ·)
loss function used in a Bayes risk decoder, defined for two word sequences or two paths through a word lattice
Lev(·, ·)
Levenshtein distance, defined for two word sequences or two paths through a word lattice
L, Lj
word lattice defined as a weighted finite state acceptor, word lattice produced by the jth system
λ
loglinear model parameters, λ = λ1 , λ2 , . . . , λI
λi
loglinear model parameter, scaling factor of the ithe feature function fi (. . . )
λi (w)
worddependent parameter in a loglinear model, worddependent scaling factor of the ith feature function fi (. . . )
m, n
common indices for the words in a word sequence
o(a, b)
overlap in number of time frames between two arcs in a word lattice
p(j)
prior probability for the jth system
p(axT1 )
posterior for the word lattice arc a given the acoustic observations xT1
T p(aL 1 x1 )
T posterior for path aL 1 through a word lattice given the acoustic observations x1
p(w1N )
prior for the word sequence w1N , language model
p(w1N xT1 )
posterior for the spoken word sequence w1N given the acoustic observations xT1
ps (wxT1 )
defined by a confusion network (CN), posterior for the occurrence of word w in CN slot s given the acoustic observations xT1
pt (wxT1 )
posterior for the occurrence of word w at time frame t given the acoustic observations xT1
r(xT1 )
Bayes risk given the acoustic observations xT1
σ(a)
assigns word lattice arc a to a confusion network (CN) slot, in the computation of the CN distance two lattice arcs are aligned if they are assigned to the same slot
152
D.1 Mathematical Symbols s
state in a word lattice
S
word lattice representing the summation space of a Bayes risk decoder
Σ
alphabet or vocabulary
t, τ
common indices for time frames
t(s)
time stamp of word lattice state s
to(a)
target state of lattice arc a
v, w
words from vocabulary Σ or the empty word
v1M , w1N
word sequence, where v1M := v1 v2 . . . vM and w1N := w1 w2 . . . wN
w(a)
weight of word lattice arc a; for an arc in a systemdependent word lattice the weight usually consists of an acoustic and a language model score
xT1
sequence of acoustic observation vectors, xT1 = x1 x2 . . . xT
153
Appendix D Symbols and Acronyms
D.2 Acronyms ASR
Automatic Speech Recognition
BC
Broadcast Conversations
BN
Broadcast News
BR
Bayes Risk
CART
Classification And Regression Tree
CER
Character Error Rate
CMLLR
Constrained Maximum Likelihood Linear Regression
CN
Confusion Network
CNC
Confusion Network Combination
CNRSLIMSI
Centre National de la Recherche Scientifique  Laboratoire d’Informatique pour la M´ecanique et les Sciences de l’Ing´enieur
DMC
Discriminative Model Combination
EPPS
European Parliament Planery Sessions
FB
Forward Backward
FBKIRST
Fondazione Bruno Kessler (former Istituto Trentino di Cultura)  Centro per la Ricerca Scientifica e Tecnologica
FE
Frame Error
FST
Finite State Transducer
GALE
Global Autonomous Language Exploitation
GT
Gammatone filter
HMM
Hidden Markov Model
IBM
International Business Machines
ICSI
International Computer Science Institute, Berkeley, California
IDIAP
Idiap Research Institute
IRST
see FBKIRST
ITCIRST
see FBKIRST
LDA
Linear Discriminant Analysis
LIMSI
see CNRSLIMSI
LM
Language Model
LVCSR
Large Vocabulary Continuous Speech Recognition
MAP
Maximum A Posteriori
MFCC
Mel Frequency Cepstral Coefficients
ML
Maximum Likelihood
154
D.2 Acronyms MLLR
Maximum Likelihood Linear Regression
MPE
Minimum Phone Error
MPFE
Minimum Frame Phone Error
MWE
Minimum Word Error
NCE
Normalized Cross Entropy
NIST
National Institute of Standards and Technology
NN
Neural Network
PLP
Perceptual Linear Prediction
PP
Language Model Perplexity
Rprop
Resilient Propagation
ROVER
Recognizer Output Voting Error Reduction
RWTH
Rheinisch Westf¨ alische Technische Hochschule
SAT
Speaker Adaptive Training
SRI
SRI International
TCSTAR
Technology and Corpora for Speech to Speech Translation
UKA
Universit¨ at Karlsruhe
UW
University of Washington
VTLN
Vocal Tract Length Normalization
WER
Word Error Rate
WFST
Weighted Finite State Transducer
155
List of Figures 1.1 1.2
1.3 1.4
3.1
3.2 3.3
3.4 3.5
4.1
4.2
4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1
Basic architecture of a statistical automatic speech recognition system according to [Ney 1990]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6state hidden Markov model in Bakis topology for the triphone s ehv in the word “seven” and the resulting trellis for a time alignment. The HMM segments are denoted by <1>, <2>, and <3>. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lattice produced by the RWTH 2007 TCStar EPPS Evaluation System for English [L¨o¨of & Gollan+ 2007]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical representation of a weighted acceptor a) and a weighted transducer b). An arc in the acceptor is labeled by i(e)/ w(e), a transducer arc by i(e) : o(e)/ w(e). States are labeled with their state number and a final weight, if the state is final. . . . . . . . . . . . Error induced by changing the LM scale after computing x ⊕ x; the LM scale is initialized with 20. The correct sum results from changing the scaling factors before applying the ⊕operator. The ⊕operator is defined in Equation (3.4). . . . . . . . . . . . . . . . . . . The figure shows a word lattice with time stamps at the states, a slot function, and the confusion network induced by the slot function. . . . . . . . . . . . . . . . . . . . . . . . . Illustration of the nonspeech cloud filter applied to a word lattice. In figure a) four paths are connecting the left most and the right most state, three of them starting with “have” and continuing with nonspeech arcs marked as “{·}”. These three paths define a nonspeech cloud and the nonspeech cloud filter removes all but the best scoring path through the cloud. The filter result is shown in figure b). . . . . . . . . . . . . . . . . . . . . . . . CN decoding results for the Chinese 230h testing system, cf. Section B.1.1, for different lattice densities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CN decoding results for the English EPPS 2007 evaluation system, cf. Section B.2.1, for different lattice densities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
6 9
12
24 35
44 46 46
The bias in partially normalized frame errors. In a) the frame error is normalized w.r.t. the hypothesis, which results in ignoring deletion errors (left side) while insertions are counted (right side). In b) the frame error is normalized w.r.t. the reference and insertion errors are ignored (left side) while deletions are counted. . . . . . . . . . . . . . . . . . . . . . . . . . The figure shows a lattice, a CN derived from the lattice, and a lattice in which all paths have the same length. The positions for the insertions of the arcs are derived from the CN according to the algorithm described in the text. The number at the arcs corresponds to the CN slot the arc is assigned to and the number in the states is the minimum slot number from all outgoing arcs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CN construction with the arccluster algorithm. . . . . . . . . . . . . . . . . . . . . . . . . Pseudo code for the arccluster CN construction algorithm. . . . . . . . . . . . . . . . . . CN construction with the statecluster algorithm. . . . . . . . . . . . . . . . . . . . . . . . Pseudo code for the statecluster CN construction algorithm. . . . . . . . . . . . . . . . . Pseudo code for the statecluster CN construction algorithm with backsplitting. . . . . . CN construction with the centerframe algorithm. . . . . . . . . . . . . . . . . . . . . . . . Pseudo code for the centerframe CN construction algorithm. . . . . . . . . . . . . . . . .
63 65 66 67 68 69 71 71
The figure shows in the first row a lattice. The second and the third row show the wordlevel resp. frame level CN derived from the lattice. In the wordlevel CN each slot assigns a single position to each word hypothesis. In the framewise CN each slot represents a single time frame and a word hypothesis is usually spread among several slots. . . . . . . . . . .
78
55
157
List of Figures 5.2
5.3
5.4
7.1
158
Example for a typical error made by the common CN construction algorithms and the correction of the error by using a windowed Levenshtein distance, where the window is centered around the CN alignment. The example lattice consists of three paths which are listed to the right of the lattice together with their path probabilities. The arc labels in the lattice are composed of the word, the CN slot to which the arc is assigned, and the arc probability. The resulting CN is drawn below the lattice. To the right of the CN an example for the possible alignment position of arc “b:1” within a windowed Levenshtein alignment is given: a) shows the only possible alignment position for a window of size one, b) shows the possible alignment positions for a symmetric window of size three. The lower part of the figure shows the alignments for the Bayes risk hypotheses for different window sizes with the windowed Levenshtein distance as cost function. Alignment a) is the outcome for a window of size one, which is equivalent to the standard CN decoding. Alignment b) uses a symmetric window of size three. The larger window allows the alignment of “b:1” and “b:2” which compensates for the flaw in the CN construction, where the two arcs were assigned to different slots. The Bayes risk hypothesis for a window of size three is “a b c”, which is also the minimum WER hypothesis for the example lattice. . . . . . . . . . . . . The figure visualizes the alignments performed in the Bayes risk decoder with the windowed Levenshtein distance as loss function. Figure a) shows the CN alignment case, where the window size is one and thus the alignment is unique. For a window size of 2d + 1 the computation of the hypothesis word at position n considers the alignment between vnn+d and wnn+2d as shown in b). For sufficiently large window size, that is ≥ 2S −1, the alignment between v1S and w1S is computed, see c), which yields the exact Levenshtein distance. . . . Confidence warping applied to the lattices for eval07en produced by the LIMSI English EPPS 2007 evaluation system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
92 94
Results for the loglinear modelcombination for 25 training iterations and 6,904 worddependent scaling factors. The worddependent scaling factors are trained on 120h. The left plot shows the objective function and character error rates for the training set, the heldout set, and the development set. The right plot shows the progression of the error rates for the development set and the two test sets. . . . . . . . . . . . . . . . . . . . . . . 116
List of Tables 1.1
Semirings used by WFSTs for speech recognition tasks. . . . . . . . . . . . . . . . . . . .
3.1
Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for the English EPPS 2007 evaluation systems, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˆ , i.e. the hypothesis with the Example for the situation where the Bayes risk hypothesis W minimum expected word error rate, has a sentence posterior probability of zero and thus is not contained in the summation space. . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . . . . Results for the English EPPS 2007 evaluation system, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . Results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The table summarizes common approaches to latticebased system combination. The methods are classified according to a) the lattice combination method and b) the decoder. The lattices are either combined via an intersection (or an theoretically equivalent lattice rescoring) or by building the lattice union. The decoder is either the Viterbi decoder, which is an approximation of the Bayes risk decoder with the sentence error as loss function, or the Bayes risk decoder with a local cost function as loss function. The local cost functions are of the second type for all methods but Povey’s MPE, which is of the first type. . . . . Results for the Chinese 230h testing system, cf. Section B.1.1. Wordlevel vs. characterlevel decoding and approximated vs. exact character boundaries. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . Comparison of the posterior probability distributions resulting from maximum likelihood estimation and from MRT training given the observations 1 × (x, 111), 2 × (x, 112), 1 × (x, 211), and 1 × (x, 221). The table also shows the Bayes risk hypothesis given the two distributions and the according risks given the empirical distribution. . . . . . . . . . . . .
3.2
3.3
3.4 3.5 3.6 3.7
3.8
3.9
4.1
4.2
Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three different approaches to wordwise frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minimum frame error decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare three different approaches to wordwise frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
30
31
32 38 39 40
41
45
50
56
57
159
List of Tables 4.3
Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1). The experiments compare the word and timeconditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . 4.4 Minimum frame error decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare the word and timeconditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . 4.5 The substitution, insertion, and deletion error for the discrete and the continuous case of the 1/2 overlap approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Minimum local alignment error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare four variants of the local alignment based cost. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . 4.7 Minimum local alignment error decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare four variants of the local alignment based cost. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 CN decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 CN decoding results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Comparison of the original and the modified statecluster CN construction algorithm for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . 4.11 Comparison of the original and the modified statecluster CN construction algorithm for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . 5.1
5.2
5.3
160
Entropybased combination results for the Chinese 230h testing system, cf. Section B.1.1. Experiments are performed with the minimum frame error decoder with hypothesisside frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropybased combination results for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. Experiments are performed with the minimum frame error decoder with hypothesisside frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . Combination results with systemdependent frame and CNslotwise posterior warping for the Chinese 230h testing system, cf. Section B.1.1. The warping is optimized for minimum character error rate. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
58 60
61
62
72
73
74
75
80
80
95
List of Tables 5.4 5.5
5.6 5.7
5.8
6.1
6.2 6.3 6.4
6.5
6.6
7.1 7.2
7.3 7.4
7.5
Normalized cross entropy (NCE) results with frame and CNslotwise posterior warping for the Chinese 230h testing system, cf. Section B.1.1. . . . . . . . . . . . . . . . . . . . . Combination results with systemdependent frame and CNslotwise posterior warping for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The warping is optimized for minimum word error rate. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . Normalized cross entropy (NCE) results with frame and CNslotwise posterior warping for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. . . . . . . Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the Chinese 230h testing system, cf. Section B.1.1. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the English EPPS 2007 evaluation crosssite combination, cf. Section B.2.2. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . Baseline results for eval07. ROVER results come with confidence score based voting and with majority voting. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corpora statistics for the training/tuning set (eval06) and the evaluation set (eval07). . . CN oracle error rates for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iROVER combination results for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combination results with Boostexter (BT) and random forests (RF) as classifier for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%. . . . . . . . . Error detection and correction results for eval07 for four systems and with a random forest as classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training, tuning (dev07), and test sets. The worddependent scaling factors are trained on the 120h “λtraining” set. For the first test set no wordsegmented transcripts are available. Lattice rescoring results with various acoustic models. The lattice sets are generated with the MFCC model and subsequently rescored with the PLP and resp. with the Gammatone (GT) acoustic model, where the character boundaries are kept fixed. The acoustic models were estimated on the 230h AM training set. . . . . . . . . . . . . . . . . . . . . . . . . . Statistics for worddependent scaling factors on dev07: number of worddependent scaling factors and coverage of running words for a given cutoff Nmin . . . . . . . . . . . . . . . CNdecoding results for the loglinear model combination using word, character, and syllabledependent scaling factors. The scaling factors are trained on 120h using either minimum phone error (MPE) or minimum character error (MWE) training. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model. CNdecoding results for loglinear model combinations and for a system combination using the weighted average of sentence posteriors. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model. . . . . . . . . . . . . . . . . . .
95
96 97
99
99
105 105 106
107
107 108
113
113 114
116
117
161
List of Tables B.1 B.2 B.3 B.4 B.5 B.6
162
Corpora statistics for the Chinese GALE systems. . . . . . . . . . . . . . Subsystems in the Chinese 230 testing system. . . . . . . . . . . . . . . . System combinations for the Chinese 230 testing system. . . . . . . . . . . Subsystems in the RWTH Aachen Chinese GALE 2008 evaluation system. Corpora statistics for the English EPPS systems. . . . . . . . . . . . . . . Subsystems in the RWTH Aachen English EPPS 2007 evaluation system.
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
127 128 129 130 130 131
Bibliography A. M. H. J. Aertsen, P. I. M. Johannesma, and D. J. Hermes. Spectrotemporal receptive fields of auditory neurons in the grassfrog. Biological Cybernetics, 38:235–248, November 1980. Cyril Allauzen and Mehryar Mohri. An optimal predeterminization algorithm for weighted transducers. Theoretical Computer Science, 328(12):3 – 18, November 2004. Cyril Allauzen, Mehryar Mohri, Brian Roark, and Michael Riley. A generalized construction of integrated speech recognition transducers. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada, May 2004. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. Openfst: a general and efficient weighted finitestate transducer library. In 12th International Conference on Implementation and Application of Automata (CIAA 2007), volume 4783, pages 11–23, Prague, Czech Republic, July 2007. Lecture Notes in Computer Science, SpringerVerlag, Heidelberg, Germany. P. Alleva, X. D. Huang, and M. Y. Hwang. Improvements on the pronunciation prefix tree search organization. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 133–136, Atlanta, GA, USA, May 1996. Ladan BaghaiRavary, Greg Kochanski, and John Coleman. Precision of phoneme boundaries derived using hidden markov models. In Interspeech, pages 2879–2883, Brighton, U.K., September 2009. L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:179–190, March 1983. L. R. Bahl, M. Padmanabhan, D. Nahamoo, and P. S. Gopalakrishnan. Discriminative training of Gaussian mixture models for large vocabulary speech recognition systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 613–616, Atlanta, GA, USA, May 1996. L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 49–52, Tokyo, Japan, May 1986. J. K. Baker. Stochastic modeling for automatic speech understanding. In D. R. Reddy, editor, Speech Recognition, pages 512–542. Academic Press, New York, NY, USA, 1975. R. Bakis. Continuous speech word recognition via centisecond acoustic states. In ASA Meeting, Washington, DC, USA, April 1976. L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. In O. Shisha, editor, Inequalities, volume 3, pages 1–8. Academic Press, New York, NY, 1972. T. Bayes. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370–418, 1763. Reprinted in Biometrika, vol. 45, no. 3/4, pp. 293–315, December 1958. R. E. Bellman. Dynamic programming. Princeton University Press, Princeton, NJ, USA, 1957. K. Beulen. Phonetische Entscheidungsb¨ aume f¨ ur die automatische Spracherkennung mit großem Vokabular. PhD thesis, Human Language Technology and Pattern Recognition Group, RWTH Aachen University, Aachen, Germany, July 1999.
163
Appendix D Bibliography K. Beulen, S. Ortmanns, and C. Elting. Dynamic programming search techniques for acrossword modeling in speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 609–612, Phoenix, AZ, March 1999. P. Beyerlein. Discriminative model combination. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 238 – 245, Santa Barbara, CA, USA, December 1997. P. Beyerlein. Discriminative model combination. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 481 – 484, Seattle, WA, USA, May 1998. P. Beyerlein, X. L. Aubert, R. HaebUmbach, M. Harris, D. Klakow, A. Wendemuth, S. Molau, M. Pitz, and A. Sixtus. The philips/rwth system for transcription of broadcast news. In Proc. DARPA Broadcast News Workshop,, pages 151–155, Herndon, VI, February 1999. Peter Beyerlein. Diskriminative Modellkombination in Spracherkennungssystemen mit großem Wortschatz. PhD thesis, RWTH Aachen University, Aachen, Germany, October 2000. M. Bisani and H. Ney. Multigrambased graphemetophoneme conversion for LVCSR. In Interspeech, pages 933–936, Geneva, Switzerland, September 2003. C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. L. Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001. C. Breslin and M. J. F. Gales. Generating complementary systems for speech recognition. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. C. Breslin and M. J. F. Gales. Complementary system generation using directed decision trees. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 337– 340, Honululu, HI, USA, April 2007a. C. Breslin and M. J. F. Gales. Building multiple complementary systems using directed decision trees. In Interspeech, Antwerp, Belgium, August 2007b. Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne, and Michel Comeau. Gpu accelerated acoustic likelihood computations. In Interspeech, Brisbane, Australia, September 2008. Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, and KaiFu Lee. Large vocabulary mandarin speech recognition with different approaches in modeling tones. In International Conference on Spoken Language Processing (ICSLP), pages 983–986, Beijing, China, October 2000. B. Chen, Q. Zhu, and N. Morgan. Learning longterm temporal features in LVCSR using neural networks. In Interspeech, Jeju Island, Korea, October 2004. C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A. Picheny, and K. Shen. New methods in continuos Mandarin speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), volume 3, pages 1543–1546, Rhodes, Greece, September 1997. C. J. Chen, H. Li, L. Shen, and G. K. Fu. Recognize tone languages using pitch information on the main vowel of each syllable. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 61–64, Salt Lake City, USA, May 2001. IFan Chen and LinShan Lee. A new framework for system combination based on integrated hypothesis space. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. S. S. Chenand and P. S. Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion. In DARPA Broadcast News Transcription and Understanding Workshop, pages 127–132, February 1998.
164
Appendix D Bibliography J. T. Chien, C. H. Huang, K. Shinoda, and S. Furui. Towards optimal bayes decision for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, May 2006. S.B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP28(4):357 – 366, August 1980. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1 – 38, 1977. Thomas G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857: 1–15, 2000a. Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, August 2000b. E. W. Dijkstra. A note on two problems in connection with graphs. Numerische Mathematik, 1:269–271, 1959. G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds. The NIST speaker recognition evaluation – overview, methodology, systems, results, perspective. Speech Communication, 31(2–3): 225–254, June 2000. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, New York, NY, USA, 2001. A. Emami, K. Papineni, and J. Sorenson. Largescale distributed language modeling. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 37–40, Honolulu, HI, USA, April 2007. G. Evermann and P. Woodland. Posterior probability decoding, confidence estimation and system combination. In NIST Speech Transcription Workshop, College Park, MD, USA, 2000. G. Evermann, H.Y. Chan, M.J.F. Gales, T. Hain, X. Liu, L. Wang D. Mrva, and P.C. Woodland. Development of the 2003 cuhtk conversational telephone speech transcription system. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 261–264, Montreal, Canada, May 2003. Daniele Falavigna, Nicola Bertoldi, Fabio Brugnara, Roldano Cattoni, Mauro Cettolo Boxing Chen, Marcello Federico, Diego Giuliani, Roberto Gretter, Deepa Gupta, and Dino Seppi. The irst englishspanish translation system for european parliament speeches. In International Conference on Spoken Language Processing (ICSLP), pages 2833–2837, Antwerp, Belgium, August 2007. J.G. Fiscus. A postprocessing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 347 – 354, Santa Barbara, CA, USA, December 1997. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(179188), 1936. M. J. E. Gales and P. C. Woodland. Mean and variance adaptation within the mllr framework. Computer Speech and Language, 10(4):249–264, 1996. M. Generet, H. Ney, and F. Wessel. Extensions to absolute discounting for language modeling. In European Conference on Speech Communication and Technology (Eurospeech), volume 2, pages 1245– 1248, Madrid, Spain, September 1995. M. Gibson and T. Hain. Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In Interspeech, Pittsburgh, PA, USA, September 2006.
165
Appendix D Bibliography Matthew Gibson. Minimum Bayes Risk Acoustic Model Estimation and Adaptation. PhD thesis, University of Sheffield, Sheffield, UK, November 2008. V. Goel and W.J. Byrne. Minimum bayesrisk automatic speech recognition. Computer Speech and Language, 14:115–136, 2000. V. Goel, W. Byrne, and S Khudanpur. Lvcsr rescoring with modified loss functions: a decision theoretic perspective. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 425–428, Seattle, WA, USA, 1998. V. Goel, S. Kumar, and W.J. Byrne. Segmental minimum bayesrisk decoding for automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 12:234 – 249, 2004. Vaibhava Goel, Shankar Kumar, and William Byrne. Segmental minimum bayesrisk asr voting strategies. In International Conference on Spoken Language Processing (ICSLP), pages 139–142, Beijing, China, October 2000. Vaibhava Goel, Shankar Kumar, and William Byrne. Confidence based lattice segmentation and minimum bayesrisk decoding. In European Conference on Speech Communication and Technology (Eurospeech), pages 2569–2572, Aalborg, Denmark, September 2001. D. Guiliani and F. Brugnara. Acoustic model adaptation with multiple supervisions. In Proc. TCStar Workshop on SpeechtoSpeech Translation, pages 151–154, Barcelona, Spain, June 2006. D. Guiliani and F. Brugnara. Experiments on crosssystem acoustic model adaptation. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 117–122, Kyoto, Japan, December 2007. A. Gunawardana, M. Mahajan, A. Acero, and J.C. Platt. Hidden conditional random fields for phone classification. In Interspeech, pages 117 – 120, Lisbon, Portugal, September 2005. R. H¨ abUmbach and H. Ney. Improvements in beam search for 10000word continuousspeech recognition. IEEE Transactions on Speech and Audio Processing, 2(2):353–356, April 1994. R. HaebUmbach, X. Aubert, P. Beyerlein, D. Klaskow, M. Ullrich, A. Wendemuth, and P. Wilcox. Acoustic modeling in the philips hub4 continousspeech recognition system. In DARPA Broadcast News Transcription and Understanding Workshop, February 1998. D. Hakkani and G. Riccardi. A general algorithm for word graph matrix decomposition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 596–599, Hong Kong, April 2003. Georg Heigold, Thomas Deselaers, Ralf Schl¨ uter, and Hermann Ney. Modified mmi/mpe: A direct evaluation of the margin in speech recognition. In International Conference on Machine Learning, pages 384–391, Helsinki, Finland, July 2008. A typo from the original publication was corrected (marked in red). H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738 – 1752, June 1990. H. Hermansky, D.P.W. Ellis, and S. Sharma. Tandem connectionist feature stream extraction for conventional HMM systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1635–1638, Istanbul, Turkey, June 2000. L. Hetherington. Mit finitestate transducer toolkit for speech and language processing. In International Conference on Spoken Language Processing (ICSLP), pages 2609–2612, Jeju Island, Korea, October 2004. Dustin Hillard and Mari Ostendorf. Compensating for word posterior estimation bias in confusion networks. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, Toulouse, France, May 2006.
166
Appendix D Bibliography Dustin Hillard, Bj¨ orn Hoffmeister, Mari Ostendorf, Ralf Schl¨ uter, and Hermann Ney. irover: Improving system combination with classification. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 65–68, Rochester, New York, April 2007. Bj¨ orn Hoffmeister, Tobias Klein, Ralf Schl¨ uter, and Hermann Ney. Frame based system combination and a comparison with weighted rover and cnc. In Interspeech, pages 537–540, Pittsburgh, PA, USA, September 2006. Bj¨ orn Hoffmeister, Dustin Hillard, Stefan Hahn, Ralf Schl¨ uter, Mari Ostendorf, and Hermann Ney. Crosssite and intrasite asr system combination: Comparisons on lattice and 1best methods. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1145–1148, Honululu, HI, USA, April 2007. Bj¨ orn Hoffmeister, Christian Plahl, Peter Fritz, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Development of the 2007 rwth mandarin lvcsr system. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Kyoto, Japan, December 2007. Bj¨ orn Hoffmeister, Ralf Schl¨ uter, and Hermann Ney. icnc and irover: The limits of improving system combination with classification? In Interspeech, pages 232–235, Brisbane, Australia, September 2008. Bj¨ orn Hoffmeister, Ruoying Liang, Ralf Schl¨ uter, and Hermann Ney. Loglinear model combination with worddependent scaling factors. In Interspeech, pages 248–251, Brighton, U.K., September 2009. Bj¨ orn Hoffmeister, Ralf Schl¨ uter, and Hermann Ney. Bayes risk approximations using time overlap with an application to system combination. In Interspeech, pages 1191–1194, Brighton, U.K., September 2009. Roger Hsiao, Mark Fuhs, YikCheung Tam, Qin Jin, and Tanja Schultz. The cmuinteract 2008 mandarin transcription system. In Interspeech, pages 1445–1448, Brisbane, Australia, September 2008. Jing Huang, Etienne Marcheret, Karthik Visweswariah, Vit Libal, and Gerasimos Potamianos. The ibm rich transcription 2007 speechtotext systems for lecture meetings. Lecture Notes in Computer Science, 4625:429–441, 2009. X. Huang, M. Belin, F. Alleva, and M. Hwang. Unified stochastic engine (USE) for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 636–639, Minneapolis, MN, USA, April 1993. X. D. Huang and M. A. Jack. Semicontinuous hidden Markov models for speech signals. Computer Speech and Language, 3(3):329–252, 1989. M.Y. Hwang, G. Peng, W. Wang, A. Faria, A. Heidel, and M. Ostendorf. Building a highly accurate Mandarin speech recognizer. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 490–495, Kyoto, Japan, December 2007. F. Jelinek. A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development, 13:675–685, November 1969. F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(10):532–556, April 1976. N. Jennequin and J. L. Gauvain. Modeling duration via lattice rescoring. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, HI, USA, April 2007. B.H. Juang and S. Katagiri. Discriminative learning for minimum error classification. IEEE Transactions on Signal Processing, 40(12):3043–3054, 1992. J. Kaiser, B. Horvat, and Z. Kacic. A novel loss function for the overall risk criterion based discriminative training of HMM models. In Interspeech, volume 2, pages 887 – 890, Bejing, China, October 2000.
167
Appendix D Bibliography S. Kanthak and H. Ney. FSA: An efficient and flexible C++ toolkit for finite state automata using ondemand computation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 510 – 517, Barcelona, Spain, July 2004. S. Kanthak, K. Sch¨ utz, and H. Ney. Using SIMD instructions for fast likelihood calculation in LVCSR. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1531– 1534, Istanbul, Turkey, June 2000. S. Kanthak, H. Ney, M. Riley, and M. Mohri. A comparison of two lvr search optimization techniques. In International Conference on Spoken Language Processing (ICSLP), pages 1309–1312, Denver, CO, USA, September 2002. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Speech and Audio Processing, 35:400–401, March 1987. Daniel Keysers, Franz Josef Och, and Hermann Ney. Efficient maximum entropy training for statistical object recognition. In Informatiktage der Gesellschaft f¨ ur Informatik, pages 342–345, Bad Schussenried, Germany, November 2002. Daniil Kocharov, Andras Zolnay, Ralf Schl¨ uter, and Hermann Ney. Articulatory motivated acoustic features for speech recognition. In Interspeech, pages 1101–1104, Lisbon, Portugal, September 2005. N. Kumar and A. G. Andreou. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26(4):283 – 297, December 1998. Shankar Kumar and William Byrne. Risk based lattice cutting for segmental minimum bayesrisk decoding. In International Conference on Spoken Language Processing (ICSLP), pages 373–376, Denver, CO, USA, September 2002. L. Lamel, J.L. Gauvain, G. Adda, C. Barras, E. Bilinski, O. Galibert, A. Pujol, H. Schwenk, and Xuan Zhu. The limsi 2006 tcstar epps transcription systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 997–1000, Honolulu, HI, USA, April 2007. L. Lee and R. Rose. Speaker normalization using efficient frequency warping procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 353–356, Atlanta, GA, USA, May 1996. C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language, 9(2):171–185, 1995. X. Lei, W. Wu, W. Wang, A. Mandal, and A. Stolcke. Development of the 2008 sri mandarin speechtotext system for broadcast news and conversation. In Interspeech, Brighton, U.K., September 2009. Xin Lei, Manhung Siu, MeiYuh Hwang, Mari Ostendorf, and Tan Lee. Improved tone modeling for Mandarin broadcast news speech recognition. In International Conference on Spoken Language Processing (ICSLP), pages 1237–1240, Pittsburgh, PA, USA, September 2006. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics  Doklay, 10(10):707 – 710, 1966. S. E. Levinson, L. R. Rabiner, and M. M. Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 62(4):1035–1074, April 1983. Andrej Ljolje, Fernando Pereira, and Michael Riley. Efficient general lattice generation and rescoring. In European Conference on Speech Communication and Technology (Eurospeech), pages 1251–1254, Budapest, Hungary, September 1999. J. L¨ o¨ of, M. Bisani, C. Gollan, G. Heigold, B. Hoffmeister, C. Plahl, Ralf R. Schl¨ uter, and H. Ney. The 2006 RWTH parliamentary speeches transcription system. In TCSTAR Workshop on SpeechtoSpeech Translation, pages 133–138, Barcelona, Spain, June 2006.
168
Appendix D Bibliography J. L¨ o¨ of, M. Bisani, Ch. Gollan, G. Heigold, Bj¨orn Hoffmeister, Ch. Plahl, R. Schl¨ uter, and H. Ney. The 2006 RWTH parliamentary speeches transcription system. In Interspeech, pages 105 – 108, Pittsburgh, PA, September 2006. J. L¨ o¨ of, Ch. Gollan, S. Hahn, G. Heigold, B. Hoffmeister, Ch. Plahl, D. Rybach, R. Schl¨ uter, and H. Ney. The RWTH 2007 TCSTAR evaluation system for European English and Spanish. In Interspeech, Antwerp, Belgium, August 2007. B. Lowerre. A Comparative Performance Analysis of Speech Understanding Systems. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 1976. L. Mangu and M. Padmanabhan. Error corrective mechanisms for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 29–32, Salt Lake City, UT, USA, May 2001. L. Mangu, E. Brill, and A. Stolcke. Finding consensus among words: Latticebased word error minimization. In European Conference on Speech Communication and Technology (Eurospeech), volume 1, pages 495 – 498, Budapest, Hungary, September 1999. L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language, 14:373–400, 2000. Lidia Mangu. Finding Consensus in Speech Recognition. PhD thesis, Johns Hopkins University, Baltimore, Maryland, USA, April 2000. Sven C. Martin. Statistische Auswahl von Wortabh¨ angigkeiten in der automatischen Spracherkennung. PhD thesis, RWTH Aachen University, Aachen, Germany, February 2000. Evgeny Matusov, Arne Mauser, and Hermann Ney. Automatic sentence segmentation and punctuation prediction for spoken language translation. In International Workshop on Spoken Language Translation, pages 158–165, Kyoto, Japan, November 2006. Evgeny Matusov, Bj¨ orn Hoffmeister, and Hermann Ney. Asr word lattice translation with exhaustive reordering is possible. In Interspeech, pages 2342–2345, Brisbane, Australia, September 2008. E. McDermott and S. Katagiri. Minimum classification error for large scale speech recognition tasks using weighted finite state transducers. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, April 2005. F. Metze and A. Waibel. A flexible stream architecture for asr using articulatory features. In International Conference on Spoken Language Processing (ICSLP), pages 2133–2136, Denver, CO, USA, September 2002a. F. Metze and A. Waibel. Auditorybased acoustic distinctive features and spectral cues for automatic speech recognition using a multistream paradigm. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 837–840, Orlando, FL, USA, May 2002b. Hemant Misra, Herv´e Bourlard, and Vivek Tyagi. New entropy based combination rules in HMM/ANN multistream ASR. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, April 2003. M. Mohri. Generic epsilonremoval and input epsilonnormalization algorithms for weighted transducers. International Journal of Foundations of Computer Science, 13(1):129 – 143, 2002a. M. Mohri. Semiring frameworks and algorithms for shortestdistance problems. Journal of Automata, Languages and Combinatorics, 7(3):321 – 350, 2002b. M. Mohri. Editdistance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(6):957 – 982, 2003.
169
Appendix D Bibliography M. Mohri. Weighted finitestate transducer algorithms: An overview. in Carlos Mart´ınVide, Victor Mitrana, and Gheorghe Paun, editors, Formal Languages and Applications, Springer, Berlin, 2004. M. Mohri and M. Riley. Weighted determinization and minimization for large vocabulary speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), Rhodes, Greece, September 1997. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech recognition with weighted finitestate transducers. in Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech Communication, Part E: Speech recognition, Springer, Heidelberg, Germany, 2008. S. Molau. Normalization in the Acoustic Feature Space for Improved Speech Recognition. PhD thesis, RWTH Aachen, Aachen, Germany, 2003. Hy Murveit, John Butzberger, Vassilios Digalakis, and Mitch Weintraub. Progressivesearch algorithms for largevocabulary speech recognition. In HLT ’93: Proceedings of the workshop on Human Language Technology, pages 87–90, Morristown, NJ, USA, 1993. Association for Computational Linguistics. J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4): 308–313, 1965. H. Ney. The use of a onestage dynamic programming algorithm for connected word recognition. IEEE Transactions on Speech and Audio Processing, 32(2):263–271, April 1984. H. Ney. Acoustic modeling of phoneme units for continuous speech recognition. In L. Torres, E. Masgrau, and M. A. Lagunas, editors, Signal Processing V: Theories and Applications, Fifth European Signal Processing Conference, pages 65–72. Elsevier Science Publishers B. V., Barcelona, Spain, 1990. H. Ney and X. Aubert. A word graph algorithm for large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 3, pages 1355–1358, Yokohama, Japan, September 1994. H. Ney, D. Mergel, A. Noll, and A. Paeseler. A datadriven organization of the dynamic programming beam search for continuous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 833–836, Dallas, TX, USA, April 1987. H. Ney, R. H¨ abUmbach, B.H. Tran, and M. Oerder. Improvements in beam search for 10000word continuous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 9–12, San Francisco, CA, March 1992. H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependencies in language modeling. Computer Speech and Language, 2(8):1–38, 1994. H. Ney, S. C. Martin, and F. Wessel. Statistical language modeling using leavingoneout. In S. Young and G. Bloothooft, editors, Corpus Based Methods in Language and Speech Processing, pages 1–26. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1997. Tim Ng, Bing Zhang, Kham Nguyen, and Long Nguyen. Progress in the bbn 2007 mandarin speech to text system. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1537–1540, Las Vegas, NV, USA, April 2008. Y. Normandin, R. Lacouture, and R. Cardin. MMIE training for large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing, pages 1367–1370, Yokohama, Japan, September 1994. M. K. Omar and L. Mangu. An evaluation of lattice scoring using a smoothed estimate of word accuracy. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 1149–1152, Honululu, HI, USA, April 2007.
170
Appendix D Bibliography S. Ortmanns and H. Ney. An experimental study of the search space for 20000word speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), volume 2, pages 901–904, Madrid, Spain, September 1995. S. Ortmanns, H. Ney, and A. Eiden. Languagemodel lookahead for large vocabulary speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 4, pages 2095–2098, Philadelphia, PA, October 1996. S. Ortmanns, H. Ney, and X. Aubert. A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech and Language, 11(1):43–72, January 1997a. S. Ortmanns, H. Ney, and T. Firzlaff. Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), volume 1, pages 139–142, Rhodes, Greece, September 1997b. S. Ortmanns, A. Eiden, and H. Ney. Improved lexical tree search for large vocabulary recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 817–820, Seattle, WA, USA, May 1998. M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. Schwartz, and J. R. Rohlicek. Integration of diverse recognition methodologies through reevaluation of nbest sentence hypotheses. In DARPA Speech and Natural Language Processing Workshop, pages 83–87, Pacific Grove, CA, USA, 1991. Naveen Parihar, Ralf Schl¨ uter, David Rybach, and Eric A. Hansen. Parallel fast likelihood computation for lvcsr using mixture decomposition. In Interspeech, Brighton, U.K., September 2009. D. B. Paul. Algorithms for an optimal A∗ search and linearizing the search in the stack decoder. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 693– 696, Toronto, Canada, May 1991. M. Pitz. Investigations on Linear Transformations for Speaker Adaptation and Normalization. PhD thesis, RWTH Aachen University, 2005. Christian Plahl, Bj¨ orn Hoffmeister, MeiYuh Hwang, Danju Lu, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Recent improvements of the rwth gale mandarin lvcsr system. In Interspeech, pages 2426–2429, Brisbane, Australia, September 2008a. Christian Plahl, Bj¨ orn Hoffmeister, MeiYuh Hwang, Danju Lu, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Recent improvements of the rwth gale mandarin lvcsr system. In Interspeech, pages 2426–2429, Brisbane, Australia, September 2008b. Christian Plahl, Bj¨ orn Hoffmeister, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Development of the gale 2008 mandarin lvcsr system. In Interspeech, pages 2107–2110, Brighton, U.K., September 2009. D. Povey and P. C. Woodland. Minimum phone error and Ismoothing for improved discriminative training. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 105 – 108, Orlando, FL, May 2002. R. Prasad, S. Matsoukas, C. L. Kao, J. Z. Ma, D. X. Xu, T. Colthurst, O. Kimball, R. Schwartz, J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, and F. Lefevre. The 2004 bbn/limsi 20xrt english conversational telephone speech recognition system. In Interspeech, Lisbon, Portugal, September 2005. L. Rabiner and B.H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1): 4–16, 1986. L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. PrenticeHall Signal Processing Series, Englewood Cliffs, NJ, 1979.
171
Appendix D Bibliography B. Ramabhadran, Olivier Siohan, L. Mangu, G. Zweig, M. Westphal, H. Schulz, and A Soneiro. The ibm 2006 speech transcription system for european parliamentary speeches. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. V. Ramasubramansian and K. K. Paliwal. Fast kdimensional tree algorithms for nearest neighbor search with application to vector quantization encoding. IEEE Transactions on Speech and Audio Processing, 40(3):518–528, March 1992. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The Rprop algorithm. In IEEE International Conference on Neural Networks (ICNN), pages 586 – 591, San Francisco, CA, USA, 1993. H. Sakoe. Twolevel DPmatching  a dynamic programmingbased pattern matching algorithm for connected word recognition. IEEE Transactions on Speech and Audio Processing, 27:588–595, December 1979. A. Sankar. Bayesian model combination (baycom) for improved recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 845–848, Philadelphia, PA, USA, April 2005. R. R. Sarukkai and D. H. Ballard. Improved spontaneous dialogue recognition using dialogue and utterance triggers by adaptive probability boosting. In Interspeech, volume 1, pages 208–211, Philadelphia, PA, USA, October 1996. R. E. Schapire and Y. Singer. Boostexter: A boostingbased system for text categorization. Machine Learning, 39(2/3):135–168, 2000. R. E. Schapire and Y Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37(3):297–336, 2001. R. Schl¨ uter. Investigations on Discriminative Training Criteria. PhD thesis, RWTH Aachen University, Aachen, Germany, September 2000. Ralf Schl¨ uter, Thomas Scharrenbach, Volker Steinbiss, and Hermann Ney. Bayes risk minimization using metric loss functions. In European Conference on Speech Communication and Technology (Eurospeech), pages 1449–1452, Lisbon, Portugal, September 2005. Ralf Schl¨ uter, Andras Zolnay, and Hermann Ney. Feature combination using linear discriminant analysis and its pitfalls. In International Conference on Spoken Language Processing (ICSLP), pages 345–348, Pittsburgh, PA, USA, September 2006. Ralf Schl¨ uter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney. Gammatone features and feature combination for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honululu, HI, USA, April 2007. R. Schwartz and Y.L. Chow. The N best algorithm: An efficient and exact procedure for finding the N most likely sentence hypotheses. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 81–84, Albuquerque, NM, April 1990. O. Siohan, B. Ramabhadran, and B. Kingsbury. Constructing ensembles of asr systems using randomized decision trees. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, April 2005. A. Sixtus. AcrossWord Phoneme Models for Large Vocabulary Continuous Speech Recognition. PhD thesis, RWTH Aachen, January 2003. A. Sixtus and S. Ortmanns. High quality word graphs using forwardbackward pruning. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 593–596, Phoenix, Arizona, USA, March 1999.
172
Appendix D Bibliography H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig. The ibm 2004 conversational telephony system for rich transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 205–208, Philadelphia, PA, USA, March 2005. V. Steinbiss, H. Ney, R. H¨ abUmbach, B.H. Tran, U. Essen, R. Kneser, M. Oerder, H.G. Meier, X. Aubert, C. Dugast, and D. Geller. The Philips research system for largevocabulary continuousspeech recognition. In European Conference on Speech Communication and Technology (Eurospeech), pages 2125–2128, Berlin, Germany, September 1993. A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao Gadde, M. Plauche, C. Richey, E. Shriberg, K. S¨ onmez, F. Weng, and J. Zheng. The sri march 2000 hub5 conversational speech transcription system. In NIST Speech Transcription Workshop, College Park, MD, USA, May 2000. Andreas Stolcke. Srilm  an extensible language modeling toolkit. In Interspeech, pages 901–904, Denver, CO, September 2002. Andreas Stolcke, Yochai K¨ onig, and Mitchel Weintraub. Explicit word error minimization in Nbest list rescoring. In European Conference on Speech Communication and Technology (Eurospeech), pages 163–166, Rhodes, Greece, 1997. Sebastian St¨ uker, Christian F¨ ugen, Susanne Burger, and Matthias W¨olfel. Crosssystem adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic frontend. In Interspeech, Pittsburgh, PA, USA, September 2006. Sebastian St¨ uker, Christian F¨ ugen, Florian Kraft, , and Matthias W¨olfel. The isl 2007 english speech transcription system for european parliament speeches. In International Conference on Spoken Language Processing (ICSLP), pages 2069–2072, Antwerp, Belgium, August 2007. Alain Tritschler and Ramesh A. Gopinath. Improved speaker segmentation and segments clustering using the bayesian information criterion. In European Conference on Speech Communication and Technology (Eurospeech), pages 679–682, Budapest, Hungary, September 1999. Grigorios Tsoumakas and Ioannis Katakis. Multilabel classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007. Fabio Valente. A novel criterion for classifiers combination in multistream speech recognition. IEEE Signal Processing Letters, 16(7):561–564, July 2009. Fabio Valente, Jithendra Vepa, Christian Plahl, Christian Gollan, Hynek Hermansky, and Ralf Schl¨ uter. Hierarchical neural networks feature extraction for LVCSR system. In Interspeech, Antwerp, Belgium, August 2007. V. Venkataramani, S.A. Chakrabartty, and W.J Byrne. Support vector machines for segmental minimum bayes risk decoding of continuous speech. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), St. Thomas, VI, USA, November 2003. Veera Venkataramani, Shantanu Chakrabartty, and William Byrne. Ginisupport vector machines for segmental minimum bayes risk decoding of continuous speech. Computer Speech and Language, 21, July 2007. D. Vergyri, S. Tsakalidis, and W. Byrne. Minimum risk acoustic clustering for multilingual acoustic model combination. In International Conference on Spoken Language Processing (ICSLP), pages 873– 876, Beijing, China, October 2000. D. Vergyri, A. Mandal, W. Wang, A. Stolcke, J. Zheng, M. Graciarena, D. Rybach, C. Gollan, R. Schl¨ uter, K. Kirchhoff, A. Faria, and N. Morgan. Development of the sri/nightingale arabic asr system. In Interspeech, pages 1437–1440, Brisbane, Australia, September 2008. Dimitra Vergyri. Integration of multiple knowledge sources in speech recognition using minimum error training. PhD thesis, Johns Hopkins University, Baltimore, Maryland, USA, 2000.
173
Appendix D Bibliography T. K. Vintsyuk. Elementwise recognition of continuous speech composed of words from a specified dictionary. Kibernetika, 7:133–143, March 1971. A. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13:260–269, 1967. F. Wessel. Word Posterior Probabilities for Large Vocabulary Continuous Speech Recognition. PhD thesis, RWTH Aachen, Aachen, Germany, 2002. F. Wessel, K. Macherey, and R. Schl¨ uter. Using word probabilities as confidence measures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 225–228, Seattle, WA, USA, May 1998. Frank Wessel, Ralf Schl¨ uter, and Hermann Ney. Using posterior word probabilities for improved speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1587–1590, Istanbul, Turkey, June 2000. Frank Wessel, Ralf Schl¨ uter, Klaus Macherey, and Hermann Ney. Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288–298, March 2001b. Frank Wessel, Ralf Schl¨ uter, Klaus Macherey, and Hermann Ney. Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288–298, March 2001a. Frank Wessel, Ralf Schl¨ uter, and Hermann Ney. Explicit word error minimization using word hypothesis posterior probabilities. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 33–36, Salt Lake City, Utah, May 2001c. Daniel Willett and Chuang He. Discriminative training for complementariness in system combination. In Interspeech, Brisbane, Australia, September 2008. P. C. Woodland and D. Povey. Large scale discriminative training for speech recognition. In Automatic Speech Recognition (ASR), pages 7 – 16, Paris, France, September 2000. P. C. Woodland and D. Povey. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech & Language, 16(1):25–48, 2002. Haihua Xu, Daniel Povey, Jie Zhu, and Guanyong Wu. Minimum hypothesis phone error as a decoding method for speech recognition. In Interspeech, pages 76–79, Brighton, U.K., September 2009. J. Xue and Y. Zhao. Random forestsbased confidence annotation using novel features from confusion network. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 1149–1152, Toulouse, France, May 2006. Jian Xue and Yunxin Zhao. Improved confusion network algorithm and shortest path search from word lattice. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 853–856, Philadelphia, PA, USA, March 2005. Richard Zens, Saˇsa Hasan, and Hermann Ney. A systematic comparison of training criteria for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, pages 524– 532, Prague, Czech Republic, June 2007. R. Zhang and A. Rudnicky. Investigations of issues for using multiple acoustic models to improve continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. J. Zheng and A. Stolcke. Improved discriminative training using phone lattices. In Interspeech, pages 2125–2128, Lisbon, Portugal, September 2005.
174
Appendix D Bibliography Andras Zolnay. Acoustic Feature Combination for Speech Recognition. PhD thesis, RWTH Aachen University, Aachen, Germany, August 2006. Andras Zolnay, Ralf Schl¨ uter, and Hermann Ney. Acoustic feature combination for robust speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 457–460, Philadelphia, PA, USA, March 2005.
175
Curriculum Vitae Personal Information Name: Date of birth: Place of birth: Nationality:
Bj¨ orn Hoffmeister November 26, 1976 Aachen, Germany German
Education 1983 – 1986 1986 – 1993 1993 – 1996
Trinkbornschule in R¨ odermark, Germany OswaldvonNellBreuningSchule (former Rodgauschule) in R¨odermark, Germany Abitur, AlfredDelpSchule in Dieburg, Germany
1997 – 2003
Diplom in Informatik, Universit¨at zu L¨ ubeck, Germany
Working Experience 2003 – 2004
Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck, Germany Research assistant (machine learning)
2004 – 2010
Chair of Computer Science 6 (Human Language Technology and Pattern Recognition), RWTH Aachen University, Germany Research assistant and Ph.D. student (statistical speech recognition)
Summer 2009
Internship at NTT Communication Science Laboratories, Kyoto, Japan