Bayes Risk Decoding and its Application to System - i6 RWTH Aachen

Loading...

Bayes Risk Decoding and its Application to System Combination

Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation vorgelegt von Diplom-Informatiker Bj¨orn Hoffmeister aus Aachen

Berichter: Professor Dr.–Ing. Hermann Ney Privatdozent Dr. Jean–Luc Gauvain Tag der m¨undlichen Pr¨ufung: 18. Juli 2011 Diese Dissertation ist auf den Internetseiten der Hochschulbibliothek online verf¨ugbar.

Abstract Speech recognition is the task of converting an acoustic signal, which contains speech, to written text. The error of a speech recognition system is measured in the number of words in which the recognized and the spoken text differ. This work investigates and develops decoding and system combination approaches within the Bayes risk decoding framework with the objective of reducing the number of word errors. The investigated approaches are computationally too expensive to be applied in the speech decoder. Instead, the result of a first recognition run is used which narrows the number of hypotheses and provides the result in a compact form, the word lattice. In the single system decoding task a single word lattice is given and in the lattice-based system combination task a word lattice is provided by each system. In both cases the goal is to minimize the number of word errors in the ultimate hypothesis. In large vocabulary continuous speech recognition (LVCSR) tasks the number of word errors is computed as the Levenshtein distance between recognized and spoken text. The Bayes risk decoding framework yields the hypothesis with the least expected number of errors w.r.t. a specified loss function and given the true sentence posterior probabilities. However, neither the true probabilities are known nor is the computation of the Bayes risk hypothesis with the Levenshtein distance as loss function computationally feasible for a word lattice. Consequently, in lattice-based Bayes risk decoding and system combination two problems have to be addressed: first, how to compute an estimate for the sentence posterior probabilities given one or several word lattices; second, how to approximate the Levenshtein distance such that the computation of the Bayes risk hypothesis becomes computationally feasible. Based on the separation of the posterior probability computation and the loss function in the Bayes risk decoding rule a framework will be developed, which covers the common approaches to lattice-based system combination, like ROVER, CNC, and DMC. Furthermore, it will be shown that the common approximations of the Levenshtein distance used in LVCSR tasks can be classified into two categories for which efficient Bayes risk decoder exist. The existing approximates will be investigated and compared. New loss functions will be developed which overcome drawbacks of the existing approximations to the Levenshtein distance, like the frequently observed deletion bias. A data structure of particular interest is the confusion network (CN). In previous work it was shown that a CN has a simple decoding rule in the Bayes risk framework. In this work new algorithms for deriving a CN from a word lattice will be developed and compared to existing methods. Furthermore, the CN will be the base for several investigations aiming at improving the posterior probability estimates and the approximation of the Levenshtein distance. The methods looked into include classifier-based system combination and the usage of a windowed Levenshtein distance as loss function for the Bayes risk decoder. A further topic of research is the log-linear model combination for which the enhancement with modeland word-dependent scaling factors will be investigated. The methods are tested on the Chinese speech recognition systems used by RWTH Aachen in the GALE project and on the lattices provided within the English track of the 2007 TC-Star EPPS evaluation. The best performing system combination methods investigated in this work improve the error rates by up to 10% relative for intra-site combination experiments and by more than 20% relative for cross-site combinations compared to the best single system. The newly developed methods show a slight improvement over the existing approaches to lattice decoding and lattice-based system combination.

iii

Zusammenfassung Die automatische Spracherkennung befasst sich mit der Aufgabe gesprochene Sprache in geschriebenen Text umzuwandeln. Der Fehler eines Spracherkennungsystems wird in der Anzahl der W¨orter gemessen, in denen der gesprochene vom erkannten Text abweicht. Thema dieser Arbeit ist die Verwendung des Bayes Risk Frameworks mit dem Ziel den Fehler eines einzelnen Systems oder einer Kombination von mehreren Systemen zu minimieren. Bedingt durch die Komplexit¨ at der Methoden werden alle Experimente und Untersuchungen in dieser Arbeit auf Wortgraphen durchgef¨ uhrt. Ein Wortgraph ist die kompakte Darstellung eines eingeschr¨ankten Hypothesenraums, der von einem vorgeschalteten Erkennungslauf erzeugt wird. Im Falle der Systemkombination wird pro System ein Wortgraph bereitgestellt. Das Ziel ist es, aus den Wortgraphen eine finale Hypothese zu generieren, die einen geringeren Wortfehler aufweist als jedes der einzelnen System. In der kontinuierlichen Spracherkennung mit großem Wortschatz wird der Wortfehler als der Levenshteinabstand zwischen gesprochener und erkannter Wortfolge definiert. Falls die wahren Satzwahrscheinlichkeiten bekannt sind, liefert das Bayes Risk Framework die Wortfolge mit dem geringsten zu erwarteten Fehler. In der Praxis sind allerdings weder die wahren Wahrscheinlichkeiten bekannt, noch ist die Komplexit¨at der Berechnung der Bayes Risk Hypothese auf einem Wortgraphen handhabbar, wenn der Levenshteinabstand als Kostenfunktion verwendet wird. Somit ergeben sich die beiden folgenden Aufgabenstellungen: Erstens, wie lassen sich aus den systemabh¨ angigen Wortgraphen Wahrscheinlichkeiten sch¨atzen. Und zweitens, wie l¨ asst sich der Levenshteinabstand so absch¨atzen, daß die Komplexit¨at der Berechnung der Bayes Risk Hypothese handhabbar wird. In dieser Arbeit wird, basierend auf der Trennung der Sch¨atzung der Wahrscheinlichkeiten und der Kostenfunktion in der Bayes Risk Berechnung, ein allgemeines Framework f¨ ur die wortgraphgest¨ utzte Systemkombination entwickelt. Das Framework deckt die in der Praxis g¨angigen Methoden ab, u.a. ROVER, CNC und DMC. Weiterhin wird gezeigt, daß sich die, in der Sprachererkennung g¨angigen, Absch¨atzungen des Levenshteinabstands in zwei Klassen einteilen lassen, f¨ ur die sich die Bayes Risk Hypothese effizient berechnen l¨ asst. Die bekannten Absch¨ atzungen werden untersucht und verglichen. Neue Verfahren werden entwickelt, die die Nachteile der bestehenden Absch¨atzungen ausgleichen, insbesondere den h¨aufig zu beobachtenden hohen Anteil an Ausl¨ oschungen. Eine Datenstruktur von besonderem Interesse ist das Confusion Network (CN). In fr¨ uheren Arbeiten wurde gezeigt, daß sich die Bayes Risk Hypothese eines CNs auf triviale Weise berechnen l¨asst. In dieser Arbeit werden neue Verfahren zur Umwandlung eines Wortgraphen in ein CN vorgestellt und mit bestehenden Verfahren verglichen. Weiterhin bildet das CN die Grundlage f¨ ur mehrere Ans¨atze zur verbesserten Sch¨ atzung der Wahrscheinlichkeiten und zur genaueren Absch¨atzung des Levenshteinabstands. Die untersuchten Ans¨ atze beinhalten die klassifikatorbasierte Systemkombination und den Einsatz eines gefensterten Levenshteinabstands als Kostenfunktion in der Berechnung der Bayes Risk Hypothese. Ein weiteres Thema, das in dieser Arbeit untersucht wird, ist die log-lineare Modellkombination, f¨ ur die modell- und wortabh¨ angige Skalierungsfaktoren eingef¨ uhrt werden. Experimente werden mit den chinesischen Spracherkennern durchgef¨ uhrt, die an der RWTH Aachen im Laufe des GALE Projekts entwickelt wurden, sowie mit den Wortgraphen, die im Zuge der 2007 TCStar EPPS Evaluation bereitgestellt wurden. Die besten Methoden zur Systemkombination, die in dieser Arbeit untersucht werden, zeigen eine relative Verbesserung in der Wortfehlerrate um bis zu 10% f¨ ur die hausinterne Wortgraphkombination und mehr als 20% f¨ ur die Kombination von Wortgraphen mehrerer Projektpartner. Dabei bezieht sich die relative Verbesserung auf die Fehlerrate des besten Einzelsystems. Im Vergleich zu den bestehenden Methoden zur wortgraphbasierten Systemkombination erzielen die neuentwickelten Verfahren leichte Verbesserungen.

v

Acknowledgement First of all I would like to thank my doctoral adviser, Prof. Dr.-Ing. Hermann Ney, head of the Chair of Human Language Technology and Pattern Recognition, Lehrstuhl f¨ ur Informatik 6, at the RWTH Aachen University, for his support and his interest. He introduced me to speech recognition in 2004 when I started my studies as a PhD student and he has since then given me the opportunity and the freedom to pursue my ideas. I would also like to thank Dr. Jean-Luc Gauvain for agreeing to review this thesis and for the interest in this work. I am very grateful to Dr. Ralf Schl¨ uter for his support in the field of Bayes risk decision theory and its application to speech recognition. His supportive coaching helped me to make my decisions and to define my long-term research goals. Special thanks go to Stephan Kanthak who mentored me in my first year and introduced me to the concepts of transducers and their application to speech recognition. I would like to thank all my colleagues in the speech recognition group for the great team play in doing (and winning) evaluations, designing our software, and developing new ideas. In no particular order these include Christian Gollan, Stefan Hahn, Georg Heigold, Jonas L¨o¨of, Christian Plahl, and David Rybach. During my time at the Lehrstuhl f¨ ur Informatik 6 I worked together with many people whom I would like to thank for the fruitful collaborations. Especially Dustin Hillard for the great teamwork in developing the classifier-based approach to system combination, and Mei-Yuh Hwang for the challenging and exciting times in the GALE project. For the good times and the memorable moments I had at the Lehrstuhl f¨ ur Informatik 6 I would like to thank all my former and current colleagues including Sasa Hasan, Oliver Bender, Thomas Deselaers, Philippe Dreuw, Saab Mansour, David Vilar-Torres, Arne Mauser, Evgeny Matusov, and many more. Also, my thanks go to our system administration team and our secretariat for their always available help and their excellent support. I am very thankful for the friendly atmosphere and the support I received at the NTT Communication Laboratories, Kyoto, Japan during my stay in 2009. Thanks go to all members of the laboratories, in particular to Erik McDermott, Takaaki Hori, and Shinji Watanabe. Finally, I would like to thank my parents and all my family members for their understanding and encouragements during the long years of my doctoral studies and the writing of this thesis.

This work was partly funded by the European Commission under the integrated projects TC-STAR (FP6-506738), this work was partly realized as part of the Quaero Programme, funded by OSEO, French State agency for innovation, and this work is partly based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the DARPA.

vii

Contents 1 Introduction 1.1 Statistical Speech Recognition . . . . . 1.2 Signal Analysis/ Feature Extraction . 1.3 Acoustic Model . . . . . . . . . . . . . 1.4 Language Model . . . . . . . . . . . . 1.5 Search . . . . . . . . . . . . . . . . . . 1.6 Multi-Pass Search . . . . . . . . . . . 1.6.1 Lattices . . . . . . . . . . . . . 1.6.2 Speaker Adaptation . . . . . . 1.7 Weighted Finite State Transducers . . 1.7.1 Notation . . . . . . . . . . . . . 1.7.2 Algorithms . . . . . . . . . . . 1.7.3 WFSTs in ASR . . . . . . . . . 1.8 Bayes Risk Decoding: State of the Art 1.9 Model and System Combination: State 1.9.1 Log-linear Model Combination 1.9.2 System Combination . . . . . . 1.9.3 Cross-Adaptation . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of the Art . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

2 Scientific Goals 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework 3.1 WFSTs as a High-Level Programming Language for lattice-based System Combination . 3.2 Probabilities over Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Probabilities over a single Lattice . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Probabilities over the Lattice Intersection . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Probabilities over the Lattice Union . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Lattice-Based System Combination in the Bayes Risk Decoding Framework . . . . . . . 3.3.1 The MAP/Viterbi Decoding Framework . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 MAP/Viterbi Decoding Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 The Bayes Risk Decoding Framework with Local Cost Functions . . . . . . . . . 3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework 3.4.1 Confusion Network Combination (CNC) . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 ROVER: An Approximation of CNC . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Lattice Combination Framework vs. State-of-the-Art in System Combination . . . 3.6 Lattice Pre-Processing for Bayes Risk Decoding and System Combination . . . . . . . . 3.6.1 Lattice Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Lattice Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 The non-Word Cloud Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Parameter Optimization for Bayes Risk Decoding and System Combination . . . . . . . 3.7.1 Parameter Optimization based on the Downhill-Simplex Algorithm . . . . . . . . 3.7.2 Parameter Optimization based on Minimum Risk Training . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 4 6 7 8 8 10 10 10 12 13 14 15 15 15 16 17

. . . . . . . . . . . . . . . . . . . . . . .

21 22 25 25 26 27 28 28 29 30 34 36 37 37 41 43 43 44 44 45 47 48 49 50

4 Local Cost Functions for Bayes Risk Decoding 4.1 Local Costs and the Deletion Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53 53

ix

Contents 4.2

4.3

4.4

4.5

Frame Error . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Partially Normalized Frame Error . . . . . . . . 4.2.2 Symmetrically Normalized Frame Error . . . . . 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . Local Alignment based Error . . . . . . . . . . . . . . . 4.3.1 Povey’s Approximation in MPE/MWE Training 4.3.2 The 1/2 Overlap Approximation . . . . . . . . . 4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . Confusion Network Distance based Error . . . . . . . . . 4.4.1 Distances betweens Arcs and Arc Clusters . . . . 4.4.2 The Arc-Cluster CN Construction Algorithm . . 4.4.3 The State-Cluster CN Construction Algorithm . 4.4.4 The Center-Frame CN Construction Algorithm . 4.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

53 54 55 56 59 59 60 61 62 64 65 67 68 72 75

5 Confusion Networks: Applications and Investigations 5.1 Frame Level Confusion Networks . . . . . . . . . . . 5.1.1 Minimum- and Inverse-Entropy Combination 5.1.2 Time Alignment with Frame Level CNs . . . 5.1.3 Results . . . . . . . . . . . . . . . . . . . . . 5.2 Word Level Confusion Networks . . . . . . . . . . . 5.2.1 Confidence Warping . . . . . . . . . . . . . . 5.2.2 The windowed Levenshtein Distance . . . . . 5.2.3 Results . . . . . . . . . . . . . . . . . . . . . 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

77 77 77 79 79 81 81 82 94 98

6 Classifier based System Combination 6.1 Combination with Classification . 6.1.1 Features . . . . . . . . . . 6.1.2 Classifiers and Training . 6.1.3 The iROVER Approach . 6.1.4 The iCNC Approach . . . 6.1.5 The iCN Approach . . . . 6.2 Experiments . . . . . . . . . . . . 6.2.1 Experimental Setup . . . 6.2.2 Results . . . . . . . . . . 6.2.3 Analysis . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

101 101 102 102 103 104 104 104 104 106 106 108

7 Log-Linear Model Combination vs. System Combination 7.1 Log-Linear Model Combination with Word-Dependent Scaling Factors 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

111 112 112 112 115 115

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

8 Scientific Contributions

119

9 Outlook

123

A The Deletion Bias in LVCSR Decoding

125

B Corpora and Systems 127 B.1 Chinese GALE Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 B.1.1 The Chinese 230h Testing System . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

x

Contents B.1.2 The RWTH Aachen Chinese GALE 2008 Evaluation System B.2 English TC-Star/EPPS Systems . . . . . . . . . . . . . . . . . . . . B.2.1 The RWTH Aachen English EPPS 2007 Evaluation System . B.2.2 The English EPPS 2007 Evaluation Cross-site Combination . C Experimental Results C.1 The Chinese 230h Testing System . . . . . . . . . . . . . . . . C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System . C.3 The RWTH Aachen English EPPS 2007 Evaluation System . C.4 The English EPPS 2007 Evaluation Cross-site Combination .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

129 130 130 131

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

133 133 139 142 146

D Symbols and Acronyms 151 D.1 Mathematical Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 D.2 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 List of Figures

157

List of Tables

159

Bibliography

163

xi

Chapter 1 Introduction Speech is the most common and most natural way for humans to communicate, even in times of e-mail, chat, and blogs. This makes an automatic speech recognition (ASR) system the natural choice for a human-machine interface. In the recent years a huge amount of audio and video data became available in the world-wide web. Most of these pod-casts, news, and home-made videos use speech as the natural form of communication. ASR is the first step in making the information contained in the speech data available to machine processing. The speech recognition problem is defined as the task of converting an acoustic signal, which contains speech (the speech signal), to written text (the recognized word sequence). The automatic speech recognizer serves as a human-machine interface or provides the input for further machine processing like machine translation. According to the specific task ASR systems have to fulfill certain requirements, e.g. an ASR system which serves as a human-machine interface has to work in real-time. The ASR systems considered in this thesis are large vocabulary continuous speech recognition (LVCSR) systems. The vocabulary contains 50,000 and more words, recognition is performed on complete utterances (in opposite to single word recognition), and real-time is not required. Modern LVCSR systems use a statistical approach to find the sequence of words with the highest probability given the acoustic features. The signal analysis which converts the speech signal into a sequence of features happens in a pre-processing step and stays apart from the statistical approach. The standard evaluation measure for LVCSR systems is the word error rate (WER). Bayes risk approaches in LVCSR aim at finding the word sequence given the speech signal which produces the least expected WER. The exact computation of the Bayes risk hypothesis in a modern LVCSR system is prohibitive and requires approximations. Usually it is applied in a post-processing step which follows a first decoding run that produces a set of alternative word sequences. In this thesis a variety of approximations for computing the minimum expected WER hypothesis are developed and analyzed. The WER for an utterance can be greatly reduced by combining several ASR systems. In this thesis a general framework is developed for system combination by applying the approximate minimum expected WER decoder to multiple systems.

1.1 Statistical Speech Recognition The statistical approach to ASR takes a sequence of acoustic features xT1 as input and aims at finding the sequence of words w1N which maximizes the posterior probability. The statistical approach applies Bayes’ decision rule [Bayes 1763]: ˆ xT1 → W

:=

argmax p(w1N |xT1 ) w1N ,N

=

argmax p(xT1 |w1N )p(w1N )

(1.1)

w1N ,N

The result is referred to as the maximum a-posterior (MAP) hypothesis. The equation defines two stochastic models, the acoustic model p(xT1 |w1N ) and the language model p(w1N ). The acoustic model computes the likelihood for observing the feature sequence xT1 given the word sequence w1N . The language model denotes the a-priori probability of the word sequence w1N . A word wn in the word sequence w1N is either taken from the finite alphabet Σ (aka vocabulary) or  N equals the empty word , that is w1N ∈ Σ ∪ {} . The convention of allowing the empty word at any position in the word sequence will be frequently used later when dealing with confusion networks. In

1

Chapter 1 Introduction Speech Input

Feature Extraction

Feature Vectors x1...x T

Global Search Process:

p(x1... x T | w1 ...wN )

maximize

Acoustic Model - subword units - pronunciation lexicon

p(w1...wN) p(x1... x T | w1 ...w N ) p(w1...wN )

over w1...w N

Language Model

Recognized Word Sequence {w1...wN } opt Figure 1.1. Basic architecture of a statistical automatic speech recognition system according to [Ney 1990].

the computation of the equality of two word sequences the empty word is not considered, e.g. it holds “a b” = “a  b” = “a b ”. The extraction of the feature sequence xT1 from the continuous speech signal happens in a pre-processing step, the signal analysis. The signal analysis itself is based on models of the human auditory system. The resulting features are further processed by data-driven approaches, which ultimately yield the feature sequence xT1 . Figure 1.1 summarizes the interaction between feature extraction, acoustic model, and language model during the search. The search algorithm aims at finding the word sequences that fulfills Equation (1.1). The search space for a LVCSR system consists of all possible word sequences over the (finite) vocabulary. The huge size makes the complete exploration of the search space prohibitive and pruning techniques are used to restrict the effective number of hypotheses. The subset of the search space considered during the search process can be stored and used for applying sophisticated methods, which are too complex to be applied to the full search space. The main topic of this thesis is the application of Bayes risk decoding1 and system combination as a postprocessing step for LVCSR systems. The conventional decoding rule in Equation (1.1) aims at minimizing the number of incorrectly recognized word sequences or sentences. But the standard evaluation measure in LVCSR is the WER, which is based on the number of incorrectly recognized words. More precisely, the WER is the normalized Levenshtein or edit distance between the correct and the hypothesized sentence calculated on word level [Levenshtein 1966]. Considering the Levenshtein distance in the Bayes risk framework results in decision rule X ˆ := argmin xT1 → W p(v1M |xT1 ) Lev(w1N , v1M ), w1N ,N

v1M ,M

where Lev(v1M , w1N ) denotes the Levenshtein distance between the two word sentences v1M and w1N [Bishop 1 In

the speech recognition literature the term “Minimum Bays risk decoding” is frequently used. However, this terminology is misleading as by definition the Bayes risk hypothesis is already the sequence producing the least number of expected errors, i.e. it is already the minimum.

2

1.2 Signal Analysis/ Feature Extraction 2006]. The computation of the equation for a LVCSR task is computationally not feasible even for the reduced search space and requires further approximations. This thesis investigates a variety of approximations for Bayes risk decoding with the Levenshtein distance as loss function. A successful way to decrease the WER for an utterance is to combine several models or systems. In the model combination approach all knowledge sources are combined into a single log-linear model from which the posterior probability p(w1N |xT1 ) is computed. The knowledge sources combined in the log-linear model usually consist of the language model and several acoustic models. In the cross-adaptation approach two or more independently trained systems are combined, where the interaction between the systems takes place in the speaker adaptation step. The third and most common approach is to introduce the system as a hidden variable and to compute the marginal over the resulting weighted, system-dependent posteriors p(w1N |xT1 ) =

J X j=1

p(w1N , j|xT1 ) =

J X

p(j|xT1 )p(w1N |j, xT1 ),

j=1

for J LVCSR systems. This type of combination is usually applied within the Bayes risk decoding framework. In this thesis all three approaches are considered, but the focus is on system combination within the Bayes risk framework.

1.2 Signal Analysis/ Feature Extraction The signal analysis and feature extraction module of the ASR system provides the statistical model with a sequence of observations or acoustic vectors. The goal is to keep only the information from the speech signal that is relevant for finding the correct word sequence. Discarding all the irrelevant information makes the acoustic model robust e.g. to the intensity of the speech, to background noise, to speaker gender and identity. The feature extraction of today’s state-of-the-art LVCSR systems happens in three steps: 1. A first set of features is extracted from the speech signal based on models of the human auditory system. 2. The features are transformed, augmented, and/or reduced by parametric models, where the model parameters are estimated on the acoustic training data. 3. Speaker normalization steps are applied either to the features directly or to the acoustic model parameters in order to achieve speaker independence; usually the free parameters are estimated based on the result of a previous, unadapted recognition run. The most common signal analysis applied in the first step is based on a short term spectral analysis, usually a Fast Fourier Transformation (FFT) [Rabiner & Schafer 1979]. Widely used procedures for further processing the FFT result yield the Mel Frequency Cepstral Coefficients (MFCCs) [Davis & Mermelstein 1980] or the Perceptual Linear Predictives (PLPs) [Hermansky 1990]. Another feature now commonly used by RWTH Aachen are the Gammatone filter based features (GT), which work in the time domain [Aertsen & Johannesma+ 1980; Schl¨ uter & Bezrukov+ 2007]. The recognition performance can be significantly improved by concatenating articulatory motivated acoustic features to the short-term FFT-based features [Kocharov & Zolnay+ 2005; Zolnay & Schl¨ uter+ 2005]. An alternative approach which became popular in the recent years is the usage of phone posterior probability estimates as acoustic features. In this approach features from the first step are feed into a classifier, usually a neural network, which has as output the posterior estimates [Chen & Zhu+ 2004; Hermansky & Ellis+ 2000; Valente & Vepa+ 2007]. The parameters of the classifier are estimated on the training data. The features described above were designed to scope with European languages and do not consider tone information, that is the contour of the pitch for a syllable. For tonal languages like Chinese state-of-the-art speech recognition systems integrate an additional tone feature [Chang & Zhou+ 2000; Chen & Gopinath+ 1997; Chen & Li+ 2001; Lei & Siu+ 2006].

3

Chapter 1 Introduction Dynamic information can be included by augmenting the feature vector with the first and second derivatives. A more general approach is to apply the Linear Discriminant Analysis (LDA) [Fisher 1936] or the heteroscedastic LDA (HLDA) [Kumar & Andreou 1998] to a window of usually 9 or 11 of the original feature vectors. The result is a linear transformation which projects the original features into a lower dimensional feature space such that the class separability is maximized, assuming that the data given a class follows a normal distribution. The (H)LDA is also successfully used to combine acoustic features from several feature extraction procedures, i.e. several short-term FFT features [Schl¨ uter & Zolnay+ 2006] or short-term FFT and tone + features for Chinese systems [Ng & Zhang 2008; Plahl & Hoffmeister+ 2008a]. The third step puts the focus on gender and speaker independence of the acoustic features which is hard to meet and usually not achieved by the feature extraction procedures mentioned above. For example, the MFCC and PLP features are also used to detect the gender of the speaker [Stolcke & Bratt+ 2000] or even for speaker identification [Doddington & Przybocki+ 2000]. Several methods have been developed to reduce the speaker dependency of the acoustic features. Two wide-spread approaches are the vocal tract length normalization (VTLN) and the MLLR transformation [Gales & Woodland 1996; Lee & Rose 1996; Leggetter & Woodland 1995]. The MLLR approach consists of a speaker-dependent linear transformation of the model parameters and is discussed in more detail in Section 1.6. A comprehensive comparison of speaker normalization and adaptation methods is given in [Pitz 2005].

1.3 Acoustic Model The stochastic model which computes the likelihood of the acoustic feature sequence xT1 given a word sequence w1N is called acoustic model. For LVCSR systems usually sub-word models like syllables, phonemes, or allophones are used instead of whole-word models. The pronunciation model p(ψ1L |w1N ) assigns a sequences of sub-word units ψ1L to a sequence of words w1N . Most modern LVCSR systems use a finite pronunciation dictionary to store the (weighted) mapping from words to sequences of sub-word units. Assuming independence in the pronunciation of a word from adjacent words yields Equation (1.2). X p(xT1 |w1N ) = p(xT1 |ψ1L )p(ψ1L |w1N ) ψ1L

=

X ψ1L

p(xT1 |ψ1L )

N Y

n p(ψlln−1 +1 |wn )

(1.2)

n=1

The advantage of sub-word units is that they reduce the model complexity, which allows a reliable parameter estimation. Another advantage is that the search vocabulary needs not to be equal to or a subset of the training vocabulary. The acoustic model for a new word with known pronunciation is assembled from the corresponding sequence of sub-word units. Even if a word is not in the pronunciation dictionary, i.e. a new word with unknown pronunciation, there exist algorithms which compute with high accuracy a matching sequence of sub-word units [Bisani & Ney 2003]. The common approach for modern LVCSR systems is to use a two-stage mapping. First, the pronunciation dictionary provides the weighted mapping from the word to a phoneme sequence. It follows the unique mapping from phonemes to triphones, where a triphone is a phoneme together with its predecessor and successor; some systems use a larger context, so-called quinphones, septaphones, etc. The motivation for context-dependent phonemes is the observation that the articulation of a phoneme highly depends on the adjacent phonemes. In general, the acoustic realization of a phoneme is called allophone and the triphone is the most common way in LVCSR systems to model allophones. If the context is considered across word boundaries the resulting acoustic model is called an across-word model [Sixtus 2003]. Natural speech shows a great variability in speaking rate. The quasi standard approach to scope with the varying acoustic realization of sub-word units at different speaking rates is the Hidden Markov Model (HMM) [Baker 1975; Rabiner & Juang 1986]. An HMM is a stochastic finite state automaton, where the states represent (hidden) random variables which cannot be observed directly. The output of an HMM is generated according to the probability distributions which depend on the values sT1 of the hidden variables. The HMM is a generative model and an HMM representing an acoustic model generates feature sequences xT1 .

4

1.3 Acoustic Model The acoustic probability for observing xT1 given word sequence w1N is the marginal over all possible state sequences: X p(xT1 |w1N ) = p(xT1 , sT1 |w1N ) N sT 1 :w1

T X Y

=

N sT 1 :w1

t−1 t N N p(xt |xt−1 1 , s1 ; w1 )p(st |s1 ; w1 )

(1.3)

t=1

The equation is simplified by applying the first order Markov assumption [Duda & Hart+ 2001]. The assumption states that the probabilities at time t do not depend on previous observations, but only on the current and the immediate preceding state. Furthermore, it is assumed that the probability of an observation depends only on the current state. Under this assumptions Equation (1.3) simplifies to: p(xT1 |w1N )

=

T X Y

p(xt |st ; w1N )p(st |st−1 ; w1N )

(1.4)

N t=1 sT 1 :w1

In the so-called Viterbi or maximum approximation the sum in Equation (1.4) is replaced by the maximum: p(xT1 |w1N )

=

max

N sT 1 :w1

T Y

p(xt |st ; w1N )p(st |st−1 ; w1N )

(1.5)

t=1

According to Equation (1.4) two probability distributions have to be considered: the emission probability p(xt |st ; w1N ) and the transition probability p(st |st−1 ; w1N ). The emission probability denotes the probability of observing acoustic feature vector xt while being in state st . The transition probability is the probability for moving from state st−1 to state st . A triphone is usually modeled by a linear HMM with three to six states. The possible transitions are the loop transition going from the state back to itself, the forward transition connecting to the next state, and the skip transition, which skips the next state and goes to the next to next state. Six state models like the topology introduced by Bakis [Bakis 1976] use the skip transition, whereas some three state topologies forbid the skip. In the Bakis topology each two successive states are identical, which makes it almost equivalent to a three state topology without skip. Both models are inadequate for fast speech, because they absorb at least 30ms of speech considering the standard frame shift for ASR systems of around 10ms [Molau 2003]. In this case the common choice is a three state model with skip. The HMM for a sequence of words is assembled by concatenating the HMMs of the according triphone sequence. Equation (1.4) and Equation (1.5) are also referred to as the time alignment problem. The result computed for a particular word sequence w1N is called the forced acoustic alignment of w1N . An efficient algorithm for solving the time alignment problem based on dynamic programming [Bellman 1957; Ney 1984; Viterbi 1967] is the forward-backward algorithm for HMMs [Baum 1972; Rabiner & Juang 1986]. Figure 1.2 shows an example for a time alignment in speech recognition. For a part of the word “seven” the ultimate HMM is constructed using the Bakis topology and it is aligned against a sequence of acoustic feature vectors. In the time alignment the HMM is enrolled along the times axis and the resulting graph is referred to as trellis. The trellis visualizes the complete search space for the time alignment. In the Viterbi approximation, cf. Equation (1.5), the solution is the path from the lower left to the upper right corner with the highest probability. The emission probabilities p(xt |st ; w1N ) of the HMM are usually modeled by Gaussian mixture models (GMMs). Alternative approaches are discrete probabilities [Jelinek 1976], semi-continuous probabilities [Huang & Jack 1989] or other continuous probability distributions like mixtures of Laplacians [HaebUmbach & Aubert+ 1998; Levinson & Rabiner+ 1983]. The RWTH Aachen system uses the GMMs defined in Equation (1.6). Ls X p(x|s; w1N ) = csl N (x|µsl , Σsl ; w1N ) (1.6) l=1

The emission probability for state s is described by a GMM of Ls Gaussian densities N (x|µsl , Σsl ; w1N ) with mean vector µsl and covariance matrix Σsl and non-negative mixture weights csl , where the mixture

5

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

HMM States S

<1> <3>

<1>

<2>

<3>

Triphone: eh v un Triphone: s eh v Triphone: # s eh

Word: seven

Phoneme Sequence: s eh v un

Chapter 1 Introduction

Acoustic Vectors X

Figure 1.2. 6-state hidden Markov model in Bakis topology for the triphone s ehv in the word “seven” and the resulting trellis for a time alignment. The HMM segments are denoted by <1>, <2>, and <3>.

PLs weights are subject to the constraint l=1 csl = 1. The LVCSR systems at RWTH Aachen use only a single, globally pooled and diagonal covariance matrix Σ. The choice is made to avoid data sparseness problems in the acoustic model training. Using a diagonal covariance matrix requires that the components of the acoustic features are decorrelated, which happens for the RWTH Aachen LVCSR systems in the feature extraction step by applying a discrete cosine transformation. The free parameters of the acoustic model µsl , csl , and Σ are estimated by applying Maximum Likelihood (ML) estimation in combination with the Expectation Maximization (EM) algorithm [Dempster & Laird+ 1977]. In state-of-the-art LVCSR systems the ML/EM training is followed by a discriminative refinement of the acoustic model parameters [Bahl & Padmanabhan+ 1996; Schl¨ uter 2000; Woodland & Povey 2002]. In the discriminative training step the objective is to maximize the a-posteriori probability of the correct sentence [Bahl & Brown+ 1986; Normandin & Lacouture+ 1994] or to minimize the word or phoneme error rate on the training data [Juang & Katagiri 1992; Kaiser & Horvat+ 2000; McDermott & Katagiri 2005; Povey & Woodland 2002]. In the RWTH Aachen system the transition probabilities are replaced by so called time distortion penalties (TDPs). The TDPs depend only on the transition type, but not on the state itself. A special case is the HMM for the silence model, which consists only of a single state and has separate TDPs.

1.4 Language Model The language model provides the a-priori probability p(w1N ) for a word sequence w1N . Ideally, it covers the syntax, the semantics, and the pragmatics of the language and the situation. In practice, a rather simple model is the standard for LVCSR systems. The m-gram model makes the assumption that the n−1 probability of the current word wn depends only on the previous m − 1 words wn−m+1 [Bahl & Jelinek+ 1983]. Equation (1.7) motivates the factorization of the a-priori probability under the assumption of an (m − 1)th-order Markov process. p(w1N )

= p(wn |w1N −1 ) · p(wN −1 |w1N −2 ) · · · p(w1 ) N −1 N −2 = p(wn |wN −m+1 ) · p(wN −1 |wN −m ) · · · p(w1 )

(1.7)

The consecutive sequence of m words is called an m-gram and in the general case the history hn of word n−1 wn is a function of wn−m+1 . For the standard m-gram model hn is the identity; examples for alternative

6

1.5 Search history functions are the class language model or the trigger models [Martin 2000]. n−1 The estimates for p(wn |wn−m+1 ) are usually based on the relative frequencies computed on a large training set of transcripts of speech and written text. The relative frequency is the optimal solution if the m-gram language model is optimized w.r.t. the perplexity (PP) of the training data. " log P P (w1N )

=

log

N Y

#−1/N n−1 p(wn |wn−m+1 )

n=1

= −

N 1 X n−1 log p(wn |wn−m+1 ) N n=1

(1.8)

The log-perplexity defined in Equation (1.8) is a common evaluation measure for m-gram language models. It equals the entropy of the model and can be interpreted as the number of different words which follow on average any given history hn . However, the number of possible m-grams grows exponentially in m and for LVCSR tasks many mgrams are not seen in the training data or have only very few observations. Applied to the test data, any word sequence containing a single unseen m-gram has a probability of zero and an infinite log-PP. Therefore, the relative frequencies have to be smoothed. Common smoothing techniques are based on discounting followed by backing-off or interpolation [Generet & Ney+ 1995; Katz 1987; Ney & Essen+ 1994; Ney & Martin+ 1997]. In the discounting step probability mass is removed from the relative frequencies. The backing-off or interpolation step distributes the discounted probability mass over all unseen m-grams (backing-off) or over all m-grams (interpolation). A popular method to estimate the parameters of a smoothed language model is leaving-one-out, a cross-validation approach [Ney & Essen+ 1994].

1.5 Search The search problem in ASR consists of finding an efficient algorithm and appropriate approximations for ˆ which maximizes the a-posteriori probability solving Equation (1.1), i.e. for finding the word sequence W ˆ |xT ) for a given feature sequence xT . As shown in Figure 1.1 the search combines the different p(W 1 1 knowledge sources: the acoustic model (including the pronunciation model) and the language model. If the acoustic model is an HMM as described in Equation (1.4) and the language model is an m-gram model following Equation (1.7), then Equation (1.9) describes the resulting optimization problem. ( N  X Y ) T Y n−1 T N N ˆ p(wn |w ) p(xt |st ; w1 )p(st |st−1 ; w1 ) x1 → W = argmax w1N ,N

Viterbi

=

argmax w1N ,N

n−m+1

n=1

( N Y n=1

N t=1 sT 1 :w1

 ) T Y n−1 N N p(wn |wn−m+1 ) max p(xt |st ; w1 )p(st |st−1 ; w1 ) N sT 1 :w1

(1.9)

t=1

The optimization problem can be efficiently solved by using dynamic programming [Bellman 1957]. The Markov assumptions and the Viterbi approximation yield a mathematical structure which divides the global optimization problem in Equation (1.9) into sub-problems with local dependencies and allows the application of dynamic programming. In general, the search can be organized in two ways: depth-first or breadth-first. Prominent instances of the depth-first search (aka stack decoding algorithms) are the Dijkstra [Dijkstra 1959] and the A∗ algorithm [Jelinek 1969; Paul 1991]. The hypotheses space is explored in a time-asynchronous manner according to the stack organization. In the A∗ algorithm the stack is sorted by a heuristic estimate of the cost to complete a hypothesis. In contrast, in the breadth-first search all hypotheses are expanded in a time-synchronous manner [Baker 1975; Ney 1984; Sakoe 1979; Vintsyuk 1971]. However, for LVCSR tasks the resulting search space is still huge and a full exploration is prohibitive. Modern recognizer use pruning techniques to visit only the promising parts of the search space thereby avoiding search errors. A search error occurs if due to pruning the output of the recognizer differs from the solution of Equation (1.9). In an A∗ decoder pruning is applied by removing the least promising partial paths from the stack. The quality of the pruning depends on the quality of the heuristic cost estimate. In

7

Chapter 1 Introduction contrast, the standard pruning for breadth-first search decoders does not require an explicit heuristic. In a breadth-first search implementation the likelihoods for all hypotheses are computed at each time frame. The so-called beam pruning compares at each time frame the likelihoods and keeps only those hypotheses which have likelihoods sufficiently close to the one of the current best hypothesis [Lowerre 1976; Ney & Mergel+ 1987; Ortmanns & Ney 1995]. A careful tuning of the pruning parameters yields a considerable reduction of the search effort without having a significant number of search errors. Beam search approaches for LVCSR decoders are in particular effective in combination with lexical prefix trees [Ney & H¨ ab-Umbach+ 1992; Ortmanns & Eiden+ 1998]. Pronunciations with common prefixes are laid together in the lexical prefix tree. Pruning in the early stages of the tree removes whole sub-trees and eventually discards large parts of the search space. Language model look-ahead techniques aim at considering the language model probabilities in the early stages of the lexical prefix tree [Alleva & Huang+ 1996; Ortmanns & Ney+ 1996; Steinbiss & Ney+ 1993]. Weighted finite state transducer (WFST) provide a generic way to optimize the search space [Allauzen & Mohri+ 2004; Mohri & Riley 1997]. The acoustic model (HMM) and the language model (m-gram model) have natural WFST representations and the respective WFSTs can be combined and minimized by using generic algorithms. In particular, the lexical prefix tree and the language model look-ahead technique are implicitly applied by a WFST decoder using a minimized static search space transducer [Kanthak & Ney+ 2002]. WFSTs and the construction of the static search space transducer are discussed in Section 1.7. Other methods to reduce the computational complexity of the search include fast likelihood computation [Cardinal & Dumouchel+ 2008; Kanthak & Sch¨ utz+ 2000; Ortmanns & Ney+ 1997b; Parihar & + Schl¨ uter 2009; Ramasubramansian &Paliwal 1992], several look-ahead techniques [Alleva &Huang+ 1996; H¨ ab-Umbach & Ney 1994; Ortmanns & Ney+ 1996], and multi-pass approaches, where a fast first pass reduces the search space for the ultimate Viterbi search [Ljolje & Pereira+ 1999; Murveit & Butzberger+ 1993; Ney & Aubert 1994; Ortmanns & Ney+ 1997a; Schwartz & Chow 1990].

1.6 Multi-Pass Search State-of-the-art LVCSR recognizers perform multiple recognition and/or re-scoring passes, see for example [Evermann & Chan+ 2003; Hoffmeister & Plahl+ 2007; Prasad & Matsoukas+ 2005]. Supervised adaptation techniques like standard VTLN, MLLR, constrained MLLR, and domain specific language model adaptation require a reference transcription (supervisor). In a multi-pass decoder the output of the first, unadapted recognition run serves as supervisor for the adaptation step, which is followed by a second recognition run with the adapted models and/or adapted features. Some models and techniques cannot be applied during the Viterbi search because of their complexity, like the language model used in [Emami &Papineni+ 2007] or the phoneme duration model in [Jennequin & Gauvain 2007]. They are applied to a restricted search space, which is the result of an extended Viterbi search: instead of finding a single hypothesis, the search algorithm narrows the search space. The result is an N -best list or a lattice containing the best scoring word sequences, which is subsequently re-scored with the sophisticated model. N -best lists or lattices are also used for applying Bayes risk decoding with the (approximate) Levenshtein distance as loss function and for system combination approaches, cf. Section 1.8 and Section 1.9.

1.6.1 Lattices A word or phoneme lattice is a directed, acyclic graph with time stamps on the states and labels on the arcs. In a word lattice the label is usually the word together with the pronunciation of the word, in a phoneme lattice the label is simply a phoneme. In addition, for each arc the acoustic and the language model score from the Viterbi decoding is stored. An example for a word lattice produced by a LVCSR system is shown in Figure 1.3. The goal in lattice creation is to store in a compact form a large number of hypotheses, where the number of hypotheses is usually by magnitudes larger than the size of any feasible N -best list. The exact properties of a lattice depend on the search algorithm, i.e. on the HMM decoder design, and on subsequent filter steps. The default Viterbi search of the RWTH Aachen LVCSR system is a time-synchronous wordconditioned tree search implementation [Beulen & Ortmanns+ 1999]. A word lattice produced by the

8

1.6 Multi-Pass Search the/[2.03222e-99 1.09534e-21 1] 130 t=579 is/[0 3.88192e-47 1] 142 t=518

137 t=571

means/[0 1.90778e-61 1]

the/[2.03222e-99 8.04422e-20 1]

*EPS*/[1.07264e-229 4.28655e-36 1] is/[0 4.0873e-44 1] 134 t=571

the/[2.03222e-99 7.64304e-15 1]

138 t=579

129 t=571

means/[0 6.23786e-66 1]

is/[0 1.35408e-26 1]

132 t=550

is/[8.85425e-283 3.88192e-47 1]

of/[9.54218e-180 4.77794e-40 1]

is/[0 5.74511e-49 1] and/[0 4.36953e-23 1] *EPS*/[0 8.27889e-22 1]

163 t=410

and/[4.83981e-130 1.41816e-30 1] 162 t=393

not/[1.29216e-266 4.8724e-46 1]

*EPS*/[2.999e-82 1.62337e-33 1] 152 t=410

not/[1.29216e-266 4.8724e-46 1]

153 t=432

*EPS*/[2.01606e-241 1.36457e-13 1]

violence/[0 1.5141e-82 1]

*EPS*/[1.1418e-223 1.36457e-13 1]

violence/[0 1.5141e-82 1]

131 t=502

is/[8.85425e-283 3.88192e-47 1]

113 t=579

109 t=574

means/[0 8.88299e-70 1]

105 t=620

24 t=777

158 t=387

nonviolence/[0 8.35058e-72 1]

0 t=0

and/[1.28654e-158 1.41816e-30 1] and/[6.34569e-156 1.41816e-30 1]

181 t=10

this/[1.20552e-220 2.31015e-28 1]

1 t=14

it/[7.72228e-188 1.08551e-16 1]

179 t=13

it/[1.84836e-197 1.08551e-16 1]

182 t=29

2 t=30,cw

is/[1.03919e-192 1.36373e-06 1]

is/[7.72228e-188 1.08551e-16 1] is/[1.84836e-197 1.08551e-16 1]

180 t=29,cw

183 t=46

our/[1.24367e-144 1.72061e-43 1]

our/[1.24352e-144 1.37505e-36 1] 3 t=46

184 t=62

4 t=62

duty/[0 2.27172e-31 1]

duty/[0 8.3294e-11 1]

view/[0 5.90157e-66 1]

185 t=114

176 t=113

177 t=116

to/[1.00356e-198 9.75866e-12 1]

duty/[0 8.3294e-11 1]

duty/[0 8.3294e-11 1]

6 t=130

promote/[0 4.92486e-42 1] promote/[0 4.92486e-42 1]

173 t=173

*EPS*/[1.81627e-68 1 1]

167 t=275

168 t=297

that/[4.19809e-258 4.60059e-32 1]

and/[4.83981e-130 3.28684e-23 1] 160 t=410

democracy/[0 1.07628e-67 1]

to/[1.00356e-198 0.000577 1]

the/[9.10426e-139 1.18146e-13 1]

154 t=187

7 t=178

155 t=246

*EPS*/[1.97076e-51 1 1]

156 t=250

that/[0 1.40043e-10 1]

164 t=295

*EPS*/[1.19395e-62 1 1]

165 t=298

169 t=391 *EPS*/[9.09939e-51 1 1]

democracy/[0 1.73989e-65 1]

view/[0 5.90157e-66 1]

the/[7.37339e-115 1.18146e-13 1]

promote/[0 4.92486e-42 1] 175 t=130

174 t=177

157 t=297

that/[0 1.40043e-10 1]

149 t=390

171 t=393

view/[0 4.91764e-64 1]

9 t=246

*EPS*/[1.97076e-51 1 1]

10 t=250

that/[0 3.32602e-23 1]

11 t=297

*EPS*/[2.94332e-56 1 1]

16 t=502

*EPS*/[2.94332e-56 1 1]

nonviolence/[0 8.35058e-72 1] not/[3.47273e-251 8.77168e-43 1]

*EPS*/[4.4412e-98 1.86521e-38 1]

98 t=785

*EPS*/[2.54177e-212 1.86521e-38 1]

95 t=795

and/[1.63804e-147 4.61212e-20 1]

23 t=774

*EPS*/[4.77718e-224 4.50347e-23 1]

37 t=795

change/[0 3.26491e-124 1]

124 t=550

106 t=637

a/[8.73928e-59 1.25481e-20 1]

133 t=574

127 t=574

110 t=574

14 t=432

violence/[0 6.75159e-77 1]

change/[0 1.66753e-70 1]

governing/[0 6.19343e-87 1] 34 t=838 governing/[0 6.19343e-87 1] 26 t=838

119 t=495

*EPS*/[1.07244e-229 6.2917e-39 1] *EPS*/[2.01569e-241 8.40927e-19 1]

a/[8.73928e-59 1.7473e-33 1] a/[8.73928e-59 1.7473e-33 1]

is/[8.85425e-283 8.7503e-46 1]

change/[0 1.66753e-70 1]

making/[0 1.11055e-36 1]

20 t=620

21 t=637

making/[0 1.11055e-36 1]

99 t=680

*EPS*/[8.34736e-110 4.50347e-23 1] *EPS*/[2.82451e-186 4.50347e-23 1] *EPS*/[4.4412e-98 1.56588e-28 1]

change/[0 1.66753e-70 1]

means/[0 6.94785e-54 1]

111 t=579

18 t=574

*EPS*/[2.91473e-69 1 1]

100 t=683

*EPS*/[4.77718e-224 2.25297e-17 1] 36 t=773

64 t=785

*EPS*/[4.88111e-69 1 1]

41 t=838

114 t=495

*EPS*/[1.14152e-223 8.40927e-19 1] *EPS*/[1.07244e-229 4.68203e-25 1]

65 t=785

38 t=820

ourselves/[0 6.7434e-59 1] governing/[0 1.3092e-105 1] governing/[0 1.3092e-105 1] governing/[0 1.3092e-105 1]

of/[1.09194e-231 6.02311e-44 1]

governing/[0 1.23805e-95 1]

of/[1.09194e-231 1.16289e-36 1] 56 t=820

and/[0 1.41816e-30 1] and/[1.18761e-278 1.41816e-30 1]

90 t=776

and/[0 1.41816e-30 1]

*EPS*/[4.88111e-69 1 1]

of/[9.54218e-180 1.16289e-36 1] of/[3.17936e-176 1.16289e-36 1]

and/[3.31275e-275 1.41816e-30 1]

55 t=795

58 t=835

59 t=838 governing/[0 1.23805e-95 1]

and/[0 1.41816e-30 1]

is/[8.85425e-283 5.74511e-49 1]

118 t=518

is/[8.85425e-283 5.74511e-49 1]

of/[3.63123e-228 1.16289e-36 1]

57 t=838

governing/[0 1.3092e-105 1]

is/[0 5.74511e-49 1]

92 t=820

*EPS*/[3.86884e-43 6.18154e-29 1] *EPS*/[7.2716e-55 1 1]

32 t=896

84 t=838

31 t=896

governing/[0 1.37219e-86 1]

ourselves/[0 6.7434e-59 1]

governing/[0 1.37219e-86 1]

of/[1.27195e-151 1.16289e-36 1] governing/[0 1.3092e-105 1] of/[1.09194e-231 1.16289e-36 1]

of/[9.54218e-180 2.81111e-47 1]

the/[8.10311e-104 2.73124e-27 1] 83 t=835

governing/[0 1.37219e-86 1] 77 t=838

of/[1.09194e-231 2.81111e-47 1] of/[3.17936e-176 2.81111e-47 1] 102 t=579

76 t=820

78 t=892

and/[1.35694e-105 1.59517e-28 1]

governing/[0 1.37219e-86 1]

79 t=898

ourselves/[0 1.01674e-78 1]

governing/[0 1.37219e-86 1] of/[3.63123e-228 2.81111e-47 1]

governing/[0 1.0267e-74 1]

45 t=895

ourselves/[0 3.36005e-80 1]

of/[1.27195e-151 2.81111e-47 1] and/[1.65264e-210 1.41816e-30 1]

148 t=574

ourselves/[0 1.69597e-103 1]

governing/[0 1.37219e-86 1] governing/[0 1.37219e-86 1]

60 t=832 *EPS*/[4.88111e-69 1 1]

91 t=820

116 t=574

a/[8.73928e-59 4.78153e-17 1] is/[0 7.75183e-19 1]

30 t=893

governing/[0 1.37219e-86 1]

of/[3.63123e-228 1.16289e-36 1] of/[9.54218e-180 1.16289e-36 1]

is/[0 8.7503e-46 1]

115 t=550

governing/[0 1.3092e-105 1] governing/[0 1.37219e-86 1]

of/[3.17936e-176 1.16289e-36 1]

of/[4.23801e-148 1.16289e-36 1] and/[0 1.77789e-26 1]

a/[8.73928e-59 4.88331e-20 1]

*EPS*/[2.94332e-56 1 1]

governing/[0 1.3092e-105 1]

of/[1.27195e-151 1.16289e-36 1] 61 t=820

the/[8.10311e-104 9.2114e-16 1]

means/[0 9.16526e-48 1]

101 t=574

*EPS*/[0 7.59968e-18 1] 146 t=498

27 t=895

governing/[0 1.23805e-95 1]

42 t=832

of/[1.27195e-151 6.02311e-44 1]

and/[0 1.77789e-26 1]

and/[1.65264e-210 1.77789e-26 1]

*EPS*/[1.32048e-59 1.56588e-28 1]

is/[0 5.74511e-49 1]

117 t=518

*EPS*/[6.07362e-212 4.68203e-25 1]

violence/[0 3.49312e-79 1]

governing/[0 1.23805e-95 1]

governing/[0 1.23805e-95 1]

of/[9.54218e-180 6.02311e-44 1] of/[4.23801e-148 6.02311e-44 1]

and/[0 1.41816e-30 1]

*EPS*/[2.82451e-186 2.25297e-17 1]

*EPS*/[7.2977e-31 1 1]

violence/[0 3.49312e-79 1]

governing/[0 6.19343e-87 1]

governing/[0 1.23805e-95 1] governing/[0 1.23805e-95 1]

40 t=835

governing/[0 6.19343e-87 1] 50 t=820

and/[3.31275e-275 1.77789e-26 1] and/[0 1.77789e-26 1]

*EPS*/[2.48188e-71 4.50347e-23 1] change/[0 1.66753e-70 1]

*EPS*/[3.88273e-19 2.25297e-17 1]

128 t=579

a/[8.73928e-59 5.8716e-22 1] *EPS*/[0 8.40927e-19 1]

governing/[0 6.19343e-87 1] governing/[0 6.19343e-87 1]

governing/[0 1.23805e-95 1]

39 t=838

of/[3.17936e-176 1.80878e-37 1]

of/[4.23801e-148 1.16289e-36 1] violence/[0 6.75159e-77 1] violence/[0 6.75159e-77 1]

violence/[0 6.75159e-77 1]

145 t=433

*EPS*/[4.88111e-69 1 1]

of/[3.63123e-228 4.77794e-40 1]

of/[1.27195e-151 1.80878e-37 1]

of/[3.17936e-176 6.02311e-44 1] and/[1.18761e-278 1.77789e-26 1]

94 t=767

means/[0 5.62587e-52 1]

means/[0 2.64702e-50 1] 17 t=550

22 t=681

a/[8.73928e-59 1.93838e-21 1]

a/[8.73928e-59 1.7473e-33 1]

is/[0 7.48063e-32 1]

126 t=550

making/[0 8.30359e-39 1] making/[0 8.30359e-39 1]

19 t=579

120 t=518 is/[8.85425e-283 4.0873e-44 1]

violence/[0 6.75159e-77 1] not/[1.29216e-266 8.77168e-43 1] not/[7.36733e-263 8.77168e-43 1]

97 t=820

and/[1.18761e-278 5.01491e-31 1]

governing/[0 1.3092e-105 1] 125 t=574

of/[1.45334e-156 0.00432893 1] *EPS*/[0 1.36457e-13 1] *EPS*/[0 5.17766e-15 1]

nonviolence/[0 8.35058e-72 1]

and/[1.37723e-177 4.61212e-20 1] 13 t=411

is/[0 1.35408e-26 1]

*EPS*/[6.07491e-212 4.85633e-20 1] 135 t=495

*EPS*/[4.8464e-45 7.62629e-44 1]

33 t=835

of/[3.17936e-176 4.77794e-40 1]

of/[3.63123e-228 6.02311e-44 1]

and/[0 5.01491e-31 1]

of/[4.23801e-148 1.80878e-37 1]

is/[0 3.88192e-47 1]

*EPS*/[0 8.40927e-19 1]

*EPS*/[0 1.36457e-13 1]

*EPS*/[0 5.17766e-15 1]

*EPS*/[9.10901e-57 1 1] 143 t=430

96 t=820

of/[3.63123e-228 1.80878e-37 1] 104 t=579

103 t=571

is/[0 5.74511e-49 1]

is/[0 5.74511e-49 1] 139 t=518 123 t=502

121 t=410

150 t=393

12 t=391

violence/[0 1.69133e-81 1] violence/[0 6.75159e-77 1]

122 t=498

nonviolence/[0 8.35058e-72 1] and/[5.80131e-128 4.61212e-20 1]

*EPS*/[9.09939e-51 1 1]

democracy/[0 1.73989e-65 1]

democracy/[0 6.95111e-60 1]

*EPS*/[1.56599e-44 7.59968e-18 1] 15 t=498

35 t=820

of/[9.54218e-180 1.80878e-37 1]

and/[0 5.01491e-31 1]

of/[1.45334e-156 4.0207e-09 1]

the/[2.03222e-99 2.1184e-22 1] is/[0 7.48063e-32 1] is/[0 4.0873e-44 1]

violence/[0 6.75159e-77 1] violence/[0 6.75159e-77 1] violence/[0 1.69133e-81 1]

161 t=432 nonviolence/[0 8.35058e-72 1] and/[5.80131e-128 1.07062e-19 1]

democracy/[0 6.95111e-60 1] 8 t=187

not/[1.29216e-266 6.25991e-37 1]

and/[1.63804e-147 1.07062e-19 1]

and/[4.87763e-158 4.61212e-20 1]

their/[3.80785e-129 1.56369e-37 1]

144 t=433

is/[0 5.74511e-49 1]

democracy/[0 1.07628e-67 1]

172 t=236

democracy/[0 1.73989e-65 1]

to/[1.91213e-173 0.000577 1]

5 t=114

*EPS*/[1.01484e-273 9.5777e-35 1] *EPS*/[9.36614e-317 9.5777e-35 1]

170 t=390 view/[0 5.90157e-66 1] to/[6.29912e-153 0.000577 1]

our/[9.086e-178 1.37505e-36 1] 178 t=62

166 t=241

to/[1.91213e-173 9.75866e-12 1]

*EPS*/[6.55955e-65 1 1]

107 t=502

and/[3.31275e-275 5.01491e-31 1] 159 t=393

*EPS*/[4.84131e-39 1.62337e-33 1]

and/[3.45924e-117 1.41816e-30 1]

*EPS*/[1.56599e-44 5.17766e-15 1]

of/[1.09194e-231 1.80878e-37 1]

means/[0 1.6659e-55 1] 140 t=571

*EPS*/[1.07264e-229 4.85633e-20 1]

*EPS*/[2.999e-82 1.70297e-31 1]

democracy/[0 1.73989e-65 1]

and/[0 4.36953e-23 1]

a/[8.73928e-59 2.34832e-25 1]

is/[0 5.74511e-49 1] *EPS*/[0 8.27889e-22 1]

151 t=393

25 t=820

of/[1.09194e-231 4.77794e-40 1] the/[2.03222e-99 7.64304e-15 1] the/[2.03222e-99 7.64304e-15 1]

*EPS*/[1.0355e-28 1 1] 108 t=550

141 t=495

nonviolence/[0 8.35058e-72 1]

and/[5.80131e-128 1.41816e-30 1]

112 t=571

136 t=518

governing/[0 1.0267e-74 1] ourselves/[0 5.51977e-93 1] 86 t=820

147 t=502

of/[4.23801e-148 2.81111e-47 1] of/[4.53105e-184 2.81111e-47 1]

85 t=832

the/[8.10311e-104 9.25424e-11 1]

43 t=838

governing/[0 2.18674e-68 1] governing/[0 2.18674e-68 1]

44 t=895

governing/[0 2.18674e-68 1]

and/[0 1.41816e-30 1]

51 t=815

of/[1.50984e-180 2.81111e-47 1]

62 t=815

not/[8.97892e-312 4.8724e-46 1]

governing/[0 2.18674e-68 1] a/[5.87733e-60 7.9538e-36 1]

not/[8.97892e-312 1.18765e-35 1]

and/[0 1.62969e-24 1]

52 t=837

governing/[0 1.50405e-97 1]

and/[0 1.62969e-24 1]

46 t=893

governing/[0 1.0267e-74 1]

81 t=838

and/[0 1.62969e-24 1]

66 t=776

47 t=896

*EPS*/[7.2716e-55 1 1] 80 t=835

governing/[0 1.0267e-74 1]

93 t=815

not/[8.97892e-312 4.8724e-46 1]

63 t=837

88 t=815

not/[8.97892e-312 3.80759e-40 1]

89 t=837

governing/[0 9.41392e-102 1] governing/[0 1.87262e-93 1]

53 t=893

49 t=896

82 t=895

ourselves/[0 3.14947e-75 1]

*EPS*/[7.2716e-55 1 1] 54 t=896 governing/[0 7.41553e-97 1]

off/[3.64777e-227 3.32672e-77 1] 87 t=820

72 t=838

off/[3.1881e-213 3.32672e-77 1] 67 t=821

offer/[2.24674e-222 1.95978e-62 1]

governing/[0 7.41553e-97 1] governing/[0 7.41553e-97 1]

offer/[2.57044e-236 1.95978e-62 1]

68 t=838

*EPS*

29 t=979

ourselves/[0 3.74879e-65 1]

*EPS*/[7.2716e-55 1 1]

75 t=896

69 t=895

ourselves/[0 2.6043e-66 1]

*EPS*/[7.2716e-55 1 1]

71 t=896

74 t=893

governing/[0 4.55675e-102 1]

and/[0 1.62969e-24 1]

28 t=976

ourselves/[0 2.14796e-51 1]

ourselves/[0 3.74879e-65 1]

73 t=895

governing/[0 7.41553e-97 1] and/[0 1.62969e-24 1]

ourselves/[0 5.51977e-93 1]

ourselves/[0 3.36005e-80 1] *EPS*/[7.2716e-55 1 1] 48 t=893

governing/[0 7.09985e-84 1]

ourselves/[0 2.6043e-66 1]

governing/[0 4.55675e-102 1] governing/[0 4.55675e-102 1]

70 t=893

Figure 1.3. Lattice produced by the RWTH 2007 TC-Star EPPS Evaluation System for English [L¨ oo ¨f & Gollan+ 2007].

decoder follows the word pair approximation in which the assumption is made that the end time of the word in question depends only on the current and the preceding word hypothesis [Ney & Aubert 1994; Ortmanns & Ney+ 1997a]. The word pair approximation guarantees that at any time t and for any word w and predecessor word v there exists only one lattice arc labeled with w. As a consequence the lattice is deterministic, i.e. each word sequence exists only once, in particular the same word sequence cannot exist with different word boundaries. This makes a lattice which fulfills the word pair approximation compact. However, the only guaranteed property of a lattice created by the RWTH Aachen decoder is that it contains the best sentence hypothesis with the correct scores and correct word boundaries. Due to the word pair approximation hypotheses competing with the best one may have inaccurate word boundaries and thus overestimated acoustic scores. Furthermore, it is not guaranteed that a lattice of M hypotheses contains the N -best list for 1 < N ≤ M , i.e. the N best scoring hypotheses. The constraints hold not only for the RWTH Aachen decoder but for any popular LVCSR decoder design, for example a discussion of issues in creating lattices from a WFST decoder is given in [Ljolje & Pereira+ 1999]. HMM decoding results can be stored in a compact form due to the several independence assumptions made in the search, cf. Section 1.3 and Section 1.4. The assumptions restrict the dependencies for computing any probability applied in HMM decoding to a finite context. On a word (or phoneme) level and for LVCSR tasks this is the context for cross-word modeling and the context for computing an m-gram probability. The context can be stored in the lattice topology, for example for any state in a lattice which stores bigram probabilities all incoming arcs must have the same label. However, this also means that in general a lattice build from a trigram LM requires more arcs to represent the same number of hypotheses than a lattice which stores bigram probabilities. The advantage of storing all context information in the lattice topology is that it allows to apply generic graph algorithms to the lattice, like the transducer operations introduced later in this chapter in Section 1.7. For example, the LM probability for a sentence is simply the product of the m-gram probabilities stored on the arcs along a path through the lattice. The quality of a lattice is measured in terms of graph error rate (GER) and density and the goal in lattice construction is to achieve a low GER for a small density. The GER of a word lattice L is defined Nr in Equation (1.10), where Lev(w1N , w ˜r,1 ) is the Levenshtein distance between word sequence w1N and Nr ˜ reference w ˜r,1 , where N is the number of reference words. Nr Lev(w1N , w ˜r,1 ) N ˜ w1 ,N : N

GER(L) = min

(1.10)

w1N ∈L

˜ and the number of The density is defined as the ratio between the number of words in the reference N arcs in the lattice |E(L)|. If the reference is unknown, then the density can be approximated by using the ˆ. number of words in the Viterbi decoding result N density(L) :=

˜ ˆ N N ≈ |E(L)| |E(L)|

(1.11)

All lattices produced by the RWTH Aachen LVCSR system use the word-conditioned tree search decoder and the word-pair approximation; the resulting lattices are referred to as word-conditioned lattices. Word lattice densities presented in this work are always approximated densities. Furthermore, all lattices used in any experiment presented in this work store all context information in the lattice topology.

9

Chapter 1 Introduction

Table 1.1. Semirings used by WFSTs for speech recognition tasks.

Semiring probability log tropical

K R+ R ∪ {−∞, +∞} R ∪ {−∞, +∞}

x⊕y x+y −log(exp(−x) + exp(−y)) min(x, y)

x⊗y x·y x+y x+y

¯0 0 +∞ +∞

¯1 1 0 0

1.6.2 Speaker Adaptation Speaker adaptation requires a speaker label S for each speech utterance, where utterances spoken by the same speaker build a speaker cluster. A common approach for unsupervised speaker clustering is to optimize the Bayesian information criterion (BIC) on the acoustic features of the clustered utterances [Chenand & Gopalakrishnan 1998; Tritschler & Gopinath 1999]. The commonly applied speaker adaptation methods in the RWTH Aachen LVCSR decoder are vocal tract length normalization (VTLN), maximum likelihood linear regression (MLLR), and constrained MLLR (CMLLR). In VTLN the warping factor for a speaker S is chosen by a grid search which aims at maximizing the likelihood of the speaker cluster given the output of the previous recognition result. The approach is computationally expensive and the RWTH Aachen system uses by default the fastVTLN implementation, where the warping factor is selected by a classifier [Lee & Rose 1996; Molau 2003]. In the MLLR approach the parameters of the GMMs are adapted to the speaker by applying a speakerdependent linear transformation to the means and variances. Equation (1.12) shows the unconstrained form of MLLR. (S) (S)T (S) ˆ (S) = H(S) µ ˆsl = A(S) Σ (1.12) s µsl + bs , s Σsl Hsl sl In the RWTH Aachen system only the means are adapted, but not the globally pooled, diagonal co(S) variance matrix Σ. The state dependent transformation matrices As for a given speaker S are tied according to a decision tree [Pitz 2005]. In the estimation step those transformation matrices are chosen which maximize the likelihood of the corresponding speaker cluster, where likewise for VTLN the output of the previous decoding pass serves as supervisor. In the constrained form of MLLR the means and variances are transformed by the same matrices. The RWTH Aachen system uses CMLLR for speaker adaptive training (SAT), where only a single transformation per speaker is used. The resulting transformation is shown in Equation (1.13). (S)

µ ˆsl = A(S) µsl + b(S) ,

ˆ (S) = A(S) ΣA(S)T Σ

(1.13)

The advantage of CMLLR is that it can be implemented as a feature transformation, which makes the integration in a LVCSR system simple [Leggetter & Woodland 1995].

1.7 Weighted Finite State Transducers Weighted finite state transducers (WFSTs) are directed graphs with an input label, an output label, and a weight on each arc. In speech recognition WFSTs are commonly used to represent the stochastic models, in particular the HMM-based acoustic model and the m-gram language model, and lattices. The representation as transducers allows to manipulate them by generic WFST operations [Mohri & Pereira+ 2008]. Word lattices represented as WFSTs and the notation developed in this section as well as the presented algorithms are heavily used in the following chapters. Besides introducing the notation and algorithms, this section shows how the search and time alignment problem is tackled with the help of WFSTs.

1.7.1 Notation A weighted finite state transducer T is a 7-tuple (Σin , Σout , (K, ⊕, ⊗, ¯0, ¯1), S, sI , SF , E). The input and output labels are taken from the alphabets Σin and Σout . An acceptor A is a transducer without Σout .

10

1.7 Weighted Finite State Transducers The weights of an transducer or acceptor form a semiring (K, ⊕, ⊗, ¯0, ¯1), where a semiring has the following properties: ¯ is a commutative monoid: 1. (K, ⊕, 0) • (x ⊕ y) ⊕ z = x ⊕ (y ⊕ z) • ¯ 0⊕x=x⊕¯ 0=x • x⊕y =y⊕x 2. (K, ⊗, ¯ 1) is a monoid: • (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z) • ¯ 1⊗x=x⊗¯ 1=x 3. ⊗ distributes over ⊕: • x ⊗ (y ⊕ z) = (x ⊗ y) ⊕ (x ⊗ z) • (x ⊕ y) ⊗ z = (x ⊗ z) ⊕ (y ⊗ z) 4. ¯ 0 is an annihilator for ⊗: ¯⊗x=x⊗¯ • 0 0=¯ 0 The common semirings used in speech recognition are summarized in Table 1.1. The log semiring equals the probability semiring in negated, logarithmic probability space. Applying the maximum or Viterbi approximation to the log semiring results in the tropical semiring. The semirings used in this work (including the semirings listed in Table 1.1) have the additional property that the ⊗-operation is commutative and for each element x but ¯0 the ⊗-inverse element x−1 exists in K. The states in the WFST are denoted by S, the single initial by sI , and the set of final states by SF . Final states can have weights, which are denoted by w(s), s ∈ SF . In a lattice each state s carries a time stamp denoted by t(s). The set of arcs or edges in a WFST is denoted by E ⊆ S × {Σin ∪ } × {Σout ∪ } × K × S, where  denotes the empty word. For an arc e ∈ E the input label is denoted by i(e), the output label by o(e), the weight by w(e), the source state by from(e), and the target state by to(e). For a state s the set of incoming arcs is denoted by in(s) and the set of outgoing arcs by out(s). The notation e ∈ E and s ∈ S are abbreviated by e ∈ T and s ∈ T. A (sub-)path aL 1 ∈ E × · · · × E in transducer T is any consecutive sequence of arcs. The set of all paths starting from state s and ending in state s0 are denoted by π(s, s0 ) and according π(S, S 0 ) is the set of all paths starting in s ∈ S and ending in s0 ∈ S 0 . Paths in π({sI }, SF ) are called paths through T, other paths are called sub-paths in T. Likewise for edges and states, the notation aL 1 ∈ π({sI }, SF ) is L L ∗ L abbreviated by aL ∈ T. For path a the sequence of non- input labels is given by i(a 1 1 1 ) ∈ Σin , o(a1 ) is defined analogously. The ⊗-product over the arc weights w(a1 ) ⊗ w(a2 ) ⊗ . . . ⊗ w(aL ) of a (sub-)path is denoted by [[aL 1 ]] and the ⊕-sum over the product of each path through T by M [[T]] := [[aL (1.14) 1 ]]. aL 1 ∈T

The interpretation of [[T]] depends on the semiring: the tropical semiring yields the Viterbi decoding result for T. For adequate weights the result of the log or probability semiring can be interpreted as the normalization term for a probability distribution over the paths through T, i.e. p(aL 1 | T) := exp −  −1 L [[T]]log ⊗log [[a1 ]]log . The weight for a sequence of input labels w1N and output labels v1M is the sum over all paths through T accepting w1N as input and v1M as output: M [[T]](w1N , v1M ) := [[aL (1.15) 1 ]] ⊗ w(to(aL )) aL 1 ∈T: L M i(a1 )=w1N ∧o(aL 1 )=v1

[[A]](w1N )

:=

M

[[aL 1 ]] ⊗ w(to(aL ))

(1.16)

aL 1 ∈A: N i(aL 1 )=w1

WFSTs have a natural graphical representation as shown in Figure 1.4 for a transducer and an acceptor.

11

Chapter 1 Introduction a/0.5

a)

0

a:d/0.5

b/0.3

1

c/0.0 d/0.6

b)

2/0.8

0

b:c/0.3

1

c:b/0.0 d:a/0.6

2/0.8

Figure 1.4. Graphical representation of a weighted acceptor a) and a weighted transducer b). An arc in the acceptor is labeled by i(e)/ w(e), a transducer arc by i(e) : o(e)/ w(e). States are labeled with their state number and a final weight, if the state is final.

1.7.2 Algorithms Single-Source Shortest-Distance. The shortest-distance of a state s to the final states of T is defined in Equation (1.17). ! L M O w(al ) ⊗ w(to(aL )) (1.17) d(s; T) := aL 1 ∈π(s,SF )

l=1

Starting from the initial state d(sI ; T) equals [[T]]. For the tropical semiring with non-negative weights the Dijkstra algorithm can be used to compute d(·; T). The shortest path for acyclic WFSTs or WFSTs with idempotent semirings (like the tropical semiring) can be computed efficiently by using the BellmanFord algorithm, which applies a form of dynamic programming. In particular, the time complexity for acyclic WFSTs is O(|E| + |S|). A summary of efficient solutions to the single-source shortest-distance problem for arbitrary WFSTs and semirings is given in [Mohri 2002b]. Composition and Intersection. The composition of two transducers T1 ◦ T2 is a mapping from sequences in Σ∗in,1 to sequences in Σ∗out,2 and is defined in Equation (1.18). [[T1 ◦ T2 ]](w1N , v1M ) :=

M

L M [[T1 ]](w1N , uL 1 ) ⊗ [[T2 ]](u1 , v1 )

(1.18)

uL 1

The result of the composition of two acceptors A1 and A2 is their intersection: w1N is accepted if A1 and A2 accept w1N . Composition and intersection can be efficiently computed in time O((|E1 |+|S1 |)(|E2 |+ |S2 |)) [Mohri 2004]. Determinization and Minimization. In a determinized WFST det(T) no two arcs leaving the same state have the same input label. In the common definition of determinization for WFSTs the empty word  is treated as a normal label. However, in a strong sense determinization means that for acceptor A and input label sequence w1N at most one path through A accepts w1N . The strong form of determinization is achieved by first removing -labels from A. All acyclic WFSTs and unweighted FSTs are determinizable, but cyclic WFSTs can be non-determinizable. A work-around is to convert a non-determinizable WFST into a determinizable WFST by inserting additional arcs labeled with so-called disambiguating input labels or disambiguators [Allauzen & Mohri 2004]. In the worst case the number of states in the determinized WFST grows exponentially even for acyclic transducers. A determinized, acyclic WFST can be efficiently minimized in time O(|E|), where the minimized WFST is the equivalent transducer with the minimal number of states; the complexity of the minimization depends on the semiring [Mohri 2004]. -removal. After applying -removal to an acceptor A the resulting acceptor remove -(A) has no arcs with the empty word  as input label. The complexity is O(|S||E| + |S|2 ) for acyclic acceptors. The complexity for the general case and possible extensions of -removal to transducers are discussed in [Mohri 2002a, 2003].

12

1.7 Weighted Finite State Transducers Project. The projection converts a transducer T into an acceptor project(T) and is defined in Equation (1.19). M [[T]](w1N , v1M ) [[project(T)]](w1N ) := (1.19) v1M

Union. The union of two transducers accepts (w1N , v1M ) if T1 or T2 accepts (w1N , v1M ). Equation (1.20) defines the union of two WFSTs. [[T1 ∪ T2 ]](w1N , v1M ) = [[T1 ]](w1N , v1M ) ⊕ [[T2 ]](w1N , v1M )

(1.20)

Building the union has a time complexity of O(1): a new super-initial state is introduced and connected via -arcs with the initial states of T1 and T2 . Miscellaneous. In the transposed WFST TT the arc direction is inverted. In T−1 input and output label are exchanged. ∂(s; T) denotes the sub-WFST of T with s as new initial state. The result of trim(T) has only co-accessible states, where a co-accessible state s is any state on a path through T, i.e. s can be reached from the initial state and at least one of the final states can be reached from s. If not explicitly mentioned otherwise, any WFST T is assumed to be trim. Several WFSTs libraries which include the algorithms presented in this section are publicly available, cf. [Allauzen & Riley+ 2007; Hetherington 2004; Kanthak & Ney 2004].

1.7.3 WFSTs in ASR In ASR tasks transducers are often used for representing and manipulating lattices and for solving the search and the time alignment problem. WFSTs for lattice representation are discussed in detail in Chapter 3. This section briefly summarizes how the search and the time alignment problem are solved with the help of WFSTs. The main idea in using WFSTs for solving the search problem is to factorize the problem into a set of simple-to-construct WFSTs and then to use generic WFSTs algorithms to solve the search problem ˆ given feature sequence xT . The common defined in Equation (1.9), i.e. to find the Viterbi hypothesis W 1 factorization for LVCSR systems consists of five transducers, cf. [Mohri & Pereira+ 2008]: • O emission probabilities: An acyclic transducer with the acoustic feature xt as input, an HMM state s as output, and the likelihood p(xt |s) as weight. • H HMM state to context-dependent (CD) phone mapping: A cyclic transducer which consists of the collection of all the triphone dependent HMMs; the weights are the transition probabilities. p(s|s0 ) • C CD phone to context-independent (CI) phone mapping: A cyclic, unweighted FST which maps triphones to their central phoneme. • L CI phone to word mapping: The WFST representation of the pronunciation lexicon; the weights are the pronunciation probabilities p(w|aL 1 ). • G language model probabilities: The representation of an m-gram language model with backing-off as acceptor; the weights are the n−1 language model probabilities p(wn |wn−m+1 ). The five knowledge sources are combined via composition and the resulting form of the search problem is given in Equation (1.21), where probabilities are represented in negated log-space and the transducers are defined over the tropical semiring.  ˆ = o arg d(O ◦ H ◦ C ◦ L ◦ G) xT1 → W (1.21)

13

Chapter 1 Introduction The advantage of the transducer representation is that the static part of the search space H ◦ C ◦ L ◦ G can be optimized offline. Minimizing the static part does significantly reduce the number of states and the run-time of the decoder. However, in practice the minimization of H ◦ C ◦ L ◦ G is not straightforward, because it is not determinizable and contains many -arcs (-removal is prohibitive as it would cause a dramatic blow-up). The placement of the disambiguator arcs and of the -labels is crucial for getting a small and efficient WFST decoder [Allauzen & Mohri 2004; Allauzen & Mohri+ 2004]. WFST decoder for LVCSR use a version of the single-source shortest-distance operation d(·) which includes pruning. For LVCSR tasks the full static search space transducer can become huge and common decoder designs perform on-the-fly compositions ◦fly , which are applied during the decoding in combination with a pruning dprune (·) implementation. The following list summarizes the most common decoder designs. [w1N ]opt

=

o(arg dprune (O ◦fly min(H ◦ C ◦ L ◦ G)))

(1.22)

[w1N ]opt [w1N ]opt

=

o(arg dprune (O ◦fly H ◦fly min(C ◦ L ◦ G)))

(1.23)

=

o(arg dprune (O ◦fly min(H ◦ C ◦ L) ◦fly G))

(1.24)

Decoder design (1.22) uses the fully optimized static search space, where minimization is usually applied over the log semiring. Design (1.23) expands the HMM states on the fly, which significantly reduces the size of the pre-computed WFST. The third decoder design (1.24) is conceptually equivalent to the word-conditioned tree search, the standard decoder at RWTH Aachen. Producing a lattice with a WFST decoder is conceptually simple: instead of applying dprune (·) the search space is only pruned. The time alignment problem is solved in the WFST framework by simply replacing acceptor G by acceptor R, which is a linear transducer representing the reference transcription.

1.8 Bayes Risk Decoding: State of the Art ˆ with the minimum risk (aka minimum Bayes risk decoding for ASR aims at finding the word sequence W T expected loss/error/cost) given feature sequence x1 and given a loss function L(·, ·). Equation (1.25) shows the general form of the Bayes risk decision rule [Bishop 2006]. ˆ = argmin xT1 → W w1N ,N

X

p(v1M |xT1 ) L(w1N , v1M )

(1.25)

v1M ,M

In fact, Equation (1.1) is the instance of the Bayes risk decoder, which uses the sentence error L(w1N , v1M ) := 1 − δ(w1N , v1M ) as loss function. However, the standard cost function for LVCSR tasks is the WER which is defined as the Levenshtein distance normalized by the length of the reference string. Due to the discrepancy in the cost function the MAP (and Viterbi) decoding result is not optimal for LVCSR tasks and motivates the application of cost functions which are closer to the WER. Usually, the normalization in the WER is omitted and the goal is to minimize the Levenshtein distance. However, the sum in Equation (1.25) prohibits the usage of a complex, non-local cost function like the Levenshtein distance during the search. Thus, Bayes risk decoding approaches with non-local loss functions are usually applied in a post-processing step on N -best lists or on word lattices. N -best lists allow a direct computation of Equation (1.25) with the Levenshtein distance as loss function [Goel & Byrne+ 1998; Stolcke & K¨ onig+ 1997]. Lattices possess many more hypotheses than any practicable N -best list and preserve more probability mass, especially for long utterances. On the downside, a direct computation of the Bayes risk decoding rule is still prohibitive for word lattices from a LVCSR system. A commonly used approximation is the confusion network (CN) for which Bayes risk decoding with an approximate Levenshtein distance as loss function is reduced to a local, word-wise decision problem [Mangu & Brill+ 1999, 2000]. In the recent years several methods have been proposed to build confusion networks directly from lattices [Hakkani & Riccardi 2003; Hoffmeister & Schl¨ uter+ 2009; Mangu & Brill+ 2000; Xue & Zhao 2005]. An extension to the CN decoding approach cuts the lattice into small, independent segments and computes the Levenshtein distance within the segments [Goel & Kumar+ 2004, 2000, 2001; Kumar & Byrne 2002]; the standard CN case is derived by allowing at most one word per segment.

14

1.9 Model and System Combination: State of the Art Another extension is to replace the standard decision rule in CN decoding by a classifier, which can compensate for alignment errors and for unreliable probability estimates [Hoffmeister & Schl¨ uter+ 2008; + Mangu & Padmanabhan 2001; Venkataramani & Chakrabartty 2003, 2007]. In [Chien &Huang+ 2006] Bayesian priors are used in the risk computation, which model the uncertainty in the parameters of the acoustic and the language model. In [Goel & Byrne 2000] the authors aim at finding the Bayes risk hypothesis by doing an A∗ -search over the lattice, where the algorithm requires an estimation of the residual costs. An estimation of the Bayes risk and a criterion to decide whether the Bayes risk hypothesis with the Levenshtein distance as loss function is different from the MAP hypothesis is given in [Schl¨ uter & Scharrenbach+ 2005]. Other lattice-based approaches use modified loss functions which allow an efficient computation of the Bayes risk hypothesis [Hoffmeister & Schl¨ uter+ 2009; Wessel & Schl¨ uter+ 2000, 2001c; Xu & Povey+ 2009]. An algorithm for computing the lattice-based Bayes risk hypothesis with the Levenshtein distance as loss function using only generic transducer operations is presented in [Mohri 2003], but the algorithm has exponential worst case complexity.

1.9 Model and System Combination: State of the Art 1.9.1 Log-linear Model Combination The standard in ASR is to use a log-linear model with only two knowledge sources: the acoustic model and the language model. For optimal performance LVCSR systems introduce a language model scale which eventually turns Equation (1.1) into a log-linear model. The log-linear model can be used explicitly for model combination by simply adding more knowledge sources to the model, usually additional acoustic models [Metze & Waibel 2002a,b; Zolnay 2006]. In the discriminative model combination (DMC) each of the knowledge sources combined in the loglinear model gets its own scaling factor which is optimized for minimal word error rate [Beyerlein 1997, 1998; Vergyri 2000; Zolnay & Schl¨ uter+ 2005]. In practice, performing a decoding with many acoustic models is expensive in terms of time and memory and the common approach is to produce a lattice with a base decoder and re-score the lattice arcs with the additional knowledge sources. In the standard LVCSR training procedures the interaction during the search between the several knowledge sources is not (fully) considered during model parameter estimation. An approach to compensate for the shortcoming of the model training is to capture the interactions in the log-linear model combination by using context-dependent scaling factors [Hoffmeister &Liang+ 2009; Huang &Belin+ 1993; Vergyri &Tsakalidis+ 2000].

1.9.2 System Combination An alternative to the log-linear model combination is the N -best list or lattice-based system combination, where the output of several decoders is combined. In the log-linear model combination all (acoustic) models are combined into a single system, whereas in the system combination approach from each of the acoustic front-ends a separate system is built. In the simplest approach only a single hypothesis from each system is combined like in ROVER [Fiscus 1997]. The quality of the ROVER result can be significantly increased by using confidence scores [Mangu & Brill+ 2000; Wessel & Schl¨ uter+ 2001a] or by replacing + ROVER’s simple decision rule by a classifier [Hillard & Hoffmeister 2007; Zhang & Rudnicky 2006]. Instead of a single hypothesis, N -best lists or confusion networks can be combined [Evermann & Woodland 2000; Mangu 2000; Ostendorf & Kannan+ 1991; Stolcke & Bratt+ 2000]. In [Ostendorf & Kannan+ 1991] the system-dependent N -best lists are merged into a single N -best list followed by a re-scoring step. In the other approaches a super CN is derived by aligning the system-dependent N -best lists or CNs. A lattice combination approach which derives system weights from the Bayesian decision theory is presented in [Sankar 2005]. In [Hoffmeister & Schl¨ uter+ 2008] a more general classifier is used to predict which system is correct. The minimum frame error decoding rule introduced in [Wessel &Schl¨ uter+ 2001c] is extended in [Hoffmeister & + Klein 2006] to a system combination approach. A similar method is used in [Chen & Lee 2006], where

15

Chapter 1 Introduction alternatively a phoneme error based cost is minimized. A general approach to combine and decode lattices from several systems, which covers the two latter approaches, is discussed in [Hoffmeister &Schl¨ uter+ 2009]. In [Omar & Mangu 2007] an approach is presented, where the scores from the first system drives the search of the second system and thereby aiming at minimizing a smoothed loss function. A comparison of ROVER with confidence scores, CN combination, and minimum frame error based lattice combination shows that all three approaches perform almost equally well [Hoffmeister & Hillard+ 2007]. The results presented in [Zolnay 2006] and in [Hoffmeister & Liang+ 2009] indicate that the performance of lattice-based system combination approaches are superior to the log-linear model combination. The theoretical motivation for system combination comes from machine learning. The basic idea is that if one classifier is not perfect then the combination with more classifiers improves the result, if the classifier make different kind of errors [Dietterich 2000a]. The same author discusses a simple way for getting an ensemble of classifiers by randomizing decision trees [Dietterich 2000b]. In [Ramabhadran & Siohan+ 2006; Siohan & Ramabhadran+ 2005] the approach is applied to the estimation of the phonetic decision tree used in modern LVCSR systems, e.g. [Beulen 1999]. The usage of different acoustic front-ends or randomized decision trees works well in practice, but it does not guarantee that the resulting systems benefit from combination. In the recent years some effort has been put in deriving an ensemble of complementary systems which benefit from each other in the system combination [Breslin & Gales 2006, 2007a,b; Willett & He 2008]. But so far, the gain from complementary system training is rather small.

1.9.3 Cross-Adaptation Cross-adaptation is an alternative way for doing system combination which became popular in the recent years [Soltau & Kingsbury+ 2005; St¨ uker & F¨ ugen+ 2006]. Instead of applying the system combination in a post-processing step to decoding, the interaction between the systems is put into the speaker adaptation step of a multi-pass decoder. In the cross-adaptation approach the supervisor for MLLR adaptation, cf. Equation (1.12), is the output of an alternative system. In [Guiliani &Brugnara 2006] the approach is extended to multiple supervisors. The multiple supervisors are either reduced to a single supervisor in a pre-processing step by applying system combination methods, or the ultimate adaptation statistics are derived from the weighted average of the supervisor-dependent statistics [Guiliani & Brugnara 2007; Hoffmeister & Plahl+ 2007].

16

Chapter 2 Scientific Goals System combination is an important techniques in state-of-the-art highly accurate LVCSR systems. In particular for those systems, where a low error rate is mandatory and run-time is second-rate. The combination via a single log-linear model is theoretically well grounded and was extensively studied in [Beyerlein 2000; Vergyri 2000]. The log-linear model is eventually a sentence-wise combination approach, whereas the popular ROVER [Fiscus 1997] approach comes as an ad-hoc method for word-wise system combination. ROVER is closely related to the common lattice-based combination via a confusion network combination (CNC) [Evermann & Woodland 2000; Mangu 2000]. CNC is theoretically motivated by the Bayes risk decoding rule with the Levenshtein distance as loss function: a CN provides an upper bound to the Levenshtein alignment between any two paths through a lattice. An alternative approximation is based on the definition of the frame error and was introduced in [Wessel & Schl¨ uter+ 2001c] and extended + to system combination in [Hoffmeister & Klein 2006]. The first objective of this thesis is to develop an unified view on system combination and to explore the connections between the popular approaches. The lattice-based approach to system combination, which applies a Bayes risk decoder with an approximate Levenshtein distance as loss function, proved to be a simple and successful method and is the most widespread combination techniques used in state-of-the-art LVCSR systems. However, confusion networks and frame error are only two of a variety of Levenshtein distance approximations used in LVCSR for training and decoding. The second objective of the thesis is to categorize, investigate, and extend existing and develop new approximations to the Levenshtein distance, which can be used to build efficient and accurate Bayes risk decoder. Besides the loss function, Bayes risk decoding relies on the quality of the posterior probability estimates. Standard approaches based on the Bayes risk decoding rule blindly trust the posterior probabilities derived from the word lattices. However, these probabilities are only estimates of the true posteriors. The third objective is to find and explore approaches to deal with the bias in the lattice-based posterior estimates. From the objectives a set of theoretical and experimental goals are derived and investigated in this thesis, which include: Development of an unified view on system combination. Word lattices have a natural representation as weighted finite state transducers (WFSTs). Based on the WFST framework and the Bayes decoding rule an unified view on system combination is developed which covers the log-linear model, the minimum frame error combination, CNC, and many more. The framework is used to compare sentence error based decoders, e.g. the Viterbi decoder, and approximated Levenshtein distance based decoder with regard to their capability for system combination. At first glance the common approach to CN combination stays a-side from the lattice-based framework, because it makes use of the special structure of a CN. In this work the interpretation of the CNC in the lattice-based Bayes risk combination and decoding framework is explored as well as its connection to ROVER. Investigations on local cost functions used in Bayes risk decoding. Bayes risk decoding of word lattices derived from LVCSR systems with the Levenshtein distance as loss function is computationally prohibitive. The common approach is to place the necessary approximation in the loss function, i.e. to use an approximate of the Levenshtein distance. Different approximates to the Levenshtein distance are in use in ASR for different purposes. The key idea of the approximation is to reduce the dependencies in the loss function such that the computation of the loss has a local nature. The degree of locality is used to derive two general classes of loss functions, which yield efficiently computable Bayes risk decoder.

17

Chapter 2 Scientific Goals In practice, the common local losses show a strong deletion bias. This work continuous the theoretical investigations on the deletion bias started in [Gibson 2008] with special attention to the frame error based losses. New variants of local loss functions are developed, which have a direct influence on the deletion/insertion ratio and thus reduce the inherent deletion bias. Investigations on confusion networks. Confusion networks are used in speech recognition and speech processing for many purposes like confidence score computation, Bayes risk decoding, system combination, and other tasks like (speech) translation [Evermann & Woodland 2000; Mangu & Brill+ 1999; Matusov & Hoffmeister+ 2008]. The common algorithms for converting a word lattice into a CN are based on an arc or state clustering. The algorithms are parametrized and finding the right parameters is crucial for good performance. Inspired by the common approaches to CN construction two algorithms are developed and investigated. Furthermore, a conceptually new and simple algorithm is proposed, which comes completely parameterfree. The algorithm is based on frame-wise word posterior probabilities and draws a connection between minimum frame error and minimum CN distance decoding. CN construction algorithms aim at approximating the Levenshtein alignment, but the heuristic nature of the common lattice-based algorithms do not allow to make any assumption about the resulting alignment, besides that it is an upper bound to the exact Levenshtein distance. However, experimental results indicate that the CN alignments are close to the Levenshtein alignments. In this work an approach is investigated which uses the CN alignment to initialize a windowed Levenshtein distance. A hierarchy of approximate Bayes risk decoders is developed, which starts with the common CN decoding rule for a window of size one. For a sufficiently large window the decoder eventually becomes the Bayes risk decoder with the exact, unwindowed Levenshtein distance as loss function. Development of a new approach to system combination. The common system combination approaches formulated in the Bayes risk decoding framework have two major drawbacks. The first is the approximation of the Levenshtein distance and the second is the blind reliance on the posterior probability estimates derived from the word lattice. In this work an approach is proposed which aims at overcoming both problems: a classifier based system combination. Instead of using the combined posterior estimates directly, a set of posterior estimates and further features are fed into a classifier. The classifier has also access to the results of the standard approaches to system combination and decides for the ultimate output. Thus, the classifier can learn systematic biases in the approximation of the Levenshtein distance and in the probability estimates. The approach is investigated for different feature sets and classifiers. The investigation contains a detailed analysis of the error detection and error correction capabilities of the classifier based system combination. Investigations on the log-linear model combination. Log-linear model combination for speech recognition has been studied before in [Beyerlein 2000; Vergyri 2000; Zolnay 2006]. However, a systematic comparison of system combination approaches using either sentence posterior probabilities based on a log-linear model combination or sentence posterior probabilities based on the weighted average of systemdependent sentence posteriors is lacking and will be given in this thesis. In the standard log-linear model combination each knowledge source has a single scaling factor. However, a single factor cannot reflect the dynamic change in the influence of the knowledge source. This work investigates the usage of word and knowledge source dependent scaling factors in the log-linear combination of several acoustic models. The log-linear combination is compared to a combination based on averaged system-dependent sentence posterior probabilities, where the knowledge source dependent scaling factors are applied in the computation of the system-dependent posteriors. The remainder of the thesis is organized as follows: Chapter 3 develops the unified view on system combination based on the WFST framework. Classes of local cost functions are derived and efficient Bayes risk decoders for system combination based on the local cost functions are developed. The common approaches to lattice-based system combination are investigated and classified into the framework.

18

Concrete instances of local costs are introduced and investigated in Chapter 4. Three categories of local costs are explored: frame error based costs, costs defined via a local alignment, and the confusion network (CN) distance. For each cost function an efficient Bayes risk decoder with the according cost as loss function is developed. In this work the primer function of lattice-derived CNs is to define an approximate of the Levenshtein alignments between any two paths through the lattice. Three algorithms for constructing a CN from a lattice are introduced and investigated in the chapter. Chapter 5 investigates applications of CNs defined on frame and on word level. In particular, a Bayes risk decoder with the windowed Levenshtein distance as cost function is developed, which also draws a connection between decoding with the CN distance and with the exact Levenshtein distance. The classifier based approach to system combination is introduced in Chapter 6 and Chapter 7 is dedicated to investigations on the log-linear model combination. The thesis is concluded by a summary of the scientific contributions in Chapter 8 and an outlook in Chapter 9. The appendix contains a description of the systems and corpora used in the various combination experiments as well as detailed results for all systems and corpora.

19

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework In the lattice-based system combination task the goal is to combine and decode lattices provided by several systems. The number of LVCSR systems to be combined is denoted by J and the word lattice produced by the jth system by Lj ; word lattices will be introduced in the next section. The Bayes risk decoding framework requires sentence posterior probabilities for computing the optimal hypothesis, cf. Equation(1.25). The ultimate goal is to compute the sentence posterior probability from the J lattices and thus performing a system combination within the Bayes risk framework. Note that the MAP decoding rule is included in the considerations, because it is the instance of the Bayes risk decoding rule with the sentence error as loss function. In this chapter it will be shown that the sentence posteriors used in the common approaches to system combination are eventually computed from either the lattice intersection or the lattice union. In the course of the chapter the Bayes risk decoders are investigated which arise from combining the lattice intersection or union with different loss functions. The standard model used in LVCSR to describe the system-dependent sentence posterior probabilities is a log-linear model of the form ! N X I X N T exp λi fi (n; w1 , x1 ) pλ (w1N |xT1 ) := X

n=1 i=1 M X I X

exp

v1M ,M

!,

(3.1)

λi fi (m; v1M , xT1 )

m=1 i=1

where the feature functions fi (·; w1N , xT1 ) represent the I knowledge sources used to solve the search problem. For simplicity it is assumed that each of the J systems combines the same number of I feature functions. Each feature function has its own scaling factor λi . In the general definition given in Equation (3.1) the feature functions depend on the whole sentence w1N . However, in practice the features use only a restricted context and the model has a compact representation in form of a word lattice. Common LVCSR systems combine only two knowledge sources, an HMM-based acoustic model and an m-gram language model: fAM (n; w1N , xT1 )

:=

log p(xttnn−1 +1 |wn )

fLM (n; w1N , xT1 )

:=

n−1 log p(wn |wn−m+1 )

For λAM = λLM = 1 the log-linear combination equals the factorization of the posterior probability p(w1N |xT1 ) according to Bayes rule shown in Equation (1.1). However, the estimate of the acoustic model is usually less reliable than the language model, which the scaling factors in the log-linear model aim at compensating for. In the MAP decoding the normalization in Equation (3.1) can be discarded and only the ratio between the λs is considered. If the MAP decoding is applied to the standard combination of an acoustic and a language model, the acoustic model scale is usually set to one and the language model scale is optimized. For computing sentence posterior probabilities the normalization is needed and the absolute values of the scaling factors matter [Wessel & Macherey+ 1998; Woodland & Povey 2000]. All lattices used in this work for Bayes risk decoding and system combination experiments provide separate acoustic and language model scores. Furthermore, all lattices store any required context information in their topology as described in Section 1.6.1. Word lattices have a natural representation as weighted finite state transducers (WFSTs), in particular as acyclic weighted finite state acceptors. The next section introduces semirings over RD which allows to represent a log-linear model directly as a WFST.

21

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework In Section 3.2 the WFST framework and Equation (3.1) are combined for computing probabilities over word lattices. In Section 3.2.1 probabilities are derived from a single lattice. From lattice- and thus systemdependent probabilities the next step is to get a combined probability in order to perform the system combination within the Bayes risk decoding framework. The combination via the lattice intersection is introduced and motivated in Section 3.2.2 and via the lattice union in Section 3.2.3. In Section 3.3 a general framework for the combination and Bayes risk decoding of several lattices is developed using the WFST framework. In Section 3.3.1 the MAP decoding of the lattice union and intersection is investigated. In Section 3.3.3 the MAP decoding rule is replaced by a Bayes risk decoder with an approximate of the Levenshtein distance as loss function. A classification for Levenshtein distance approximates is introduced. For two classes of loss functions Bayes risk decoders are developed, which efficiently decode single lattices, the lattice intersection, and the lattice union. System combination based on confusion networks is discussed in Section 3.4. The common confusion network combination (CNC) algorithm is derived from minimizing the Bayes risk of the combination result and it is shown that CNC is in fact a CN decoding of the lattice union. Finally, the ROVER method is introduced as an approximation of the CNC algorithm. The result of the previous three sections is an abstract view on lattice combination and decoding which includes the common approaches to lattice-based system combination, in particular the discriminative model combination (DMC), CNC and ROVER. The resulting combination and decoding framework is summarized in Section 3.5. The section shortly discusses the common approaches to lattice-based system combination and shows how they fit into the framework. A crucial step for getting good results with Bayes risk decoding and system combination techniques is a careful pre-processing of the lattices. Section 3.6 discusses the several normalization steps applied in intra- and cross-site lattice combination. Section 3.7 describes the optimization algorithm used to estimate the scaling factors of the log-linear model and further combination and decoding dependent parameters. In the same section the general difference between parameter optimization for Bayes risk decoding and minimum risk based parameter estimation is discussed.

3.1 WFSTs as a High-Level Programming Language for lattice-based System Combination The WFST framework provides a high-level programming language which is used in this work to describe the lattice-based combination and decoding problems. The advantage of using the WFST framework and generic WFST operations is a compact description of the problems. Furthermore, a WFST representation immediately yields an algorithm for solving the problem. The algorithm allows a first complexity analysis and helps to identify the expensive sub-problems which require further analysis or have to be replaced by sophisticated algorithms or by approximations. A lattice L is defined as an acyclic WFST over the log or tropical vector semiring with time stamps on the states, where the log or tropical vector semiring is given by (RD , ⊕, ⊗, ¯0, ¯1, λ). An arc weight x ∈ RD and the vector λ ∈ RD correspond to the arc-dependent features and to the scaling factors in the log linear model defined in Equation (3.1). The standard log and tropical semiring are defined in Section 1.7. The log semiring equals the probability semiring in negated log space, i.e. exp(−x), x ∈ R, is a homomorphism from the log to the probability semiring. Applying the Viterbi approximation to the log semiring yields the tropical semiring. The interpretation of an arc weight x ∈ RD and vector λ as the features and scaling factors in a log-linear model induces that the scalar product λ · x shall be a homomorphism from the log or tropical vector semiring to the standard log or tropical semiring. In other words, the definitions of the ⊕-operator, the ⊗-operator, and of the neutral elements have to satisfy λ · (x ⊕ y) = (λ · x) ⊕ (λ · y),

λ · (x ⊗ y) = (λ · x) ⊗ (λ · y).

(3.2)

Thus, it is guaranteed that the standard log and the log vector semiring as well as the standard tropical and the tropical vector semiring produce equivalent results as long as λ is kept fixed, e.g. finding the path through the lattice with the shortest distance.

22

3.1 WFSTs as a High-Level Programming Language for lattice-based System Combination A second, desired property is that the operators and neutral elements are independent of λ, i.e. that changing the λ-vector before or after applying an operation shall not effect the outcome of the operation. Equation (3.3) formally defines the property for an arbitrary operation . λ 6= λ0

λ · (x λ y) = λ · (x λ0 y),

(3.3)

Drawing the connection to the log-linear model reveals the motivation for the second property: λ corresponds to the scaling factors in the log-linear model. Therefore, operations which are independent of λ can be applied without affecting the outcome of a subsequent optimization of the log-linear model. It is easy to see that due to the distribution of multiplication over addition the following definitions of the neutral elements and the ⊗-operator fulfill Equation (3.2):           x1 + y1 y1 x1 0 ∞    .   .  .. .  .  ¯ :=  ¯ :=  0   ..  ,  ..  ⊗  ..  :=   ..  , 1 . ∞

0

xD

yD

xD + yD

The neutral elements and the ⊗-operator are defined independently of λ and thus Equation (3.3) holds for the ⊗-operator. The ⊕-operator for the log and tropical semiring cannot be defined independently of the λ-vector and thus does not fulfill Equation (3.3). This becomes obvious when looking at the interpretation of the ⊕-operation in the log-linear model. Computing the ⊕-sum over all paths in the WFST equals the computation of the normalization constant in the log-linear model for the log semiring. In the tropical semiring the ⊕-sum is equivalent to finding the best scoring path. Both operations are obviously not independent of λ. For the log semiring many possible definitions of the ⊕-operator, which fulfill Equation (3.2), exist. In a series of experiments from all tested ⊕-operators the following definition gives the best approximation to Equation (3.3): x ⊕log y = z,  log exp(−λi xi ) + exp(−λi yi )

where



zi :=

−λ−1 i D X

log exp(− 

 log exp(−λd xd ) + exp(−λd yd )

D X d=1

λd xd ) + exp(−

D X

! λd yd )

(3.4)

d=1

d=1

The ⊕-operator for the tropical vector semiring fulfilling Equation (3.2) is denoted by:  x , if λ · x ≤ λ · y x ⊕trop y := y , otherwise

(3.5)

As pointed out before, the multidimensional semiring is not needed: instead the arc weights can be set to the precomputed scalar products and the standard log or tropical semiring can be applied. However, the theoretical advantage of the multidimensional semiring is that the scaling factors λ are integrated in the model: the semiring itself describes the log-linear model of which λ is part of. Modifying the scaling factors changes the weight computation over the lattice and thus the outcome of operations like computing the best scoring path. Consequently, modifying λ instantiates a new semiring. The practical advantage of the multidimensional semirings over the corresponding single dimensional semirings is twice. After decoding over the tropical vector semiring the system- and model-dependent scores of the best hypothesis are available for post-processing steps, e.g. machine translation. And λinvariant operations can be performed on the lattice without affecting the weight computation over the lattice with changed scaling factors. That is, the parameters of the log-linear model can be optimized on the modified lattice, e.g. after a lattice pre-processing. In particular, algorithms which do not use the ⊕-operator, like composition, -removal (without determinization of the -closures), and trimming, are invariant to the scaling factors. Algorithms which use the ⊕-operator, like single-source shortest-path or determinization, are not λinvariant. Changing from λ0 to λ, λ0 6= λ, after instead before applying the ⊕-operator induces an error. An

23

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

0.05 x= 0.04

(25, 1.25) (250, 12.5) (2500, 125)

0.03

normalized error

0.02 0.01 0 -0.01 -0.02 -0.03 -0.04 -0.05 16

18

20 LM scale

22

24

Figure 3.1. Error induced by changing the LM scale after computing x ⊕ x; the LM scale is initialized with 20. The correct sum results from changing the scaling factors before applying the ⊕-operator. The ⊕-operator is defined in Equation (3.4).

24

3.2 Probabilities over Lattices example of the development of the error for the ⊕-operator of the log vector semiring, cf. Equation (3.4), is shown in Figure 3.1. The setup for the example uses values which are typical for a single arc in a word lattice produced by a LVCSR decoder. The initial scaling factors are λ0 := (1/20, 1), where 20 is a typical value for the language model scale. The modified scaling factors λ := (1/α, 1) vary only in the first component with language model scale α ∈ {15, 16, . . . , 25}. The x-axis of the graph is the language y −y model scale α. The y-axis shows the normalized error defined as corycorappr , where ycor := λ · (x ⊕λ x) and yappr := λ · (x ⊕λ0 x). The error in a lattice is additive along a path. Thus, for long sentences or large scores the error induced by the ⊕-operator can become huge, which forbids the attempt to tune the scaling factors on the modified, e.g. determinized, lattice. In the field, the obvious disadvantage of the vector semirings is that the computation of the ⊕- and ⊗-operator are more expensive for multidimensional weights. However, if speed is an issue the scores can be projected to a single dimension and the standard log or tropical semiring can be applied. The time stamps stored at the transducer states are also a subject to problems when applying generic transducer operations to word lattices. Operations which merge states, like the composition or determinization, destroy the uniqueness of the time stamp; In this case the time stamps are discarded. If in a composition only one transducer has time stamps, then the time stamps are transferred to the composition result. If both WFSTs have time stamps, then the time stamps are discarded.

3.2 Probabilities over Lattices 3.2.1 Probabilities over a single Lattice Let L be a lattice in WFST representation as described in Section 3.1 produced by a LVCSR system given acoustic features xT1 . According to the definitions given in Equation (1.15) and Equation (3.1) the posterior probability of a word sequence w1N is given by h i−1 h i p(w1N |xT1 ) = exp − λ · [[L]] exp − λ · [[L]](w1N ) .

(3.6)

Defining the probability for a path aL 1 through L as " T p(aL 1 |x1 )

#−1

:= exp − λ · [[L]]

" exp − λ ·

L h O

w al

i

⊗ w to(aL )



# ,

(3.7)

l=1

allows to rewrite Equation (3.6) as p(w1N |xT1 ) =

X

T p(aL 1 |x1 ).

(3.8)

aL 1 ∈L, N i(aL 1 )=w1

The posterior probability for an arc a is defined as the sum over all paths going through a: X T p(a|xT1 ) := p(aL 1 |x1 )

(3.9)

aL 1 ∈L, ∃l:al =a

The next equation shows an efficient way to compute arc probabilities with the help of generic WFST operations: h i−1 h   i p(a|xT1 ) = exp − λ · d(sI ) exp − λ · d ∂(from(a); LT ) ⊗ w(a) ⊗ d to(a)

(3.10)

For lattice L and state s the value d(∂(s; LT )) is called forward score and d(s) is called backward score and the resulting algorithm is known as the forward/backward-algorithm. The forward scores for all states in an acyclic lattice can be efficiently computed in time O(|E| + |S|) by calculating the forward score for the last state and storing all intermediate results. The backward scores can be computed analogously.

25

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework Frame-wise word posterior probabilities pt (w|xT1 ) model the chance of observing word w at time t: X T pt (w|xT1 ) := p(aL 1 |x1 ) aL 1 ∈L, ∃l:i(al )=w ∧ beg(al )≤t
X

=

p(a|xT1 )

(3.11)

a∈L, i(a)=w ∧ beg(a)≤t
Thus, the frame-wise word posteriors can be efficiently computed from the arc probabilities. Similar, position- or slot-wise posterior probabilities can be computed from a lattice. Let us assume a function σ : E(L) → N which assigns each arc to a position or slot number. Under the assumption that for any two arcs a1 and a2 such that both arcs lay on the same path and a1 precedes a2 holds σ(a1 ) < σ(a2 ), the probability for word w being observed at position s is given by: X T ps (w|xT1 ) := p(aL 1 |x1 ) aL 1 ∈L, ∃l:i(al )=w ∧σ(al )=s

X

=

p(a|xT1 )

(3.12)

a∈L, i(a)=w ∧σ(a)=s

3.2.2 Probabilities over the Lattice Intersection Let J be the number of LVCSR systems to be combined and Lj the lattice produced by the jth system given acoustic features xT1 . For the intersection approach thesemiring of all lattices is either set to the log or to the tropical vector semiring with dimensionality I · J , where I is the number of feature functions per system. The lattice   from the jth system stores the scores in dimension (j − 1) · I + 1 to j · I , the other dimensions are set to zero. The construction defines a log-linear model which combines the I × J knowledge sources provided by the J word lattices. And the intersection of the J lattices is the log-linear model combination of the J × I models: by the definition of the intersection each path and thus each arc in L∩ :=

J \

Lj

(3.13)

j=1

has scores assigned from all J × I models. The sentence posterior probabilities can now be computed directly from the intersection result in the same way as for a single lattice, cf. Equation (3.6). However, in practice the intersection approach has several drawbacks: • Building the intersection from lattices with many  arcs is expensive1 ; often an -removal and a determinization of the lattices is necessary to make the intersection work. • Time stamps are invalidated when applying standard transducer operations including the determinization and the intersection. Bayes risk decoder with loss functions which rely on correct word boundaries cannot be applied; this includes all Levenshtein distance approximations investigated in this thesis, cf. Chapter 4. • The vocabulary of the intersection result is the intersection of the system-dependent vocabularies. Thus, the intersection increases the out-of-vocabulary (OOV) rate. 1A

decoder like the word-conditioned tree search decoder used in the RWTH Aachen system produces -arc free (and even deterministic) lattices. In this case -arcs can result from preparing the lattices for combination, e.g. by replacing non-word events like silence or noise by the empty word. Lattice pre-processing is discussed in detail in Section 3.6.

26

3.2 Probabilities over Lattices • The intersection of several lattices can be empty, if the lattices do not contain a common input sequence. This is especially the case if the systems use different vocabularies, e.g. in a cross-site system combination, or if rather long utterances are decoded. An alternative to the intersection approach is the lattice re-scoring. A base lattice is provided and  arc-wise re-scored with all I · J models. The approach resolves the drawbacks of the intersection but introduces new problems. All systems must have the same pronunciation dictionary and the re-scoring with fixed word boundaries usually causes inferior error rates. Results with the re-scoring approach are given in Chapter 7. From a theoretical point of view intersection and re-scoring describe the same model and thus are not distinguished in the abstract framework developed in this chapter.

3.2.3 Probabilities over the Lattice Union Again, let J be the number of LVCSR systems to be combined. In the common approaches to system combination in ASR the sentence posterior probabilities are computed as the weighted average of the system-dependent sentence posteriors. The motivation is to introduce the system as a hidden variable and derive the posterior probability by marginalizing over the systems p(w1N |xT1 )

=

J X

p(j|xT1 )p(w1N |j, xT1 )

j=1

=

J X

p(j)pj (w1N |xT1 ),

(3.14)

j=1

where the model assumption is made that the system prior p(j) is independent of the acoustic observation. This is the model used in ROVER with confidence scores [Fiscus 1997] and in confusion network combination (CNC) [Evermann & Woodland 2000]. Let Lj be the lattice produced by the jth system given acoustic features xT1 . Looking at the definition of the union in Equation (1.20) and at the definition of the sentence posterior probability for a single lattice given in Equation (3.6) it is easy to see that the union over slightly modified lattices Lj yields the desired posterior probabilities. Each lattice Lj is modified such that it has a new initial state which is connected with the former initial state by an -arc with weight ωj ⊗ [[Lj ]]−1 . Here, ωj is simply the   weighted negated logarithm of the jth system prior p(j) such that exp − λ · (ωj ⊗ x) = p(j) exp − λ · x . The modified lattice is denoted by L0j and the union over the modified lattices by L0∪ :=

J [

L0j .

(3.15)

j=1

Equation (3.16) proofs that the union over the modified lattices yields the desired posterior probabilities. i h exp − λ · [[L0∪ ]](w1N )

=

J i h M exp − λ · [[L0j ]](w1N ) j=1

=

J X

h

i exp − λ · ωj ⊗ [[Lj ]]−1 ⊗ [[Lj ]](w1N )

j=1

=

J X

p(j)pj (w1N |xT1 )

(3.16)

j=1

A direct advantage of the modified union approach over the intersection method is that the OOV rate in the union is reduced rather than increased. The union always exists, whereas the intersection might be empty. And in contrast to the lattice intersection, in the lattice union the time stamps always survive. This makes the union in particular interesting for all Bayes risk decoders based on a cost function which requires exact time stamps; this includes all Levenshtein distance approximations investigated in this thesis, cf. Chapter 4.

27

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework The frame-wise word posteriors of the modified lattice union can be either computed directly from L0∪ , cf. Equation (3.11), or equivalently by averaging the system-dependent frame-wise word posterior probabilities: X T pt (w|xT1 ) = p(aL 1 |x1 ) 0 aL 1 ∈L∪ , ∃l:i(al )=w ∧ beg(al )≤t
=

J X

X

p(j)

j=1

T pj (aL 1 |x1 )

aL 1 ∈Lj , ∃l:i(al )=w ∧ beg(al )≤t
=

J X

p(j)pj,t (w|xT1 )

(3.17)

j=1

The same holds for the slot-wise word posteriors, cf. Equation (3.12), given a slot function σ : E(L0∪ ) → N defined over the lattice union: X T p(aL ps (w|xT1 ) = 1 |x1 ) 0 aL 1 ∈L∪ , ∃l:i(al )=w ∧σ(al )=s

=

J X

X

p(j)

j=1

T pj (aL 1 |x1 )

aL 1 ∈Lj , ∃l:i(al )=w ∧σ(al )=s

=

J X

p(j)pj,s (w|xT1 )

(3.18)

j=1

3.3 Lattice-Based System Combination in the Bayes Risk Decoding Framework 3.3.1 The MAP/Viterbi Decoding Framework The maximum a-posteriori (MAP) decoding rule for a word lattice L is derived by inserting Equation (3.6) in Equation (1.1): ˆ xT1 → W

:=

argmin λ · [[L]]log (w1N ) w1N ,N

=

w1N ,N

=

X

argmax

T p(aL 1 |x1 )

aL 1 ∈L, N i(aL 1 )=w1

 best detlog (remove -(L)) ,

(3.19)

where best(L) returns the sequence of the input labels of the shortest path through L; the weight of the shortest path equals dtrop (L). Applying the Viterbi approximation yields X T ˆ xT1 → W := argmax p(aL 1 |x1 ) w1N ,N

Viterbi

=

argmax w1N ,N

=

28

aL 1 ∈L, N i(aL 1 )=w1

max

aL 1 ∈L, N i(aL 1 )=w1

best(L).

T p(aL 1 |x1 )

(3.20)

3.3 Lattice-Based System Combination in the Bayes Risk Decoding Framework The main difference in the implementation is that Viterbi decoding does not require the determinization. In contrast to the full search space of an HMM state-wise decoding, the determinization of a lattice is computationally possible, if the lattice is not too dense. But determinization is expensive and still has an exponential worst case complexity and can cause a run-time in O exp(|E(L)| for the MAP decoder. In practice, a strong lattice pruning is applied before determinization. Lattice pruning is discussed further in Section 3.6. The MAP and Viterbi decoder can easily be used for system combination by decoding the lattice intersection or the modified lattice union, cf. Section 3.2.2 and Section 3.2.3. However, in both cases the computation of the MAP hypothesis requires once or even several times a lattice determinization, which can become expensive. In practice, especially the determinization of the lattice union turned out to be very expensive and makes the approach for LVCSR infeasible. On the other hand, the Viterbi decoder is not a suitable choice for decoding the lattice union, because the Viterbi approximation replaces the sum in Equation (3.14) by the maximum, which eventually results in a sentence posterior based system selection. In conclusion, the MAP/Viterbi decoding framework is not a good choice for intersection or union based, i.e. log-linear or averaged sentence posterior probability based, system combination. An exception is the arc-wise re-scoring based approach which is investigated further in Chapter 7.

3.3.2 MAP/Viterbi Decoding Results In this section experimental results for the intersection and union based system combination in the MAP and Viterbi framework are given and discussed. Experiments are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation system. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. Only, for the English EPPS 2007 evaluation cross-site combination neither MAP nor intersection results are produced. The setup uses extremely long utterances, which makes already the computation of the determinization of the system-dependent lattices computationally infeasible. For all experiments acoustic and language model scales and the system weights in the union based combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described later in Section 3.7. The first set of experiments compares lattice-based MAP and Viterbi decoding for a single system. The Chinese setup consists of three subsystems and the English setup of four. The experimental results are shown in Table 3.1 and Table 3.2. The results are summarized by decoder, for each decoder the system with the lowest error rate on the tuning set is highlighted. The results show no benefit for MAP based lattice decoding. In fact, the MAP decoding is slower and has the disadvantage that the MAP decoding result comes without time stamps (due to the determinization). The word boundaries are computed with the forced-alignment algorithm described later in Section 5.1.2. Intersection results are produced with the MAP and the Viterbi decoder. In order to make the computation of the intersection efficient, the system-dependent lattices are made -arc free and are determinized. As a result, the intersection has no time stamps and alike the MAP decoder the word boundaries are computed according to Section 5.1.2. For the union based lattice combination only Viterbi results are produced. MAP computation turned out to be expensive due to the determinization of the union; the run-time for a single experiment took days to weeks. The results show that the intersection based combination approach works and the outcome improves over the results of the best single system. The improvements on the tuning set generalize to the test sets and improvements in the same magnitude can be observed. For both, the Chinese and the English system, the best approach reduces the error rate by 5% relative compared to the best single system. The Chinese system benefits from intersecting all three system whereas the error rates for the English system increase when intersecting more than two systems. A possible explanation is the OOV rate, the Chinese subsystems share the same vocabulary, whereas the four English lattice sets are produced with different vocabularies. For some utterances the intersection is empty and a back-off strategy is applied: the hypothesis from the best performing system (on the tuning set) is used. The percentage of utterances for which the intersection exists is included in Table 3.1 and Table 3.2. Again, for the intersection approach the MAP decoder cannot improve over the Viterbi decoder. The modified union based combination decoded with the Viterbi approximation is eventually a system selection: the system with the hypothesis with the highest posterior probability is chosen. Even this

29

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

Table 3.1. Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded.

System Combination Viterbi Decoder s1 s2 s3 s1+s2 intersection(97.6%) union s1+s2+s3 intersection(92.4%) union MAP Decoder s1 s2 s3 s1+s2 intersection(97.6%) s1+s2+s3 intersection(92.4%) 1

dev071

CER[%] (del/ins) err eval07

dev08

(2.63/1.59) 14.54 (2.65/1.70) 14.82 (2.65/1.64) 15.07 (2.55/1.58) 14.05 (2.59/1.65) 14.25 (2.46/1.56) 13.91 (2.57/1.64) 14.09

(4.42/0.91) 15.08 (4.44/0.93) 15.02 (4.57/1.04) 15.60 (4.43/0.91) 14.59 (4.44/0.92) 14.86 (4.38/0.91) 14.57 (4.47/0.92) 14.83

(2.80/0.87) 13.28 (2.71/0.94) 13.54 (2.84/0.93) 13.80 (2.75/0.84) 13.09 (2.74/0.89) 13.36 (2.66/0.83) 12.65 (2.77/0.87) 13.17

(2.67/1.56) 14.56 (2.63/1.72) 14.80 (2.66/1.63) 15.08 (2.48/1.64) 14.04 (2.49/1.59) 14.01

(4.42/0.91) 15.14 (4.41/0.96) 15.00 (4.56/1.04) 15.58 (4.37/0.91) 14.56 (4.40/0.90) 14.45

(2.88/0.85) 13.39 (2.66/0.97) 13.47 (2.83/0.92) 13.82 (2.63/0.85) 12.91 (2.68/0.87) 12.63

tuning set

simple selection scheme works and shows an improvement over the best single system. Throughout this work the Viterbi result of the best single system will serve as the baseline for the upcoming Bayes risk decoding and system combination results.

3.3.3 The Bayes Risk Decoding Framework with Local Cost Functions The definition of the lattice-based Bayes risk distinguishes between hypothesis space and summation space. The summation space is represented by lattice S and describes the posterior probability distribution over all word sequences w1N computed according to Equation (3.6). By the definition of the sentence posterior probability, cf. Equation (3.8), a word sequence which is not present in S has a probability of zero and thus has no contribution to the posterior computation in the Bayes risk decoder. However, the Bayes risk hypothesis might not be contained in the summation space lattice S as shown for example in Table 3.3. The size of the hypothesis space depends on the summation space and on the loss function, but contains in the general case all possible word sequences. In practice, often only a subset of the complete hypothesis space is explored. The restricted hypothesis space is represented by lattice H. The Bayes risk for an arbitrary loss function L(·, ·), summation space lattice S, and hypothesis space lattice H is given by X xT1 → ˆr := min p(w1N |xT1 ) L(v1M , w1N ) v1M ,M

= min

v1M ,M

≤ min

aL 1 ∈H

w1N ,N

X

T M K p(bK 1 |x1 ) L(v1 , i(b1 ))

bK 1 ∈S

X

T L K p(bK 1 |x1 ) L(i(a1 ), i(b1 )),

(3.21)

bK 1 ∈S

where the inequality is caused by the possibly restricted hypothesis space: if the optimal hypothesis is not contained in H, then the result is larger than the exact Bayes risk ˆr.

30

3.3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

Table 3.2. Results for the English EPPS 2007 evaluation systems, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded.

System Combination Viterbi Decoder s1 s2 s3 s4 s1+s2 intersection(99.5%) union s1+s2+s3 intersection(98.5%) union s1+s2+s3+s4 intersection(97.4%) union MAP Decoder s1 s2 s3 s4 s1+s2 intersection(99.5%) s1+s2+s3 intersection(98.5%) s1+s2+s3+s4 intersection(97.4%) 1

dev06

WER[%] (del/ins) err eval061

eval07

(1.65/2.21) 11.09 (1.77/2.28) 11.89 (2.06/2.29) 12.43 (2.04/2.18) 12.06 (1.72/2.09) 10.85 (1.82/2.00) 11.05 (1.73/2.17) 11.27 (1.86/2.05) 11.23 (1.72/2.17) 11.19 (1.86/2.05) 11.24

(1.38/1.36) 8.43 (1.67/1.23) 8.70 (1.80/1.30) 8.98 (1.85/1.38) 9.44 (1.48/1.25) 8.07 (1.56/1.24) 8.33 (1.49/1.28) 8.18 (1.59/1.26) 8.38 (1.54/1.20) 8.12 (1.59/1.26) 8.38

(1.86/1.31) 9.81 (2.12/1.31) 10.07 (2.22/1.34) 10.76 (2.68/1.42) 11.73 (1.99/1.21) 9.29 (2.04/1.23) 9.79 (1.93/1.28) 9.57 (1.99/1.22) 9.66 (1.99/1.24) 9.54 (1.99/1.23) 9.67

(1.66/2.29) 11.19 (1.85/2.27) 11.81 (2.04/2.34) 12.46 (1.97/2.32) 12.27 (1.68/2.12) 10.84 (1.73/2.22) 11.28 (1.70/2.23) 11.27

(1.41/1.43) 8.51 (1.72/1.23) 8.73 (1.79/1.33) 8.99 (1.73/1.47) 9.45 (1.46/1.29) 8.11 (1.48/1.31) 8.22 (1.53/1.26) 8.20

(1.84/1.35) 9.84 (2.18/1.33) 10.14 (2.19/1.37) 10.77 (2.56/1.54) 11.77 (1.94/1.25) 9.40 (1.90/1.33) 9.62 (1.96/1.33) 9.73

tuning set, eval06 was the official development set in the 2007 evaluation campaign

31

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

ˆ , i.e. the hypothesis with the minimum Table 3.3. Example for the situation where the Bayes risk hypothesis W expected word error rate, has a sentence posterior probability of zero and thus is not contained in the summation space.

“coca “coca “ ˆ = “coca W

cola’s cola’s cola’s cola’s

w1N share in market” share the market” share in the market” share in the market”

p(w1N |xT1 ) 0.4 0.4 0.2 error=1

In Bayes risk decoding of LVCSR lattices the main interest is in loss functions which approximate the Levenshtein distance. In case of the Levenshtein distance the hypothesis space is usually larger than the summation space defined by lattice S. In the general case the hypothesis space shall provide exact word boundaries which are required by most approximates of the Levenshtein distance. The consideration motivates the usage of the time-conditioned form of the summation space lattice S as the default hypothesis space lattice H: all states with the same time stamp are merged [Hoffmeister &Klein+ 2006; Hoffmeister & Schl¨ uter+ 2009]. Thus, the resulting hypothesis space is a super set of the summation space, but preserves the correct time stamp for each state. Two special cases arise from using the sentence error and the confusion network (CN) distance as loss functions. For the sentence error, i.e. for the MAP decoder, it is easy to see that hypothesis and summation space are equal. The CN distance is an approximation of the Levenshtein distance for which it is possible to search the complete hypothesis space; CNs and Bayes risk decoding with the CN distance as loss function are discussed later in Section 3.4. The Bayes risk decoder is simply the lattice decoder which returns the path from the hypothesis space which minimizes the Bayes risk on the summation space w.r.t. a given loss function. The extension of the Bayes risk decoder to lattice-based system combination is straightforward: in Equation (3.21) the sentence posterior p(w1N |xT1 ) is computed either from a log-linear model combination, cf. Section (3.2.2), or from the weighted average of the system-dependent sentence posterior probabilities, cf. Section (3.2.3). This is equivalent to using the lattice intersection L∩ or the modified lattice union L0∪ as summation space lattice S. The lattice intersection is not suitable for the Levenshtein distance approximations investigated in this work. All approximations require exact word boundaries, which are not preserved in the intersection. However, in Chapter 7 Bayes risk decoding with the CN distance as loss function is applied to the loglinear model derived from a lattice re-scoring and compared to the modified lattice union. The computation of the Bayes risk hypothesis from a LVCSR lattice using the Levenshtein distance as loss function is computationally prohibitive and approximations are required. In the first decoding approaches N -best lists of moderate size were used and the Bayes risk with the Levenshtein distance as loss function was computed on N -best lists [Goel & Byrne+ 1998; Stolcke & K¨onig+ 1997]. The N -best list approach is still computationally expensive and the considered summation and hypothesis space are by magnitudes smaller than for lattices. The standard approach to lattice-based decoding is to place the approximation in the loss function. The goal is to find a loss function which is close to the Levenshtein distance and at the same time enables an efficient computation of the lattice-based Bayes risk decoding rule defined in Equation (3.21). Looking at the decoding rule reveals that an efficient computation of the Bayes risk cannot have longterm dependencies in the loss computation. Long-term dependencies would require an expansion of the lattice structure, in the worst case the expansion to the full N -best list. Thus, the loss functions used for lattice-based Bayes risk decoding aim at reducing the dependencies. K L Let c(aL 1 , b1 ) be a general cost function for a path a1 through the hypothesis space lattice H and K path b1 through the summation space lattice S. The cost function is assumed to be additive likewise the Levenshtein distance. The first approximation makes the cost function local to the hypothesis space

32

3.3 Lattice-Based System Combination in the Bayes Risk Decoding Framework lattice, i.e. the cost for arc al does not depend on the cost of arc ak with k 6= l: L X

K c(aL 1 , b1 ) =

c(al , bK 1 )

(3.22)

l=1

The second approximation requires that only arcs compete which have overlap in time: K c(aL 1 , b1 ) =

L X

X

l=1

bk j:

c(al , bkj )

(3.23)

o(al ,bi )>0 for i∈[j,k], o(al ,bi )=0 for i∈[j,k] /

where o(a, b) denotes the overlap in time of arc a and arc b. Cost functions fulfilling Equation (3.22) and Equation (3.23) are called type one cost functions. In addition, the most common approximations for the Levenshtein distance are local in the summation space: L K X X K c(aL , b ) = c(al , bk ) (3.24) 1 1 l=1

k=1: o(al ,bk )>0

These cost functions will be referred to as cost functions of the second type. For both types of local costs an efficient implementation of the Bayes risk decoder exists, where for a type two cost function an efficiently computable Bayes risk decoder exists even if contraint (3.23) is violated. For the derivation of the decoders the following notation is introduced. The set of all sub-paths in L which intersect in time k with arc a are denoted by Osub (a; L). In other words, for each path bK 1 such that sub-path bj ∈ Osub (a; L) k holds o(a, bi ) > 0 for i ∈ [j, k] and o(a, bi ) = 0 for i ∈ / [j, k]. Furthermore, the notation bj ∈ φK 1 means k K that bj is a sub-path of path φ1 . The Bayes risk for a cost function of the first type can be computed by finding the shortest path in an arc-wise re-scored hypothesis space lattice: xT1

→ ˆr := min

aL 1 ∈H

= min

aL 1 ∈H

= min

aL 1 ∈H

X

T p(bK 1 |x1 )

bK 1 ∈S L X

L X

X

l=1

bk j ∈Osub (al ;S)

X

X

c(al , bkj )

T k p(φK 1 |x1 )c(al , bj )

K l=1 bk j ∈Osub (al ;S) φ1 ∈S: K bk j ∈φ1

L X

X

l=1

bk j ∈Osub (al ;S)

|

p(bkj |xT1 )c(al , bkj ) {z

:=c(al ;S)

= dtrop rescore(H, c(·; S))

} (3.25)

The disadvantage for Bayes risk decoding with cost functions of the first type is that the computation still requires a local expansion of the summation space for getting the partial paths bkj . The time overlap constraint is essential in order to restrict the expansion. Otherwise each path in the summation space would have to be compared to each arc in the hypothesis space, which would be computationally infeasible in LVCSR. Once the expansion is done the computation of the partial path probability p(bkj |xT1 ) is efficient: simply use the algorithm for computing arc posteriors and replace the single arc weight by the product over the arc weights in the partial path w(bj ) ⊗ w(bj+1 ) ⊗ . . . ⊗ w(bk ). However, due to the local expansion the worst case complexity is exponential in the number of arcs. In practice, the run-time highly depends on the lattice structure. If a long word competes with a highly connected cloud of short words, then a local exponential blow-up can happen. In practice, the union of lattices from several systems and in particular the cross-site combination case shows such a petulant behavior.

33

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework In contrast to the type one cost functions, Bayes risk decoding with cost functions of the second type does not require the local expansion and guarantees an efficient computation of the Bayes risk hypothesis: xT1 → ˆr := min

aL 1 ∈H

= min

aL 1 ∈H

= min

aL 1 ∈H

X

T p(bK 1 |x1 )

bK 1 ∈S L X

X

L X

K X

l=1

k=1: o(al ,bk )>0

X

c(al , bk )

T p(φK 1 |x1 )c(al , b)

l=1 f ∈E(S): φK ∈S: 1 o(al ,b)>0 ∃k:φk =b L X

X

c(al , b)p(b|xT1 )

l=1 f ∈E(S): o(al ,b)>0

|

{z

:=c(al ;S)

= dtrop rescore(H, c(·; S))

} (3.26)

 The time complexity of Equation (3.26) is in the worst case O |S(H)| + |S(S)| + |E(H)||E(S)| . The arc-wise probabilities p(a|xT1 ) for all arcs a ∈ E(S) can be computed in time O(|S(S)|+|E(S)|), because S is acyclic. In the re-scoring step for each arc in the hypothesis a sum over the posteriors of all arcs in the summation space is computed. Together with the subsequent Viterbi decoding step this yields the worst case complexity. In the worst case examination the time overlap constraint cannot be considered. In practice, due to the time overlap constraint and using not too dense lattices the algorithm can be implemented such that the run-time grows almost only linearly with the number of arcs. However, since an efficient decoding for type two cost functions does not rely on the time locality constraint, the requirement can be declared optional. Indeed, in the following chapter a local cost function is investigated which is of type two but violates the time locality constraint: the cost function defined by the confusion network combination (CNC). The two classes cover all cost functions which are commonly used in word error minimizing training and decoding approaches for LVCSR tasks; the topic is discussed further in Section 3.5. Instances of cost functions of type one and two and the resulting Bayes risk decoder are introduced and investigated in Chapter 4.

3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework A confusion network (CN) is a sequence of word posterior probability distributions. The probabilities are derived from the sentence posteriors of a set of aligned word sequences. The CN can be interpreted as the sequence of alignment positions, where to each position belongs a posterior distribution over all words aligned to that position. The alignment positions are often referred to as slots and the CN to as a sequence of slots. The terminology comes presumably from the way CN construction algorithms work: each word is inserted into a slot. A CN is completely described by a lattice L and a function σ : E(L) → N referred to as slot function. The slot function maps the lattice arcs to the CN slots, where for any two consecutive lattice arcs a and b holds σ(a) < σ(b). In particular, the constraint guarantees that two arcs lying on the same path are not assigned to the same slot. The mapping is used to derive the slot-wise word posterior probabilities which are computed according to Equation (3.18) for all words but the empty word. In order to guarantee a probability distribution the probability for the empty word  in slot s is derived as X ps (|xT1 ) = 1 − ps (w|xT1 ). w6=

A CN is defined as the ordered sequence of the slot-wise word posterior probability distributions and can be expressed as a word lattice without time stamps and with a sausage structure: all arcs leaving state si

34

3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework

Figure 3.2. The figure shows a word lattice with time stamps at the states, a slot function, and the confusion network induced by the slot function.

end in state si+1 . The CN in word lattice representation derived from lattice L and slot function σ(·) is denoted by CN(L, σ(·)), or by CN(L) for an arbitrary slot function. From the construction follows that the sentences accepted by CN(L) are a super set of the sentences accepted by the lattice itself. In this work only CNs derived from a lattice and a slot function are considered. Thus, a CN is always associated with a lattice and provides a unique mapping from the arcs of the lattice to the CN slots. Figure 3.2 visualizes the connection between lattice, slot function, and CN. The slot function of a CN defines an alignment between each pair of paths through the lattice: arcs assigned to the same slot compete with each other. CN construction algorithms aim at finding a slot function which covers the Levenshtein alignment between each pair of paths through the lattice. Instances of CN construction algorithms are introduced and discussed in the next chapter in Section 4.4. The slot function can be used to define a local cost of the second type which results together with Equation (3.26) in an efficient Bayes risk decoder. The cost is in particular simple to compute if the CN of the summation space lattice CN(S) serves as hypothesis space. Any two word sequences v1S and w1S taken from CN(S) have equal length S, where S is the number of slots in the CN. Making use of this property, the CN distance between v1S and w1S is given by cCN (v1S , w1S )

=

S X

 1 − δ(vs , ws ) .

(3.27)

s=1

Defining the appropriate re-scoring function for the Bayes risk decoder, cf. Equation (3.26), using CN(S) as hypothesis space lattice H, and simplifying the resulting formula yields a simple decoding rule: ˆ 1S , xT1 → W

ˆ s := argmax ps (w|xT1 ) W

(3.28)

w

Furthermore, the usage of the CN as hypothesis space guarantees that the optimal hypothesis is included ˆ S is the Bayes risk hypothesis and the Bayes risk itself is given by in H. Therefore W 1 xT1 → ˆr =

S X s=1

 1 − argmax ps (w|xT1 ) .

(3.29)

w

The CN decoding rule was originally developed in [Mangu 2000], where also the proofs of the above claims can be found. The extension to the CN decoding of arbitrary hypothesis space lattices will be given in the next chapter in Section 4.4. Confusion network based system combination can be done in two ways: the first way is to derive the slot function directly from the lattice intersection or modified lattice union. An alternative way is to compute a slot function and thus a CN for each of the J lattices and align the system-dependent slot sequences; the result of the alignment is again a CN which can be decoded according to Equation (3.28). In the next section the common confusion network combination (CNC) algorithm proposed in [Evermann &Woodland 2000] is investigated in the Bayes risk decoding framework.

35

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

3.4.1 Confusion Network Combination (CNC) In this section the confusion network combination (CNC) algorithm proposed in [Evermann & Woodland 2000] is derived by formulating the CN alignment problem in the Bayes risk decoding framework. Furthermore, it is shown that the CNC computes a slot function over the lattice union and thus CNC is nothing else but a Bayes risk decoding of the modified lattice union with a cost function of the second type. The first step is to look at the alignment of two CNs derived from the two lattices L1 and L2 . The according slot-wise posterior probability distributions are denoted by p1,n (·|xT1 ) and p2,n (·|xT1 ). The alignment between the two CNs is defined on slot level and consists of pairs of slot numbers, where slot numbering starts from one:   A := (k1 , l1 ), (k2 , l2 ) . . . , (kS , lS ) , where either ki < kj for i < j or ki = 0, but not ki = li = 0, and analogously for li . An alignment pair (k, l) means that the kth slot from the first CN is aligned to the lth slot of the second CN. If k = 0 then the lth slot from the second CN is inserted and vice versa. For convenient reasons the posterior distribution p·,0 (·|xT1 ) for the pseudo slot 0 is introduced, it equals one for the empty word  and zero otherwise. The alignment can be used to build a new CN by averaging the slot-wise word posterior distributions of the aligned slots. Hence, the slot-wise word posterior probabilities for the combined CN are given by ps (w|xT1 ) = p(1)p1,ks (w|xT1 ) + p(2)p1,ls (w|xT1 ).

(3.30)

On the other hand it is easy to see that the combined CN defines a slot function over the lattice union L1 ∪ L2 : all arcs in L1 which are assigned by the system-dependent slot function to slot ls are assigned to slot s in the combined CN, and analogously for L2 . Applying the slot function to the modified lattice union, cf. Equation (3.16), results in ps (w|xT1 )

=

X

T p(aL 1 |x1 )

0 0 aL 1 ∈L1 ∪L2 , ∃l:σ(al )=s ∧ i(al )=w

=

 p(1)

X

T p1 (aL 1 |x1 )



 + p(2)



T p2 (aL 1 |x1 )

0 aL 1 ∈L2 , ∃l:σ(al )=ls ∧ i(al )=w

0 aL 1 ∈L1 , ∃l:σ(al )=ks ∧ i(al )=w

=

X

p(1)p1,ks (w|xT1 ) + p(2)p2,ls (w|xT1 ).

That is, the difference between CNC and applying a CN construction algorithm directly to the modified lattice union is only in the resulting cost function. Both approaches define different local costs of the second type, where the CNC based cost function might violate the time overlap constraint. The remaining question is how to find the CN alignment. The goal in CNC is to minimize the Bayes risk computed from the cost function defined by the combined CN. Obviously, the Bayes risk depends on ˆ Using the definition of the CN distance, the CN alignment. Let us denote the optimal alignment by A. cf. Equation (3.27), the resulting optimization problem is defined as Aˆ :=

argmin min A

=

argmin min A

36

v1S ,S

v1S ,S

X

pA (w1S |xT1 )cCN (v1S , w1S )

w1S

X w1S

pA (w1S |xT1 )

S X s=1

 1 − δ(vs , ws ) .

(3.31)

3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework Inserting Equation (3.30) into the optimization problem yields Aˆ :=

argmin min A

v1S ,S

X

pA (w1S |xT1 )

S X

 1 − δ(vs , ws )

s=1

w1S

" =

argmin min A

=

v1S

argmin min A

v1S ,S

X

p(1)p1 (w1S |xT1 )

S X   1 − δ(vs , wks ) + p(2)p2 (w1S |xT1 ) 1 − δ(vs , wls )

s=1

w1S ,S S  X

S X

h i 1 − p(1)p1,ks (vs |xT1 ) + p(2)p2,ls (vs |xT1 ) .

#

s=1

(3.32)

s=1

Equation (3.32) can be solved efficiently by dynamic programming similar to the computation of the Levenshtein distance, but CN slots are aligned instead of words. The local cost function for the dynamic programming is given by n o c(k, l) := 1 − max p(1)p1,k (w|xT1 ) + p(2)p2,l (w|xT1 ) . w

The extension of the algorithm to the simultaneous alignment of multiple CNs is straightforward, but expensive. In practice, the common way is to approximate the multiple alignment by a sequence of pairwise alignments: CN 1 and 2 are aligned, the result is aligned to CN 3, and so on. As a rule of thumb the CNs are sorted according to their error rate, least error first.

3.4.2 ROVER: An Approximation of CNC Recognizer Output Voting Error Reduction (ROVER) is a system combination approach working on singlebest results [Fiscus 1997]. ROVER is a simple but powerful approach to system combination, especially in combination with confidence scores, see for example [Hoffmeister & Hillard+ 2007] for a comparison with CNC and a frame error based system combination approach. ROVER aligns and decodes the single-best results from J systems. A single-best output can be interpreted as a CN with a single entry per slot and thus ROVER can be interpreted as a combination of J CNs. In ROVER with majority voting the assumption is made that pj (w1N |xT1 ) = 1 and thus pj,n (w|xT1 ) = 1, where w1N is the system-dependent single-best output for system j. The decoding happens analogously to the CNC: per slot the word with the highest averaged word posterior probability is chosen. That is, per slot the word wins for which the most systems voted. However, the assumption is usually wrong and the better model is the CN, which provides a slot-wise posterior distribution over all words. And in fact CNs are a common base for computing word-wise confidence scores for LVCSR systems [Evermann & Woodland 2000; Hillard & Ostendorf 2006]. ROVER with confidence scores can now be derived from the CNC by regarding the single-best hypothesis as the result of a slot-wise pruning of the system-dependent CNs: in each slot of the system-dependent CNs only the entry with the highest probability survives. That is, per slot and system only a single word is considered, but the word posterior probability is taken from the CN slot. The standard implementation of ROVER2 makes always the assumption that pj,n (w|xT1 ) equals one for computing the alignment. In the subsequent slot-wise decoding step of the resulting CN either majority or confidence voting is applied.

3.4.3 Results In this section experimental results for the CN decoder and the different ways to CN based system combination are given. The CNs are computed with the arc-cluster algorithm introduced in the next chapter in Section 4.4.2, which is the default CN construction algorithm in the RWTH Aachen system. Experiments are presented for the Chinese 230h testing system, the English EPPS 2007 evaluation system, and the English EPPS 2007 evaluation cross-site combination. A detailed description of the

37

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

Table 3.4. Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction.

System Combination Viterbi Decoder s1 s2 s3 s1+s2 ROVER w/o confidences ROVER w/ confidences s1+s2+s3 ROVER w/o confidences ROVER w/ confidences CN Decoder s1 s2 s3 s1+s2 union CNC s1+s2+s3 union CNC 1

dev071

CER[%] (del/ins) err eval07

dev08

(2.63/1.59) 14.54 (2.65/1.70) 14.82 (2.65/1.64) 15.07 (2.66/1.57) 14.54 (2.49/1.59) 13.63 (2.74/1.35) 13.55 (2.70/1.34) 13.22

(4.42/0.91) 15.08 (4.44/0.93) 15.02 (4.57/1.04) 15.60 (4.44/0.90) 15.13 (4.30/0.91) 14.09 (4.59/0.75) 14.16 (4.55/0.74) 13.86

(2.80/0.87) 13.28 (2.71/0.94) 13.54 (2.84/0.93) 13.80 (2.86/0.85) 13.32 (2.64/0.94) 12.61 (2.89/0.75) 12.61 (2.89/0.76) 12.47

(2.79/1.45) 14.30 (2.90/1.50) 14.52 (2.97/1.48) 14.86 (3.05/1.29) 13.54 (2.93/1.34) 13.56 (2.88/1.24) 13.13 (2.87/1.29) 13.17

(4.53/0.85) 14.96 (4.62/0.81) 14.74 (4.74/0.92) 15.42 (4.69/0.73) 14.01 (4.66/0.76) 13.99 (4.77/0.67) 13.73 (4.68/0.70) 13.70

(2.85/0.80) 13.05 (2.88/0.79) 13.35 (3.01/0.85) 13.67 (3.01/0.73) 12.54 (2.93/0.74) 12.50 (3.01/0.73) 12.30 (2.92/0.72) 12.21

tuning set

systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based combination and in the CNC are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described later in Section 3.7. For ROVER the confidence score for making a deletion (aka null-confidence) is included in the optimization. The first set of experiments compares Viterbi and CN decoding for a single system. The results in Table 3.4, Table 3.5, and Table 3.6 are consistent: for all systems, languages, and setups the CN decoder shows a small but consistent improvement of around 0.2% absolute over the Viterbi decoder. The first set of combination experiments is done with ROVER with majority and confidence voting. The Viterbi hypotheses of the system-dependent lattices are combined and the confidence scores are derived from frame-wise word posterior probabilities according to [Wessel & Schl¨ uter+ 2001a]. In preliminary experiments the ROVER combination of the system-dependent CN decoding results and CN based confidences were tested, but no significant differences in the results were observed. The ROVER combination gives a huge improvement of up to 10% relative for the Chinese testing and the English evaluation system, and more than 20% relative for the English cross-site combination compared to the best Viterbi result. The experimental results show that ROVER benefits from the confidence scores. And ROVER benefits from adding more systems: in all setups adding more systems further decreased the error rate. Note that ROVER with majority voting is not a suitable choice for two systems: the ROVER implementation will always take the word hypothesis from the first system. In the further experiments CN based system combination is investigated. The CN decoding of the modified lattice union is compared with the CNC approach. For the Chinese testing and the English evaluation system both approaches show an almost identical performance, but for the cross-site combination a small advantage for CNC is observed. Presumably, the advantage for CNC comes from the independence of the CNC algorithm from word boundaries. The word boundaries are needed to build the CNs, but not anymore in the CN combination itself. In the Chinese testing and the English evaluation system 2 The

NIST ROVER implementation is part of the NIST Scoring Toolkit (SCTK) which is publicly available at http://www.itl.nist.gov/iad/mig/tools/.

38

3.4 Confusion Network based System Combination in the Bayes Risk Decoding Framework

Table 3.5. Results for the English EPPS 2007 evaluation system, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.

System Combination Viterbi Decoder s1 s2 s3 s4 s1+s2 ROVER w/o confidences ROVER w/ confidences s1+s2+s3 ROVER w/o confidences ROVER w/ confidences s1+s2+s3+s4 ROVER w/o confidences ROVER w/ confidences CN Decoder s1 s2 s3 s4 s1+s2 union CNC s1+s2+s3 union CNC s1+s2+s3+s4 union CNC 1

dev06

WER[%] (del/ins) err eval061

eval07

(1.65/2.21) 11.09 (1.77/2.28) 11.89 (2.06/2.29) 12.43 (2.04/2.18) 12.06 (1.65/2.20) 11.07 (1.97/1.70) 10.54 (1.81/1.91) 10.90 (2.05/1.57) 10.42 (1.77/1.93) 10.92 (1.82/1.91) 10.70

(1.38/1.36) 8.43 (1.67/1.23) 8.70 (1.80/1.30) 8.98 (1.85/1.38) 9.44 (1.38/1.36) 8.41 (1.75/0.93) 7.90 (1.49/1.13) 7.91 (1.79/0.87) 7.73 (1.45/1.17) 7.81 (1.47/1.08) 7.67

(1.86/1.31) 9.81 (2.12/1.31) 10.07 (2.22/1.34) 10.76 (2.68/1.42) 11.73 (1.85/1.30) 9.80 (2.28/0.95) 9.11 (1.99/1.09) 9.32 (2.40/0.89) 9.17 (1.97/1.11) 9.28 (2.06/1.08) 9.15

(1.90/1.92) 10.73 (2.14/1.90) 11.42 (2.29/1.98) 11.97 (2.31/1.94) 11.87 (2.02/1.56) 10.21 (1.94/1.62) 10.22 (2.03/1.59) 10.21 (1.95/1.60) 10.14 (2.03/1.64) 10.33 (1.88/1.65) 10.22

(1.55/1.12) 8.22 (1.90/1.08) 8.61 (1.90/1.14) 8.83 (2.09/1.17) 9.31 (1.73/0.94) 7.79 (1.66/0.99) 7.82 (1.74/0.94) 7.73 (1.67/0.96) 7.70 (1.70/0.96) 7.59 (1.60/0.97) 7.59

(2.09/1.16) 9.57 (2.40/1.07) 9.78 (2.47/1.15) 10.48 (2.96/1.29) 11.57 (2.25/0.93) 8.97 (2.17/0.96) 8.98 (2.26/0.95) 8.96 (2.22/0.95) 8.98 (2.29/0.95) 8.94 (2.18/0.91) 8.92

tuning set, eval06 was the official development set in the 2007 evaluation campaign

39

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

Table 3.6. Results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.

System Viterbi Decoder LIMSI RWTH UKA IRST LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST CN Decoder LIMSI RWTH UKA IRST LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST 1

40

Combination

ROVER ROVER ROVER ROVER ROVER ROVER

union CNC union CNC union CNC

w/o confidences w/ confidences w/o confidences w/ confidences w/o confidences w/ confidences

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.47/1.33) 8.46 (1.76/1.31) 8.80 (2.35/1.40) 10.09 (1.50/1.24) 7.87 (1.63/0.91) 6.69 (1.35/0.84) 6.58 (1.43/0.76) 6.32 (1.36/0.78) 6.38 (1.37/0.79) 6.21

(1.74/1.23) 9.13 (1.91/1.26) 9.71 (2.00/1.28) 10.22 (2.48/1.14) 9.81 (1.70/1.20) 9.06 (2.13/0.87) 7.85 (1.86/0.78) 8.01 (2.00/0.70) 7.77 (1.82/0.79) 7.67 (1.77/0.73) 7.26

(1.65/1.33) 8.07 (1.55/1.13) 8.24 (1.83/1.39) 8.98 (2.35/1.39) 10.06 (1.63/0.77) 6.46 (1.45/0.80) 6.38 (1.51/0.79) 6.38 (1.47/0.72) 6.27 (1.61/0.73) 6.28 (1.45/0.71) 6.14

(1.76/1.18) 8.96 (2.07/1.15) 9.54 (2.08/1.33) 10.36 (2.47/1.13) 9.82 (2.17/0.71) 7.67 (1.88/0.75) 7.51 (2.04/0.77) 7.63 (1.87/0.68) 7.24 (2.19/0.67) 7.36 (1.87/0.69) 7.12

tuning set, eval06 was the official development set in the 2007 evaluation campaign

3.5 The Lattice Combination Framework vs. State-of-the-Art in System Combination

Table 3.7. The table summarizes common approaches to lattice-based system combination. The methods are classified according to a) the lattice combination method and b) the decoder. The lattices are either combined via an intersection (or an theoretically equivalent lattice re-scoring) or by building the lattice union. The decoder is either the Viterbi decoder, which is an approximation of the Bayes risk decoder with the sentence error as loss function, or the Bayes risk decoder with a local cost function as loss function. The local cost functions are of the second type for all methods but Povey’s MPE, which is of the first type.

Combination intersection/ re-scoring union

Decoder Viterbi DMC

Bayes risk with local cost DMC + CN decoding

-

CN, CNC, ROVER, N -best ROVER, frame error, Povey’s MPE

all lattices are produced with the same decoder and thus all lattices have the same bias in their time stamps. On the other hand, for systems from different sites the bias is usually different [Baghai-Ravary & Kochanski+ 2009]. In conclusion, the advantage of CNC is that word boundaries are only used within a system, whereas for the CN decoding of the modified lattice union time stamps are compared across systems. This explains why a significant performance gap between the two approaches is only observed for the cross-site combination. Alike ROVER both CN based system combination approaches benefit from adding more systems. In a direct comparison ROVER performs only slightly worse than CNC. While on the tuning set the performance is almost equal, ROVER seems to have a tendency to overfit on the test corpora. However, the comparison of the ROVER and CNC results indicate that in the CNC only very few word hypotheses per slot are eventually involved in the decision making. The ROVER and CN based combination methods clearly outperform the Viterbi or MAP decoding of the lattice intersection, cf. Section 3.3.2.

3.5 The Lattice Combination Framework vs. State-of-the-Art in System Combination In the last three sections several methods to lattice and CN decoding and combination were discussed. As a result two decoding approaches and two combination methods were identified which allow to efficiently combine and decode lattices, directly or via CNs. In particular it was shown that CN combination and decoding can be implemented as a lattice-based Bayes risk decoder with a CN based cost function. The result is a separation of the computation of the sentence posterior probabilities from the decoding process. This applies not only to CN based decoding approaches, but to a wide class of combination methods including the common approaches to system combination. The key result of the framework developed in this work is the separation of the computation of the sentence posterior probabilities from the decoding process. For a wide class of combination methods the probability computation is only driven by the way the lattices are combined: intersection or union. The choice of the lattice combination is independent of the decoder. Furthermore, the common decoders applied in lattice-based system combination can be partitioned into two classes: the Viterbi decoders (the maximum approximation of the Bayes risk with the sentence error as loss function) and the Bayes risk decoders with a local cost function, e.g. CNC. For both classes of decoders efficient implementations exist. The common approaches to lattice-based system combination can now be classified within the framework as shown in Table 3.7. Note that the lower left cell is empty, because the decoding of the lattice union with the Viterbi decoder is nonsense as pointed out in Section 3.2.3. The following list gives a short overview of the different methods. • DMC. In the discriminative model combination all knowledge sources are combined into a single log-linear model. The lattice scores can either be determined by intersecting the system-dependent lattices, cf. Section 3.2.2, or by re-scoring the arcs in a given base lattice with all models, cf. [Beyerlein

41

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework 1997; Vergyri 2000; Zolnay & Schl¨ uter+ 2005] and also Chapter 7. The DMC approach was successfully used, for example in the Philips/RWTH broadcast news system [Beyerlein & Aubert+ 1999], but eventually superseded by the more flexible ROVER and CNC methods. • DMC + CN decoding. All previously published work on DMC applied the Viterbi decoder. In [Hoffmeister & Liang+ 2009] and in this work, cf. Chapter 7, DMC is combined with CN decoding and compared to the CN decoding of the lattice union. • CN, CNC, ROVER, N -best ROVER. These are the most popular methods to lattice decoding (besides Viterbi) and to lattice-based system combination [Evermann &Woodland 2000; Fiscus 1997; Mangu & Brill+ 1999; Stolcke & Bratt+ 2000]. Although several years old, these methods are still the combination approaches of choice for state-of-the-art LVCSR systems, see for example [Hsiao & Fuhs+ 2008; Huang & Marcheret+ 2009; Ng & Zhang+ 2008; Vergyri & Mandal+ 2008]. All four methods can be interpreted as a CN decoder applied to the modified lattice union, cf. Section 3.2.3 and Section 3.4.1. The methods differ in the way the CN is derived from the lattice union and in the summation and hypothesis space. A special case is ROVER with confidence scores, which can be regarded as an approximation of CNC, where the system-dependent CNs are pruned to a single entry per CN slot. N -best ROVER is conceptually closer to CNC than to ROVER: the system-dependent lattices are heavily pruned and converted into N -best lists. This allows to apply a CN construction algorithm to the system-dependent N -best lists which works less heuristic than the construction algorithms computing the CN directly from the lattice. The system-dependent CNs are then aligned and decoded as in CNC. That is, N -best ROVER is eventually CNC with a different cost function and a heavily restricted hypothesis and summation space. The construction of a CN from a lattice is discussed in the next chapter in Section 4.4. • Frame Error. The frame error and frame error based cost functions will be introduced and discussed in detail in the next chapter in Section 4.2. The idea is to count errors on a frame instead on a word base. The results are cost functions of the second type, i.e. the according Bayes risk decoding rule can be computed efficiently. Experimental results show a strong connection between frame and word error, cf. [Wessel & Schl¨ uter+ 2001c], which motivates the usage of frame error based costs as an approximation for the Levenshtein distance. The approaches to lattice combination presented in [Hoffmeister & Klein+ 2006] and [Chen & Lee 2006] are Bayes risk decoders with frame error based costs applied to the modified lattice union. The frame error based approach to system combination is also successfully used in state-of-the-art LVCSR systems, see for example [Plahl & Hoffmeister+ 2008a]. • Povey’s MPE. Povey’s MPE refers to the cost function used in [Povey & Woodland 2002] for a variant of discriminative acoustic model training which aims at minimizing the expected phoneme error. The same cost can be defined on word instead of phoneme level. The cost is of the first type as it lacks locality with respect to the reference. This and other cost functions of the first type are discussed in detail in the next chapter in Section 4.3. The cost was applied to Bayes risk decoding in [Xu & Povey+ 2009] and also to system combination in [Hoffmeister & Schl¨ uter+ 2009]. Another approach to system combination frequently used in state-of-the-art LVCSR systems is the cross-adaptation, cf. Section 1.9.3. The cross-adaptation is applied in the speaker adaptation step of the speech decoder and thus does not fit into the framework developed in this chapter. But it can be stacked with the methods investigated in this work. Some cross-adaptation results are given in the appendix in Section C.2.

42

3.6 Lattice Pre-Processing for Bayes Risk Decoding and System Combination

3.6 Lattice Pre-Processing for Bayes Risk Decoding and System Combination A crucial step in Bayes risk decoding and system combination is the pre-processing. Vocabularies from different sites usually differ in their spelling, abbreviations, or simply use different encodings. A special case are LVCSR systems for Chinese: most state-of-the-art recognizers produce word level lattices, like for example [Lei & Wu+ 2009; Plahl & Hoffmeister+ 2009], but the objective for Chinese LVCSR systems is the character error rate (CER) and not the word error rate (WER). For a Viterbi decoder, i.e. regarding the sentence error, this does not make a difference, but it does for a Bayes risk decoder which aims at minimizing the Levenshtein distance defined on character level. In the latter case the pre-processing includes the transformation of the word lattice into a character lattice. The next section discusses the normalization topics in detail. Lattice pruning reduces the size of the lattice and thus speeding-up the decoding. Some algorithms, like the determinization which has an exponential worst-case complexity, require a preceding pruning for becoming computationally feasible. Lattice pruning is discussed in Section 3.6.2. If posterior probabilities are derived from lattices, the lattices require a pre-processing step which makes the probabilities comparable. Reasons and solutions for distorted posteriors are discussed in Section 3.6.2 and Section 3.6.3.

3.6.1 Lattice Normalization In languages like English many words have different, but equally correct spellings, e.g. American vs. British English. While it is easy to agree on the spelling of a single word, the situation becomes ambiguous for expressions like “Tony’s”, which can indeed mean “Tony’s”, but can also be short for “Tony is” or “Tony has”. The NIST scoring tools3 , the de-facto standard evaluation tools for LVCSR tasks, allow all three alternatives in the computation of the error rate. However, simply substituting a lattice arc labeled with “Tony’s” by all three alternatives would change the posterior probability distribution defined by the lattice. Re-weighting the alternatives solves the problem, but requires the estimation of appropriate weights. An approximation to the re-weighting is to simply choose the most frequent alternative. This is the solution used throughout the experiments presented in this work, where the frequencies are computed from the training set. Other important normalizations include hyphens like in “word-level” vs. “word level” and abbreviations like “AM” vs. “A.M.” vs. “A. M.” Here, the solution used throughout the work is to expand a word or abbreviation to the alternative with the maximum number of tokens, which increases the probability for partial matches. In Chinese LVCSR systems the objective is the character error rate (CER). Nevertheless, many systems like the RWTH Aachen system produce word level lattices. For decoding in the Bayes risk framework with the Levenshtein distance on character level as loss function the word lattice arcs are split into character arcs. All normalizations described so far are one-to-one or one-to-many mappings. Applied to a lattice they result in an arc mapping or an arc splitting. After an arc split new time stamps have to be estimated for the resulting sub-arcs, for which two algorithms are tested: 1. Approximate word boundaries. The duration of the arc is distributed over the sub-arcs according to the number of phonemes or characters per sub-word. The number of characters approximates the number of phonemes per sub-word and is used if the pronunciations, i.e. the phoneme sequences, for the sub-words are not known. For the conversion of Chinese word lattices to character lattices the algorithm described in [Hoffmeister & Plahl+ 2007] is used for all experimental results presented in this work. The algorithm simply distributes the word arc duration uniformly among the character arcs. 2. Recognizer word boundaries. The word boundaries are derived from a forced acoustic alignment of the sub-words. Computing the forced alignment is much more expensive than the approximate word boundary method and 3 The

NIST Scoring Toolkit (SCTK) is publicly available at http://www.itl.nist.gov/iad/mig/tools/.

43

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework a)

{b}/1

{si}/1

have/5

{b}/1

{si}/3 have/6

{cough}/3 {b}/2

move/8 b)

{b}/1 have/5

{si}/1 {b}/1

have/6 move/8

{b}/2

Figure 3.3. Illustration of the non-speech cloud filter applied to a word lattice. In figure a) four paths are connecting the left most and the right most state, three of them starting with “have” and continuing with nonspeech arcs marked as “{·}”. These three paths define a non-speech cloud and the non-speech cloud filter removes all but the best scoring path through the cloud. The filter result is shown in figure b).

requires pronunciations for all sub-words. Pronunciations and even acoustic models are not always available, especially when lattices are shared across several sites.

3.6.2 Lattice Pruning Lattice pruning aims at removing unlikely paths from the lattice; the de-facto standard is the forward/backward pruning described in [Sixtus & Ortmanns 1999]. The main motivation for lattice pruning is the reduction of the lattice size with the goal to reduce memory and run-time of lattice processing algorithms. Especially for algorithms with an exponential worst-case complexity, like the determinization, a preceding lattice pruning can become mandatory. The posterior probabilities over a pruned lattice are usually sharper than the posteriors from the unpruned base lattice, because unlikely hypotheses are removed from the probability distribution. That is, the comparability of the posteriors derived from two lattices depends, among other factors, on the lattice density. Thus, for lattice-based system combination the densities of the individual lattices should be in a similar range. For the system combination experiments presented in this work all lattices are pruned to the same density, where the density is computed according to Equation (1.11). A typical density for Bayes risk decoding and system combination tasks is between 30 and 100, whereas the lattices produced by a Viterbi decoder can have a density of several hundreds up to several thousands. The bias in the system-dependent posteriors is investigated further in Chapter 5.

3.6.3 The non-Word Cloud Bias Some systems use several models for non-speech events, e.g. articulatory noise and stationary noise. If the acoustics of the different non-speech events are similar and no other control of the occurrence of the non-speech events, like including them into the language model, is applied, then so-called “non-word clouds” appear in the lattices produced by the decoder. Due to the similarity of the models all non-word events are hypothesized in parallel with similar scores and if they survive the pruning they appear as clouds in the lattice. The clouds do not harm the Viterbi result, but they influence the posteriors derived from the lattice: the posterior probability for words lying on paths which go through these clouds are overestimated [Hoffmeister & Klein+ 2006; Wessel & Schl¨ uter+ 2001b]. The clouds can be removed from a lattice by applying an appropriate filter as described in [Hoffmeister & Klein+ 2006]. Figure 3.3 illustrates the function of the filter. In Figure 3.3 a) two arcs labeled with “have” start from the leftmost node and both arcs are followed by non-speech events. From all the alternative paths starting with one of the “have”-arcs and ending in the rightmost node, only a single one shall survive.

44

3.6 Lattice Pre-Processing for Bayes Risk Decoding and System Combination

Table 3.8. Results for the Chinese 230h testing system, cf. Section B.1.1. Word-level vs. character-level decoding and approximated vs. exact character boundaries. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline word level s1+s2+s3

Combination/Decoder

dev071 (2.63/1.59) 14.54

ROVER w/ confidences (2.90/1.43) 13.38 union/CN (3.42/1.33) 13.41 CNC (3.14/1.40) 13.32 character level, approximated char. boundaries s1+s2+s3 ROVER w/ confidences (2.69/1.35) 13.24 union/CN (2.98/1.27) 13.20 CNC (2.86/1.29) 13.16 character level, char. boundaries from forced alignment s1+s2+s3 ROVER w/ confidences (2.70/1.34) 13.22 union/CN (2.88/1.24) 13.13 CNC (2.87/1.29) 13.17 1

CER[%] (del/ins) err eval07 dev08 (4.42/0.91) 15.08 (2.80/0.87) 13.28 (4.76/0.82) 14.03 (5.25/0.75) 13.99 (5.05/0.85) 13.99

(2.94/0.85) 12.55 (3.36/0.77) 12.43 (3.11/0.84) 12.39

(4.52/0.76) 13.89 (4.80/0.71) 13.73 (4.70/0.69) 13.71

(2.85/0.76) 12.47 (3.06/0.73) 12.28 (2.93/0.74) 12.26

(4.55/0.74) 13.86 (4.77/0.67) 13.73 (4.68/0.70) 13.70

(2.89/0.76) 12.47 (3.01/0.73) 12.30 (2.92/0.72) 12.21

tuning set

For all the nodes in the “non-word cloud”, all incoming arcs but the best scoring one are discarded. The result is lattice Figure 3.3 b). The dotted arc is removed by a subsequent trimming step.

3.6.4 Results In this section two lattice normalization issues are experimentally investigated. The first set of experiments is performed on the Chinese 230h testing system and compares Bayes risk decoding with the CN distance as loss function for word and character lattices, where two different approaches are investigated for deriving a character lattice from a given word lattice. The second set of experiments evaluates the impact of the lattice density for CN decoding for the Chinese 230h testing system and the English EPPS 2007 evaluation system. A detailed description of the systems is given in Appendix B. For all experiments acoustic and language model scales and the system weights in the union based combination and in the CNC are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described later in Section 3.7. For ROVER the confidence score for making a deletion (null-confidence) is included in the optimization. In the first set of experiments the combination and decoding of Chinese lattices is performed on word and on character level. The character lattices are derived from the word lattices by splitting the word arcs into character arcs by applying the algorithms described in Section 3.6.1. The results are shown in Table 3.8. Going from word to character level improves the CN based lattice decoding and combination and the CER decreases by around 0.2% absolute. The results for the arc splitting algorithms differ only slightly without showing a clear advantage for any. The observation makes sense under the consideration that the duration of a character in spoken Chinese is similar for most characters. That is, instead of performing an expensive forced alignment it is sufficient to distribute the word duration uniformly among the character arcs. In the second set of experiments the impact of the lattice density on the CN combination and decoding result is explored. For a density of one only the Viterbi hypothesis remains in the lattice and CNC degrades to ROVER with majority voting. In the case of the CNC of two Viterbi paths the implementation chooses always the hypothesis from the first CN. This explains why system s1 and the system combination s1+s2

45

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

14.8 s1 s1+s2 s1+s2+s3

14.6

14.4

CER[%]

14.2

14

13.8

13.6

13.4

13.2

13 0

10

20

30 density

40

50

60

Figure 3.4. CN decoding results for the Chinese 230h testing system, cf. Section B.1.1, for different lattice densities.

8.5 s1 s1+s2 s1+s2+s3 s1+s2+s3+s4

8.4 8.3 8.2

WER[%]

8.1 8 7.9 7.8 7.7 7.6 7.5 0

10

20

30 density

40

50

60

Figure 3.5. CN decoding results for the English EPPS 2007 evaluation system, cf. Section B.2.1, for different lattice densities.

46

3.7 Parameter Optimization for Bayes Risk Decoding and System Combination show equal error rates for a density of one, cf. Figure 3.4 and Figure 3.5. Not surprisingly the error drops significantly for densities larger than one. Remarkably, the optimal performance is already achieved for almost all experiments for a density of five. A further increase of the density helps only slightly, if at all. The conclusion is that only few words and eventually few lattice paths have an impact on the decision finding. The conclusion is supported by the ROVER vs. CNC results from Section 3.4.3: ROVER with confidence scores performs almost as good as CNC, but considers only one hypothesis per system and slot. This can be interpreted as the CNC of heavily pruned CNs. In the experiments presented in this section the CNs are derived from heavily pruned lattices. The results indicate that in CNC only few hypotheses are considered and required in decision making.

3.7 Parameter Optimization for Bayes Risk Decoding and System Combination The focus of this thesis is on the decoding and combination of lattices, where a lattice is a log-linear combination of feature functions. Throughout this work it is assumed that the feature functions are given and fix, i.e. no parameters of the feature functions are optimized. Let J be the number of lattices to be combined and I be the number of feature functions per system, to simplify matters it is assumed that each system is combining the same number of features. The parameters optimized for each combination experiment consist of the (J · I)scaling factors of the J system-dependent log-linear models, cf. Equation (3.6), the J system priors if used, cf. Equation (3.14), and a small number of combination and decoding specific parameters. The set of free parameters is denoted by θ. For most experiments the free parameters consist of two scaling factors per system (the acoustic model and the language model scale), a weight per system, and one or two method specific parameters. Thus, the typical size of θ ranges between 1 (single system with Viterbi decoding) and 13 (four systems with system weights and one method specific parameter). This small number of parameters is optimized on a development set via a direct error rate minimization using the Downhill-Simplex algorithm as described in the next section. In Chapter 7 experiments with word-dependent scaling factors are presented which increases the number of parameters up to several thousands. A direct parameter optimization is prohibitive and instead the minimum risk training (MRT) approach described in Section 3.7.2 is applied. By definition the Bayes risk is the lower bound of the overall risk (or expected loss) of any classifier. Thus, the overall risk for a speech recognition system g(·) is given by X X  xT1 → r := P r(xT1 , w1N ) L w1N , g(xT1 ) , (3.33) N xT 1 ,T w1 ,N

where P r(w1N , xT1 ) is the true joint probability of observing sentence w1N and acoustic feature sequence xT1 together and L(·, ·) denotes an arbitrary loss function. The Bayes risk is defined as the risk of the optimal classifier: X X  xT1 → ropt := min r = min P r(xT1 , w1N ) L g(xT1 ), w1N g(·)

g(·)

=

N xT 1 ,T w1 ,N

X

P r(xT1 ) min v1M

xT 1 ,T

X

P r(w1N |xT1 ) L(v1M , w1N )

From the last equation it follows that the optimal classifier is given by X gopt (xT1 ) = argmin P r(w1N |xT1 ) L(v1M , w1N ), v1M ,M

(3.34)

w1N ,N

(3.35)

w1N ,N

if the true posterior distribution P r(w1N |xT1 ) is known. In practice the true posteriors are unknown and Nr R r only a limited training or optimization set [xTr,1 ,w ˜r,1 ]r=1 is available. However, Equation (3.35) motivates the usage of a classifier of the following form for decoding LVCSR lattices: X T L K gθ (xT1 ) = argmin pθ (bK (3.36) 1 |x1 ) L(a1 , b1 ) aL 1 ∈H

bK 1 ∈S

47

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework Parameter optimization means to find those parameters θˆ which yield the best approximation of g(·) on the empirical overall risk ˆr. The empirical risk is derived by approximating the true joint probability by Pˆ r(xT1 , w1N ), which is estimated on the training set. The direct parameter optimization and the MRT approach differ in the way they estimate the joint probability, where in particular the estimation used in MRT can lead to sub-optimal results, cf. Section 3.7.2. Noteworthy, the goal of the optimization is to derive the Bayes risk classifier, but not necessarily to derive a good predictor for the Bayes risk itself. Under the assumption that pθ (·|xT1 ) can be arbitrarily exactly approximate the true posterior probability distribution it is obviously guaranteed that the set of classifiers having the form given in Equation 3.36 includes the Bayes risk classifier. But in general, the Bayes risk classifier is not unique. In particular, if the parameter set θˆ describes a Bayes risk classifier, it does not necessarily follow that pθˆ(·|xT1 ) is a good estimate of the true posteriors. In conclusion, after parameter optimization for Bayes risk decoding the interpretation of the latticederived probability pθˆ(w1N |xT1 ) as the true posterior probability of w1N is questionable. However, only few parameters are optimized and in acoustic and language model training the vast majority of the parameters are (at least initially) maximum likelihood trained. The two parameter optimization algorithms presented in this section choose by design from all risk minimizing classifiers a one with parameters θˆ close to the initial parameters. That is, in practice the interpretation of pθˆ(w1N |xT1 ) is passable and for example successfully used in confidence score computation, cf. Section 5.2.1.

3.7.1 Parameter Optimization based on the Downhill-Simplex Algorithm The approach uses the empirical risk as the objective function in the definition of the optimization problem R  1 X Nr r L w ˜r,1 , gθ (xTr,1 ) . θˆ := argmin R r=1 θ

(3.37)

The classifier based on θˆ minimizes the error on the training set and the estimate of the joint probability is the relative frequency R 1 X Nr r δ(xT1 , xTr,1 )δ(w1N , w ˜r,1 ), Pˆ r(xT1 , w1N ) := R r=1 which converges to the true probability for sufficiently large training sets. The drawback of the approach is that the objective function is not differentiable and thus gradient-descent based optimization algorithms cannot be applied. In practice, the following algorithm for optimizing the parameters turned out to work fast and robust. 1. Optimize the language model scale βj of the jth system separately for each lattice such that the error rate of the Viterbi decoder is minimized. 2. Initialize θ, i.e. the set of all parameters, as follows. Set the scaling factor for the acoustic model of the jth system λj,AM to 1/βj and the language model scaling factor λj,LM to one. The system prior p(j) is initialized with 1/J and for the combination and decoding parameters some defaults are assumed. Now, optimize each parameter in θ consecutively w.r.t. Equation (3.37). 3. Apply the Nelder-Mead downhill simplex optimization algorithm [Nelder & Mead 1965]. The parameters from step 1 are usually already close to the optimum. The optimization in step 2 can be accelerated by making use of the knowledge about the parameters to be optimized, e.g. the system priors have to sum up to one. The initial values for the third step are in most cases already very close to the optimum and only few more iterations are needed. Equation (3.37) is not differentiable which motivates the usage of the Nelder-Mead downhill simplex algorithm, which was successfully applied to similar problems, e.g. [Zens & Hasan+ 2007]. However, a direct start with the downhill-simplex algorithm is not recommended. The algorithm is sensitive to local minimums which can be avoided by choosing a good starting point, i.e. a point close to a good (ideally to the global) minimum. This motivates the three step architecture.

48

3.7 Parameter Optimization for Bayes Risk Decoding and System Combination

3.7.2 Parameter Optimization based on Minimum Risk Training Minimum risk training (MRT) is well-known for its application to the parameter estimation of acoustic models, where it is usually referred to as minimum word error (MWE) or minimum phoneme error (MPE) training [Kaiser & Horvat+ 2000; Povey & Woodland 2002]. The optimization problem solved by minimum risk training is defined as θˆ := argmin θ

R X X

Nr pθ (w1N |xTr,1 ) L(w ˜r,1 , w1N ).

(3.38)

r=1 w1N ,N

The problem is differentiable and θˆ can be computed by the help of the extended Baum-Welch algorithm or by gradient-descent based approaches. The implementation used throughout this work applies Rprop, a gradient-descent based optimization algorithm [Gunawardana & Mahajan+ 2005; Riedmiller & Braun 1993]. Remarkably, Equation (3.38) models the posterior probability, which results in the following modelbased estimate of the joint probability: 1 Pˆ rθ (xT1 , w1N ) := pθ (w1N |xT1 ) R

R X

r ) δ(xT1 , xTr,1

r=1

That is, the estimate of the risk depends twice on θ, in the probability of the occurrence of sentence w1N and in the classifier: R 1 X X ˆ r ˆrθ = P rθ (xTr,1 , w1N ) L(w1N , gθ (xTr,1 )) R r=1 N w1 ,N

The consequence is that MRT is not only aiming at optimizing the parameters of the classifier gθ (·), but at the same time changing the probability distribution over the training data such that the classifier fits. Consequently, the training aims not at finding the true probability distribution over the training data and therefore the optimization will in general not converge to the Bayes risk classifier. In fact, it is easy to show that the resulting distribution is one for the class selected by the classifier and zero otherwise. The following example shows how the dependency of the empirical risk on θ can lead to a sub-optimal solution. Let us assume that a feature extraction produces five times the same feature x, but the observed classes differ. The observations are: 1 × (x, 111), 2 × (x, 112), 1 × (x, 211), and 1 × (x, 221). In MRT the goal is to find the probability distribution pˆ(·|x) such that the following optimization problem is solved, cf. Equation (3.38), where the loss function of choice is the Levenshtein distance: R

1 XX p(c|x) Lev(c, c˜r ) p(·|x) R r=1 c  1 = argmin 5 × p(111|x) + 6 × p(112|x) + 4 × p(211|x) + 7 × p(221|x) 5 p(·|x)

pˆ(·|xT1 ) := argmin

Table 3.9 shows the empirical posterior probability distribution, i.e. the relative frequencies, the posterior distribution resulting from the minimum risk training, the classification results, and the risks of the classification results. The classifier using the empirical posterior probabilities yields “111”, which minimizes the expected loss on the training set. On the other hand, the hypothesis of the classifier based on the MRT result is ”211”, which is not an optimal solution for the training set: the expected loss on the training set is by 1/5 greater than for the optimal solution “111”. The consequence is that MRT will in general not produce the Bayes risk classifier even on infinite training data and with no model restrictions. In contrast, the empirical risk minimizing approach will yield the Bayes risk classifier under the same conditions. However, in practice MRT in combination with regularization is successfully applied for several optimization tasks involving thousands to millions of free parameters [Heigold & Deselaers+ 2008; Povey & Woodland 2002], i.e. where a direct parameter optimization is not applicable. The regularization applied in this work penalizes the deviation from an initial parameter; MRT with regularization is used in Chapter 7.

49

Chapter 3 Lattice-Based System Combination in the Bayes Risk Decoding Framework

Table 3.9. Comparison of the posterior probability distributions resulting from maximum likelihood estimation and from MRT training given the observations 1 × (x, 111), 2 × (x, 112), 1 × (x, 211), and 1 × (x, 221). The table also shows the Bayes risk hypothesis given the two distributions and the according risks given the empirical distribution.

obs. 1 × (x, 111) 2 × (x, 112) 1 × (x, 211) 1 × (x, 221) gˆ(x) r(111) r(211)

Pˆ r(c|x) 1/5 2/5 1/5 1/5 111 5/5 6/5

MRT pˆ(c|x) 0 0 1 0 211 5/5 4/5

Another problem of the MRT concerns the default system combination used throughout this work: the weighted average of system-dependent posterior probabilities as defined in Equation (3.14). Under the assumption that θ consists only of the system-dependent parameters, i.e. θ = {Λ1 , . . . , ΛJ , p(1), . . . , p(j)}, and that the system-dependent parameters are mutually exclusive, the optimization problem solved by MRT can be re-written as   J R X X X Nr  p(j)pj (w1N |xTr,1 , Λj ) L(w ˜r,1 , w1N ) θˆ := argmin θ

=

argmin p(·)

r=1 w1N ,N J X j=1

j=1

p(j) argmin Λj

R X X

Nr pj (w1N |xTr,1 , Λj ) L(w ˜r,1 , w1N ).

r=1 w1N ,N

Under the constraint that the system priors sum up to one, it is easy to see that the optimization problem has the following solution: optimize the system-dependent scaling factors Λj separately and subsequently set for the best performing system the system prior to one. That is, minimum risk training (in contrast to empirical risk minimization) does not consider the interaction between the systems and ends up with a system selection. This makes MRT unsuitable for parameter optimization for all lattice union based system combination approaches, in particular for ROVER and CNC.

3.8 Summary In this chapter a unified view on system combination has been developed which covers the most common approaches used in LVCSR. In the Bayes risk decoding framework system combination reduces to the problem of computing sentence posterior probabilities over multiple systems. A common approach is to use a single log-linear model which combines all knowledge sources from all systems. The alternative is to compute the weighted average of the system-dependent sentence posteriors. The two approaches have a natural representation in the transducer framework. A new semiring, the vector semiring, is introduced, which contains dimension-dependent scaling factors. Lattices are represented by weighted finite-state acceptors over the vector semiring. Thus, a lattice eventually defines a log-linear model distribution over sentences. The combination of several lattices can be done by building the intersection or the union. The intersection results directly in a log-linear model combination of the knowledge sources provided by the system-dependent lattices. A slightly modified union yields the weighted average of the system-dependent sentence posteriors. In both cases the result is again a lattice. The investigation on lattice decoding in the Bayes risk framework with the aim of minimizing the Levenshtein distance commences with categorizing approximate loss functions. Two classes of loss functions are derived and efficient Bayes risk decoder are developed. The characteristic of the two classes is the

50

3.8 Summary locality of the loss computation: loss functions of the first class are local w.r.t. the reference arcs, i.e. in the computation of the loss for a single reference arc no context is considered. Loss functions of the second class are local w.r.t. reference and hypothesis. The intersection can be efficiently decoded in a MAP/Viterbi decoder, but not in a Bayes risk decoder with a common Levenshtein distance approximation as loss function, because the intersection invalidates the word boundaries which are needed in the loss computation. Vice versa for the union approach: in the Viterbi decoder the union approach degenerates to a system selection and the MAP decoder is computationally expensive. But the Bayes risk decoder developed for a single lattice can be applied to the union and yields an efficient combination approach. An alternative to the lattice-based system combination is the confusion network combination (CNC). The lattices are first transformed into CNs and subsequently the CNs are aligned into a super CN followed by a standard CN decoding. The common CNC alignment rule is derived from formulating the combination problem within the Bayes risk decoding framework. Furthermore, it is shown that the difference between CNC and constructing a CN directly from the lattice union is only in the loss function, but not in the computation of the probabilities. That is, eventually CNC is a Bayes risk decoding of the lattice union. Finally, ROVER is introduced as an approximation to CNC. Experimental results show that a CN based combination performs better than the intersection approach and gives up to 10% relative improvement for intra-site and more than 20% relative improvement for crosssite combination experiments. Improvements are measured in terms of error rate reduction compared to the Viterbi decoding result of the best single system. The experiments indicate that only few hypotheses are needed for decision finding. In particular, ROVER performs almost as good as CNC. Lattice normalization and parameter optimization are discussed in the end of the chapter. For all experiments the acoustic and language model scales of all systems and the combination technique specific parameters are tuned for minimum error rate via the downhill-simplex method.

51

Chapter 4 Local Cost Functions for Bayes Risk Decoding Local cost functions are introduced in the last chapter in Section 3.3.3. In the Bayes risk decoding framework for LVCSR tasks, a local cost function approximates the Levenshtein distance and makes the computation of the Bayes risk hypothesis from a lattice computationally feasible. In this chapter local cost functions are investigated in detail. The first section discusses the general deletion bias of local costs. The remaining sections introduce several concrete implementations of local cost functions: based on the frame error in Section 4.2, based on local alignments in Section 4.3, and based on confusion networks (CN) in Section 4.4. All of the local costs show in their common form a deletion bias, especially the costs based on frame error and on local alignments. The reasons for the bias are investigated and improved versions of the cost functions are developed, which compensate for deletions. The section about CNs introduces and compares three approaches to CN construction including a new approach based on frame-wise word posterior probabilities. The new approach has some interesting properties: in opposite to the common approaches to CN construction from lattices, the new algorithm is parameter-free and does not rely on distance functions comparing arcs or arc clusters.

4.1 Local Costs and the Deletion Bias LVCSR systems tuned for minimum error rate have a general deletion bias: it is better to discard an unlikely word rather than to risk an insertion. The detailed proof and a further discussion of the bias is given in Appendix A. However, in practice the impact is negligible and the actual deletion bias of a system is mainly driven by the model approximations and the choice of the loss function in case of a Bayes risk decoder. Local cost functions as defined in Section 3.3.3 have an inherent deletion bias caused by the requirement that only arcs can compete which overlap in time. This requires exact time stamps for words which do not exist in continuous speech. The discretization of the acoustic signal impairs the situation. Short words like “I” or “a” or fast and unclear spoken words like “have” are good candidates for fluctuating word boundaries, especially if they occur in context with words starting or ending in the same or a similar vowel. These words can occur several times in the lattice without or with only little overlap in time, even so they are clearly referring to the same word position in the spoken sentence. Consequently, in a Bayes risk decoder using a loss function, which requires exact word boundaries, these arcs are not aligned, which usually strengthens the hypothesis of the empty word and thus causes deletions. The situation is even worse in cross-site lattice combinations, because as shown in [Baghai-Ravary & Kochanski+ 2009] LVCSR decoder usually show a systematic bias in where to set word boundaries. The specific bias of the concrete implementations of local costs is discussed in the next sections when introducing concrete instances of local cost functions. Local costs for discriminative acoustic model training are investigated in [Gibson 2008]. The work focuses on local alignment and frame error based costs and the author comes to similar conclusions concerning the deletion bias of local cost functions.

4.2 Frame Error The frame error is a common approximation of the Levenshtein distance and is used in discriminative acoustic model training [Gibson & Hain 2006; Zheng & Stolcke 2005] and in Bayes risk decoding [Wessel & Schl¨ uter+ 2001c]. The plain frame error between two paths through a lattice is simply the number of time frames in which the overlapping arcs have different word labels. Let at denote the arc in path aL 1 which intersects with time frame t and let o(a, b) denote the overlap in time of arc a and arc b. In order to

53

Chapter 4 Local Cost Functions for Bayes Risk Decoding  achieve a simplified notation the helper function h(a, b) := o(a, b)δ i(a), i(b) is defined. The frame error K between lattice path aL 1 and lattice path b1 is defined as K cFE (aL 1 , b1 )

:=

T h X i 1 − δ i(at ), i(bt ) t=1

=

K X

"

L X

dur(bk ) −

k=1

=

L X

# h(bk , al )

l=1

" dur(al ) −

l=1

K X

# h(al , bk ) .

(4.1)

k=1

Note that the computation of the pure frame error is symmetric w.r.t. summing over the hypothesis arcs or over the reference arcs The frame error itself and all the modifications discussed in this section are local cost functions of the second type. That is, the Bayes risk hypothesis can be computed efficiently by using the Bayes risk decoder developed in Equation (3.26).

4.2.1 Partially Normalized Frame Error In [Wessel &Schl¨ uter+ 2001c] a modified version of the frame error is used as loss function for lattice-based Bayes risk decoding. The modified frame error has an additional normalization term with the intention to average between frame error and a word-like error and the resulting error is defined as

K chyp-nFE (aL 1 , b1 ) :=

L X l=1

dur(al ) −

K X

h(al , bk )

k=1

 . 1 + α dur(al ) − 1

(4.2)

The parameter α smoothly interpolates between frame- and word-wise normalization. According to Equation (3.26) the Bayes risk decoding can be implemented as a lattice re-scoring with the following re-scoring function: X dur(a) − h(a, b)p(b|xT1 ) chyp-nFE (a; S, α)

:=

b∈E(S): o(a,b)>0

 1 + α dur(a) − 1 end(a)−1

dur(a) − =

X

pt (i(a)|xT1 )

t=beg(a)

 1 + α dur(a) − 1

(4.3)

The resulting decoding rule is referred to as the min.hyp-nFE decoding rule1 , where hyp-nFE is short for hypothesis-side normalized frame error. The extension of the min.hyp-nFE decoding PJrule to system combination is straightforward by using Equation (3.14), i.e. by setting p(w1N |xT1 ) = j=1 p(j)pj (w1N |xT1 ). This is equivalent to computing the PJ T frame-wise word posteriors according to Equation (3.17), i.e. pt (w|xT1 ) = j=1 p(j)pj,t (w|x1 ), and is exactly the form given in [Hoffmeister & Klein+ 2006]. In [Chen & Lee 2006] the authors start from a different approach, but it is easy to see their frame error based combination rule is exactly the min.hypnFE rule for system combination, where the lattice union serves as hypothesis space. 1 In

previous work the resulting decoder was referred to as min.fWER decoder. However, for a consistent notation throughout the thesis the name is changed to min.hyp-nFE decoder.

54

4.2 Frame Error

a)

b)

Figure 4.1. The bias in partially normalized frame errors. In a) the frame error is normalized w.r.t. the hypothesis, which results in ignoring deletion errors (left side) while insertions are counted (right side). In b) the frame error is normalized w.r.t. the reference and insertion errors are ignored (left side) while deletions are counted.

4.2.2 Symmetrically Normalized Frame Error In Equation (4.2) a normalization term is introduced for the frame error. However, the normalization happens only w.r.t. the left argument, the hypothesis, which destroys the symmetry in the definition of the plain frame error, cf. Equation (4.1). Let us assume that the left argument is the hypothesis and the right argument the reference. The notation in Equation (4.1) stresses the symmetry. Breaking the symmetry and normalizing w.r.t. the hypothesis ignores deletions, while normalizing w.r.t. the reference ignores insertions. Figure 4.1 illustrates the behavior. The min.hyp-nFE decoding rule defined in Equation (4.2) normalizes w.r.t. the hypothesis which causes a deletion bias. Consequently, experimental results show a high deletion ratio for the min.hyp-nFE decoder which increases with larger α. Not surprisingly the optimal performance is achieved with a small α, usually around 0.05. On the other hand a normalization is reasonable, because the plain frame error depends on the duration of the words and is dominated by long words. Two approaches have been proposed to normalize the frame error without breaking the symmetry, i.e. without having a bias towards deletions or insertions. The first approach implements the cost function proposed in [Gibson 2008]. The symmetry of the error is achieved on arc level by counting the total number of frames at which two overlapping arcs differ, divided by the length of the shorter arc. The resulting re-scoring function for the Bayes risk decoder of the second type, cf. Equation (3.26), is given by 

carc-nFE (a; S) :=

X b∈E(S): o(a,b)>0

   max end(a), end(b) − min beg(a), beg(a) − h(a, b)  . p(b|xT1 ) min dur(a), dur(b)

(4.4)

So far, the arc-nFE2 was only tested for discriminative acoustic model training, where the approximation shows good results. The approach proposed in [Hoffmeister & Schl¨ uter+ 2009] achieves the symmetry on path level by averaging hypothesis- and reference-normalized frame error. The error is called path-nFE and is defined as K cpath-nFE (aL 1 , b1 ) := γ

PK PL L K X X dur(bk ) − l=1 h(bk , al ) dur(al ) − k=1 h(al , bk ) + (1 − γ) . dur(al ) dur(bk ) l=1

(4.5)

k=1

The parameter γ allows to bias the error towards deletions or insertions; symmetry is achieved for γ = 0.5. Obviously, the error has a new bias: substitutions are penalized twice compared to insertions and deletions. However, experimental results do not show a significantly increased fraction of insertions or deletions in the error rates. Using the path-nFE error in Bayes risk decoding yields the min.path-nFE decoder. The re-scoring function for a Bayes risk decoder of the second type is derived by inserting the definition of the error into 2 The

author refers to the error as symmetrically normalised frame error (SNFE). However, for a consistent notation throughout the thesis the name is changed to arc-nFE.

55

Chapter 4 Local Cost Functions for Bayes Risk Decoding

Table 4.1. Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three different approaches to word-wise frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1

s1+s2

s1+s2+s3

1

Norm. hyp. arc-sym. path-sym. hyp. arc-sym. path-sym. hyp. arc-sym. path-sym.

dev071 (2.63/1.59) 14.54 (2.92/1.38) 14.35 (2.68/1.53) 14.42 (2.52/1.61) 14.23 (3.07/1.30) 13.57 (2.83/1.41) 13.83 (2.57/1.58) 13.49 (3.06/1.23) 13.18 (2.85/1.30) 13.45 (2.99/1.22) 13.06

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.62/0.79) 14.98 (4.45/0.90) 15.09 (4.32/0.98) 14.96 (4.69/0.68) 13.95 (4.58/0.80) 14.21 (4.31/0.90) 13.93 (4.72/0.69) 13.71 (4.70/0.73) 14.09 (4.76/0.66) 13.64

dev08 (2.80/0.87) 13.28 (3.01/0.75) 13.13 (2.80/0.83) 13.09 (2.75/0.94) 13.11 (3.05/0.70) 12.54 (2.85/0.70) 12.67 (2.65/0.89) 12.45 (3.01/0.72) 12.22 (2.92/0.72) 12.52 (3.04/0.71) 12.22

tuning set

Equation (3.26): ˆ xT1 → W

:=

argmin aL 1 ∈H

X

T L K p(bK 1 |x1 ) cpath-nFE (a1 , b1 )

bK 1 ∈S

 =

argmin

L  X  γ 

dur(al ) −

h(al , b)p(b|xT1 )

    

b∈E(S)

dur(al )

aL 1 ∈H l=1



 X h(b, al )p(b|xT ) 1  (1 − γ) +(1 − γ) p(b|xT1 ) − dur(b) l=1 b∈E(S) b∈E(S)     L X X h(a , b) h(a , b) l l  argmin −γ + (1 − γ) p(b|xT1 ) dur(al ) dur(b) aL 1 ∈H l=1 b∈E(S) | {z } X

=

X

L X

(4.6)

:=cpath-nFE (al ;S,γ)

The left term in cpath-nFE (·; L) equals the hyp-nFE cost function with α = 1. For the path symmetric cost function the smoothing parameter α could be easily included in Equation (4.5), but in preliminary experiments it turned out not to be necessary for optimal performance.

4.2.3 Results In this section results for the Bayes risk decoder with frame error based local cost functions are presented and discussed. Experiments have been performed for single lattices and for union based lattice combinations, cf. Section 3.2.3. Results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation cross-site combination. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. In addition, for the min.hyp-nFE decoder α and for the min.path-nFE decoder γ is included into the optimization process. The optimization algorithm is described in Section 3.7.

56

4.2 Frame Error

Table 4.2. Minimum frame error decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare three different approaches to word-wise frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI

LIMSI+RWTH

LIMSI+RWTH+UKA

LIMSI+RWTH+UKA+IRST

1

Norm. hyp. arc-sym. path-sym. hyp. arc-sym. path-sym. hyp. arc-sym. path-sym. hyp. arc-sym. path-sym.

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.95/1.15) 8.08 (2.22/0.99) 9.00 (1.72/1.34) 8.24 (1.82/1.22) 9.19 (1.68/1.32) 8.05 (1.84/1.15) 9.00 (1.60/0.85) 6.65 (1.99/0.76) 7.73 (1.57/1.29) 8.35 (2.02/1.21) 9.58 (1.62/0.76) 6.46 (2.09/0.73) 7.57 (1.80/0.72) 6.48 (2.21/0.68) 7.52 (1.61/1.39) 8.23 (1.74/1.27) 9.19 (1.53/0.74) 6.24 (2.01/0.74) 7.28 (1.70/0.79) 6.52 (1.93/0.76) 7.26 (1.57/1.28) 8.33 (2.01/1.22) 9.55 (1.36/0.85) 6.10 (1.81/0.85) 7.21

tuning set, eval06 was the official development set in the 2007 evaluation campaign

In the first set of experiments the three frame error based cost functions hyp-nFE, arc-nFE, and pathnFE are compared. The definitions of the cost functions can be found in Equation (4.3), Equation (4.4), and Equation (4.6). The results for the Chinese system are summarized in Table 4.1. The arc-nFE cost performs clearly worst. In a direct comparison of the partially normalized hyp-nFE and the symmetrically normalized path-nFE a small advantage of the path-nFE over the hyp-nFE is observed. Looking at the deletion/insertion ratio shows that the path-nFE cost has a reduced deletion ratio compared to the hyp-nFE cost. The parameters are tuned for minimum error rate, which means that a low del/ins ratio only appears if it is beneficial for the decoder performance. In fact, for the combination of three systems the del/ins ratio does almost not change between the min.hyp-nFE and the min.path-nFE decoder, which means that the optimal error rate has a rather high deletion rate. For the cross-site combination results shown in Table 4.2 the benefit from the reduced del/ins ratio of the min.path-nFE decoder is higher. Especially, for the combination of three and four systems the error rate benefits from a lower del/ins ratio. Here again, the path-nFE cost outperforms the other two costs, where arc-nFE performs clearly worst. The second set of experiments investigates the influence of the size of the hypothesis space on the decoding result. By default, for experiments requiring exact word boundaries in the hypothesis, like the frame error based costs, the hypothesis space is the time-conditioned form of the summation space lattice, cf. Section 3.3.3. The summation space lattices are the result of a word-conditioned tree search decoder and thus are word-conditioned lattices. The experimental results presented in Table 4.3 and Table 4.4 compare the two hypothesis spaces: the summation space lattice and the time-conditioned form of the summation space lattice. In the combination experiments the summation space lattice is the union of the system-dependent lattices. Experiments are performed with the min.path-nFE decoder, which performed best among all tested frame-error based decoders. The results for the Chinese system show no clear advantage for the time-conditioned hypothesis space, whereas the cross-site combination clearly benefits from the increased size of the hypothesis space. The reason for the different behavior is due to the different decoding setups. The Chinese system uses many short segments. On the contrary, the English cross-site combination uses only a few segments each spanning over a whole recording. Now, defining the hypothesis space as the union of the system-dependent

57

Chapter 4 Local Cost Functions for Bayes Risk Decoding

Table 4.3. Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1). The experiments compare the word- and time-conditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1 s1+s2 s1+s2+s3 1

Time-Cond. Hyp. Space no yes no yes no yes

dev071 (2.63/1.59) 14.54 (2.74/1.47) 14.23 (2.52/1.61) 14.23 (2.64/1.52) 13.50 (2.57/1.58) 13.49 (2.80/1.32) 13.09 (2.99/1.22) 13.06

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.45/0.84) 14.87 (4.32/0.98) 14.96 (4.37/0.84) 13.98 (4.31/0.90) 13.93 (4.61/0.72) 13.67 (4.76/0.66) 13.64

dev08 (2.80/0.87) 13.28 (2.93/0.85) 13.09 (2.75/0.94) 13.11 (2.73/0.86) 12.42 (2.65/0.89) 12.45 (2.89/0.77) 12.19 (3.04/0.71) 12.22

tuning set

Table 4.4. Minimum frame error decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare the word- and time-conditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST 1

58

Time-Cond. Hyp. Space no yes no yes no yes no yes

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.90/1.18) 8.07 (2.11/1.03) 8.96 (1.68/1.32) 8.05 (1.84/1.15) 9.00 (1.50/1.14) 6.94 (1.93/0.97) 7.77 (1.62/0.76) 6.46 (2.09/0.73) 7.57 (1.55/0.93) 6.61 (2.02/0.94) 7.80 (1.53/0.74) 6.24 (2.01/0.74) 7.28 (1.58/0.97) 6.60 (2.03/0.88) 7.62 (1.36/0.85) 6.10 (1.81/0.85) 7.21

tuning set, eval06 was the official development set in the 2007 evaluation campaign

4.3 Local Alignment based Error lattices does not allow to switch between the hypotheses of the different systems within a segment. For Chinese the restriction does not harm as the segments are short anyway. But for the English task with the long segments the restriction has a clear negative impact. Consequently, the large benefit of the time-conditioned hypothesis space is only observed in the combination case, but not for single lattice decoding.

4.3 Local Alignment based Error In cost functions based on local alignments each word in the reference is aligned to all sub-paths which overlap in time with the reference word. These are cost functions of the first type and the Bayes risk hypothesis can be computed according to Equation (3.25).

4.3.1 Povey’s Approximation in MPE/MWE Training The most prominent cost function based on a local alignment is the approximation used for minimum risk acoustic model training in [Povey & Woodland 2002]. The cost between lattice path aL 1 and lattice path bK is defined as 1

K cPovey (aL 1 , b1 ) := K −

L X l=1

 max k

−1 +

o(al , bk ) dur(bk )

 +

h(al , bk ) dur(bk )

 .

(4.7)

In practice, the approximation is either applied on word level or phoneme level. Accordingly, the combination with minimum risk acoustic model training is referred to as minimum word error (MWE) training and minimum phone error (MPE) training, respectively. MPE training is the de-facto standard in discriminative acoustic model training for LVCSR systems. The MPE criterion is also used as loss function for Bayes risk decoding. In [Chen & Lee 2006] the authors develop an arc-wise cost based on the phoneme error approximation for word lattice-based system combination and decoding. However, the phoneme alignment is computed only within a word lattice arc which eventually yields a cost function of the second type. The approach presented in [Xu & Povey+ 2009] re-scores N -best lists with the phoneme error approximation used by Povey for MPE training. The cost function developed in this section is a modified version of Povey’s cost but applied on words and was first published in [Hoffmeister & Schl¨ uter+ 2009]. A drawback of the approximations used in MPE/MWE training is that they show a strong deletion bias as pointed out in [Gibson 2008; Zheng & Stolcke 2005] and being experimentally verified for Bayes risk decoding in [Hoffmeister & Schl¨ uter+ 2009]. Alternative criteria like the minimum phone frame error (MPFE) [Zheng & Stolcke 2005] have been proposed, but the MPFE objective is rather expensive to compute and requires a state alignment. Furthermore, neither phoneme nor state alignments might be available, e.g. in a cross-site system combination task. The criterion proposed in this thesis modifies Povey’s original cost applied to words.. Equation (4.7) is extended by an additional term which adds a penalty if the occurrence of a deletion is likely. The two l ,bk ) terms in the original error definition in Equation (4.7) have the following interpretation: − 1 + o(a dur(bk ) adds a penalty if an insertion is likely and

h(al ,bk ) dur(bk ) is o(al ,bk )  dur(al ) is

the accuracy, thus indirectly modeling substitutions

and deletions. An additional term − 1 + introduced which is similar to the insertion penalty, but normalized by the duration of the hypothesis word dur(al ). The motivation is: if a long hypothesis word al competes with a much shorter reference word bk , then presumably a deletion takes place and is penalized by the new term. The new term is weighted by a scalar χ which allows to smoothly increase the deletion penalty; setting χ = 0 yields Povey’s original criterion. The resulting re-scoring function for

59

Chapter 4 Local Cost Functions for Bayes Risk Decoding

Table 4.5. The substitution, insertion, and deletion error for the discrete and the continuous case of the 1/2 overlap approximation.

error

discrete  δ i(a), i(b) o(a, b) > 0.5  1 o(a, b) ≤ 0.5  1 o(a, b) ≤ 0.5

substitution insertion deletion

continuous h(a, b) dur(b) dur(a) − o(a, b) dur(a) dur(b) − o(a, b) dur(b)

the type one Bayes risk decoder, cf. Equation (3.25), is given by X T L K ˆ := argmin p(bK xT1 → W 1 |x1 ) cχPovey (a1 , b1 ) aL 1 ∈H

bK 1 ∈S

= argmin aL 1 ∈H

X bK 1 ∈S

| X

T p(bK 1 |x1 )K +

l=1

{z

=const(aL 1)

 k T p(bj |x1 ) −

K k κ bk j :∃φ1 ∈S with bj =φι , o(al ,bi )>0 for i∈[j,k], o(al ,bi )=0 for i∈[j,k] /

L X

}

     o(al , bi ) h(al , bi ) o(al , bi ) + χ −1 + + . (4.8) max −1 + dur(bi ) dur(al ) dur(bi ) i∈{j,...,k}

|

{z

:=cχPovey (al ;S,χ)

}

4.3.2 The 1/2 Overlap Approximation The use of the fractional values in the error approximation in MPE/MWE training is a tribute to the locality of the approximation, because two hypothesis arcs a and a0 can be assigned to the same competing arc b. The flaw in the alignment can be avoided by requiring that a (or a0 ) can only be aligned to b if the fractional overlap exceeds one half. Following the consideration two cost functions are developed in [Hoffmeister & Schl¨ uter+ 2009] and presented in this section. The cost for an error can now be chosen discrete like in the Levenshtein alignment or again be smoothed by using normalized overlaps. For hypothesis arc a and reference arc b Table 4.5 summarizes the definition of the substitution, insertion, and deletion error for both cases. The requirement of a minimum overlap of 0.5 is in practice too strong for optimal error rate. Instead, 0.5 is replaced by the parameter β which is empirically optimized on the tuning set. Equation (4.9) shows the resulting Bayes risk decoder for the continuous case; the discrete case follows analogously. X T L K ˆ := argmin xT1 → W p(bK 1 |x1 ) cβPovey (a1 , b1 ) aL 1 ∈H

bK 1 ∈S

= argmin aL 1 ∈H

X bK 1 ∈H

|  X k T p(bj |x1 ) − K k κ bk j :∃φ1 ∈S with bj =φι , o(al ,bi )>β for i∈[j,k], o(al ,bi )≤β for i∈[j,k] /

T p(bK 1 |x1 )K

+

L X

"  1 ∀b ∈ S : o(al , b) < β +

l=1

{z

=const(aL 1)

}

     # o(al , bi ) o(al , bi ) h(al , bi ) max −1 + + −1 + + (4.9) dur(bi ) dur(al ) dur(bi ) i∈{j,...,k}

 The penalty term 1 ∀b ∈ S : o(al , b) < β is necessary to count an insertion in the case that the minimum overlap requirement prohibits the alignment of the hypothesis arc al to any arc from the summation space

60

4.3 Local Alignment based Error

Table 4.6. Minimum local alignment error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare four variants of the local alignment based cost. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1

s1+s2

s1+s2+s3

1

Criterion POV χPOV βINT (cont.) βINT (disc.) POV χPOV βINT (cont.) βINT (disc.) POV χPOV βINT (cont.) βINT (disc.)

dev071 (2.63/1.59) 14.54 (2.89/1.39) 14.33 (2.32/1.68) 14.17 (2.70/1.51) 14.33 (2.61/1.55) 14.34 (3.11/1.25) 13.60 (2.47/1.53) 13.44 (2.78/1.37) 13.48 (2.68/1.45) 13.54 (3.12/1.14) 13.19 (2.61/1.33) 13.09 (2.69/1.31) 13.12 (2.58/1.38) 13.20

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.62/0.80) 15.03 (4.23/1.03) 15.01 (4.45/0.89) 14.98 (4.46/0.92) 15.01 (4.75/0.67) 14.00 (4.32/0.85) 13.93 (4.51/0.75) 13.97 (4.44/0.82) 14.00 (4.82/0.62) 13.74 (4.48/0.75) 13.67 (4.55/0.72) 13.70 (4.50/0.80) 13.87

dev08 (2.80/0.87) 13.28 (3.00/0.75) 13.14 (2.61/0.97) 13.04 (2.84/0.84) 13.12 (2.73/0.88) 13.06 (3.06/0.70) 12.57 (2.58/0.86) 12.35 (2.80/0.75) 12.49 (2.78/0.81) 12.44 (3.16/0.68) 12.26 (2.72/0.78) 12.08 (2.81/0.74) 12.15 (2.74/0.77) 12.25

tuning set

lattice. The 1/2 overlap cost function is almost identical to the modified version of Povey’s cost, cf. Equation (4.8). For β = 0.5 the additional deletion penalty becomes now crucial, because due to the 1/2 overlap constraint the accuracy term is not accounting for the deletion anymore. Therefore, χ is set to one for choosing β = 0.5. For β < 0.5 and especially for β = 0 two hypothesis arcs can in fact be aligned to the same reference arc and the accuracy term penalizes (indirectly) the deletions. Consequently, the impact of the χ-term has to be decreased in order to avoid an overestimation of the deletion error.

4.3.3 Results In this section results for the Bayes risk decoder with local alignment based cost functions are presented and discussed. Experiments have been performed for single lattices and for union based lattice combinations, cf. Section 3.2.3. Results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation cross-site combination. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. In addition, for the Bayes risk decoder based on Equation (4.8) and Equation (4.9) the deletion penalty weight χ and respectively the minimum overlap β are included in the parameter optimization. The optimization algorithm is described in Section 3.7. Four different local alignment based cost functions are compared: the original cost approximation used by Povey for minimum risk training (abbreviated as POV), the modified cost with the additional deletion penalty term (χPOV), and the 1/2 overlap approximation (βINT) with continuous and discrete costs. The results are summarized in Table 4.6 and Table 4.7. For the English cross-site combination setup experimental results are presented only for single systems and the combination of two systems. For the combination of three and four systems the computation of the local alignment in the lattice union became infeasible. The reason is that long hypothesis words were aligned to a highly connected cloud of short words. During the alignment the cloud was expanded to a huge number of paths, each one being aligned against the hypothesis word.

61

Chapter 4 Local Cost Functions for Bayes Risk Decoding

Table 4.7. Minimum local alignment error decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare four variants of the local alignment based cost. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI

LIMSI+RWTH

1

Criterion POV χPOV βINT (cont.) βINT (disc.) POV χPOV βINT (cont.) βINT (disc.)

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.67/1.29) 8.04 (1.87/1.15) 9.03 (1.62/1.40) 8.13 (1.73/1.22) 8.99 (1.66/1.33) 8.07 (1.79/1.14) 8.96 (1.65/1.28) 8.07 (1.82/1.24) 9.09 (1.78/0.72) 6.66 (2.33/0.61) 7.73 (1.48/0.90) 6.61 (2.05/0.79) 7.70 (1.44/0.92) 6.66 (1.96/0.88) 7.96 (1.66/0.87) 6.70 (2.10/0.74) 7.81

tuning set, eval06 was the official development set in the 2007 evaluation campaign

Overall, the χPOV cost shows a good performance and gives for almost all experiments the best result. The χPOV cost profits from the deletion penalty and in most experiments the deletion ratio is significantly reduced compared to the POV result. The 1/2 approximation with the continuous cost can improve in some experiments over POV and also decreases the deletion ratio. The continuous version is clearly superior to the discrete version, but the overall performance of the χPOV cost is slightly better. In a direct comparison to the frame error based cost functions in Section 4.2.3 the local alignment based costs are competitive in terms of error rate, but clearly outperformed in terms of run-time.

4.4 Confusion Network Distance based Error Confusion networks (CNs) have already been introduced in the last chapter in Section 3.4. In this work the interest is in the CN derived directly from a lattice L via a function σ : E(L) → N, where for two consecutive arcs a and b holds σ(a) < σ(b). The integer σ(al ) is interpreted as the position of arc al within the alignment of lattice path aL 1 and any other path through L: two arcs mapped to the same position are aligned. Note that due to the constraint on σ(·) two arcs on the same path are never aligned. From the alignments position-wise word posterior probability distributions can be derived, cf. Section 3.4. In the common CN terminology the alignment positions are referred to as slots and σ(·) is called slot function. Following the terminology, a CN is defined as the ordered sequence of the slot-wise word posterior distributions. The CN can be expressed as a word lattice without time stamps and with a sausage structure and is denoted by CN(L, σ(·)) or by CN(L) for an arbitrary slot function. From the construction follows that CN(L, σ(·)) is also a compact representation of the alignment between each two sentences accepted by L. From the alignment given by the slot function a path distance can be computed, cf. Equation (3.27), which is referred to as CN distance3 The CN distance yields a cost function of the second type and thus can be efficiently decoded in the Bayes risk framework. The Bayes risk decoder distinguishes between the hypothesis space lattice H and the summation space lattice S and thus slot functions for both lattices are required in order to produce an alignment between any path in H and any path in S. In the last chapter in Section 3.4 the special case of using the CN derived from S as hypothesis space, i.e. H := CN(S), was investigated. For the general case let us assume that two slot functions σH (·) and σS (·) exist and that the two arcs e ∈ E(H) and f ∈ E(S) are aligned if σH (a) = σS (b). Furthermore, let S be the number of slots 3 The

62

terminology is somewhat misleading as the path distance depends on the slot function, but not on the CN itself.

4.4 Confusion Network Distance based Error

Figure 4.2. The figure shows a lattice, a CN derived from the lattice, and a lattice in which all paths have the same length. The positions for the insertions of the -arcs are derived from the CN according to the algorithm described in the text. The number at the arcs corresponds to the CN slot the arc is assigned to and the number in the states is the minimum slot number from all outgoing arcs.

in the corresponding CNs, i.e. S := maxa∈E(H) σH (a) = maxb∈E(S) σS (b). The next step is to compute K the CN distance between a path aL 1 ∈ H and a path b1 ∈ S, where attention should be paid to that in general the sequence σ(a1 ), σ(a2 ), . . . , σ(al ) is not consecutive but can have gaps, likewise for bK 1 . By the insertion of -arcs gaps can be filled and both paths are brought to equal length S and the CN distance is computed according to Equation (3.27). The positions for the insertions of -arcs into a lattice can be easily found by the following algorithm. Given a lattice state s and a slot function σ(·) the minimum slot number for a state s is defined as min.σ(s) := min σ(a), (4.10) a∈out(s)

where min.σ(sI ) := 0 for the initial F ) := S for all final states sF . Given arc a  state sI and min.σ(s  with σ(a) = n then for each i ∈ min.σ(from(a)), n an -arc with slot number i is inserted before a, and for each j ∈ n, min.σ(to(a)) an -arc with slot number j is inserted after a. Figure 4.2 visualizes K the algorithm. By the help of min.σH (·) the CN distance between aL 1 and b1 can be computed without explicitly inserting -arcs into the summation or hypothesis space lattice: " L X  L K 1 ∃k : σH (al ) = σS (bk ) ∧ i(al ) 6= i(bk ) cCN (a1 , b1 ) = l=1 min.σH (to(al ))−1

X

+

#  1 ∃k : σS (bk ) = s ∧ i(bk ) 6= 

(4.11)

s=min.σH (from(al )): s6=σH (al )

The derivation of the re-scoring function for the Bayes risk decoder is now straightforward: X T L K ˆ := argmin xT1 → W p(bK 1 |x1 ) cCN (a1 , b1 ) aL 1 ∈H

bK 1 ∈S

 =

 L  X  argmin  L a1 ∈H l=1  

 min.σH (to(al ))−1

X

b∈E(S): σH (al )=σS (b)∧ i(al )6=i(b)

p(b|xT1 )

X

+

X

s=min.σH (from(al )): b∈E(S): s6=σH (al ) σS (b)=s∧ i(b)6=

 =

argmin

|

  

L X    1 − pσ (a ) (i(al )|xT1 ) + H l 

aL 1 ∈H l=1

 

 p(b|xT1 )

min.σH (to(al ))−1

X s=min.σH (from(al )): s6=σH (al )

{z



 1 − ps (|xT1 )  

(4.12)

}

:=cCN al ;S,σS (·),σH (·)

63

Chapter 4 Local Cost Functions for Bayes Risk Decoding A nice property of the CN distance is that by setting H = CN(S, σS (·)) it is guaranteed that the optimal hypothesis is included in the hypothesis space. Furthermore, the construction of the CN from the slot function yields a sausage lattice in which each path has exactly length S, cf. Section 3.4. By the choice of CN(S, σS (·)) as hypothesis space the Bayes risk decoding rule defined in Equation (4.12) simplifies to the decoding rule given in Equation (3.28). In all experimental results presented in this work CN(S, σS (·)) is used as hypothesis space. The motivation behind CN distance based Bayes risk decoding is that the alignments defined by the slot function are good approximations of the Levenshtein alignments. The constraint that the outcome of the slot function for two consecutive arcs must be strictly ascending guarantees that the CN distance is an upper bound of the Levenshtein distance, cf. [Mangu 2000]. Nr R r The slot function of choice minimizes the error on the training samples [xTr,1 ,w ˜r,1 ]r=1 and is defined as R

σopt (·) := argmin σ(·)

1 X Lev R r=1



  Nr gCN ·, Lr ; σ(·) , w ˜r,1 .

(4.13)

However, no efficient algorithm is known to compute σopt (·) from LVCSR lattices and in practice heuristic approaches are used with at most a few free parameters which are optimized on a tuning set. Algorithms computing a slot function from a lattice will be referred to as CN construction algorithms. A common heuristic used in many CN construction algorithms is the time overlap constraint defined in Section 3.3.3 for local cost functions. The constraint claims that arcs assigned to the same slot overlap in time and thus guarantees that two consecutive arcs cannot be aligned. However, the time overlap constraint causes a deletion bias in the subsequent CN decoding. Let us assume that the optimal CN alignment has S slots and due to the time overlap constraint the outcome of the CN construction algorithm has S 0 > S slots. A common situation in which the time overlap constraint causes such a suboptimal alignment is the occurrence of short words with fuzzy word boundaries as pointed out in Section 4.1. Let us assume that the Levenshtein alignment would align these words, although due to the short duration and the fuzzy boundaries they have no or only little overlap in time. The CN construction algorithm would not align these words and would probably create extra CN slots. Eventually, the same number of arcs is spread among more slots and the number of arcs per slot decreases. This weakens the probability for a specific word v, if two v-arcs which should be aligned end up in different slots. In turn, this usually strengthens the probability of the empty word in the affected slots and causes eventually the deletion bias. The first algorithm which constructs a CN directly from a lattice and thereby making use of the time overlap constraint is introduced in [Mangu & Brill+ 1999]. The main idea is to cluster arcs, where the final clustering defines the slot function. The construction can be significantly speed up by using a so-called pivot path [Hakkani & Riccardi 2003; Stolcke 2002]. An algorithm following this approach is presented in Section 4.4.2. The algorithm requires the computation of the distance between arcs and arc clusters. Section 4.4.1 introduces the distance functions used in the CN construction algorithms presented in this work. In [Xue & Zhao 2005] an algorithm is proposed which traverses the lattice in chronological order and thereby builds state clusters. The state clusters are then used to derive the ultimate arc clusters. An algorithm extending and overcoming some drawbacks of the original version is developed in Section 4.4.3. The third CN construction algorithm is based on frame-wise word posterior probabilities and is introduced in Section 4.4.4. The algorithm is proposed in [Hoffmeister & Schl¨ uter+ 2009] and aims at finding in each iteration a single frame which defines the center of the next CN slot. The algorithm has some interesting properties, e.g. in opposite to the common CN construction algorithms it does not require a distance measure between arcs or arc clusters and comes completely parameter free.

4.4.1 Distances betweens Arcs and Arc Clusters Distance functions between arcs and arc clusters are an important heuristic in the two CN construction algorithms presented in Section 4.4.2 and Section 4.4.3. A common choice in CN construction algorithms are the distance functions introduced in [Mangu & Brill+ 1999]. However, they depend on a phoneme alignment which is not always available, especially not in cross-site system combinations, or it is expensive to compute. The distance functions used throughout this work are eventually the result of empirical tests. The

64

4.4 Confusion Network Distance based Error

eh

hello

10

40

0 32

hello

1.

2.

eh

eh

[si]

hello

hello hello

3.

eh

hello

[si]

hello Figure 4.3. CN construction with the arc-cluster algorithm.

distance between two arcs a and b or between an arc a and an arc cluster C is computed by:      max end(a), end(b) − min beg(a), beg(b) darc (a, b) := 2 − δ i(a), i(b) dur(a) + dur(b) dslot (a, C) := min darc (a, b) b∈C

(4.14)

In arc clustering algorithms usually the distances are weighted by the posteriors of the arcs which yields the following weighted forms of the previously defined distances:   p(a|xT1 )p(b|xT1 ) dwarc (a, b) := 1+ darc (a, b) α dwslot (a, C) := min dwarc (a, b) (4.15) b∈C

The weight sees to it that the lattice arcs with a high probability of occurrence dominate the CN construction, where the parameter α controls the impact of the weight and is tuned on the development set.

4.4.2 The Arc-Cluster CN Construction Algorithm The arc clustering algorithm presented in this section is based on a set of pivot arcs. The pivot arcs are used to initialize the arc clusters. In the next step the algorithm aims at assigning all arcs to the clusters. If some arcs cannot be assigned, because they would violate the consistency of the arc clusters, i.e. lacking overlap in time with some arcs in a cluster, then additional pivot elements are chosen from the remaining arcs and the algorithm starts over. The idea of using a set of pivot arcs for initializing the arc clusters and subsequently clustering the remaining arcs is presumably the most common approach to CN construction algorithms for lattices and also for N -best lists, cf. [Hakkani & Riccardi 2003; Stolcke & Bratt+ 2000; Stolcke 2002]. The method presented in this section is also based on the idea of using a set of pivot arcs, but the algorithm itself was developed as part of this work. The pseudo code for the algorithm is given in Figure 4.4. The distance function used is the weighted distance defined in Equation (4.15). Figure 4.3 illustrates the algorithm on a small example. The first set of pivot arcs are the arcs from the best path through the lattice, in the example “eh” and “hello”. New clusters are initialized from the pivot arcs. The remaining arcs are now assigned in a greedy manner, which makes the other “hello” to fall in the same cluster as the first “hello”. The silence arc cannot be assigned anymore without violating the time overlap condition. In the next step the silence arc is added to the pivot elements, the algorithm starts over, and eventually three clusters are built.

65

Chapter 4 Local Cost Functions for Bayes Risk Decoding

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

# Initialize pivot elements with the arcs making up the best hypothesis P <- [ e for e in L if e in best ( L ) ] while True do # Initialize remaining arcs R <- [ e for e in E ( L ) if e not in P ] # Use pivot elements to initialize the CN slots CN <- [] foreach e in P do append ( CN , [ e ]) # Store remaining arcs together with their distance to the closest slot Q <- [] foreach e in R do d_e 0} S_e <- argmin { d (e , S ) for S in CN if overlap (e , S ) > 0} append (Q , (e , d_e , S_e )) # Sort remaining arcs by their distance to the closest slot sort Q by d_e in increasing order # Assign remaining arcs to the closest slot , if possible # Store arcs that could not be assigned together with their # posterior probability Q ’ <- [] while not empty ( Q ) do (e , d_e , S_e ) <- pop ( Q ) if overlap (e , S_e ) > 0 then append ( S_e , e ) else p_e <- p ( e | x_1 ^ T ) append (Q ’ , (e , p_e )) # If no remaining arcs exist , then stop if empty (Q ’) then break # Sort remaining arcs by their posterior probability sort Q ’ by p_e in decreasing order # Add new pivot elements P ’ <- [] while not empty (Q ’) do (e , p_e ) <- pop (Q ’) if not overlap (e , P ’) then append (P ’ , e ) P <- P + P ’ finalize CN Figure 4.4. Pseudo code for the arc-cluster CN construction algorithm.

66

4.4 Confusion Network Distance based Error

eh

10

hello 40

0

1.

eh

[si]

32

hello 1.

eh

eh

eh

2.

2. hello

hello

eh hello

3.

3.

eh

hello eh

hello hello

4.

4. hello

hello

eh

[si]

hello

[si]

hello

Figure 4.5. CN construction with the state-cluster algorithm.

The time complexity of the algorithm for lattice L is in the worst case O(|E(L)|2 ). However, in practice the algorithm is the fastest of the three CN construction algorithms investigated in this work. The algorithm turned out to be very robust, i.e. to produce among the best results for all tested systems and conditions including the union based system combination. The clusters are built in a greedy manner and no properties can be assured besides that all arcs in a cluster overlap in time. The actual clustering result depends on the distance function used and on the choice of the initial pivot elements.

4.4.3 The State-Cluster CN Construction Algorithm The state clustering algorithm is proposed in [Xue & Zhao 2005]. The main idea of the algorithm is to visit the lattice states in chronological order and to add all states to the current cluster until the following condition is met: for the state in question s there exists an arc a such that a starts from a state in the current cluster and ends in s. If the condition is fulfilled a new state cluster is started and initialized with s. Let C(s) denote the number of the state cluster to which state s is assigned to and let us assume that the state clusters are numbered in ascending order, then the constraint guarantees for each arc a   that C from(a) < C to(a) holds. For the sub-sequent arc clustering step an empty arc cluster is initialized between each two state clusters. The arcs are traversed and arc a is assigned to the best matching arc cluster which lays between the state clusters given by the source and the target state of a. The state clustering constraint guarantees that after the arc clustering step for each two consecutive arcs a and b the slot function constraint σ(a) < σ(b) holds. By default the algorithm uses the unweighted arc distances which makes the algorithm independent of the posterior probabilities computed from the lattice. The pseudo code for the algorithm is given in Figure 4.6 and an example in Figure 4.5, left side. The example illustrates a shortcoming of the algorithm: the greedy approach obviously fails in finding the correct arc clustering. The greedy procedure aligns “eh” and the first “hello” before considering the second “hello”. Because the target state of the “eh” arc is the source state of the second “hello” arc a new state cluster is started and the two “hello” arcs cannot be aligned. In this work an extension of the state-cluster algorithm is developed, which can compensate for the

67

Chapter 4 Local Cost Functions for Bayes Risk Decoding

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

# Initialize states S <- [ s in S ( L ) ] sort S chronologically in increasing order # Initial state cluster C_0 <- [ pop ( S ) ] j <- 0 # Initialize CN CN <- [] while not empty ( S ) do # Process next state s <- pop ( S ) # If potential violation of the alignment property is detected , # then start new state and arc cluster ( aka slot ) if max { state_cluster_index ( from ( e )) for e in In ( s ) } = j then j <- j + 1 C_j <- [] , A_j <- [] append ( CN , A_j ) append ( C_j , s ) # Find best arc slot for all incoming arcs foreach e in In ( s ) do i <- state_cluster_index ( from ( e )) k <- argmin { d (e , A_k ) for k in ( i .. j ] } append ( A_k , e ) finalize CN Figure 4.6. Pseudo code for the state-cluster CN construction algorithm.

shortcoming of the original method. The extension allows so-called back-splits where an existing arc cluster is split and a new state cluster is inserted. The procedure compares the arc-to-be-clustered a to all already clustered arcs which overlap in time with a. If an arc a0 is found which matches better to a than to any arc in its current cluster, then the split is accomplished. The right side of Figure 4.5 illustrates the idea: when the matching arc cluster for the second “hello” is searched, the existing arc cluster containing “eh” and the first “hello” is split and both “hello” are assigned to the right cluster. The complete pseudo code for the state-cluster algorithm with back-splitting is given in Figure 4.7. The time complexity for lattice L is for both algorithms in the worst case O(|E(L)|2 ), alike the pivot path based arc clustering algorithm from the previous section. In practice, the algorithm is fast, while slower than the pivot path based arc clustering algorithm. The performance is good, sometimes the results are even slightly better than for the pivot path based arc clustering. But the algorithm is sensitive to the lattice structure, especially for union based system combinations. For these cases the back-splitting improves the error rates significantly, but still it works not as robust as the algorithm from the previous section. An interesting property of the algorithm is that it works quasi online: in the original algorithm the processing of state s at time t affects only the incoming arcs of state s. In particular, when using the posterior free distance functions, cf. Equations (4.15), it depends only on what happened chronologically before t.

4.4.4 The Center-Frame CN Construction Algorithm The heuristic used in the center frame algorithm as proposed in [Hoffmeister & Schl¨ uter+ 2009] is based T on the frame-wise word posterior probabilities pt (w|x1 ) and the arc probabilities p(a|xT1 ) computed from the lattice. In opposite to the CN construction algorithms presented in the two previous sections, the algorithm does not rely on distances between arcs and arc clusters. The core idea of the algorithm is to find in each iteration the frame t that fulfills best three conditions.

68

4.4 Confusion Network Distance based Error

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

# Initialize states S <- [ s in S ( L ) ] sort S chronologically in increasing order # Initial state cluster C_0 <- [ pop ( S ) ] j <- 0 # Initialize CN CN <- [] while not empty ( S ) do s <- pop ( S ) if max { state_cluster_index ( from ( e )) for e in In ( s ) } = j then j <- j + 1 C_j <- [] , A_j <- [] append ( CN , A_j ) append ( C_j , s ) foreach e in In ( n ) do i <- state_cluster_index ( from ( e )) k_fwd <- argmin { d (e , A_k ) for k in ( i .. j ] } d_fwd <- d (e , A_k_fwd ) # Check , whether we prefer a back insertion , \ ie insert edge e # into a slot where it might violate the slot consistency Q <- [ (k , d (e , A_k )) for k in [1.. i ] ] sort Q by d in increasing order while not empty ( Q ) do ( k_bwd , d_bwd ) <- pop ( Q ) # Try a back insertion , if it is cheaper if d_bwd < d_fwd then I_left <- [] , I_right <- [] foreach e_bwd in A_k_bwd do if overlap ( e_bwd , e ) = 0 do append ( I_left , e_bwd ) else append ( I_right , e_bwd ) if empty ( I_left ) then # Back insertion doesn ’ t violate the slot # consistency append ( A_k_bwd , e ); break else # Back insertion violates the slot consistency # Check , whether a slot split is desired or not F_left <- I_left , F_right <- [ e ] while not empty ( I_right ) do e_bwd <- pop ( I_right ) if d ( e_bwd , I_left ) < d ( e_bwd , e ) then append ( F_left , e_bwd ) else append ( F_right , e_bwd ) # If at least one arc is assigned to the new # slot F_right , then perform the split if | F_right | > 1 then replace ( A_k_bwd , ( F_left , F_right )); break # No back insertion happened d_bwd <- infinity # If no back insertion happened , then do the forward insertion if d_bwd = infinity then append ( A_k_fwd , e ) finalize CN Figure 4.7. Pseudo code for the state-cluster CN construction algorithm with back-splitting.

69

Chapter 4 Local Cost Functions for Bayes Risk Decoding The first condition requires the definition of the region of maximum overlap \   mo(a) := beg(b), end(b)

(4.16)

b∈L: i(a)=i(b)∧ o(a,b)>0

for an arc a. Now, the three conditions are: 1. t lays in the region of maximum overlap for all arcs it intersects with 2. the probability of the empty word has a minimum at time t 3. t lays in the center of all arcs it intersects with Condition 1 overrules condition 2 and condition 2 overrules condition 3. That is, first the regions are selected which fulfill best condition 1. From these regions those time frames are selected which fulfill best condition 2 and condition 3 is used for the final selection. In the optimal case t is the center of all arcs which intersect with time frame t and none of these arcs is an -arc. Condition 1 and 3 ensure that the arcs in the resulting slot are competitors. Condition 2 aims at reducing the probability of the empty word in a slot and thus reducing the deletion bias of the resulting CN. In practice, condition 2 forces a compact CN with the fewest number of slots compared to the alternative CN construction algorithms. The algorithm has a crucial drawback: the region of maximum overlap for arc a will be empty if there exist two arcs which have the same label as a and overlap with a, but do not mutually overlap. This case is referred to as the ambiguous case, because no unambiguous region of maximum overlap exists. An alternative definition of the region of maximum overlap can be derived based on frame-wise word posteriors, which is in the unambiguous case equivalent to the original definition given in Equation (4.16), but provides in the ambiguous case a meaningful set of time stamps. The new definition is based on the observation that   pt i(a)|xT1 = max pτ i(a)|xT1 , for t ∈ mo(a). beg(a)≤τ
That is, those time frames in an arc’s time span, where the probability of the arc label is maximized, are good candidates for the region of maximum overlap. The resulting region is referred to as region of maximum probability and is defined as       mp(a) := t : beg(a) ≤ t < end(a) ∧ pt (i(a)|xT1 ) = max pτ i(a)|xT1 . (4.17) beg(a)≤τ
The definition guarantees that for all arcs the region of maximum probability is not empty and equals the region of maximum overlap in the unambiguous case. The resulting CN construction algorithm is illustrated in Figure 4.8. The only time frame being close at fulfilling all three conditions is frame 22. The slot derived from frame 22 contains both “hello” arcs. Assuming that the non-word “[si]” is regarded by the algorithm as the empty word, then the next choice is frame 5 goring the “eh” arc. And finally the third slot is built from the silence arc. The complete algorithm is given in pseudo code in Figure 4.9. In the first two lines the algorithm is initialized, where E := {a ∈ L : i(a) 6= } is the set of all non- arcs. The main loop starts in line 8 with updating the frame-wise word posteriors and the frame-wise average deviation from the arc center. In the experiments presented in Section 4.4.5 the deviation is measured by the l1 -norm, but also so the l2 -norm works well. In line 14 the frame-wise  posteriors are updated, whereas only arcs are considered whose region of maximum probability intersects with the current time frame. And in line 19 the selection of the next slot building frame starts. Finally, in line 27 the next slot is created, where again only those arcs are considered whose region of maximum probability intersects with the slot building time frame. The algorithm has some nice properties. First of all it does not require a distance function for arcs or arc clusters and it comes completely parameter free. The abandonment of the distance function has a direct consequence: the algorithm is invariant to the fragmentation of -paths, i.e. consecutive arcs of silence, noise, or other non-words. All slots produced by the algorithm contain only non- arcs, whereas the other algorithms usually produce many slots containing only -arcs. Furthermore, it is guaranteed

70

4.4 Confusion Network Distance based Error

eh

hello

10

40

0 32

hello

[si]

hello 1. hello

2.

eh

hello hello

3.

eh

hello

[si]

hello Figure 4.8. CN construction with the center-frame algorithm.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

# Initialize set of non - eps edges E <- [ e for e in E ( L ) if label ( e ) != eps ] # Initialize CN CN <- [] # Main loop while not empty ( E ) do # Update frame - wise word - posteriors foreach t in [1.. T ] do p_t ( eps | x_1 ^ T ) <- 1 dev_t ( x_1 ^ T ) <- sum { |t - center ( e )| * p_t ( e | x_1 ^ T ) for e in E } foreach w in W do p_t ( w | x_1 ^ T ) <- sum { p_t ( e | x_1 ^ T ) for e in E if label ( e ) = w } # Update frame - wise eps - posteriors foreach e in E do p_max <- max { p_t ( label ( e )| x_1 ^ T ) for t in [ begin ( e ).. end ( e )) } for t in [ begin ( e ).. end ( e )) with p_t ( label ( e )| x_1 ^ T ) = p_max do p_t ( eps | x_1 ^ T ) <- p_t ( eps | x_1 ^ T ) - p_max # Find next slot building frame n <- infinity foreach e in E do p_max <- max { p_t ( label ( e )| x_1 ^ T ) for t in [ begin ( e ).. end ( e )) } for t in [ begin ( e ).. end ( e )) with p_t ( label ( e )| x_1 ^ T ) = p_max do if ( p_t ( eps | x_1 ^ T ) < p_n ( eps | x_1 ^ T ) ) and ( dev_t ( x_1 ^ T ) < dev_n ( x_1 ^ T ) ) then n <- t # Build next slot S <- [] foreach e in E with t in [ begin ( e ).. end ( e )) do p_max <- max { p_t ( label ( e )| x_1 ^ T ) for t in [ begin ( e ) , end ( e )] } if p_n ( label ( e )| x_1 ^ T ) = p_max then append (S , e ) remove (E , e ) append ( CN , S ) finalize CN Figure 4.9. Pseudo code for the center-frame CN construction algorithm.

71

Chapter 4 Local Cost Functions for Bayes Risk Decoding

Table 4.8. CN decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1

Comb.

s1+s2

union

CNC

s1+s2+s3

union

CNC

1

CN alg. arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame

dev071 (2.63/1.59) 14.54 (2.79/1.45) 14.30 (2.95/1.41) 14.31 (2.81/1.45) 14.32 (3.05/1.29) 13.54 (3.47/1.20) 13.69 (2.90/1.34) 13.54 (2.93/1.34) 13.56 (3.03/1.32) 13.53 (2.91/1.36) 13.55 (2.88/1.24) 13.13 (3.38/1.14) 13.27 (2.74/1.33) 13.15 (2.87/1.29) 13.17 (2.93/1.26) 13.15 (2.74/1.34) 13.16

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.53/0.85) 14.96 (4.69/0.82) 14.93 (4.56/0.85) 14.95 (4.69/0.73) 14.01 (5.18/0.69) 14.22 (4.60/0.74) 13.96 (4.66/0.76) 13.99 (4.72/0.72) 13.95 (4.60/0.75) 13.95 (4.77/0.67) 13.73 (5.19/0.65) 13.77 (4.56/0.73) 13.65 (4.68/0.70) 13.70 (4.71/0.67) 13.65 (4.57/0.76) 13.74

dev08 (2.80/0.87) 13.28 (2.85/0.80) 13.05 (3.07/0.79) 13.10 (2.89/0.80) 13.10 (3.01/0.73) 12.54 (3.45/0.66) 12.75 (2.90/0.71) 12.43 (2.93/0.74) 12.50 (3.09/0.75) 12.66 (2.91/0.74) 12.49 (3.01/0.73) 12.30 (3.34/0.64) 12.32 (2.87/0.74) 12.19 (2.92/0.72) 12.21 (3.03/0.70) 12.29 (2.86/0.77) 12.14

tuning set

that all overlapping arcs with the same label are assigned to the same slot, if an unambiguous solution exists. The worst case complexity of the algorithm is O(|T |2 ). In the conducted experiments the center-frame algorithm needs between two and eight times longer than the pivot path based arc clustering algorithm, depending on the length and structure of the lattice. The produced CNs are the most compact of all three algorithms and the decoding usually yields the lowest deletion ratio. The error rates are competitive to the arc clustering approach, for some tasks even better. And the algorithm is robust, showing good results under all test conditions including the experiments with union based system combinations, where it usually beats the other arc clustering algorithms.

4.4.5 Results In this section results for the Bayes risk decoder with the CN distance as loss function are presented and discussed. Experiments have been performed for single lattices, for union based lattice combinations, and for CN combinations, see Section 3.2.3 and Section 3.4.1 for details about the combination techniques. The CN decoder follows Equation (3.28) and considers the complete hypothesis space. Results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation cross-site combination. A detailed description of the systems is given in Appendix B. More results for all systems and all setups can be found in Appendix C. For all experiments acoustic and language model scales and the system weights in the union based lattice combination and in the CN combination approach are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described in Section 3.7. In the first set of experiments three CN construction algorithms are compared: the arc-cluster algorithm introduced in Section 4.4.2, the state-cluster algorithm from Section 4.4.3 in the modified version with back-splitting, and the center-frame algorithm from Section 4.4.4.

72

4.4 Confusion Network Distance based Error

Table 4.9. CN decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI

Comb.

LIMSI+RWTH

union

CNC

LIMSI+RWTH+UKA

union

CNC

LIMSI+RWTH+UKA+IRST

union

CNC

1

CN alg. arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame arc-cluster state-cluster (mod.) center-frame

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.65/1.33) 8.07 (1.76/1.18) 8.96 (1.71/1.25) 8.04 (1.88/1.14) 8.94 (1.64/1.33) 8.08 (1.75/1.18) 8.97 (1.63/0.77) 6.46 (2.17/0.71) 7.67 (1.90/0.85) 6.95 (2.29/0.79) 8.13 (1.50/0.77) 6.39 (1.92/0.73) 7.52 (1.45/0.80) 6.38 (1.88/0.75) 7.51 (1.49/0.78) 6.38 (1.96/0.75) 7.52 (1.45/0.81) 6.41 (1.88/0.80) 7.58 (1.51/0.79) 6.38 (2.04/0.77) 7.63 (1.98/0.73) 6.57 (2.63/0.69) 7.76 (1.54/0.73) 6.30 (1.89/0.69) 7.32 (1.47/0.72) 6.27 (1.87/0.68) 7.24 (1.58/0.67) 6.25 (2.04/0.64) 7.28 (1.36/0.74) 6.23 (1.77/0.76) 7.32 (1.61/0.73) 6.28 (2.19/0.67) 7.36 (2.31/0.63) 6.61 (2.90/0.52) 7.58 (1.61/0.71) 6.23 (2.00/0.61) 7.10 (1.45/0.71) 6.14 (1.87/0.69) 7.12 (1.54/0.65) 6.10 (2.04/0.57) 7.12 (1.36/0.73) 6.11 (1.82/0.67) 7.16

tuning set, eval06 was the official development set in the 2007 evaluation campaign

73

Chapter 4 Local Cost Functions for Bayes Risk Decoding

Table 4.10. Comparison of the original and the modified state-cluster CN construction algorithm for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1

Comb.

s1+s2+s3

union CNC

1

CN alg. state-cluster state-cluster state-cluster state-cluster state-cluster state-cluster

(orig.) (mod.) (orig.) (mod.) (orig.) (mod.)

dev071 (2.63/1.59) 14.54 (3.10/1.43) 14.45 (2.95/1.41) 14.31 (3.70/1.53) 13.88 (3.38/1.14) 13.27 (2.83/1.29) 13.14 (2.93/1.26) 13.15

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.85/0.83) 15.02 (4.69/0.82) 14.93 (5.36/1.10) 14.43 (5.19/0.65) 13.77 (4.67/0.71) 13.73 (4.71/0.67) 13.65

dev08 (2.80/0.87) 13.28 (3.25/0.81) 13.30 (3.07/0.79) 13.10 (3.72/1.11) 13.03 (3.34/0.64) 12.32 (3.03/0.69) 12.30 (3.03/0.70) 12.29

tuning set

The results are summarized in Table 4.8 and Table 4.9. The single system experiments show almost no difference in error rate, but a slightly higher deletion ratio for the state-cluster algorithm. The confusion network combination experiments show a similar picture: the error rates are almost identical and the state-cluster algorithm has the highest and the center-frame algorithm the lowest deletion ratio. The union-based lattice combinations are the most challenging tasks for the CN algorithms, because a single CN from several, sometimes diverse lattices has to be constructed. In particular demanding is the cross-site combination, where the CN has to be built from lattices with different biases in the word boundaries. The results for the Chinese task show that the arc-cluster and the center-frame algorithm are doing well and the error rates do not differ from the CNC results. For the state-cluster algorithm the number of deletions increases heavily and raises the error rate compared to the CNC result. On the English cross-site combination task the CNC approach shows a small advantage over the union based system combination. Presumably, the advantage comes from the independence of the CNC algorithm from time information. The time information is needed to build the system-dependent CNs, but not anymore in the CN combination itself. In the Chinese testing system all lattices are produced with the same decoder and thus all lattices have the same bias in their time stamps. On the other hand, in a cross-site system combination the lattices are usually produced by different decoder and vary in their bias, cf. [Baghai-Ravary & Kochanski+ 2009]. This explains the different behavior of the Chinese system and the English cross-site combination. Similar to the Chinese results, the state-cluster algorithm is inferior to the arc-cluster and center-frame algorithm for the union based combination. Again, a heavily increased deletion ratio is observed. Among the union based experiments the center-frame algorithm shows a small advantage over the arc-cluster method, although little the difference can be observed in almost all experimental setups, cf. Appendix C. A direct comparison of the CNC results with the best frame error results, cf. Section 4.2.3, shows no significant difference in error rate. Compared with the CN decoding of the lattice union the frame error approximation shows a small advantage for the cross-site combination. The CN combination and decoding approaches show good generalization abilities. For all experimental setups the improvement on the tuning and on the testing sets are of similar magnitude. The second set of experiments investigates the modification of the original state-cluster algorithm, the back-splitting. The results are summarized in Table 4.10 and Table 4.11. For the single lattice case and the CNC the back-splitting gives a small improvement making the algorithm competitive to the arc-cluster and the center-frame algorithm. The performance of the original state-cluster algorithm on the lattice union is rather poor and the allowance of back-splits results in a large improvement. However, the deletion ratio remains high and the performance on the lattice union stays inferior.

74

4.5 Summary

Table 4.11. Comparison of the original and the modified state-cluster CN construction algorithm for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI

Comb.

LIMSI+RWTH+UKA+IRST

union CNC

1

CN alg. state-cluster state-cluster state-cluster state-cluster state-cluster state-cluster

(orig.) (mod.) (orig.) (mod.) (orig.) (mod.)

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.71/1.35) 8.13 (1.82/1.22) 9.04 (1.71/1.25) 8.04 (1.88/1.14) 8.94 (2.25/1.08) 7.39 (2.82/1.00) 8.30 (2.31/0.63) 6.61 (2.90/0.52) 7.58 (1.60/0.63) 6.15 (2.04/0.63) 7.14 (1.54/0.65) 6.10 (2.04/0.57) 7.12

tuning set, eval06 was the official development set in the 2007 evaluation campaign

4.5 Summary In this chapter three different approaches to the approximation of the Levenshtein distance have been investigated. The approximations belong to two classes of local cost functions for which efficient Bayes risk decoder exist. Local cost functions and efficient Bayes risk decoder for local costs of the first and second type were introduced in the previous chapter in Section 3.3.3. The cost based on a local alignment is an example for a local cost function of the first type and the frame error and the CN distance based costs are examples for cost functions of the second type. The frame error counts the number of frames in which hypothesis and reference disagree in the word label. In practice, the frame error is normalized in order to get a more word-like error. The common normalization used in Bayes risk decoding happens only under consideration of the hypothesis. An investigation of the hypothesis-side normalization shows that it ignores deletions and thus causes a heavy deletion bias in decoding. A new frame error based cost is introduced which averages between hypothesisand reference-side normalization. The new cost is compared to the original cost function and to a third frame error based cost, which applies a symmetric normalization on arc level. The new cost performs superior in all experiments and in some experiments a considerable decrease in error rate is observed. The class of cost functions based on a local alignment includes the cost approximation used in Povey’s implementation of MPE/MWE training. In Bayes risk decoding Povey’s MWE cost shows a deletion bias and a modified version is developed. The modified cost contains an additional term which explicitly penalizes deletions. In the computation of the cost function two reference arcs on the same path can be assigned to the same hypothesis arc. Povey’s cost function is designed to compensate for the flaw by computing fractional error counts. The 1/2 overlap approximation is an alternative approach which allows two arcs to compete only if their overlap exceeds one half. The constraint guarantees that no two hypothesis arcs on the same path are assigned to the same reference word. From this approach two cost functions are derived using continuous and discrete costs. The experimental results reveal the deletion bias of Povey’s cost approximation. The modified criterion can reduce the deletion bias and shows the best error rates of the four compared cost functions. The results for the 1/2 overlap approximation with the continuous error counts are close to the modified criterion, whereas the version using discrete costs performs inferior. In the last section three confusion network (CN) construction algorithms are introduced. The arc-cluster and the state-cluster algorithm are based on common algorithms used in LVCSR lattice decoding. The center-frame algorithm is a new approach which does not rely on distances between arcs or arc clusters. The arc-cluster algorithm uses a set of pivot arcs to built an initial set of arc clusters, which are re-defined in an iterative manner until all arcs are clustered. The state-cluster algorithm performs a

75

Chapter 4 Local Cost Functions for Bayes Risk Decoding chronological traversal of the lattice thereby clustering the states. The state cluster information is used for building the ultimate arc clustering. The original state-clustering algorithm exhibits problems for some lattice structures. A modified version is developed, which is able to compensate for the shortcoming. In the experimental tests the modified version performs better for all tested systems and conditions. The arc-cluster and the state-cluster algorithms are eventually based on building arc clusters by comparing arcs. The center-frame algorithm works differently: in each iteration a single time frame is selected which defines the center of the next CN slot. The heuristic aims at choosing the time frame such that a compact CN arises with a high arc overlap within the slots. In the experimental comparison of the three algorithms the performance is similar for single lattices and for confusion network combinations. For union based lattice combination the arc-cluster and center-frame algorithms are on the same level and outperform the state clustering approach.

76

Chapter 5 Confusion Networks: Applications and Investigations Confusion networks (CNs) have been introduced in Chapter 3, Section 3.4.1, and have been further discussed in Chapter 4, Section 4.4. A CN defines a sequence of slots, where each slot represents a posterior probability distribution over words. The CN can be interpreted as the result of an alignment of word sequences: words in the same slot are aligned. Having said this, the sequence of slots equates to the possible alignment positions. For each slot and word the CN provides the posterior probability of the observation of the word at the corresponding alignment position given the acoustic observations. In particular, the slot-wise posterior probabilities are derived from a given alignment, which makes them independent of the posterior distributions of the adjacent slots. The independence yields the simple decoding rule for CNs: for each slot select the word with the highest slot-wise posterior probability. In this chapter more applications of CNs are presented which make explicit use of the independence. In the last chapters CNs have been introduced on word and on Chinese character level. In this chapter also CNs defined on frame level are used. For example, the time alignment introduced in Section 1.3 can be expressed as a CN: for each time frame the acoustic alignment provides a probability distribution over all HMM states; in the Viterbi case the probability is zero for all but one state. The corresponding CN has a slot for each time frame and the slot-wise distribution is defined over HMM states. In this chapter frame-wise CNs are used which provide per frame a distribution over all word labels. Thus, they can be interpreted as an acoustic alignment on word level instead of HMM state level. Figure 5.1 shows an example for a lattice and the derived word-wise and frame-wise CNs. In Section 5.1.1 a frame-wise entropy is computed from frame-wise defined CNs and used for a combination approach. Another application of frame-wise CNs is presented in Section 5.1.2, where word boundaries are derived from the frame level CNs. And in Section 5.2.1 the slot-wise posteriors of a word level CN are warped for optimal performance in a CN combination. A CN derived from a lattice induces an alignment for each pair of paths in the lattice. The CN decoding result equals the Bayes risk decoding with the Levenshtein distance as loss function, if the CN defines the Levenshtein alignment for all path pairs. In practice, this is not the case for LVCSR tasks, but the true alignment is usually close to the CN alignment. This motivates the idea of using a windowed Levenshtein distance in the Bayes risk decoder, where the alignment is initialized by a CN alignment. In Section 5.2.2 the idea is explored in detail. It is shown that the resulting decoder with a window size of one equals the CN decoding rule and for a sufficiently large window it becomes the Bayes risk decoder with the Levenshtein distance as loss function.

5.1 Frame Level Confusion Networks A frame-wise CN (fCN) is defined on word labels and is completely described by the frame-wise word posterior probability distributions pt (w|xT1 ) which define the slots in the fCN. The posterior distributions and thus the fCN are derived from a lattice according to Equation (3.11). In contrast to a word or arc alignment, the time alignment requires no explicit computation of the alignment: the alignment is implicitly given by the time stamps in the lattice. Thus, in contrast to a word-level CN, in the fCN the articulation of a word is usually spread over several slots.

5.1.1 Minimum- and Inverse-Entropy Combination The min.hyp-nFE decoding rule defined in the last chapter in Equation (4.3) is solely based on frame-wise word posterior distributions, no other lattice-based probabilities are required. For the union based lattice

77

Chapter 5 Confusion Networks: Applications and Investigations

Figure 5.1. The figure shows in the first row a lattice. The second and the third row show the word-level resp. frame level CN derived from the lattice. In the word-level CN each slot assigns a single position to each word hypothesis. In the frame-wise CN each slot represents a single time frame and a word hypothesis is usually spread among several slots.

combination the frame-wise word posteriors are computed according to Equation (3.17) as the weighted average of the system-dependent frame-wise word posteriors. In [Misra & Bourlard+ 2003; Valente 2009] the authors propose alternative ways to combine frame-wise posteriors based on the frame-wise computed, system-dependent entropy. In their work neural network based frame-wise phoneme posterior probabilities are derived from several feature streams. Systemdependent entropy values are computed from the posteriors and used for merging the phoneme posteriors into a new acoustic front-end. In this work the combination method is applied to the frame-wise word posteriors derived from the system-dependent lattices. The basic idea of entropy based combination as proposed in [Misra & Bourlard+ 2003] is that the system with the lowest entropy is the most reliable system. From the main idea the authors derive two approaches: for each frame make a hard or a soft decision for one of the systems based on the system-dependent entropy. In the first approach, at each time frame simply the posterior distribution of the system with the lowest entropy is chosen. The resulting combination rule is called the “minimum entropy” weighting scheme and is defined as follows, where the entropy for the posterior distribution pj,t (·|xT1 ) is denoted by Hj,t (xT1 ): pt (w|xT1 ) :=

J X j=1

 δ Hj,t (xT1 ), min Hk,t (xT1 ) pj,t (w|xT1 ) k

(5.1)

In the “inverse-entropy” weighting scheme the system-dependent posteriors are weighted according to the inverse of the system-dependent entropy values: pt (w|xT1 ) := Z −1

J X j=1

−1 T p(j)Hj,t (x1 )pj,t (w|xT1 ),

Z :=

J X

−1 T p(j)Hj,t (x1 )

(5.2)

j=1

The “inverse-entropy” can be interpreted as a smoothed version of the “minimum entropy” approach: the closer a system’s entropy is to zero, the more it dominates the competitors. Results with the entropy based weighting schemes are presented and discussed in Section 5.1.3.

78

5.1 Frame Level Confusion Networks

5.1.2 Time Alignment with Frame Level CNs Some lattice combination and decoding approaches, like the lattice intersection or the MAP decoding rule, erase the time stamps and invalidate the word boundaries. In theory, for computing and optimizing the Levenshtein distance based error rate time stamps are not necessary. However, in practice they are needed for applying the popular NIST scoring tools or for post-processing steps applied to the decoding result, for example in the preparation for a subsequent translation step [Matusov & Mauser+ 2006]. A general way to produce new word boundaries is to perform a time alignment of the decoding result with an appropriate acoustic model. The drawback is that the alignment is expensive compared to the lattice decoding and acoustic models are required. Especially in the cross-site system combination case an appropriate acoustic model is not always available or it has an out-of-vocabulary (OOV) problem, i.e. the pronunciation lexicon at hand does not contain pronunciations for all words in the lattices. An alternative approach is to modify the lattice processing tools such that they compute approximate word boundaries. The approach is usually fast, but the drawbacks are that only approximate time stamps are computed and that generic algorithms, like the lattice determinization, have to be modified, i.e. no generic WFST toolkits can be used anymore. A third approach similar to the acoustic time alignment is presented in this section. The idea is to use the frame-wise word posterior distributions for computing the word boundaries. Given the frame-wise word posteriors pt (w|xT1 ) computed from lattice L and given the decoding result w1N computed from the same lattice, then the alignment problem is given by ˆ

(w1N , xT1 ) → tˆN 1 := argmin tN 1

N Y

tn Y

pτ (wn |xT1 ).

(5.3)

n=1 τ =tn−1 +1

The ending time of word wn is denoted by tn , that is the boundaries of wn are [tn−1 +1, tn ]. The alignment can be efficiently computed using a dynamic programming approach. The approach can be easily derived from the recursive formulation of the problem  h(t, n; w1N , xT1 ) := pt (wn |xT1 ) min h(t − 1, n − 1; w1N , xT1 ), h(t − 1, n; w1N , xT1 ) , where h(0, 0; w1N , xT1 ) := 1. Computing h(T, N ) and tracing the changes in the word index yields the desired word boundaries. Also for system combination approaches which are not based on a single lattice, like the CN combination (CNC), the algorithm is suitable. The frame-wise word posteriors are computed according to Equation (3.17) as the weighted average of the system-dependent posteriors or equivalently directly from the modified lattice union as defined in Section 3.2.3. The choice of the union for computing the frame-wise posteriors guarantees that no OOV problem occurs during the alignment. The algorithm is used in [Hoffmeister & Hillard+ 2007] for computing word boundaries for the output of a CNC decoder and throughout this work to compute word boundaries for lattice intersection and MAP decoding results.

5.1.3 Results In this section experimental results for the entropy-based combination of frame-wise word posterior probabilities in the minimum frame error framework are given and discussed. The corresponding minimum frame error decoder using the standard combination approach is defined in Section 4.2.1. Experiments are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation cross-site combination. A detailed description of the systems is given in Appendix B. For all experiments acoustic and language model scales, the system weights in the union based combination approach, and the smoothing parameter α in the minimum frame error decoder are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described in Section 3.7. The results are summarized in Table 5.1 and Table 5.2. Especially for the English cross-site combination, the inverse-entropy combination performs better than the maximum entropy approach. However, both entropy-based combination rules are inferior to the standard method of the weighted average.

79

Chapter 5 Confusion Networks: Applications and Investigations

Table 5.1. Entropy-based combination results for the Chinese 230h testing system, cf. Section B.1.1. Experiments are performed with the minimum frame error decoder with hypothesis-side frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1+s2

s1+s2+s3

1

Frame Comb. average min. entropy inv. entropy average min. entropy inv. entropy

dev071 (2.63/1.59) 14.54 (3.07/1.30) 13.57 (2.78/1.48) 13.65 (3.12/1.28) 13.61 (3.06/1.23) 13.18 (2.84/1.40) 13.37 (3.08/1.23) 13.20

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.69/0.68) 13.95 (4.52/0.76) 13.95 (4.79/0.71) 14.01 (4.72/0.69) 13.71 (4.65/0.77) 13.85 (4.82/0.70) 13.82

dev08 (2.80/0.87) 13.28 (3.05/0.70) 12.54 (2.78/0.81) 12.38 (3.12/0.69) 12.55 (3.01/0.72) 12.22 (2.99/0.79) 12.18 (3.09/0.70) 12.10

tuning set

Table 5.2. Entropy-based combination results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. Experiments are performed with the minimum frame error decoder with hypothesis-side frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI+RWTH

LIMSI+RWTH+UKA

LIMSI+RWTH+UKA+IRST

1

80

Frame Comb. average min. entropy inv. entropy average min. entropy inv. entropy average min. entropy inv. entropy

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.60/0.85) 6.65 (1.99/0.76) 7.73 (1.65/0.97) 6.84 (1.91/0.88) 7.80 (1.50/0.97) 6.64 (1.74/0.90) 7.61 (1.80/0.72) 6.48 (2.21/0.68) 7.52 (1.70/1.04) 6.84 (1.84/1.01) 7.84 (1.67/0.97) 6.64 (1.81/0.88) 7.43 (1.70/0.79) 6.52 (1.93/0.76) 7.26 (2.13/1.01) 7.88 (2.27/0.95) 8.50 (1.88/0.95) 7.29 (2.00/0.93) 8.02

tuning set, eval06 was the official development set in the 2007 evaluation campaign

5.2 Word Level Confusion Networks In their original work [Misra & Bourlard+ 2003; Valente 2009] the authors also find that the inverse entropy method is superior to the maximum entropy method. They improved with both methods over the simple average. However, in their experiments the biggest gains were observed for noisy data, whereas for clean speech almost no improvement was seen. The tasks considered in this work use clean speech only which is presumably the reason why the combination does not benefit from the entropy-based approaches.

5.2 Word Level Confusion Networks A word level CN is completely described by the slot-wise word posterior probability distributions denoted by ps (w|xT1 ). In contrast to the frame-wise CN, in the word-wise CN the articulation of a word is never distributed over several slot: each slot represents a complete word. Alternatively, a word level CN is described by a word lattice L and a slot function σ : E(L) → N as introduced in Section 3.4. The posterior probability ps (w|xT1 ) is computed from the lattice according to Equation (3.12), where the slot function assigns each lattice arc to a single CN slot. The slot function fulfills the constraint that σ(a) < σ(b) for any two consecutive lattice arcs a and b. The constraint guarantees that ps (·|xT1 ) is a probability distribution, cf. Section 3.4, and that the Levenshtein distance is a lower bound for the CN distance, cf. Section 4.4.

5.2.1 Confidence Warping The confidence score for a word in the decoding output is a measure of how certain the decoder is about the hypothesized word. Thus, the confidence score can be interpreted as the probability of how often the hypothesized word is correct [Wessel 2002]. The common confidence scores in LVCSR are based on fCNs [Wessel & Schl¨ uter+ 2001b] or on CNs [Evermann & Woodland 2000; Mangu & Brill+ 2000]. In the simplest approach the slot-wise word posterior derived from the CN are used directly as confidence score. However, posterior probabilities derived directly from lattices are usually biased due to model assumptions, beam pruning in the search, and subsequent lattice pruning. If all systems in a system combination show the same bias, then the bias presumably does not effect the decoding result. But if completely different systems are combined, e.g. lattices contributed from different sites, the posteriors might be biased differently and the system-dependent bias effects the decoding result. Focusing on the slot-wise word posteriors derived from a CN the bias can be measured by interpreting the posteriors as confidence scores. The normalized cross-entropy (NCE) or other confidence measures show how close the lattice-based posterior estimates are to the true posteriors [Hillard & Ostendorf 2006; Wessel 2002]. The bias of confidence scores derived from slot-wise word posteriors and an algorithm to compensate for the bias are discussed for example in [Hillard & Ostendorf 2006]. Here, the idea is to improve the CNC based system combination, cf. Section 3.4.1, by introducing system-dependent warping functions which compensate for the bias in the word posterior probability distributions of the system-dependent CNs. The bias of slot-wise word posteriors derived from LVCSR lattices is almost always characterized by overestimated large probabilities and underestimated small probabilities, or vice versa. In the consequence, a simple, word- and slot-independent warping function is sufficient to considerably improve the CN based confidence scores. The warping function used in this work is defined in Equation (5.4), where j denotes the system and bj and γj are the two system-dependent parameters.   γj   1 − (1 − b) 1−x , if x > bj 1−bj  γj hj (x) :=  bj x , otherwise bj  hj pi,s (w|xT1 ) p0i,s (w|xT1 ) := X (5.4)  hj pi,s (v|xT1 ) v

Figure 5.4 in the results section, cf. Section 5.2.3, shows the warping function for b = 0.3 and γ = 0.4 and the result of its application to the slot-wise word posteriors derived from a CN. In the right plot the true confidence scores computed on a tuning set are drawn against the confidence estimates. The warped confidence estimates are already very close to the true scores.

81

Chapter 5 Confusion Networks: Applications and Investigations In the application for CNC the two parameters γj and bj of the system-dependent warping function hj (·) are first optimized separately for each system for maximum NCE. The expectation is that this brings the slot-wise word posterior estimates close to the true posteriors and makes the probabilities comparable among systems. Eventually, the J system-dependent γs are included in the overall parameter optimization process and tuned directly for minimum error rate of the CNC decoding. CNC results with system-dependently warped slot-wise word posterior probabilities are presented and discussed in Section 5.2.3.

5.2.2 The windowed Levenshtein Distance: from the CN Distance to the exact Levenshtein Distance In this section the connection between CN decoding and Bayes risk decoding with the Levenshtein distance as loss function is developed. The idea is to relax the alignment defined by the CN until the Levenshtein alignment is computed. In this work a CN is derived from a lattice via a slot function which assigns a slot number to each arc in the lattice, cf. Section 3.4. In the CN distance the alignment between any two paths through the lattice is defined by the slot function: two arcs taken from the two paths compete with each other if they have the same slot number. For the computation of the Levenshtein distance each arc in the first path would have to be allowed to compete with each arc in the second path, obeying the monotony constraint in the Levenshtein alignment. In between the two extremes exists the windowed Levenshtein distance initialized with the CN alignment. For a window of size 2d + 1, the arc from the first path with the slot number n can compete with one of the arcs from the second path with a slot number in [n − d, n + d]. For d = 0, the result is the original CN distance and for sufficiently large d the exact Levenshtein distance is derived. The idea of applying the windowed Levenshtein distance initialized with a CN alignment is motivated by experimental results. In preliminary experiments the alignment between the Viterbi hypothesis and the reference was derived from a common CN construction algorithm. It turned out that almost always a symmetric window of size three or five was sufficient to find the exact Levenshtein alignment. The example given in Figure 5.2 shows a common mistake in the alignment produced by a heuristically working CN construction algorithm. The two “b” arcs in the lattice do not overlap in time and thus they are not clustered into the same slot. As a result the alignments defined by the CN are different from the Levenshtein alignments and the outcome of the Bayes risk decoder with the CN distance as loss function differs from the result of the Bayes risk decoder with the Levenshtein distance. The windowed Levenshtein distance with a window size of three would be sufficient to get the correct Levenshtein alignments. In the following the general windowed Levenshtein distance decoder with an arbitrary window size and an initial CN alignment is developed within the Bayes risk decoding framework. Afterwards it is shown that the result for a window of size one equals the CN decoding rule given in Equation (3.28) and for a sufficiently large window the Bayes risk decoder with the Levenshtein distance as loss function is derived. For a window of size one the decoding is a local decision which is made independently for each CN slot, i.e. the classic CN decoding. For a window larger than one the locality of the decision is no longer given and decoding becomes a non-trivial problem. The decoding of a slot depends now on the decisions made for the neighboring slots. Furthermore, the set of possible hypotheses increases beyond the hypotheses defined by the CN. In the following, dynamic programming equations are derived which efficiently compute an approximate of the Bayes risk decoding rule with the windowed Levenshtein distance as loss function. For a window of size one and for sufficiently large windows, i.e. for the CN distance and for the exact Levenshtein distance, the equations produce the correct Bayes risks. The rest of the section is organized as follows. A recursive definition of the Levenshtein distance is introduced from which the windowed Levenshtein distance will be derived. At the same time the hypothesis space is constructed such that the inclusion of the Bayes risk hypothesis is guaranteed, where the size of the resulting hypothesis space is a function of the initial CN and the window size. The computation of the Bayes risk consists of the computation of an outer loop going over the hypothesis space, an inner loop going over the summation space, and the computation of the loss function and the sentence posterior probability. It will be shown that loss and sentence probability can be computed without approximation, whereas the two loops require approximations in order to enable an efficient computation. Finally, the dynamic programming equations are derived, followed by an exact analysis of the run-time and memory requirements.

82

5.2 Word Level Confusion Networks

lattice (with CN positions): a:0/0.4

10

b:1/1.0

N-best (with CN positions): c:3/1.0

11

c:3/1.0

0 b:2/0.5 a:0/0.6

CN: a:0

12

11

20

1. a:0 b:1 c:3 p=0.4 2. a:0 b:2 c:3 p=.03 3. a:0 c:3 p=0.3 p=1.0

c:3/0.5

EPS:1

EPS:2

b:1

b:2

c:3

windowed Levenshtein distance: a) window size 1 (CN case) ref: a:0 b:1 b:2 c:3 b:1 b) window size 3 ref: a:0 b:1 b:2 c:3 b:1

a) alignment of hypothesis "a:0 EPS:1 EPS:2 c:3", BR hypothesis for window size 1 (CN case): 1. ref:

a:0 b:1

hyp: a:0 2. ref:

a:0

hyp: a:0 3. ref:

a:0

hyp: a:0

c:3 c:3 err=0.4x1 b:2 c:3 c:3 err=0.3x1 c:3 c:3 err=0.3x0 err=0.7

b) alignment of hypothesis "a:0 b:1 EPS:2 c:3", BR hypothesis for window size 3: 1. ref:

a:0 b:1

hyp: a:0 b:1 2. ref:

a:0

hyp: a:0 b:1 3. ref:

a:0

hyp: a:0 b:1

c:3 c:3 err=0.4x0 b:2 c:3 c:3 err=0.3x0 c:3 c:3 err=0.3x1 err=0.3

Figure 5.2. Example for a typical error made by the common CN construction algorithms and the correction of the error by using a windowed Levenshtein distance, where the window is centered around the CN alignment. The example lattice consists of three paths which are listed to the right of the lattice together with their path probabilities. The arc labels in the lattice are composed of the word, the CN slot to which the arc is assigned, and the arc probability. The resulting CN is drawn below the lattice. To the right of the CN an example for the possible alignment position of arc “b:1” within a windowed Levenshtein alignment is given: a) shows the only possible alignment position for a window of size one, b) shows the possible alignment positions for a symmetric window of size three. The lower part of the figure shows the alignments for the Bayes risk hypotheses for different window sizes with the windowed Levenshtein distance as cost function. Alignment a) is the outcome for a window of size one, which is equivalent to the standard CN decoding. Alignment b) uses a symmetric window of size three. The larger window allows the alignment of “b:1” and “b:2” which compensates for the flaw in the CN construction, where the two arcs were assigned to different slots. The Bayes risk hypothesis for a window of size three is “a b c”, which is also the minimum WER hypothesis for the example lattice.

83

Chapter 5 Confusion Networks: Applications and Investigations The Levenshtein distance. For the following considerations a recursive definition of the Levenshtein distance is required. The recursion is defined via an auxiliary cost function, i.e. it holds Lev(v1M , w1N ) = C(M, N ; v1M , w1N ). The recursion is given by  C(m, n; v1M , w1N ) := min d(vm , wn ) + C(m − 1, n − 1; v1M , w1N ), d(, wn ) + C(m, n − 1; v1M , w1N ), d(vm , ) + C(m − 1, n; v1M , w1N )  = min Lev(vm , wjn ) + C(m − 1, j − 1; v1M , w1N ) . j∈[1,n+1]

The equation describes a computation of the Levenshtein distance which is position-synchronous in v1M , where the computation of a cost at position m depends only on costs at the previous position m − 1. For the Levenshtein distance the local cost d(v, w) is defined as  d(v, w) :=

0 if v = w 1 otherwise,

but in general any local cost can be substituted. The partial risk. The so-called partial risk of w1n , n ≤ N , given acoustic observation sequence xT1 and given word sequence v1m , m ≤ M , is defined as the Levenshtein distance weighted by the posterior probability of the complete hypothesis w1N : R(m, n; v1M , w1N ) := p(w1N |xT1 )C(m, n; v1M , w1N ) The partial risk depends on the observed feature sequence xT1 , but for the sake of clarity the dependency is discarded from the notation. The Bayes risk decoding rule with the Levenshtein distance as loss function can be re-written in terms of the partial risk: xT1 → g(xT1 )

:=

argmin v1M

=

argmin v1M

X

p(w1N |xT1 ) Lev(v1M , w1N )

w1N

X

R(M, N ; v1M , w1N )

w1N

The summation and hypothesis space. The further steps require that all sentences in the hypothesis and summation space have equal length S. The summation space S can be restricted to word sequences w1N with p(w1N |xT1 ) > 0; obviously, a word sequence with a probability of zero is not considered in the summation. By inserting the empty word  all word sequences in S can be expanded to equal length S 0 , which yields the aligned summation space SS 0 . The positions for inserting the s are given by the initial CN alignment and S 0 equals the number of slots in the CN. Before continuing with the definitions of aligned summation and hypothesis spaces, some properties of the hypothesis space are investigated which motivate the next steps. The hypothesis space is in general larger than the aligned summation space as illustrated in the following example: w1N abcdf bcde acde g(xT1 ) = abcde

p(w1M |xT1 ) 0.¯3 0.¯3 0.¯3 err=1

The example shows that the Bayes risk hypothesis “abcde” is not contained in the summations space S = {“abcdf”, “bcde”, “acde”}. Furthermore, the hypothesis space can contain word sequences which are longer than the sequences in the aligned summation space. The next example shows such a case:

84

5.2 Word Level Confusion Networks w1N abcd bcde acde g(xT1 ) = abcde

p(w1M |xT1 ) 0.¯3 0.¯3 0.¯3 err=1

Again, the Bayes risk hypothesis “abcde” is not contained in the summations space. Keep in mind that the goal is to define a hypothesis and a summation space in which all sequences have equal length S. Let ˆ be the length of the shortest Bayes risk hypothesis. It is easy to see that for the Levenshtein distance M ˆ < 2S 0 : the maximum Levenshtein distance between two sequences is the number as loss function holds M of words in the longer sequence, that is an alignment with more insertions and deletions than number of words in the longer sequence cannot be the Levenshtein alignment. S is set to 2S 0 (or 2S 0 − 1, if S 0 is odd) and a new aligned summation space SS is constructed by adding S 0 /2 ×  as prefix and as suffix to any sequence in the old aligned summation space SS 0 . Let us use the first example to produce the required quantities step-by-step. The summation space is given by S := {“abcdf”, “bcde”, “acde”}. By inserting s at the appropriate positions, e.g. given by a CN, an aligned summation space with S 0 = 5 is derived: S5 = { “abcdf”, “bcde”, “acde” }. The Bayes ˆ = 5. For the final summation space S 0 /2 s risk hypothesis is “abcde” which has length 5 and thus M are attached to the begin and end of each sentence in S5 . The result is S9 = { “abcdf”, “bcde”, “acde” } and thus S = 9. The next equations give the formal definitions of the summation space and the set of all words in the (n) summation space at position n denoted by SS , where Σ denotes the vocabulary.   S SS := w1S : p(w1S |xT1 ) > 0 ⊂ Σ ∪ {} ,

(n)

SS

 := wn : w1S ∈ SS (n)

The hypothesis space corresponding to summation space SS is defined with the help of SS ( HS :=

S [

as

)S (i) SS

.

i=1

That is, at each position each word can occur which is contained anywhere in the summation space. It is easy to see that this hypothesis space contains all possible outcomes of the Bayes risk decoding rule with (n) the CN distance or (windowed) Levenshtein distance as loss function. Worthwhile to mention, using SS as hypothesis space at position n as in the CN decoding rule is in general not sufficient as shown in the following example: p(w1M |xT1 ) 0.¯3 0.¯3 0.¯3 err=1

abcdf bcde acde g(xT1 ) = abcde

In summary, there exists always an S such that the constructed hypothesis and summation space fulfill xT1 → g(xT1 ) := argmin v1M

= argmin v1S ∈HS

X

p(w1N |xT1 ) Lev(v1M , w1N )

w1N

X

p(w1S |xT1 ) Lev(v1S , w1S ).

w1S ∈SS

In the remainder it is assumed that word sequences are taken from the aligned hypothesis and the aligned summation space of length S, i.e. all word sequences are assumed to have equal length S, where the empty word  can occur at any position in the word sequence.

85

Chapter 5 Confusion Networks: Applications and Investigations The windowed Levenshtein distance and the windowed risk. For an initial alignment of two word sequences v1S and w1S the window is defined as the maximum deviation d, d ≥ 0, from the initial alignment, i.e. vn can be aligned to wn−d , . . . , wn , . . . , wn+d . The resulting windowed cost is given by  Cd (m, n; v1S , w1S ) := min Lev(vm , wjn ) + Cd (m − 1, j − 1; v1S , w1S ) . j∈[m−d,n+1]

The windowed cost is only defined for m − d ≤ n ≤ m + d. It is more convenient to define n in terms of the deviation i from m, i.e. n = m + i with −d ≤ i ≤ d:  m+i Lev(vm , wm+j ) + Cd,j (m − 1; v1S , w1S ) Cd,i (m; v1S , w1S ) := min j∈[−d,i+1]

The notation can be interpreted as having a cost vector of fixed length 2d + 1 at each position m. The definition of the windowed Levenshtein distance and the windowed risk are now straightforward Levd (v1S , w1S )

Cd,0 (S; v1S , w1S )

:=

Rd,i (S; v1S , w1S )

:= p(w1S |xT1 )Cd,i (S; v1S , w1S )

and the following inequalities are a direct consequence of the fact that the Levenshtein distance is a lower bound for the windowed Levenshtein distance. Lev(v1S , w1S ) = R(S, S; v1S , w1S ) =

LevS (v1S , w1S ) ≤ · · · ≤ Levd+1 (v1S , w1S ) ≤ RS,0 (S; v1S , w1S ) ≤ · · · ≤ Rd+1,0 (S; v1S , w1S ) ≤

Levd (v1S , w1S ) Rd,0 (S; v1S , w1S )

For the windowed Levenshtein alignment holds: a hypothesis word vm can only be aligned to a word in {wm−d , . . . , wm+d }, w1S ∈ SS , and consequently the following hypothesis space is sufficient for the windowed Levenshtein distance decoder: ( m+d )S [ (i) HS,d := SS i=m−d

m=1

Taking the hypothesis space and the above approximation the following inequalities for the windowed Bayes risk decoding rule are derived for going from a window size of S down to d: X xT1 → r := min p(w1N |xT1 ) Lev(v1M , w1N ) v1M

w1N

X

= min

v1S ∈HS,S

=

RS,0 (S; v1S , w1S )

w1S ∈SS

X

min

v1S ∈HS,S−1

RS−1,0 (S; v1S , w1S )

w1S ∈SS

≤... X

≤ min

v1S ∈HS,d

Rd,0 (S; v1S , w1S )

w1S ∈SS

The approximated posterior probability. The approximation of the posterior probability happens by applying the chain rule and shorten the sequence in the condition (the “history”) to fixed length L ≥ 0, i.e. the posteriors are approximated by an L-gram model conditioned on the acoustic observations: p(w1S |xT1 ) = p(wS |w1S−1 , xT1 )p(wS−1 |w1S−2 , xT1 ) · · · p(w1 |xT1 ) S−1 S−2 ≈ p(wS |wS−L , xT1 )p(wS−1 |wS−L−1 , xT1 ) · · · p(w1 |xT1 )

For the partial product of the approximated posteriors a new notation is introduced, where L is set to 2d. That is, the length of the sub-sequences equals the size of the window which is used for the windowed Levenshtein distance. The product is defined recursively as ˜ d (n; w1S ) := p(wn+d |wn+d−1 , xT1 )p(wn+d−1 |wS+d−2 , xT1 ) · · · p(w1 |xT1 ) P n−d S−d−1 n+d−1 T ˜ S = p(wn+d |w , x )Pd (n − 1; w ). n−d

86

1

1

5.2 Word Level Confusion Networks With the help of the approximated posteriors the following windowed risk is defined: ˜ d,i (n; v1S , w1S ) := P ˜ d (n; w1S )Cd,i (n; v1S , w1S ) R The approximation in the posteriors does not cause an approximation in the according Bayes risk computation with the windowed Levenshtein distance as loss function. In other words, replacing in the Bayes risk formula the correct posteriors by the approximated ones does still yield the correct result: min v1S

X

Rd,0 (S; v1S , w1S ) = min v1S

w1S

X

˜ d,0 (S; v1S , w1S ) R

w1S

The reason is the locality of the errors in the windowed Levenshtein distance. The decision whether a sequence w1S in the summation space contributes to the error of hypothesized word vn is made in the local window around position n. Thus, in the summation the fore and rear parts of each sequence in the summation space fall together. For a window of size 2d + 1 a history of length 2d (or larger) is required in order to get the correct windowed Bayes risk result. The first step of the proof is to show that the Bayes risk decoding with the windowed Levenshtein distance relies only on the posterior probabilities of sequences of length 2d + 1 centered at position n. Let the windowed alignment of v1S and w1S be denoted by AS1 , where An contains all the information required n+d by loss function L(n; vn , wn−d , An ) to compute the number of errors due to vn : vn can be aligned to one n+d of the words in wn−d or it can be an insertion. Furthermore, the alignment of vn can cause the alignment n+d of one or several s to words in wn−d . The loss function L is only an auxiliary construct for this proof and is not to be confused with the recursively defined cost function C. With the help of the loss function L the Bayes risk for the windowed Levenshtein distance can be computed as: xT1 → rd

:=

min v1S

=

X

p(w1M |xT1 ) Levd (v1S , w1S )

w1S

min min v1S

=

AS 1

min min v1S

AS 1

X

p(w1M |xT1 )

S X

n+d L(n; vn , wn−d , An )

n=1

w1S S X X

L(n; vn , un+d n−d , An )

n=1 un+d

min min v1S

AS 1

S X X

p(w1M |xT1 )

n+d w1S :wn−d =un+d n−d

n−d

=

X

n+d T L(n; vn , un+d n−d , An )p(un−d |x1 )

n=1 un+d

n−d

T The crucial step of the proof is to show that p(un+d n−d |x1 ) can be computed from the approximated word sequence posteriors. In other words, the proof is concluded by showing that the following equality holds:

T p(un+d n−d |x1 ) =

X

!

p(w1M |xT1 ) =

n+d w1S :wn−d =un+d n−d

S Y

X

i−1 p(wi |wi−2d , xT1 )

n+d i=1 w1S :wn−d =un+d n−d

For d = 0 (window size of one) this is easy to see:

X

S Y

w1S :wn =un i=1

p(wi |xT1 ) = p(un |xT1 )

S X Y i=1, w i6=n |

p(w|xT1 ) = p(un |xT1 ) {z

=1

}

87

Chapter 5 Confusion Networks: Applications and Investigations Next, the proof is shown for d = 1; the extension to d > 1 is straightforward. S Y

X

i−1 p(wi |wi−2 , xT1 )

n+1 i=1 w1S :wn−1 =un+1 n−1

X

=

p(un+1 |un−1 , un , xT1 )

n+1 w1S :wn−1 =un+1 n−1

p(un |un−2 , wn−1 , xT1 )p(un−1 |wn−3 , wn−2 , xT1 ) n−2 Y

i−1 p(wi |wi−2 , xT1 )

i=1

=

S Y

i−1 p(wi |wi−2 , xT1 )

i=n+2

p(un+1 |un−1 , un , xT1 ) X

p(un |un−2 , wn−1 , xT1 )

wn−2

i−1 p(wi |wi−2 , xT1 )

S i=n+2 wn+2

{z

!

S X Y

i−1 p(wi |wi−2 , xT1 )

w1n−4 i=1

|

p(un−1 |wn−3 , wn−2 , xT1 )

wn−3 n−2 Y

X

X

}|

{z

}

=1

=p(wn−3 ,wn−2 |xT 1 )(∗)

= p(un+1 |un−1 , un , xT1 ) X X p(un |un−2 , wn−1 , xT1 ) p(un−1 |wn−3 , wn−2 , xT1 )p(wn−3 , wn−2 |xT1 ) wn−2

=

wn−3

p(un−1 , un , un+1 |xT1 )

The proof is concluded by showing that assumption (∗) made in the last equation is correct. Y X n−2

i−1 p(wi |wi−2 , xT1 )

w1n−4 i=1

=

X

p(wn−3 |wn−2 , wn−4 , xT1 )

wn−4

X

p(wn−3 |wn−5 , wn−4 , xT1 ) · · ·

wn−5

X

p(w4 |w2 , w3 , xT1 )

w2

X

p(w3 |w1 , w2 , xT1 )p(w2 |w1 , xT1 )p(w1 |xT1 )

w1

|

{z

=p(w2 ,w3 |xT 1 )

}

= p(wn−3 , wn−2 |xT1 ) The approximated summation. In order to make the computation of the sum over the summation space SS feasible on a structure like a lattice the dependencies of the summands have to be reduced, i.e. for a window of size 2d + 1 it is required that the sum at position n depends only on its 2d predecessors. For the posterior probabilities this is achieved by defining a recursive function which computes the marginals of the approximated posteriors. No further approximation is required, because the context of the conditional posteriors is already limited, i.e. X X ˜ d (n; w1S ) = p(wn+d |wn+d−1 , xT1 ) ˜ d (n − 1; w1S ). P P n−d w1n−d−1

w1n−d−1

Making explicit use of the fact that the dependency is bounded by the window size the marginals can be computed as X ˜ d (n; wn+d ) := p(wn+d |wn+d−1 , xT ) ˜ d (n − 1; wn+d−1 ). P P 1 n−d n−d n−d−1 wn−d−1 (n−d−1) ∈SS

88

5.2 Word Level Confusion Networks Next, the so-called marginal risk is defined which is computed over a window of fixed size. In order to reduce the dependency to the last 2d positions, the sum in the risk computation has to be approximated. The alignment of a hypothesis word is already limited to the last 2d positions by using the windowed cost function, but the approximation is still required because the cost function contains a sum over a minimum and the minimum operation does not distribute over addition: X X ˜ d,i (n; v1S , w1S ) = ˜ d (n; w1S , xT1 )Cd,i (S; v1S , w1S ) R P w1n−d−1

w1n−d−1

n+d−1 = p(wn+d |wn−d , xT1 )



X

min

w1n−d−1

+ Cd,j (n − ≤

1; v1S , w1S )

n+d−1 p(wn+d |wn−d , xT1 )

X

+



j∈[−d,i+1]

˜ d (n − 1; wS , xT ) Lev(vn , wn+i ) P 1 1 n+j

  X

min j∈[−d,i+1]

˜ d (n − 1; w1S , xT1 ) Lev(vn , wn+i ) P n+j

w1n−d−1

 S S S S ˜ Pd,j (n − 1; v1 , w1 )Cd,j (n − 1; v1 , w1 )

w1n−d−1

=

n+d−1 p(wn+d |wn−d , xT1 )

X

+

 min j∈[−d,i+1]

n+i Lev(vn , wn+j )

X

˜ d (n − 1; wS , xT ) P 1 1

w1n−d−1

˜ d,j (n − 1; v1S , w1S ) R



w1n−d−1

Applying the approximation to n − 1, n − 2, . . . the following recursion is derived, which defines the approximated marginal risk: ˜ d,i (n; v1n , wn+d ) R n−d n+d−1 := p(wn+d |wn−d , xT1 )

+

X

 min j∈[−d,i+1]

n+i Lev(vn , wn+j )

X

˜ d (n − 1; wn+d , xT ) P 1 n−d

wn−d−1 (n−d−1) ∈SS

˜ d,j (n − 1; v n−1 , wn−1+d ) R 1 n−1−d



wn−d−1 (n−d−1) ∈SS

The approximated marginal risk efficiently computes an approximation of the sum over the aligned summation space by considering only a context of fixed size, which is set to 2d. The following inequality results from the approximation: X X ˜ d,0 (S; v S , wS+d−1 ) Rd,0 (S; v1S , w1S ) ≤ R 1 S−d+1 w1S ∈SS

S+d−1 wS−d+1 :w1S ∈SS

Unfortunately, the approximation destroys the hierarchy w.r.t. the window size: the swapping of sum and minimum can cause the preference of an alignment which yields the lowest cost up to the current window, but is not the lowest final cost. The approximated minimum. The last operation preventing from an efficient computation is the minimum over the hypothesis space. In general, the cost of two hypotheses can only be compared after the alignment of all words in the hypotheses, even when using the windowed Levenshtein distance. That is, when comparing two partial hypotheses up to position n the minimum over all possible expansions to full length S has to be taken into account in order to guarantee the correct result. The approximation happens by considering only the next d positions instead of all positions up to S.

89

Chapter 5 Confusion Networks: Applications and Investigations For the approximation the definition of the summation space over all sub-sequences in a given range is needed  n (m,n) SS := wm : w1S ∈ SS , and also the definition of the hypothesis space at a given position and for a given range (m)

HS,d :=

m+d [

(i)

(m,n)

SS ,

HS,d

(m)

(n)

:= HS,d × · · · × HS,d .

i=m−d

For computing the hypothesis up to position n all possible expansions to length n + d are considered, i.e. the algorithm looks d positions into the future: X ˜ d,0 (n; v1n , wn+d ) v˜1n−d := argmin nmin R n−d v1 n−d vn−d+1

n+d wn−d

Applying the approximation to n − d − 1, n − d − 2, . . . yields the recursive definition   n X vn−d n+d ˜ v˜n−d := argmin min Rd,0 (n; , wn−d ). n v˜1n−d−1 vn−d vn−d+1 n+d (n−d)

∈HS,d

wn−d (n−d,n+d)

(n−d+1,n)

∈HS,d

∈SS

And the following inequality is a direct result from the definition of the Levenshtein distance: X X ˜ d,0 (S; v˜1S , wS+d ) ˜ d,0 (S; v1S , wS+d ) ≤ R R min v1S ∈HS,d

S−d

S−d

S+d wS−d

S+d wS−d

The dynamic programming equations. Putting it all together, the following dynamic programming equations which efficiently compute an approximation of the Bayes risk and the according hypothesis for the windowed Levenshtein as loss function are derived: X ˜ d (n; wn+d ) := p(wn+d |wn+d−1 , xT1 ) ˜ d (n − 1; wn+d−1 ) P P (5.5) n−d n−d n−d−1 wn−d−1 (n−d−1) ∈SS

˜ d,i (n; v n , wn+d ) R n−d n−d

 n+d−1 := p(wn+d |wn−d , xT1 ) min j∈[−d,i+1] X ˜ d (n − 1; wn+d ) Lev(vn , wn+i ) P n+j n−d wn−d−1 (n−d−1) ∈SS

X

+

˜ d,j (n − 1; R



wn−d−1 (n−d−1) ∈SS

 v˜n−d

:=

argmin

X

min

n vn−d vn−d+1 (n−d) (n−d+1,n) ∈HS,d ∈HS,d

n−1 vn−d v˜n−1−d



 n−1+d , wn−1−d )

 n+d n ˜ Rd,0 (n; vn−d , wn−d )

(5.6)

(5.7)

n+d wn−d (n−d,n+d)

∈SS

The equations describe a nested recursion: alternating the approximated risk at position n and the final word hypothesis at position n − d is computed. The hypothesis word at position n − d is the leftmost hypothesis word considered in computing the approximated risk at position n. The probabilities n+d−1 p(wn+d |wn−d , xT1 ) can be efficiently computed in a pre-processing step from the summation space lattice under consideration of the arc and path alignment given by the initializing CN. The equations are initialized in the following way, where for all n ≤ 0 and n > S holds vn = wn := : ˜ d (−d; w0 ) P −2d

:=

1

0 ˜ d,i (−d; v −d , w−2d R ) −2d

:=

0

v˜−d

90

:= 

5.2 Word Level Confusion Networks The interpretation of the initialization is that the probability of the sequence of empty words preceding the ultimate hypothesis v˜1S equals one and the according risk is zero. For computing the approximated Bayes risk hypothesis it is sufficient to look at the last hypothesis word v˜S , because the recursion will produce the remaining S−1 elements and will compute the approximate Bayes risk for the complete hypothesis. The approximate risk for a window of size 2d + 1 equals ˜ d,0 (S + d + 1; v S+d+1 , wS+2d+1 ) R S+1 S+1  S+2d = p(wS+2d+1 |wS+1 , xT1 ) min j∈[−d,i+1] X S+d S+d+1+i ˜ d (S; w P S−d ) Lev(vS+d+1 , wS+d+1+j ) wS

+

X

˜ d,j (S + d; R



wS

=

X

˜ d,0 (S + d; R

S+d vS+1 v˜S



wS

  S+d vS+1 n−1+d , wn−1−d ) v˜S  n−1+d , wn−1−d ).

In the (S +d+1)th computation of the risk, only empty words are aligned, because wn = vn =  for n > S. The risk computation reduces to a simple sum and the sum depends on the last hypothesis element v˜S . This initializes the nested recursion and in the next step v˜S is computed as X ˜ d,0 (S + d; v S+d , wS+2d ) R v˜S = argmin S S vS

S+2d wS

= argmin vS

X

˜ d,0 (S + d; v S+d , wS+2d ). R S S

wS

The result depends on the risk at position (S + d) and thus the recursive computation of the approximate Bayes risk is initiated. The risk computation terminates with v˜−d = . The first d calls in the enrolled recursion just fill the right half of the window, which is used to predict the current hypothesis word. Thus, the first (d + 1) 0 hypothesis words produced by the recursion equal the empty word, i.e. v˜−d = , and v˜1S is the ultimate hypothesis. Figure 5.3 visualizes the approach for different window sizes. For getting the word hypothesis at position n the decoder considers the alignment between any partial word sequence vnn+d from the hypothesis space and any partial word sequence wnn+2d from the summation space as shown by figure b). For a window of size one the alignment is unique as shown in figure a), i.e. the alignment is already determined. For a sufficiently large window the complete word sequences v1S and w1S are considered, see figure c). The time and space complexity. The run-time and memory requirements of the algorithm depend on the window size d and on the initial CN alignment with length S, from which the aligned hypothesis and summation space are derived. For a full search the exact run-time and memory consumption can be computed; slightly simplified the recursion has the following time and space requirements (the underbraces point at the quantity for whose computation the time resp. memory is used): time:

S+d+1 X



n=−d+1

space:

S+d+1 X n=−d+1

{z

˜ d (n;wn+d ) P n−d



(n−d,n)

(n−d−1,n+d)

|SS |

(n−d,n+d)

|SS | {z

˜ d (n;wn+d ) P n−d

(n−d−1,n+d)

||SS | + (2d + 1)|HS,d } | {z

˜ d,· (n;v n ,wn+d ) R n−d n−d

(n−d,n)

| + |HS } |

(n−d,n+d)

||SS {z

˜ d,· (n;v n ,wn+d ) R n−d n−d

| }

(n−d,n)

| + |HS,d } |

(n−d,n+d)

||SS {z

v ˜n−d



| }

 + |{z} S v ˜1S

The space complexity can be reduced by holding only the information necessary for computing the quantities at the current position; the sum is replaced by two times the maximum.

91

Chapter 5 Confusion Networks: Applications and Investigations

Figure 5.3. The figure visualizes the alignments performed in the Bayes risk decoder with the windowed Levenshtein distance as loss function. Figure a) shows the CN alignment case, where the window size is one and thus the alignment is unique. For a window size of 2d + 1 the computation of the hypothesis word n+d n+2d at position n considers the alignment between vn and wn as shown in b). For sufficiently large window size, that is ≥ 2S − 1, the alignment between v1S and w1S is computed, see c), which yields the exact Levenshtein distance.

92

5.2 Word Level Confusion Networks (n)

(n)

A further estimate can be done using the fact that |SS | ≤ |Σ| and |HS,d | ≤ |Σ|, where Σ denotes the vocabulary:  time: O d(S + d)|Σ|3d+3  space: O (S + d)|Σ|3d+2 Due to the function of the algorithm no tracebacks are needed; in each step the algorithm produces a word of the final output. But if the alignment of the final hypothesis is desired, then tracebacks have to be stored. The approximations. The following inequalities summarize the approximations applied in the windowed Levenshtein distance decoder with a window size of 2d + 1, starting from the exact Bayes risk with the exact Levenshtein distance as loss function: X xT1 → r := min p(w1N |xT1 ) Lev(v1M , w1N ) v1M

=

min min

X

v1S



v1S

=

min v1S



w1N

X

min v1S

R(S, S; v1S , w1S )

w1S

Rd,0 (S; v1S , w1S )

w1S

X

˜ d,0 (S; v1S , w1S ) R

w1S

X

˜ d,0 (S; v1S , wS+d ) R S−d

S+d wS−d

˜ d,0 (S + d + 1; v S+d+1 , wS+2d+1 ) =: rd (xT1 ) R S+1 S+1



The first inequality is due to the windowed Levenshtein distance. The second inequality follows from toggling summation and minimization in the risk computation. And the third inequality is due to only considering a limited future when finding the next hypothesis word. The limits. The nice property of the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function is that for d = 0 it becomes the well-known CN decoding rule, and for d ≥ S − 1 it equals the Bayes risk decoder with the exact Levenshtein distance as loss function. For a window of size one, i.e. d = 0, the resulting decoding rule is the CN decoding rule introduced in Section 3.4. In the notation used in this section the decoding rule becomes  S [v1S ]CN = argmax p(vn |xT1 ) n=1 . vn

The decoding of the hypothesis word at position n is independent of the adjacent hypothesis words. Thus, for the proof it is sufficient to investigate the result of Equation (5.7) for any n: X ˜ 0,0 (n; vn , wn ) v˜n = argmin R vn

= argmin

wn

X

vn

p(wn |xT1 )

wn

X wn−1

˜ 0 (n − 1; wn−1 ) P {z } |

p(w1n−1 |xT 1 )=1 n−1

P = w

Lev(vn , wn ) | {z } =d(vn ,wn )

 X ˜ 0,0 (n − 1; v˜n−1 , wn−1 ) + R wn−1

 = argmin vn

  X ˜ 0,0 (n − 1; v˜n−1 , wn−1 ) 1 − p(vn |xT1 ) + R

= argmin 1 − vn

wn−1

p(vn |xT1 )



93

Chapter 5 Confusion Networks: Applications and Investigations

gamma=0.4, breakpoint=0.3

s1.limsi/eval07en

1

1 unwarped warped

zero bias unwarped warped 0.8

estimated confidence

0.8

h(x)

0.6

0.4

0.2

0.6

0.4

0.2

0

0 0

0.2

0.4

0.6

0.8

1

0

x

0.2

0.4

0.6

0.8

1

true confidence

Figure 5.4. Confidence warping applied to the lattices for eval07en produced by the LIMSI English EPPS 2007 evaluation system.

For d = 0 the alignment is completely determined by the initial CN alignment which allows to compute the risk for vn independently of w1S . This greatly reduces the run-time and the space requirement of the CN decoding rule, which is given by: time: O(S|Σ|) space: O(S) From the construction of the windowed Levenshtein distance decoder it is obvious that for a sufficiently large window, i.e. a window spanning over the whole initial alignment, the result equals the outcome of the exact Bayes risk with the Levenshtein distance as loss function. In fact, choosing d ≥ S − 1 is sufficient for avoiding any approximation in the Bayes risk computation. The proof is done by inserting the window size into the equations which eventually yield the dynamic programming equations. The proof itself is mathematically straightforward, but bulky. Here, only the outline is given: first it is proved that ˜ S−1 (n; wn+S−1 ) is not an approximation, but computes the correct posterior probability for wS . The P 1 n−S+1 ˜ S−1,0 (S; v S , wS ) computes the correct risk, i.e. equals R(S, S; v S , wS ), from result is used in showing that R 1 1 1 1 which follows that the result equals the exact Bayes risk with the Levenshtein distance as loss function and thus v˜1S is the Bayes risk hypothesis.

5.2.3 Results In this section experimental results for CN and fCN combination with posterior probability warping and for approximate Bayes risk decoding with the windowed Levenshtein distance as loss function are presented and discussed. Experimental results are presented for the Chinese 230h testing system and for the English EPPS 2007 evaluation cross-site combination. A detailed description of the systems can be found in Appendix B. For all experiments acoustic and language model scales and the system weights in the union based combination and in CNC are optimized for minimum character/word error rate (CER/WER) on the tuning set. The optimization algorithm is described in Section 3.7. For the experiments applying the warping function defined in Equation (5.4) the system-dependent γs are included into the optimization. The first set of experiments investigates the impact of the slot-wise posterior probability warping on the performance of the fCN and CN combination. In the fCN combination the min.hyp-nFE decoder defined in Section (4.2.1) is applied, which solely relies on frame-wise word posterior probabilities. In the union approach to lattice combination, the combined frame-wise word posteriors are computed as the weighted average of the system-dependent frame-wise posteriors, cf. Equation (3.17). The warping function is applied to the system-dependent frame-wise posteriors before computing the sum. The systemdependent γ in the warping function defined in Equation (5.4) is initialized for each system separately by maximizing on the tuning set the NCE value for the system-dependent Viterbi result. The confidence

94

5.2 Word Level Confusion Networks

Table 5.3. Combination results with system-dependent frame- and CN-slot-wise posterior warping for the Chinese 230h testing system, cf. Section B.1.1. The warping is optimized for minimum character error rate. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

Warped System Comb. baseline Frame Error Decoder s1+s2 no yes s1+s2+s3 no yes CNC Error Decoder s1+s2 no yes s1+s2+s3 no yes 1

dev071 (2.63/1.59) 14.54

CER[%] (del/ins) err eval07 dev08 (4.42/0.91) 15.08 (2.80/0.87) 13.28

(3.07/1.30) 13.57 (2.76/1.46) 13.56 (3.06/1.23) 13.18 (3.04/1.25) 13.15

(4.69/0.68) 13.95 (4.46/0.82) 13.97 (4.72/0.69) 13.71 (4.75/0.69) 13.69

(3.05/0.70) 12.54 (2.83/0.80) 12.48 (3.01/0.72) 12.22 (3.01/0.70) 12.15

(2.93/1.34) 13.56 (2.96/1.33) 13.55 (2.87/1.29) 13.17 (2.92/1.25) 13.12

(4.66/0.76) 13.99 (4.65/0.74) 13.99 (4.68/0.70) 13.70 (4.68/0.68) 13.69

(2.93/0.74) 12.50 (3.01/0.74) 12.59 (2.92/0.72) 12.21 (2.99/0.72) 12.19

tuning set

Table 5.4. Normalized cross entropy (NCE) results with frame- and CN-slot-wise posterior warping for the Chinese 230h testing system, cf. Section B.1.1.

System Warping/Objective Frame Error Decoder s1+s2 system-dep./min. CER system-indep./max. NCE s1+s2+s3 system-dep./min. CER system-indep./max. NCE CNC Error Decoder s1+s2 system-dep./min. CER system-indep./max. NCE s1+s2+s3 system-dep./min. CER system-indep./max. NCE 1

dev071

NCE eval07

dev08

0.310 0.342 0.348 0.320 0.338 0.343

0.346 0.372 0.375 0.340 0.353 0.358

0.338 0.366 0.376 0.342 0.358 0.368

0.307 0.334 0.344 0.335 0.338 0.355

0.347 0.376 0.375 0.362 0.364 0.377

0.333 0.368 0.370 0.354 0.366 0.378

tuning set

95

Chapter 5 Confusion Networks: Applications and Investigations

Table 5.5. Combination results with system-dependent frame- and CN-slot-wise posterior warping for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The warping is optimized for minimum word error rate. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline Frame Error Decoder LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST CNC Error Decoder LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST 1

Warped Comb.

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13

no yes no yes no yes

(1.60/0.85) 6.65 (1.53/0.82) 6.43 (1.80/0.72) 6.48 (1.66/0.79) 6.46 (1.70/0.79) 6.52 (1.53/0.83) 6.37

(1.99/0.76) 7.73 (1.90/0.76) 7.54 (2.21/0.68) 7.52 (1.92/0.76) 7.24 (1.93/0.76) 7.26 (1.82/0.78) 7.07

no yes no yes no yes

(1.45/0.80) 6.38 (1.47/0.77) 6.33 (1.47/0.72) 6.27 (1.46/0.72) 6.16 (1.45/0.71) 6.14 (1.52/0.66) 6.11

(1.88/0.75) 7.51 (1.95/0.69) 7.47 (1.87/0.68) 7.24 (1.87/0.68) 7.32 (1.87/0.69) 7.12 (2.00/0.59) 7.01

tuning set, eval06 was the official development set in the 2007 evaluation campaign

score used for computing the NCE are derived from the frame-wise word posteriors according to [Wessel & Schl¨ uter+ 2001a]. For the CN combination (CNC) the slot-wise word posterior probabilities from the system-dependent CNs are warped before feeding the CNs into the CNC algorithm. Again, the γ-parameter in the warping function is initialized for each system separately by maximizing the NCE on the tuning set; the slot-wise word posterior probability is used directly as confidence score. Figure 5.4 shows the resulting warping function for the LIMSI English EPPS 2007 evaluation system. The green line in the left plot shows the warping function with the γ-parameter optimized for maximum NCE on the Viterbi path. The right graph shows the ideal confidence scores in red, the unwarped confidence scores in green, and the warped scores in blue. In a contrast experiment the confidence scores for the unwarped system combination result are warped in a post-processing step. The warping is applied to the frame- or slot-wise combined word posterior probabilities of the combination and decoding output and the single γ is optimized for maximum NCE. The objective of the experiment is twofold: first, it shows how in a simple post-processing step the NCE value of confidence scores based on frame- or slot-wise posterior probabilities can be improved. Second, the comparison with the system-dependent warping, where the γs are optimized for minimum error rate, indicates whether minimum error rate and maximum NCE go along. The error rates for the experiments with the Chinese system are shown in Table 5.3 and the NCE values in Table 5.4. Keep in mind that the unwarped system and the system with system-independently warped confidence scores have the same error rate, because warping is applied after combination and decoding. For the Chinese system almost no improvement in CER is observed. The result is not surprising as all three Chinese systems use the same decoder to produce the lattices. Thus, it can be expected that for all lattice sets the bias in the lattice derived posterior probabilities is the same. The NCE value is increased by the system-dependent posterior warping by 5 to 10% relative over the unwarped baseline. The gain comes from the system-dependent optimization of the system-dependent γ for maximum NCE. In the subsequent combined optimization of all γs for minimum CER almost no changes in the γs are observed.

96

5.2 Word Level Confusion Networks

Table 5.6. Normalized cross entropy (NCE) results with frame- and CN-slot-wise posterior warping for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2.

NCE System Frame Error Decoder LIMSI+RWTH

Warping /Objective

system-dep./min. WER system-indep./max. NCE LIMSI+RWTH+UKA system-dep./min. WER system-indep./max. NCE LIMSI+RWTH+UKA+IRST system-dep./min. WER system-indep./max. NCE CNC Error Decoder LIMSI+RWTH system-dep./min. WER system-indep./max. NCE LIMSI+RWTH+UKA system-dep./min. WER system-indep./max. NCE LIMSI+RWTH+UKA+IRST system-dep./min. WER system-indep./max. NCE 1

eval061

eval07

0.309 0.291 0.318 0.310 0.247 0.322 0.320 0.303 0.343

0.371 0.361 0.378 0.367 0.293 0.384 0.375 0.341 0.401

0.323 0.317 0.332 0.342 0.315 0.356 0.331 0.316 0.344

0.387 0.371 0.394 0.388 0.372 0.405 0.382 0.358 0.402

tuning set, eval06 was the official development set in the 2007 evaluation campaign

97

Chapter 5 Confusion Networks: Applications and Investigations The final warping has virtually no impact on the decoding result. The gain in NCE from putting the warping in the post-processing step and tuning it for maximum cross-entropy is a little higher for an almost identical error rate. The results for the English cross-site combination are summarized in Table 5.5 and in Table 5.6. For the frame-wise and slot-wise posterior probability warping a small decrease in error rate is observed. The improvements are larger for the frame error decoder, which on the other hand starts from a higher baseline. In contrast to the Chinese system, the NCE values decrease for the system-dependent posterior warping if optimized for minimum WER. On the other hand, for the post-decoding warping the NCE values increase slightly. The observation is consistent with the considerations in Section 3.7.1, where it is shown that the objective of the parameter optimization is to find a good classifier and not to find a good approximation of the true posteriors. In the second set of experiments the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function is investigated. The windowed Levenshtein distance is initialized with the CN alignment derived from the arc-cluster CN construction algorithm described in Section 4.4.2. Table 5.7 and Table 5.8 summarize the results for the Chinese and the English task. The results are similar: increasing the window size does not help, the error rates are even slightly worse for larger windows. An investigation of the resulting alignments did not give a final explanation for the disappointing results. Noticeably, in the alignments for the English task for windows larger than one, frequently erroneous alignments of short words appear. It seems that especially for clouds of short words the windowed Levenshtein distance fails and the word boundaries considered in the common Levenshtein distance approximations are a valuable hint for correctly aligning these words. However, before drawing any conclusions further investigations are needed which are beyond the scope of this work.

5.3 Summary In this chapter several applications based on confusion networks (CNs) have been presented. Confusion networks derived from word lattices have a simple structure: they can be regarded as a sequence of slots, where each slot defines a posterior probability distribution over the decoding vocabulary. In frame-wise defined CNs (fCNs) a slot represents a time frame and the articulation of a word is distributed among slots. In word-level CNs the articulation of a word is assigned to a single slot. The first application uses the fCN to compute the time alignment for a word sequence. The method is similar to the common time alignment algorithm using an acoustic model. The difference is that the frame-wise scores are not computed by an acoustic model but are provided by the fCN. The algorithm is of particular interest for lattice-based combination and decoding experiments, where the decoder does not provide word boundaries. In this case, the fCN derived from the union of the system-dependent lattices can be used for computing new word boundaries. In particular, the union approach avoids out-of-vocabulary problems in the time alignment for (cross-site) system combination results. In the second application entropy-based methods are used to combine several system-dependent fCNs. Entropy-based combination methods have been successfully applied in combining several feature streams in noisy environments. In this work the approach is integrated into the hyp-nFE decoder, cf. Section 4.2.1, which relies solely on frame-wise word posteriors. The standard combination consisting of the weighted average of the system-dependent frame-wise word posteriors is replaced by the entropy-based methods. However, in the experimental tests the entropy-based methods cannot beat the standard approach. The results presented in [Misra & Bourlard+ 2003] suggest that the method is most beneficial in the presence of noise, whereas all experiments conducted in this work use clean speech. The third application aims at warping frame- or slot-wise word posterior probabilities for optimal error rate. The motivation is twofold: by warping the posterior distributions the probability estimates achieve a better approximation of the true posteriors, which theoretically helps in Bayes risk decoding. The other motivation comes from the observation that lattice-based posteriors have a system-specific bias. The posterior warping is a means for making the posteriors comparable among systems, especially in cross-site system combinations. The experimental results show a small benefit for the cross-site system combination, but no improvement for an intra-site combination. The confidence scores are directly derived from the warped frame- or slot-wise word posteriors. An evaluation of the normalized cross entropy (NCE) shows that for all systems the posterior warping can increase the NCE, if tuned for maximum NCE. However,

98

5.3 Summary

Table 5.7. Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the Chinese 230h testing system, cf. Section B.1.1. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system.

System baseline s1

s1+s2+s3

1

Window Size 1 3 5 1 3 5

dev071 (2.63/1.59) 14.54 (2.83/1.44) 14.33 (2.71/1.49) 14.33 (2.78/1.44) 14.35 (2.87/1.25) 13.12 (2.64/1.41) 13.27 (2.75/1.33) 13.20

CER[%] (del/ins) err eval07 (4.42/0.91) 15.08 (4.54/0.81) 14.91 (4.48/0.86) 14.98 (4.56/0.85) 14.95 (4.69/0.69) 13.70 (4.48/0.81) 13.80 (4.56/0.72) 13.71

dev08 (2.80/0.87) 13.28 (2.97/0.78) 13.15 (2.84/0.85) 13.24 (2.99/0.78) 13.22 (2.94/0.77) 12.27 (2.74/0.85) 12.34 (2.80/0.74) 12.38

tuning set

Table 5.8. Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system.

System baseline LIMSI

LIMSI+RWTH+UKA+IRST

1

Window Size 1 3 5 1 3 5

WER[%] (del/ins) err eval061 eval07 (1.64/1.38) 8.16 (1.74/1.23) 9.13 (1.60/1.30) 8.01 (1.72/1.16) 8.94 (1.58/1.33) 8.02 (1.73/1.18) 8.99 (1.57/1.31) 8.01 (1.71/1.18) 8.98 (1.61/0.69) 6.29 (2.07/0.60) 7.18 (1.42/0.82) 6.33 (1.80/0.76) 7.22 (1.43/0.80) 6.34 (1.81/0.75) 7.20

tuning set, eval06 was the official development set in the 2007 evaluation campaign

99

Chapter 5 Confusion Networks: Applications and Investigations in the cross-site combination experiments with the warping optimized for minimum error rate the NCE values decrease. In the last section a windowed Levenshtein decoder is developed within the Bayes risk framework. The resulting decoder draws the connection between CN decoding and Bayes risk decoding with the exact Levenshtein distance as loss function: the windowed Levenshtein decoder is initialized with a CN alignment. The result for a window of size one equals the CN decoder and for a sufficiently large window the Bayes risk decoder with the exact Levenshtein distance as loss function is achieved. Dynamic programming equations for the windowed Levenshtein decoder are given, which compute in polynomial time an approximation of the Bayes risk with the windowed Levenshtein distance as loss function. However, experimental results show no improvements for the windowed Levenshtein decoder with a symmetric window of size three or five over the standard CN decoder, i.e. over a window of size one.

100

Chapter 6 Classifier based System Combination In the Bayes risk decoding framework presented in Chapter 3 two assumptions have been made: the probabilities derived from lattices are trustworthy and the local cost functions are good approximations of the Levenshtein distance. Section 5.2.1 discusses why the probabilities are not always reliable and not necessarily comparable among systems. The biases and drawbacks of the approximation of the Levenshtein distance by local cost functions are described in Chapter 4. That is, in practice, neither assumption is fulfilled. The motivation for using classifiers in system combination follows directly from the above considerations: neither to blindly trust the lattice-based posterior probabilities nor the cost approximation. Instead, all available information is fed into a classifier. In the best case the classifier learns the underlying patterns like the systematic bias of a cost approximation or the bias of a system-dependent posterior probability under certain conditions. Eventually, the classifier shall separate reliable from unreliable information and decide for the ultimate output of the system combination. The approach to classifier based system combination described in this section was introduced in [Hillard & Hoffmeister+ 2007] and further developed in [Hoffmeister & Schl¨ uter+ 2008].

6.1 Combination with Classification Confusion network combination (CNC) and also ROVER work in two steps. In the first step the systemdependent inputs (CNs or 1-bests) are aligned to a super CN. The second step consists of decoding the super CN, which is done in CNC and ROVER by a simple, slot-wise decision rule. CNs, CN decoding, CNC, and ROVER have been discussed before in Chapter 3. Under the assumptions that the posterior probability estimates derived from the lattice are the true probabilities and for each pair of paths in the lattice the CN alignment equals the Levenshtein alignment, then the simple decision rule is optimal, i.e. the rule yields the hypothesis for the Bayes risk decoder with the Levenshtein distance as loss function. However, in practice, neither assumption is fulfilled. Posterior probabilities derived from a lattice usually show a bias due to model assumptions, beam pruning in the search, and subsequent lattice pruning. Even in the case that all Levenshtein alignments between all paths in the lattice can be expressed as a CN, the common, heuristic CN construction algorithms, like the algorithms presented in Section 4.4, usually do not find the optimal alignment. In this work an approach is described which aims at compensating for inaccuracies in the probabilities and the alignment by using a classifier. The main idea is to take advantage of the super CN constructed in the first step of the CNC or ROVER algorithm. The classifier makes a slot-wise decision on the super CN, where the decision is based on the slot-wise word posteriors and other slot-wise features derived from the system-dependent and also from the combined lattices. As pointed out in Section 5.2.2, the CN alignment deviates from the Levenshtein alignment usually only in one or two positions. This observation motivates the inclusion of context information for the current slot into the classification process. The classifier works on a symmetric window of slots centered on the slot in question. The features from all slots in the window are concatenated and used by the classifier to predict the output for the current slot. The features are described in detail in Section 6.1.1. The context is brought in by augmenting the feature vector of the current slot by the features of the two adjacent slots. In the training phase the classifier can learn the systematic bias of the lattice-based probability estimates and the bias in the CNC or ROVER alignment. In particular, the classifier can learn a probability warping similar to the explicit model given in Section 5.2.1. The consideration of the context is akin to the windowed Levenshtein distance introduced in Section 5.2.2 with a window size of three. The particular classifiers used in this work are discussed in Section 6.1.2.

101

Chapter 6 Classifier based System Combination From the basic idea three approaches to combination with classification are derived. They are distinguished according to the alignment they are based on. The iROVER approach presented in Section 6.1.3 is based on the ROVER alignment, the iCNC approach in Section 6.1.4 is based on the CNC alignment, and the iCN in Section 6.1.5 uses directly the super CN derived from the CN combination. The last section presents and discusses the results. The i in the i-approaches refers to improved or intelligent.

6.1.1 Features The first feature for each word hypothesis is the information which systems have hypothesized the word. The information is crucial for the subsequent decoding, because in the iROVER and iCNC approach the classifier’s prediction per slot is not the concrete word, but the system which produced, according to the classifier’s belief, the correct word. The other features are divided into three categories: the word features, the posterior features, and the decoder features. The first category are the word features, which are computed on word level and do not necessarily require a lattice, i.e. they can be computed from any 1-best decoding result. The category consists of the acoustic and the language model score, word duration, the number of characters, and the averaged character duration, which serves as an approximation for the average phoneme duration. The features are produced for each system separately. Furthermore, the information is added whether the word is in the list of the 10, 20, or 100 words causing the most errors on a tuning set. In the ROVER based approaches only the Viterbi results are aligned and the scores and time stamps of a word are unambiguous. In contrast, in a CN many word lattice arcs are collapsed into a single slot entry. In this case the averaged time stamps and scores are used, where the average is weighted according to the lattice arc posteriors. In [Hillard & Hoffmeister+ 2007] the word identity was added as a feature, but further experiments indicated that the feature is not helpful and rather caused overfitting on some setups. Results presented in this work do not use this feature. The second category of features includes all features derived from lattice posterior probabilities and is referred to as posterior features. The features include the system-dependent CN confidence score and the entropy of the slot-wise word posterior probability distribution. If a CNC is available, the CNC confidence score and slot entropy are added. Furthermore, confidence scores based on frame-wise posterior probabilities, cf. [Wessel 2002], are included which are computed across all systems as well as from the combined frame-wise posterior probabilities. The cross-system confidence score assigned by system A to a word hypothesis from system B is defined as follows: the confidence score is computed according to [Wessel 2002], where the required frame-wise word posterior probabilities are derived from the lattice provided by system A. This allows system A to give a confidence estimate for the hypothesis of system B. A classifier can use the cross-system confidence scores as an indicator for out-of-vocabulary (OOV) words. The third and last feature category consists of the decisions of the standard approaches to system combination. The ROVER, CNC, and the min.hyp-nFE combination and decoding results are computed. For each word and classifier the information is included whether the word would have been chosen by the decoder. ROVER alignment based experiments do not use CNC based features, because the CNC is superior to ROVER and thus if a CNC is computed, the combination is based on the CNC alignment. The ROVER, CNC, and min.hyp-nFE decoder have been introduced in Section 3.4 and in Section 4.2.1. The final feature vector consists at least of the word features of the current and the adjacent slots. The vector is augmented by the minimum distance in seconds to the adjacent slots. According to the setup, the features from the other two categories are added.

6.1.2 Classifiers and Training The classifiers applied are Boostexter (BT) [Schapire & Singer 2000], random forests (RF) [Breiman 2001], and a log-linear model trained in the maximum entropy framework (Maxent) [Keysers & Och+ 2002]. For each slot in the provided CN the classifier makes an independent decision; context is included in the feature vector as explained in the previous section. The classifiers are learned on the CNs which were produced on the training set, where the reference transcription is matched to the CN via an oracle alignment. The result of the oracle alignment is used to

102

6.1 Combination with Classification assign to each slot a reference word. The ultimate slot labels for classifier training are either the systems which predicted the correct word for the slot or the rank of the reference word within the slot. The pure oracle error between CN and reference is computed by using the local cost defined in Equation (6.1), where ps (·|xT1 ) denotes the slot-wise word posterior distribution for slot number s, and the reference is denoted by w ˜1S :  0, if ps (w ˜s |xT1 ) > 0 c(w ˜s ) := (6.1) 1, otherwise The resulting alignment is not optimal for slot labeling, because it disregards the rank of the reference word in the slot. Especially in ambiguous alignments the reference word can be aligned to a slot with a low rank for the reference word instead of being aligned to the adjacent slot, where the reference word has a high rank. However, it is not clear what is the optimal alignment for the classifier training. Intuitively, for the classifier training the alignment shall assign the reference word to a slot where it has a high rank, and at the same time the alignment should minimize the oracle error rate. The alignment derived from minimizing the expected reference error shows (but not guarantees) these properties and gives good results in practice. The according local cost for computing the alignment is given by c(w ˜s ) := 1 − ps (w ˜s |xT1 ).

(6.2)

Due to the not well defined alignment there is no guarantee that the resulting labeling is optimal for training. Some training labels have to be considered wrong and the resulting training set to be noisy. First experiments are done using Boostexter, a simple classifier which shows good performance on a wide range of tasks. The idea of BT is to learn a series of weak classifiers (decision stumps) and re-weight the training examples using Adaboost, real Adaboost.MH with logistic loss for the experiments presented in this work, cf. [Schapire & Singer 2001]. The second classifier is the random forest, which has some relations to the Boostexter approach. In a RF the weak classifier is a full decision tree, and randomization is applied instead of boosting. The RF implementation used in this work is the Randomized C4.5 as suggested in [Dietterich 2000b]. Randomization is in particular preferable to boosting in the presence of noisy training data. Boosting starts to focus on the incorrectly labeled and thus hard to classify examples. Randomization is a simple approach to avoid this bias. RFs have been successfully applied to several tasks, e.g. to a CN based confidence annotation task in [Xue & Zhao 2006]. An alternative to the two decision tree based approaches is the log-linear model. The model parameters are estimated using the Maxent Toolkit described in [Keysers & Och+ 2002]. The next three sections investigate different approaches for applying the classifiers to the system combination problem. In two of the setups the classifier predicts the system which is believed to produce the correct output. In the classifier training this setup causes multi labels, because for each slot more than one system can be correct. BT training can directly handle multi labels, unlike the RF implementation and the Maxent toolkit. Multi label classification problems can be reduced to a single label problem, cf. [Tsoumakas & Katakis 2007]. In preliminary tests two approaches were tested for the RF and the Maxent classifier. The first approach is to build a new label set which consists of one label for each combination of the original labels which occurs in the training set. In the preliminary experiments this approach worked best for Maxent. During the classification process the Maxent model assigns a probability to each label. From these probabilities the ultimate probabilities for the original labels are derived by splitting the new labels into the original ones and summing up the probabilities for each of the original labels. The investigated alternative tackles the multi label problem by performing a one-vs-all classification. For each label a binary classifier is built. In classification all classifiers are applied and the final result is taken from the highest scoring classifier. This approach worked best for random forests, where the score is simply the number of trees within an RF which vote for the label in question.

6.1.3 The iROVER Approach In the iROVER approach the Viterbi results of the systems are aligned and the standard decoding rule is replaced by a classifier. A similar approach is investigated in [Zhang & Rudnicky 2006], where the authors apply a neural network to a set of basic features, but observe only a small improvement.

103

Chapter 6 Classifier based System Combination In the combination of J systems the Viterbi results are aligned with the ROVER tool. In addition, the min.hyp-nFE decoding result is added as the (J + 1)th system. The output of the classifier is one of the J + 1 systems and the final output of iROVER is the word hypothesis of the predicted system. In the result of the ROVER alignment each slot contains exactly one word from each system, where the word can be the empty word . The feature vector for a slot is simply the concatenation of the features of the J + 1 words in the slot. Thus, a feature vector of fixed size is constructed from which the classifier maps to the J + 1 classes.

6.1.4 The iCNC Approach Two approaches for improving CNC decoding by classification are investigated. The first approach is referred to as iCNC and follows directly the iROVER approach. A super CN is computed from the system-dependent CNs by performing the alignment step of the CNC method. For each slot the best hypothesis from each of the J systems is selected, i.e. the word which maximizes pj,s (·|xT1 ) is the word hypothesis provided by the jth system for the sth slot. In addition, the CNC and the min.hyp-nFE hypothesis are added as the (J + 1)th and (J + 2)th system. Noticeably, the ROVER result does not have to be explicitly added as it is contained in the J system-dependent words; a binary flag indicates for each word whether it equals the ROVER result. The classifier is now applied in the same way as in the iROVER approach.

6.1.5 The iCN Approach The iCN approach uses the CNC in a different way following the approach applied in [Mangu &Padmanabhan 2001] to a single CN. The decision is made slot-wise among the N -best word hypotheses list of the slot, where the hypotheses are ranked according to the averaged word posterior probability as in CNC decoding, cf. Section 3.4.1. Choosing N =2 already gives an oracle error rate lower than the corresponding ROVER oracle error rate, i.e. in theory the iCN approach can compensate for more errors than iROVER or iCNC. For each word in the N -best list the feature vectors from all systems are concatenated. Each word is further tagged with whether or not it is the min.hyp-nFE or ROVER choice; the CNC choice is always the word with rank one. For N =2 the construction results in a feature vector of fixed size and a binary classification problem.

6.2 Experiments Experiments are performed on the four lattice sets from the English EPPS 2007 evaluation cross-site combination setup. The corpus and the lattices are described in Appendix B. The baseline results for the single systems and system combinations with CNC, ROVER, and min.hyp-nFE decoding are summarized in Table 6.1 . All results presented in this chapter are produced on the evaluation set, the TC-Star/EPPS 2007 eval07 set.

6.2.1 Experimental Setup The classifiers are trained on the development set (eval06) of the TC-Star/EPPS 2007 Evaluation, which serves as well as tuning set for any further parameter optimization. A larger training set is not available, as in the TC-Star project only lattices for the eval06 and eval07 sets were produced and exchanged. Due to the limited training data a 10-fold cross-validation is applied for tuning the parameters of the classifiers. With the optimized parameter set the final classifier is trained on the complete data. Table 6.2 summarizes the statistics for the classifiers and the corpora. The number of samples is the number of slots in which not all systems agree on the same word, i.e. where a non-trivial classification problem exists. These samples make up the effective training set for the classifiers. The number of features is the dimensionality of the ultimate feature vector fed into the classifier. The iROVER+FE classifier refers to the setup where the min.hyp-nFE decoding result is added as (J + 1)th system, whereas iROVER only combines the Viterbi results from the J systems.

104

6.2 Experiments

Table 6.1. Baseline results for eval07. ROVER results come with confidence score based voting and with majority voting. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.

WER[%] (del/ins) err System LIMSI RWTH UKA IRST LIMSI+RWTH LIMSI+RWTH+UKA LIMSI+RWTH+UKA+IRST

Viterbi/ ROVER (1.91/1.21) 9.38 (1.93/1.26) 9.76 (2.12/1.26) 10.22 (2.41/1.18) 9.79 (2.71/0.53) 8.10/ (1.91/1.19) 9.38 (2.30/0.70) 7.83/ (2.05/0.82) 7.95 (2.02/0.67) 7.43/ (2.44/0.54) 7.52

CN(C) (2.00/1.07) 9.00 (2.23/1.08) 9.52 (2.07/1.25) 10.09 (2.40/1.19) 9.82 (1.89/0.68) 7.42

min.hyp-nFE (comb.) (1.86/1.18) 9.00 (2.42/0.98) 9.62 (2.22/1.20) 10.14 (2.45/1.16) 9.80 (1.81/0.85) 7.62

(1.91/0.61) 7.09

(1.72/0.84) 7.38

(1.94/0.60) 7.05

(1.61/0.88) 7.17

Table 6.2. Corpora statistics for the training/tuning set (eval06) and the evaluation set (eval07).

#features System LIMSI+RWTH

LIMSI+RWTH+UKA

LIMSI+RWTH+UKA+IRST

Comb. iROVER iROVER+FE iCNC iCN(N =2) iROVER iROVER+FE iCNC iCN(N =2) iROVER iROVER+FE iCNC iCN(N =2)

75 108 71 99 111 147 94 126 149 188 166 157

#samples eval06 eval07 3,032 3,215 3,115 3,301 647 659 28,900 26,961 4,237 4,386 4,207 4,416 1,709 1,801 32,624 30,069 5,320 5,178 5,346 5,207 3,696 3,354 33,252 30,504

105

Chapter 6 Classifier based System Combination

Table 6.3. CN oracle error rates for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction.

Comb. iROVER iROVER+FE iCNC iCN(N =2)

Oracle WER[%] (del/ins) err 2 systems 3 systems 4 systems (1.38/0.49) 5.39 (1.22/0.42) 4.44 (1.10/0.34) 3.82 (1.30/0.41) 5.06 (1.12/0.33) 4.21 (0.97/0.24) 3.56 (1.71/0.57) 6.59 (1.52/0.41) 5.13 (1.29/0.27) 3.70 (0.87/0.33) 3.56 (0.86/0.29) 3.41

For Boostexter and Maxent the number of training iterations is optimized for each task separately. Random forests proved not to be sensitive to parameter tuning: eventually C4.5 is used with default parameters and 100 trees for all experiments.

6.2.2 Results The first set of experiments explores the potential of the proposed approach. The ROVER or CNC alignment is performed for the evaluation set and the reference is aligned according to Equation (6.2). From the resulting CN the oracle error rate is computed. The oracle error is defined as the error of the optimal classifier: only if the reference word is not present in the slot an error is counted. Table 6.3 shows the oracle error rates for the four investigated setups. Comparing the table to the baseline results in Table 6.3 shows that the classifier based approaches have a huge potential for improving the error rate. The largest gap is observed for the iCN approach, where already the combination of two systems halves the baseline error rate. In the next set of experiments the iROVER approach is investigated in detail. Especially the importance of the different feature categories is explored. The results are summarized in Table 6.4. Using iROVER with only the simple word features already improves considerably over standard ROVER. Adding the posterior features boosts iROVER to the level of the min.hyp-nFE combination. Results with the Maxent classifier are only produced for the combination of two and three systems. The Maxent toolkit applies the General Iterative Scaling (GIS) algorithm which causes extremely long run-times, e.g. 100K iterations for the iROVER/2-systems task and 1M iterations for the iROVER/3systems task, without giving an advantage over BT and RF. Eventually, no further Maxent classifiers are trained. In the remaining experiments the features of all three categories are combined. Table 6.5 shows the results for a direct comparison of the four i approaches using BT and RF as classifier. iROVER+FE goes beyond iROVER and can take over min.hyp-nFE combination, but fails on improving over the CNC baseline. iCNC performs best and can slightly improve over standard CNC. The iCN approach disappoints and cannot improve clearly over standard CNC and is beaten by iCNC on the four system combination task, even though it shows the lowest oracle error rate. The analysis of the dissatisfying performance are subject to the next section. Boostexter and random forests are mostly on the same level with some advantages for RF. Especially for the hard iCN task the RF classifier seems to be more robust. The results are rather sobering, the improvements over the standard approaches are present, but small. Especially the CNC baseline is only slightly beaten by one of the classifier based approaches, the iCNC.

6.2.3 Analysis The analysis of the classifier based combination methods is based on the error detection and correction statistics for the different approaches. Error detection is defined as the ability of the classifier to detect that the hypothesis chosen by the according standard combination approach is not correct. The error correction statistics tell whether the classifier is able to replace a detected erroneous hypothesis by the correct word. The formal definitions of precision and recall for detecting and correcting wrong word

106

6.2 Experiments

Table 6.4. iROVER combination results for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%.

iROVER 2 systems word features Boostexter (2.14/0.86) 7.89 Random forests (2.14/0.88) 7.88 Maxent (1.98/0.96) 7.92 word and posterior features Boostexter (2.04/0.81) 7.61 Random forests (2.08/0.83) 7.68 Maxent (2.07/0.86) 7.77

WER[%] (del/ins) err 3 systems

4 systems

(2.10/0.75) 7.60 (2.15/0.72) 7.57 (1.98/0.83) 7.78

(2.17/0.69) 7.56 (2.02/0.74) 7.37 -

(2.07/0.77) 7.40 (2.13/0.70) 7.43 (2.06/0.78) 7.60

(2.11/0.73) 7.25 (2.07/0.70) 7.19 -

Table 6.5. Combination results with Boostexter (BT) and random forests (RF) as classifier for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%.

Comb. iROVER iROVER +FE iCNC iCN(N =2)

BT RF BT RF BT RF BT RF

2 systems (2.04/0.81) 7.61 (2.08/0.83) 7.68 (1.87/0.79) 7.57 (1.91/0.78) 7.49 (1.86/0.69) 7.39 (1.88/0.69) 7.41 (1.93/0.71) 7.46 (1.90/0.71) 7.37

WER[%] (del/ins) err 3 systems 4 systems (2.07/0.77) 7.40 (2.11/0.73) 7.25 (2.13/0.70) 7.43 (2.07/0.70) 7.19 (1.89/0.71) 7.24 (1.76/0.75) 7.00 (1.88/0.69) 7.31 (1.90/0.62) 6.97 (1.90/0.62) 7.07 (1.97/0.58) 6.90 (1.96/0.63) 7.08 (1.99/0.60) 6.93 (1.88/0.68) 7.20 (1.91/0.72) 7.15 (1.87/0.63) 7.05 (1.94/0.65) 7.01

107

Chapter 6 Classifier based System Combination

Table 6.6. Error detection and correction results for eval07 for four systems and with a random forest as classifier.

Error Detection recall prec. 0.22 0.7 (357/1,658) (357/514) iROVER 0.16 0.72 +FE (267/1,629) (267/369) iCNC 0.14 0.71 (153/1,086) (153/215) iCN(N =2) 0.08 0.63 (153/2,018) (153/244) Comb. iROVER

Error Correction recall prec. 0.16 0.52 (269/1,658) (269/514) 0.12 0.51 (189/1,629) (189/369) 0.1 0.48 (104/1,086) (104/215) 0.06 0.46 (113/2,018) (113/244)

hypotheses are given by: X recdetect (S) :=

1{ws 6= ws,base ∧ ws,base 6= ws,ref }

s∈S

X

1{ws,base 6= ws,ref }

s∈S

X precdetect (S) :=

1{ws 6= ws,base ∧ ws,base 6= ws,ref }

s∈S

X

1{ws 6= ws,base }

s∈S

X reccorrect (S) :=

1{ws 6= ws,base ∧ ws = ws,ref }

s∈S

X

1{ws,base 6= ws,ref }

s∈S

X preccorrect (S) :=

1{ws 6= ws,base ∧ ws = ws,ref }

s∈S

X

1{ws 6= ws,base }

(6.3)

s∈S

The sequence of slots in the CN is denoted by S and the reference word for slot s by ws,ref , the baseline hypothesis by ws,base , and the classifier hypothesis by ws . The baseline result depends on the investigated combination approach: for iROVER it is the ROVER result, for iROVER+FE it is the min.hyp-nFE result, and for iCNC and iCN it is the CNC result. Table 6.6 gives the performance obtained with RF as classifier applied to the four systems task; the results for the other setups show the same tendencies. For iROVER, iROVER+FE, and iCNC the precision remains almost constant, whereas the recall decreases. This suggests that the iROVER approaches mostly compensate for errors which are already wiped out by standard CNC. Comparing iCNC and iCN shows that the absolute number of recovered and corrected errors is almost equal for both approaches, but iCN produces many more false positives. Thus, for the tested classifiers and features it helps to apply the ROVER constraint, i.e. to restrict the choice to hypotheses which occurred at least for one system as best hypothesis. On the other hand the results indicate that the iCN approach suffers from having more choices which implies that either the feature set or the modeling is still insufficient.

6.3 Summary In this chapter classifier based system combination has been introduced. The core idea in classifier based system combination is that lattice-based posterior estimates and the common approximations of the Levenshtein distance have systematic biases which the classifier can learn and compensate for.

108

6.3 Summary Based on ROVER and confusion network combination (CNC) three different approaches to classifier based system combination are developed. The common idea of the approaches is to apply the classifier to the super CN derived from the ROVER or CNC alignment. The CN can be decoded slot-wise and thus the decoding is reduced to a local classification problem. For each slot and each word in the slot a variety of features is computed. The features range from the simple word duration to sophisticated features based on the posterior probabilities derived from the system-dependent lattices. Context information is brought into the classification process by combining the feature vectors of the current and the adjacent slots. In the iROVER approach the system-dependent Viterbi results are aligned with the ROVER tool. In the decoding step the classifier predicts for each slot which system hypothesized the correct word. In this approach the number of target classes equals the number of systems and is therefore small and fixed. The results of alternative combination methods can be added to iROVER by simply including their output as an additional system. The iCNC approach works similar to the iROVER approach. The system-dependent CNs are aligned to a super CN and for each slot and system only the word with the highest system-dependent posterior probability is kept. The CNC result is added as an additional system and decoding is performed according to the iROVER approach. In the iCN approach for each slot in the super CN derived from the CNC alignment only the N words with the highest posterior probabilities are kept. The classifier predicts for each slot the rank of the correct word. For N =2 the approach reduces to a binary decision problem. For all approaches three classifiers are tested: Boostexter (BT), random forests (RF), and a log-linear model (Maxent). In the experimental results the iROVER and iCNC approaches can slightly improve over the corresponding baseline methods. Overall, RF performs slightly better than BT and both are superior to Maxent. The best results are achieved with iCNC and RF as classifier beating the standard CNC by 0.2% absolute on a four systems cross-site combination task.

109

Chapter 7 Log-Linear Model Combination vs. System Combination The standard log-linear model used in modern speech recognition systems combines the acoustic model and the language model with model-dependent scaling factors. If the combination is used in Viterbi decoding only, no normalization is required and a single scaling factor is sufficient: the language model scale. Equation (7.1) shows the model with LM scale β, where the normalization term Z guarantees a probability distribution over all sentences: pβ (w1N |xT1 ) := Z −1

N Y

n−L β p(wn |xttnn−1 +1 )p(wn |wn−1 )

(7.1)

n=1

The model in Equation (7.1) is a special case of the general log-linear model used in speech recognition which is defined as ! I N X X N T N T −1 (7.2) λi fi (n; w1 , x1 ) . pλ (w1 |x1 ) := Z exp n=1 i=1

fi (n; w1N , xT1 )

The feature functions are in the simplest case the negated log probabilities provided by the acoustic and the language model. In practice, the feature functions used in LVCSR, like the negated logarithm of the HMM based acoustic model or the L-gram language model, depend only on the local context given position n. Therefore, the model can be compactly stored as a word lattice. In the log-linear model combination more knowledge sources are combined into a single log-linear model, usually several acoustic models. In theory, all knowledge sources can be used jointly to produce lattices with I-dimensional scores, where I is the number of knowledge sources. The lattice is represented as a WFST over the log or tropical vector semiring; the connection between transducers over the vector semirings and the log-linear model is discussed in Chapter 3. However, the usage of many knowledge sources during the search is expensive in terms of memory and run-time. Instead, lattices are usually built with a single acoustic and a single language model. Using an appropriate semiring, cf. Section 3.1, the intersection of the lattices from several decoders results in a log-linear combination of the system-dependent knowledge sources. In practice, instead of the intersection the conceptual similar re-scoring is used: lattices are produced with a single acoustic and a single language model and are subsequently re-scored with the additional models. In the discriminative model combination (DMC) the lattices are used to optimize the model-dependent scaling factors for minimum error rate [Beyerlein 2000; Vergyri 2000; Zolnay 2006]. The models in the combination are usually trained independently and the task of the scaling factors in the log-linear model combination is to capture the dependencies between the several models. In order to better describe the interaction between the knowledge sources, several scaling factors per model can be used. In the following section the log-linear model is extended by word- and pronunciation-dependent scaling factors. The scaling factors are optimized for minimum error rate using the MRT training described in Section 3.7.2, which is eventually DMC with word- and pronunciation-dependent scaling factors. The concrete setup of the scaling factor training is discussed in Section 7.2.1. The approach to word-dependent scaling factors investigated in this section follows [Hoffmeister & Liang+ 2009]. Word- or word class-dependent scaling factors were used before in [Huang & Belin+ 1993; Sarukkai & Ballard 1996]. In the first paper a joint training of the acoustic model, the language model, and the scaling factors is performed. In the latter work word class-dependent scaling factors are used among other techniques in an adaptation step. Neither paper investigates the improvement coming solely from the word-dependent scaling factors. Another approach is applied in [Vergyri & Tsakalidis+ 2000], where an improvement of around 3% relative is reported for a DMC experiment by using scaling factors which depend on classes derived from several acoustic features.

111

Chapter 7 Log-Linear Model Combination vs. System Combination A comparison of a log-linear model combination with model-dependent scaling factors and ROVER based system combination is performed in [Zolnay 2006], where ROVER outperformed DMC. In Section 7.2.2 the log-linear model combination with and without word-dependent scaling factors and with CN decoding is compared to the CN decoding of the union based lattice combination approach described in Section 3.2.3.

7.1 Log-Linear Model Combination with Word-Dependent Scaling Factors In this work an extended form of the log-linear model as defined in Equation 7.2 is used, where the scaling factors are made word-dependent. It consists of a set of word level feature functions fi (wn ; w1N , xT1 ) and a corresponding set of word-dependent scaling factors λi (wn ): ! N X I X exp λi (wn )fi (wn ; w1N , xT1 ) pλ (w1N |xT1 ) := X v1M

exp

n=1 i=1 M X I X

!

(7.3)

λi (vm )fi (vm ; v1M , xT1 )

m=1 i=1

In the following it is assumed that for each word its pronunciation is known. That is, a word wn is considered to be a tuple of the orthography orth(wn ) of the word and the pronunciation pron(w n ).  The feature functions used are the logarithms of several acoustic models p pron(wn )|xttnn−1 +1 , of the   n−L pronunciation model p pron(wn )|orth(wn ) , of the L-gram language model p orth(wn )|orth(wn−1 ) , and a word penalty. Going from a single scaling factor per model to word-dependent scaling factors is motivated by the following observations, which give reason to assume a word- and pronunciation-dependent interaction between the models. • Varying discriminative power of the acoustic model: the discriminative power of an acoustic model is usually unsteady across phones and thus across pronunciations. • Varying discriminative power among different acoustic models: different acoustic front-ends differ in their ability to discriminate among phones. • Several modeling and training issues of the acoustic model, e.g. the severe independence assumptions and the presumably underestimated variances of the GMMs. Furthermore, due to the word-dependent scaling factors the training of the model in Equation (7.3) estimates word-dependent pronunciation scores and the word penalty in a discriminative manner.

7.2 Experiments Experiments are conducted on the Chinese 230h testing system described in detail in Appendix B. In addition to the 230h speech data for acoustic model training, a separate 120h corpus is created for the estimation of the word-dependent scaling factors. Both training sets do not overlap and have the same ratio between broadcast news and broadcast conversation data. The three acoustic models used in the experiments are based on the MFCC, PLP, and Gammatone filter (GT) based front-ends.

7.2.1 Experimental Setup The log-linear model combination of the three acoustic models with word-dependent scaling factors is applied in a lattice re-scoring step. Lattices are produced with the MFCC system and are subsequently arc-wise re-scored with fixed word boundaries. The language model scores are taken from the LM used in the decoding pass; a further language model re-scoring of the lattices was omitted. For experiments on

112

7.2 Experiments

Table 7.1. Training, tuning (dev07), and test sets. The word-dependent scaling factors are trained on the 120h “λ-training” set. For the first test set no word-segmented transcripts are available.

Corpus AM-training λ-training held-out dev071 eval07 dev08 1

Duration ∼230h ∼120h 1.5h 2.5h 1.6h 1h

Running Words Char.s 2.4M 4.0M 1.3M 2.2M 12.7K 21.5K 27.5K 46.8K 28.1K 10.5K 18.2K

Vocabulary Words Char.s 42.1K 5.3K 33.7K 4.4K 4.4K 1.8K 5.3K 1.9K 1.7K 2.9K 1.4K

tuning set

Table 7.2. Lattice re-scoring results with various acoustic models. The lattice sets are generated with the MFCC model and subsequently re-scored with the PLP and resp. with the Gammatone (GT) acoustic model, where the character boundaries are kept fixed. The acoustic models were estimated on the 230h AM training set.

Acoustic Model MFCC PLP GT 1

dev071 (2.60/1.64) 14.91 (2.66/1.72) 15.19 (2.71/1.65) 15.62

[%CER] (del/ins) err eval07 dev08 (4.40/1.01) 15.45 (2.69/0.89) 13.44 (4.44/1.07) 15.41 (2.78/0.88) 13.90 (4.55/1.05) 16.15 (2.76/0.93) 14.11

held-out (2.02/1.19) 10.82 (2.22/1.12) 10.82 (2.11/1.14) 10.74

tuning set

character or syllable level the word arcs are first split into character arcs using the time information from an arc-wise forced alignment with the MFCC model. The lattices for the 120h scaling factor training set, for the development set, and for the test set are produced with identical setups. Unfortunately, the language model training data includes both training sets which results in a much lower perplexity on the 120h scaling factor training set than on the development and test sets. In order to get an idea of how much performance is lost due to the discrepancy in the training and evaluation setting an additional held-out set is created by removing each hundredth segment from the 120h training set. Table 7.1 summarizes the corpora statistics. Viterbi decoding results of the re-scoring of the MFCC lattices with the three acoustic models are summarized in Table 7.2. The language model scale is optimized separately for each acoustic model. The MFCC based model clearly outperforms the PLP and GT frontend, and will be referred to as baseline in the remainder of this chapter. The 120h training set is not sufficient to reliably estimate a scaling factor for each word. In order to get a robust estimation only words which occur more often than a cut-off Nmin get their own scale. The scaling factors for all other words are tied by a backing-off scale, where the backing-off scaling factor depends on the number of phonemes in the pronunciation of the word:  λi,w , if #w > Nmin λi (w) := (7.4) λi,|pron(w)| , otherwise For experiments on character level only a single backing-off class is used. In order to get an idea of how important the lexical information is, an alternative set of scaling factors is built, where characterdependent scaling factors are tied among equal pronunciations, i.e. syllable classes are built. Table 7.3 shows the number of scaling factors per model for different cut-offs. The vocabulary size is 60K and the table shows that even for 7K word-dependent scaling factors (∼10% vocabulary coverage) a high coverage of 90% of the running words in the development set is achieved. For character- and syllable-dependent scaling factors the coverage is almost complete. For most experiments five models are combined: the three acoustic models, the pronunciation model, and the language model. The interdependency between the several models is sufficiently described by

113

Chapter 7 Log-Linear Model Combination vs. System Combination

Table 7.3. Statistics for word-dependent scaling factors on dev07: number of word-dependent scaling factors and coverage of running words for a given cut-off Nmin .

Nmin 200 50 20 10 5

#classes 997 3,596 6,904 10,911 16,665

Running Words[%] 67% 83% 90% 93% 96%

putting the word-dependent scaling factors on four of the five models. Following the considerations from Section 7.1 the word-dependent scaling factors are put on the acoustic models and the pronunciation model (and on the word penalty, if used). For parameter estimation the minimum risk training (MRT) described in Section 3.7.2 is applied. The objective function is either the smoothed phoneme error (MPE training) or word error (MWE training) applied on character level. The estimation is done iteratively using Rprop, a gradient-descent algorithm [Riedmiller & Braun 1993]. The implementation of the MPE objective function follows directly [Povey & Woodland 2002]. The objective function applied in MWE training is the confusion network (CN) error computed on character level. The CNs are built from the training set lattices using the arc-cluster CN construction algorithm described in Section 4.4.2. Regularization turns out to be important, similar to the I-smoothing used in [Povey & Woodland 2002] for GHMM training. The objective function for MRT is defined as follows, where L(·, ·) denotes the loss ˜r R N r function and [xTr,1 ,w ˜r,1 ]r=1 the training samples.   R X X 1 C ˜r N r  F(λ) := pλ (w1N |xTr,1 ) L(w1N , w ˜r,1 ) + ||λ − λ(0) ||22 R r=1 2 N

(7.5)

w1

The initial set of scaling factors λ(0) is made up of the model-dependent scales derived from a direct error rate minimization on the development set. Thus, the initial LM scaling factors are around one and the acoustic model scaling factors are close to the inverse language model scale (as used in Viterbi decoding, where the acoustic model scale is fixed to one) divided by the number of acoustic models. The scaling factors are optimized until convergence in the objective function occurs and the scaling factors from the last training iteration are taken for decoding. The regularization constant C is optimized on the development set for minimum error rate, which is expensive and therefore is not done in fine-grained steps. For lattice decoding the CN decoder with the arc-cluster CN construction algorithm described in Section 4.4.2 is used, which is consistent with the optimization criterion used for character level MWE training. In a final set of experiments the log-linear model combination is compared to the modified lattice union approach described in Section 3.2.3, which derives the combined sentence posterior probability as the weighted average of the system-dependent sentence posteriors. For a fair comparison of the loglinear model combination and the union based system combination it is necessary to use equivalent word lattices. In the experiments, a system is simply defined as the log-linear combination of the language model, the pronunciation model, and a single acoustic model. The system-dependent sentence posteriors are computed from the re-scored lattices by setting the scaling factor for all but one acoustic front-end to zero. That is, the lattices and lattice arc scores, i.e. the features, are the same for all experiments. The three sentence posterior distributions are then combined according to Equation (3.14) and the CN decoder is applied. The system weights and model scales are optimized for minimum error rate on the development set. As pointed out in Section 3.7.2, scaling factor optimization via MRT is meaningless for the union based system combination. For all experiments with word-dependent scaling factors the λs are used which are optimized for the log-linear model combination.

114

7.3 Summary

7.2.2 Results In the first set of experiments the different tying strategies for the scaling factors are investigated. Table 7.4 assembles the results for word-, character-, and syllable-dependent scaling factors for different cut-off values. The number of classes refers to the number of scaling factors per model; throughout, the language model gets only a single scaling factor. The baseline is the setup using a single scale per model, which is equivalent to the common DMC approach. The best improvement is achieved with 7K word-dependent scaling factors, but the difference among the cut-off values is tiny and especially for 3K and more scaling factors it might even disappear with a more fine-grained optimization of the regularization constant. The relative improvement in character error rate (CER) is around 2%, a little better for the held-out set where a relative improvement of 3% is observed. On the training set the error rate of the Viterbi decoding is measured and even here the gain is at most around 4% relative. In preliminary experiments with an additional word penalty no further improvements were observed: error rates changed only in the second decimal place. Figure 7.1 shows detailed results for the training and evaluation of the 7K word-dependent scaling factors, the best performing setup. The left plot shows that the objective function (smoothed phoneme accuracy) improves smoothly and the CER on training, held-out, and development set smoothly decreases. The right plot shows again the development set together with the two test sets. Both, the Viterbi and the CN results are plotted. The plots for the other setups look rather similar. The results with character-dependent scaling factors are similar to the word-dependent results, where on eval07 and on the held-out set the improvements are a little smaller. The differences in the word and character level baselines are due to fixing the boundaries of the character arcs with the MFCC model. When re-scoring with the PLP and GT model the character boundaries are not optimal and the results are slightly worse compared to a word arc-wise re-scoring. The results for character level MWE training are a little worse than for MPE, but here again the differences are too small for drawing reliable conclusions. The CN decoder cannot benefit from MWE trained, character-dependent scaling factors: the gap to the corresponding Viterbi results do not widen compared to the experiments using the MPE criterion. The syllable-dependent scaling factors are inferior to the character-dependent ones. The differences are small, but consistent among all test sets. In the second set of experiments the log-linear model combination is compared with the system combination approach based on the weighted average of the system-dependent sentence posteriors. The results are summarized in Table 7.5. The word-dependent scaling factors are optimized for the log-linear model combination containing the three acoustic models using the MPE criterion. Obviously, the resulting scales cannot be applied directly in a log-linear model using only one of the three acoustic models, because the impact of the acoustic and the language model are not balanced anymore. As compensation an additional scaling factor per model is introduced and optimized on the development set. The results for the log-linear combination of a single acoustic model, the pronunciation, and the language model are shown in the first part of the table. The next two parts show the results for the log-linear model combination and the averaged sentence posterior based system combination. The CN decoding of the averaged sentence posteriors clearly outperforms the CN decoding of the loglinear model combination. Notably, the relative improvement from the word-dependent scaling factors is almost the same for both approaches, even if they are optimized only for the log-linear combination. The picture is completed by the results from the experiments with a single acoustic model, where the relative improvement is in the same range. That is, the log-linear model combination with the three acoustic models cannot benefit from the joint training considering all the acoustic models. The conclusion is that the word-dependent scaling factors presumably do not capture the dependencies between the acoustic models, but solely the interdependency of acoustic and language model.

7.3 Summary In this chapter the log-linear model combination with word- and pronunciation-dependent scaling factors has been introduced. The goal is to describe within the log-linear model the interaction between the combined, but independently trained knowledge sources. The scaling factors are optimized for minimum error rate using the training method described in Section 3.7.2.

115

Chapter 7 Log-Linear Model Combination vs. System Combination

Table 7.4. CN-decoding results for the log-linear model combination using word-, character-, and syllable-dependent scaling factors. The scaling factors are trained on 120h using either minimum phone error (MPE) or minimum character error (MWE) training. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model.

#classes Criterion (cut-off) dev071 baseline (2.60/1.64) 14.91 word-dependent scaling factors MPE 1 (2.72/1.47) 13.94 997(200) (2.81/1.38) 13.80 3,596( 50) (2.80/1.40) 13.76 6,904( 20) (2.81/1.40) 13.73 10,911( 10) (2.80/1.40) 13.73 16,665( 5) (2.82/1.39) 13.74 character-dependent scaling factors MPE 1 (2.70/1.50) 13.95 2,708( 20) (2.71/1.42) 13.79 3,707( 5) (2.72/1.42) 13.80 MWE 1 (2.69/1.50) 13.95 2,708( 20) (2.87/1.36) 13.84 3,707( 5) (2.87/1.36) 13.83 syllable-dependent scaling factors MPE 1 (2.70/1.50) 13.95 1,064( 20) (2.72/1.43) 13.79 MWE 1 (2.69/1.50) 13.95 1,064( 20) (3.01/1.33) 13.91 1

[%CER] (del/ins) err eval07 dev08 (4.40/1.01) 15.45 (2.69/0.89) 13.44

held-out (2.02/1.19) 10.82

(4.53/0.88) 14.51 (4.59/0.75) 14.26 (4.57/0.75) 14.23 (4.60/0.75) 14.19 (4.57/0.76) 14.28 (4.59/0.77) 14.25

(2.85/0.72) 12.73 (2.86/0.69) 12.56 (2.84/0.69) 12.57 (2.84/0.69) 12.51 (2.81/0.71) 12.53 (2.80/0.69) 12.46

(2.28/0.88) 9.76 (2.54/0.76) 9.59 (2.61/0.74) 9.54 (2.56/0.73) 9.43 (2.55/0.75) 9.57 (2.58/0.74) 9.52

(4.52/0.89) (4.57/0.81) (4.56/0.81) (4.52/0.89) (4.72/0.76) (4.72/0.77)

14.59 14.37 14.37 14.60 14.42 14.42

(2.69/0.76) (2.76/0.70) (2.76/0.70) (2.71/0.76) (2.88/0.69) (2.87/0.69)

12.63 12.40 12.40 12.64 12.55 12.50

(1.99/0.90) (2.37/0.80) (2.37/0.80) (1.99/0.90) (3.04/0.71) (3.03/0.71)

9.77 9.54 9.55 9.77 9.77 9.78

(4.52/0.89) (4.59/0.82) (4.52/0.89) (4.78/0.76)

14.59 14.51 14.60 14.61

(2.69/0.76) (2.81/0.71) (2.71/0.76) (3.01/0.70)

12.63 12.40 12.64 12.46

(1.99/0.90) (2.36/0.78) (1.99/0.90) (3.18/0.66)

9.77 9.60 9.77 9.84

tuning set

15

12 train(obj.func.) train(Viterbi) held-out(Viterbi) held-out(CN-dec.) dev(Viterbi) dev(CN-dec.)

11

10

14.5

14 %CER

%CER

13

objective func.(phoneme accuracy)

14

dev(Viterbi) dev(CN-dec.) test1(Viterbi) test1(CN-dec.) test2(Viterbi) test2(CN-dec.)

13.5

13

12.5 0

5

10

15 iteration

20

25

0

5

10

15

20

iteration

Figure 7.1. Results for the log-linear model-combination for 25 training iterations and 6,904 word-dependent scaling factors. The word-dependent scaling factors are trained on 120h. The left plot shows the objective function and character error rates for the training set, the held-out set, and the development set. The right plot shows the progression of the error rates for the development set and the two test sets.

116

25

7.3 Summary

Table 7.5. CN-decoding results for log-linear model combinations and for a system combination using the weighted average of sentence posteriors. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model.

Acoustic [%CER] (del/ins) err Model(s) #classes dev071 eval07 baseline (2.60/1.64) 14.91 (4.40/1.01) 15.45 model combination with one acoustic model MFCC 1 (2.86/1.51) 14.79 (4.58/0.92) 15.18 6,904 (2.88/1.44) 14.44 (4.64/0.81) 15.07 PLP 1 (2.92/1.55) 15.07 (4.64/0.88) 15.20 6,904 (3.00/1.36) 14.71 (4.73/0.84) 15.05 GT 1 (2.89/1.57) 15.47 (4.68/0.94) 15.97 6,904 (3.00/1.41) 15.20 (4.81/0.84) 15.87 log-linear model comb. MFCC+PLP+GT 1 (2.75/1.48) 14.01 (4.52/0.88) 14.51 6,904 (2.82/1.39) 13.69 (4.61/0.75) 14.18 avg. sentence posteriors MFCC+PLP+GT 1 (2.88/1.37) 13.72 (4.62/0.74) 14.13 6,904 (2.94/1.25) 13.35 (4.72/0.63) 13.89 1

dev08 (2.69/0.89) 13.44 (2.84/0.80) (2.86/0.70) (3.07/0.80) (3.08/0.75) (2.89/0.91) (3.02/0.80)

13.32 12.98 13.67 13.54 14.05 13.68

(2.85/0.72) 12.71 (2.82/0.69) 12.52 (2.86/0.72) 12.44 (2.96/0.64) 12.13

tuning set

In this work three acoustic models, a pronunciation model, and a language model are combined for a Chinese task. The training set for the word-dependent scaling factors consists of 120h, which is separated from the 230h used for acoustic model training. Many words of the 60K vocabulary occur only infrequently or not at all in the 120h training set and no reliable word-dependent scaling factors can be estimated. Those words are tied into a set of fallback classes. Each fallback class has its own scaling factor where the fallback class for a particular word depends on the length of the word’s pronunciation counted in number of phonemes. An alternative approach applicable for Chinese is investigated, where the words are split into characters and character dependent scaling factors are used. In the experimental results the word-dependent scaling factors performed better than the characterdependent scales and a small but consistent gain in error rate is observed. The error rate is decreased by around 2% relative for all tasks. In the final set of experiments the log-linear model combination is compared to the system combination via the modified lattice union, cf. Section 3.2.3. The union based approach clearly outperforms the loglinear model combination for model- and for word-dependent scaling factors. Notably, the same relative gain from the word-dependent scaling factors is observed for both combination approaches, even so the word-dependent scaling factors are solely optimized for the log-linear model combination.

117

Chapter 8 Scientific Contributions The goal of this work has been to investigate Bayes risk decoding techniques and system combination in the Bayes risk decoding framework for LVCSR systems. This work contains the following contributions which cover different aspects of Bayes risk decoding and system combination: Development of a unified view on system combination. A unified view on system combination in the Bayes risk decoding framework has been developed, which covers most of the common approaches to system combination applied in state-of-the art LVCSR systems. The log-linear model used in modern LVCSR systems has a natural representation as a weighted finite state transducer (WFST) over a vector semiring, in which the scaling factors of the log-linear model are part of the semiring. An arc label in the WFST is a single word and an arc weight is the vector of the values of word-wise feature functions. The vector usually consists of two scores, the score from the acoustic model and the language model score. Thus, in combination with time stamps assigned to the states, the WFST defines a word lattice, where the probabilities derived from then lattice follow the log-linear model. Context information like the language model history or cross-word boundaries are preserved in the WFST topology. The log-linear model combination corresponds to a WFST intersection (or to the conceptual similar arcwise re-scoring of the WFST). Path and sentence posterior probabilities are derived directly from the single log-linear model. The common alternative used in system combination methods like confusion network combination is to compute the weighted average of the system-dependent sentence posterior probabilities, where the sentence posteriors are derived from the system-dependent lattices. The combination via the averaged system-dependent posteriors has its interpretation in the WFST framework as a slightly modified lattice union. In the first combination method all system-dependent log-linear models are combined into a super log-linear model from which sentence posteriors are derived. In the second method systemdependent sentence posteriors are derived from the system-dependent log-linear models. The sentence posteriors are subsequently combined in a linear manner. Intersection and modified union implement the two common approaches for estimating sentence posterior probabilities from a set of word lattices. The lattice combination itself is accomplished by generic transducer operations which combine the system-dependent lattices into a single super lattice, either based on the lattice intersection or the lattice union. Thus, the combination and the decoding problem are separated and system combination is reduced to a single lattice decoding problem. The lattice decoding is formulated in the Bayes risk framework, where the posterior probabilities are provided by the lattice. The common loss function for Bayes risk decoding for LVCSR tasks is the Levenshtein distance. The computation of the Bayes risk hypothesis from a LVCSR lattice with the Levenshtein distance as loss function is computationally prohibitive and in practice the Levenshtein distance is replaced by an approximate. In this work a classification for loss functions which aim at approximating the Levenshtein distance has been developed. The classes are based on the degree of locality of the approximates. Two classes of local loss functions have been derived which cover the common approximations used in LVCSR tasks and for these two classes efficient Bayes risk decoder have been developed. The theoretical investigations show that the computation of the Bayes risk hypothesis from the union based combination is more efficient if a local loss function is used rather than the sentence error. The generic Bayes risk decoder covers a variety of known approaches to system combination including the discriminative model combination (DMC) and the confusion network combination (CNC). In the confusion network combination the lattices are first transformed into CNs which are subsequently combined into a super CN. In this work it has been shown that the CNC decoding rule can be also expressed as a Bayes risk decoding of the lattice union with an appropriate cost function. Furthermore, it has been shown that the CNC cost function is optimal in terms of Bayes risk decoding under the contraint that

119

Chapter 8 Scientific Contributions the system-dependent alignments can be expressed as CNs. The relation between CNC and ROVER has been made and ROVER with confidence voting has been developed as an approximation of the CNC. The experimental results show that lattice-based system combination improves over the decoding of the best single lattice for all investigated combination approaches and loss functions. The best results are achieved for the lattice union based Bayes risk decoder with either the CN distance or the symmetrically normalized frame error as loss function, where especially the CNC shows a small advantage in the crosssite combination tasks. ROVER degrades only slightly in error rate compared to CNC. For intra-site combination experiments the improvements are around 10% relative compared to the best single system’s Viterbi result and more than 20% relative for the cross-site combination task. Investigations on the local cost functions used in Bayes risk decoding. In this work the common approximations to the Levenshtein distance used in LVCSR tasks have been compared for Bayes risk decoding of word lattices. Improved, but still efficiently computable loss functions have been developed based on an analysis of the drawbacks of the common approximations. The investigated loss functions include the CN distance, the frame error, and Povey’s popular cost function for discriminative acoustic model training. The Bayes risk decoders with the common frame error based cost and Povey’s cost show a strong deletion bias. A further analysis of the frame error based cost has revealed that the major reason is the normalization. In particular, it has been shown that the standard normalization of the frame error used for Bayes risk decoding ignores deletions. A modified version has been proposed which shows a lower deletion ratio and outperforms the original frame error based approach. As well, a modified version of Povey’s cost has been developed, which successfully compensates for the deletion bias. Both modifications are parametrized and thus allow a direct tuning of the deletion ratio. In the experimental results the modified loss functions improve over the original versions and are competitive or on some tasks even slightly better than the CN decoder, i.e. the Bayes risk decoder with the CN distance as loss function. Investigations on confusion networks. The common algorithms for constructing confusion networks from word lattices are based on heuristics and require a careful parameter tuning. The most common approaches are based on a direct arc clustering. Alternative algorithms do a fast state clustering by exploiting the topology of the word lattice followed by a subsequent arc clustering. In this work two implementations of CN construction algorithms based on the arc and the state clustering have been developed. The arc clustering algorithm proved to work fast and robust over a wide range of systems and conditions. Though the main concept is inspired by existing approaches, the concrete algorithm is new. The state clustering algorithm follows the implementation of [Xue & Zhao 2005], but the experimental results show that their approach is inferior to the arc clustering algorithm. A modified version has been developed, which improves over the original algorithm and proved to be competitive to the direct arc clustering approach. Both algorithms are parametrized and careful parameter tuning is required for optimal performance. A new approach to lattice-based CN construction has been developed which is conceptually simple and parameter free. The algorithm is based on frame-wise word posterior probabilities and proved to be competitive or even better on some tasks than the two competing algorithms, though it is significantly slower. The sentence posterior probabilities derived from word lattices are only estimates of the true posteriors. The structure of the CN allows to break the sentence posteriors down to word posteriors and to compare them with the empirical posterior estimates for a given development set. In this work a warping function has been applied to the slot-wise word posterior probability distributions defined by the CN in order to bring them closer to the true probability distributions. The technique is especially interesting for cross-site CN combinations, where it is to be expected that the system-dependent posterior estimates show different biases. In the experimental evaluation on a cross-site combination task the warping reduces the error rate, whereas for an intra-site combination almost no effect on error rate is observed. However, in both cases the warping function has the ability to significantly improve the quality of the posterior probability based confidence scores measured in terms of the normalized cross-entropy.

120

Furthermore, in this work the connection between CN distance and Levenshtein distance has been explored. The lattice-based CN construction algorithms work heuristically and no assumption about the resulting alignment can be made. However, experiments indicate that the CN alignment is a close approximation of the Levenshtein alignment. The idea is to use the CN alignment as a starting point from which the Levenshtein alignment is reached. An approximate Bayes risk decoder with the windowed Levenshtein distance as loss function and the according dynamic programming equations have been developed. Time and space requirement of the decoder are polynomial in the size of the window. The windowed Levenshtein distance can be initialized with any CN alignment and it has been shown that for setting the window size to one the result is the common CN decoding rule. For any initial CN alignment and sufficiently large window the decoder passes into the Bayes risk decoder with the exact Levenshtein distance as loss function. Unfortunately, the approximations made in the windowed Levenshtein distance based Bayes risk decoder prevents from having the property that the approximated Bayes risk decreases monotonously with an increased window size. Though of theoretic interest, in the experimental evaluation the windowed Levenshtein decoder could not gain over the CN decoder in terms of error rate. Development of a new approach to system combination. The common system combination approaches formulated in the Bayes risk decoding framework have two major drawbacks. The first is the approximation of the Levenshtein distance and the second is the blind reliance on the posterior probability estimates derived from the word lattices. In this work an approach has been introduced and analyzed which aims at overcoming both problems: a classifier based system combination. The experimental results show that under some conditions the classifier approach can clearly outperform the standard approach. However, compared to the best performing common methods to system combination the classifier based approach gains only little. In the experiments several setups, feature sets, and classifiers have been compared. Investigations on the log-linear model combination. The log-linear model combination is a common approach in speech recognition to combine several knowledge sources. It can be used as a means to system combination instead of approaches like CNC or ROVER. A common choice for a system combination setup is to build several systems which differ only in their acoustic front-end. The combination happens by averaging the weighted posterior probabilities derived from the several systems. Instead, in the log-linear model combination only a single system is built by combining the acoustic models derived from the several acoustic front-ends into a single log-linear model from which the posterior probabilities are computed. In this work the performance of both combination approaches, applied in the Bayes risk decoding framework with the CN distance as loss function, has been experimentally compared. The combination approach based on separate systems clearly outperforms the log-linear model in terms of error rate. The second study introduces word-dependent scaling factors. Instead of using a single scaling factor per knowledge source the scales are made word- and knowledge source-dependent. The experimental results show a small but consistent improvement in error rate. Again, the single log-linear model approach has been compared to the approach based on the averaged system-dependent posteriors, where in both approaches word-dependent scaling factors are applied. The results show that both approaches benefit from the word-dependent scales in the same magnitude and the log-linear model combination stays inferior.

121

Chapter 9 Outlook In this thesis a unified view on system combination in the Bayes risk decoding framework has been developed. Several aspects of system combination and Bayes risk decoding for speech recognition have been investigated. The combination approaches are able to improve over the best single system by up to 20% relative. However, the oracle error rates for lattices and confusion networks (even with a single hypothesis per system like in ROVER) indicate a large potential for further improvements. In particular, none of the sophisticated combination techniques was able to considerably outperform the simple ROVER approach with word confidence scores. From these considerations the following theoretical and experimental questions remain open and may serve as a starting point for further research: Bayes risk decoding. • How much improvement can be expected from lattice-based Bayes risk decoding using the Levenshtein distance instead of the sentence error as loss function? This is the question of the general potential of word error instead of sentence error minimization in speech recognition under the constraint that the unmodified lattice-based posterior probability estimates are used. The follow-up question is: how close gets Bayes risk decoding for LVCSR tasks with any suitable Levenshtein distance approximation to the decoder with the exact Levenshtein distance? First experiments with the windowed Levenshtein distance initialized by a confusion network (CN) alignment were rather disappointing, because a more accurate error approximation did not yield immediately a lower error rate. However, the experimental results indicate that the windowed Levenshtein distance with a symmetric window of small size, three or five seems to be sufficient, is a good candidate for a very close approximation of the exact Levenshtein distance. The experiments might give a starting point for further theoretical and experimental investigations. • According to the experimental results presented in this thesis none of the investigated approximate Levenshtein distances is superior for all systems and under all conditions. The question is if one of the approximations is superior on a broader range of systems and conditions or is there even a better, efficiently computable approximation? • Several approaches tried to deal with the unreliability of the lattice-based posterior probabilities. So far, no approach could considerably outperform the plain probability estimates derived directly from the lattice and the remaining question is: exists a better approach to model and compensate for the bias in the lattice-based posterior estimates with the objective to reduce the error rate? System combination techniques. • The simple ROVER approach performs amazingly well and is hardly beaten by sophisticated combination techniques. We still lack a good understanding of why ROVER performs that well. A good starting point might be the view of ROVER as CNC with pruning. Then the question is: when do search errors occur due to pruning and can the error be bounded? In other words, can we explain the ROVER performance by showing that even a heavily pruned CNC makes almost no search errors? • The classifier based approaches to system combination are only at their beginning. There exist several possible extensions which might boost the performance. The first idea is to apply classifiers which consider the context of the complete sentence like conditional random fields. The second

123

Chapter 9 Outlook direction are the features, which are so far derived only from the lattices. The classifier based approach describes a simple way to bring in additional knowledge sources into the combination process. The question is: does there exist better classifiers and better feature functions for classifier based system combination? • The interaction between cross-adaptation and lattice-based system combination is yet not systematically explored. In fact, so far there is only intuition but no true understanding of why and how cross-adaptation improves the error rate. • An issue is still the question of how to generate ASR systems such that they are optimal for system combination performance. A few approaches have been explored in [Breslin & Gales 2006, 2007a; Willett & He 2008], but none gave a considerable improvement. The question is: can we derive from a deeper analysis of the combination techniques a better algorithm for estimating complementary systems? Confusion networks. • The set of alignments stored in a confusion network is restricted and, in general, cannot express the Levenshtein alignments between all sentence pairs in the lattice. The questions is: how severe is the restriction in practice? • All lattice-based confusion network construction algorithms use heuristics to estimate the alignments. Ideally, the algorithm finds the CN which minimizes the Bayes risk with the CN error as loss function. The question is: does an efficiently computable algorithm exist which finds the optimal CN? • The center-frame CN construction algorithm introduced in Section 4.4.4 shows some nice properties and is competitive or even better in error rate than the standard algorithms. However, the algorithm is based on heuristics and so far, it is slower than the common CN algorithms which are based on a direct arc clustering. Can the heuristics of the center-frame algorithm be further improved and can the construction be speed up?

124

Appendix A The Deletion Bias in LVCSR Decoding The optimization of an ASR system for minimum word error rate (WER), the standard evaluation measure for LVCSR tasks, biases the system towards producing deletions. The main insight is: for a LVCSR system it is preferable to discard a word with a low confidence rather than to risk an insertion. The remainder proves the intuition. ˜ Let w1N be the hypothesis and w ˜1N be the reference and let A = [(k1 , l1 ), (k2 , l2 ) . . . , (kM , lM )] denote the Levenshtein alignment between hypothesis and reference. The interpretation of the alignment is that hypothesis word wkm and reference word w ˜lm are aligned, where km or lm (but not both) can be zero, where w0 equals the empty word , i.e. it is an insertion or deletion. Let us assume the following cost function:  ccor , for w = w ˜    csub , for w 6= w ˜ ∧ w 6=  ∧ w ˜ 6=  c(w, w) ˜ := (A.1) c , for v =   ins   cdel , for w =  For the standard Levenshtein distance holds ccor = 0 and csub = cins = cdel = 1. Given the Levenshtein alignment, the cost function for the Levenshtein distance, and a probability distribution over the hypothesis space, then the expectation of the Levenshtein distance is given by ˜

E Lev(w1N , w ˜1N ) = E

M X m=1

c(wlm , w ˜ km ) =

M X

Em c(wlm , w ˜km ),

(A.2)

m=1

where the expectation is computed over all sentences w1N and the according posterior probability P r(w1N |xT1 ). Under the assumption of a fixed alignment, further investigations can be done alignment position-wise. The expected cost at position m is given by Em c(w, w) ˜

= P rm (w 6= w, ˜ w 6= , w ˜ 6= |xT1 ) +P rm (v = |xT1 ) +P rm (w = |xT1 ).

(A.3)

The question of interest is now: when is it advantageous to delete w at position m, i.e. to replace w by the empty word . The expectation for setting w to  is given by Em, c(w, w) ˜ w→

= P rm (w = w|x ˜ T1 ) +P rm (w 6= w, ˜ w 6= , w ˜ 6= |xT1 ) +P rm (w = ; m).

(A.4)

An insertion cannot happen anymore, but an error occurs if w equals the correct word w. ˜ A comparison of Equation (A.3) and Equation (A.4) shows when it is advantageous for the system to replace w by the empty word , i.e. to delete w: Em, c(w, w) ˜ < Em c(w, w) ˜ w→ T ⇔ P rm (w = w|x ˜ 1 ) < P rm (v = |xT1 ) ⇔ P rm (w = w|w ˜ 6= , xT1 ) < P rm (v = |w 6= , xT1 )

(A.5)

The result in words: if the risk of an insertion is higher than the probability of the word being correct, then it is better to discard the word. Thus, a system optimized for minimum WER will have a (slight) deletion bias.

125

Appendix A The Deletion Bias in LVCSR Decoding The above result can be used in a post-processing step to the search: simply delete all words from the hypothesis for which Equation (A.5) is fulfilled. In practice, the estimate for P rm (w = w|w ˜ 6= , xT1 ) is the word confidence score conf(wlm ) for hypothesis word wlm . The probability for an insertion at position m can be roughly estimated as P rm (v = |w 6= , xT1 ) ≈

ins(Abest ) ins(Abest ) ≈ , ˜ N N

where Abest is the alignment of the decoder output and the reference and ins(Abest ) counts the number of insertions in the alignment. That is, the probability for an insertion is simply approximated by the ˜ insertion ratio in the WER computed between the original hypothesis w1N and reference w ˜1N . Theoretically, N by deleting words in the decoding output, i.e. by replacing words in w1 with , the Levenshtein alignment ˜ to the reference w ˜1N can be changed, what is not considered in the above analysis. However, the error rate will presumably benefit from the new alignment: let us assume a low-confident word which was actually aligned to a reference word is replaced by  and let us further assume one of the adjacent hypothesis words cause an insertion in the current alignment, then in the new alignment only a substitution appears but not the insertion anymore. The practical use of the post-processing algorithm is very limited. For the systems investigated in this work the insertion ratio is small ( 10%) and only few words have such a small confidence score and confidence scores in this range are usually not very reliable. In one experiment the approach was applied to a task with a rather high insertion ratio (almost 10%) and all words with an confidence score lower than a given threshold were discarded. The experiment was done for a Chinese task and the threshold was chosen empirically for minimum character error rate (CER). For thresholds between 0.3 and 0.4 an improvement in CER was observed. Experiments with common tasks which have a low insertion ratio showed only a slight improvement, if at all, for the price of a highly increased deletion ratio.

126

Appendix B Corpora and Systems Experiments have been performed on four setups with an overall of 19 subsystems. Two systems were built for the Chinese track of the GALE project: a testing system used for fast technology tests and the RWTH Aachen GALE 2008 evaluation system. The Chinese corpora and systems are introduced in Section B.1. Further system combinations are done within the RWTH Aachen English TC-Star/EPPS 2007 evaluation system. For the same evaluation word lattices were provided by four project partners and extensively used for cross-site combination experiments. The corpora used in the English track of the TC-Star/EPPS 2007 evaluation, the RWTH Aachen evaluation system, and the cross-site combination setup are described in Section B.2.

B.1 Chinese GALE Systems The systems were developed as part of the participation of the RWTH Aachen in the Global Autonomous Language Exploitation (GALE) project [Hoffmeister & Plahl+ 2007; Plahl & Hoffmeister+ 2008b, 2009]. The goal of the GALE program is to provide the technology for translating and analyzing huge volumes of speech and text in multiple languages. A particular sub-task is the transcription of Chinese broadcast news (BN) and broadcast conversations (BC). Training and tuning/testing data is provided within the project and is summarized in Table B.1. The complete 1,600 hours of training data consist of the Hub4 and TDT4 data and the GALE data releases Y1Q1-4, P2R1-2, P3R1-2, and P4R1. Hub4 consists of 30h of carefully transcribed BN data. The 120h of TDT4 BN data come with closed captions and the GALE data releases with quick transcriptions1 . The 230h training set is a subset made up of the Hub4 data and 100h BN and 100h BC data taken from the GALE releases. Two systems are used for the experiments, each system consisting of several subsystems. The first system described in Section B.1.1 is trained on the 230h training set and is used for technology testing and analysis. Section B.1.2 describes the latest RWTH Aachen Chinese system used in the GALE 2008 evaluation. Both systems share the same pronunciation dictionary, word list, and language model. The derivation of the pronunciation dictionary is described in detail in [Plahl & Hoffmeister+ 2008b]. Word list and language model are kindly provided by SRI/University of Washington(UW) and are equivalent 1 The

data is provided by LDC and at least the Chinese Hub4 and TDT4 data is publicly available at http://ldc.upenn.edu; the GALE data releases are not yet publicly available.

Table B.1. Corpora statistics for the Chinese GALE systems.

Corpus #Segments training 230h 206K 1600h 1.3M tuning/ testing dev07 1,655 eval07 1,013 dev08 618

#Words

Audio data [h]

2.4M 15.5M

230 1,600

27.5K 10.5K

2.5 1.6 1.0

127

Appendix B Corpora and Systems

Table B.2. Subsystems in the Chinese 230 testing system.

Name s1 s1.r1 s1.r2 s2 s2.r1 s3 s3.r1

Acoustic Front-End MFCC MFCC MFCC PLP PLP GT GT

Randomized CART no yes yes no yes no yes

to the ones used in the SRI/UW GALE evaluation systems [Hwang & Peng+ 2007; Lei & Wu+ 2009]. The word list contains 60K words and the language model is a large 4-gram. A pruned version of the LM is used in the recognition runs and the full 4-gram is applied in a subsequent lattice re-scoring step.

B.1.1 The Chinese 230h Testing System The 230h testing system consists of seven subsystems, all maximum likelihood (ML) trained on the 230h training set and dev07 is used for parameter tuning. The subsystems vary in their acoustic front-end and some use a randomized phonetic decision tree (randomized CART). The following list gives an overview of the training setup and decoding structure for a single subsystem; a detailed discussion can be found in [Plahl & Hoffmeister+ 2008b]. • 3 × 1-state HMMs • across-word acoustic model • state-tying via (randomized) phonetic decision tree • 4,501 mixtures with a total of 1.1M Gaussian densities • 16 dimensional acoustic features • LDA on 9 adjacent input frames (16 × 9 = 144 input features), reduced to 45 dimensions • 1 tone feature including first and second derivatives • 60K vocabulary • 4-gram LM (P Pdev07 = 367) • 1. decoding pass: ML trained VTLN acoustic model (fast variant of VTLN) • 2. decoding pass: ML trained SAT/CMLLR acoustic model, MLLR • 3. decoding pass: lattice re-scoring with full LM The randomization of the phonetic decision tree follows the approach described in [Dietterich 2000b] and was applied to speech recognition first in [Siohan & Ramabhadran+ 2005]. Table B.2 lists the resulting seven subsystems which are used in the various system combination experiments. In the experiments systems with different acoustic front-ends and with randomized phonetic decision trees are combined. In particular, the experiments shown in Appendix C compare the approaches to complementary system building via different acoustic front-ends and via randomized phonetic decision trees. Table B.3 gives an overview over the tested combinations. For the lattice decoding and combination experiments the word lattices are pruned to a density of 75.

128

B.1 Chinese GALE Systems

Table B.3. System combinations for the Chinese 230 testing system.

Name s1+s2 s1+s2+s3 s1+s1.r1+s1.r2 s1+s1.r1+s2+s2.r1+s3+s3.r1

#Systems 2 3 3 6

Acoustic Front-End(s) MFCC, PLP MFCC, PLP, GT MFCC MFCC, PLP, GT

Randomized CARTs no no yes yes

B.1.2 The RWTH Aachen Chinese GALE 2008 Evaluation System This section describes the Chinese system used by RWTH Aachen in the GALE 2008 evaluation. The basic setup follows the Chinese 230h testing system but additional techniques and the complete 1,600h training set are used. The additional techniques include neural network (NN) based phoneme posterior features, minimum phoneme error (MPE) discriminative acoustic model training, and cross-adaptation. The following list gives a summary of the training and decoding setup. • 3 × 1-state HMMs • across-word acoustic model • state-tying via phonetic decision tree • 4,501 mixtures with a total of 1.2M Gaussian densities • 16 dimensional acoustic base features (+1 voicing feature) • 1 tone feature • LDA on 9 adjacent input frames (16+1×9 = 153 input features, with voicing feature: 16+1+1×9 = 162 input features), reduced to 45 dimensions • 35(IDIAP) or 32(ICSI) dimensional NN features, concatenated to the LDA result • 60K vocabulary • 4-gram LM (P Pdev07 = 367) • 1. decoding pass: ML trained VTLN acoustic model (fast variant of VTLN) • 2. decoding pass: MPE trained SAT/CMLLR acoustic model, MLLR or cross-system MLLR • 3. decoding pass: lattice re-scoring with full LM A detailed discussion of the setup including a description of the NN features is given in [Plahl &Hoffmeister+ 2009]. Cross-adaptation and lattice-based system combination are two combination techniques which can be easily combined: first cross-adapting the systems and subsequently combining the resulting, cross-adapted lattices. For the Chinese GALE 2008 evaluation system the interaction of cross-adaptation and latticebased system combination is experimentally explored. The system consists of two core subsystems, one is based on MFCC features augmented with the NN features provided by IDIAP, and the other uses a PLP front-end together with the NN features provided by ICSI. Each of the two core subsystems exists in two flavors: with and without cross-adaptation. The cross-adapted system uses the final output of the other, non-cross-adapted system as supervisor in the CMLLR/MLLR adaptation step. lattice-based system combination experiments are performed for the pair of non-cross-adapted as well as for the pair of cross-adapted subsystems. Table B.4 summarizes the differences between the subsystems. In the system combination experiments the two cross-adapted systems, called s1.x2 and s2.x1, and the two non-cross-adapted systems, called s1

129

Appendix B Corpora and Systems

Table B.4. Subsystems in the RWTH Aachen Chinese GALE 2008 evaluation system.

Name s1 s1.x2 s2 s2.x1

Acoustic Front-End MFCC MFCC PLP PLP

NN features IDIAP IDIAP ICSI ICSI

voicing feature no no yes yes

CMLLR/MLLR supervisor s1 (1. pass output) s2 (final output) s2 (1. pass output) s1 (final output)

Table B.5. Corpora statistics for the English EPPS systems.

Corpus #Segments training supervised 67K unsupervised tuning/ testing dev06 726 eval06 742 eval07 644

#Words

Audio data [h]

660K -

91.6 187.2

29K 30K 27K

3.2 3.2 2.9

and s2, are combined. In particular, the experiments presented in Appendix C show the effect of stacking cross-adaption and lattice-based system combination. For the lattice combination experiments word lattices are produced with all four subsystems and pruned to a density of 75.

B.2 English TC-Star/EPPS Systems The European parliament plenary sessions (EPPS) task was part of the TC-Star project. The objective is to transcribe debates from the European parliament. RWTH Aachen participated in all evaluations which took place in 2005, 2006, and 2007 [L¨o¨of &Bisani+ 2006; L¨o¨of &Bisani+ 2006; L¨o¨of &Gollan+ 2007]. In 2006 and 2007 the project partners agreed on sharing lattices from their best evaluation (sub-)system for system combination experiments. In this work results are presented for the TC-Star 2007 English EPPS evaluation. The corpora statistics for the training and testing data are summarized in Table B.5. The eval06 set was the evaluation set in the 2006 evaluation and the official development set in the 2007 evaluation. Section B.2.1 describes the RWTH Aachen English EPPS 2007 evaluation system and Section B.2.2 the setup of the cross-site combination experiments based on the lattices shared after the 2007 evaluation.

B.2.1 The RWTH Aachen English EPPS 2007 Evaluation System The section describes the English system used by RWTH Aachen for the EPPS task in the TC-Star 2007 evaluation campaign. Four subsystems are trained varying in the acoustic front-ends and in the amount of training data. Parameter tuning is done on the eval06 corpus. An overview of the training and decoding setup is given by the following list. • 3 × 2-states HMMs • across-word acoustic model • 4,501 mixtures with a total of 0.8M Gaussian densities • state-tying via phonetic decision tree

130

B.2 English TC-Star/EPPS Systems

Table B.6. Subsystems in the RWTH Aachen English EPPS 2007 evaluation system.

Name s1 s2 s3 s4

Acoustic Front-End MFCC MFCC GT MFCC

NN features no no no yes

unsupervised training data yes no no no

• 16 dimensional acoustic base features + 1 voicing feature • LDA on 9 adjacent input frames (16 + 1 × 9 = 153 input features), reduced to 45 dimensions • neural network based phoneme posterior features • 52K vocabulary • 4-gram LM (P Peval06 = 106) • 1. decoding pass: ML trained VTLN acoustic model (fast variant of VTLN) • 2. decoding pass: MPE trained SAT/CMLLR acoustic model, MLLR • 3. decoding pass: lattice re-scoring with full LM The pronunciation lexicon is based on the English Beep lexicon and missing pronunciations are derived from a grapheme-to-phoneme conversion model, which is trained on the Beep lexicon [Bisani & Ney 2003]. For the decoding passes a pruned version of the 4-gram LM is used and the full LM is applied in the lattice re-scoring step. A detailed description of the system can be found in [L¨o¨of & Gollan+ 2007]. The four subsystems and their main differences are listed in Table B.6. In the system combination experiments the combination of the first two systems, called s1+s2, the first three systems, s1+s2+s3, and of all four systems, s1+s2+s3+s4, are used. In particular, the experimental results in Appendix C show how the combination benefits from adding more systems. For the lattice combination experiments word lattices were produced with all subsystems and pruned to a density of 75.

B.2.2 The English EPPS 2007 Evaluation Cross-site Combination All partners who participated in the English EPPS task of the TC-Star 2007 evaluation campaign were asked to provide lattices from their best (sub-)system for system combination experiments. In the end, four sites kindly distributed their lattices: CNRS/LIMSI [Lamel & Gauvain+ 2007], FBK/IRST (former ITC/IRST) [Falavigna &Bertoldi+ 2007], RWTH Aachen University [L¨o¨of &Gollan+ 2007], and University of Karlsruhe (UKA) [St¨ uker & F¨ ugen+ 2007]. The lattices provided by RWTH Aachen were produced by subsystem s1, cf. Section B.2.1. Word lattices are provided for the eval06 corpus (the official development set for the 2007 evaluation) and for the eval07 corpus. All sites used their own acoustic segmentation. For the lattice-based system combination experiments the segmentation was unified by concatenating the lattices recording-wise, where eval06 consists of five and eval07 of eight recordings. The lattices are normalized by applying the normalization rules used in scoring. The resulting lattices are pruned to a density of 50, where the target density is given by the least dense lattice set. All lattices come with separate acoustic and language model scores. Parameter optimization is done on the development set (eval06). System combination results are produced for the combination of the two best performing systems, LIMSI+RWTH, the three best performing systems, LIMSI+RWTH+UKA, and for the combination of all four systems, LIMSI+RWTH+UKA+IRST. Systematic results for each of the combinations are presented in Appendix C.

131

Appendix C Experimental Results Detailed results for the systems introduced in Appendix B are given. Experimental results are produced and summarized for each system and for all combination methods and decoding rules introduced in Chapter 3 and Chapter 4. First, the results for the several subsystems are presented followed by the various combination results. The first set of results is produced with the minimum sentence error decoding rules discussed in Chapter 3. For single systems this is the Viterbi and the MAP decoder. In the system combination experiments the Viterbi and the MAP decoding rule is applied to the lattice intersection and the modified lattice union. MAP decoding results for the union based combination were eventually omitted, because a single decoding run took several days and thus no parameter optimization was possible in a reasonable amount of time. The second set of results is produced by Bayes risk decoders which aim at minimizing an approximate Levenshtein distance, in particular the approximations introduced in Chapter 4. The results are structured as follows: first, the results for the three confusion network (CN) construction algorithms are given. They are followed by the frame error results with different normalization approaches. And last, the four variants of the error approximation based on local alignments are added. The system combination experiments use the modified union approach to combine the system-dependent lattices. For comparison, ROVER and confusion network combination (CNC) results are included.

C.1 The Chinese 230h Testing System This section summarizes the results for the Chinese 230h testing system introduced in Section B.1.1. All results are produced on character lattices. The character lattices are derived from word lattices by splitting the word arcs into character arcs, where the character boundaries are determined by a forced alignment of the characters within a word arc. The error measure is the character error rate (CER).

MFCC front-end (s1) Results for the Chinese 230h testing system with the MFCC acoustic front-end. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.63/1.59) 14.54 (2.67/1.56) 14.56

(4.42/0.91) 15.08 (4.42/0.91) 15.14

(2.80/0.87) 13.28 (2.88/0.85) 13.39

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.79/1.45) 14.30 (2.95/1.41) 14.31 (2.81/1.45) 14.32

(4.53/0.85) 14.96 (4.69/0.82) 14.93 (4.56/0.85) 14.95

(2.85/0.80) 13.05 (3.07/0.79) 13.10 (2.89/0.80) 13.10

(2.92/1.38) 14.35 (2.68/1.53) 14.42 (2.52/1.61) 14.23

(4.62/0.79) 14.98 (4.45/0.90) 15.09 (4.32/0.98) 14.96

(3.01/0.75) 13.13 (2.80/0.83) 13.09 (2.75/0.94) 13.11

(2.89/1.39) (2.32/1.68) (2.70/1.51) (2.61/1.55)

(4.62/0.80) (4.23/1.03) (4.45/0.89) (4.46/0.92)

(3.00/0.75) (2.61/0.97) (2.84/0.84) (2.73/0.88)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

14.33 14.17 14.33 14.34

15.03 15.01 14.98 15.01

13.14 13.04 13.12 13.06

133

Appendix C Experimental Results

MFCC front-end and randomized CART (s1.r1) Results for the Chinese 230h testing system with the MFCC acoustic front-end and a randomized phonetic decision tree. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.72/1.60) 14.61 (2.77/1.57) 14.59

(4.57/0.94) 15.22 (4.57/0.92) 15.20

(2.87/0.88) 13.58 (2.85/0.88) 13.53

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.91/1.46) 14.33 (3.05/1.43) 14.38 (2.83/1.49) 14.33

(4.66/0.82) 14.83 (4.82/0.81) 14.89 (4.55/0.87) 14.86

(2.99/0.83) 13.25 (3.11/0.82) 13.28 (2.89/0.83) 13.19

(2.97/1.46) 14.39 (2.77/1.54) 14.51 (2.65/1.60) 14.31

(4.68/0.85) 14.90 (4.53/0.90) 14.99 (4.42/0.99) 14.87

(3.00/0.80) 13.24 (2.85/0.86) 13.26 (2.77/0.89) 13.18

(2.99/1.42) (2.47/1.71) (2.72/1.52) (2.71/1.55)

(4.65/0.82) (4.30/1.06) (4.49/0.94) (4.49/0.89)

(3.04/0.75) (2.58/0.97) (2.81/0.83) (2.83/0.85)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

14.40 14.34 14.34 14.33

14.89 14.94 14.93 14.90

13.20 13.14 13.18 13.26

MFCC front-end and randomized CART (s1.r2) Results for the Chinese 230h testing system with the MFCC acoustic front-end and a randomized phonetic decision tree. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.70/1.58) 14.49 (2.63/1.65) 14.51

(4.51/0.96) 15.11 (4.44/0.94) 15.09

(2.77/0.99) 13.56 (2.74/0.99) 13.53

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.98/1.44) 14.28 (3.17/1.38) 14.29 (2.95/1.44) 14.25

(4.74/0.83) 15.05 (4.83/0.81) 15.04 (4.71/0.83) 15.05

(3.10/0.84) 13.45 (3.25/0.84) 13.48 (3.10/0.84) 13.43

(2.99/1.43) 14.27 (2.85/1.48) 14.34 (2.59/1.60) 14.21

(4.71/0.85) 15.05 (4.62/0.88) 15.08 (4.41/1.00) 14.90

(3.05/0.81) 13.42 (2.94/0.87) 13.48 (2.73/0.96) 13.24

(2.98/1.38) (2.72/1.51) (2.73/1.50) (2.70/1.55)

(4.72/0.78) (4.52/0.88) (4.54/0.88) (4.55/0.91)

(3.06/0.82) (2.80/0.92) (2.84/0.90) (2.87/0.93)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

134

14.28 14.24 14.24 14.27

14.93 14.95 14.95 15.02

13.28 13.24 13.31 13.39

C.1 The Chinese 230h Testing System

PLP front-end (s2) Results for the Chinese 230h testing system with the PLP acoustic front-end. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.65/1.70) 14.82 (2.63/1.72) 14.80

(4.44/0.93) 15.02 (4.41/0.96) 15.00

(2.71/0.94) 13.54 (2.66/0.97) 13.47

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.90/1.50) 14.52 (3.12/1.44) 14.53 (2.85/1.52) 14.48

(4.62/0.81) 14.74 (4.83/0.80) 14.80 (4.59/0.82) 14.71

(2.88/0.79) 13.35 (3.12/0.74) 13.39 (2.88/0.77) 13.35

(2.90/1.50) 14.55 (2.78/1.58) 14.67 (2.66/1.63) 14.47

(4.63/0.81) 14.74 (4.53/0.86) 14.83 (4.43/0.92) 14.73

(2.93/0.78) 13.36 (2.78/0.83) 13.41 (2.71/0.89) 13.30

(2.95/1.46) (2.52/1.69) (2.76/1.56) (2.77/1.53)

(4.67/0.76) (4.33/0.96) (4.52/0.85) (4.53/0.86)

(2.95/0.78) (2.56/0.94) (2.73/0.84) (2.81/0.87)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

14.54 14.48 14.55 14.54

14.76 14.72 14.75 14.76

13.43 13.30 13.33 13.44

PLP front-end and randomized CART (s2.r1) Results for the Chinese 230h testing system with the PLP acoustic front-end and a randomized phonetic decision tree. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.69/1.68) 14.73 (2.68/1.69) 14.72

(4.45/0.99) 14.97 (4.43/0.96) 14.98

(2.75/0.93) 13.51 (2.73/0.93) 13.38

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.91/1.56) 14.46 (3.12/1.51) 14.49 (2.90/1.56) 14.47

(4.62/0.87) 14.77 (4.82/0.83) 14.82 (4.59/0.86) 14.76

(2.97/0.81) 13.24 (3.12/0.80) 13.28 (2.96/0.81) 13.19

(3.14/1.44) 14.50 (2.80/1.61) 14.56 (3.07/1.48) 14.40

(4.78/0.79) 14.75 (4.53/0.91) 14.85 (4.74/0.82) 14.75

(3.18/0.77) 13.30 (2.89/0.85) 13.33 (3.09/0.78) 13.22

(3.03/1.46) (2.40/1.84) (2.84/1.57) (2.76/1.63)

(4.68/0.82) (4.25/1.06) (4.57/0.88) (4.45/0.92)

(3.07/0.78) (2.53/1.08) (2.91/0.81) (2.81/0.88)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

14.47 14.42 14.46 14.48

14.75 14.79 14.82 14.79

13.30 13.16 13.22 13.22

135

Appendix C Experimental Results

GT front-end (s3) Results for the Chinese 230h testing system with the acoustic front-end based on the Gammatone filter bank. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.65/1.64) 15.07 (2.66/1.63) 15.08

(4.57/1.04) 15.60 (4.56/1.04) 15.58

(2.84/0.93) 13.80 (2.83/0.92) 13.82

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.97/1.48) 14.86 (3.11/1.44) 14.88 (2.89/1.50) 14.83

(4.74/0.92) 15.42 (4.91/0.88) 15.42 (4.73/0.93) 15.42

(3.01/0.85) 13.67 (3.21/0.86) 13.82 (2.98/0.85) 13.65

(2.98/1.49) 14.87 (2.84/1.56) 15.01 (2.84/1.53) 14.76

(4.77/0.86) 15.41 (4.67/0.96) 15.53 (4.72/0.95) 15.41

(3.06/0.83) 13.63 (2.94/0.87) 13.80 (3.01/0.91) 13.71

(3.09/1.41) (2.74/1.57) (2.87/1.50) (2.79/1.56)

(4.89/0.83) (4.62/0.99) (4.68/0.94) (4.65/0.99)

(3.16/0.79) (2.89/0.92) (2.96/0.88) (2.88/0.94)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

14.91 14.80 14.86 14.87

15.42 15.42 15.45 15.49

13.82 13.75 13.69 13.71

GT front-end and randomized CART (s3.r1) Results for the Chinese 230h testing system with the acoustic front-end based on the Gammatone filter bank and a randomized phonetic decision tree. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.65/1.69) 15.23 (2.62/1.72) 15.23

(4.56/1.08) 15.86 (4.53/1.08) 15.86

(2.79/0.97) 14.15 (2.78/0.97) 14.05

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.91/1.54) 15.07 (3.13/1.46) 15.07 (2.93/1.52) 15.05

(4.73/0.94) 15.66 (4.93/0.90) 15.68 (4.71/0.94) 15.62

(2.91/0.84) 13.71 (3.12/0.80) 13.78 (2.94/0.85) 13.75

(2.96/1.53) 15.09 (2.87/1.59) 15.24 (2.41/1.85) 15.00

(4.72/0.95) 15.71 (4.70/0.97) 15.76 (4.38/1.22) 15.72

(3.00/0.83) 13.74 (2.88/0.85) 13.80 (2.58/1.03) 13.59

(2.98/1.51) (2.65/1.66) (2.80/1.59) (2.73/1.62)

(4.79/0.93) (4.50/1.04) (4.60/0.99) (4.59/1.02)

(2.97/0.81) (2.71/0.94) (2.82/0.88) (2.78/0.89)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

136

15.13 14.98 15.04 15.07

15.72 15.61 15.65 15.68

13.71 13.65 13.66 13.68

C.1 The Chinese 230h Testing System

Combination of two acoustic front-ends (s1+s2) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

dev071

CER[%] (del/ins) err eval07

dev08

(2.55/1.58) 14.05 (2.48/1.64) 14.04 (2.59/1.65) 14.25

(4.43/0.91) 14.59 (4.37/0.91) 14.56 (4.44/0.92) 14.86

(2.75/0.84) 13.09 (2.63/0.85) 12.91 (2.74/0.89) 13.36

(3.05/1.29) (3.47/1.20) (2.90/1.34) (2.93/1.34) (3.03/1.32) (2.91/1.36) (2.66/1.57) (2.49/1.59)

(4.69/0.73) (5.18/0.69) (4.60/0.74) (4.66/0.76) (4.72/0.72) (4.60/0.75) (4.44/0.90) (4.30/0.91)

(3.01/0.73) (3.45/0.66) (2.90/0.71) (2.93/0.74) (3.09/0.75) (2.91/0.74) (2.86/0.85) (2.64/0.94)

13.54 13.69 13.54 13.56 13.53 13.55 14.54 13.63

14.01 14.22 13.96 13.99 13.95 13.95 15.13 14.09

12.54 12.75 12.43 12.50 12.66 12.49 13.32 12.61

(3.07/1.30) 13.57 (2.83/1.41) 13.83 (2.57/1.58) 13.49

(4.69/0.68) 13.95 (4.58/0.80) 14.21 (4.31/0.90) 13.93

(3.05/0.70) 12.54 (2.85/0.70) 12.67 (2.65/0.89) 12.45

(3.11/1.25) (2.47/1.53) (2.78/1.37) (2.68/1.45)

(4.75/0.67) (4.32/0.85) (4.51/0.75) (4.44/0.82)

(3.06/0.70) (2.58/0.86) (2.80/0.75) (2.78/0.81)

13.60 13.44 13.48 13.54

14.00 13.93 13.97 14.00

12.57 12.35 12.49 12.44

Combination of three acoustic front-ends (s1+s2+s3) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

dev071

CER[%] (del/ins) err eval07

dev08

(2.46/1.56) 13.91 (2.49/1.59) 14.01 (2.57/1.64) 14.09

(4.38/0.91) 14.57 (4.40/0.90) 14.45 (4.47/0.92) 14.83

(2.66/0.83) 12.65 (2.68/0.87) 12.63 (2.77/0.87) 13.17

(2.88/1.24) (3.38/1.14) (2.74/1.33) (2.87/1.29) (2.93/1.26) (2.74/1.34) (2.74/1.35) (2.70/1.34)

(4.77/0.67) (5.19/0.65) (4.56/0.73) (4.68/0.70) (4.71/0.67) (4.57/0.76) (4.59/0.75) (4.55/0.74)

(3.01/0.73) (3.34/0.64) (2.87/0.74) (2.92/0.72) (3.03/0.70) (2.86/0.77) (2.89/0.75) (2.89/0.76)

13.13 13.27 13.15 13.17 13.15 13.16 13.55 13.22

13.73 13.77 13.65 13.70 13.65 13.74 14.16 13.86

12.30 12.32 12.19 12.21 12.29 12.14 12.61 12.47

(3.06/1.23) 13.18 (2.85/1.30) 13.45 (2.99/1.22) 13.06

(4.72/0.69) 13.71 (4.70/0.73) 14.09 (4.76/0.66) 13.64

(3.01/0.72) 12.22 (2.92/0.72) 12.52 (3.04/0.71) 12.22

(3.12/1.14) (2.61/1.33) (2.69/1.31) (2.58/1.38)

(4.82/0.62) (4.48/0.75) (4.55/0.72) (4.50/0.80)

(3.16/0.68) (2.72/0.78) (2.81/0.74) (2.74/0.77)

13.19 13.09 13.12 13.20

13.74 13.67 13.70 13.87

12.26 12.08 12.15 12.25

137

Appendix C Experimental Results

Combination of three randomized trees (s1+s1.r1+s1.r2) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

dev071

CER[%] (del/ins) err eval07

dev08

(2.57/1.60) 14.14 (2.56/1.59) 14.13 (2.64/1.68) 14.40

(4.38/0.94) 15.04 (4.35/0.93) 14.97 (4.39/0.93) 15.02

(2.77/0.91) 13.19 (2.77/0.91) 13.15 (2.78/0.90) 13.41

(2.83/1.39) (3.13/1.29) (2.78/1.42) (2.86/1.39) (2.92/1.34) (2.78/1.43) (2.61/1.54) (2.70/1.47)

(4.59/0.78) (4.89/0.76) (4.53/0.82) (4.62/0.80) (4.67/0.79) (4.54/0.81) (4.45/0.90) (4.50/0.85)

(3.00/0.75) (3.21/0.71) (2.91/0.77) (3.01/0.75) (3.10/0.75) (2.95/0.77) (2.72/0.88) (2.89/0.83)

13.83 13.86 13.82 13.84 13.82 13.84 14.00 13.80

14.53 14.60 14.51 14.54 14.51 14.52 14.81 14.63

12.79 12.88 12.81 12.86 12.88 12.83 13.05 12.98

(2.98/1.39) 13.89 (2.85/1.41) 13.99 (2.83/1.41) 13.77

(4.63/0.84) 14.55 (4.60/0.84) 14.66 (4.56/0.81) 14.45

(3.05/0.78) 12.90 (2.92/0.76) 12.87 (2.99/0.80) 12.77

(3.06/1.28) (2.53/1.52) (2.76/1.43) (2.70/1.43)

(4.75/0.73) (4.30/0.93) (4.53/0.85) (4.48/0.87)

(3.15/0.70) (2.68/0.87) (2.89/0.79) (2.92/0.82)

13.86 13.74 13.83 13.79

14.59 14.54 14.53 14.57

12.81 12.71 12.80 12.88

Combination of three acoustic front-ends and three randomized trees (s1+s1.r1+s2+s2.r1+s3+s3.r1) dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error intersection Viterbi MAP union Viterbi

(2.36/1.59) 13.78 (2.42/1.58) 13.77 (2.58/1.59) 14.22

(4.32/0.91) 14.70 (4.31/0.88) 14.53 (4.55/1.01) 15.07

(2.54/0.86) 12.53 (2.55/0.82) 12.44 (2.68/0.85) 13.06

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf.

(2.98/1.26) (3.55/1.11) (2.73/1.36) (2.88/1.26) (3.00/1.23) (2.89/1.27) (2.60/1.42) (2.65/1.37)

(4.80/0.69) (5.45/0.60) (4.53/0.71) (4.70/0.69) (4.80/0.65) (4.64/0.70) (4.44/0.79) (4.50/0.75)

(3.10/0.70) (3.67/0.70) (2.81/0.72) (2.97/0.70) (3.08/0.70) (3.00/0.69) (2.80/0.81) (2.85/0.77)

Decoder

Frame Error error norm.:

1

138

tuning set

asym. arc-sym. path-sym.

13.05 13.19 13.02 13.05 13.05 13.10 13.15 12.97

(3.00/1.25) 13.03 (2.82/1.36) 13.35 (2.58/1.43) 12.89

13.82 14.00 13.63 13.81 13.74 13.84 13.91 13.72

(4.65/0.68) 13.68 (4.66/0.77) 13.89 (4.38/0.79) 13.64

12.20 12.53 11.98 12.14 12.09 12.13 12.34 12.15

(2.91/0.70) 11.96 (2.89/0.70) 12.15 (2.70/0.84) 11.97

C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System

C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System Results for the setup for the RWTH Aachen Chinese GALE 2008 evaluation system introduced in Section B.1.2. All results are produced on character lattices. The character lattices are derived from word lattices by splitting the word arcs into character arcs, where the character boundaries are determined by a forced alignment of the characters within a word arc. The error measure is the character error rate (CER).

MFCC+IDIAP-NN front-end (s1) Results for the Chinese GALE 2008 evaluation system with the MFCC acoustic front-end combined with the neural network (NN) based phoneme posterior features provided by IDIAP. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.17/1.11) 9.56 (2.18/1.11) 9.55

(4.17/0.71) 10.90 (4.19/0.70) 10.94

(2.37/0.76) 9.25 (2.38/0.75) 9.26

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.22/1.09) 9.46 (2.38/1.06) 9.47 (2.21/1.08) 9.44

(4.12/0.65) 10.91 (4.28/0.64) 10.92 (4.13/0.65) 10.87

(2.34/0.64) 8.98 (2.52/0.63) 9.05 (2.35/0.63) 8.98

(2.30/1.06) 9.47 (2.15/1.35) 9.77 (2.38/1.02) 9.44

(4.23/0.65) 10.89 (4.08/1.00) 11.28 (4.28/0.62) 10.85

(2.57/0.66) 9.15 (2.32/0.91) 9.36 (2.65/0.61) 9.15

(2.31/1.01) (2.27/1.05) (2.12/1.11) (2.27/1.10)

(4.21/0.62) (4.14/0.65) (4.05/0.68) (4.18/0.67)

(2.52/0.60) (2.45/0.64) (2.30/0.65) (2.43/0.64)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

9.46 9.46 9.44 9.49

10.87 10.87 10.97 10.87

9.10 9.09 9.00 9.06

PLP+ICSI-NN front-end (s2) Results for the Chinese GALE 2008 evaluation system with the PLP acoustic front-end combined with the NN based phoneme posterior features provided by ICSI. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.41/1.13) 9.96 (2.36/1.17) 9.95

(4.14/0.76) 11.12 (4.11/0.77) 11.10

(2.42/0.68) 9.24 (2.39/0.69) 9.19

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.45/1.05) 9.87 (2.60/1.04) 9.89 (2.43/1.08) 9.87

(4.23/0.71) 11.05 (4.33/0.68) 11.02 (4.21/0.70) 11.00

(2.47/0.60) 9.19 (2.61/0.58) 9.25 (2.45/0.63) 9.22

(2.55/1.06) 9.88 (2.34/1.28) 10.11 (2.29/1.17) 9.84

(4.26/0.70) 11.00 (4.16/0.92) 11.30 (4.07/0.83) 11.03

(2.58/0.63) 9.25 (2.34/0.86) 9.51 (2.38/0.75) 9.25

(2.52/1.01) (2.29/1.20) (2.47/1.03) (2.61/1.02)

(4.29/0.66) (4.06/0.78) (4.24/0.68) (4.32/0.71)

(2.54/0.59) (2.33/0.72) (2.51/0.61) (2.62/0.65)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

9.85 9.87 9.86 9.91

11.07 11.02 11.01 11.12

9.25 9.22 9.20 9.33

139

Appendix C Experimental Results

MFCC+IDIAP-NN front-end and cross-adaptation (s1.x2) Results for the Chinese GALE 2008 evaluation system with the MFCC acoustic front-end combined with the NN based phoneme posterior features provided by IDIAP. The CMLLR/MLLR adaption is performed as a cross-adaptation with the final output of system s2 as supervisor. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(1.87/1.16) 9.07 (1.86/1.18) 9.02

(3.99/0.76) 10.67 (3.99/0.75) 10.68

(2.11/0.72) 8.72 (2.10/0.71) 8.70

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(1.96/1.13) 8.98 (2.12/1.06) 8.98 (1.95/1.11) 8.97

(4.08/0.68) 10.67 (4.20/0.67) 10.66 (4.06/0.68) 10.65

(2.22/0.65) 8.63 (2.33/0.64) 8.67 (2.18/0.64) 8.59

(2.00/1.11) 9.01 (1.83/1.48) 9.39 (1.75/1.28) 8.96

(4.09/0.67) 10.66 (3.94/1.12) 11.13 (3.84/0.82) 10.69

(2.26/0.67) 8.66 (2.07/0.91) 8.97 (2.04/0.82) 8.67

(2.16/1.00) (1.81/1.21) (2.03/1.07) (2.09/1.04)

(4.23/0.62) (3.90/0.78) (4.10/0.68) (4.13/0.67)

(2.47/0.61) (2.11/0.81) (2.21/0.64) (2.37/0.63)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

8.97 8.91 8.96 8.99

10.58 10.58 10.61 10.62

8.73 8.63 8.67 8.77

PLP+ICSI-NN front-end (s2.x1) Results for the Chinese GALE 2008 evaluation system with the PLP acoustic front-end combined with the NN based phoneme posterior features provided by ICSI. The CMLLR/MLLR adaption is performed as a cross-adaptation with the final output of system s1 as supervisor. dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error Viterbi MAP

(2.01/1.15) 9.26 (1.97/1.20) 9.26

(3.98/0.71) 10.60 (3.95/0.72) 10.60

(2.17/0.72) 8.91 (2.09/0.71) 8.76

Confusion Network (CN) Error CN construct. alg.: arc-cluster state-cluster center-frame

(2.09/1.11) 9.24 (2.10/1.15) 9.27 (2.07/1.13) 9.22

(4.04/0.66) 10.46 (4.10/0.68) 10.55 (4.03/0.66) 10.45

(2.26/0.65) 8.79 (2.38/0.64) 8.77 (2.24/0.63) 8.73

(2.04/1.16) 9.24 (1.90/1.51) 9.64 (1.98/1.22) 9.21

(3.99/0.71) 10.47 (3.89/1.11) 10.94 (3.96/0.75) 10.47

(2.19/0.71) 8.85 (2.11/1.06) 9.18 (2.14/0.74) 8.73

(2.08/1.13) (1.82/1.31) (1.96/1.19) (2.07/1.14)

(4.04/0.66) (3.91/0.81) (3.92/0.70) (4.01/0.68)

(2.24/0.61) (1.98/0.80) (2.15/0.68) (2.26/0.64)

Decoder

Frame Error error norm.:

hyp. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

140

9.28 9.20 9.25 9.28

10.52 10.50 10.47 10.50

8.70 8.75 8.66 8.79

C.2 The RWTH Aachen Chinese GALE 2008 Evaluation System

Combination of two acoustic front-ends (s1+s2) dev071

CER[%] (del/ins) err eval07

dev08

Sentence Error intersection Viterbi MAP union Viterbi

(2.11/1.08) 9.12 (2.11/1.09) 9.10 (2.17/1.12) 9.50

(4.10/0.73) 10.67 (4.09/0.72) 10.68 (4.09/0.75) 10.92

(2.23/0.74) 8.80 (2.28/0.69) 8.73 (2.39/0.74) 9.23

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf.

(2.36/0.94) (2.80/0.87) (2.32/0.96) (2.32/0.96) (2.44/0.93) (2.26/0.99) (2.14/1.12) (2.11/1.09)

(4.24/0.61) (4.67/0.56) (4.21/0.61) (4.21/0.63) (4.26/0.58) (4.19/0.63) (4.14/0.72) (4.08/0.73)

(2.57/0.56) (2.99/0.51) (2.46/0.59) (2.49/0.59) (2.60/0.54) (2.42/0.60) (2.35/0.76) (2.26/0.70)

Decoder

Frame Error error norm.:

1

asym. arc-sym. path-sym.

8.95 9.13 8.92 8.91 8.94 8.93 9.55 9.02

(2.33/1.02) 9.02 (2.20/1.24) 9.40 (2.05/1.10) 8.87

10.46 10.60 10.45 10.52 10.43 10.46 10.95 10.54

(4.21/0.65) 10.54 (4.08/0.85) 10.79 (4.01/0.72) 10.41

8.74 8.86 8.64 8.71 8.69 8.67 9.22 8.69

(2.51/0.65) 8.74 (2.38/0.77) 8.99 (2.24/0.70) 8.57

tuning set

Combination of two acoustic front-ends, with cross-adaptation (s1.x2+s2.x1) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set

dev071

CER[%] (del/ins) err eval07

dev08

(1.91/1.15) 9.02 (1.92/1.16) 9.00 (2.04/1.11) 9.02

(3.93/0.69) 10.50 (3.93/0.69) 10.50 (4.10/0.70) 10.69

(2.21/0.72) 8.64 (2.17/0.70) 8.61 (2.31/0.72) 8.67

(1.96/1.10) (2.27/1.01) (2.05/1.09) (2.04/1.10) (2.09/1.09) (1.96/1.12) (1.87/1.16) (1.84/1.20)

(4.06/0.64) (4.34/0.61) (4.05/0.62) (4.05/0.63) (4.15/0.63) (4.05/0.66) (3.99/0.76) (3.87/0.78)

(2.21/0.63) (2.61/0.62) (2.26/0.64) (2.26/0.64) (2.33/0.63) (2.20/0.66) (2.11/0.72) (2.05/0.74)

8.84 8.90 8.85 8.86 8.87 8.85 9.07 8.84

10.40 10.52 10.29 10.33 10.33 10.37 10.67 10.41

8.54 8.65 8.48 8.49 8.54 8.47 8.72 8.47

(2.05/1.11) 8.88 (1.81/1.36) 9.17 (1.87/1.20) 8.80

(4.02/0.65) 10.35 (3.90/0.96) 10.84 (3.88/0.73) 10.36

(2.26/0.70) 8.60 (2.07/0.78) 8.74 (2.13/0.75) 8.50

(2.22/1.01) (1.89/1.15) (2.00/1.09) (2.02/1.13)

(4.18/0.59) (3.92/0.66) (4.05/0.63) (4.04/0.66)

(2.41/0.57) (2.09/0.69) (2.23/0.63) (2.22/0.66)

8.87 8.81 8.84 8.90

10.38 10.25 10.36 10.37

8.54 8.37 8.43 8.50

141

Appendix C Experimental Results

C.3 The RWTH Aachen English EPPS 2007 Evaluation System Results for the RWTH Aachen English EPPS 2007 evaluation system introduced in Section B.2.1. The error measure is the word error rate (WER).

MFCC front-end with unsupervised training (s1) Results for the English EPPS 2007 evaluation system with the MFCC acoustic front-end and model refinement with unsupervised training. Decoder

dev06

WER[%] (del/ins) err eval061

eval07

Sentence Error Viterbi MAP

(1.65/2.21) 11.09 (1.66/2.29) 11.19

(1.38/1.36) 8.43 (1.41/1.43) 8.51

(1.86/1.31) 9.81 (1.84/1.35) 9.84

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(1.90/1.92) 10.73 (2.06/1.78) 10.64 (1.82/1.93) 10.73

(1.55/1.12) 8.22 (1.75/1.10) 8.25 (1.54/1.15) 8.24

(2.09/1.16) 9.57 (2.22/1.10) 9.53 (2.03/1.16) 9.56

(1.89/1.91) 10.73 (1.81/2.06) 11.05 (2.03/1.69) 10.53

(1.57/1.14) 8.24 (1.54/1.24) 8.44 (1.72/1.03) 8.17

(2.02/1.13) 9.49 (1.99/1.27) 9.91 (2.34/1.00) 9.51

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (1.92/1.78) 10.66 (1.62/1.09) 8.24 (mod.) (1.80/1.96) 10.73 (1.48/1.19) 8.24 1/2 overlap cost (cont.) (1.77/1.96) 10.76 (1.48/1.17) 8.20 (disc.) (1.90/1.97) 10.86 (1.63/1.20) 8.39 1 tuning set, eval06 was the official development set in the 2007 evaluation

(2.17/1.06) (1.98/1.19) (1.98/1.17) (2.15/1.25) campaign

9.52 9.55 9.55 9.72

MFCC front-end (s2) Results for the English EPPS 2007 evaluation system with the MFCC acoustic front-end. dev06

WER[%] (del/ins) err eval061

eval07

Sentence Error Viterbi MAP

(1.77/2.28) 11.89 (1.85/2.27) 11.81

(1.67/1.23) 8.70 (1.72/1.23) 8.73

(2.12/1.31) 10.07 (2.18/1.33) 10.14

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(2.14/1.90) 11.42 (1.99/2.14) 11.74 (1.90/2.08) 11.57

(1.90/1.08) 8.61 (1.80/1.09) 8.57 (1.80/1.11) 8.59

(2.40/1.07) 9.78 (2.28/1.15) 9.90 (2.19/1.14) 9.76

(2.23/1.82) 11.44 (1.78/2.32) 11.97 (2.15/1.94) 11.51

(2.04/0.96) 8.55 (1.71/1.22) 8.82 (1.97/1.02) 8.57

(2.59/0.96) 9.75 (2.15/1.28) 10.18 (2.42/1.06) 9.76

Decoder

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (1.91/2.16) 11.76 (1.75/1.09) 8.57 (2.24/1.15) 9.91 (mod.) (1.92/2.11) 11.71 (1.78/1.11) 8.57 (2.25/1.17) 9.91 1/2 overlap cost (cont.) (1.87/2.13) 11.65 (1.74/1.12) 8.56 (2.19/1.19) 9.98 (disc.) (2.02/2.13) 11.72 (1.97/1.18) 8.80 (2.32/1.30) 10.10 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign

142

C.3 The RWTH Aachen English EPPS 2007 Evaluation System

MFCC+NN based phoneme posteriors front-end (s3) Results for the English EPPS 2007 evaluation system with the MFCC front-end combined with NN based phoneme posterior features. dev06

WER[%] (del/ins) err eval061

eval07

Sentence Error Viterbi MAP

(2.06/2.29) 12.43 (2.04/2.34) 12.46

(1.80/1.30) 8.98 (1.79/1.33) 8.99

(2.22/1.34) 10.76 (2.19/1.37) 10.77

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(2.29/1.98) 11.97 (2.40/1.93) 11.96 (2.19/2.00) 11.95

(1.90/1.14) 8.83 (2.02/1.10) 8.82 (1.89/1.17) 8.84

(2.47/1.15) 10.48 (2.56/1.10) 10.47 (2.39/1.16) 10.45

(2.54/1.66) 11.83 (2.10/2.15) 12.34 (2.26/2.00) 12.00

(2.21/0.94) 8.82 (1.82/1.26) 9.11 (1.92/1.17) 8.87

(2.80/0.92) 10.46 (2.34/1.23) 10.74 (2.39/1.18) 10.44

Decoder

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (2.29/1.90) 11.95 (1.96/1.09) 8.85 (mod.) (2.23/1.97) 12.03 (1.91/1.14) 8.87 1/2 overlap cost (cont.) (2.13/2.06) 12.06 (1.81/1.22) 8.90 (disc.) (2.22/2.11) 12.13 (1.88/1.27) 8.99 1 tuning set, eval06 was the official development set in the 2007 evaluation

(2.49/1.10) (2.42/1.17) (2.30/1.25) (2.35/1.36) campaign

10.51 10.49 10.59 10.77

GT front-end (s4) Results for the English EPPS 2007 evaluation system with the acoustic front-end based on the Gammatone filter bank. dev06

WER[%] (del/ins) err eval061

eval07

Sentence Error Viterbi MAP

(2.04/2.18) 12.06 (1.97/2.32) 12.27

(1.85/1.38) 9.44 (1.73/1.47) 9.45

(2.68/1.42) 11.73 (2.56/1.54) 11.77

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(2.31/1.94) 11.87 (2.42/1.82) 11.78 (2.19/1.93) 11.80

(2.09/1.17) 9.31 (2.21/1.15) 9.32 (2.02/1.21) 9.31

(2.96/1.29) 11.57 (3.08/1.22) 11.54 (2.85/1.31) 11.53

(2.30/1.88) 11.78 (2.18/2.08) 12.17 (2.25/2.01) 11.87

(2.07/1.18) 9.33 (1.93/1.26) 9.41 (2.02/1.26) 9.31

(2.98/1.19) 11.47 (2.78/1.31) 11.72 (2.86/1.30) 11.52

Decoder

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (2.26/1.91) 11.79 (2.06/1.18) 9.30 (mod.) (2.31/1.86) 11.81 (2.09/1.17) 9.33 1/2 overlap cost (cont.) (2.16/2.02) 11.92 (1.98/1.26) 9.35 (disc.) (2.29/2.00) 11.98 (2.10/1.23) 9.43 1 tuning set, eval06 was the official development set in the 2007 evaluation

(2.99/1.27) (3.03/1.18) (2.86/1.27) (3.15/1.27) campaign

11.59 11.50 11.49 11.69

143

Appendix C Experimental Results

Combination of acoustic two front-ends (s1+s2) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

WER[%] (del/ins) err eval061

dev06

asym. arc-sym. path-sym.

eval07

(1.72/2.09) 10.85 (1.68/2.12) 10.84 (1.82/2.00) 11.05

(1.48/1.25) 8.07 (1.46/1.29) 8.11 (1.56/1.24) 8.33

(1.99/1.21) 9.29 (1.94/1.25) 9.40 (2.04/1.23) 9.79

(2.02/1.56) (2.25/1.55) (1.83/1.74) (1.94/1.62) (1.97/1.56) (1.82/1.77) (1.65/2.20) (1.97/1.70)

(1.73/0.94) (1.95/0.92) (1.60/1.04) (1.66/0.99) (1.79/0.95) (1.58/1.04) (1.38/1.36) (1.75/0.93)

(2.25/0.93) (2.50/0.93) (2.09/1.04) (2.17/0.96) (2.21/0.91) (2.06/1.02) (1.85/1.30) (2.28/0.95)

10.21 10.29 10.29 10.22 10.19 10.35 11.07 10.54

(2.07/1.53) 10.18 (1.83/2.04) 10.99 (2.00/1.62) 10.29

7.79 7.80 7.83 7.82 7.81 7.81 8.41 7.90

(1.80/0.90) 7.80 (1.54/1.20) 8.37 (1.75/0.95) 7.76

8.97 9.13 9.04 8.98 8.94 9.01 9.80 9.11

(2.35/0.90) 9.01 (2.04/1.25) 9.92 (2.25/0.96) 8.92

Local Alignment based Error Povey’s cost (orig.) (1.97/1.57) 10.24 (1.74/0.94) 7.86 (2.26/0.90) (mod.) (1.83/1.93) 10.58 (1.52/1.11) 7.84 (1.99/1.12) 1/2 overlap cost (cont.) (1.82/1.76) 10.36 (1.58/1.04) 7.84 (2.03/1.01) (disc.) (2.00/1.73) 10.49 (1.82/1.07) 8.07 (2.24/1.15) 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign

9.05 9.08 9.04 9.30

Combination of acoustic three front-ends (s1+s2+s3) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

dev06

WER[%] (del/ins) err eval061

eval07

(1.73/2.17) 11.27 (1.73/2.22) 11.28 (1.86/2.05) 11.23

(1.49/1.28) 8.18 (1.48/1.31) 8.22 (1.59/1.26) 8.38

(1.93/1.28) 9.57 (1.90/1.33) 9.62 (1.99/1.22) 9.66

(2.03/1.59) (2.22/1.55) (1.93/1.65) (1.95/1.60) (2.06/1.50) (1.89/1.64) (1.81/1.91) (2.05/1.57)

(1.74/0.94) (1.94/0.92) (1.65/0.99) (1.67/0.96) (1.79/0.91) (1.64/0.98) (1.49/1.13) (1.79/0.87)

(2.26/0.95) (2.51/0.89) (2.16/0.96) (2.22/0.95) (2.27/0.90) (2.16/0.95) (1.99/1.09) (2.40/0.89)

10.21 10.38 10.24 10.14 10.11 10.19 10.90 10.42

(1.97/1.71) 10.50 (1.85/2.31) 11.30 (1.98/1.64) 10.21

7.73 7.79 7.73 7.70 7.69 7.69 7.91 7.73

(1.73/0.99) 7.79 (1.65/1.48) 8.78 (1.70/0.97) 7.70

(2.18/1.01) 9.01 (2.14/1.44) 10.11 (2.27/1.00) 8.97

Local Alignment based Error Povey’s cost (orig.) (2.03/1.49) 10.14 (1.75/0.91) 7.75 (2.34/0.86) (mod.) (1.74/1.93) 10.52 (1.47/1.09) 7.70 (1.98/1.18) 1/2 overlap cost (cont.) (1.97/1.54) 10.09 (1.73/0.95) 7.74 (2.28/0.92) (disc.) (1.91/1.80) 10.40 (1.70/1.22) 8.01 (2.13/1.27) 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign

144

8.96 9.00 8.99 8.98 8.94 9.01 9.32 9.17

9.01 9.05 9.00 9.28

C.3 The RWTH Aachen English EPPS 2007 Evaluation System

Combination of acoustic four front-ends (s1+s2+s3+s4) Decoder Sentence Error intersection union

Viterbi MAP Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

dev06

WER[%] (del/ins) err eval061

eval07

(1.72/2.17) 11.19 (1.70/2.23) 11.27 (1.86/2.05) 11.24

(1.54/1.20) 8.12 (1.53/1.26) 8.20 (1.59/1.26) 8.38

(1.99/1.24) 9.54 (1.96/1.33) 9.73 (1.99/1.23) 9.67

(2.03/1.64) (2.47/1.48) (1.93/1.68) (1.88/1.65) (1.96/1.63) (1.94/1.59) (1.77/1.93) (1.82/1.91)

(1.70/0.96) (2.07/0.86) (1.67/0.99) (1.60/0.97) (1.71/0.94) (1.66/0.98) (1.45/1.17) (1.47/1.08)

(2.29/0.95) (2.73/0.87) (2.24/0.92) (2.18/0.91) (2.21/0.88) (2.34/0.87) (1.97/1.11) (2.06/1.08)

10.33 10.33 10.54 10.22 10.18 10.25 10.92 10.70

(1.91/1.76) 10.45 (1.85/2.21) 11.20 (1.92/1.77) 10.37

7.59 7.71 7.71 7.59 7.60 7.65 7.81 7.67

(1.62/1.03) 7.69 (1.62/1.41) 8.63 (1.63/1.02) 7.62

8.94 9.09 9.10 8.92 8.86 9.03 9.28 9.15

(2.18/1.02) 8.97 (2.17/1.40) 10.06 (2.15/1.00) 8.89

Local Alignment based Error Povey’s cost (orig.) (2.06/1.47) 10.11 (1.77/0.90) 7.65 (2.42/0.84) (mod.) (1.95/1.72) 10.39 (1.62/1.02) 7.73 (2.24/0.99) 1/2 overlap cost (cont.) (1.85/1.72) 10.30 (1.56/0.98) 7.64 (2.16/0.98) (disc.) (2.01/1.79) 10.51 (1.73/1.19) 7.92 (2.34/1.16) 1 tuning set, eval06 was the official development set in the 2007 evaluation campaign

8.93 9.12 8.96 9.33

145

Appendix C Experimental Results

C.4 The English EPPS 2007 Evaluation Cross-site Combination Results for the English EPPS 2007 evaluation cross-site combination introduced in Section B.2.2. The error measure is the word error rate (WER).

The LIMSI System Results for the lattices provided by CNRS/LIMSI within the TC-Star English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07

Decoder Sentence Error Viterbi

(1.59/1.33) 8.04

(1.71/1.21) 9.08

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(1.65/1.33) 8.07 (1.71/1.25) 8.04 (1.64/1.33) 8.08

(1.76/1.18) 8.96 (1.88/1.14) 8.94 (1.75/1.18) 8.97

(1.95/1.15) 8.08 (1.72/1.34) 8.24 (1.68/1.32) 8.05

(2.22/0.99) 9.00 (1.82/1.22) 9.19 (1.84/1.15) 9.00

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development

(1.67/1.29) 8.04 (1.87/1.15) 9.03 (1.62/1.40) 8.13 (1.73/1.22) 8.99 (1.66/1.33) 8.07 (1.79/1.14) 8.96 (1.65/1.28) 8.07 (1.82/1.24) 9.09 set in the 2007 evaluation campaign

The RWTH Aachen System Results for the lattices provided by RWTH Aachen University within the TC-Star English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07

Decoder Sentence Error Viterbi

(1.51/1.30) 8.42

(1.95/1.25) 9.75

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(1.55/1.13) 8.24 (1.62/1.11) 8.24 (1.55/1.14) 8.26

(2.07/1.15) 9.54 (2.17/1.11) 9.51 (2.03/1.16) 9.54

(1.84/0.96) 8.23 (1.47/1.27) 8.46 (1.73/1.03) 8.21

(2.39/0.97) 9.54 (1.94/1.28) 9.83 (2.33/1.00) 9.50

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development

146

(1.63/1.09) 8.23 (2.15/1.07) 9.49 (1.58/1.13) 8.24 (2.05/1.09) 9.47 (1.47/1.20) 8.24 (1.99/1.18) 9.54 (1.69/1.22) 8.47 (2.15/1.18) 9.66 set in the 2007 evaluation campaign

C.4 The English EPPS 2007 Evaluation Cross-site Combination

The UKA System Results for the lattices provided by University of Karlsruhe (UKA) within the TC-Star English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07

Decoder Sentence Error Viterbi

(1.77/1.29) 8.78

(2.00/1.29) 10.21

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(1.83/1.39) 8.98 (2.00/1.38) 9.03 (1.81/1.42) 9.06

(2.08/1.33) 10.36 (2.26/1.28) 10.38 (2.08/1.39) 10.49

(1.93/1.35) 9.04 (1.64/1.96) 10.04 (1.97/1.33) 8.97

(2.19/1.33) 10.31 (1.88/2.02) 11.61 (2.24/1.37) 10.34

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development

(1.88/1.32) 9.02 (2.17/1.35) 10.41 (1.95/1.33) 9.03 (2.20/1.32) 10.40 (1.70/1.45) 9.06 (1.96/1.48) 10.53 (1.67/1.62) 9.19 (1.93/1.60) 10.60 set in the 2007 evaluation campaign

The IRST System Results for the lattices provided by FBK/IRST (former ITC/IRST) within the TC-Star English EPPS 2007 evaluation. WER[%] (del/ins) err eval061 eval07

Decoder Sentence Error Viterbi

(2.35/1.40) 10.09

(2.49/1.14) 9.82

Confusion Network (CN) Error CNconstrcut. alg.: arc-cluster state-cluster center-frame

(2.35/1.39) 10.06 (2.34/1.39) 10.05 (2.35/1.39) 10.06

(2.47/1.13) 9.82 (2.46/1.14) 9.79 (2.47/1.13) 9.82

(2.44/1.31) 10.04 (2.34/1.40) 10.05 (2.43/1.32) 10.04

(2.56/1.09) 9.81 (2.45/1.15) 9.85 (2.55/1.10) 9.82

Frame Error error norm.:

asym. arc-sym. path-sym.

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development

(2.35/1.39) 10.06 (2.44/1.14) 9.80 (2.33/1.38) 10.04 (2.44/1.14) 9.80 (2.31/1.41) 10.07 (2.40/1.17) 9.84 (2.28/1.43) 10.05 (2.39/1.19) 9.84 set in the 2007 evaluation campaign

147

Appendix C Experimental Results

Combination of the LIMSI and the RWTH lattices WER[%] (del/ins) err eval061 eval07

Decoder Sentence Error intersection

Viterbi

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf. Frame Error error norm.:

asym. arc-sym. path-sym.

(1.58/1.32) 8.02

(1.71/1.20) 9.07

(1.63/0.77) (1.90/0.85) (1.50/0.77) (1.45/0.80) (1.49/0.78) (1.45/0.81) (1.50/1.24) (1.63/0.91)

(2.17/0.71) (2.29/0.79) (1.92/0.73) (1.88/0.75) (1.96/0.75) (1.88/0.80) (1.70/1.20) (2.13/0.87)

6.46 6.95 6.39 6.38 6.38 6.41 7.87 6.69

(1.60/0.85) 6.65 (1.57/1.29) 8.35 (1.62/0.76) 6.46

Local Alignment based Error Povey’s cost (orig.) (mod.) 1/2 overlap cost (cont.) (disc.) 1 tuning set, the official development set

7.67 8.13 7.52 7.51 7.52 7.58 9.06 7.85

(1.99/0.76) 7.73 (2.02/1.21) 9.58 (2.09/0.73) 7.57

(1.78/0.72) 6.66 (2.33/0.61) 7.73 (1.48/0.90) 6.61 (2.05/0.79) 7.70 (1.44/0.92) 6.66 (1.96/0.88) 7.96 (1.66/0.87) 6.70 (2.10/0.74) 7.81 in the 2007 evaluation campaign

Combination of the LIMSI, the RWTH, and the UKA lattices Decoder Sentence Error intersection Viterbi

(1.75/1.25) 8.18

(1.84/1.17) 9.24

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf.

(1.51/0.79) (1.98/0.73) (1.54/0.73) (1.47/0.72) (1.58/0.67) (1.36/0.74) (1.35/0.84) (1.43/0.76)

(2.04/0.77) (2.63/0.69) (1.89/0.69) (1.87/0.68) (2.04/0.64) (1.77/0.76) (1.86/0.78) (2.00/0.70)

Frame Error error norm.:

1

148

WER[%] (del/ins) err eval061 eval07

asym. arc-sym. path-sym. tuning set, the official development

6.38 6.57 6.30 6.27 6.25 6.23 6.58 6.32

7.63 7.76 7.32 7.24 7.28 7.32 8.01 7.77

(1.80/0.72) 6.48 (2.21/0.68) 7.52 (1.61/1.39) 8.23 (1.74/1.27) 9.19 (1.53/0.74) 6.24 (2.01/0.74) 7.28 set in the 2007 evaluation campaign

C.4 The English EPPS 2007 Evaluation Cross-site Combination

Combination of the LIMSI, the RWTH, the UKA and the IRST lattices Decoder

WER[%] (del/ins) err eval061 eval07

Sentence Error intersection Viterbi

(2.46/1.51) 10.30

(2.52/1.17) 9.89

Confusion Network (CN) Error union arc-cluster state-cluster (mod.) center-frame CNC arc-cluster state-cluster (mod.) center-frame ROVER w/o conf. w/ conf.

(1.61/0.73) (2.31/0.63) (1.61/0.71) (1.45/0.71) (1.54/0.65) (1.36/0.73) (1.36/0.78) (1.37/0.79)

(2.19/0.67) (2.90/0.52) (2.00/0.61) (1.87/0.69) (2.04/0.57) (1.82/0.67) (1.82/0.79) (1.77/0.73)

6.28 6.61 6.23 6.14 6.10 6.11 6.38 6.21

7.36 7.58 7.10 7.12 7.12 7.16 7.67 7.26

Frame Error error norm.:

1

asym. (1.70/0.79) 6.52 (1.93/0.76) 7.26 arc-sym. (1.57/1.28) 8.33 (2.01/1.22) 9.55 path-sym. (1.36/0.85) 6.10 (1.81/0.85) 7.21 tuning set, the official development set in the 2007 evaluation campaign

149

Appendix D Symbols and Acronyms In this appendix, all relevant mathematical symbols and acronyms which are used in this thesis are defined for convenience. Detailed explanations are given in the corresponding chapters.

D.1 Mathematical Symbols x⊕y

collect operator in a semiring, ⊕-sum of x and y

x⊗y

extend operator in a semiring, ⊗-product of x and y

1{cond}

equals one if condition cond is true, and zero otherwise

A

alignment between two word sequences or two CNs

a, b

word lattice arcs

K aL 1 , b1

K paths through a word lattice, aL 1 := a1 , a2 , . . . , aL and b1 := b1 , b2 , . . . , bK , where al and bk are word lattice arcs

bkj

partial path in a word lattice, bkj := bj , bj+1 , . . . , bk , where bi is a word lattice arc

beg(a)

begin time of word lattice arc a

best(L)

non- input label sequence of the best path through lattice L

β

language model scale

c(b, a)

cost function, defined between word lattice arcs

c(b; S)

cost function, defined between an arc b from the hypothesis space lattice and the summation space lattice S

L c(bK 1 , a1 )

cost function, defined between two paths through a word lattice

CN(L)

confusion network derived from word lattice L and an arbitrary slot function

CN(L, σ(·))

confusion network derived from word lattice L and slot function σ(·)

d(L)

single-source shortest distance for word lattice L starting from the initial state, score of the best path if computed over the tropical semiring

dur(a)

duration in number of time frames of word lattice arc a

δ(i, j)

Kronecker delta, equals one for i = j, and zero otherwise

end(a)

end time of word lattice arc a

E(L)

set of all lattice arcs in word lattice L



the empty word

from(a)

source state of lattice arc a

fi (. . . )

ith feature function in a log-linear model

151

Appendix D Symbols and Acronyms g(xT1 )

Bayes risk classifier applied the acoustic observations xT1

H(·)

entropy

H

word lattice representing the hypothesis space of a Bayes risk decoder

h(a, b)

conditional overlap; overlap in number of time frames between two word lattice arcs, if both arcs have the same input label, zero otherwise

i

common index for the scaling factors and feature functions in a log-linear model

I

number of scaling factors resp. feature functions in a log-linear model

i(a)

input label of word lattice arc a

i(aL 1)

sequence of non- input labels of a path through a word lattice

j

common index for the systems in a system combination or the lattices in a lattice-based system combination

J

number of systems in a system combination, number of lattices in a lattice-based system combination

k, l

common indices for the arcs in a path through a word lattice

L(·, ·)

loss function used in a Bayes risk decoder, defined for two word sequences or two paths through a word lattice

Lev(·, ·)

Levenshtein distance, defined for two word sequences or two paths through a word lattice

L, Lj

word lattice defined as a weighted finite state acceptor, word lattice produced by the jth system

λ

log-linear model parameters, λ = λ1 , λ2 , . . . , λI

λi

log-linear model parameter, scaling factor of the ithe feature function fi (. . . )

λi (w)

word-dependent parameter in a log-linear model, word-dependent scaling factor of the ith feature function fi (. . . )

m, n

common indices for the words in a word sequence

o(a, b)

overlap in number of time frames between two arcs in a word lattice

p(j)

prior probability for the jth system

p(a|xT1 )

posterior for the word lattice arc a given the acoustic observations xT1

T p(aL 1 |x1 )

T posterior for path aL 1 through a word lattice given the acoustic observations x1

p(w1N )

prior for the word sequence w1N , language model

p(w1N |xT1 )

posterior for the spoken word sequence w1N given the acoustic observations xT1

ps (w|xT1 )

defined by a confusion network (CN), posterior for the occurrence of word w in CN slot s given the acoustic observations xT1

pt (w|xT1 )

posterior for the occurrence of word w at time frame t given the acoustic observations xT1

r(xT1 )

Bayes risk given the acoustic observations xT1

σ(a)

assigns word lattice arc a to a confusion network (CN) slot, in the computation of the CN distance two lattice arcs are aligned if they are assigned to the same slot

152

D.1 Mathematical Symbols s

state in a word lattice

S

word lattice representing the summation space of a Bayes risk decoder

Σ

alphabet or vocabulary

t, τ

common indices for time frames

t(s)

time stamp of word lattice state s

to(a)

target state of lattice arc a

v, w

words from vocabulary Σ or the empty word 

v1M , w1N

word sequence, where v1M := v1 v2 . . . vM and w1N := w1 w2 . . . wN

w(a)

weight of word lattice arc a; for an arc in a system-dependent word lattice the weight usually consists of an acoustic and a language model score

xT1

sequence of acoustic observation vectors, xT1 = x1 x2 . . . xT

153

Appendix D Symbols and Acronyms

D.2 Acronyms ASR

Automatic Speech Recognition

BC

Broadcast Conversations

BN

Broadcast News

BR

Bayes Risk

CART

Classification And Regression Tree

CER

Character Error Rate

CMLLR

Constrained Maximum Likelihood Linear Regression

CN

Confusion Network

CNC

Confusion Network Combination

CNRS-LIMSI

Centre National de la Recherche Scientifique - Laboratoire d’Informatique pour la M´ecanique et les Sciences de l’Ing´enieur

DMC

Discriminative Model Combination

EPPS

European Parliament Planery Sessions

FB

Forward Backward

FBK-IRST

Fondazione Bruno Kessler (former Istituto Trentino di Cultura) - Centro per la Ricerca Scientifica e Tecnologica

FE

Frame Error

FST

Finite State Transducer

GALE

Global Autonomous Language Exploitation

GT

Gammatone filter

HMM

Hidden Markov Model

IBM

International Business Machines

ICSI

International Computer Science Institute, Berkeley, California

IDIAP

Idiap Research Institute

IRST

see FBK-IRST

ITC-IRST

see FBK-IRST

LDA

Linear Discriminant Analysis

LIMSI

see CNRS-LIMSI

LM

Language Model

LVCSR

Large Vocabulary Continuous Speech Recognition

MAP

Maximum A Posteriori

MFCC

Mel Frequency Cepstral Coefficients

ML

Maximum Likelihood

154

D.2 Acronyms MLLR

Maximum Likelihood Linear Regression

MPE

Minimum Phone Error

MPFE

Minimum Frame Phone Error

MWE

Minimum Word Error

NCE

Normalized Cross Entropy

NIST

National Institute of Standards and Technology

NN

Neural Network

PLP

Perceptual Linear Prediction

PP

Language Model Perplexity

Rprop

Resilient Propagation

ROVER

Recognizer Output Voting Error Reduction

RWTH

Rheinisch Westf¨ alische Technische Hochschule

SAT

Speaker Adaptive Training

SRI

SRI International

TC-STAR

Technology and Corpora for Speech to Speech Translation

UKA

Universit¨ at Karlsruhe

UW

University of Washington

VTLN

Vocal Tract Length Normalization

WER

Word Error Rate

WFST

Weighted Finite State Transducer

155

List of Figures 1.1 1.2

1.3 1.4

3.1

3.2 3.3

3.4 3.5

4.1

4.2

4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1

Basic architecture of a statistical automatic speech recognition system according to [Ney 1990]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-state hidden Markov model in Bakis topology for the triphone s ehv in the word “seven” and the resulting trellis for a time alignment. The HMM segments are denoted by <1>, <2>, and <3>. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lattice produced by the RWTH 2007 TC-Star EPPS Evaluation System for English [L¨o¨of & Gollan+ 2007]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphical representation of a weighted acceptor a) and a weighted transducer b). An arc in the acceptor is labeled by i(e)/ w(e), a transducer arc by i(e) : o(e)/ w(e). States are labeled with their state number and a final weight, if the state is final. . . . . . . . . . . . Error induced by changing the LM scale after computing x ⊕ x; the LM scale is initialized with 20. The correct sum results from changing the scaling factors before applying the ⊕-operator. The ⊕-operator is defined in Equation (3.4). . . . . . . . . . . . . . . . . . . The figure shows a word lattice with time stamps at the states, a slot function, and the confusion network induced by the slot function. . . . . . . . . . . . . . . . . . . . . . . . . Illustration of the non-speech cloud filter applied to a word lattice. In figure a) four paths are connecting the left most and the right most state, three of them starting with “have” and continuing with non-speech arcs marked as “{·}”. These three paths define a nonspeech cloud and the non-speech cloud filter removes all but the best scoring path through the cloud. The filter result is shown in figure b). . . . . . . . . . . . . . . . . . . . . . . . CN decoding results for the Chinese 230h testing system, cf. Section B.1.1, for different lattice densities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CN decoding results for the English EPPS 2007 evaluation system, cf. Section B.2.1, for different lattice densities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

6 9

12

24 35

44 46 46

The bias in partially normalized frame errors. In a) the frame error is normalized w.r.t. the hypothesis, which results in ignoring deletion errors (left side) while insertions are counted (right side). In b) the frame error is normalized w.r.t. the reference and insertion errors are ignored (left side) while deletions are counted. . . . . . . . . . . . . . . . . . . . . . . . . . The figure shows a lattice, a CN derived from the lattice, and a lattice in which all paths have the same length. The positions for the insertions of the -arcs are derived from the CN according to the algorithm described in the text. The number at the arcs corresponds to the CN slot the arc is assigned to and the number in the states is the minimum slot number from all outgoing arcs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CN construction with the arc-cluster algorithm. . . . . . . . . . . . . . . . . . . . . . . . . Pseudo code for the arc-cluster CN construction algorithm. . . . . . . . . . . . . . . . . . CN construction with the state-cluster algorithm. . . . . . . . . . . . . . . . . . . . . . . . Pseudo code for the state-cluster CN construction algorithm. . . . . . . . . . . . . . . . . Pseudo code for the state-cluster CN construction algorithm with back-splitting. . . . . . CN construction with the center-frame algorithm. . . . . . . . . . . . . . . . . . . . . . . . Pseudo code for the center-frame CN construction algorithm. . . . . . . . . . . . . . . . .

63 65 66 67 68 69 71 71

The figure shows in the first row a lattice. The second and the third row show the wordlevel resp. frame level CN derived from the lattice. In the word-level CN each slot assigns a single position to each word hypothesis. In the frame-wise CN each slot represents a single time frame and a word hypothesis is usually spread among several slots. . . . . . . . . . .

78

55

157

List of Figures 5.2

5.3

5.4

7.1

158

Example for a typical error made by the common CN construction algorithms and the correction of the error by using a windowed Levenshtein distance, where the window is centered around the CN alignment. The example lattice consists of three paths which are listed to the right of the lattice together with their path probabilities. The arc labels in the lattice are composed of the word, the CN slot to which the arc is assigned, and the arc probability. The resulting CN is drawn below the lattice. To the right of the CN an example for the possible alignment position of arc “b:1” within a windowed Levenshtein alignment is given: a) shows the only possible alignment position for a window of size one, b) shows the possible alignment positions for a symmetric window of size three. The lower part of the figure shows the alignments for the Bayes risk hypotheses for different window sizes with the windowed Levenshtein distance as cost function. Alignment a) is the outcome for a window of size one, which is equivalent to the standard CN decoding. Alignment b) uses a symmetric window of size three. The larger window allows the alignment of “b:1” and “b:2” which compensates for the flaw in the CN construction, where the two arcs were assigned to different slots. The Bayes risk hypothesis for a window of size three is “a b c”, which is also the minimum WER hypothesis for the example lattice. . . . . . . . . . . . . The figure visualizes the alignments performed in the Bayes risk decoder with the windowed Levenshtein distance as loss function. Figure a) shows the CN alignment case, where the window size is one and thus the alignment is unique. For a window size of 2d + 1 the computation of the hypothesis word at position n considers the alignment between vnn+d and wnn+2d as shown in b). For sufficiently large window size, that is ≥ 2S −1, the alignment between v1S and w1S is computed, see c), which yields the exact Levenshtein distance. . . . Confidence warping applied to the lattices for eval07en produced by the LIMSI English EPPS 2007 evaluation system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

92 94

Results for the log-linear model-combination for 25 training iterations and 6,904 worddependent scaling factors. The word-dependent scaling factors are trained on 120h. The left plot shows the objective function and character error rates for the training set, the held-out set, and the development set. The right plot shows the progression of the error rates for the development set and the two test sets. . . . . . . . . . . . . . . . . . . . . . . 116

List of Tables 1.1

Semirings used by WFSTs for speech recognition tasks. . . . . . . . . . . . . . . . . . . .

3.1

Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for the English EPPS 2007 evaluation systems, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The bracketed percentages in the rows with the intersection results are the percentages of segments for which the lattice intersection is not empty. In case of an empty intersection the lattice from the first system is decoded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˆ , i.e. the hypothesis with the Example for the situation where the Bayes risk hypothesis W minimum expected word error rate, has a sentence posterior probability of zero and thus is not contained in the summation space. . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . . . . Results for the English EPPS 2007 evaluation system, cf. Section B.2.1. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . Results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The table summarizes common approaches to lattice-based system combination. The methods are classified according to a) the lattice combination method and b) the decoder. The lattices are either combined via an intersection (or an theoretically equivalent lattice rescoring) or by building the lattice union. The decoder is either the Viterbi decoder, which is an approximation of the Bayes risk decoder with the sentence error as loss function, or the Bayes risk decoder with a local cost function as loss function. The local cost functions are of the second type for all methods but Povey’s MPE, which is of the first type. . . . . Results for the Chinese 230h testing system, cf. Section B.1.1. Word-level vs. characterlevel decoding and approximated vs. exact character boundaries. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . Comparison of the posterior probability distributions resulting from maximum likelihood estimation and from MRT training given the observations 1 × (x, 111), 2 × (x, 112), 1 × (x, 211), and 1 × (x, 221). The table also shows the Bayes risk hypothesis given the two distributions and the according risks given the empirical distribution. . . . . . . . . . . . .

3.2

3.3

3.4 3.5 3.6 3.7

3.8

3.9

4.1

4.2

Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three different approaches to word-wise frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minimum frame error decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare three different approaches to word-wise frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

30

31

32 38 39 40

41

45

50

56

57

159

List of Tables 4.3

Minimum frame error decoding results for the Chinese 230h testing system, cf. Section B.1.1). The experiments compare the word- and time-conditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . 4.4 Minimum frame error decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare the word- and time-conditioned hypothesis space for the minimum frame error decoder with path symmetric normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . 4.5 The substitution, insertion, and deletion error for the discrete and the continuous case of the 1/2 overlap approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Minimum local alignment error decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare four variants of the local alignment based cost. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . 4.7 Minimum local alignment error decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare four variants of the local alignment based cost. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 CN decoding results for the Chinese 230h testing system, cf. Section B.1.1. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 CN decoding results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The experiments compare three CN construction algorithms for single lattice decoding and for system combination. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 Comparison of the original and the modified state-cluster CN construction algorithm for the Chinese 230h testing system, cf. Section B.1.1. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . 4.11 Comparison of the original and the modified state-cluster CN construction algorithm for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . 5.1

5.2

5.3

160

Entropy-based combination results for the Chinese 230h testing system, cf. Section B.1.1. Experiments are performed with the minimum frame error decoder with hypothesis-side frame error normalization. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy-based combination results for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. Experiments are performed with the minimum frame error decoder with hypothesis-side frame error normalization. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . Combination results with system-dependent frame- and CN-slot-wise posterior warping for the Chinese 230h testing system, cf. Section B.1.1. The warping is optimized for minimum character error rate. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

58 60

61

62

72

73

74

75

80

80

95

List of Tables 5.4 5.5

5.6 5.7

5.8

6.1

6.2 6.3 6.4

6.5

6.6

7.1 7.2

7.3 7.4

7.5

Normalized cross entropy (NCE) results with frame- and CN-slot-wise posterior warping for the Chinese 230h testing system, cf. Section B.1.1. . . . . . . . . . . . . . . . . . . . . Combination results with system-dependent frame- and CN-slot-wise posterior warping for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The warping is optimized for minimum word error rate. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . Normalized cross entropy (NCE) results with frame- and CN-slot-wise posterior warping for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. . . . . . . Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the Chinese 230h testing system, cf. Section B.1.1. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of system s1, the best single system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results with the approximate Bayes risk decoder with the windowed Levenshtein distance as loss function for the English EPPS 2007 evaluation cross-site combination, cf. Section B.2.2. The windowed Levenshtein distance is initialized with a CN alignment; for a window size of one the CN decoding result is produced. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the LIMSI system, the best single system. . . . . . . . . . . . . . . . . . . . . . . Baseline results for eval07. ROVER results come with confidence score based voting and with majority voting. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corpora statistics for the training/tuning set (eval06) and the evaluation set (eval07). . . CN oracle error rates for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iROVER combination results for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combination results with Boostexter (BT) and random forests (RF) as classifier for eval07. Results are word error rates; the bracketed numbers show the deletion and insertion fraction. The WER for the Viterbi decoding result of the best single system is 9.38%. . . . . . . . . Error detection and correction results for eval07 for four systems and with a random forest as classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training, tuning (dev07), and test sets. The word-dependent scaling factors are trained on the 120h “λ-training” set. For the first test set no word-segmented transcripts are available. Lattice re-scoring results with various acoustic models. The lattice sets are generated with the MFCC model and subsequently re-scored with the PLP and resp. with the Gammatone (GT) acoustic model, where the character boundaries are kept fixed. The acoustic models were estimated on the 230h AM training set. . . . . . . . . . . . . . . . . . . . . . . . . . Statistics for word-dependent scaling factors on dev07: number of word-dependent scaling factors and coverage of running words for a given cut-off Nmin . . . . . . . . . . . . . . . CN-decoding results for the log-linear model combination using word-, character-, and syllable-dependent scaling factors. The scaling factors are trained on 120h using either minimum phone error (MPE) or minimum character error (MWE) training. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model. CN-decoding results for log-linear model combinations and for a system combination using the weighted average of sentence posteriors. Results are character error rates; the bracketed numbers show the deletion and insertion fraction. The baseline is the Viterbi decoding result of the MFCC model, the best single acoustic model. . . . . . . . . . . . . . . . . . .

95

96 97

99

99

105 105 106

107

107 108

113

113 114

116

117

161

List of Tables B.1 B.2 B.3 B.4 B.5 B.6

162

Corpora statistics for the Chinese GALE systems. . . . . . . . . . . . . . Subsystems in the Chinese 230 testing system. . . . . . . . . . . . . . . . System combinations for the Chinese 230 testing system. . . . . . . . . . . Subsystems in the RWTH Aachen Chinese GALE 2008 evaluation system. Corpora statistics for the English EPPS systems. . . . . . . . . . . . . . . Subsystems in the RWTH Aachen English EPPS 2007 evaluation system.

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

127 128 129 130 130 131

Bibliography A. M. H. J. Aertsen, P. I. M. Johannesma, and D. J. Hermes. Spectro-temporal receptive fields of auditory neurons in the grassfrog. Biological Cybernetics, 38:235–248, November 1980. Cyril Allauzen and Mehryar Mohri. An optimal pre-determinization algorithm for weighted transducers. Theoretical Computer Science, 328(1-2):3 – 18, November 2004. Cyril Allauzen, Mehryar Mohri, Brian Roark, and Michael Riley. A generalized construction of integrated speech recognition transducers. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Canada, May 2004. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. Openfst: a general and efficient weighted finite-state transducer library. In 12th International Conference on Implementation and Application of Automata (CIAA 2007), volume 4783, pages 11–23, Prague, Czech Republic, July 2007. Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, Germany. P. Alleva, X. D. Huang, and M. Y. Hwang. Improvements on the pronunciation prefix tree search organization. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 133–136, Atlanta, GA, USA, May 1996. Ladan Baghai-Ravary, Greg Kochanski, and John Coleman. Precision of phoneme boundaries derived using hidden markov models. In Interspeech, pages 2879–2883, Brighton, U.K., September 2009. L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:179–190, March 1983. L. R. Bahl, M. Padmanabhan, D. Nahamoo, and P. S. Gopalakrishnan. Discriminative training of Gaussian mixture models for large vocabulary speech recognition systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 613–616, Atlanta, GA, USA, May 1996. L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 49–52, Tokyo, Japan, May 1986. J. K. Baker. Stochastic modeling for automatic speech understanding. In D. R. Reddy, editor, Speech Recognition, pages 512–542. Academic Press, New York, NY, USA, 1975. R. Bakis. Continuous speech word recognition via centisecond acoustic states. In ASA Meeting, Washington, DC, USA, April 1976. L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. In O. Shisha, editor, Inequalities, volume 3, pages 1–8. Academic Press, New York, NY, 1972. T. Bayes. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370–418, 1763. Reprinted in Biometrika, vol. 45, no. 3/4, pp. 293–315, December 1958. R. E. Bellman. Dynamic programming. Princeton University Press, Princeton, NJ, USA, 1957. K. Beulen. Phonetische Entscheidungsb¨ aume f¨ ur die automatische Spracherkennung mit großem Vokabular. PhD thesis, Human Language Technology and Pattern Recognition Group, RWTH Aachen University, Aachen, Germany, July 1999.

163

Appendix D Bibliography K. Beulen, S. Ortmanns, and C. Elting. Dynamic programming search techniques for across-word modeling in speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 609–612, Phoenix, AZ, March 1999. P. Beyerlein. Discriminative model combination. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 238 – 245, Santa Barbara, CA, USA, December 1997. P. Beyerlein. Discriminative model combination. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 481 – 484, Seattle, WA, USA, May 1998. P. Beyerlein, X. L. Aubert, R. Haeb-Umbach, M. Harris, D. Klakow, A. Wendemuth, S. Molau, M. Pitz, and A. Sixtus. The philips/rwth system for transcription of broadcast news. In Proc. DARPA Broadcast News Workshop,, pages 151–155, Herndon, VI, February 1999. Peter Beyerlein. Diskriminative Modellkombination in Spracherkennungssystemen mit großem Wortschatz. PhD thesis, RWTH Aachen University, Aachen, Germany, October 2000. M. Bisani and H. Ney. Multigram-based grapheme-to-phoneme conversion for LVCSR. In Interspeech, pages 933–936, Geneva, Switzerland, September 2003. C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. L. Breiman. Random forests. Machine Learning, 45(1):5–32, October 2001. C. Breslin and M. J. F. Gales. Generating complementary systems for speech recognition. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. C. Breslin and M. J. F. Gales. Complementary system generation using directed decision trees. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 337– 340, Honululu, HI, USA, April 2007a. C. Breslin and M. J. F. Gales. Building multiple complementary systems using directed decision trees. In Interspeech, Antwerp, Belgium, August 2007b. Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne, and Michel Comeau. Gpu accelerated acoustic likelihood computations. In Interspeech, Brisbane, Australia, September 2008. Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, and Kai-Fu Lee. Large vocabulary mandarin speech recognition with different approaches in modeling tones. In International Conference on Spoken Language Processing (ICSLP), pages 983–986, Beijing, China, October 2000. B. Chen, Q. Zhu, and N. Morgan. Learning long-term temporal features in LVCSR using neural networks. In Interspeech, Jeju Island, Korea, October 2004. C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A. Picheny, and K. Shen. New methods in continuos Mandarin speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), volume 3, pages 1543–1546, Rhodes, Greece, September 1997. C. J. Chen, H. Li, L. Shen, and G. K. Fu. Recognize tone languages using pitch information on the main vowel of each syllable. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 61–64, Salt Lake City, USA, May 2001. I-Fan Chen and Lin-Shan Lee. A new framework for system combination based on integrated hypothesis space. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. S. S. Chenand and P. S. Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion. In DARPA Broadcast News Transcription and Understanding Workshop, pages 127–132, February 1998.

164

Appendix D Bibliography J. T. Chien, C. H. Huang, K. Shinoda, and S. Furui. Towards optimal bayes decision for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, May 2006. S.B. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-28(4):357 – 366, August 1980. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1 – 38, 1977. Thomas G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857: 1–15, 2000a. Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, August 2000b. E. W. Dijkstra. A note on two problems in connection with graphs. Numerische Mathematik, 1:269–271, 1959. G. R. Doddington, M. A. Przybocki, A. F. Martin, and D. A. Reynolds. The NIST speaker recognition evaluation – overview, methodology, systems, results, perspective. Speech Communication, 31(2–3): 225–254, June 2000. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, New York, NY, USA, 2001. A. Emami, K. Papineni, and J. Sorenson. Large-scale distributed language modeling. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 37–40, Honolulu, HI, USA, April 2007. G. Evermann and P. Woodland. Posterior probability decoding, confidence estimation and system combination. In NIST Speech Transcription Workshop, College Park, MD, USA, 2000. G. Evermann, H.Y. Chan, M.J.F. Gales, T. Hain, X. Liu, L. Wang D. Mrva, and P.C. Woodland. Development of the 2003 cu-htk conversational telephone speech transcription system. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 261–264, Montreal, Canada, May 2003. Daniele Falavigna, Nicola Bertoldi, Fabio Brugnara, Roldano Cattoni, Mauro Cettolo Boxing Chen, Marcello Federico, Diego Giuliani, Roberto Gretter, Deepa Gupta, and Dino Seppi. The irst english-spanish translation system for european parliament speeches. In International Conference on Spoken Language Processing (ICSLP), pages 2833–2837, Antwerp, Belgium, August 2007. J.G. Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 347 – 354, Santa Barbara, CA, USA, December 1997. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(179-188), 1936. M. J. E. Gales and P. C. Woodland. Mean and variance adaptation within the mllr framework. Computer Speech and Language, 10(4):249–264, 1996. M. Generet, H. Ney, and F. Wessel. Extensions to absolute discounting for language modeling. In European Conference on Speech Communication and Technology (Eurospeech), volume 2, pages 1245– 1248, Madrid, Spain, September 1995. M. Gibson and T. Hain. Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition. In Interspeech, Pittsburgh, PA, USA, September 2006.

165

Appendix D Bibliography Matthew Gibson. Minimum Bayes Risk Acoustic Model Estimation and Adaptation. PhD thesis, University of Sheffield, Sheffield, UK, November 2008. V. Goel and W.J. Byrne. Minimum bayes-risk automatic speech recognition. Computer Speech and Language, 14:115–136, 2000. V. Goel, W. Byrne, and S Khudanpur. Lvcsr rescoring with modified loss functions: a decision theoretic perspective. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 425–428, Seattle, WA, USA, 1998. V. Goel, S. Kumar, and W.J. Byrne. Segmental minimum bayes-risk decoding for automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 12:234 – 249, 2004. Vaibhava Goel, Shankar Kumar, and William Byrne. Segmental minimum bayes-risk asr voting strategies. In International Conference on Spoken Language Processing (ICSLP), pages 139–142, Beijing, China, October 2000. Vaibhava Goel, Shankar Kumar, and William Byrne. Confidence based lattice segmentation and minimum bayes-risk decoding. In European Conference on Speech Communication and Technology (Eurospeech), pages 2569–2572, Aalborg, Denmark, September 2001. D. Guiliani and F. Brugnara. Acoustic model adaptation with multiple supervisions. In Proc. TC-Star Workshop on Speech-to-Speech Translation, pages 151–154, Barcelona, Spain, June 2006. D. Guiliani and F. Brugnara. Experiments on cross-system acoustic model adaptation. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 117–122, Kyoto, Japan, December 2007. A. Gunawardana, M. Mahajan, A. Acero, and J.C. Platt. Hidden conditional random fields for phone classification. In Interspeech, pages 117 – 120, Lisbon, Portugal, September 2005. R. H¨ ab-Umbach and H. Ney. Improvements in beam search for 10000-word continuous-speech recognition. IEEE Transactions on Speech and Audio Processing, 2(2):353–356, April 1994. R. Haeb-Umbach, X. Aubert, P. Beyerlein, D. Klaskow, M. Ullrich, A. Wendemuth, and P. Wilcox. Acoustic modeling in the philips hub-4 continous-speech recognition system. In DARPA Broadcast News Transcription and Understanding Workshop, February 1998. D. Hakkani and G. Riccardi. A general algorithm for word graph matrix decomposition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 596–599, Hong Kong, April 2003. Georg Heigold, Thomas Deselaers, Ralf Schl¨ uter, and Hermann Ney. Modified mmi/mpe: A direct evaluation of the margin in speech recognition. In International Conference on Machine Learning, pages 384–391, Helsinki, Finland, July 2008. A typo from the original publication was corrected (marked in red). H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4):1738 – 1752, June 1990. H. Hermansky, D.P.W. Ellis, and S. Sharma. Tandem connectionist feature stream extraction for conventional HMM systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1635–1638, Istanbul, Turkey, June 2000. L. Hetherington. Mit finite-state transducer toolkit for speech and language processing. In International Conference on Spoken Language Processing (ICSLP), pages 2609–2612, Jeju Island, Korea, October 2004. Dustin Hillard and Mari Ostendorf. Compensating for word posterior estimation bias in confusion networks. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, Toulouse, France, May 2006.

166

Appendix D Bibliography Dustin Hillard, Bj¨ orn Hoffmeister, Mari Ostendorf, Ralf Schl¨ uter, and Hermann Ney. irover: Improving system combination with classification. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 65–68, Rochester, New York, April 2007. Bj¨ orn Hoffmeister, Tobias Klein, Ralf Schl¨ uter, and Hermann Ney. Frame based system combination and a comparison with weighted rover and cnc. In Interspeech, pages 537–540, Pittsburgh, PA, USA, September 2006. Bj¨ orn Hoffmeister, Dustin Hillard, Stefan Hahn, Ralf Schl¨ uter, Mari Ostendorf, and Hermann Ney. Crosssite and intra-site asr system combination: Comparisons on lattice and 1-best methods. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1145–1148, Honululu, HI, USA, April 2007. Bj¨ orn Hoffmeister, Christian Plahl, Peter Fritz, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Development of the 2007 rwth mandarin lvcsr system. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Kyoto, Japan, December 2007. Bj¨ orn Hoffmeister, Ralf Schl¨ uter, and Hermann Ney. icnc and irover: The limits of improving system combination with classification? In Interspeech, pages 232–235, Brisbane, Australia, September 2008. Bj¨ orn Hoffmeister, Ruoying Liang, Ralf Schl¨ uter, and Hermann Ney. Log-linear model combination with word-dependent scaling factors. In Interspeech, pages 248–251, Brighton, U.K., September 2009. Bj¨ orn Hoffmeister, Ralf Schl¨ uter, and Hermann Ney. Bayes risk approximations using time overlap with an application to system combination. In Interspeech, pages 1191–1194, Brighton, U.K., September 2009. Roger Hsiao, Mark Fuhs, Yik-Cheung Tam, Qin Jin, and Tanja Schultz. The cmu-interact 2008 mandarin transcription system. In Interspeech, pages 1445–1448, Brisbane, Australia, September 2008. Jing Huang, Etienne Marcheret, Karthik Visweswariah, Vit Libal, and Gerasimos Potamianos. The ibm rich transcription 2007 speech-to-text systems for lecture meetings. Lecture Notes in Computer Science, 4625:429–441, 2009. X. Huang, M. Belin, F. Alleva, and M. Hwang. Unified stochastic engine (USE) for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 636–639, Minneapolis, MN, USA, April 1993. X. D. Huang and M. A. Jack. Semi-continuous hidden Markov models for speech signals. Computer Speech and Language, 3(3):329–252, 1989. M.-Y. Hwang, G. Peng, W. Wang, A. Faria, A. Heidel, and M. Ostendorf. Building a highly accurate Mandarin speech recognizer. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 490–495, Kyoto, Japan, December 2007. F. Jelinek. A fast sequential decoding algorithm using a stack. IBM Journal of Research and Development, 13:675–685, November 1969. F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(10):532–556, April 1976. N. Jennequin and J. L. Gauvain. Modeling duration via lattice rescoring. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, HI, USA, April 2007. B.-H. Juang and S. Katagiri. Discriminative learning for minimum error classification. IEEE Transactions on Signal Processing, 40(12):3043–3054, 1992. J. Kaiser, B. Horvat, and Z. Kacic. A novel loss function for the overall risk criterion based discriminative training of HMM models. In Interspeech, volume 2, pages 887 – 890, Bejing, China, October 2000.

167

Appendix D Bibliography S. Kanthak and H. Ney. FSA: An efficient and flexible C++ toolkit for finite state automata using on-demand computation. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 510 – 517, Barcelona, Spain, July 2004. S. Kanthak, K. Sch¨ utz, and H. Ney. Using SIMD instructions for fast likelihood calculation in LVCSR. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1531– 1534, Istanbul, Turkey, June 2000. S. Kanthak, H. Ney, M. Riley, and M. Mohri. A comparison of two lvr search optimization techniques. In International Conference on Spoken Language Processing (ICSLP), pages 1309–1312, Denver, CO, USA, September 2002. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Speech and Audio Processing, 35:400–401, March 1987. Daniel Keysers, Franz Josef Och, and Hermann Ney. Efficient maximum entropy training for statistical object recognition. In Informatiktage der Gesellschaft f¨ ur Informatik, pages 342–345, Bad Schussenried, Germany, November 2002. Daniil Kocharov, Andras Zolnay, Ralf Schl¨ uter, and Hermann Ney. Articulatory motivated acoustic features for speech recognition. In Interspeech, pages 1101–1104, Lisbon, Portugal, September 2005. N. Kumar and A. G. Andreou. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Communication, 26(4):283 – 297, December 1998. Shankar Kumar and William Byrne. Risk based lattice cutting for segmental minimum bayes-risk decoding. In International Conference on Spoken Language Processing (ICSLP), pages 373–376, Denver, CO, USA, September 2002. L. Lamel, J.-L. Gauvain, G. Adda, C. Barras, E. Bilinski, O. Galibert, A. Pujol, H. Schwenk, and Xuan Zhu. The limsi 2006 tc-star epps transcription systems. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 997–1000, Honolulu, HI, USA, April 2007. L. Lee and R. Rose. Speaker normalization using efficient frequency warping procedures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 353–356, Atlanta, GA, USA, May 1996. C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech and Language, 9(2):171–185, 1995. X. Lei, W. Wu, W. Wang, A. Mandal, and A. Stolcke. Development of the 2008 sri mandarin speech-to-text system for broadcast news and conversation. In Interspeech, Brighton, U.K., September 2009. Xin Lei, Manhung Siu, Mei-Yuh Hwang, Mari Ostendorf, and Tan Lee. Improved tone modeling for Mandarin broadcast news speech recognition. In International Conference on Spoken Language Processing (ICSLP), pages 1237–1240, Pittsburgh, PA, USA, September 2006. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics - Doklay, 10(10):707 – 710, 1966. S. E. Levinson, L. R. Rabiner, and M. M. Sondhi. An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Bell System Technical Journal, 62(4):1035–1074, April 1983. Andrej Ljolje, Fernando Pereira, and Michael Riley. Efficient general lattice generation and rescoring. In European Conference on Speech Communication and Technology (Eurospeech), pages 1251–1254, Budapest, Hungary, September 1999. J. L¨ o¨ of, M. Bisani, C. Gollan, G. Heigold, B. Hoffmeister, C. Plahl, Ralf R. Schl¨ uter, and H. Ney. The 2006 RWTH parliamentary speeches transcription system. In TC-STAR Workshop on Speech-to-Speech Translation, pages 133–138, Barcelona, Spain, June 2006.

168

Appendix D Bibliography J. L¨ o¨ of, M. Bisani, Ch. Gollan, G. Heigold, Bj¨orn Hoffmeister, Ch. Plahl, R. Schl¨ uter, and H. Ney. The 2006 RWTH parliamentary speeches transcription system. In Interspeech, pages 105 – 108, Pittsburgh, PA, September 2006. J. L¨ o¨ of, Ch. Gollan, S. Hahn, G. Heigold, B. Hoffmeister, Ch. Plahl, D. Rybach, R. Schl¨ uter, and H. Ney. The RWTH 2007 TC-STAR evaluation system for European English and Spanish. In Interspeech, Antwerp, Belgium, August 2007. B. Lowerre. A Comparative Performance Analysis of Speech Understanding Systems. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 1976. L. Mangu and M. Padmanabhan. Error corrective mechanisms for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 29–32, Salt Lake City, UT, USA, May 2001. L. Mangu, E. Brill, and A. Stolcke. Finding consensus among words: Lattice-based word error minimization. In European Conference on Speech Communication and Technology (Eurospeech), volume 1, pages 495 – 498, Budapest, Hungary, September 1999. L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language, 14:373–400, 2000. Lidia Mangu. Finding Consensus in Speech Recognition. PhD thesis, Johns Hopkins University, Baltimore, Maryland, USA, April 2000. Sven C. Martin. Statistische Auswahl von Wortabh¨ angigkeiten in der automatischen Spracherkennung. PhD thesis, RWTH Aachen University, Aachen, Germany, February 2000. Evgeny Matusov, Arne Mauser, and Hermann Ney. Automatic sentence segmentation and punctuation prediction for spoken language translation. In International Workshop on Spoken Language Translation, pages 158–165, Kyoto, Japan, November 2006. Evgeny Matusov, Bj¨ orn Hoffmeister, and Hermann Ney. Asr word lattice translation with exhaustive reordering is possible. In Interspeech, pages 2342–2345, Brisbane, Australia, September 2008. E. McDermott and S. Katagiri. Minimum classification error for large scale speech recognition tasks using weighted finite state transducers. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, April 2005. F. Metze and A. Waibel. A flexible stream architecture for asr using articulatory features. In International Conference on Spoken Language Processing (ICSLP), pages 2133–2136, Denver, CO, USA, September 2002a. F. Metze and A. Waibel. Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 837–840, Orlando, FL, USA, May 2002b. Hemant Misra, Herv´e Bourlard, and Vivek Tyagi. New entropy based combination rules in HMM/ANN multi-stream ASR. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, April 2003. M. Mohri. Generic epsilon-removal and input epsilon-normalization algorithms for weighted transducers. International Journal of Foundations of Computer Science, 13(1):129 – 143, 2002a. M. Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321 – 350, 2002b. M. Mohri. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(6):957 – 982, 2003.

169

Appendix D Bibliography M. Mohri. Weighted finite-state transducer algorithms: An overview. in Carlos Mart´ın-Vide, Victor Mitrana, and Gheorghe Paun, editors, Formal Languages and Applications, Springer, Berlin, 2004. M. Mohri and M. Riley. Weighted determinization and minimization for large vocabulary speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), Rhodes, Greece, September 1997. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Speech recognition with weighted finite-state transducers. in Larry Rabiner and Fred Juang, editors, Handbook on Speech Processing and Speech Communication, Part E: Speech recognition, Springer, Heidelberg, Germany, 2008. S. Molau. Normalization in the Acoustic Feature Space for Improved Speech Recognition. PhD thesis, RWTH Aachen, Aachen, Germany, 2003. Hy Murveit, John Butzberger, Vassilios Digalakis, and Mitch Weintraub. Progressive-search algorithms for large-vocabulary speech recognition. In HLT ’93: Proceedings of the workshop on Human Language Technology, pages 87–90, Morristown, NJ, USA, 1993. Association for Computational Linguistics. J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4): 308–313, 1965. H. Ney. The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Speech and Audio Processing, 32(2):263–271, April 1984. H. Ney. Acoustic modeling of phoneme units for continuous speech recognition. In L. Torres, E. Masgrau, and M. A. Lagunas, editors, Signal Processing V: Theories and Applications, Fifth European Signal Processing Conference, pages 65–72. Elsevier Science Publishers B. V., Barcelona, Spain, 1990. H. Ney and X. Aubert. A word graph algorithm for large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 3, pages 1355–1358, Yokohama, Japan, September 1994. H. Ney, D. Mergel, A. Noll, and A. Paeseler. A data-driven organization of the dynamic programming beam search for continuous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 833–836, Dallas, TX, USA, April 1987. H. Ney, R. H¨ ab-Umbach, B.-H. Tran, and M. Oerder. Improvements in beam search for 10000-word continuous speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 9–12, San Francisco, CA, March 1992. H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependencies in language modeling. Computer Speech and Language, 2(8):1–38, 1994. H. Ney, S. C. Martin, and F. Wessel. Statistical language modeling using leaving-one-out. In S. Young and G. Bloothooft, editors, Corpus Based Methods in Language and Speech Processing, pages 1–26. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1997. Tim Ng, Bing Zhang, Kham Nguyen, and Long Nguyen. Progress in the bbn 2007 mandarin speech to text system. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1537–1540, Las Vegas, NV, USA, April 2008. Y. Normandin, R. Lacouture, and R. Cardin. MMIE training for large vocabulary continuous speech recognition. In International Conference on Spoken Language Processing, pages 1367–1370, Yokohama, Japan, September 1994. M. K. Omar and L. Mangu. An evaluation of lattice scoring using a smoothed estimate of word accuracy. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 1149–1152, Honululu, HI, USA, April 2007.

170

Appendix D Bibliography S. Ortmanns and H. Ney. An experimental study of the search space for 20000-word speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), volume 2, pages 901–904, Madrid, Spain, September 1995. S. Ortmanns, H. Ney, and A. Eiden. Language-model look-ahead for large vocabulary speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 4, pages 2095–2098, Philadelphia, PA, October 1996. S. Ortmanns, H. Ney, and X. Aubert. A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech and Language, 11(1):43–72, January 1997a. S. Ortmanns, H. Ney, and T. Firzlaff. Fast likelihood computation methods for continuous mixture densities in large vocabulary speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), volume 1, pages 139–142, Rhodes, Greece, September 1997b. S. Ortmanns, A. Eiden, and H. Ney. Improved lexical tree search for large vocabulary recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 817–820, Seattle, WA, USA, May 1998. M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. Schwartz, and J. R. Rohlicek. Integration of diverse recognition methodologies through reevaluation of n-best sentence hypotheses. In DARPA Speech and Natural Language Processing Workshop, pages 83–87, Pacific Grove, CA, USA, 1991. Naveen Parihar, Ralf Schl¨ uter, David Rybach, and Eric A. Hansen. Parallel fast likelihood computation for lvcsr using mixture decomposition. In Interspeech, Brighton, U.K., September 2009. D. B. Paul. Algorithms for an optimal A∗ search and linearizing the search in the stack decoder. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 693– 696, Toronto, Canada, May 1991. M. Pitz. Investigations on Linear Transformations for Speaker Adaptation and Normalization. PhD thesis, RWTH Aachen University, 2005. Christian Plahl, Bj¨ orn Hoffmeister, Mei-Yuh Hwang, Danju Lu, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Recent improvements of the rwth gale mandarin lvcsr system. In Interspeech, pages 2426–2429, Brisbane, Australia, September 2008a. Christian Plahl, Bj¨ orn Hoffmeister, Mei-Yuh Hwang, Danju Lu, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Recent improvements of the rwth gale mandarin lvcsr system. In Interspeech, pages 2426–2429, Brisbane, Australia, September 2008b. Christian Plahl, Bj¨ orn Hoffmeister, Georg Heigold, Jonas L¨o¨of, Ralf Schl¨ uter, and Hermann Ney. Development of the gale 2008 mandarin lvcsr system. In Interspeech, pages 2107–2110, Brighton, U.K., September 2009. D. Povey and P. C. Woodland. Minimum phone error and I-smoothing for improved discriminative training. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 105 – 108, Orlando, FL, May 2002. R. Prasad, S. Matsoukas, C. L. Kao, J. Z. Ma, D. X. Xu, T. Colthurst, O. Kimball, R. Schwartz, J. L. Gauvain, L. Lamel, H. Schwenk, G. Adda, and F. Lefevre. The 2004 bbn/limsi 20xrt english conversational telephone speech recognition system. In Interspeech, Lisbon, Portugal, September 2005. L. Rabiner and B.-H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1): 4–16, 1986. L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals. Prentice-Hall Signal Processing Series, Englewood Cliffs, NJ, 1979.

171

Appendix D Bibliography B. Ramabhadran, Olivier Siohan, L. Mangu, G. Zweig, M. Westphal, H. Schulz, and A Soneiro. The ibm 2006 speech transcription system for european parliamentary speeches. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. V. Ramasubramansian and K. K. Paliwal. Fast k-dimensional tree algorithms for nearest neighbor search with application to vector quantization encoding. IEEE Transactions on Speech and Audio Processing, 40(3):518–528, March 1992. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The Rprop algorithm. In IEEE International Conference on Neural Networks (ICNN), pages 586 – 591, San Francisco, CA, USA, 1993. H. Sakoe. Two-level DP-matching - a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Speech and Audio Processing, 27:588–595, December 1979. A. Sankar. Bayesian model combination (baycom) for improved recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 845–848, Philadelphia, PA, USA, April 2005. R. R. Sarukkai and D. H. Ballard. Improved spontaneous dialogue recognition using dialogue and utterance triggers by adaptive probability boosting. In Interspeech, volume 1, pages 208–211, Philadelphia, PA, USA, October 1996. R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. R. E. Schapire and Y Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 2001. R. Schl¨ uter. Investigations on Discriminative Training Criteria. PhD thesis, RWTH Aachen University, Aachen, Germany, September 2000. Ralf Schl¨ uter, Thomas Scharrenbach, Volker Steinbiss, and Hermann Ney. Bayes risk minimization using metric loss functions. In European Conference on Speech Communication and Technology (Eurospeech), pages 1449–1452, Lisbon, Portugal, September 2005. Ralf Schl¨ uter, Andras Zolnay, and Hermann Ney. Feature combination using linear discriminant analysis and its pitfalls. In International Conference on Spoken Language Processing (ICSLP), pages 345–348, Pittsburgh, PA, USA, September 2006. Ralf Schl¨ uter, Ilja Bezrukov, Hermann Wagner, and Hermann Ney. Gammatone features and feature combination for large vocabulary speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honululu, HI, USA, April 2007. R. Schwartz and Y.-L. Chow. The N -best algorithm: An efficient and exact procedure for finding the N most likely sentence hypotheses. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 81–84, Albuquerque, NM, April 1990. O. Siohan, B. Ramabhadran, and B. Kingsbury. Constructing ensembles of asr systems using randomized decision trees. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, PA, USA, April 2005. A. Sixtus. Across-Word Phoneme Models for Large Vocabulary Continuous Speech Recognition. PhD thesis, RWTH Aachen, January 2003. A. Sixtus and S. Ortmanns. High quality word graphs using forward-backward pruning. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 593–596, Phoenix, Arizona, USA, March 1999.

172

Appendix D Bibliography H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig. The ibm 2004 conversational telephony system for rich transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 205–208, Philadelphia, PA, USA, March 2005. V. Steinbiss, H. Ney, R. H¨ ab-Umbach, B.H. Tran, U. Essen, R. Kneser, M. Oerder, H.G. Meier, X. Aubert, C. Dugast, and D. Geller. The Philips research system for large-vocabulary continuous-speech recognition. In European Conference on Speech Communication and Technology (Eurospeech), pages 2125–2128, Berlin, Germany, September 1993. A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao Gadde, M. Plauche, C. Richey, E. Shriberg, K. S¨ onmez, F. Weng, and J. Zheng. The sri march 2000 hub-5 conversational speech transcription system. In NIST Speech Transcription Workshop, College Park, MD, USA, May 2000. Andreas Stolcke. Srilm - an extensible language modeling toolkit. In Interspeech, pages 901–904, Denver, CO, September 2002. Andreas Stolcke, Yochai K¨ onig, and Mitchel Weintraub. Explicit word error minimization in N-best list rescoring. In European Conference on Speech Communication and Technology (Eurospeech), pages 163–166, Rhodes, Greece, 1997. Sebastian St¨ uker, Christian F¨ ugen, Susanne Burger, and Matthias W¨olfel. Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end. In Interspeech, Pittsburgh, PA, USA, September 2006. Sebastian St¨ uker, Christian F¨ ugen, Florian Kraft, , and Matthias W¨olfel. The isl 2007 english speech transcription system for european parliament speeches. In International Conference on Spoken Language Processing (ICSLP), pages 2069–2072, Antwerp, Belgium, August 2007. Alain Tritschler and Ramesh A. Gopinath. Improved speaker segmentation and segments clustering using the bayesian information criterion. In European Conference on Speech Communication and Technology (Eurospeech), pages 679–682, Budapest, Hungary, September 1999. Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007. Fabio Valente. A novel criterion for classifiers combination in multistream speech recognition. IEEE Signal Processing Letters, 16(7):561–564, July 2009. Fabio Valente, Jithendra Vepa, Christian Plahl, Christian Gollan, Hynek Hermansky, and Ralf Schl¨ uter. Hierarchical neural networks feature extraction for LVCSR system. In Interspeech, Antwerp, Belgium, August 2007. V. Venkataramani, S.A. Chakrabartty, and W.J Byrne. Support vector machines for segmental minimum bayes risk decoding of continuous speech. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), St. Thomas, VI, USA, November 2003. Veera Venkataramani, Shantanu Chakrabartty, and William Byrne. Ginisupport vector machines for segmental minimum bayes risk decoding of continuous speech. Computer Speech and Language, 21, July 2007. D. Vergyri, S. Tsakalidis, and W. Byrne. Minimum risk acoustic clustering for multilingual acoustic model combination. In International Conference on Spoken Language Processing (ICSLP), pages 873– 876, Beijing, China, October 2000. D. Vergyri, A. Mandal, W. Wang, A. Stolcke, J. Zheng, M. Graciarena, D. Rybach, C. Gollan, R. Schl¨ uter, K. Kirchhoff, A. Faria, and N. Morgan. Development of the sri/nightingale arabic asr system. In Interspeech, pages 1437–1440, Brisbane, Australia, September 2008. Dimitra Vergyri. Integration of multiple knowledge sources in speech recognition using minimum error training. PhD thesis, Johns Hopkins University, Baltimore, Maryland, USA, 2000.

173

Appendix D Bibliography T. K. Vintsyuk. Elementwise recognition of continuous speech composed of words from a specified dictionary. Kibernetika, 7:133–143, March 1971. A. Viterbi. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13:260–269, 1967. F. Wessel. Word Posterior Probabilities for Large Vocabulary Continuous Speech Recognition. PhD thesis, RWTH Aachen, Aachen, Germany, 2002. F. Wessel, K. Macherey, and R. Schl¨ uter. Using word probabilities as confidence measures. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 225–228, Seattle, WA, USA, May 1998. Frank Wessel, Ralf Schl¨ uter, and Hermann Ney. Using posterior word probabilities for improved speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1587–1590, Istanbul, Turkey, June 2000. Frank Wessel, Ralf Schl¨ uter, Klaus Macherey, and Hermann Ney. Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288–298, March 2001b. Frank Wessel, Ralf Schl¨ uter, Klaus Macherey, and Hermann Ney. Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing, 9(3):288–298, March 2001a. Frank Wessel, Ralf Schl¨ uter, and Hermann Ney. Explicit word error minimization using word hypothesis posterior probabilities. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 33–36, Salt Lake City, Utah, May 2001c. Daniel Willett and Chuang He. Discriminative training for complementariness in system combination. In Interspeech, Brisbane, Australia, September 2008. P. C. Woodland and D. Povey. Large scale discriminative training for speech recognition. In Automatic Speech Recognition (ASR), pages 7 – 16, Paris, France, September 2000. P. C. Woodland and D. Povey. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech & Language, 16(1):25–48, 2002. Haihua Xu, Daniel Povey, Jie Zhu, and Guanyong Wu. Minimum hypothesis phone error as a decoding method for speech recognition. In Interspeech, pages 76–79, Brighton, U.K., September 2009. J. Xue and Y. Zhao. Random forests-based confidence annotation using novel features from confusion network. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 1149–1152, Toulouse, France, May 2006. Jian Xue and Yunxin Zhao. Improved confusion network algorithm and shortest path search from word lattice. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 853–856, Philadelphia, PA, USA, March 2005. Richard Zens, Saˇsa Hasan, and Hermann Ney. A systematic comparison of training criteria for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, pages 524– 532, Prague, Czech Republic, June 2007. R. Zhang and A. Rudnicky. Investigations of issues for using multiple acoustic models to improve continuous speech recognition. In International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA, USA, September 2006. J. Zheng and A. Stolcke. Improved discriminative training using phone lattices. In Interspeech, pages 2125–2128, Lisbon, Portugal, September 2005.

174

Appendix D Bibliography Andras Zolnay. Acoustic Feature Combination for Speech Recognition. PhD thesis, RWTH Aachen University, Aachen, Germany, August 2006. Andras Zolnay, Ralf Schl¨ uter, and Hermann Ney. Acoustic feature combination for robust speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 457–460, Philadelphia, PA, USA, March 2005.

175

Curriculum Vitae Personal Information Name: Date of birth: Place of birth: Nationality:

Bj¨ orn Hoffmeister November 26, 1976 Aachen, Germany German

Education 1983 – 1986 1986 – 1993 1993 – 1996

Trinkbornschule in R¨ odermark, Germany Oswald-von-Nell-Breuning-Schule (former Rodgauschule) in R¨odermark, Germany Abitur, Alfred-Delp-Schule in Dieburg, Germany

1997 – 2003

Diplom in Informatik, Universit¨at zu L¨ ubeck, Germany

Working Experience 2003 – 2004

Institut f¨ ur Theoretische Informatik, Universit¨ at zu L¨ ubeck, Germany Research assistant (machine learning)

2004 – 2010

Chair of Computer Science 6 (Human Language Technology and Pattern Recognition), RWTH Aachen University, Germany Research assistant and Ph.D. student (statistical speech recognition)

Summer 2009

Internship at NTT Communication Science Laboratories, Kyoto, Japan

Loading...

Bayes Risk Decoding and its Application to System - i6 RWTH Aachen

Bayes Risk Decoding and its Application to System Combination Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen U...

1MB Sizes 1 Downloads 0 Views

Recommend Documents

Kapitel 4 Diskriminative Modellkombination (DMC) - i6 RWTH Aachen
Die vorliegende Dissertation entstand während meiner Tätigkeit als Wissenschaftlicher Mitarbeiter an den Philips. Fors

amtliche bekanntmachung - RWTH-Aachen
Jan 28, 2015 - 13 Wiederholung von Prüfungen, der Master-Arbeit und Verfall des Prüfungsanspruchs. § 14 Abmeldung ...

Column Generation - RWTH Aachen
Jul 1, 2010 - We conclude with some advise for computer implementations. Key words: Linear programming; column generatio

RWTH Aachen Operations Research: Courses
Ability to model with integer programs, knowledge of operations research algorithms, as well as solid programming skills

Liste dissertationen rwth aachen architektur - Bia Lamego
2 days ago - Writing an essay about yourself lyrics what is a true friend essay kingdom symbolism in young goodman brown

Anglistische Sprachwissenschaft - RWTH AACHEN UNIVERSITY Die
Jun 22, 2017 - Englische nicht-muttersprachliche Fachkommunikation; Schnittstelle zwischen regionaler und funktionaler s

RWTH AACHEN UNIVERSITY - Rheinisch-Westfaelische Technische
Die RWTH Aachen ist ein Ort, an dem die Zukunft unserer industrialisierten Welt gedacht wird - Thinking the Future.

Wohnungsbau und öffentliche Förderung - PT RWTH Aachen
wohnungsbau den Ländern die Möglichkeit eröffnet, die Höhe der Förderung an das. Einkommen des Mieterhaushaltes ... der

Cryptography and its application to operating system security
1981. Cryptography and its application to operating system security. Michelle Painchaud. Follow this and additional work

Exercise 1. Exercise 2. Exercise 3. - (TI), RWTH Aachen
Exercise 1. Bob's public ElGamal key is (p, a, y) = (101,2,11). (a) Find the plaintext of the message (c1,c2) = (64,79)