CUED HTK Large Vocabulary Recognition Systems
Over the years a number of generations of HTK-based large vocabulary
systems have been built at CUED. The initial impetus for each
generation was often a DARPA/NIST evaluation. This section gives a
brief overview of the features of these systems and how they relate to
the features present in released versions of HTK. Each of these
systems described below has represented the state-of-the-art when it
was produced (either the lowest error rate in the evaluation or not a
statistically significant difference to the lowest error rate system).
The major tool currently lacking from the distributed HTK releases to
reproduce these systems is a capable large vocabulary decoder
supporting trigrams and 4-grams and cross-word triphones and
quinphones. However, from the sections below it can be seen that there
are many other features that have been incorporated into the CUED HTK
systems. In future we hope to make many of these available in released
versions of HTK.
The first true HTK-based large vocabulary system built at CUED was in
1993 before that years DARPA/NIST Wall Street Journal evaluation. The
core training used an extended version of 1.5 which included decision
tree clustering and supplied bigram and trigram language models were
used. The decoder used was JRlx (see ICASSP94 Paper).
For the 1994 (Hub 1) evaluation a number of other features were used
including maximum likelihood linear regression adaptation and the use
of quinphone models in a lattice-rescoring pass using a 65k 4-gram
language model. The quinphone state-clustering and decoding facilities
are not present in any released HTK software (see ICASSP95
Paper).
The 1995 (Hub 3) evaluation focussed on large vocabulary transcription
of clean and noisy speech. It was the first HTK system to use a PLP
parameterisation and used MLLR mean and variance adaptation (available
in HTK 2.2). It used MLLR before lattice generation and then rescored
the lattices with adapted quinphone models. Models for noisy
environments were trained using single-pass retraining (present in
V2.0) from clean-speech models (see 1996 DARPA Paper).
The 1996 broadcast news evaluation (Hub 4) transcribed pre-segmented
and labelled portions of broadcast news audio. The front-end used was
PLP with a mel-spectra based filterbank. Models (triphones and
quinphones) were adapted to broadcast news data via MLLR and MAP for
each data type. Triphones were used to generate word lattices (4-gram
LM) and then these were re-scored with quinphones (see 1997 DARPA Paper).
The 1997 broadcast news evaluation (Hub 4). Speech was first segmented
using Gaussian mixture models and a phone recogniser. Groups of
clustered segments were then used for MLLR adaptation and word
lattices generated (4-gram interpolated with class-trigram) with
triphone HMMs (trained on 70 hours of broadcast news data). These
lattices were re-scored with quinphone HMMs and the quinphone and
triphone output combined using the NIST ROVER program (see 1998 DARPA Paper and ICASSP98 Paper).
The 1998 broadcast news evaluation (Hub 4) was an evolution of the
1997 system. The full system included cluster-based variance
normalistaion and vocal-tract length normalisation (VTLN) and
full-variance transforms (none of these are included in released
versions of HTK up to 3.0). Separate LMs were built for different
sources and interpolated to form a single model. Again combined
triphone and quinphone rescoring passes were used. (see 1999
DARPA Paper) For this evaluation CUED and Entropic also built a
system that operated in less than ten times real-time and used the
fast Entropic decoder with a two-pass strategy . The result was about
2% absolute increase in error rate relative to the full system that
ran in 300 times real time (see 1999
"10XRT" DARPA Paper).
The 1998 conversational speech evaluation (Hub 5) required the
transcription of telephone conversations. The CUED HTK system used
models trained and tested using conversation side-based cepstral
mean/variance normalisation and VTLN with reduced bandwidth speech
analysis. Both triphone and quinphone HMMs were trained on 180 hours
of data and used in a multi-stage recognition process, first
generating lattices with MLLR-adapted triphones and then rescoring
these with adapted quinphones. The output of the triphone and
quinphone stages were combined with ROVER (see
1999 ICASSP Paper).
The March 2000 HTK Hub5 system built on the 1998 system but used both
the standard maximum likelihood estimation (MLE) tools provided in HTK
along with maximum mutual information estimation (MMIE) training for
both triphones and quinphones. The soft-tying technique was used with the
MLE HMMs and pronunciation probabilities and full-variance adaptation
included along with standard MLLR. The system also used confusion
networks generated by post-processing lattices at each stage to
generate minimum expected word error rate output, confidence scores
and to allow combination of MLE/MMIE triphone and quinphone outputs (see Speech Transcription Workshop 2000 Paper).
In March 2001 a revised version of the HTK Hub5 system was produced
which included an improved MMIE training procedure, lattice-based MLLR
estimation and revised normalisation procedures. The NIST March 2001
evaluation data included data recorded over conventional telephone
lines as well as data from calls over cellular channels. For a full
description and results see 2001 LVCSR workshop presentation.
The system developed for the Switchboard part of the April 2002 Rich
Transcription evaluation used acoustic models trained using Minimum
Phone Error training. The standard PLP features were augmented with
third differentials and then projected down to 39 dimensions using an
HLDA transform. Other new features include speaker adaptive training
(SAT), a special single pronunciation dictionary (SPRON) and a new LVR
decoder (HDecode). A faster version of the full system that ran in
less than 10 times realtime was developed. For a full description and
results see 2002
Rich Transcription workshop presentation.
For the April
2003 Rich Transcription Evaluation
a number of LVR
systems were developed. The unlimited compute conversational telephone
speech (CTS, previously known as Switchboard or Hub5) was similar in
structure to the 2002 system, but utilised improved acoustic and
language models and performed automatic segmentation of the audio
data. A less than 10xRT CTS system was developed which employed 2-way
system combination and lattice-based adaptation. It achieved word
error rates only 5-7% relative worse than the full (190xRT) system.
For the first time since 1998 a CU-HTK Broadcast News system was
developed. It was designed to run in less than 10xRT and incorporated
many features found in recent CTS systems (for example HLDA, MPE, SAT,
lattice-based adaptation and multi-way system combination) plus a
number of new techniques such as MPE-MAP for training gender-dependent
models and improved MPE training. An initial version of a CTS system
for Mandarin Chinese was built.
For more details see our two ASRU papers
(BN;
fast systems)
and a series
of 2003 Rich Transcription workshop presentations
(CTS;
BN;
fast systems;
Mandarin).
Phil Woodland, Gunnar Evermann
September 2000
Last updated September 2003