HTK Speech Recognition Toolkit

CUED HTK Large Vocabulary Recognition Systems

Over the years a number of generations of HTK-based large vocabulary systems have been built at CUED. The initial impetus for each generation was often a DARPA/NIST evaluation. This section gives a brief overview of the features of these systems and how they relate to the features present in released versions of HTK. Each of these systems described below has represented the state-of-the-art when it was produced (either the lowest error rate in the evaluation or not a statistically significant difference to the lowest error rate system). The major tool currently lacking from the distributed HTK releases to reproduce these systems is a capable large vocabulary decoder supporting trigrams and 4-grams and cross-word triphones and quinphones. However, from the sections below it can be seen that there are many other features that have been incorporated into the CUED HTK systems. In future we hope to make many of these available in released versions of HTK.

The first true HTK-based large vocabulary system built at CUED was in 1993 before that years DARPA/NIST Wall Street Journal evaluation. The core training used an extended version of 1.5 which included decision tree clustering and supplied bigram and trigram language models were used. The decoder used was JRlx (see ICASSP94 Paper).

For the 1994 (Hub 1) evaluation a number of other features were used including maximum likelihood linear regression adaptation and the use of quinphone models in a lattice-rescoring pass using a 65k 4-gram language model. The quinphone state-clustering and decoding facilities are not present in any released HTK software (see ICASSP95 Paper).

The 1995 (Hub 3) evaluation focussed on large vocabulary transcription of clean and noisy speech. It was the first HTK system to use a PLP parameterisation and used MLLR mean and variance adaptation (available in HTK 2.2). It used MLLR before lattice generation and then rescored the lattices with adapted quinphone models. Models for noisy environments were trained using single-pass retraining (present in V2.0) from clean-speech models (see 1996 DARPA Paper).

The 1996 broadcast news evaluation (Hub 4) transcribed pre-segmented and labelled portions of broadcast news audio. The front-end used was PLP with a mel-spectra based filterbank. Models (triphones and quinphones) were adapted to broadcast news data via MLLR and MAP for each data type. Triphones were used to generate word lattices (4-gram LM) and then these were re-scored with quinphones (see 1997 DARPA Paper).

The 1997 broadcast news evaluation (Hub 4). Speech was first segmented using Gaussian mixture models and a phone recogniser. Groups of clustered segments were then used for MLLR adaptation and word lattices generated (4-gram interpolated with class-trigram) with triphone HMMs (trained on 70 hours of broadcast news data). These lattices were re-scored with quinphone HMMs and the quinphone and triphone output combined using the NIST ROVER program (see 1998 DARPA Paper and ICASSP98 Paper).

The 1998 broadcast news evaluation (Hub 4) was an evolution of the 1997 system. The full system included cluster-based variance normalistaion and vocal-tract length normalisation (VTLN) and full-variance transforms (none of these are included in released versions of HTK up to 3.0). Separate LMs were built for different sources and interpolated to form a single model. Again combined triphone and quinphone rescoring passes were used. (see 1999 DARPA Paper) For this evaluation CUED and Entropic also built a system that operated in less than ten times real-time and used the fast Entropic decoder with a two-pass strategy . The result was about 2% absolute increase in error rate relative to the full system that ran in 300 times real time (see 1999 "10XRT" DARPA Paper).

The 1998 conversational speech evaluation (Hub 5) required the transcription of telephone conversations. The CUED HTK system used models trained and tested using conversation side-based cepstral mean/variance normalisation and VTLN with reduced bandwidth speech analysis. Both triphone and quinphone HMMs were trained on 180 hours of data and used in a multi-stage recognition process, first generating lattices with MLLR-adapted triphones and then rescoring these with adapted quinphones. The output of the triphone and quinphone stages were combined with ROVER (see 1999 ICASSP Paper).

The March 2000 HTK Hub5 system built on the 1998 system but used both the standard maximum likelihood estimation (MLE) tools provided in HTK along with maximum mutual information estimation (MMIE) training for both triphones and quinphones. The soft-tying technique was used with the MLE HMMs and pronunciation probabilities and full-variance adaptation included along with standard MLLR. The system also used confusion networks generated by post-processing lattices at each stage to generate minimum expected word error rate output, confidence scores and to allow combination of MLE/MMIE triphone and quinphone outputs (see Speech Transcription Workshop 2000 Paper).

In March 2001 a revised version of the HTK Hub5 system was produced which included an improved MMIE training procedure, lattice-based MLLR estimation and revised normalisation procedures. The NIST March 2001 evaluation data included data recorded over conventional telephone lines as well as data from calls over cellular channels. For a full description and results see 2001 LVCSR workshop presentation.

The system developed for the Switchboard part of the April 2002 Rich Transcription evaluation used acoustic models trained using Minimum Phone Error training. The standard PLP features were augmented with third differentials and then projected down to 39 dimensions using an HLDA transform. Other new features include speaker adaptive training (SAT), a special single pronunciation dictionary (SPRON) and a new LVR decoder (HDecode). A faster version of the full system that ran in less than 10 times realtime was developed. For a full description and results see 2002 Rich Transcription workshop presentation.

For the April 2003 Rich Transcription Evaluation a number of LVR systems were developed. The unlimited compute conversational telephone speech (CTS, previously known as Switchboard or Hub5) was similar in structure to the 2002 system, but utilised improved acoustic and language models and performed automatic segmentation of the audio data. A less than 10xRT CTS system was developed which employed 2-way system combination and lattice-based adaptation. It achieved word error rates only 5-7% relative worse than the full (190xRT) system. For the first time since 1998 a CU-HTK Broadcast News system was developed. It was designed to run in less than 10xRT and incorporated many features found in recent CTS systems (for example HLDA, MPE, SAT, lattice-based adaptation and multi-way system combination) plus a number of new techniques such as MPE-MAP for training gender-dependent models and improved MPE training. An initial version of a CTS system for Mandarin Chinese was built. For more details see our two ASRU papers (BN; fast systems) and a series of 2003 Rich Transcription workshop presentations (CTS; BN; fast systems; Mandarin).

Phil Woodland, Gunnar Evermann
September 2000
Last updated September 2003