Confidence Estimation of Speech Recognition Modules Using Deep Learning

No Thumbnail Available

Files

URL

Journal Title

Journal ISSN

Volume Title

Insinööritieteiden korkeakoulu | Bachelor's thesis
Electronic archive copy is available locally at the Harald Herlin Learning Centre. The staff of Aalto University has access to the electronic bachelor's theses by logging into Aaltodoc with their personal Aalto user ID. Read more about the availability of the bachelor's theses.

Date

2024-09-20

Department

Major/Subject

Computational Engineering

Mcode

ENG3082

Degree programme

Aalto Bachelor’s Programme in Science and Technology

Language

en

Pages

27

Series

Abstract

This paper provides a general overview for confidence estimation in automatic speech recognition (ASR) systems, focusing on two state-of-the-art methods: OpenAI’s Whisper and NVIDIA’s NeMo framework. The goal of the study is to address challenges in ASR by improving confidence estimation modules. The methodology involved evaluating Whisper on LibriSpeech and TIMIT datasets, measuring performance on Word Error Rate (WER). Moreoever, confidence estimation techniques were applied to the Conformer-CTC and Conformer-Trasducer models built in the NeMo framework. The results from this paper align with previous studies done on the same methods and also demonstrate the exceptional performance of Whisper on the TIMIT dataset, with WER as low as 2.73% for large-v1. For NeMo, proposed modification to confidence estimation methods, particularly using Gibbs entropy-based measures, showed improvements in certain metrics for RNN-T methods on the Librispeech ’test-other’ dataset. This paper confirms that while Whisper and NeMo demonstrate strong performance, there is room for improvement in confidence estimation techniques.

Description

Supervisor

St-Pierre, Luc

Thesis advisor

Rech, Silas

Keywords

ASR, confidence estimation, end-to-end models

Other note

Citation