Audiovisual Speech Processing

Audiovisual Speech Processing

PDF Audiovisual Speech Processing Download

  • Author: Gérard Bailly
  • Publisher: Cambridge University Press
  • ISBN: 110737815X
  • Category : Language Arts & Disciplines
  • Languages : en
  • Pages : 507

When we speak, we configure the vocal tract which shapes the visible motions of the face and the patterning of the audible speech acoustics. Similarly, we use these visible and audible behaviors to perceive speech. This book showcases a broad range of research investigating how these two types of signals are used in spoken communication, how they interact, and how they can be used to enhance the realistic synthesis and recognition of audible and visible speech. The volume begins by addressing two important questions about human audiovisual performance: how auditory and visual signals combine to access the mental lexicon and where in the brain this and related processes take place. It then turns to the production and perception of multimodal speech and how structures are coordinated within and across the two modalities. Finally, the book presents overviews and recent developments in machine-based speech recognition and synthesis of AV speech.


Audiovisual Speech Recognition: Correspondence between Brain and Behavior

Audiovisual Speech Recognition: Correspondence between Brain and Behavior

PDF Audiovisual Speech Recognition: Correspondence between Brain and Behavior Download

  • Author: Nicholas Altieri
  • Publisher: Frontiers E-books
  • ISBN: 2889192512
  • Category : Brain
  • Languages : en
  • Pages : 102

Perceptual processes mediating recognition, including the recognition of objects and spoken words, is inherently multisensory. This is true in spite of the fact that sensory inputs are segregated in early stages of neuro-sensory encoding. In face-to-face communication, for example, auditory information is processed in the cochlea, encoded in auditory sensory nerve, and processed in lower cortical areas. Eventually, these “sounds” are processed in higher cortical pathways such as the auditory cortex where it is perceived as speech. Likewise, visual information obtained from observing a talker’s articulators is encoded in lower visual pathways. Subsequently, this information undergoes processing in the visual cortex prior to the extraction of articulatory gestures in higher cortical areas associated with speech and language. As language perception unfolds, information garnered from visual articulators interacts with language processing in multiple brain regions. This occurs via visual projections to auditory, language, and multisensory brain regions. The association of auditory and visual speech signals makes the speech signal a highly “configural” percept. An important direction for the field is thus to provide ways to measure the extent to which visual speech information influences auditory processing, and likewise, assess how the unisensory components of the signal combine to form a configural/integrated percept. Numerous behavioral measures such as accuracy (e.g., percent correct, susceptibility to the “McGurk Effect”) and reaction time (RT) have been employed to assess multisensory integration ability in speech perception. On the other hand, neural based measures such as fMRI, EEG and MEG have been employed to examine the locus and or time-course of integration. The purpose of this Research Topic is to find converging behavioral and neural based assessments of audiovisual integration in speech perception. A further aim is to investigate speech recognition ability in normal hearing, hearing-impaired, and aging populations. As such, the purpose is to obtain neural measures from EEG as well as fMRI that shed light on the neural bases of multisensory processes, while connecting them to model based measures of reaction time and accuracy in the behavioral domain. In doing so, we endeavor to gain a more thorough description of the neural bases and mechanisms underlying integration in higher order processes such as speech and language recognition.


Audiovisual Speech Processing

Audiovisual Speech Processing

PDF Audiovisual Speech Processing Download

  • Author: Gérard Bailly
  • Publisher: Cambridge University Press
  • ISBN: 1107006821
  • Category : Computers
  • Languages : en
  • Pages : 507

This book presents a complete overview of all aspects of audiovisual speech including perception, production, brain processing and technology.


Cognitively Inspired Audiovisual Speech Filtering

Cognitively Inspired Audiovisual Speech Filtering

PDF Cognitively Inspired Audiovisual Speech Filtering Download

  • Author: Andrew Abel
  • Publisher: Springer
  • ISBN: 3319135090
  • Category : Computers
  • Languages : en
  • Pages : 134

This book presents a summary of the cognitively inspired basis behind multimodal speech enhancement, covering the relationship between audio and visual modalities in speech, as well as recent research into audiovisual speech correlation. A number of audiovisual speech filtering approaches that make use of this relationship are also discussed. A novel multimodal speech enhancement system, making use of both visual and audio information to filter speech, is presented, and this book explores the extension of this system with the use of fuzzy logic to demonstrate an initial implementation of an autonomous, adaptive, and context aware multimodal system. This work also discusses the challenges presented with regard to testing such a system, the limitations with many current audiovisual speech corpora, and discusses a suitable approach towards development of a corpus designed to test this novel, cognitively inspired, speech filtering system.


Speech and Audio Processing

Speech and Audio Processing

PDF Speech and Audio Processing Download

  • Author: Ian Vince McLoughlin
  • Publisher: Cambridge University Press
  • ISBN: 1316558673
  • Category : Technology & Engineering
  • Languages : en
  • Pages : 403

With this comprehensive and accessible introduction to the field, you will gain all the skills and knowledge needed to work with current and future audio, speech, and hearing processing technologies. Topics covered include mobile telephony, human-computer interfacing through speech, medical applications of speech and hearing technology, electronic music, audio compression and reproduction, big data audio systems and the analysis of sounds in the environment. All of this is supported by numerous practical illustrations, exercises, and hands-on MATLAB® examples on topics as diverse as psychoacoustics (including some auditory illusions), voice changers, speech compression, signal analysis and visualisation, stereo processing, low-frequency ultrasonic scanning, and machine learning techniques for big data. With its pragmatic and application driven focus, and concise explanations, this is an essential resource for anyone who wants to rapidly gain a practical understanding of speech and audio processing and technology.


Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition

Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition

PDF Advances in Audiovisual Speech Processing for Robust Voice Activity Detection and Automatic Speech Recognition Download

  • Author: Fei Tao (Electrical engineer)
  • Publisher:
  • ISBN:
  • Category : Automatic speech recognition
  • Languages : en
  • Pages :

Speech processing systems are widely used in existing commercial applications, including virtual assistants in smartphones and home assistant devices. Speech-based commands provide convenient hands-free functionality for users. Two key speech processing systems in practical applications are voice activity detection (VAD), which aims to detect when a user is speaking to a system, and automatic speech recognition (ASR), which aims to recognize what the user is speaking. A limitation in these speech tasks is the drop in performance observed in noisy environments or when the speech mode differs from neutral speech (e.g., whisper speech). Emerging audiovisual solutions provide principled frameworks to increase the robustness of the systems by incorporating features describing lip motion. This study proposes novel audiovisual solutions for VAD and ASR tasks. The dissertation introduces unsupervised and supervised audiovisual voice activity detection (AV-VAD). The unsupervised approach combines visual features that are characteristic of the semi-periodic nature of the articulatory production around the orofacial area. The visual features are combined using principal component analysis (PCA) to obtain a single feature. The threshold between speech and non-speech activity is automatically estimated with the expectation-maximization (EM) algorithm. The decision boundary is improved by using the Bayesian information criterion (BIC) algorithm, resolving temporal ambiguities caused by different sampling rates and anticipatory movements. The supervised framework corresponds to the bimodal recurrent neural network (BRNN), which captures the taskrelated characteristics in the audio and visual inputs, and models the temporal information within and across modalities. The approach relied on three subnetworks implemented with long short-term memory (LSTM) networks. This framework is implemented with either hand-crafted features or features representations directly derived from the data (i.e., end-toend system). The study also extends this framework by increasing the temporal modeling by using advanced LSTMs (A-LSTMs). For audiovisual automatic speech recognition (AV-ASR), the study explores the use of visual features to compensate for the mismatch observed when the system is evaluated with whisper speech. We propose supervised adaptation schemes which significantly reduce the mismatch between normal and whisper speech across speakers. The study also introduces the Gating neural network (GNN). The GNN aims to attenuate the effect of unreliable features, creating AV-ASR systems that improve, or at least maintain, the performance of an ASR system implemented only with speech. Finally, the dissertation introduces the front-end alignment neural network (AliNN) to address the temporal alignment problem between audio and visual features. This front-end system is important as the lip motion often precedes speech (e.g., anticipatory movements). The framework relies on RNN with attention model. The resulting aligned features are concatenated and fed to conventional back-end ASR systems obtaining performance improvements. The proposed approaches for AV-VAD and AV-ASR systems are evaluated on large audiovisual corpora, achieving competitive performance under real world scenarios, outperforming conventional audio-based VAD and ASR systems or alternative audiovisual systems proposed by previous studies. Taken collectively, this dissertation has made algorithmic advancements for audiovisual systems, representing novel contributions to the field of multimodal processing.


Audiovisual Speech Processing

Audiovisual Speech Processing

PDF Audiovisual Speech Processing Download

  • Author: Luis Morís Fernández
  • Publisher:
  • ISBN:
  • Category :
  • Languages : en
  • Pages : 0


Robust Speech Recognition of Uncertain or Missing Data

Robust Speech Recognition of Uncertain or Missing Data

PDF Robust Speech Recognition of Uncertain or Missing Data Download

  • Author: Dorothea Kolossa
  • Publisher: Springer Science & Business Media
  • ISBN: 3642213170
  • Category : Technology & Engineering
  • Languages : en
  • Pages : 387

Automatic speech recognition suffers from a lack of robustness with respect to noise, reverberation and interfering speech. The growing field of speech recognition in the presence of missing or uncertain input data seeks to ameliorate those problems by using not only a preprocessed speech signal but also an estimate of its reliability to selectively focus on those segments and features that are most reliable for recognition. This book presents the state of the art in recognition in the presence of uncertainty, offering examples that utilize uncertainty information for noise robustness, reverberation robustness, simultaneous recognition of multiple speech signals, and audiovisual speech recognition. The book is appropriate for scientists and researchers in the field of speech recognition who will find an overview of the state of the art in robust speech recognition, professionals working in speech recognition who will find strategies for improving recognition results in various conditions of mismatch, and lecturers of advanced courses on speech processing or speech recognition who will find a reference and a comprehensive introduction to the field. The book assumes an understanding of the fundamentals of speech recognition using Hidden Markov Models.


Audio and Speech Processing with MATLAB

Audio and Speech Processing with MATLAB

PDF Audio and Speech Processing with MATLAB Download

  • Author: Paul Hill
  • Publisher: CRC Press
  • ISBN: 0429813961
  • Category : Technology & Engineering
  • Languages : en
  • Pages : 330

Speech and audio processing has undergone a revolution in preceding decades that has accelerated in the last few years generating game-changing technologies such as truly successful speech recognition systems; a goal that had remained out of reach until very recently. This book gives the reader a comprehensive overview of such contemporary speech and audio processing techniques with an emphasis on practical implementations and illustrations using MATLAB code. Core concepts are firstly covered giving an introduction to the physics of audio and vibration together with their representations using complex numbers, Z transforms and frequency analysis transforms such as the FFT. Later chapters give a description of the human auditory system and the fundamentals of psychoacoustics. Insights, results, and analyses given in these chapters are subsequently used as the basis of understanding of the middle section of the book covering: wideband audio compression (MP3 audio etc.), speech recognition and speech coding. The final chapter covers musical synthesis and applications describing methods such as (and giving MATLAB examples of) AM, FM and ring modulation techniques. This chapter gives a final example of the use of time-frequency modification to implement a so-called phase vocoder for time stretching (in MATLAB). Features A comprehensive overview of contemporary speech and audio processing techniques from perceptual and physical acoustic models to a thorough background in relevant digital signal processing techniques together with an exploration of speech and audio applications. A carefully paced progression of complexity of the described methods; building, in many cases, from first principles. Speech and wideband audio coding together with a description of associated standardised codecs (e.g. MP3, AAC and GSM). Speech recognition: Feature extraction (e.g. MFCC features), Hidden Markov Models (HMMs) and deep learning techniques such as Long Short-Time Memory (LSTM) methods. Book and computer-based problems at the end of each chapter. Contains numerous real-world examples backed up by many MATLAB functions and code.


Toward a Unified Theory of Audiovisual Integration in Speech Perception

Toward a Unified Theory of Audiovisual Integration in Speech Perception

PDF Toward a Unified Theory of Audiovisual Integration in Speech Perception Download

  • Author: Nicholas Altieri
  • Publisher: Universal-Publishers
  • ISBN: 1599423618
  • Category :
  • Languages : en
  • Pages :

Auditory and visual speech recognition unfolds in real time and occurs effortlessly for normal hearing listeners. However, model theoretic descriptions of the systems level cognitive processes responsible for integrating auditory and visual speech information are currently lacking, primarily because they rely too heavily on accuracy rather than reaction time predictions. Speech and language researchers have argued about whether audiovisual integration occurs in a parallel or in coactive fashion, and also the extent to which audiovisual occurs in an efficient manner. The Double Factorial Paradigm introduced in Section 1 is an experimental paradigm that is equipped to address dynamical processing issues related to architecture (parallel vs. coactive processing) as well as efficiency (capacity). Experiment 1 employed a simple word discrimination task to assess both architecture and capacity in high accuracy settings. Experiments 2 and 3 assessed these same issues using auditory and visual distractors in Divided Attention and Focused Attention tasks respectively. Experiment 4 investigated audiovisual integration efficiency across different auditory signal-to-noise ratios. The results can be summarized as follows: Integration typically occurs in parallel with an efficient stopping rule, integration occurs automatically in both focused and divided attention versions of the task, and audiovisual integration is only efficient (in the time domain) when the clarity of the auditory signal is relatively poor--although considerable individual differences were observed. In Section 3, these results were captured within the milieu of parallel linear dynamic processing models with cross channel interactions. Finally, in Section 4, I discussed broader implications for this research, including applications for clinical research and neural-biological models of audiovisual convergence.