Gaze Patterns and Audiovisual Speech Enhancement Purpose In this study, the authors sought to quantify the relationships between speech intelligibility (perception) and gaze patterns under different auditory–visual conditions. Method Eleven subjects listened to low-context sentences spoken by a single talker while viewing the face of one or more talkers on a computer display. Subjects ... Research Article
Research Article  |   April 01, 2013
Gaze Patterns and Audiovisual Speech Enhancement
 
Author Affiliations & Notes
  • Astrid Yi
    University of Toronto, Ontario, Canada
  • Willy Wong
    University of Toronto, Ontario, Canada
  • Moshe Eizenman
    University of Toronto, Ontario, Canada
  • Correspondence to Willy Wong: willy.wong@utoronto.ca
  • Editor: Anne Smith
    Editor: Anne Smith×
  • Associate Editor: Karen Forrest
    Associate Editor: Karen Forrest×
Article Information
Speech, Voice & Prosodic Disorders / Speech / Research Articles
Research Article   |   April 01, 2013
Gaze Patterns and Audiovisual Speech Enhancement
Journal of Speech, Language, and Hearing Research, April 2013, Vol. 56, 471-480. doi:10.1044/1092-4388(2012/10-0288)
History: Received October 17, 2010 , Revised March 25, 2011 , Accepted August 14, 2012
 
Journal of Speech, Language, and Hearing Research, April 2013, Vol. 56, 471-480. doi:10.1044/1092-4388(2012/10-0288)
History: Received October 17, 2010; Revised March 25, 2011; Accepted August 14, 2012
Web of Science® Times Cited: 9

Purpose In this study, the authors sought to quantify the relationships between speech intelligibility (perception) and gaze patterns under different auditory–visual conditions.

Method Eleven subjects listened to low-context sentences spoken by a single talker while viewing the face of one or more talkers on a computer display. Subjects either maintained their gaze at a specific distance (0°, 2.5°, 5°, 10°, and 15°) from the center of the talker's mouth (CTM) or moved their eyes freely on the computer display. Eye movements were monitored with an eye-tracking system, and speech intelligibility was evaluated by the mean percentage of correctly perceived words.

Results With a single talker and a fixed point of gaze, speech intelligibility was similar for all fixations within 10° of the CTM. With visual cues from two talker faces and a speech signal from one of the talkers, speech intelligibility was similar to that of a single talker for fixations within 2.5° of the CTM. With natural viewing of a single talker, gaze strategy changed with speech-signal-to-noise ratio (SNR). For low speech-SNR, a strategy that brought the point of gaze directly to within 2.5° of the CTM was used in approximately 80% of trials, whereas in high speech-SNR it was used in only approximately 50% of trials.

Conclusions With natural viewing of a single talker and high speech-SNR, subjects can shift their gaze between points on the talker's face without compromising speech intelligibility. With low-speech SNR, subjects change their gaze patterns to fixate primarily on points that are in close proximity to the talker's mouth. The latter strategy is essential to optimize speech intelligibility in situations where there are simultaneous visual cues from multiple talkers (i.e., when some of the visual cues are distracters).

Order a Subscription
Pay Per View
Entire Journal of Speech, Language, and Hearing Research content & archive
24-hour access
This Article
24-hour access