Method and system for visual context aware automatic speech recognition
Abstract:
Accuracy of transcript is of foremost importance in Automatic Speech Recognition (ASR). State of the art system mostly rely on spelling correction based contextual improvement in ASR, which is generally a static vocabulary based biasing approach. Embodiments of the present disclosure provide a method and system for visual context aware ASR. The method provides biasing using shallow fusion biasing approach with a modified beam search decoding technique, which introduces a non-greedy pruning strategy to allow biasing at the sub-word level. The biasing algorithm brings in the visual context of the robot to the speech recognizer based on a dynamic biasing vocabulary, improving the transcription accuracy. The dynamic biasing vocabulary, comprising objects in a current environment accompanied by their self and relational attributes, is generated using a bias prediction network that explicitly adds label to objects, which are detected and captioned via a state of the art dense image captioning network.
Information query
Patent Agency Ranking
0/0