Abstract:
Systems and methods are provided for associating a phonetic pronunciation with a name by receiving the name, mapping the name to a plurality of monosyllabic components that are combinable to construct the phonetic pronunciation of the name, receiving a user input to select 5 one or more of the plurality, and combining the selected one or more of the plurality of monosyllabic components to construct the phonetic pronunciation of the name.
Abstract:
One or more media items can be bound to a voice call using a binding protocol. The binding protocol allows call participants to more easily transfer media items to other call participants using one or more user interfaces. A call participant can initiate a media transfer by selecting the media and a communication modality for transferring the media. The binding protocol can be active or lazy. In lazy binding, the call participant can select the desired media for transfer before the voice call is established, and subsequently mark the media for binding with the voice call. In active binding, the call participant can select and transfer the desired media item during the voice call, and the media item is automatically bound to the voice call. The media item can be transferred using a user-selected communication modality over an independent data communication channel.
Abstract:
Systems and processes for robust end-pointing of speech signals using speaker recognition are provided. In one example process, a stream of audio having a spoken user request can be received. A first likelihood that the stream of audio includes user speech can be determined. A second likelihood that the stream of audio includes user speech spoken by an authorized user can be determined. A start-point or an end-point of the spoken user request can be determined based at least in part on the first likelihood and the second likelihood.
Abstract:
Systems and processes for generating a shared pronunciation lexicon and using the shared pronunciation lexicon to interpret spoken user inputs received by a virtual assistant are provided. In one example, the process can include receiving pronunciations for words or named entities from multiple users. The pronunciations can be tagged with context tags and stored in the shared pronunciation lexicon. The shared pronunciation lexicon can then be used to interpret a spoken user input received by a user device by determining a relevant subset of the shared pronunciation lexicon based on contextual information associated with the user device and performing speech-to-text conversion on the spoken user input using the determined subset of the shared pronunciation lexicon.
Abstract:
Uno o más elementos multimedia pueden enlazarse a una llamada de voz al utilizar un protocolo de enlace. El protocolo de enlace permite que los participantes de la llamada transfieran más fácilmente elementos multimedia a otros participantes de la llamada al utilizar una o más interfaces de usuario. Una participante de la llamada puede iniciar una transferencia multimedia al seleccionar el contenido multimedia y una modalidad de comunicación para transferir el contenido multimedia. El protocolo de enlace puede ser activo o lento. En el enlace lento, el participante de la llamada puede seleccionar el contenido multimedia deseado para transferir antes de que la llamada de voz se establezca, y de manera subsecuente marcar el contenido multimedia para el enlace con la llamada de voz. En el enlace activo, el participante de la llamada puede seleccionar y transferir el elemento multimedia deseado durante !a llamada de voz, y el elemento multimedia se enlaza automáticamente a la llamada de voz. El elemento multimedia puede transferirse al utilizar una modalidad de comunicación seleccionada por el usuario a través de un canal de comunicación de datos independiente.
Abstract:
The method is performed at an electronic device with one or more processors and memory storing one or more programs for execution by the one or more processors. A first speech input including at least one word is received. A first phonetic representation of the at least one word is determined, the first phonetic representation comprising a first set of phonemes selected from a speech recognition phonetic alphabet. The first set of phonemes is mapped to a second set of phonemes to generate a second phonetic representation, where the second set of phonemes is selected from a speech synthesis phonetic alphabet. The second phonetic representation is stored in association with a text string corresponding to the at least one word.
Abstract:
Systems and methods are provided for associating a phonetic pronunciation with a name by receiving the name, mapping the name to a plurality of monosyllabic components that are combinable to construct the phonetic pronunciation of the name, receiving a user input to select one or more of the plurality, and combining the selected one or more of the plurality of monosyllabic components to construct the phonetic pronunciation of the name.
Abstract:
A method comprising: providing a plurality of pronunciation guessers, each of the plurality of pronunciation guessers being associated with a respective phonetic alphabet of a language or a locale; determining a user language or a user locale; associating a first phonetic alphabet with the user language or the user locale; receiving at each pronunciation guesser a representation of a name; guessing, at each pronunciation guesser, a phonetic pronunciation of one or more components of the name; mapping the phonetic pronunciation of the one or more components of the name guessed by each of the plurality of pronunciation guessers to the first phonetic alphabet to generate a list of guessed pronunciations; receiving an audio pronunciation of the name; and selecting a combination of components from the list of guessed pronunciations that, when pronounced, substantially matches the audio pronunciation of the name.