Apple releases paper revealing Siri’s secrets

With more than 500 million users worldwide, Apple’s cross-platform Siri Virtual Assistant is clearly one of Apple’s key areas of interest. Last week, Apple published a series of preprinted research papers on how to improve voice trigger detection and speaker verification, as well as multi-speaker language recognition technology.

Apple releases paper revealing Siri's secrets

Speaker verification and voice trigger detection

In the first paper, a team of Apple researchers presented a trained artificial intelligence model that performs both automatic speech recognition tasks and speaker recognition tasks.

As they explain in the summary, commands identified by the voice assistant are usually prefixed with a trigger phrase (e.g., “Hey Siri”) that involves two steps.

First, aI must determine whether the voice content in the input audio matches the voice content that triggers the phrase (voice trigger detection);

Typically, both tasks are considered independently. But co-authors hypothesized that understanding the voice initiator might help infer the content of the voice in the sound signal, and vice versa, which would help evaluate both attributes.

In response, the researchers designed three models that could learn speech and speaker information and trained a set of data that contained more than 16,000 hours of annotated samples, of which 5,000 hours of audio had voice tags (the rest were speaker tags).

In addition, more than 100 subjects used smart speaker devices to contribute to the corpus in a range of acoustic settings, including quiet rooms, external noise from in-room televisions or kitchen equipment, and the recorder playing music at high volume.

It’s worth noting that 2,000 hours of continuous audio recordings from television, radio, and podcasts that don’t contain trigger phrases are added to measure false positives.

These models show the ability to learn speech and speaker information, and at the same number of parameters (variables that control certain attributes of the training process), the accuracy of each task is at least the same as that of the baseline model.

In fact, one of the three models presented performed better than the speaker’s validation baseline under the “multiple” setting, and improved by 7.6% relative to the baseline in text-independent tasks.

The researchers say the results are interesting because the models are trained using unrelated data sets, meaning that each audio sample has either a voice tag or a speaker tag, and never has both.

By looking at the results, the researchers came up with a flexible design that trains models on multiple related tasks by connecting training data for different tasks instead of obtaining multiple tags for each training sample. From a practical point of view, sharing calculations between two tasks can save device memory, calculation time or latency, and power/battery consumption.

Error trigger mitigation

In the study, a supplementary study reduced the occurrence of error triggers, meaning that voice assistants deliberately ignored the voice of voice assistants like Siri.

The researchers say they used a graphical neural network (GNN), an artificial intelligence model that operates on a graphical structure, where each node is associated with a label with the goal of predicting the node’s label without the underlying facts.

In the paper, the researchers wrote:

Voice-triggered smart assistants typically detect a trigger phrase before startlistening user requests… The trigger for errors usually comes from background noise or voice that sounds similar to the trigger phrase. Therefore, reducing false triggers is an important aspect of building a privacy-centric, non-intrusive smart assistant.

In future work, the team plans to extend GNN-based processing to other tasks, such as user intent classification.

Multi-lingual Speaker Recognition

In another paper, Apple researchers explored a speaker language recognition system tailored to multilingual users.

Speech recognition systems, they say, are highly accurate for most languages. However, when multiple languages appear, the language recognition system does not perform as well as it should. Therefore, based on this implementation, the researchers decided to work on the speaker’s language recognition system.

Notably, a recent study commissioned by The Washington Post showed that Google and Amazon’s popular smart speakers were 30 percent more likely to understand the voice of local users than to understand a non-American accent.

At the same time, a corpus library like Switchboard, which is used by companies such as IBM and Microsoft to measure the error rate of speech modelerrors, has been shown to have a measurable tilt toward users from specific parts of the country.

In response, the co-authors integrated knowledge of usage patterns into a dictation system that enabled speakers from more than 60 regions to make decisions.

Among them, the acoustic submodel will make predictionaccordingtoe according to the evidence transmitted by the voice signal, while the context-aware prediction component takes into account various interactive context signals, and selects the optimal single-word automatic speech recognition system.

It is understood that the context signal contains information about the conditions under which the dictation request is made, including information about the dictation area that is installed, the currently selected dictation area, and whether the user switched the dictation area before making the request.

Importantly, they help to rely on acoustic models to produce a reliable prediction when the voice signal is too short. For example, if the user has both English and German, short, vague statements like “naIn” may be negative in German and the number “nine” in English.

In addition, to evaluate the system, the researchers developed a custom indicator called “Average User Accuracy” (AUA, Average User Accuracy), which they believe better reflects the “population level” usage patterns in the model.

By rigorously training the internal library of 128,000 dictation slots for multilingual speakers with corresponding interactive contextual information, it achieves an average of 87% accuracy across all language combinations, while increasing the worst-case accuracy by more than 60% relative to the baseline.

In addition, after the team adjusted parameters to balance accuracy and latency with the computational load of running the model on the device, the average latency was reduced from 2 seconds to 1.2 seconds, with an impact on AUA of no more than 0.05%.