孙科｜Ke Sun - Wake-Free Voice Interaction

This is the industry's first and only implementation of "wake-free" voice interaction on mobile phones. Users don't need to say a wake word (such as "Siri" or "Hey Google"). Simply bringing the bottom of the phone to the mouth allows users to directly speak voice commands and converse with the assistant, providing a quick, private, and natural experience. Since its release with the Honor's flagship foldable phone Magic Vs in 2022, the wake-free voice interaction feature has been a selling point for Honor's flagship Magic series and the foldable Magic V series, with over a million units shipped.

Product Manager. As the owner of this product feature, (1) I conducted research and discovered prototypes from the field of human-computer interaction academia, and performed product value and feasibility analyses; (2) initiated the project and collaborated deeply with various departments within the company (audio algorithms, chip, software, intelligence) and university for product implementation; (3) expanded the functional scope of the voice assistant.

UX Design: Responsible for user experience design of the feature, including quantitative analysis and definition of user natural interaction behaviors, optimization of voice interaction state logic and visual expression, and interface design for new voice interaction features.

Voice interaction has always been considered important and with great potential on smartphones. However, the reality is that very few users use voice assistants. Besides the fact that voice interaction currently supports a limited range of functions and scenarios, we believe that the very manner of voice interaction itself greatly restricts its use by users. At present, all voice assistants on smartphones need to be activated by a "wake word" (such as "Siri" or "Hey Google"). The disadvantages are obvious. Firstly, wake words have social acceptability issues, as they can easily attract the attention of others and expose privacy. Secondly, the need to use a wake word every time the assistant is used leads to lower interaction efficiency.

Therefore, we aim to implement 'wake-free voice interaction' on smartphones, bringing improvements and breakthroughs in the following three aspects:

Interaction efficiency. Users can simply bring the bottom of the phone close to their mouth to directly converse with the assistant. This eliminates the need to say a wake word, thereby increasing the efficiency of the interaction.
Interaction naturalness. At the same time, this action is consistent with users' existing habits of voice interaction on smartphones, which we refer to as the users' natural interaction behavior expression.
Social acceptability. The absence of a wake word can significantly reduce the attention from others nearby, making the voice interaction more discreet and protecting privacy.

We anticipate that this novel voice interaction technology will provide a different experience for users, increase the usage frequency of voice assistants on phones, and open up opportunities for more assistant functionalities.

The wake-free voice interaction is an successful example of industry-research collaboration and cross-departmental joint development.

After aligning with the ideal experience of wake-free voice interaction, we began to deduce its possible implementation principles and technological approaches. Among these, accuracy of recognition, always-on low power consumption, and not adding additional sensors are the three most critical indicators. The first thought is definitely to use the built-in inertial sensors of the phone to detect the user's action and posture when moving the phone close to their mouth, similar to Apple Watch's Raise to Speak. However, after simple experiments and analysis, we find that the users’ motions while using their phones are diverse and highly uncertain. Relying solely on inertial sensors would cause a large number of false recognitions.

Therefore, we turned our attention to related academic fields and found that in a human-computer interaction paper, researchers discovered that the built-in microphone of a smartphone itself could recognize the proximity of a sound signal. Specifically, the researchers collected and analyzed sound signals and found a signal feature known as "pop noise" showing a significant difference when close to or far from the microphone. Pop noise is the explosive sound produced when the airflow from speaking passes through the microphone, resulting in a significant amplitude change in the audio signal. The researchers detected the pop noise in the frequency domain and used a neural network to identify its characteristics to distinguish the proximity of the user's speech to the microphone. Therefore, we often refer to it as "breath waking".

We contacted university researchers, obtained and improved the demo to test in our actual functional scenarios, and simultaneously verified the feasibility of the user experience and implementation.

Transitioning from a demo and theory to actual deployment often entails a significant amount of work. The implementation of wake-free voice interaction spans across three major departments: audio, chip, and software.

Audio. Implement and optimize the algorithm for detecting speaking distance from audio signals. It is crucial to reduce the computation per second as much as possible while ensuring the accuracy of recognition.

Chip. Port the detection algorithm to a low-power chip to achieve the goal of all-day, 24-hour real-time monitoring.

Software. Re-engineer the underlying infrastructure of voice interaction to support immediate sound receiving and response without an activation state.

Quantitative analysis and definition of user's natural interaction behavior expression. Through user experiments, we understand and model the interaction movement of user's bringing the phone to their mouth to speak. It serves as input for R&D to achieve a balance between recognition accuracy and interaction naturalness.

Optimization of the guide interface and assistant avatar expression. In terms of interface design, the wake-free voice interaction does not require much explicit new design. The most crucial point is how to guide users to "know and start using" this new feature. So we add visual guidance for it at all touchpoints of the mobile voice assistant (activated in conventional ways, settings, etc.). In addition, in line with the system's visual style upgrade, we have redesigned the assistant's avatar and the expression of various states.

Technically. The final result exceeded our initial expectations, with the recognition accuracy of the wake-free activation even exceeding that of the regular wake words by 2 percentage points. The main reason is that the regular voice activation is often affected by surrounding environmental noise, while our wake-free interaction largely avoids this issue due to its unique detection principle.

User Experience. Thanks to the improved interaction efficiency, naturalness, and acceptance of wake-free voice interaction, the usage rate of AI assistant on Honor's phones increased by 22%, and the continued usage rate of the wake-free feature reaches 76%.

Commercially. The wake-free voice interaction has become a selling feature for successive generations of Honor's flagship phones, including the Magic series, the foldable Magic V series, the system Magic OS, and continues to be promoted at launch events.

[1] Official Website. https://www.hihonor.com/cn/shop/product/10086498408796.html/

[2] Product Launch. https://youtu.be/APFr6P4mG9w?t=1048

[3] Official Account. https://mp.weixin.qq.com/s/zAyDY-9JuEjceH_vBNF5og

Wake-Free Voice Interaction

Role

Background and Opportunity

Process

Solution & Demo

Technical Implementation

Design

Outcome

Additional Reference