Video conferencing is open to everyone, and that should also include users who communicate in sign language, but since most video conferencing systems automatically track speaker prompt windows, it’s difficult for sign language communicators to communicate easily and effectively.
As a result, scenes detected in real-time sign language in video conferencing become challenging, and the system needs to classify a large amount of video feedback as input, which makes task computing very heavy. To some extent, the existence of these challenges has also led to little research on sign language detection.
At the recent ECCV 2020 and SLRTP 2020 Global Summits, Google’s research team presented a real-time sign language detection model that details how it will be used to identify “speakers” in video conferencing systems.
1, design ideas.
To proactively adapt the conferencing solutions offered by mainstream video conferencing systems, the research team adopted a lightweight, plug-and-play model. The model uses a small CPU to minimize the impact on client call quality. To reduce the dimension of the input, each frame is classified by separating the required information from the video.
“Since sign language involves both the user’s body and hands, we first ran the human attitude estimation model PoseNet, which significantly reduces the input from the entire HD image to a small portion of the user’s body, such as key features such as eyes, nose, shoulders, hands, and so on.” We then use these key feature points to calculate each frame of optical flow, so that the user’s gesture characteristics can be quantified without retaining user-specific information. Each gesture is normalized by the width of the human shoulder to ensure that the model notices the user’s gestures within a certain distance from the camera. Finally, the light stream is normalized through the frame rate of the video and passed to the model. “
To test the effectiveness of this method, the team used the German Sign Language Library (DGS), which contains long videos of human gestures with span notes. Based on the baseline of the trained linear regression model, the number of light streams is used to predict when human gestures are made. The model baseline can achieve 80% accuracy, and each frame takes only about 3 s (0.000003 seconds) to complete. By using the light flow of the first 50 frames as the context of the model, the accuracy of 83.4% is finally achieved.
The team used a short- and long-term memory network (LSTM) architecture that achieved 91.5% accuracy, with a processing time of approximately 3.5 milliseconds (0.0035 seconds) per frame.
2, proof of concept.
In a real-world scenario, with a well-run sign language detection model only the first step, the team also needed to design a way to start the active speaker functionality of the video conferencing system. The team developed a lightweight online sign language detection demo demo demo that can be connected to any video conferencing system and set the sign language communicator to Speaker.
When the gesture detection model determines that a user is communicating sign language, it transmits ultrasound audio over a virtual audio cable, which can be detected by any video conferencing system as if the sign language communicator were “talking.” Audio is transmitted at 20kHz, usually outside the human auditory range. Because video conferencing systems typically use the volume of audio as a criterion for detecting whether they are speaking, rather than detecting speech, applications mistakenly assume that sign language communicators are talking.
The model’s online video demo source code has now been published on GitHub.
GitHub Gate: https://github.com/AmitMY/sign-language-detector.
3, the demonstration process.
In the video, the research team demonstrates how to use the model. The yellow chart in the video reflects the model’s confirmation value when sign language communication is detected. When the user uses sign language, the chart value increases to close to 100, and when the user stops using sign language, the chart value decreases to 0.
To further validate the model’s effectiveness, the team also conducted a user experience feedback survey. The study asked participants to use the model during video sessions and to communicate sign language as usual. They were also asked to use sign language with each other to detect switching functions to the speaker. As a result, the model detects sign language, recognizes it as audible speech, and successfully recognizes gesture participants.
From the present point of view, the starting point of this attempt and the process of using a series of methods of operability are based on the scene landing as the starting point, although from the practical application may also appear more unexpected large-volume user needs, such as sign language in different countries and regions there are huge differences and other issues, how to abstract these capabilities to meet more people, will be the next to this work in the commercial environment really need to think positively.