Speech recognition remains a difficult problem in AI and machine learning. As a step to solve it, OpenAI today The open-source Whisper, an automatic speech recognition system the company claims allows for “robust” transcription in multiple languages and translation from those languages into English.
Countless organizations develop sophisticated speech recognition systems that are at the heart of software and services from tech giants such as Google, Amazon, and Meta. But what makes Whisper different, according to OpenAI, is that it’s been trained on 680,000 hours of multilingual and “multitasking” data collected from the web. This improves recognition of idiosyncratic accents, background noise, and jargon.
“Primary target audience for [the Whisper] Models are AI researchers who study the robustness, generalization, features, biases, and constraints of current models. However, Whisper could also be very useful as an automatic speech recognition solution for developers, especially for English speech recognition,” he wrote on GitHub. report For Whisper, you can download several versions of the system here. “[The models] View powerful ASR results in ~10 languages. Fine-tuning specific tasks such as voice activity detection, speaker classification, and speaker diarization may reveal additional capabilities, but these areas have not been thoroughly evaluated. ”
Whisper has limitations, especially in the area of text prediction. Because the system was trained on a large amount of “noisy” data, OpenAI warns that the audio transcription may contain words that weren’t actually spoken. It’s probably because you’re trying to predict the next word in the audio, and at the same time you’re trying to transcribe the audio itself. Additionally, Whisper does not perform equally well across languages and has a high error rate for speakers of languages that are poorly represented in the training data.
Unfortunately, the last part is nothing new in the world of speech recognition. Bias has long plagued even the best systems, Stanford University 2020. study When searching systems from Amazon, Apple, Google, IBM, and Microsoft, white users encountered far fewer errors (about 35%) than black users.
Nevertheless, OpenAI believes Whisper’s transcription capabilities are being used to improve existing accessibility tools.
“The Whisper models cannot be used for real-time transcription right away, but their speed and size make it possible for others to build applications on top of them that enable near-real-time speech recognition and translation.” It suggests that there is,” the company said. Continue on GitHub. “The true value of profitable applications built on top of the Whisper model suggests that differences in the performance of these models can have real economic impact… [W]We hope that this technology will be used mainly for profitable purposes. By making automatic speech recognition technology more accessible, more actors may be able to build capable surveillance techniques or augment existing surveillance efforts. Because its speed and accuracy enable mass automated transcription and translation at an affordable price. Or voice communication.
Whisper’s release is not necessarily indicative of OpenAI’s future plans. Although there is an increasing focus on commercial initiatives such as: Darui 2 When GPT-3the company pursues several purely theoretical research threads. learn by watching videos.