Catallaxy Services | Hello, Computer: Language Recognition in Azure

ABSTRACT

Speech is an essential part of communication. If you need to incorporate speech into your applications, the Azure AI Foundry Speech Service is a great service for the job. In this talk, we will gain an overview of some of the capabilities around Azure AI Speech, seeing how we can use the service to perform text-to-speech and speech-to-text operations. We will investigate multi-lingual speech translation, analyze the components of speech, and even dive into custom neural voices, giving your applications a unique voice. We'll also compare Azure AI Speech to OpenAI's Whisper model, learning which model performs better for specific scenarios. Along the way, we will work with the .NET and Python libraries that make this service available to a wide audience of developers. Finally, because this is a critical piece of any cloud technology conversation, we'll gain an idea of how much it all costs.

ADDITIONAL MEDIA

No recordings or additional media are available for this talk.

SLIDES

Click here to access the slides for this presentation.

The slides are licensed under Creative Commons Attribution-ShareAlike.

DEMO CODE

Click here to access demo code for this presentation.

The source code is licensed under the terms offered by the GPL.

LINKS & FURTHER INFO

Helpful Resources

Microsoft's Azure Speech documentation hub is the canonical reference for the service. It includes overviews, quickstarts, how-tos, and API reference for every language the SDK supports.
About the Speech SDK explains what the SDK covers across platforms. Pair it with the Install the Speech SDK guide when you're setting up your first project.
The Azure-Samples/cognitive-services-speech-sdk repo on GitHub has runnable samples in C#, C++, Java, JavaScript, Python, Objective-C, and Swift. It's the fastest way to go from "hello world" to something real.
Microsoft's speech-to-text quickstart and text-to-speech quickstart walk you through a working app in any of the SDK languages with a language-pivot at the top.
The pronunciation assessment how-to covers the full scoring model (accuracy, fluency, completeness, prosody, miscue detection), along with the JSON response shape for word- and phoneme-level results.
The speech translation overview explains real-time multi-language translation including the newer multi-lingual mode that auto-detects the source language and handles mid-session language switching.
Language and voice support is the definitive list of locales, neural voices, and which features (translation, pronunciation, prosody) are available where. it's essential when scoping a demo or a production rollout.
The pronunciation assessment tool and the broader Speech Studio overview let you try reading, speaking, and gaming scenarios with no code. It's great for getting a feel for the scoring before writing any SDK integration.
The azure-cognitiveservices-speech package on PyPI is the Python SDK distribution. Keep it bookmarked for release notes and to check which version of the SDK is current before pinning dependencies.