Can I help?

MedASR

MedASR is a speech-to-text model based on the Conformer architecture that is pre-trained for medical dictation and transcription. Developers can use MedASR as a foundational model to build efficient healthcare-based voice applications.

MedASR was trained on a diverse corpus of de-identified medical speech, comprising approximately 5,000 hours of physician dictations. The training data spans multiple specialties, including radiology, internal medicine, and family medicine. This 105-million-parameter model accepts mono-channel audio (16kHz, int16 waveform) and generates text-only transcriptions.

MedASR is recommended for dictation tasks involving specialized medical terminologies. For applications that require analyzing the semantic content of the transcribed text (such as summarization or question answering), the output of MedASR can be used as input for generative models like MedGemma.

Common use cases

The following sections present some common use cases for the model. You're free to pursue any use case, as long as it adheres to the Health AI Developer Foundations terms of use.

Medical Dictation and Transcription

MedASR allows developers to incorporate automatic speech recognition (ASR) capabilities, specifically tuned for the medical domain into their product. Unlike general-purpose ASR models, MedASR has been trained on extensive corpora of medical dictations.

This makes it well-suited for:

Radiology Dictation: Accurate transcription of imaging reports containing complex anatomical and pathological terms.
Clinical Documentation: Transcribing physician-patient interactions to assist in generating clinical notes.

See the Quick start notebook to try running the model locally.

Fine-tuning for Specialized Contexts

MedASR is intended as a starting point for developers. While it performs well on general medical dictation, it can be fine-tuned to adapt to specific requirements that may fall outside its pre-training data, such as:

English Accents: Adapting the model to specific speaker demographics.
Acoustic Environments: Improving performance in noisy environments or with lower-quality recording hardware.
Vocabulary Expansion: Adding recognition for additional medical vocabulary that was not included in its training.
Formatting: Improving the consistent handling of temporal data (dates, times, or durations).

For an example of how to fine-tune MedASR on your own data, see the Fine-tuning notebook.

Generative AI Integration

MedASR serves as a critical component in multimodal healthcare applications. By converting speech to text, it enables integration with Large Language Models (LLMs).

For example, a developer can build a pipeline where:

MedASR transcribes a raw audio recording of a patient visit.
MedGemma takes that transcript as input to automatically generate a Subjective, Objective, Assessment and Plan (SOAP) note or summarize key symptoms and medications.

Next steps

Get started using the model

MedASR Stay organized with collections Save and categorize content based on your preferences.