MedASR is a speech-to-text model based on the Conformer architecture that is pre-trained for medical dictation and transcription. Developers can use MedASR as a foundational model to build efficient healthcare-based voice applications.
MedASR was trained on a diverse corpus of de-identified medical speech, comprising approximately 5,000 hours of physician dictations and clinical conversations. The training data spans multiple specialties, including radiology, internal medicine, and family medicine. This 105-million-parameter model accepts mono-channel audio (16kHz, int16 waveform) and generates text-only transcriptions.
MedASR is recommended for dictation tasks involving specialized medical terminologies and transcribing physician-patient conversations. For applications that require analyzing the semantic content of the transcribed text (such as summarization or question answering), the output of MedASR can be used as input for generative models like MedGemma.
Common use cases
The following sections present some common use cases for the model. You're free to pursue any use case, as long as it adheres to the Health AI Developer Foundations terms of use.
Medical Dictation and Transcription
MedASR allows developers to incorporate automatic speech recognition (ASR) capabilities, specifically tuned for the medical domain into their product. Unlike general-purpose ASR models, MedASR has been trained on extensive corpora of medical dictations and clinical conversations.
This makes it well-suited for:
- Radiology Dictation: Accurate transcription of imaging reports containing complex anatomical and pathological terms.
- Clinical Documentation: Transcribing physician-patient interactions to assist in generating clinical notes.
See the Quick start notebook to try running the model locally.
Fine-tuning for Specialized Contexts
MedASR is intended as a starting point for developers. While it performs well on general medical dictation, it can be fine-tuned to adapt to specific requirements that may fall outside its pre-training data, such as:
- English Accents: Adapting the model to specific speaker demographics.
- Acoustic Environments: Improving performance in noisy environments or with lower-quality recording hardware.
- Vocabulary Expansion: Adding recognition for additional medical vocabulary that was not included in its training.
- Formatting: Improving the consistent handling of temporal data (dates, times, or durations).
For an example of how to fine-tune MedASR on your own data, see the Fine-tuning notebook.
Generative AI Integration
MedASR serves as a critical component in multimodal healthcare applications. By converting speech to text, it enables integration with Large Language Models (LLMs).
For example, a developer can build a pipeline where:
- MedASR transcribes a raw audio recording of a patient visit.
- MedGemma takes that transcript as input to automatically generate a Subjective, Objective, Assessment and Plan (SOAP) note or summarize key symptoms and medications.