Question 1

What audio formats are supported for transcription?

Accepted Answer

The tool supports MP3, WAV, OGG, FLAC, M4A, WebM, and most common audio formats. The maximum file size is 10 MB. For longer recordings, consider splitting the audio into segments. Files recorded at standard quality (44.1 kHz, 128 kbps or higher) produce the best transcription results.

Question 2

Which languages can Whisper transcribe?

Accepted Answer

Whisper Large V3 supports 99 languages automatically detected from the audio content — you don't need to specify the language. This includes English, French, Arabic, Spanish, German, Chinese, Japanese, Portuguese, Russian, Italian, Korean, Turkish, Hindi, and many more. Accuracy is highest for languages with large amounts of audio training data, particularly English and major European languages.

Question 3

How accurate is the transcription?

Accepted Answer

For clear speech recordings in well-supported languages, Whisper Large V3 achieves near-human accuracy. Word error rates for English are typically under 5% for clear studio recordings. Performance degrades with heavy background noise, strong accents, overlapping speakers, or very specialized technical vocabulary not in the training data. For professional-grade transcription of challenging audio, manual review is recommended.

Question 4

Why does transcription sometimes take a while?

Accepted Answer

Transcription time depends on audio length and model availability. The Whisper Large V3 model has over 1.5 billion parameters and requires GPU resources to run — if the model hasn't been used recently, the first request may take 20–30 seconds for the model to load. Once warm, subsequent transcriptions are faster. Longer audio files naturally take more processing time than short clips.

Question 5

Is my audio file stored after transcription?

Accepted Answer

No. Your audio file is sent to the transcription API, processed in real time, and the text is returned to your browser. We do not store audio files on our servers. The file exists in memory only for the duration of the API call and is discarded after the response.

Voice to Text

How AI Speech-to-Text Works with Whisper

Supported Audio Formats and Languages

Accuracy, Limitations and Privacy

Frequently Asked Questions