villesau / whisper-timestamped

Transcribes audio using Whisper Large V3 with precise word-level timestamps and confidence scores.

  • Public
  • 300 runs
  • GitHub
  • Weights
  • License

Run time and cost

This model costs approximately $0.050 to run on Replicate, or 20 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 88 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Whisper-Timestamped Transcription Model (Large V3)

Overview

This model provides speech recognition with word-level timestamps using the whisper-timestamped library and Whisper Large V3. It’s designed for transcribing audio files, offering precise timing information for each transcribed word.

Features

  • Uses Whisper Large V3 for state-of-the-art speech recognition
  • Efficient and accurate word-level timestamps
  • Voice Activity Detection (VAD) to improve transcription accuracy
  • Confidence scores for each word
  • Detection of speech disfluencies
  • Support for multiple languages
  • Options for transcription or translation to English

Usage

To use this model, provide an audio file. The model will process the audio and return a JSON object containing the transcription with detailed timing information for segments and individual words.

For detailed information on input parameters and output format, please refer to the model’s input/output specifications on this page.

About

This model is hosted on Replicate and uses the whisper-timestamped library with Whisper Large V3, an extension of OpenAI’s Whisper model. For more information about whisper-timestamped, visit the GitHub repository.