zsxkib / whisper-lazyloading

Convert speech in audio to text w/ `tiny`, `small`, `base`, and `large-v3` models

  • Public
  • 17 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia T4 GPU hardware.

Readme

Whisper w/ Lazy Loading

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition, translation, and language identification.

This version allows users to choose between different model sizes, offering flexibility for various use cases.

Model Versions

Model Size Description
tiny Fastest, lowest accuracy
base Fast, lower accuracy
small Balanced speed and accuracy
medium Slower, higher accuracy
large-v3 Slowest, highest accuracy

For the specific version using only the large-v3 model, check out our single-model version.

Model Description

Approach

Whisper uses a Transformer sequence-to-sequence model trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. All of these tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many different stages of a traditional speech processing pipeline.

[Blog] [Paper] [Model card]

License

The code and model weights of Whisper are released under the MIT License. See LICENSE for further details.

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}