🎙️

Speech to Text API - Whisper & More

Transcribe audio and video with Whisper Large V3 on VoltageGPU. OpenAI-compatible API. Up to 10x cheaper than alternatives.

VoltageGPU runs OpenAI Whisper Large V3 and other speech-to-text models on GPU-accelerated infrastructure for fast, accurate transcription. Process hours of audio in minutes, support 99+ languages, and get word-level timestamps. Our OpenAI-compatible API makes migration effortless, and our pricing is up to 10x lower than hosted alternatives.

Key Benefits

🎯

Whisper Large V3

The most accurate open-source speech recognition model. 99%+ accuracy on clean audio in English.

🌐

99+ Languages

Transcribe audio in over 99 languages with automatic language detection. No model switching needed.

🔌

OpenAI-Compatible

Use the same OpenAI SDK and API format. Migrate from OpenAI Whisper API by changing one URL.

⏱️

Word-Level Timestamps

Get precise word-level timestamps for subtitle generation, content navigation, and searchable audio.

💰

10x Cheaper

Whisper API on VoltageGPU costs ~$0.003/min vs $0.006/min on OpenAI. Even cheaper for bulk processing.

📚

Batch Processing

Transcribe hundreds of hours of audio in parallel. Ideal for podcast archives, call centers, and media companies.

Recommended GPUs

Code Example

Python
from openai import OpenAI

# Initialize VoltageGPU client
client = OpenAI(
    base_url="https://api.voltagegpu.com/v1",
    api_key="YOUR_VOLTAGE_API_KEY"
)

# Transcribe an audio file
with open("interview.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word"],
    )

print(f"Transcription: {transcript.text}")
print(f"Language: {transcript.language}")
print(f"Duration: {transcript.duration}s")

# Access word-level timestamps
for word in transcript.words:
    print(f"  [{word.start:.2f}s - {word.end:.2f}s] {word.word}")

# Translation (any language to English)
with open("french_podcast.mp3", "rb") as audio_file:
    translation = client.audio.translations.create(
        model="whisper-large-v3",
        file=audio_file,
    )

print(f"English translation: {translation.text}")

Frequently Asked Questions

How accurate is Whisper Large V3 on VoltageGPU?
Whisper Large V3 achieves state-of-the-art accuracy: 99%+ on clean English audio, 95%+ on conversational English, and 90%+ on most of the 99 supported languages. Accuracy depends on audio quality, background noise, and language.
How does pricing compare to OpenAI Whisper API?
VoltageGPU Whisper API costs approximately $0.003 per minute of audio compared to OpenAI's $0.006 per minute. For 10,000 minutes of audio, you save $30 per month. Bulk pricing is even lower.
What audio formats are supported?
We support all major audio and video formats: MP3, MP4, WAV, FLAC, OGG, WebM, M4A, and more. Maximum file size is 500MB per request. For larger files, we recommend splitting them into chunks.
Can I get subtitles in SRT or VTT format?
Yes. Set response_format to "srt" or "vtt" in your API request to get ready-to-use subtitle files with timestamps. You can also request "verbose_json" for word-level timestamps and build custom subtitle formats.
Is real-time transcription supported?
Currently, VoltageGPU Whisper API processes pre-recorded audio files. For real-time streaming transcription, you can deploy a custom Whisper streaming pipeline on a VoltageGPU pod with WebSocket support.

Explore Other Use Cases

Start Building Now

Deploy a GPU pod in under 60 seconds. $5 free credits, no credit card required.

Browse Available GPUs →Explore Models