Question 1

What is RAG and why should I use it?

Accepted Answer

RAG (Retrieval-Augmented Generation) is a technique that enhances LLM responses with relevant information retrieved from your own documents. Instead of relying solely on the model's training data, RAG grounds answers in your specific knowledge base, reducing hallucinations and providing up-to-date information.

Question 2

What embedding models are available on VoltageGPU?

Accepted Answer

We currently offer BGE-M3 (BAAI/bge-m3), which is one of the highest-performing embedding models supporting dense, sparse, and multi-vector retrieval across 100+ languages. Additional embedding models are added based on demand.

Question 3

How does VoltageGPU embedding pricing compare to OpenAI?

Accepted Answer

VoltageGPU embeddings cost approximately $0.005 per 1M tokens compared to OpenAI's text-embedding-3-small at $0.02 and text-embedding-3-large at $0.13 per 1M tokens. This makes VoltageGPU 4-26x cheaper depending on the OpenAI model.

Question 4

Can I use VoltageGPU embeddings with Pinecone, Weaviate, or ChromaDB?

Accepted Answer

Yes. VoltageGPU embeddings are standard float vectors that work with any vector database. Generate embeddings via our API and store them in Pinecone, Weaviate, ChromaDB, Qdrant, Milvus, or any other vector store.

Question 5

How many documents can I embed per second?

Accepted Answer

BGE-M3 on GPU processes approximately 1,000+ short documents (256 tokens each) per second. For a corpus of 1 million documents, full embedding takes about 15-20 minutes and costs under $1.

Text Embeddings & RAG Pipeline

Key Benefits

Fast Embeddings

OpenAI-Compatible

Multilingual

End-to-End Stack

Cost-Effective

Private & Secure

Recommended GPUs

Recommended Models

Code Example

Frequently Asked Questions

Explore Other Use Cases

Start Building Now