🔍

Text Embeddings & RAG Pipeline

Build production RAG pipelines with BGE-M3 embeddings on VoltageGPU. Semantic search, document retrieval, and knowledge bases at scale.

Retrieval-Augmented Generation (RAG) combines the knowledge of your documents with the reasoning power of LLMs. VoltageGPU provides both the embedding models to vectorize your data and the LLM inference to generate answers. Run BGE-M3 and other embedding models via API, store vectors in your preferred database, and query them alongside DeepSeek, Llama, or Qwen for accurate, grounded responses.

Key Benefits

Fast Embeddings

BGE-M3 on GPU generates 1,000+ embeddings per second. Vectorize millions of documents in minutes.

🔌

OpenAI-Compatible

Drop-in replacement for OpenAI embeddings API. Change one line of code to switch providers.

🌍

Multilingual

BGE-M3 supports 100+ languages natively. Build multilingual search and RAG without separate models.

🔗

End-to-End Stack

Embeddings + LLM inference on the same platform. No data transfer costs between providers.

💰

Cost-Effective

Embedding API at $0.005 per 1M tokens vs $0.13 on OpenAI. 26x cheaper for the same quality.

🔒

Private & Secure

Your documents never leave VoltageGPU infrastructure. No data used for model training.

Recommended GPUs

Recommended Models

Code Example

Python
from openai import OpenAI

# Initialize VoltageGPU client
client = OpenAI(
    base_url="https://api.voltagegpu.com/v1",
    api_key="YOUR_VOLTAGE_API_KEY"
)

# Step 1: Generate embeddings for your documents
documents = [
    "VoltageGPU offers H100 GPUs at $2.49 per hour.",
    "Fine-tuning with LoRA reduces VRAM requirements by 10x.",
    "RAG pipelines combine retrieval with LLM generation.",
    "BGE-M3 supports multilingual embeddings in 100+ languages.",
]

embeddings_response = client.embeddings.create(
    model="BAAI/bge-m3",
    input=documents,
)

vectors = [e.embedding for e in embeddings_response.data]
print(f"Generated {len(vectors)} embeddings of dim {len(vectors[0])}")

# Step 2: Query with RAG (after storing vectors in your DB)
query = "How much does an H100 cost?"
query_embedding = client.embeddings.create(
    model="BAAI/bge-m3",
    input=[query],
).data[0].embedding

# Step 3: Use retrieved context with an LLM
retrieved_docs = ["VoltageGPU offers H100 GPUs at $2.49 per hour."]

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-0324",
    messages=[
        {"role": "system", "content": f"Answer based on context:\n{'\n'.join(retrieved_docs)}"},
        {"role": "user", "content": query},
    ],
)

print(response.choices[0].message.content)

Frequently Asked Questions

What is RAG and why should I use it?
RAG (Retrieval-Augmented Generation) is a technique that enhances LLM responses with relevant information retrieved from your own documents. Instead of relying solely on the model's training data, RAG grounds answers in your specific knowledge base, reducing hallucinations and providing up-to-date information.
What embedding models are available on VoltageGPU?
We currently offer BGE-M3 (BAAI/bge-m3), which is one of the highest-performing embedding models supporting dense, sparse, and multi-vector retrieval across 100+ languages. Additional embedding models are added based on demand.
How does VoltageGPU embedding pricing compare to OpenAI?
VoltageGPU embeddings cost approximately $0.005 per 1M tokens compared to OpenAI's text-embedding-3-small at $0.02 and text-embedding-3-large at $0.13 per 1M tokens. This makes VoltageGPU 4-26x cheaper depending on the OpenAI model.
Can I use VoltageGPU embeddings with Pinecone, Weaviate, or ChromaDB?
Yes. VoltageGPU embeddings are standard float vectors that work with any vector database. Generate embeddings via our API and store them in Pinecone, Weaviate, ChromaDB, Qdrant, Milvus, or any other vector store.
How many documents can I embed per second?
BGE-M3 on GPU processes approximately 1,000+ short documents (256 tokens each) per second. For a corpus of 1 million documents, full embedding takes about 15-20 minutes and costs under $1.

Explore Other Use Cases

Start Building Now

Deploy a GPU pod in under 60 seconds. $5 free credits, no credit card required.

Browse Available GPUs →Explore Models