Text Embeddings & RAG Pipeline
Build production RAG pipelines with BGE-M3 embeddings on VoltageGPU. Semantic search, document retrieval, and knowledge bases at scale.
Retrieval-Augmented Generation (RAG) combines the knowledge of your documents with the reasoning power of LLMs. VoltageGPU provides both the embedding models to vectorize your data and the LLM inference to generate answers. Run BGE-M3 and other embedding models via API, store vectors in your preferred database, and query them alongside DeepSeek, Llama, or Qwen for accurate, grounded responses.
Key Benefits
Fast Embeddings
BGE-M3 on GPU generates 1,000+ embeddings per second. Vectorize millions of documents in minutes.
OpenAI-Compatible
Drop-in replacement for OpenAI embeddings API. Change one line of code to switch providers.
Multilingual
BGE-M3 supports 100+ languages natively. Build multilingual search and RAG without separate models.
End-to-End Stack
Embeddings + LLM inference on the same platform. No data transfer costs between providers.
Cost-Effective
Embedding API at $0.005 per 1M tokens vs $0.13 on OpenAI. 26x cheaper for the same quality.
Private & Secure
Your documents never leave VoltageGPU infrastructure. No data used for model training.
Recommended GPUs
Recommended Models
Code Example
from openai import OpenAI
# Initialize VoltageGPU client
client = OpenAI(
base_url="https://api.voltagegpu.com/v1",
api_key="YOUR_VOLTAGE_API_KEY"
)
# Step 1: Generate embeddings for your documents
documents = [
"VoltageGPU offers H100 GPUs at $2.49 per hour.",
"Fine-tuning with LoRA reduces VRAM requirements by 10x.",
"RAG pipelines combine retrieval with LLM generation.",
"BGE-M3 supports multilingual embeddings in 100+ languages.",
]
embeddings_response = client.embeddings.create(
model="BAAI/bge-m3",
input=documents,
)
vectors = [e.embedding for e in embeddings_response.data]
print(f"Generated {len(vectors)} embeddings of dim {len(vectors[0])}")
# Step 2: Query with RAG (after storing vectors in your DB)
query = "How much does an H100 cost?"
query_embedding = client.embeddings.create(
model="BAAI/bge-m3",
input=[query],
).data[0].embedding
# Step 3: Use retrieved context with an LLM
retrieved_docs = ["VoltageGPU offers H100 GPUs at $2.49 per hour."]
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3-0324",
messages=[
{"role": "system", "content": f"Answer based on context:\n{'\n'.join(retrieved_docs)}"},
{"role": "user", "content": query},
],
)
print(response.choices[0].message.content)Frequently Asked Questions
What is RAG and why should I use it?
What embedding models are available on VoltageGPU?
How does VoltageGPU embedding pricing compare to OpenAI?
Can I use VoltageGPU embeddings with Pinecone, Weaviate, or ChromaDB?
How many documents can I embed per second?
Explore Other Use Cases
Start Building Now
Deploy a GPU pod in under 60 seconds. $5 free credits, no credit card required.
Browse Available GPUs →Explore Models