Essential terms and definitions for GPU compute and AI infrastructure
Saved state of a model during training, allowing resume or rollback.
One complete pass through the entire training dataset.
Adapting a pre-trained model to specific tasks or domains.
The process of using a trained AI model to make predictions on new data.
Large Language Model - AI models trained on vast amounts of text data for natural language understanding and generation.
Multi-dimensional array used in deep learning computations.
Basic unit of text processed by language models, roughly equivalent to 4 characters.
The process of teaching an AI model by feeding it data and adjusting its parameters.
Percentage added to base cost to determine final pricing.
Service Level Agreement - Guaranteed performance and availability metrics.
Application Programming Interface - A set of protocols for building and integrating application software.
Updating code or models without restarting the service.
Interactive computing environment for data science and machine learning.
APIs that follow OpenAI's interface standards, allowing easy migration between providers.
Graphics Processing Unit - Specialized hardware designed for parallel processing, essential for AI training and inference.
Video Random Access Memory - Dedicated memory on GPUs used to store model weights and intermediate computations.
Distributed system without single point of control or failure.
Platform for developing, shipping, and running applications in containers.
Container orchestration platform for automating deployment and scaling.
A containerized GPU instance that provides isolated compute resources.
Secure Shell - Protocol for secure remote access to computing resources.
Data transfer out of a cloud service, often subject to additional fees.
Reducing model precision to decrease memory usage and increase speed.
Processing multiple requests together to improve efficiency.
Initial delay when starting a service or loading a model for the first time.
The time delay between sending a request and receiving a response.
95th percentile response time - 95% of requests complete faster than this.
The amount of data processed per unit of time.
NVIDIA's parallel computing platform and programming model for GPUs.
Half-precision floating-point format using 16 bits, common in AI workloads.
This glossary covers the most common terms in GPU computing and AI infrastructure.