Nvidia NIM - Rerank
Use Nvidia NIM Rerank models through LiteLLM.
| Property | Details | 
|---|---|
| Description | Nvidia NIM provides high-performance reranking models for semantic search and retrieval-augmented generation (RAG) | 
| Provider Doc | Nvidia NIM Rerank API โ | 
| Supported Endpoint | /rerank | 
Overviewโ
Nvidia NIM rerank models help you:
- Reorder search results by relevance to a query
- Improve RAG (Retrieval-Augmented Generation) accuracy
- Filter and rank large document sets efficiently
Supported Models:
- All Nvidia NIM rerank models on their platform
See the full list of LiteLLM supported Nvidia NIM rerank models on Nvidia NIM
Usageโ
LiteLLM Python SDKโ
- LLaMa 1B Model
- Mistral 4B Model
import litellm
import os
os.environ['NVIDIA_NIM_API_KEY'] = "nvapi-..."
response = litellm.rerank(
    model="nvidia_nim/nvidia/llama-3_2-nv-rerankqa-1b-v2",
    query="What is the GPU memory bandwidth of H100 SXM?",
    documents=[
        "The Hopper GPU is paired with the Grace CPU using NVIDIA's ultra-fast chip-to-chip interconnect, delivering 900GB/s of bandwidth.",
        "A100 provides up to 20X higher performance over the prior generation.",
        "Accelerated servers with H100 deliver 3 terabytes per second (TB/s) of memory bandwidth per GPU."
    ],
    top_n=3,
)
print(response)
import litellm
import os
os.environ['NVIDIA_NIM_API_KEY'] = "nvapi-..."
response = litellm.rerank(
    model="nvidia_nim/nvidia/nv-rerankqa-mistral-4b-v3",
    query="What is the GPU memory bandwidth of H100 SXM?",
    documents=[
        "The Hopper GPU is paired with the Grace CPU using NVIDIA's ultra-fast chip-to-chip interconnect, delivering 900GB/s of bandwidth.",
        "A100 provides up to 20X higher performance over the prior generation.",
        "Accelerated servers with H100 deliver 3 terabytes per second (TB/s) of memory bandwidth per GPU."
    ],
    top_n=3,
)
print(response)
Response:
{
    "results": [
        {
            "index": 2,
            "relevance_score": 6.828125,
            "document": {
                "text": "Accelerated servers with H100 deliver 3 terabytes per second (TB/s) of memory bandwidth per GPU."
            }
        },
        {
            "index": 0,
            "relevance_score": -1.564453125,
            "document": {
                "text": "The Hopper GPU is paired with the Grace CPU using NVIDIA's ultra-fast chip-to-chip interconnect, delivering 900GB/s of bandwidth."
            }
        }
    ]
}
Usage with LiteLLM Proxyโ
1. Setup Configโ
Add Nvidia NIM rerank models to your proxy configuration:
model_list:
  - model_name: nvidia-rerank
    litellm_params:
      model: nvidia_nim/nvidia/llama-3_2-nv-rerankqa-1b-v2
      api_key: os.environ/NVIDIA_NIM_API_KEY
2. Start Proxyโ
litellm --config /path/to/config.yaml
3. Make Rerank Requestsโ
curl -X POST http://0.0.0.0:4000/rerank \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia-rerank",
    "query": "What is the GPU memory bandwidth of H100?",
    "documents": [
      "H100 delivers 3TB/s memory bandwidth",
      "A100 has 2TB/s memory bandwidth",
      "V100 offers 900GB/s memory bandwidth"
    ],
    "top_n": 2
  }'
API Parametersโ
Required Parametersโ
| Parameter | Type | Description | 
|---|---|---|
| model | string | The Nvidia NIM rerank model name with nvidia_nim/prefix | 
| query | string | The search query to rank documents against | 
| documents | array | List of documents to rank (1-1000 documents) | 
Optional Parametersโ
| Parameter | Type | Default | Description | 
|---|---|---|---|
| top_n | integer | All documents | Number of top-ranked documents to return | 
Nvidia-Specific Parametersโ
truncate: Controls how text is truncated if it exceeds the model's context window
- "NONE": No truncation (request may fail if too long)
- "END": Truncate from the end of the text
response = litellm.rerank(
    model="nvidia_nim/nvidia/llama-3_2-nv-rerankqa-1b-v2",
    query="GPU performance",
    documents=["High performance computing", "Fast GPU processing"],
    top_n=2,
    truncate="END",  # Nvidia-specific parameter
)
Authenticationโ
Set your Nvidia NIM API key:
- Environment Variable
- Python
export NVIDIA_NIM_API_KEY="nvapi-..."
import os
os.environ['NVIDIA_NIM_API_KEY'] = "nvapi-..."
# Or pass directly
response = litellm.rerank(
    model="nvidia_nim/nvidia/llama-3_2-nv-rerankqa-1b-v2",
    query="test",
    documents=["doc1"],
    api_key="nvapi-...",
)
API Endpointโ
The rerank endpoint uses a different base URL than chat/embeddings:
- Chat/Embeddings: https://integrate.api.nvidia.com/v1/
- Rerank: https://ai.api.nvidia.com/v1/
LiteLLM automatically uses the correct endpoint for rerank requests.
Custom API Base URLโ
You can override the default base URL in several ways:
Option 1: Environment Variable
export NVIDIA_NIM_API_BASE="https://your-custom-endpoint.com"
Option 2: Pass as parameter
response = litellm.rerank(
    model="nvidia_nim/nvidia/llama-3_2-nv-rerankqa-1b-v2",
    query="test",
    documents=["doc1"],
    api_base="https://your-custom-endpoint.com",
)
Option 3: Full URL (including model path)
If you have the complete endpoint URL, you can pass it directly:
response = litellm.rerank(
    model="nvidia_nim/nvidia/llama-3_2-nv-rerankqa-1b-v2",
    query="test",
    documents=["doc1"],
    api_base="https://your-custom-endpoint.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking",
)
LiteLLM will detect the full URL (by checking for /retrieval/ in the path) and use it as-is.
How do I get an API key?โ
Get your Nvidia NIM API key from Nvidia's website.