Inference API
The Inference API allows you to send inference requests to various LLM providers through a single, unified endpoint. This simplifies the process of integrating with multiple providers and managing your API keys.
Below, you will find detailed information about the available endpoints, including the required headers, parameters, and example requests and responses for each operation.
Inference Models
Use the Unified Inference API to send inference requests to different LLM providers.
Request Model
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
model | string | The model to use for the inference request, see our Supported Models. | Yes | N/A |
messages | array | An array of message objects. | Yes | N/A |
parameters | object | An object of parameters for the inference request. | No | N/A |
Parameters
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
temperature | number | The sampling temperature to use. | Optional | 0.7 |
max_tokens | number | The maximum number of tokens to generate. | Optional | 256 |
top_p | number | The cumulative probability for nucleus sampling. | Optional | 0.9 |
frequency_penalty | number | The penalty for new tokens based on their frequency in the input. | Optional | N/A |
presence_penalty | number | The penalty for new tokens based on their presence in the input. | Optional | N/A |
stream | boolean | Whether to stream the response. | Optional | false |
include_usage | boolean | Whether to include usage information in the response. | Optional | true |
end_user_id | string | The ID of the end user. | Optional | N/A |
stop | string or null | The stopping sequence for the generated text. | Optional | null |
json_output | boolean | Whether to output the response in JSON format. | Optional | N/A |
json_schema | object | The schema for the JSON output. | Optional | N/A |
Filters
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
documents_classification | string | The classification level of the documents to filter. (internal, external, confidential) | Optional | all |
documents_tags | string[] | An array of tags to filter the documents by. | Optional | N/A |
Response Model
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
schema | string | The schema version of the response. | Yes | N/A |
message | object | The message object containing role and content. | Yes | N/A |
model | string | The model ID used for the inference request. | Yes | N/A |
usage | object | An object containing usage information. | Yes | N/A |
stop_reason | string | The reason why the generation stopped. | Yes | N/A |
created | number | The timestamp when the response was created. | Yes | N/A |
Usage
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
prompt_tokens | number | The number of tokens in the prompt. | Yes | N/A |
completion_tokens | number | The number of tokens in the completion. | Yes | N/A |
total_tokens | number | The total number of tokens used. | Yes | N/A |
Stream Response Model
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
schema | string | The schema version of the response. | Yes | N/A |
created | number | The timestamp when the response was created. | Yes | N/A |
model_id | string | The model ID used for the inference request. | Yes | N/A |
content | string | The content of the response. | Yes | N/A |
usage | object | An object containing usage information. | Yes | N/A |
Shared
Messages
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
role | string | The role of the message sender (e.g., 'user'). | Yes | N/A |
content | array | An array of content objects. | Yes | N/A |
role can be either [user | assistant | system]
Content
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
type | string | The type of content (e.g., 'text'). | Yes | N/A |
text | string | The text content of the message. | Yes | N/A |
Usage
Property Name | Type | Description | Required | Default |
---|---|---|---|---|
inputTokens | number | The number of input tokens. | Yes | N/A |
outputTokens | number | The number of output tokens. | Yes | N/A |
totalTokens | number | The total number of tokens used. | Yes | N/A |
API Reference
Inference Endpoint
Example shows how to send an inference request to an LLM provider.
Required headers
- Name
x-api-key
- Type
- string
- Description
The API key of your UsageGuard account.
Required fields
- Name
model
- Type
- string
- Description
The model to use for the inference request.
- Name
messages
- Type
- array
- Description
An array of message objects (typically a user message - the prompt - followed by an assistant message to include history if applicable).
- Name
parameters
- Type
- object
- Description
An object of parameters for the inference request.
Optional fields
- Name
maxTokens
- Type
- number
- Description
The maximum number of tokens to generate.
- Name
temperature
- Type
- number
- Description
The sampling temperature to use.
- Name
topP
- Type
- number
- Description
The cumulative probability for nucleus sampling.
- Name
frequencyPenalty
- Type
- number
- Description
The penalty for repeated tokens.
- Name
presencePenalty
- Type
- number
- Description
The penalty for new tokens.
Request
POST /v1/inference/chat HTTP/1.1
Host: api.usageguard.com
x-api-key: Your_UsageGuard_API_key
x-connection-id: Your_Connection_ID
Content-Type: application/json
{
"model": "llama3:latest",
"messages": [
{
"role": "user",
"content": "Translate the following English text to French: 'Hello, how are you?'"
}
],
"parameters": {
"maxTokens": 60,
"temperature": 0.7
}
}
200 OK
{
"schema": "inference.response.v1",
"message": {
"role": "assistant",
"content": [
{
"type": "text",
"text": "Bonjour, comment ça va?"
}
]
},
"model": "meta.llama3-8b-instruct-v1:0",
"usage": {
"inputTokens": 30,
"outputTokens": 88,
"totalTokens": 118
},
"stop_reason": "end_turn",
"created": 1721205824
}
Streaming Response Example
Example shows how to send a chat completion request to the Llama3 model and handle streaming responses.
Required headers
- Name
x-ug-api-key
- Type
- string
- Description
The API key of your UsageGuard account.
- Name
Content-Type
- Type
- string
- Description
application/json
- Name
model
- Type
- string
- Description
The model to use for the inference request.
- Name
messages
- Type
- array
- Description
An array of message objects (typically a user message - the prompt - followed by an assistant message to include history if applicable).
- Name
parameters
- Type
- object
- Description
An object of parameters for the inference request.
- Name
stream
- Type
- boolean
- Description
Whether to stream the response.
Request
POST /v1/inference/chat HTTP/1.1
Host: api.usageguard.com
x-api-key: Your_UsageGuard_API_key
x-connection-id: Your_Connection_ID
Content-Type: application/json
{
"model": "llama3:latest",
"messages": [
{
"role": "user",
"content": "Translate the following English text to French: 'Hello, how are you?"
}
],
"parameters": {
"maxTokens": 60,
"temperature": 0.7,
"stream": true
}
}
Streaming Response
...
data: {"schema":"inference.response.chunk.v1",
"created":1721208440,
"model_id":"meta.llama3-8b-instruct-v1:0","content":"ça"}
data: {"schema":"inference.response.chunk.v1",
"created":1721208440,
"model_id":"meta.llama3-8b-instruct-v1:0","content":"va"}
data: {"schema":"inference.response.chunk.v1",
"created":1721208440,
"model_id":"meta.llama3-8b-instruct-v1:0","content":"?"}
data: {"schema":"inference.response.usage.chunk.v1","usage":{"inputTokens":40,"outputTokens":131,"totalTokens":171},
"created":1721208440,"model_id":"meta.llama3-8b-instruct-v1:0","content":""}
data: [DONE]