Inference API

The Inference API allows you to send inference requests to various LLM providers through a single, unified endpoint. This simplifies the process of integrating with multiple providers and managing your API keys.

Below, you will find detailed information about the available endpoints, including the required headers, parameters, and example requests and responses for each operation.

Inference Models

Use the Unified Inference API to send inference requests to different LLM providers.

Request Model

Property NameTypeDescriptionRequiredDefault
modelstringThe model to use for the inference request, see our Supported Models.YesN/A
messagesarrayAn array of message objects.YesN/A
parametersobjectAn object of parameters for the inference request.NoN/A

Parameters

Property NameTypeDescriptionRequiredDefault
temperaturenumberThe sampling temperature to use.Optional0.7
max_tokensnumberThe maximum number of tokens to generate.Optional256
top_pnumberThe cumulative probability for nucleus sampling.Optional0.9
streambooleanWhether to stream the response.Optionalfalse
include_usagebooleanWhether to include usage information in the response.Optionaltrue
end_user_idstringThe ID of the end user.OptionalN/A
stopstring or nullThe stopping sequence for the generated text.Optionalnull

Response Model

Property NameTypeDescriptionRequiredDefault
schemastringThe schema version of the response.YesN/A
messageobjectThe message object containing role and content.YesN/A
modelstringThe model ID used for the inference request.YesN/A
usageobjectAn object containing usage information.YesN/A
stop_reasonstringThe reason why the generation stopped.YesN/A
creatednumberThe timestamp when the response was created.YesN/A

Usage

Property NameTypeDescriptionRequiredDefault
prompt_tokensnumberThe number of tokens in the prompt.YesN/A
completion_tokensnumberThe number of tokens in the completion.YesN/A
total_tokensnumberThe total number of tokens used.YesN/A

Stream Response Model

Property NameTypeDescriptionRequiredDefault
schemastringThe schema version of the response.YesN/A
creatednumberThe timestamp when the response was created.YesN/A
model_idstringThe model ID used for the inference request.YesN/A
contentstringThe content of the response.YesN/A
usageobjectAn object containing usage information.YesN/A

Shared

Messages

Property NameTypeDescriptionRequiredDefault
rolestringThe role of the message sender (e.g., 'user').YesN/A
contentarrayAn array of content objects.YesN/A

Content

Property NameTypeDescriptionRequiredDefault
typestringThe type of content (e.g., 'text').YesN/A
textstringThe text content of the message.YesN/A

Usage

Property NameTypeDescriptionRequiredDefault
inputTokensnumberThe number of input tokens.YesN/A
outputTokensnumberThe number of output tokens.YesN/A
totalTokensnumberThe total number of tokens used.YesN/A

API Reference


Send an Inference Request

Example shows how to send an inference request to an LLM provider.

Required headers

  • Name
    x-api-key
    Type
    string
    Description

    The API key of your UsageGuard account.

Required fields

  • Name
    model
    Type
    string
    Description

    The model to use for the inference request.

  • Name
    messages
    Type
    array
    Description

    An array of message objects (typically a user message - the prompt - followed by an assistant message to include history if applicable).

  • Name
    parameters
    Type
    object
    Description

    An object of parameters for the inference request.

Optional fields

  • Name
    maxTokens
    Type
    number
    Description

    The maximum number of tokens to generate.

  • Name
    temperature
    Type
    number
    Description

    The sampling temperature to use.

  • Name
    topP
    Type
    number
    Description

    The cumulative probability for nucleus sampling.

  • Name
    frequencyPenalty
    Type
    number
    Description

    The penalty for repeated tokens.

  • Name
    presencePenalty
    Type
    number
    Description

    The penalty for new tokens.

Request

POST
/v1/inference/chat
POST /v1/inference/chat HTTP/1.1
Host: api.usageguard.com
x-api-key: Your_UsageGuard_API_key
x-connection-id: Your_Connection_ID
Content-Type: application/json

{
"model": "llama3:latest",
"messages": [
    {
        "role": "user",
        "content": "Translate the following English text to French: 'Hello, how are you?'"
    }
],
"parameters": {
    "maxTokens": 60,
    "temperature": 0.7
}   
}
200 OK
{
  "schema": "inference.response.v1",
  "message": {
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "Bonjour, comment ça va?"
      }
    ]
  },
  "model": "meta.llama3-8b-instruct-v1:0",
  "usage": {
    "inputTokens": 30,
    "outputTokens": 88,
    "totalTokens": 118
  },
  "stop_reason": "end_turn",
  "created": 1721205824
}

Streaming Response Example

Example shows how to send a chat completion request to the Llama3 model and handle streaming responses.

Required headers

  • Name
    x-ug-api-key
    Type
    string
    Description

    The API key of your UsageGuard account.

  • Name
    Content-Type
    Type
    string
    Description

    application/json

  • Name
    model
    Type
    string
    Description

    The model to use for the inference request.

  • Name
    messages
    Type
    array
    Description

    An array of message objects (typically a user message - the prompt - followed by an assistant message to include history if applicable).

  • Name
    parameters
    Type
    object
    Description

    An object of parameters for the inference request.

  • Name
    stream
    Type
    boolean
    Description

    Whether to stream the response.

Request

POST
/v1/inference/chat
POST /v1/inference/chat HTTP/1.1
Host: api.usageguard.com
x-api-key: Your_UsageGuard_API_key
x-connection-id: Your_Connection_ID
Content-Type: application/json

{
"model": "llama3:latest",
"messages": [
    {
        "role": "user",
        "content": "Translate the following English text to French: 'Hello, how are you?"
    }
],
"parameters": {
    "maxTokens": 60,
    "temperature": 0.7,
    "stream": true
}   
}

Streaming Response

POST
/v1/inference/chat
...

data: {"schema":"inference.response.chunk.v1",
"created":1721208440,
"model_id":"meta.llama3-8b-instruct-v1:0","content":"ça"}

data: {"schema":"inference.response.chunk.v1",
"created":1721208440,
"model_id":"meta.llama3-8b-instruct-v1:0","content":"va"}

data: {"schema":"inference.response.chunk.v1",
"created":1721208440,
"model_id":"meta.llama3-8b-instruct-v1:0","content":"?"}

data: {"schema":"inference.response.usage.chunk.v1","usage":{"inputTokens":40,"outputTokens":131,"totalTokens":171},
"created":1721208440,"model_id":"meta.llama3-8b-instruct-v1:0","content":""}

data: [DONE]

Was this page helpful?