Inference API

The Inference API allows you to interact with AI models through two main endpoints: /completion and /chat This unified interface simplifies integration with multiple AI providers while maintaining consistent security and monitoring.

POST/v1/inference/completion

Completion Endpoint

Get a text completion from the AI model. Use this for generating content or completing partial text.

Request Headers

  • Name
    x-connection-id
    Type
    string
    Description

    Your connection identifier. You can find this in your dashboard under Connections section.

  • Name
    x-api-key
    Type
    string
    Description

    Your API key for authentication. You can generate this from your dashboard under API Keys section.

  • Name
    traceparent
    Type
    string
    Description

    OpenTelemetry trace parent for distributed tracing

  • Name
    tracestate
    Type
    string
    Description

    OpenTelemetry trace state information

Request Body

  • Name
    type
    Type
    string
    Description

    Always set to "Inference"

  • Name
    version
    Type
    string
    Description

    API version (currently "2.0")

  • Name
    model
    Type
    string
    Description

    The ID of the AI model to use for inference

  • Name
    messages
    Type
    array
    Description

    Array of messages in the conversation. Each message has:

    • role: "user" | "assistant" | "system"
    • content: array of content objects with type and text
  • Name
    parameters
    Type
    object
    Description
    • Name
      end_user_id
      Type
      string
      Description

      Unique identifier for the end user

    • Name
      temperature
      Type
      number
      Description

      Sampling temperature (default: 0.7)

    • Name
      max_tokens
      Type
      integer
      Description

      Maximum number of tokens to generate

    • Name
      top_p
      Type
      number
      Description

      Nucleus sampling parameter (default: 0.9)

    • Name
      frequency_penalty
      Type
      number
      Description

      Frequency penalty (default: 0.0)

    • Name
      presence_penalty
      Type
      number
      Description

      Presence penalty (default: 0.0)

    • Name
      stream
      Type
      boolean
      Description

      Whether to stream the response (default: false)

    • Name
      include_usage
      Type
      boolean
      Description

      Whether to include usage information (default: true)

    • Name
      json_schema
      Type
      object
      Description

      JSON schema for structured output

    • Name
      stop
      Type
      string
      Description

      Stop sequence for generation

  • Name
    session_id
    Type
    string
    Description

    Unique identifier for this chat session

  • Name
    tools
    Type
    array
    Description

    Array of tools available to the model

  • Name
    auto_continuation
    Type
    boolean
    Description

    Whether to enable auto continuation (default: false)
    This flag enables UsageGuard to append continuation call to the model with tool result.

  • Name
    agent_parameters
    Type
    object
    Description

    Additional parameters for agent behavior

Request

POST
/v1/inference/completion
curl -X POST https://api.usageguard.com/v1/inference/completion \
  -H "x-connection-id: {connection_id}" \
  -H "x-api-key: {api_key}" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "Inference",
    "version": "2.0",
    "model": "gpt-4",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "The capital of France is"
          }
        ]
      }
    ],
    "parameters": {
      "end_user_id": "user_123",
      "temperature": 0.7,
      "max_tokens": 50,
      "top_p": 0.9,
      "frequency_penalty": 0.0,
      "presence_penalty": 0.0,
      "stream": false,
      "include_usage": true
    },
    "session_id": "comp_abc123",
    "auto_continuation": false
  }'

Response

{
  "id": "resp_xyz789",
  "model": "gpt-4",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": [
          {
            "type": "text",
            "text": " Paris, a city known for its rich history, culture, and iconic landmarks like the Eiffel Tower."
          }
        ]
      }
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 15,
    "total_tokens": 20
  }
}

400: Bad Request

{
  "error": "Bad Request",
  "message": "Invalid request. Verify your model ID, prompt format, and parameters."
}

POST/v1/inference/chat

Chat Endpoint

Send a chat request to the AI model. Use this for conversational interactions where context is important.

Request Headers

  • Name
    x-connection-id
    Type
    string
    Description

    Your connection identifier. You can find this in your dashboard under Connections section.

  • Name
    x-api-key
    Type
    string
    Description

    Your API key for authentication. You can generate this from your dashboard under API Keys section.

  • Name
    traceparent
    Type
    string
    Description

    OpenTelemetry trace parent for distributed tracing

  • Name
    tracestate
    Type
    string
    Description

    OpenTelemetry trace state information

Request Body

  • Name
    type
    Type
    string
    Description

    Always set to "Inference"

  • Name
    version
    Type
    string
    Description

    API version (currently "2.0")

  • Name
    model
    Type
    string
    Description

    The ID of the AI model to use for inference

  • Name
    messages
    Type
    array
    Description

    Array of messages in the conversation. Each message has:

    • role: "user" | "assistant" | "system"
    • content: array of content objects with type and text
  • Name
    parameters
    Type
    object
    Description
    • Name
      end_user_id
      Type
      string
      Description

      Unique identifier for the end user

    • Name
      temperature
      Type
      number
      Description

      Sampling temperature (default: 0.7)

    • Name
      max_tokens
      Type
      integer
      Description

      Maximum number of tokens to generate

    • Name
      top_p
      Type
      number
      Description

      Nucleus sampling parameter (default: 0.9)

    • Name
      frequency_penalty
      Type
      number
      Description

      Frequency penalty (default: 0.0)

    • Name
      presence_penalty
      Type
      number
      Description

      Presence penalty (default: 0.0)

    • Name
      stream
      Type
      boolean
      Description

      Whether to stream the response (default: false)

    • Name
      include_usage
      Type
      boolean
      Description

      Whether to include usage information (default: true)

    • Name
      json_schema
      Type
      object
      Description

      JSON schema for structured output

    • Name
      stop
      Type
      string
      Description

      Stop sequence for generation

  • Name
    session_id
    Type
    string
    Description

    Unique identifier for this chat session

  • Name
    tools
    Type
    array
    Description

    Array of tools available to the model. Each tool has:

    • Name
      name
      Type
      string
      Description

      The name of the tool

    • Name
      description
      Type
      string
      Description

      A description of what the tool does

    • Name
      input_schema
      Type
      object
      Description

      JSON schema defining the input parameters for the tool

  • Name
    auto_continuation
    Type
    boolean
    Description

    Whether to enable auto continuation (default: false)
    This flag enables UsageGuard to append continuation call to the model with tool result.

  • Name
    agent_parameters
    Type
    object
    Description

    Additional parameters for agent behavior

Request

POST
/v1/inference/chat
curl -X POST https://api.usageguard.com/v1/inference/chat \
  -H "x-connection-id: {connection_id}" \
  -H "x-api-key: {api_key}" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "Inference",
    "version": "2.0",
    "model": "gpt-4",
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a helpful assistant."
          }
        ]
      },
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is the capital of France?"
          }
        ]
      }
    ],
    "parameters": {
      "end_user_id": "user_123",
      "temperature": 0.7,
      "max_tokens": 100,
      "top_p": 0.9,
      "frequency_penalty": 0.0,
      "presence_penalty": 0.0,
      "stream": false,
      "include_usage": true
    },
    "session_id": "chat_abc123",
    "tools": [
      {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "input_schema": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "auto_continuation": false
  }'

Response

{
  "id": "resp_xyz789",
  "model": "gpt-4",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": [
          {
            "type": "text",
            "text": "The capital of France is Paris."
          }
        ]
      }
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 5,
    "total_tokens": 15
  }
}

400: Bad Request

{
  "error": "Bad Request",
  "message": "Invalid request. Check your model ID, message format, and parameters."
}

401: Unauthorized

{
  "error": "Unauthorized",
  "message": "Invalid API key or connection ID."
}

Tool Requests and Responses

When working with tools in the chat or completion endpoints, the conversation follows a specific pattern for tool calls and responses. Here's how to handle tool interactions:

Tool Call Pattern

The interaction typically follows these steps:

  1. Assistant Tool Call: The model requests to use a tool
  2. Tool Result: The system executes the tool and returns results
  3. Assistant Continuation: The model continues the conversation using the tool results

Here's an example of the complete flow:

// Step 1: Assistant makes a tool call
{
  "role": "assistant",
  "content": "I'll help you find some great restaurants!",
  "tool_calls": [
    {
      "id": "tool_call_001",
      "type": "function",
      "function": {
        "name": "sys_search_documents",
        "arguments": "{\"search_phrases\": [\"Italian restaurants\"], \"limit\": 3}"
      }
    }
  ]
}

// Step 2: System returns tool result
{
  "role": "tool",
  "tool_call_id": "tool_call_001",
  "content": "{ \"SearchResults\": [ { \"restaurant_name\": \"Pasta Paradise\", \"cuisine\": \"Italian\", \"rating\": 4.5, \"address\": \"123 Main St\" } ] }"
}

// Step 3: Assistant continues with response
{
  "role": "assistant",
  "content": "I found a great Italian restaurant for you: 'Pasta Paradise' with a 4.5 rating, located at 123 Main St."
}

Complete Request Example

Here's how the entire request would look when making a tool-enabled chat request:

{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "Can you recommend some Italian restaurants?"
    },
    {
      "role": "assistant",
      "content": "I'll help you find some great restaurants!",
      "tool_calls": [
        {
          "id": "tool_call_001",
          "type": "function",
          "function": {
            "name": "sys_search_documents",
            "arguments": "{\"search_phrases\": [\"Italian restaurants\"], \"limit\": 3}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "tool_call_001",
      "content": "{ \"SearchResults\": [ { \"restaurant_name\": \"Pasta Paradise\", \"cuisine\": \"Italian\", \"rating\": 4.5, \"address\": \"123 Main St\" } ] }"
    }
  ],
  "parameters": {
    "end_user_id": "user_123",
    "temperature": 0.7,
    "max_tokens": 100,
    "top_p": 0.9,
    "stream": false,
    "include_usage": true
  },
  "session_id": "chat_abc123",
  "tools": [
    {
      "name": "sys_search_documents",
      "description": "Search for restaurants and dining options in the system",
      "input_schema": {
        "type": "object",
        "properties": {
          "search_phrases": {
            "type": "array",
            "items": {
              "type": "string"
            },
            "description": "Array of search phrases"
          },
          "limit": {
            "type": "integer",
            "description": "Maximum number of results to return"
          }
        },
        "required": ["search_phrases"]
      }
    }
  ],
  "auto_continuation": false
}

Important Notes

  • Each tool call must have a unique tool_call_id
  • The tool response must reference the same tool_call_id as the original call
  • The auto_continuation parameter in your request can be set to true to automatically handle the continuation step if you are using built-in tools (e.g. sys_search_documents) or running our expirmental external tools support.
  • Tool results should be valid JSON strings that can be parsed by the model

Was this page helpful?