Evaluation (LLM-as-a-Judge)

LLM-as-a-Judge is a powerful evaluation approach that uses language models to assess the quality and characteristics of AI responses. This method leverages the model's understanding of language, context, and reasoning to provide comprehensive evaluations of AI outputs.

When setting up a connection, you can enable Evaluation mode (Post-processing) to analyze the message turn (User/AI exchange) for potential issues like hallucinations, bias, or inconsistencies in the response.

Standard Evaluation Framework

Out of the box, UsageGuard provides a standard evaluation framework that assesses the following aspects:

  • Name
    value
    Type
    number
    Description

    Primary evaluation score

  • Name
    overall_quality_score
    Type
    number
    Description

    Overall quality assessment of the response

  • Name
    relevance_score
    Type
    number
    Description

    How relevant the response is to the user's query

  • Name
    coherence_score
    Type
    number
    Description

    How coherent and well-structured the response is

  • Name
    factuality_score
    Type
    number
    Description

    How factually accurate the response is

  • Name
    instruction_adherence_score
    Type
    number
    Description

    How well the response follows the given instructions

  • Name
    user_query_safety_score
    Type
    number
    Description

    Safety assessment of the user's query

  • Name
    user_query_safety_flags
    Type
    array
    Description

    Specific safety concerns identified in the user's query

  • Name
    outcome
    Type
    string
    Description

    Overall outcome of the evaluation

  • Name
    topic
    Type
    string
    Description

    Main topic or subject of the conversation

Custom Evaluation

You can customize the evaluation framework by providing your own custom evaluation prompt in the Prompts section.

Required Parameters

For any custom evaluation prompt, you will need to define the following parameters:

Inputs

  • Name
    System Message
    Type
    string
    Description

    The system message used in the conversation

  • Name
    User Message
    Type
    string
    Description

    The user's input message

  • Name
    Model Response
    Type
    string
    Description

    The AI model's response to evaluate

Required Tool Input Schema

{
  "type": "object",
  "properties": {
    "value": {
      "type": "number",
      "format": "float",
      "minimum": 0,
      "maximum": 1,
      "description": "Primary numeric score for evaluation (recommended range: 0-1)"
    }
  },
  "required": ["value"]
}

The value field is required for consistent processing. The recommended range is 0-1.

Viewing Evaluation Results

You can review the evaluation results for each message turn by visiting the Sessions dashboard section:

  1. Navigate to the Sessions section in your dashboard
  2. Locate the relevant session and trace
  3. Expand the trace to view the detailed evaluation results, including all scores and metrics

The evaluation results are displayed alongside the conversation, making it easy to analyze the quality and characteristics of each AI response.

Was this page helpful?