Evaluation (LLM-as-a-Judge)

LLM-as-a-Judge is a powerful evaluation approach that uses language models to assess the quality and characteristics of AI responses. This method leverages the model's understanding of language, context, and reasoning to provide comprehensive evaluations of AI outputs.

When setting up a connection, you can enable Evaluation mode (Post-processing) to analyze the message turn (User/AI exchange) for potential issues like hallucinations, bias, or inconsistencies in the response.

Standard Evaluation Framework

Out of the box, UsageGuard provides a standard evaluation framework that assesses the following aspects:

Name
value
Type
number
Description
Primary evaluation score
Name
overall_quality_score
Type
number
Description
Overall quality assessment of the response
Name
relevance_score
Type
number
Description
How relevant the response is to the user's query
Name
coherence_score
Type
number
Description
How coherent and well-structured the response is
Name
factuality_score
Type
number
Description
How factually accurate the response is
Name
instruction_adherence_score
Type
number
Description
How well the response follows the given instructions
Name
user_query_safety_score
Type
number
Description
Safety assessment of the user's query
Name
user_query_safety_flags
Type
array
Description
Specific safety concerns identified in the user's query
Name
outcome
Type
string
Description
Overall outcome of the evaluation
Name
topic
Type
string
Description
Main topic or subject of the conversation

Custom Evaluation

You can customize the evaluation framework by providing your own custom evaluation prompt in the Prompts section.

Required Parameters

For any custom evaluation prompt, you will need to define the following parameters:

Inputs

Name
System Message
Type
string
Description
The system message used in the conversation
Name
User Message
Type
string
Description
The user's input message
Name
Model Response
Type
string
Description
The AI model's response to evaluate

Required Tool Input Schema

{
  "type": "object",
  "properties": {
    "value": {
      "type": "number",
      "format": "float",
      "minimum": 0,
      "maximum": 1,
      "description": "Primary numeric score for evaluation (recommended range: 0-1)"
    }
  },
  "required": ["value"]
}

The value field is required for consistent processing. The recommended range is 0-1.

Viewing Evaluation Results

You can review the evaluation results for each message turn by visiting the Sessions dashboard section:

Navigate to the Sessions section in your dashboard
Locate the relevant session and trace
Expand the trace to view the detailed evaluation results, including all scores and metrics

The evaluation results are displayed alongside the conversation, making it easy to analyze the quality and characteristics of each AI response.