Evaluation (LLM-as-a-Judge)
LLM-as-a-Judge is a powerful evaluation approach that uses language models to assess the quality and characteristics of AI responses. This method leverages the model's understanding of language, context, and reasoning to provide comprehensive evaluations of AI outputs.
When setting up a connection, you can enable Evaluation mode (Post-processing) to analyze the message turn (User/AI exchange) for potential issues like hallucinations, bias, or inconsistencies in the response.
Standard Evaluation Framework
Out of the box, UsageGuard provides a standard evaluation framework that assesses the following aspects:
- Name
value
- Type
- number
- Description
Primary evaluation score
- Name
overall_quality_score
- Type
- number
- Description
Overall quality assessment of the response
- Name
relevance_score
- Type
- number
- Description
How relevant the response is to the user's query
- Name
coherence_score
- Type
- number
- Description
How coherent and well-structured the response is
- Name
factuality_score
- Type
- number
- Description
How factually accurate the response is
- Name
instruction_adherence_score
- Type
- number
- Description
How well the response follows the given instructions
- Name
user_query_safety_score
- Type
- number
- Description
Safety assessment of the user's query
- Name
user_query_safety_flags
- Type
- array
- Description
Specific safety concerns identified in the user's query
- Name
outcome
- Type
- string
- Description
Overall outcome of the evaluation
- Name
topic
- Type
- string
- Description
Main topic or subject of the conversation
Custom Evaluation
You can customize the evaluation framework by providing your own custom evaluation prompt in the Prompts section.
Required Parameters
For any custom evaluation prompt, you will need to define the following parameters:
Inputs
- Name
System Message
- Type
- string
- Description
The system message used in the conversation
- Name
User Message
- Type
- string
- Description
The user's input message
- Name
Model Response
- Type
- string
- Description
The AI model's response to evaluate
Required Tool Input Schema
{
"type": "object",
"properties": {
"value": {
"type": "number",
"format": "float",
"minimum": 0,
"maximum": 1,
"description": "Primary numeric score for evaluation (recommended range: 0-1)"
}
},
"required": ["value"]
}
The value
field is required for consistent processing. The recommended range is 0-1.
Viewing Evaluation Results
You can review the evaluation results for each message turn by visiting the Sessions dashboard section:
- Navigate to the Sessions section in your dashboard
- Locate the relevant session and trace
- Expand the trace to view the detailed evaluation results, including all scores and metrics
The evaluation results are displayed alongside the conversation, making it easy to analyze the quality and characteristics of each AI response.