Evaluations

Evaluations is in alpha

Evaluations is currently in alpha. To enable it, opt-in from your feature previews menu. We'd love to hear your feedback as we develop this feature.

Evaluations automatically assess the quality of your LLM generations using an "LLM-as-a-judge" approach. Each evaluation runs a customizable prompt against your generations and returns a pass/fail result with reasoning.

{/*TODO: Add screenshot of evaluations list

Evaluations list
*/}

Why use evaluations?

  • Monitor output quality at scale – Automatically check if generations are helpful, relevant, or safe without manual review.
  • Detect problematic content – Catch hallucinations, toxicity, or jailbreak attempts before they reach users.
  • Track quality trends – See pass rates across models, prompts, or user segments over time.
  • Debug with reasoning – Each evaluation provides an explanation for its decision, making it easy to understand failures.

How evaluations work

When a generation is captured, PostHog samples it based on your configured rate (0.1% to 100%). If sampled, the generation's input and output are sent to an LLM judge with your evaluation prompt. The judge returns a boolean pass/fail result plus reasoning, which is stored and linked to the original generation.

You can optionally filter which generations get evaluated using property filters. For example, only evaluate generations from production, from a specific model, or above a certain cost threshold.

Built-in templates

PostHog provides five pre-built evaluation templates to get you started:

TemplateWhat it checksBest for
RelevanceWhether the output addresses the user's inputCustomer support bots, Q&A systems
HelpfulnessWhether the response is useful and actionableChat assistants, productivity tools
JailbreakAttempts to bypass safety guardrailsSecurity-sensitive applications
HallucinationMade-up facts or unsupported claimsRAG systems, knowledge bases
ToxicityHarmful, offensive, or inappropriate contentUser-facing applications

Creating an evaluation

  1. Navigate to LLM analytics > Evaluations
  2. Click New evaluation
  3. Choose a template or start from scratch
  4. Configure the evaluation:
    • Name: A descriptive name for the evaluation
    • Prompt: The instructions for the LLM judge (templates provide sensible defaults)
    • Sampling rate: Percentage of generations to evaluate (0.1% – 100%)
    • Property filters (optional): Narrow which generations to evaluate
  5. Enable the evaluation and click Save

{/*TODO: Add screenshot of evaluation creation form

Create evaluation form
*/}

Viewing results

The Evaluations tab shows all your evaluations with their pass rates and recent activity. Click an evaluation to see its run history, including individual pass/fail results and the reasoning from the LLM judge.

You can also filter generations by evaluation results or create insights based on evaluation data to build quality monitoring dashboards.

{/*TODO: Add screenshot of evaluation results

Evaluation results
*/}

Writing custom prompts

When creating a custom evaluation, your prompt should instruct the LLM judge to return true (pass) or false (fail) along with reasoning. The judge receives the generation's input and output for context.

Tips for effective evaluation prompts:

  • Be specific about what constitutes a pass or fail
  • Include examples of edge cases when relevant
  • Keep the prompt concise but comprehensive

Example custom prompt:

text
You are evaluating whether an LLM response follows our brand voice guidelines.
Given the user input and assistant response, determine if the response:
- Uses a friendly, conversational tone
- Avoids corporate jargon
- Addresses the user by name when provided
Return true if the response follows these guidelines, false otherwise.
Explain your reasoning briefly.

Pricing

Each evaluation run counts as one LLM analytics event toward your quota. Use sampling rates strategically to balance coverage and cost – often 5-10% sampling provides sufficient signal for quality monitoring.

Community questions

Was this page useful?

Questions about this page? or post a community question.