Docs
Scores & Evaluation
Model-based Evaluation

Model-based Evaluations in Langfuse

Model-based evaluations are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.

There are two ways to run model-based evaluations in Langfuse:

  1. Via the Python SDK or API
  2. Via the Langfuse UI (beta)

Via Python SDK

You can run your own model-based evals on data in Langfuse via the Python SDK. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case.

Popular libraries:

  • OpenAI Evals
  • Langchain Evaluators (Cookbook)
  • RAGAS for RAG applications (Cookbook)
  • UpTrain evals (Cookbook)
  • Whylabs Langkit

Via Langfuse UI (beta)

Where is this feature available?
  • Hobby
  • Pro
  • Team
  • Self Hosted

Ping us via the chat widget if you are interested to join the private beta.

Model-based evaluations

Create an eval template

First, we need to specify how the evaluation template:

  • Select the model and its parameters.
  • Define the evaluation prompt together with the variables that will be inserted into the prompt. Example prompts can be found here (opens in a new tab).
  • We use function calling to extract the evaluation output. Specify the descriptions for the function parameters score and reasoning.
Langfuse

Create an eval config

Second, we need to specify when to run the evaluation template we created above.

  • Select the evaluation template to run.
  • When running the evaluations, scores will be attached to the Traces. Specify the score name.
  • Filter which newly ingested traces should be evaluated. (Coming soon: select existing traces)
  • Specify which part of a trace should be inserted into the prompt. Use the Input from the Trace for the query variable and the Output from the Generation for the generation variable. In this example, choose the latest Generation with the name llm-generation.
  • Modify sampling to execute the evaluations on a randomly chosen subset of the traces.
  • Add a delay to the evaluation execution once a trace is ingested to ensure the trace is fully processed before the evaluation is executed.
Langfuse

See scores

Upon receiving new traces, navigate to the trace detail view to see the associated scores.

Langfuse

Was this page useful?

Questions? We're here to help

Subscribe to updates