Model-based Evaluations in Langfuse
Model-based evaluations are a powerful tool to automate the evaluation of LLM applications integrated with Langfuse. With model-based evalutions, LLMs are used to score a specific session/trace/LLM-call in Langfuse on criteria such as correctness, toxicity, or hallucinations.
There are two ways to run model-based evaluations in Langfuse:
Via Python SDK
You can run your own model-based evals on data in Langfuse via the Python SDK. This gives you full flexibility to run various eval libraries on your production data and discover which work well for your use case.
Popular libraries:
- OpenAI Evals
- Langchain Evaluators (Cookbook)
- RAGAS for RAG applications (Cookbook)
- UpTrain evals (Cookbook)
- Whylabs Langkit
Via Langfuse UI (beta)
- HobbyPrivate Beta
- ProPrivate Beta
- TeamPrivate Beta
- Self HostedNot Available
Ping us via the chat widget if you are interested to join the private beta.
Create an eval template
First, we need to specify how the evaluation template:
- Select the model and its parameters.
- Define the evaluation prompt together with the variables that will be inserted into the prompt. Example prompts can be found here (opens in a new tab).
- We use function calling to extract the evaluation output. Specify the descriptions for the function parameters
score
andreasoning
.

Create an eval config
Second, we need to specify when to run the evaluation template we created above.
- Select the evaluation template to run.
- When running the evaluations, scores will be attached to the Traces. Specify the score name.
- Filter which newly ingested traces should be evaluated. (Coming soon: select existing traces)
- Specify which part of a trace should be inserted into the prompt. Use the
Input
from theTrace
for thequery
variable and theOutput
from theGeneration
for thegeneration
variable. In this example, choose the latestGeneration
with the namellm-generation
. - Modify sampling to execute the evaluations on a randomly chosen subset of the traces.
- Add a delay to the evaluation execution once a trace is ingested to ensure the trace is fully processed before the evaluation is executed.

See scores
Upon receiving new traces, navigate to the trace detail view to see the associated scores.