The MetricEvaluator class is the main evaluation engine that executes metric computations on your data. It handles hierarchical aggregations, grouping, filtering, and manages the entire evaluation pipeline using Polars’ lazy evaluation for optimal performance.
MetricEvaluator.evaluate() returns a Polars DataFrame by default. Set verbose=True to include struct columns and diagnostic fields, or collect=False to keep the lazy representation for additional pipeline work.
Evaluator Inputs
Argument
Type
Notes
df
pl.DataFrame | pl.LazyFrame
Source data. The evaluator keeps it lazy internally.
metrics
MetricDefine | list[MetricDefine]
Metric definitions to execute. Use lists to mix different aggregation types.
ground_truth
str
Column containing observed values. Defaults to "actual".
estimates
str | list[str] | dict[str, str]
Model predictions to compare against ground_truth. Dict form lets you control display labels.
group_by
list[str] | dict[str, str] | None
Optional columns for cohort-level summaries (e.g., treatment, site).
subgroup_by
list[str] | dict[str, str] | None
Optional stratifiers that fan out into subgroup-specific rows.
scope (per metric)
MetricScope | None
Overrides default grouping for a metric (global, model, group).
filter_expr
pl.Expr | None
Optional Polars filter applied once up front.
error_params
dict[str, dict[str, Any]] | None
Overrides for error expressions registered in MetricRegistry.
Throughout this page we reuse the synthetic dataset produced by generate_sample_data so the examples stay reproducible.
Example Data
import polars as plfrom polars_eval_metrics import MetricDefine, MetricEvaluator, MetricRegistryfrom polars_eval_metrics.metric_registry import MetricInfofrom data_generator import generate_sample_data# Create sample data using shared generatordata = generate_sample_data(n_subjects=6, n_visits=3, n_groups=2)data
shape: (18, 11)
subject_id
visit_id
treatment
gender
race
region
age_group
actual
model1
model2
weight
i64
i64
str
str
str
str
str
f64
f64
f64
f64
1
1
"A"
"F"
"White"
"North"
"Young"
15.0
14.4
16.0
1.1
1
2
"A"
"F"
"White"
"North"
"Young"
19.0
18.8
22.1
1.1
1
3
"A"
"F"
"White"
"North"
"Young"
23.0
23.2
21.9
1.1
2
1
"B"
"M"
"Black"
"South"
"Middle"
18.0
18.6
18.0
1.2
2
2
"B"
"M"
"Black"
"South"
"Middle"
22.0
23.0
24.1
1.2
…
…
…
…
…
…
…
…
…
…
…
5
2
"A"
"F"
"White"
"North"
"Middle"
31.0
30.8
34.1
1.2
5
3
"A"
"F"
"White"
"North"
"Middle"
35.0
35.2
33.9
1.2
6
1
"B"
"M"
"Black"
"South"
"Senior"
30.0
30.6
30.0
1.0
6
2
"B"
"M"
"Black"
"South"
"Senior"
34.0
35.0
36.1
1.0
6
3
"B"
"M"
"Black"
"South"
"Senior"
38.0
39.4
35.9
1.0
Quick Start
Basic evaluation
# Define metricsmetrics = [ MetricDefine(name="mae", label="Mean Absolute Error"), MetricDefine(name="rmse", label="Root Mean Squared Error"),]# Create evaluator and run evaluationevaluator = MetricEvaluator( df=data, metrics=metrics, ground_truth="actual", estimates=["model1", "model2"],)basic_res = evaluator.evaluate()basic_res
shape: (4, 6)
estimate
metric
label
value
metric_type
scope
enum
enum
enum
str
str
str
"model1"
"mae"
"Mean Absolute Error"
"1.0"
"across_sample"
null
"model2"
"mae"
"Mean Absolute Error"
"1.5"
"across_sample"
null
"model1"
"rmse"
"Root Mean Squared Error"
"1.2"
"across_sample"
null
"model2"
"rmse"
"Root Mean Squared Error"
"1.8"
"across_sample"
null
basic_res is a Polars DataFrame. The compact view keeps the core summary columns (metric, estimate, value, and any group labels) while hiding struct payloads and diagnostic fields for readability. Use the options below when you need alternate representations:
Need to stay lazy? Pass collect=False to obtain a LazyFrame for further composition before materialising the result:
basic_lazy = evaluator.evaluate(collect=False)from polars_eval_metrics.ard import ARD# Wrap the lazy output in the ARD helper when you need canonical struct columnsard_view = ARD(basic_lazy)ard_view.collect().head()
{"across_sample",null,"Root Mean Squared Error","model1"}
null
null
null
"model2"
"rmse"
"Root Mean Squared Error"
{"float",1.839277,null,null,null,null,null}
"1.8"
[]
[]
{"across_sample",null,"Root Mean Squared Error","model2"}
Adding groups and scopes
# Define metricsmetrics = [ MetricDefine(name="n_subject", label ="Number of Subjects", scope ="global"), MetricDefine(name="n_sample", label ="Number of Samples", scope ="group"), MetricDefine(name="pct_sample_with_data", label ="Percent of Samples with Data", scope ="group"), MetricDefine(name="mae", label="MAE"), MetricDefine(name="rmse", label="RMSE"),]# Create evaluator and run evaluationevaluator = MetricEvaluator( df=data, metrics=metrics, ground_truth="actual", estimates=["model1", "model2"], group_by=["treatment"])res = evaluator.evaluate()res
shape: (13, 7)
estimate
metric
label
value
metric_type
scope
treatment
enum
enum
enum
str
str
str
str
null
"n_subject"
"Number of Subjects"
"6"
"across_sample"
"global"
null
null
"n_sample"
"Number of Samples"
"9"
"across_sample"
"group"
"A"
null
"pct_sample_with_data"
"Percent of Samples with Data"
"100.0"
"across_sample"
"group"
"A"
"model1"
"mae"
"MAE"
"1.0"
"across_sample"
null
"A"
"model2"
"mae"
"MAE"
"1.7"
"across_sample"
null
"A"
…
…
…
…
…
…
…
null
"pct_sample_with_data"
"Percent of Samples with Data"
"88.9"
"across_sample"
"group"
"B"
"model1"
"mae"
"MAE"
"1.0"
"across_sample"
null
"B"
"model2"
"mae"
"MAE"
"1.3"
"across_sample"
null
"B"
"model1"
"rmse"
"RMSE"
"1.1"
"across_sample"
null
"B"
"model2"
"rmse"
"RMSE"
"1.7"
"across_sample"
null
"B"
The evaluation output keeps a lightweight value column for quick inspection, but the full detail lives in the stat struct and the companion stat_fmt, warning, and error columns. Use these when you need typed payloads or diagnostics.
When a MetricInfo declares value_kind="int", the evaluator stores the integer value under stat.value_int, ready for reuse in downstream calculations:
# Grab integer counts for the subject metricsres_verbose = evaluator.evaluate(verbose=True)res_verbose.filter(pl.col("metric") =="n_subject").with_columns( pl.col("stat").struct.field("value_int").alias("subject_count"))
shape: (1, 16)
id
groups
subgroups
estimate
metric
label
value
stat
stat_fmt
context
warning
error
metric_type
scope
treatment
subject_count
null
struct[1]
null
enum
enum
enum
str
struct[7]
str
struct[4]
list[str]
list[str]
str
str
str
i64
null
null
null
null
"n_subject"
"Number of Subjects"
"6"
{"int",6.0,6,null,null,null,null}
"6"
{"across_sample","global","Number of Subjects",null}
[]
[]
"across_sample"
"global"
null
6
Structured payloads and custom formatting
When a metric returns more than a single scalar, surface it as a struct and optionally supply a formatter. The evaluator keeps the struct in stat.value_struct while the formatter drives stat_fmt:
# Register a metric that surfaces a richer payload as a structMetricRegistry.register_metric("mae_with_bounds", MetricInfo( expr=pl.struct( [ pl.col("absolute_error").mean().alias("mean"), pl.col("absolute_error").std().alias("sd"), ] ),format="{0[mean]:.1f} +/- {0[sd]:.1f}", ),)evaluator = MetricEvaluator( df=data, metrics=MetricDefine(name="mae_with_bounds"), ground_truth="actual", estimates=["model1"],)bounds_res = evaluator.evaluate(verbose=True)bounds_res.select(["metric", "estimate", "stat_fmt"]).head()# Inspect the struct payload when neededbounds_res.select(["metric", "stat"]).head()
Both pivot_by_group() and pivot_by_model() reshape the evaluation output into presentation-friendly tables while keeping formatted columns intact:
evaluator.pivot_by_group()
shape: (1, 1)
{"model1","mae_with_bounds"}
str
"1.0 +/- 0.7"
evaluator.pivot_by_model()
shape: (1, 3)
estimate
mae_with_bounds
estimate_label
str
str
str
"model1"
"1.0 +/- 0.7"
"model1"
Subject-level metrics
Subject-oriented aggregations either keep identifiers for every subject (within_subject) or summarise subject-level results into a single row (across_subject). The evaluator handles the hierarchical grouping and preserves entity identifiers in the id struct.
# Combine the metric lists using list concatenation or unpackingevaluator = MetricEvaluator( df=data, metrics=[ within_subject_metrics, across_subject_metrics ], ground_truth="actual", estimates=["model1", "model2"],)evaluator.evaluate()
shape: (14, 7)
id
estimate
metric
label
value
metric_type
scope
struct[1]
enum
enum
enum
str
str
str
{2}
"model1"
"mae"
"MAE per Subject"
"1.0"
"within_subject"
null
{3}
"model1"
"mae"
"MAE per Subject"
"2.2"
"within_subject"
null
{6}
"model1"
"mae"
"MAE per Subject"
"1.0"
"within_subject"
null
{1}
"model1"
"mae"
"MAE per Subject"
"0.3"
"within_subject"
null
{5}
"model1"
"mae"
"MAE per Subject"
"0.3"
"within_subject"
null
…
…
…
…
…
…
…
{6}
"model2"
"mae"
"MAE per Subject"
"1.4"
"within_subject"
null
{4}
"model2"
"mae"
"MAE per Subject"
"1.4"
"within_subject"
null
{5}
"model2"
"mae"
"MAE per Subject"
"1.7"
"within_subject"
null
null
"model1"
"mae:mean"
"Mean of Subject MAEs"
"1.0"
"across_subject"
null
null
"model2"
"mae:mean"
"Mean of Subject MAEs"
"1.5"
"across_subject"
null
Visit-level metrics
Visit metrics mirror the subject patterns but operate on combined subject_id / visit_id keys. Use within_visit to keep per-visit rows and across_visit to summarize the visit distribution.
Scopes override the default behaviour that evaluates every metric per estimate and group. global collapses everything into a single row, group keeps one row per group value, and model isolates each estimate regardless of group columns.
Global scope
Global scope metrics compute a single value across the entire dataset, ignoring model and group distinctions.
Use group_by to partition metrics into cohorts (for example, by treatment arm) and subgroup_by to explode each cohort into additional breakdowns such as gender or race. Both arguments accept either a list of column names or a mapping to display labels.
# Evaluate with subgroup stratification (gender and race already in data)evaluator = MetricEvaluator( df=data, metrics=[MetricDefine(name="mae")], ground_truth="actual", estimates=["model1"], group_by=["treatment"], subgroup_by=["gender", "race"],)evaluator.evaluate()
shape: (6, 11)
subgroup_name
subgroup_value
estimate
metric
label
value
treatment
metric_type
scope
gender
race
str
str
enum
enum
enum
str
str
str
str
str
str
"gender"
"F"
"model1"
"mae"
"mae"
"1.0"
"A"
"across_sample"
null
"F"
null
"race"
"Asian"
"model1"
"mae"
"mae"
"2.2"
"A"
"across_sample"
null
null
"Asian"
"race"
"White"
"model1"
"mae"
"mae"
"0.3"
"A"
"across_sample"
null
null
"White"
"gender"
"M"
"model1"
"mae"
"mae"
"1.0"
"B"
"across_sample"
null
"M"
null
"race"
"Black"
"model1"
"mae"
"mae"
"1.0"
"B"
"across_sample"
null
null
"Black"
"race"
"Hispanic"
"model1"
"mae"
"mae"
"1.2"
"B"
"across_sample"
null
null
"Hispanic"
Filtering and reuse
Large evaluations can be expensive. Two conveniences help keep things fast:
evaluate(metrics=..., estimates=...) lets you rerun the evaluator on a subset of the originally configured metrics or estimates without rebuilding the instance.
filter(metrics=..., estimates=...) returns a lightweight evaluator that shares the same lazy frame and cached error columns.
Results are cached by (metric, estimate) combination, so repeating the same call avoids recomputation. Call clear_cache() when the underlying data changes or you want a fresh evaluation.
# Re-evaluate only MAE for model1 using the cached pipelineevaluator.evaluate(metrics=MetricDefine(name="mae"), estimates="model1")
shape: (6, 11)
subgroup_name
subgroup_value
estimate
metric
label
value
treatment
metric_type
scope
gender
race
str
str
enum
enum
enum
str
str
str
str
str
str
"gender"
"F"
"model1"
"mae"
"mae"
"1.0"
"A"
"across_sample"
null
"F"
null
"race"
"Asian"
"model1"
"mae"
"mae"
"2.2"
"A"
"across_sample"
null
null
"Asian"
"race"
"White"
"model1"
"mae"
"mae"
"0.3"
"A"
"across_sample"
null
null
"White"
"gender"
"M"
"model1"
"mae"
"mae"
"1.0"
"B"
"across_sample"
null
"M"
null
"race"
"Black"
"model1"
"mae"
"mae"
"1.0"
"B"
"across_sample"
null
null
"Black"
"race"
"Hispanic"
"model1"
"mae"
"mae"
"1.2"
"B"
"across_sample"
null
null
"Hispanic"
# Or create a filtered evaluator for model2-only summariesmodel2_eval = evaluator.filter(estimates="model1")model2_eval.evaluate()