MetricEvaluator: Evaluation Pipeline

MetricEvaluator

The MetricEvaluator class is the main evaluation engine that executes metric computations on your data. It handles hierarchical aggregations, grouping, filtering, and manages the entire evaluation pipeline using Polars’ lazy evaluation for optimal performance.

MetricEvaluator.evaluate() returns a Polars DataFrame by default. Set verbose=True to include struct columns and diagnostic fields, or collect=False to keep the lazy representation for additional pipeline work.

Evaluator Inputs

Argument Type Notes
df pl.DataFrame | pl.LazyFrame Source data. The evaluator keeps it lazy internally.
metrics MetricDefine | list[MetricDefine] Metric definitions to execute. Use lists to mix different aggregation types.
ground_truth str Column containing observed values. Defaults to "actual".
estimates str | list[str] | dict[str, str] Model predictions to compare against ground_truth. Dict form lets you control display labels.
group_by list[str] | dict[str, str] | None Optional columns for cohort-level summaries (e.g., treatment, site).
subgroup_by list[str] | dict[str, str] | None Optional stratifiers that fan out into subgroup-specific rows.
scope (per metric) MetricScope | None Overrides default grouping for a metric (global, model, group).
filter_expr pl.Expr | None Optional Polars filter applied once up front.
error_params dict[str, dict[str, Any]] | None Overrides for error expressions registered in MetricRegistry.

Throughout this page we reuse the synthetic dataset produced by generate_sample_data so the examples stay reproducible.

Example Data

import polars as pl
from polars_eval_metrics import MetricDefine, MetricEvaluator, MetricRegistry
from polars_eval_metrics.metric_registry import MetricInfo
from data_generator import generate_sample_data

# Create sample data using shared generator
data = generate_sample_data(n_subjects=6, n_visits=3, n_groups=2)

data
shape: (18, 11)
subject_id visit_id treatment gender race region age_group actual model1 model2 weight
i64 i64 str str str str str f64 f64 f64 f64
1 1 "A" "F" "White" "North" "Young" 15.0 14.4 16.0 1.1
1 2 "A" "F" "White" "North" "Young" 19.0 18.8 22.1 1.1
1 3 "A" "F" "White" "North" "Young" 23.0 23.2 21.9 1.1
2 1 "B" "M" "Black" "South" "Middle" 18.0 18.6 18.0 1.2
2 2 "B" "M" "Black" "South" "Middle" 22.0 23.0 24.1 1.2
5 2 "A" "F" "White" "North" "Middle" 31.0 30.8 34.1 1.2
5 3 "A" "F" "White" "North" "Middle" 35.0 35.2 33.9 1.2
6 1 "B" "M" "Black" "South" "Senior" 30.0 30.6 30.0 1.0
6 2 "B" "M" "Black" "South" "Senior" 34.0 35.0 36.1 1.0
6 3 "B" "M" "Black" "South" "Senior" 38.0 39.4 35.9 1.0

Quick Start

Basic evaluation

# Define metrics
metrics = [
    MetricDefine(name="mae", label="Mean Absolute Error"),
    MetricDefine(name="rmse", label="Root Mean Squared Error"),
]

# Create evaluator and run evaluation
evaluator = MetricEvaluator(
    df=data,
    metrics=metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

basic_res = evaluator.evaluate()
basic_res
shape: (4, 6)
estimate metric label value metric_type scope
enum enum enum str str str
"model1" "mae" "Mean Absolute Error" "1.0" "across_sample" null
"model2" "mae" "Mean Absolute Error" "1.5" "across_sample" null
"model1" "rmse" "Root Mean Squared Error" "1.2" "across_sample" null
"model2" "rmse" "Root Mean Squared Error" "1.8" "across_sample" null

basic_res is a Polars DataFrame. The compact view keeps the core summary columns (metric, estimate, value, and any group labels) while hiding struct payloads and diagnostic fields for readability. Use the options below when you need alternate representations:

# Materialise a verbose view (struct + diagnostic columns)
basic_verbose = evaluator.evaluate(verbose=True)
basic_verbose.columns
['id',
 'groups',
 'subgroups',
 'estimate',
 'metric',
 'label',
 'value',
 'stat',
 'stat_fmt',
 'context',
 'warning',
 'error',
 'metric_type',
 'scope']

Need to stay lazy? Pass collect=False to obtain a LazyFrame for further composition before materialising the result:

basic_lazy = evaluator.evaluate(collect=False)

from polars_eval_metrics.ard import ARD

# Wrap the lazy output in the ARD helper when you need canonical struct columns
ard_view = ARD(basic_lazy)
ard_view.collect().head()
shape: (4, 11)
id groups subgroups estimate metric label stat stat_fmt warning error context
null null null enum enum enum struct[7] str list[str] list[str] struct[4]
null null null "model2" "mae" "Mean Absolute Error" {"float",1.535294,null,null,null,null,null} "1.5" [] [] {"across_sample",null,"Mean Absolute Error","model2"}
null null null "model1" "mae" "Mean Absolute Error" {"float",1.0,null,null,null,null,null} "1.0" [] [] {"across_sample",null,"Mean Absolute Error","model1"}
null null null "model1" "rmse" "Root Mean Squared Error" {"float",1.220415,null,null,null,null,null} "1.2" [] [] {"across_sample",null,"Root Mean Squared Error","model1"}
null null null "model2" "rmse" "Root Mean Squared Error" {"float",1.839277,null,null,null,null,null} "1.8" [] [] {"across_sample",null,"Root Mean Squared Error","model2"}

Adding groups and scopes

# Define metrics
metrics = [
    MetricDefine(name="n_subject", label = "Number of Subjects", scope = "global"), 
    MetricDefine(name="n_sample", label = "Number of Samples", scope = "group"), 
    MetricDefine(name="pct_sample_with_data", label = "Percent of Samples with Data", scope = "group"),
    MetricDefine(name="mae", label="MAE"),
    MetricDefine(name="rmse", label="RMSE"),
]

# Create evaluator and run evaluation
evaluator = MetricEvaluator(
    df=data,
    metrics=metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"]
)

res = evaluator.evaluate()
res
shape: (13, 7)
estimate metric label value metric_type scope treatment
enum enum enum str str str str
null "n_subject" "Number of Subjects" "6" "across_sample" "global" null
null "n_sample" "Number of Samples" "9" "across_sample" "group" "A"
null "pct_sample_with_data" "Percent of Samples with Data" "100.0" "across_sample" "group" "A"
"model1" "mae" "MAE" "1.0" "across_sample" null "A"
"model2" "mae" "MAE" "1.7" "across_sample" null "A"
null "pct_sample_with_data" "Percent of Samples with Data" "88.9" "across_sample" "group" "B"
"model1" "mae" "MAE" "1.0" "across_sample" null "B"
"model2" "mae" "MAE" "1.3" "across_sample" null "B"
"model1" "rmse" "RMSE" "1.1" "across_sample" null "B"
"model2" "rmse" "RMSE" "1.7" "across_sample" null "B"

The evaluation output keeps a lightweight value column for quick inspection, but the full detail lives in the stat struct and the companion stat_fmt, warning, and error columns. Use these when you need typed payloads or diagnostics.

When a MetricInfo declares value_kind="int", the evaluator stores the integer value under stat.value_int, ready for reuse in downstream calculations:

# Grab integer counts for the subject metrics
res_verbose = evaluator.evaluate(verbose=True)
res_verbose.filter(pl.col("metric") == "n_subject").with_columns(
    pl.col("stat").struct.field("value_int").alias("subject_count")
)
shape: (1, 16)
id groups subgroups estimate metric label value stat stat_fmt context warning error metric_type scope treatment subject_count
null struct[1] null enum enum enum str struct[7] str struct[4] list[str] list[str] str str str i64
null null null null "n_subject" "Number of Subjects" "6" {"int",6.0,6,null,null,null,null} "6" {"across_sample","global","Number of Subjects",null} [] [] "across_sample" "global" null 6

Structured payloads and custom formatting

When a metric returns more than a single scalar, surface it as a struct and optionally supply a formatter. The evaluator keeps the struct in stat.value_struct while the formatter drives stat_fmt:

# Register a metric that surfaces a richer payload as a struct
MetricRegistry.register_metric(
    "mae_with_bounds",
    MetricInfo(
        expr=pl.struct(
            [
                pl.col("absolute_error").mean().alias("mean"),
                pl.col("absolute_error").std().alias("sd"),
            ]
        ),
        format="{0[mean]:.1f} +/- {0[sd]:.1f}",
    ),
)

evaluator = MetricEvaluator(
    df=data,
    metrics=MetricDefine(name="mae_with_bounds"),
    ground_truth="actual",
    estimates=["model1"],
)

bounds_res = evaluator.evaluate(verbose=True)
bounds_res.select(["metric", "estimate", "stat_fmt"]).head()

# Inspect the struct payload when needed
bounds_res.select(["metric", "stat"]).head()
shape: (1, 2)
metric stat
enum struct[7]
"mae_with_bounds" {"float",{1.0,0.72111},null,null,null,null,"{0[mean]:.1f} +/- {0[sd]:.1f}"}

Pivot helpers

Both pivot_by_group() and pivot_by_model() reshape the evaluation output into presentation-friendly tables while keeping formatted columns intact:

evaluator.pivot_by_group()
shape: (1, 1)
{"model1","mae_with_bounds"}
str
"1.0 +/- 0.7"
evaluator.pivot_by_model()
shape: (1, 3)
estimate mae_with_bounds estimate_label
str str str
"model1" "1.0 +/- 0.7" "model1"

Subject-level metrics

Subject-oriented aggregations either keep identifiers for every subject (within_subject) or summarise subject-level results into a single row (across_subject). The evaluator handles the hierarchical grouping and preserves entity identifiers in the id struct.

Within-subject metrics

within_subject_metrics = MetricDefine(
        name="mae",
        type="within_subject",
        label="MAE per Subject"
    )


evaluator = MetricEvaluator(
    df=data,
    metrics=within_subject_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (12, 7)
id estimate metric label value metric_type scope
struct[1] enum enum enum str str str
{1} "model1" "mae" "MAE per Subject" "0.3" "within_subject" null
{4} "model1" "mae" "MAE per Subject" "1.2" "within_subject" null
{5} "model1" "mae" "MAE per Subject" "0.3" "within_subject" null
{6} "model1" "mae" "MAE per Subject" "1.0" "within_subject" null
{3} "model1" "mae" "MAE per Subject" "2.2" "within_subject" null
{5} "model2" "mae" "MAE per Subject" "1.7" "within_subject" null
{2} "model2" "mae" "MAE per Subject" "1.1" "within_subject" null
{3} "model2" "mae" "MAE per Subject" "1.7" "within_subject" null
{4} "model2" "mae" "MAE per Subject" "1.4" "within_subject" null
{1} "model2" "mae" "MAE per Subject" "1.7" "within_subject" null

The resulting rows include an id struct with the subject identifiers, which makes it easy to join back to other subject-level metadata:

# Inspect subject identifiers carried in the id struct
evaluator.evaluate().unnest(["id"])
shape: (12, 7)
subject_id estimate metric label value metric_type scope
i64 enum enum enum str str str
2 "model1" "mae" "MAE per Subject" "1.0" "within_subject" null
5 "model1" "mae" "MAE per Subject" "0.3" "within_subject" null
3 "model1" "mae" "MAE per Subject" "2.2" "within_subject" null
4 "model1" "mae" "MAE per Subject" "1.2" "within_subject" null
6 "model1" "mae" "MAE per Subject" "1.0" "within_subject" null
6 "model2" "mae" "MAE per Subject" "1.4" "within_subject" null
4 "model2" "mae" "MAE per Subject" "1.4" "within_subject" null
3 "model2" "mae" "MAE per Subject" "1.7" "within_subject" null
1 "model2" "mae" "MAE per Subject" "1.7" "within_subject" null
2 "model2" "mae" "MAE per Subject" "1.1" "within_subject" null

Across-subject metrics

across_subject_metrics = MetricDefine(
        name="mae:mean", 
        type="across_subject",
        label="Mean of Subject MAEs"
    )

evaluator = MetricEvaluator(
    df=data,
    metrics=across_subject_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (2, 6)
estimate metric label value metric_type scope
enum enum enum str str str
"model1" "mae:mean" "Mean of Subject MAEs" "1.0" "across_subject" null
"model2" "mae:mean" "Mean of Subject MAEs" "1.5" "across_subject" null

Combining subject-level views

# Combine the metric lists using list concatenation or unpacking
evaluator = MetricEvaluator(
    df=data,
    metrics=[
        within_subject_metrics,
        across_subject_metrics
    ],
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (14, 7)
id estimate metric label value metric_type scope
struct[1] enum enum enum str str str
{2} "model1" "mae" "MAE per Subject" "1.0" "within_subject" null
{3} "model1" "mae" "MAE per Subject" "2.2" "within_subject" null
{6} "model1" "mae" "MAE per Subject" "1.0" "within_subject" null
{1} "model1" "mae" "MAE per Subject" "0.3" "within_subject" null
{5} "model1" "mae" "MAE per Subject" "0.3" "within_subject" null
{6} "model2" "mae" "MAE per Subject" "1.4" "within_subject" null
{4} "model2" "mae" "MAE per Subject" "1.4" "within_subject" null
{5} "model2" "mae" "MAE per Subject" "1.7" "within_subject" null
null "model1" "mae:mean" "Mean of Subject MAEs" "1.0" "across_subject" null
null "model2" "mae:mean" "Mean of Subject MAEs" "1.5" "across_subject" null

Visit-level metrics

Visit metrics mirror the subject patterns but operate on combined subject_id / visit_id keys. Use within_visit to keep per-visit rows and across_visit to summarize the visit distribution.

Within-visit metrics

within_visit_metrics = MetricDefine(
        name="mae",
        type="within_visit",
        label="MAE per Visit"
    )

evaluator = MetricEvaluator(
    df=data,
    metrics=within_visit_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (36, 7)
id estimate metric label value metric_type scope
struct[2] enum enum enum str str str
{6,2} "model1" "mae" "MAE per Visit" "1.0" "within_visit" null
{2,1} "model1" "mae" "MAE per Visit" "0.6" "within_visit" null
{5,3} "model1" "mae" "MAE per Visit" "0.2" "within_visit" null
{6,3} "model1" "mae" "MAE per Visit" "1.4" "within_visit" null
{1,3} "model1" "mae" "MAE per Visit" "0.2" "within_visit" null
{1,2} "model2" "mae" "MAE per Visit" "3.1" "within_visit" null
{6,1} "model2" "mae" "MAE per Visit" "0.0" "within_visit" null
{4,2} "model2" "mae" "MAE per Visit" "2.1" "within_visit" null
{4,3} "model2" "mae" "MAE per Visit" "2.1" "within_visit" null
{2,3} "model2" "mae" "MAE per Visit" null "within_visit" null

Across-visit metrics

across_visit_metrics = MetricDefine(
        name="mae:mean", 
        type="across_visit",
        label="Mean of Visit MAEs"
    )

evaluator = MetricEvaluator(
    df=data,
    metrics=across_visit_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (2, 6)
estimate metric label value metric_type scope
enum enum enum str str str
"model1" "mae:mean" "Mean of Visit MAEs" "1.0" "across_visit" null
"model2" "mae:mean" "Mean of Visit MAEs" "1.5" "across_visit" null

Combining visit-level views

evaluator = MetricEvaluator(
    df=data,
    metrics=[
        within_visit_metrics,
        across_visit_metrics,
    ],
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (38, 7)
id estimate metric label value metric_type scope
struct[2] enum enum enum str str str
{5,2} "model1" "mae" "MAE per Visit" "0.2" "within_visit" null
{3,2} "model1" "mae" "MAE per Visit" "2.2" "within_visit" null
{4,1} "model1" "mae" "MAE per Visit" null "within_visit" null
{2,2} "model1" "mae" "MAE per Visit" "1.0" "within_visit" null
{3,1} "model1" "mae" "MAE per Visit" "1.8" "within_visit" null
{5,1} "model2" "mae" "MAE per Visit" "1.0" "within_visit" null
{5,3} "model2" "mae" "MAE per Visit" "1.1" "within_visit" null
{6,3} "model2" "mae" "MAE per Visit" "2.1" "within_visit" null
null "model1" "mae:mean" "Mean of Visit MAEs" "1.0" "across_visit" null
null "model2" "mae:mean" "Mean of Visit MAEs" "1.5" "across_visit" null

Metric scopes

Scopes override the default behaviour that evaluates every metric per estimate and group. global collapses everything into a single row, group keeps one row per group value, and model isolates each estimate regardless of group columns.

Global scope

Global scope metrics compute a single value across the entire dataset, ignoring model and group distinctions.

global_scope_metrics = MetricDefine(
    name = "n_subject", 
    scope = "global"
)

evaluator = MetricEvaluator(
    df=data,
    metrics=[
        global_scope_metrics,
    ],
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()
shape: (1, 6)
estimate metric label value metric_type scope
str enum enum str str str
null "n_subject" "n_subject" "6" "across_sample" "global"

Group scope

Group scope metrics compute one value per group, aggregating across all models.

group_scope_metrics = MetricDefine(
    name = "n_subject", 
    scope = "group"
)

evaluator = MetricEvaluator(
    df=data,
    metrics=[group_scope_metrics],
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"],
)

evaluator.evaluate()
shape: (2, 7)
estimate metric label value treatment metric_type scope
str enum enum str str str str
null "n_subject" "n_subject" "3" "A" "across_sample" "group"
null "n_subject" "n_subject" "3" "B" "across_sample" "group"

Model scope

Model scope metrics compute one value per model, ignoring group distinctions.

model_scope_metrics = MetricDefine(
    name = "n_sample_with_data", 
    scope = "model"
)

evaluator = MetricEvaluator(
    df=data,
    metrics=[model_scope_metrics],
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"],
)

evaluator.evaluate()
shape: (2, 6)
estimate metric label value metric_type scope
enum enum enum str str str
"model1" "n_sample_with_data" "n_sample_with_data" "17" "across_sample" "model"
"model2" "n_sample_with_data" "n_sample_with_data" "17" "across_sample" "model"

Grouping and stratification

Use group_by to partition metrics into cohorts (for example, by treatment arm) and subgroup_by to explode each cohort into additional breakdowns such as gender or race. Both arguments accept either a list of column names or a mapping to display labels.

Group-by analysis

# Evaluate metrics by treatment group
evaluator = MetricEvaluator(
    df=data,
    metrics=[MetricDefine(name="mae"), MetricDefine(name="rmse")],
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"],
)

evaluator.evaluate()
shape: (8, 7)
estimate metric label value treatment metric_type scope
enum enum enum str str str str
"model1" "mae" "mae" "1.0" "A" "across_sample" null
"model2" "mae" "mae" "1.7" "A" "across_sample" null
"model1" "rmse" "rmse" "1.3" "A" "across_sample" null
"model2" "rmse" "rmse" "2.0" "A" "across_sample" null
"model1" "mae" "mae" "1.0" "B" "across_sample" null
"model2" "mae" "mae" "1.3" "B" "across_sample" null
"model1" "rmse" "rmse" "1.1" "B" "across_sample" null
"model2" "rmse" "rmse" "1.7" "B" "across_sample" null

Subgroup analysis

# Evaluate with subgroup stratification (gender and race already in data)
evaluator = MetricEvaluator(
    df=data,
    metrics=[MetricDefine(name="mae")],
    ground_truth="actual",
    estimates=["model1"],
    group_by=["treatment"],
    subgroup_by=["gender", "race"],
)

evaluator.evaluate()
shape: (6, 11)
subgroup_name subgroup_value estimate metric label value treatment metric_type scope gender race
str str enum enum enum str str str str str str
"gender" "F" "model1" "mae" "mae" "1.0" "A" "across_sample" null "F" null
"race" "Asian" "model1" "mae" "mae" "2.2" "A" "across_sample" null null "Asian"
"race" "White" "model1" "mae" "mae" "0.3" "A" "across_sample" null null "White"
"gender" "M" "model1" "mae" "mae" "1.0" "B" "across_sample" null "M" null
"race" "Black" "model1" "mae" "mae" "1.0" "B" "across_sample" null null "Black"
"race" "Hispanic" "model1" "mae" "mae" "1.2" "B" "across_sample" null null "Hispanic"

Filtering and reuse

Large evaluations can be expensive. Two conveniences help keep things fast:

  • evaluate(metrics=..., estimates=...) lets you rerun the evaluator on a subset of the originally configured metrics or estimates without rebuilding the instance.
  • filter(metrics=..., estimates=...) returns a lightweight evaluator that shares the same lazy frame and cached error columns.

Results are cached by (metric, estimate) combination, so repeating the same call avoids recomputation. Call clear_cache() when the underlying data changes or you want a fresh evaluation.

# Re-evaluate only MAE for model1 using the cached pipeline
evaluator.evaluate(metrics=MetricDefine(name="mae"), estimates="model1")
shape: (6, 11)
subgroup_name subgroup_value estimate metric label value treatment metric_type scope gender race
str str enum enum enum str str str str str str
"gender" "F" "model1" "mae" "mae" "1.0" "A" "across_sample" null "F" null
"race" "Asian" "model1" "mae" "mae" "2.2" "A" "across_sample" null null "Asian"
"race" "White" "model1" "mae" "mae" "0.3" "A" "across_sample" null null "White"
"gender" "M" "model1" "mae" "mae" "1.0" "B" "across_sample" null "M" null
"race" "Black" "model1" "mae" "mae" "1.0" "B" "across_sample" null null "Black"
"race" "Hispanic" "model1" "mae" "mae" "1.2" "B" "across_sample" null null "Hispanic"
# Or create a filtered evaluator for model2-only summaries
model2_eval = evaluator.filter(estimates="model1")
model2_eval.evaluate()
shape: (6, 11)
subgroup_name subgroup_value estimate metric label value treatment metric_type scope gender race
str str enum enum enum str str str str str str
"gender" "F" "model1" "mae" "mae" "1.0" "A" "across_sample" null "F" null
"race" "Asian" "model1" "mae" "mae" "2.2" "A" "across_sample" null null "Asian"
"race" "White" "model1" "mae" "mae" "0.3" "A" "across_sample" null null "White"
"gender" "M" "model1" "mae" "mae" "1.0" "B" "across_sample" null "M" null
"race" "Black" "model1" "mae" "mae" "1.0" "B" "across_sample" null null "Black"
"race" "Hispanic" "model1" "mae" "mae" "1.2" "B" "across_sample" null null "Hispanic"