MetricEvaluator: Evaluation Pipeline

MetricEvaluator

The MetricEvaluator class is the main evaluation engine that executes metric computations on your data. It handles hierarchical aggregations, grouping, filtering, and manages the entire evaluation pipeline using Polars’ lazy evaluation for optimal performance.

MetricEvaluator.evaluate() returns a Polars DataFrame by default. Set verbose=True to include struct columns and diagnostic fields, or collect=False to keep the lazy representation for additional pipeline work.

Evaluator Inputs

Argument	Type	Notes
`df`	`pl.DataFrame \| pl.LazyFrame`	Source data. The evaluator keeps it lazy internally.
`metrics`	`MetricDefine \| list[MetricDefine]`	Metric definitions to execute. Use lists to mix different aggregation types.
`ground_truth`	`str`	Column containing observed values. Defaults to `"actual"`.
`estimates`	`str \| list[str] \| dict[str, str]`	Model predictions to compare against `ground_truth`. Dict form lets you control display labels.
`group_by`	`list[str] \| dict[str, str] \| None`	Optional columns for cohort-level summaries (e.g., treatment, site).
`subgroup_by`	`list[str] \| dict[str, str] \| None`	Optional stratifiers that fan out into subgroup-specific rows.
`scope` (per metric)	`MetricScope \| None`	Overrides default grouping for a metric (`global`, `model`, `group`).
`filter_expr`	`pl.Expr \| None`	Optional Polars filter applied once up front.
`error_params`	`dict[str, dict[str, Any]] \| None`	Overrides for error expressions registered in `MetricRegistry`.

Throughout this page we reuse the synthetic dataset produced by generate_sample_data so the examples stay reproducible.

Example Data

import polars as pl
from polars_eval_metrics import MetricDefine, MetricEvaluator, MetricRegistry
from polars_eval_metrics.metric_registry import MetricInfo
from data_generator import generate_sample_data

# Create sample data using shared generator
data = generate_sample_data(n_subjects=6, n_visits=3, n_groups=2)

data

shape: (18, 11)

subject_id	visit_id	treatment	gender	race	region	age_group	actual	model1	model2	weight
i64	i64	str	str	str	str	str	f64	f64	f64	f64
1	1	"A"	"F"	"White"	"North"	"Young"	15.0	14.4	16.0	1.1
1	2	"A"	"F"	"White"	"North"	"Young"	19.0	18.8	22.1	1.1
1	3	"A"	"F"	"White"	"North"	"Young"	23.0	23.2	21.9	1.1
2	1	"B"	"M"	"Black"	"South"	"Middle"	18.0	18.6	18.0	1.2
2	2	"B"	"M"	"Black"	"South"	"Middle"	22.0	23.0	24.1	1.2
…	…	…	…	…	…	…	…	…	…	…
5	2	"A"	"F"	"White"	"North"	"Middle"	31.0	30.8	34.1	1.2
5	3	"A"	"F"	"White"	"North"	"Middle"	35.0	35.2	33.9	1.2
6	1	"B"	"M"	"Black"	"South"	"Senior"	30.0	30.6	30.0	1.0
6	2	"B"	"M"	"Black"	"South"	"Senior"	34.0	35.0	36.1	1.0
6	3	"B"	"M"	"Black"	"South"	"Senior"	38.0	39.4	35.9	1.0

Quick Start

Basic evaluation

# Define metrics
metrics = [
    MetricDefine(name="mae", label="Mean Absolute Error"),
    MetricDefine(name="rmse", label="Root Mean Squared Error"),
]

# Create evaluator and run evaluation
evaluator = MetricEvaluator(
    df=data,
    metrics=metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

basic_res = evaluator.evaluate()
basic_res

shape: (4, 6)

estimate	metric	label	value	metric_type	scope
enum	enum	enum	str	str	str
"model1"	"mae"	"Mean Absolute Error"	"1.0"	"across_sample"	null
"model2"	"mae"	"Mean Absolute Error"	"1.5"	"across_sample"	null
"model1"	"rmse"	"Root Mean Squared Error"	"1.2"	"across_sample"	null
"model2"	"rmse"	"Root Mean Squared Error"	"1.8"	"across_sample"	null

basic_res is a Polars DataFrame. The compact view keeps the core summary columns (metric, estimate, value, and any group labels) while hiding struct payloads and diagnostic fields for readability. Use the options below when you need alternate representations:

# Materialise a verbose view (struct + diagnostic columns)
basic_verbose = evaluator.evaluate(verbose=True)
basic_verbose.columns

['id',
 'groups',
 'subgroups',
 'estimate',
 'metric',
 'label',
 'value',
 'stat',
 'stat_fmt',
 'context',
 'warning',
 'error',
 'metric_type',
 'scope']

Need to stay lazy? Pass collect=False to obtain a LazyFrame for further composition before materialising the result:

basic_lazy = evaluator.evaluate(collect=False)

from polars_eval_metrics.ard import ARD

# Wrap the lazy output in the ARD helper when you need canonical struct columns
ard_view = ARD(basic_lazy)
ard_view.collect().head()

shape: (4, 11)

id	groups	subgroups	estimate	metric	label	stat	stat_fmt	warning	error	context
null	null	null	enum	enum	enum	struct[7]	str	list[str]	list[str]	struct[4]
null	null	null	"model2"	"mae"	"Mean Absolute Error"	{"float",1.535294,null,null,null,null,null}	"1.5"	[]	[]	{"across_sample",null,"Mean Absolute Error","model2"}
null	null	null	"model1"	"mae"	"Mean Absolute Error"	{"float",1.0,null,null,null,null,null}	"1.0"	[]	[]	{"across_sample",null,"Mean Absolute Error","model1"}
null	null	null	"model1"	"rmse"	"Root Mean Squared Error"	{"float",1.220415,null,null,null,null,null}	"1.2"	[]	[]	{"across_sample",null,"Root Mean Squared Error","model1"}
null	null	null	"model2"	"rmse"	"Root Mean Squared Error"	{"float",1.839277,null,null,null,null,null}	"1.8"	[]	[]	{"across_sample",null,"Root Mean Squared Error","model2"}

Adding groups and scopes

# Define metrics
metrics = [
    MetricDefine(name="n_subject", label = "Number of Subjects", scope = "global"), 
    MetricDefine(name="n_sample", label = "Number of Samples", scope = "group"), 
    MetricDefine(name="pct_sample_with_data", label = "Percent of Samples with Data", scope = "group"),
    MetricDefine(name="mae", label="MAE"),
    MetricDefine(name="rmse", label="RMSE"),
]

# Create evaluator and run evaluation
evaluator = MetricEvaluator(
    df=data,
    metrics=metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"]
)

res = evaluator.evaluate()
res

shape: (13, 7)

estimate	metric	label	value	metric_type	scope	treatment
enum	enum	enum	str	str	str	str
null	"n_subject"	"Number of Subjects"	"6"	"across_sample"	"global"	null
null	"n_sample"	"Number of Samples"	"9"	"across_sample"	"group"	"A"
null	"pct_sample_with_data"	"Percent of Samples with Data"	"100.0"	"across_sample"	"group"	"A"
"model1"	"mae"	"MAE"	"1.0"	"across_sample"	null	"A"
"model2"	"mae"	"MAE"	"1.7"	"across_sample"	null	"A"
…	…	…	…	…	…	…
null	"pct_sample_with_data"	"Percent of Samples with Data"	"88.9"	"across_sample"	"group"	"B"
"model1"	"mae"	"MAE"	"1.0"	"across_sample"	null	"B"
"model2"	"mae"	"MAE"	"1.3"	"across_sample"	null	"B"
"model1"	"rmse"	"RMSE"	"1.1"	"across_sample"	null	"B"
"model2"	"rmse"	"RMSE"	"1.7"	"across_sample"	null	"B"

The evaluation output keeps a lightweight value column for quick inspection, but the full detail lives in the stat struct and the companion stat_fmt, warning, and error columns. Use these when you need typed payloads or diagnostics.

When a MetricInfo declares value_kind="int", the evaluator stores the integer value under stat.value_int, ready for reuse in downstream calculations:

# Grab integer counts for the subject metrics
res_verbose = evaluator.evaluate(verbose=True)
res_verbose.filter(pl.col("metric") == "n_subject").with_columns(
    pl.col("stat").struct.field("value_int").alias("subject_count")
)

shape: (1, 16)

id	groups	subgroups	estimate	metric	label	value	stat	stat_fmt	context	warning	error	metric_type	scope	treatment	subject_count
null	struct[1]	null	enum	enum	enum	str	struct[7]	str	struct[4]	list[str]	list[str]	str	str	str	i64
null	null	null	null	"n_subject"	"Number of Subjects"	"6"	{"int",6.0,6,null,null,null,null}	"6"	{"across_sample","global","Number of Subjects",null}	[]	[]	"across_sample"	"global"	null	6

Structured payloads and custom formatting

When a metric returns more than a single scalar, surface it as a struct and optionally supply a formatter. The evaluator keeps the struct in stat.value_struct while the formatter drives stat_fmt:

# Register a metric that surfaces a richer payload as a struct
MetricRegistry.register_metric(
    "mae_with_bounds",
    MetricInfo(
        expr=pl.struct(
            [
                pl.col("absolute_error").mean().alias("mean"),
                pl.col("absolute_error").std().alias("sd"),
            ]
        ),
        format="{0[mean]:.1f} +/- {0[sd]:.1f}",
    ),
)

evaluator = MetricEvaluator(
    df=data,
    metrics=MetricDefine(name="mae_with_bounds"),
    ground_truth="actual",
    estimates=["model1"],
)

bounds_res = evaluator.evaluate(verbose=True)
bounds_res.select(["metric", "estimate", "stat_fmt"]).head()

# Inspect the struct payload when needed
bounds_res.select(["metric", "stat"]).head()

shape: (1, 2)

metric	stat
enum	struct[7]
"mae_with_bounds"	{"float",{1.0,0.72111},null,null,null,null,"{0[mean]:.1f} +/- {0[sd]:.1f}"}

Pivot helpers

Both pivot_by_group() and pivot_by_model() reshape the evaluation output into presentation-friendly tables while keeping formatted columns intact:

evaluator.pivot_by_group()

shape: (1, 1)

{"model1","mae_with_bounds"}
str
"1.0 +/- 0.7"

evaluator.pivot_by_model()

shape: (1, 3)

estimate	mae_with_bounds	estimate_label
str	str	str
"model1"	"1.0 +/- 0.7"	"model1"

Subject-level metrics

Subject-oriented aggregations either keep identifiers for every subject (within_subject) or summarise subject-level results into a single row (across_subject). The evaluator handles the hierarchical grouping and preserves entity identifiers in the id struct.

Within-subject metrics

within_subject_metrics = MetricDefine(
        name="mae",
        type="within_subject",
        label="MAE per Subject"
    )


evaluator = MetricEvaluator(
    df=data,
    metrics=within_subject_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (12, 7)

id	estimate	metric	label	value	metric_type	scope
struct[1]	enum	enum	enum	str	str	str
{1}	"model1"	"mae"	"MAE per Subject"	"0.3"	"within_subject"	null
{4}	"model1"	"mae"	"MAE per Subject"	"1.2"	"within_subject"	null
{5}	"model1"	"mae"	"MAE per Subject"	"0.3"	"within_subject"	null
{6}	"model1"	"mae"	"MAE per Subject"	"1.0"	"within_subject"	null
{3}	"model1"	"mae"	"MAE per Subject"	"2.2"	"within_subject"	null
…	…	…	…	…	…	…
{5}	"model2"	"mae"	"MAE per Subject"	"1.7"	"within_subject"	null
{2}	"model2"	"mae"	"MAE per Subject"	"1.1"	"within_subject"	null
{3}	"model2"	"mae"	"MAE per Subject"	"1.7"	"within_subject"	null
{4}	"model2"	"mae"	"MAE per Subject"	"1.4"	"within_subject"	null
{1}	"model2"	"mae"	"MAE per Subject"	"1.7"	"within_subject"	null

The resulting rows include an id struct with the subject identifiers, which makes it easy to join back to other subject-level metadata:

# Inspect subject identifiers carried in the id struct
evaluator.evaluate().unnest(["id"])

shape: (12, 7)

subject_id	estimate	metric	label	value	metric_type	scope
i64	enum	enum	enum	str	str	str
2	"model1"	"mae"	"MAE per Subject"	"1.0"	"within_subject"	null
5	"model1"	"mae"	"MAE per Subject"	"0.3"	"within_subject"	null
3	"model1"	"mae"	"MAE per Subject"	"2.2"	"within_subject"	null
4	"model1"	"mae"	"MAE per Subject"	"1.2"	"within_subject"	null
6	"model1"	"mae"	"MAE per Subject"	"1.0"	"within_subject"	null
…	…	…	…	…	…	…
6	"model2"	"mae"	"MAE per Subject"	"1.4"	"within_subject"	null
4	"model2"	"mae"	"MAE per Subject"	"1.4"	"within_subject"	null
3	"model2"	"mae"	"MAE per Subject"	"1.7"	"within_subject"	null
1	"model2"	"mae"	"MAE per Subject"	"1.7"	"within_subject"	null
2	"model2"	"mae"	"MAE per Subject"	"1.1"	"within_subject"	null

Across-subject metrics

across_subject_metrics = MetricDefine(
        name="mae:mean", 
        type="across_subject",
        label="Mean of Subject MAEs"
    )

evaluator = MetricEvaluator(
    df=data,
    metrics=across_subject_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (2, 6)

estimate	metric	label	value	metric_type	scope
enum	enum	enum	str	str	str
"model1"	"mae:mean"	"Mean of Subject MAEs"	"1.0"	"across_subject"	null
"model2"	"mae:mean"	"Mean of Subject MAEs"	"1.5"	"across_subject"	null

Combining subject-level views

# Combine the metric lists using list concatenation or unpacking
evaluator = MetricEvaluator(
    df=data,
    metrics=[
        within_subject_metrics,
        across_subject_metrics
    ],
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (14, 7)

id	estimate	metric	label	value	metric_type	scope
struct[1]	enum	enum	enum	str	str	str
{2}	"model1"	"mae"	"MAE per Subject"	"1.0"	"within_subject"	null
{3}	"model1"	"mae"	"MAE per Subject"	"2.2"	"within_subject"	null
{6}	"model1"	"mae"	"MAE per Subject"	"1.0"	"within_subject"	null
{1}	"model1"	"mae"	"MAE per Subject"	"0.3"	"within_subject"	null
{5}	"model1"	"mae"	"MAE per Subject"	"0.3"	"within_subject"	null
…	…	…	…	…	…	…
{6}	"model2"	"mae"	"MAE per Subject"	"1.4"	"within_subject"	null
{4}	"model2"	"mae"	"MAE per Subject"	"1.4"	"within_subject"	null
{5}	"model2"	"mae"	"MAE per Subject"	"1.7"	"within_subject"	null
null	"model1"	"mae:mean"	"Mean of Subject MAEs"	"1.0"	"across_subject"	null
null	"model2"	"mae:mean"	"Mean of Subject MAEs"	"1.5"	"across_subject"	null

Visit-level metrics

Visit metrics mirror the subject patterns but operate on combined subject_id / visit_id keys. Use within_visit to keep per-visit rows and across_visit to summarize the visit distribution.

Within-visit metrics

within_visit_metrics = MetricDefine(
        name="mae",
        type="within_visit",
        label="MAE per Visit"
    )

evaluator = MetricEvaluator(
    df=data,
    metrics=within_visit_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (36, 7)

id	estimate	metric	label	value	metric_type	scope
struct[2]	enum	enum	enum	str	str	str
{6,2}	"model1"	"mae"	"MAE per Visit"	"1.0"	"within_visit"	null
{2,1}	"model1"	"mae"	"MAE per Visit"	"0.6"	"within_visit"	null
{5,3}	"model1"	"mae"	"MAE per Visit"	"0.2"	"within_visit"	null
{6,3}	"model1"	"mae"	"MAE per Visit"	"1.4"	"within_visit"	null
{1,3}	"model1"	"mae"	"MAE per Visit"	"0.2"	"within_visit"	null
…	…	…	…	…	…	…
{1,2}	"model2"	"mae"	"MAE per Visit"	"3.1"	"within_visit"	null
{6,1}	"model2"	"mae"	"MAE per Visit"	"0.0"	"within_visit"	null
{4,2}	"model2"	"mae"	"MAE per Visit"	"2.1"	"within_visit"	null
{4,3}	"model2"	"mae"	"MAE per Visit"	"2.1"	"within_visit"	null
{2,3}	"model2"	"mae"	"MAE per Visit"	null	"within_visit"	null

Across-visit metrics

across_visit_metrics = MetricDefine(
        name="mae:mean", 
        type="across_visit",
        label="Mean of Visit MAEs"
    )

evaluator = MetricEvaluator(
    df=data,
    metrics=across_visit_metrics,
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (2, 6)

estimate	metric	label	value	metric_type	scope
enum	enum	enum	str	str	str
"model1"	"mae:mean"	"Mean of Visit MAEs"	"1.0"	"across_visit"	null
"model2"	"mae:mean"	"Mean of Visit MAEs"	"1.5"	"across_visit"	null

Combining visit-level views

evaluator = MetricEvaluator(
    df=data,
    metrics=[
        within_visit_metrics,
        across_visit_metrics,
    ],
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (38, 7)

id	estimate	metric	label	value	metric_type	scope
struct[2]	enum	enum	enum	str	str	str
{5,2}	"model1"	"mae"	"MAE per Visit"	"0.2"	"within_visit"	null
{3,2}	"model1"	"mae"	"MAE per Visit"	"2.2"	"within_visit"	null
{4,1}	"model1"	"mae"	"MAE per Visit"	null	"within_visit"	null
{2,2}	"model1"	"mae"	"MAE per Visit"	"1.0"	"within_visit"	null
{3,1}	"model1"	"mae"	"MAE per Visit"	"1.8"	"within_visit"	null
…	…	…	…	…	…	…
{5,1}	"model2"	"mae"	"MAE per Visit"	"1.0"	"within_visit"	null
{5,3}	"model2"	"mae"	"MAE per Visit"	"1.1"	"within_visit"	null
{6,3}	"model2"	"mae"	"MAE per Visit"	"2.1"	"within_visit"	null
null	"model1"	"mae:mean"	"Mean of Visit MAEs"	"1.0"	"across_visit"	null
null	"model2"	"mae:mean"	"Mean of Visit MAEs"	"1.5"	"across_visit"	null

Metric scopes

Scopes override the default behaviour that evaluates every metric per estimate and group. global collapses everything into a single row, group keeps one row per group value, and model isolates each estimate regardless of group columns.

Global scope

Global scope metrics compute a single value across the entire dataset, ignoring model and group distinctions.

global_scope_metrics = MetricDefine(
    name = "n_subject", 
    scope = "global"
)

evaluator = MetricEvaluator(
    df=data,
    metrics=[
        global_scope_metrics,
    ],
    ground_truth="actual",
    estimates=["model1", "model2"],
)

evaluator.evaluate()

shape: (1, 6)

estimate	metric	label	value	metric_type	scope
str	enum	enum	str	str	str
null	"n_subject"	"n_subject"	"6"	"across_sample"	"global"

Group scope

Group scope metrics compute one value per group, aggregating across all models.

group_scope_metrics = MetricDefine(
    name = "n_subject", 
    scope = "group"
)

evaluator = MetricEvaluator(
    df=data,
    metrics=[group_scope_metrics],
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"],
)

evaluator.evaluate()

shape: (2, 7)

estimate	metric	label	value	treatment	metric_type	scope
str	enum	enum	str	str	str	str
null	"n_subject"	"n_subject"	"3"	"A"	"across_sample"	"group"
null	"n_subject"	"n_subject"	"3"	"B"	"across_sample"	"group"

Model scope

Model scope metrics compute one value per model, ignoring group distinctions.

model_scope_metrics = MetricDefine(
    name = "n_sample_with_data", 
    scope = "model"
)

evaluator = MetricEvaluator(
    df=data,
    metrics=[model_scope_metrics],
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"],
)

evaluator.evaluate()

shape: (2, 6)

estimate	metric	label	value	metric_type	scope
enum	enum	enum	str	str	str
"model1"	"n_sample_with_data"	"n_sample_with_data"	"17"	"across_sample"	"model"
"model2"	"n_sample_with_data"	"n_sample_with_data"	"17"	"across_sample"	"model"

Grouping and stratification

Use group_by to partition metrics into cohorts (for example, by treatment arm) and subgroup_by to explode each cohort into additional breakdowns such as gender or race. Both arguments accept either a list of column names or a mapping to display labels.

Group-by analysis

# Evaluate metrics by treatment group
evaluator = MetricEvaluator(
    df=data,
    metrics=[MetricDefine(name="mae"), MetricDefine(name="rmse")],
    ground_truth="actual",
    estimates=["model1", "model2"],
    group_by=["treatment"],
)

evaluator.evaluate()

shape: (8, 7)

estimate	metric	label	value	treatment	metric_type	scope
enum	enum	enum	str	str	str	str
"model1"	"mae"	"mae"	"1.0"	"A"	"across_sample"	null
"model2"	"mae"	"mae"	"1.7"	"A"	"across_sample"	null
"model1"	"rmse"	"rmse"	"1.3"	"A"	"across_sample"	null
"model2"	"rmse"	"rmse"	"2.0"	"A"	"across_sample"	null
"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null
"model2"	"mae"	"mae"	"1.3"	"B"	"across_sample"	null
"model1"	"rmse"	"rmse"	"1.1"	"B"	"across_sample"	null
"model2"	"rmse"	"rmse"	"1.7"	"B"	"across_sample"	null

Subgroup analysis

# Evaluate with subgroup stratification (gender and race already in data)
evaluator = MetricEvaluator(
    df=data,
    metrics=[MetricDefine(name="mae")],
    ground_truth="actual",
    estimates=["model1"],
    group_by=["treatment"],
    subgroup_by=["gender", "race"],
)

evaluator.evaluate()

shape: (6, 11)

subgroup_name	subgroup_value	estimate	metric	label	value	treatment	metric_type	scope	gender	race
str	str	enum	enum	enum	str	str	str	str	str	str
"gender"	"F"	"model1"	"mae"	"mae"	"1.0"	"A"	"across_sample"	null	"F"	null
"race"	"Asian"	"model1"	"mae"	"mae"	"2.2"	"A"	"across_sample"	null	null	"Asian"
"race"	"White"	"model1"	"mae"	"mae"	"0.3"	"A"	"across_sample"	null	null	"White"
"gender"	"M"	"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null	"M"	null
"race"	"Black"	"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null	null	"Black"
"race"	"Hispanic"	"model1"	"mae"	"mae"	"1.2"	"B"	"across_sample"	null	null	"Hispanic"

Filtering and reuse

Large evaluations can be expensive. Two conveniences help keep things fast:

evaluate(metrics=..., estimates=...) lets you rerun the evaluator on a subset of the originally configured metrics or estimates without rebuilding the instance.
filter(metrics=..., estimates=...) returns a lightweight evaluator that shares the same lazy frame and cached error columns.

Results are cached by (metric, estimate) combination, so repeating the same call avoids recomputation. Call clear_cache() when the underlying data changes or you want a fresh evaluation.

# Re-evaluate only MAE for model1 using the cached pipeline
evaluator.evaluate(metrics=MetricDefine(name="mae"), estimates="model1")

shape: (6, 11)

subgroup_name	subgroup_value	estimate	metric	label	value	treatment	metric_type	scope	gender	race
str	str	enum	enum	enum	str	str	str	str	str	str
"gender"	"F"	"model1"	"mae"	"mae"	"1.0"	"A"	"across_sample"	null	"F"	null
"race"	"Asian"	"model1"	"mae"	"mae"	"2.2"	"A"	"across_sample"	null	null	"Asian"
"race"	"White"	"model1"	"mae"	"mae"	"0.3"	"A"	"across_sample"	null	null	"White"
"gender"	"M"	"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null	"M"	null
"race"	"Black"	"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null	null	"Black"
"race"	"Hispanic"	"model1"	"mae"	"mae"	"1.2"	"B"	"across_sample"	null	null	"Hispanic"

# Or create a filtered evaluator for model2-only summaries
model2_eval = evaluator.filter(estimates="model1")
model2_eval.evaluate()

shape: (6, 11)

subgroup_name	subgroup_value	estimate	metric	label	value	treatment	metric_type	scope	gender	race
str	str	enum	enum	enum	str	str	str	str	str	str
"gender"	"F"	"model1"	"mae"	"mae"	"1.0"	"A"	"across_sample"	null	"F"	null
"race"	"Asian"	"model1"	"mae"	"mae"	"2.2"	"A"	"across_sample"	null	null	"Asian"
"race"	"White"	"model1"	"mae"	"mae"	"0.3"	"A"	"across_sample"	null	null	"White"
"gender"	"M"	"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null	"M"	null
"race"	"Black"	"model1"	"mae"	"mae"	"1.0"	"B"	"across_sample"	null	null	"Black"
"race"	"Hispanic"	"model1"	"mae"	"mae"	"1.2"	"B"	"across_sample"	null	null	"Hispanic"