Visualizing Benchmark Results

Use BenchmarkAggregator to collect test metrics from completed experiments and produce comparison plots and summary tables.

How it works

BenchmarkAggregator is the built-in tool for visualizing results. It wraps a list of Experiment objects and, when .run() is called:

  1. Calls experiment.run() on each experiment to collect (or retrieve from cache) the test-metric dictionary.

  2. Merges each result dict with the experiment config into a single DataFrame.

  3. Maps each loss type to a primary metric and produces:

    • A bar chart (outputs/core/core_bar_chart.png)

    • A results table (outputs/core/core_results_table.csv)

    • A rank table (outputs/core/core_rank_table.csv)

Outputs are organised into three subfolders of outputs/: core/ (Core suite: one dataset per task – e.g. NeuralBench-EEG-Core v1.0 for an EEG run), full/ (Full suite: per-dataset breakdowns + dataset-level variability – e.g. NeuralBench-EEG-Full v1.0), and other/ (data scaling, computational stats, …).

Triggering it from the CLI

The simplest way to use BenchmarkAggregator is the --plot-cached flag. After experiments have been run (and cached), re-invoke the CLI with --plot-cached to collect the stored results and generate all outputs without retraining:

# First, run experiments (results are cached automatically)
neuralbench eeg audiovisual_stimulus sleep_stage -m eegnet eegconformer

# Then, re-run with --plot-cached to generate comparison outputs
neuralbench eeg audiovisual_stimulus sleep_stage -m eegnet eegconformer --plot-cached

--plot-cached does not launch any experiments; it only reads the results already persisted by exca and drives BenchmarkAggregator.

Configuration

BenchmarkAggregator has a few configurable attributes:

Field

Default

max_workers

256

collect_max_workers

32

debug

False

output_dir

"<neuralbench-repo>/outputs"

Loss-to-metric mapping

The loss_to_metric_mapping attribute determines which metric is used as the primary performance indicator for each task type. This is used for plotting and ranking:

Loss

Primary metric

CrossEntropyLoss

test/bal_acc

BCEWithLogitsLoss

test/f1_score_macro

MSELoss

test/pearsonr

ClipLoss

test/full_retrieval/top5_acc_subject-agg

Total running time of the script: (0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery