Word decoding

Name: word
Category: cognitive decoding
Dataset: Nieuwland2018
Objective: Retrieval
Split: Predefined

Usage

neuralbench eeg word
Show config.yaml
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

data:
  study:
    source:
      name: Nieuwland2018Large
      query: "site not in ['GLAS', 'LOND']"
    preprocess_text:
      name: TextPreprocessor
      neuro_event_type: Eeg
    split:
      name: SklearnSplit
      split_by: sequence_id
      valid_split_ratio: 0.1
      test_split_ratio: 0.1
      valid_random_state: 33
      test_random_state: 33
  target:
    name: SpacyEmbedding
    aggregation: trigger
    infra:
      cluster: auto
      keep_in_ram: true
      timeout_min: 180
      gpus_per_node: 1
      cpus_per_task: 10
      min_samples_per_job: 16
  trigger_event_type: Word
  start: -0.5
  duration: 3.0
  summary_columns: [text, sequence_id]
brain_model_output_size: &brain_model_output_size 1024
trainer_config.monitor: val/batch_top5_acc
trainer_config.mode: max
loss:
  name: ClipLoss
  norm_kind: y
  temperature: false
  symmetric: false
metrics: !!python/name:neuralbench.defaults.metrics.retrieval_metrics
test_full_retrieval_metrics: !!python/name:neuralbench.defaults.metrics.test_full_retrieval_metrics

Description

The word decoding task involves decoding word stimuli from EEG recordings [dAscoli2025]. In this task, we use the Nieuwland2018 dataset [Nieuwland2018], which contains EEG data recorded across 8 UK laboratories while subjects read 80 sentences on a screen in a rapid serial visual presentation paradigm. Word embeddings are extracted using contextualized GPT-2 representations.

We exclude the GLAS (Glasgow, 128-channel BioSemi) and LOND (London, 34-channel) sites because their EEG montages are incompatible with the standard ~64-channel 10-20 systems used by the other 6 sites. Including them inflates the channel dimension to 194 (the union of all unique channel names) with heavy zero-padding, significantly slowing training without improving evaluation quality.

As in [dAscoli2025], the retrieval set is built from the 250 most frequent words in the test split.

References

[dAscoli2025] (1,2)

d’Ascoli, Stéphane, et al. “Towards decoding individual words from non-invasive brain recordings.” Nature Communications 16.1 (2025): 10521.

[Nieuwland2018]

Nieuwland, Mante S., et al. “Large-scale replication study reveals a limit on probabilistic prediction in language comprehension.” ELife 7 (2018): e33468.