SliceFine: The Universal Winning-Slice Hypothesis

Abstract

This paper presents a theoretical framework that explains why fine-tuning small, randomly selected subnetworks (slices) within pre-trained models is sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property, arising from two phenomena: (1) spectral balance — the eigenspectra of different weight matrix slices are remarkably similar — and (2) high task energy — their backbone representations (pretrained weights) retain rich, task-relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter-efficient fine-tuning (PEFT) in large-scale models. Inspired by this, we propose SliceFine, a PEFT method that uses this inherent redundancy by updating only selected slices of the original weights — introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of SOTA PEFT methods across various language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.

Why this work

What current PEFT misses — and how SliceFine fixes it

Limitations in the wild

Common gaps in existing PEFT

Extra parameters: adapters/low-rank modules increase model size & optimizer state.
Placement sensitivity: where you insert modules can change results significantly.
Static subnet updates: fixed rows/columns or modules can miss useful directions.
Memory/latency costs: added layers/factors slow training & inference.
Weak theory link: why small modules work often lacks a simple, testable account.

Our approach

How SliceFine addresses these

Zero new parameters: update a small slice of W; no adapters needed.
Moving slice schedule: block-coordinate descent sweeps rows/cols for coverage.
Small rank works: even r=1 is often competitive; complexity O(d × r).
Modality-agnostic: consistent gains across LLM, vision, and video backbones.
Theory-backed: spectral balance + high task energy explain why slices win.

Result: strong accuracy with lower memory, faster throughput, and compact artifacts.

🔷 Universal Winning–Slice Hypothesis (UWSH)

In a dense, pretrained network, any random slice with sufficient width acts as a local winning ticket: training only that slice while freezing the rest improves downstream performance. Moreover, tuning a small set of such slices across layers can match full fine-tuning accuracy while updating far fewer parameters.

Contributions

What we bring

1) Theory

Formalizes spectral balance across weight slices and high task energy in pretrained features, explaining why slices receive non-zero restricted gradients and reduce loss.

2) Method

SliceFine: trains a small, moving row/column slice per layer—no added parameters—covering task-aligned directions over time.

3) Empirics

Competitive or better than strong PEFT baselines across language, vision, and video with reduced memory & faster throughput.

4) Guidance

Simple rules for choosing slice rank and switching interval; rank-1 often suffices, with robustness to slice position.

Empirical evidence

Is any slice a winner?

Slice selection is largely insensitive

We empirically test the Universal Winning–Slice Hypothesis with three complementary experiments, all indicating that slice selection is largely insensitive — supporting the winner-slice property.

Figure 2: Empirical evidence for the robustness of slice selection strategies across tasks. (a) **Rank vs. Accuracy:** Increasing the slice rank improves accuracy up to a point, after which validation accuracy declines, indicating gradual overfitting. (b) **Position vs. Accuracy:** Accuracy remains stable across slice positions, within \( \pm 1\% \) of the anchor accuracy. (c) **Wanda category ablations:** Accuracy is insensitive to whether slices are chosen from most important, less important, mixed, or random weights. (d) **LTH comparison:** Even “bad” slices perform comparably to the “best” slices, supporting the *winner-slice property* — pretrained networks contain many capable subnetworks.

Positional robustness. See Fig. Winner Slice Ablation (b).

Using a fixed slice rank of \(5\), we train slices at multiple positions of the weight matrix and measure accuracy. Accuracy remains within \(\pm 1\%\) of the anchor across all positions, showing that both row and column slices contribute comparably to the task-relevant subspace.

Importance robustness (Wanda). See Fig. Winner Slice Ablation (c).

We adopt the Wanda pruning heuristic to rank weights by importance. Given a weight matrix \( W \in \mathbb{R}^{d_\ell \times d_{\ell-1}} \) and activations \( X \in \mathbb{R}^{s \times d_{\ell-1}} \) from a sequence of length \( s \), Wanda defines the importance of entry \((i,j)\) as \( S_{ij} = |W_{ij}| \cdot \lVert X_{\cdot j} \rVert_2 \), where \( \lVert X_{\cdot j} \rVert_2 \) is the \(\ell_2\) norm of the \(j\)-th input feature across the batch. To score a slice, we aggregate over its entries: \( S_{\text{slice}} = \sum_{(i,j)\in \text{slice}} S_{ij} \). We then select slices from the most-important, least-important, mixed, or random categories. Results show nearly identical accuracy across all categories, confirming that slice winners emerge regardless of weight importance.

“Good” vs “Bad” slices (LTH view). See Fig. Winner Slice Ablation (d).

Following the Lottery Ticket Hypothesis framework, we extract both “winning” and “losing” sparse subnetworks and use their masks to define slices. Standard LTH seeks a mask \( M \in \{0,1\}^d \) such that the pruned subnetwork matches full-model accuracy: \( \mathcal{L}(f_{\theta \odot M}(x), y) \approx \mathcal{L}(f_{\theta}(x), y) \). Surprisingly, even slices derived from “bad” subnetworks perform comparably to those from “good” ones — reinforcing that pretrained networks contain many capable subnetworks and that every sufficiently wide slice can be a winner.

Efficiency Analysis

Efficiency of SliceFine

Efficiency comparison — Figure 3: Comparison of PEFT methods on (a) model size, (b) peak memory, (c) throughput, and (d) total training time across **ViT**, **VideoMAE**, and **RoBERTa** backbones. *SliceFine* consistently shows 3–5% smaller models, 2–4 GB lower peak memory, 15–25% faster throughput, and 40–50% shorter training times.

SliceFine achieves remarkable computational and memory efficiency compared to LoRA-style and adapter-based PEFT methods. Slice training scales as \(O(d_\ell \times r)\) (row) or \(O(d_{\ell-1} \times r)\) (column), introducing zero additional parameters (\#APs = 0). In contrast, LoRA and adapter methods require \(2r(d_\ell+d_{\ell-1})\) or higher overhead, increasing both space and time complexity. Empirically, SliceFine yields faster iteration rates, smaller peak memory, and shorter wall-clock times across ViT, VideoMAE, and RoBERTa backbones.

Methods	Space	Time	#TTPs	#APs
FT	O(d_ℓ×d_ℓ-1)	O(d_ℓ×d_ℓ-1)	d_ℓ·d_ℓ-1	0
(IA)³	O(d_k+d_v+d_ff)	O(d_k+d_v+d_ff)	d_k+d_v+d_ff	d_k+d_v+d_ff
Prompt	O(d_ℓ×l_p)	O(d_ℓ×l_p)	l_p·d_ℓ	l_p·d_ℓ
Prefix	O(L×d_ℓ×l_p)	O(L×d_ℓ×l_p)	L·l_p·d_ℓ	L·l_p·d_ℓ
LoRA	O((d_ℓ+d_ℓ-1)×r)	O((d_ℓ+d_ℓ-1)×r)	2·d_ℓ·r	2·d_ℓ·r
LoRA-FA	O((d_ℓ+d_ℓ-1)×r)	O((d_ℓ+d_ℓ-1)×r)	d·r	2·d·r
AdaLoRA	O((d_ℓ+d_ℓ-1+r)×r)	O((d_ℓ+d_ℓ-1+r)×r)	2·d·r+r²	2·d·r+r²
LoHA	O(2r×(d_ℓ+d_ℓ-1))	O(2r×(d_ℓ+d_ℓ-1))	4·d_ℓ·r	4·d_ℓ·r
Propulsion	O(d)	O(d)	d	d
SliceFine (row)	O(d_ℓ×r)	O(d_ℓ×r)	r·d_ℓ	0
SliceFine (column)	O(d_ℓ-1×r)	O(d_ℓ-1×r)	r·d_ℓ-1	0

Table 1: Comparison of space/time complexity, total trainable parameters (#TTPs), and additional parameters (#APs) per layer \(W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\). SliceFine achieves \(O(d_\ell×r)\) or \(O(d_{\ell-1}×r)\) complexity with no additional parameters, unlike other methods that incur higher costs.

Performance results

Main results across language, vision, and video

From the paper, SliceFine matches or surpasses strong PEFT baselines across LLM (RoBERTa/LLaMA), vision (ViT), and video (VideoMAE) backbones while updating far fewer parameters. Results are consistent across multiple datasets; gains are strongest at low ranks, with stability across slice positions.

Table 2: Vision and Video results (VTAB-1K)

Technical Code Use

How to use SliceFine in your pipeline

SliceFine updates small row/column slices of the original weight matrices (no new parameters), then optionally merges back into the backbone for inference. Below are minimal snippets for injection, training with a HuggingFace-style trainer, and restoring layers for deployment.

1) Inject SliceFine into your model

Python

from slice import inject_peft, restore_layers

  # Insert SliceFine slices into your model (no new params added)
  inject_peft(
      model,                 # your HF/torch model
      rank=5,                # slice rank (capacity)
      position=0,            # starting column/row position
      bias=False,            # include bias slices or not
      mode=("row","column"), # ("row",) or ("column",) or both
  )

Tip: mode=("row","column") alternates row/column slices across steps (block-coordinate descent). Use ("row",) or ("column",) if you want a single direction.

2) Train with `SliceTrainer` (HuggingFace style)

Python

from transformers import TrainingArguments
  from SliceTrainer import SliceTrainer

  training_args = TrainingArguments(
      output_dir="dir",
      learning_rate=3e-4,
      remove_unused_columns=False,
      per_device_train_batch_size=32,
      per_device_eval_batch_size=32,
      num_train_epochs=4,
      evaluation_strategy="steps",
      save_strategy="no",
      logging_steps=100,
      save_steps=100,
      eval_steps=100,
      lr_scheduler_type="cosine",
      warmup_steps=10,
      report_to=[],
      fp16=True,
  )

  trainer = SliceTrainer(
      model=model,
      train_dataset=tokenized_datasets["train"],
      eval_dataset=tokenized_datasets["validation"],
      data_collator=data_collator,
      compute_metrics=compute_metrics,
      training_args=training_args,
      move_steps=1000,             # switch active slice every N steps
      rank=5,
      learnig_rate_decay=0.0001,   # (param name as in your code)
      min_learning_rate=0.00001,
      position=0,
      max_position=768,
      bias=False,
      peft_modes=("row","column"), # or ("row",) / ("column",)
      targets=None,                # or a list of target layer names
      verbose=True,
      tollerance=1,                # (param name as in your code)
      rank_decay=1,
      min_rank=1,
  )

  trainer.run()

Tip: move_steps controls how often the active slice shifts; small ranks (even r=1) often work well.

Note: If your internal API uses learnig_rate_decay / tollerance, keep those names. If not, consider standardizing to learning_rate_decay / tolerance.

3) Merge slices back for inference

Python

# After training, fold the learned slices back into the original weights
  from slice import restore_layers

  restore_layers(model)  # model is now slice-free and ready for deployment

Targeting specific layers (optional)

Python

# Example: restrict SliceFine to attention projections only
  targets = [
      "encoder.layer.*.attention.self.query",
      "encoder.layer.*.attention.self.key",
      "encoder.layer.*.attention.self.value",
      "encoder.layer.*.attention.output.dense",
  ]

  inject_peft(model, rank=4, position=0, bias=False, mode=("row","column"))
  trainer = SliceTrainer(
      model=model,
      train_dataset=...,
      eval_dataset=...,
      data_collator=...,
      compute_metrics=...,
      training_args=training_args,
      move_steps=800,
      rank=4,
      position=0,
      max_position=768,
      peft_modes=("row","column"),
      targets=targets,   # only apply to matched layers
  )
  trainer.run()

FAQ: Row vs. column? Columns modify incoming features; rows affect outgoing channels. Alternating both tends to cover more task-relevant directions over time.

Citation

BibTeX

@misc{kowsher2025slicefineuniversalwinningslicehypothesis,
      title={SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks}, 
      author={Md Kowsher and Ali O. Polat and Ehsan Mohammady Ardehaly and Mehrdad Salehi and Zia Ghiasi and Prasanth Murali and Chen Chen},
      year={2025},
      eprint={2510.08513},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.08513}, 
}
}

Abstract

What current PEFT misses — and how SliceFine fixes it

Common gaps in existing PEFT

How SliceFine addresses these

What we bring

1) Theory

2) Method

3) Empirics

4) Guidance

Slice selection is largely insensitive

Efficiency of SliceFine

Main results across language, vision, and video

How to use SliceFine in your pipeline

1) Inject SliceFine into your model

2) Train with SliceTrainer (HuggingFace style)

3) Merge slices back for inference

Targeting specific layers (optional)

BibTeX

2) Train with `SliceTrainer` (HuggingFace style)