Under review at ICLR 2026

SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

Md Kowsher1,2, Ali O. Polat1, Ehsan Mohammady Ardehaly1, Mehrdad Salehi1,

Zia Ghiasi1, Prasanth Murali1, Chen Chen2

1Meta  2University of Central Florida

Meta Logo UCF Logo
Main Figure (Figure 1)
(Left) Winning Tickets. In a pretrained network, a \emph{randomly chosen slice} of a layer \(W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\) acts as a local winning ticket: tuning only that slice lowers the loss while keeping the backbone frozen. A few such slices (row, column, or row-column) selected across layers constitute a \emph{global winning ticket}. (Right) SliceFine. At step $t$, only a slice of the weight matrix $W^{(\ell)}$ is updated; all other entries remain fixed. Every \(N\) steps, we activate a new slice at a different position for learning; the previously active slice retains its learned update but is frozen. Top: column sweep—the slice slides across columns. Bottom: row–column alternation—the slice alternates between a column block and a row block to cover complementary directions. Similarly, In row sweep—the slice slides across rows. This schedule updates only a tiny portion of the model at a time while gradually covering many regions; applying it across several layers yields a global winner.

Abstract

This paper presents a theoretical framework that explains why fine-tuning small, randomly selected subnetworks (slices) within pre-trained models is sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property, arising from two phenomena: (1) spectral balance — the eigenspectra of different weight matrix slices are remarkably similar — and (2) high task energy — their backbone representations (pretrained weights) retain rich, task-relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter-efficient fine-tuning (PEFT) in large-scale models. Inspired by this, we propose SliceFine, a PEFT method that uses this inherent redundancy by updating only selected slices of the original weights — introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of SOTA PEFT methods across various language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.

Why this work

What current PEFT misses — and how SliceFine fixes it

Limitations in the wild

Common gaps in existing PEFT

  • Extra parameters: adapters/low-rank modules increase model size & optimizer state.
  • Placement sensitivity: where you insert modules can change results significantly.
  • Static subnet updates: fixed rows/columns or modules can miss useful directions.
  • Memory/latency costs: added layers/factors slow training & inference.
  • Weak theory link: why small modules work often lacks a simple, testable account.
Our approach

How SliceFine addresses these

  • Zero new parameters: update a small slice of W; no adapters needed.
  • Moving slice schedule: block-coordinate descent sweeps rows/cols for coverage.
  • Small rank works: even r=1 is often competitive; complexity O(d × r).
  • Modality-agnostic: consistent gains across LLM, vision, and video backbones.
  • Theory-backed: spectral balance + high task energy explain why slices win.

Result: strong accuracy with lower memory, faster throughput, and compact artifacts.

🔷 Universal Winning–Slice Hypothesis (UWSH)

In a dense, pretrained network, any random slice with sufficient width acts as a local winning ticket: training only that slice while freezing the rest improves downstream performance. Moreover, tuning a small set of such slices across layers can match full fine-tuning accuracy while updating far fewer parameters.

Contributions

What we bring

1) Theory

Formalizes spectral balance across weight slices and high task energy in pretrained features, explaining why slices receive non-zero restricted gradients and reduce loss.

2) Method

SliceFine: trains a small, moving row/column slice per layer—no added parameters—covering task-aligned directions over time.

3) Empirics

Competitive or better than strong PEFT baselines across language, vision, and video with reduced memory & faster throughput.

4) Guidance

Simple rules for choosing slice rank and switching interval; rank-1 often suffices, with robustness to slice position.

Empirical evidence
Is any slice a winner?

Slice selection is largely insensitive

We empirically test the Universal Winning–Slice Hypothesis with three complementary experiments, all indicating that slice selection is largely insensitive — supporting the winner-slice property.

Figure 2:  Empirical evidence for the robustness of slice selection strategies across tasks. (a) Rank vs. Accuracy: Increasing the slice rank improves accuracy up to a point, after which validation accuracy declines, indicating gradual overfitting. (b) Position vs. Accuracy: Accuracy remains stable across slice positions, within \( \pm 1\% \) of the anchor accuracy. (c) Wanda category ablations: Accuracy is insensitive to whether slices are chosen from most important, less important, mixed, or random weights. (d) LTH comparison: Even “bad” slices perform comparably to the “best” slices, supporting the winner-slice property — pretrained networks contain many capable subnetworks.

Positional robustness. See Fig. Winner Slice Ablation (b).

Using a fixed slice rank of \(5\), we train slices at multiple positions of the weight matrix and measure accuracy. Accuracy remains within \(\pm 1\%\) of the anchor across all positions, showing that both row and column slices contribute comparably to the task-relevant subspace.

Importance robustness (Wanda). See Fig. Winner Slice Ablation (c).

We adopt the Wanda pruning heuristic to rank weights by importance. Given a weight matrix \( W \in \mathbb{R}^{d_\ell \times d_{\ell-1}} \) and activations \( X \in \mathbb{R}^{s \times d_{\ell-1}} \) from a sequence of length \( s \), Wanda defines the importance of entry \((i,j)\) as \( S_{ij} = |W_{ij}| \cdot \lVert X_{\cdot j} \rVert_2 \), where \( \lVert X_{\cdot j} \rVert_2 \) is the \(\ell_2\) norm of the \(j\)-th input feature across the batch. To score a slice, we aggregate over its entries: \( S_{\text{slice}} = \sum_{(i,j)\in \text{slice}} S_{ij} \). We then select slices from the most-important, least-important, mixed, or random categories. Results show nearly identical accuracy across all categories, confirming that slice winners emerge regardless of weight importance.

“Good” vs “Bad” slices (LTH view). See Fig. Winner Slice Ablation (d).

Following the Lottery Ticket Hypothesis framework, we extract both “winning” and “losing” sparse subnetworks and use their masks to define slices. Standard LTH seeks a mask \( M \in \{0,1\}^d \) such that the pruned subnetwork matches full-model accuracy: \( \mathcal{L}(f_{\theta \odot M}(x), y) \approx \mathcal{L}(f_{\theta}(x), y) \). Surprisingly, even slices derived from “bad” subnetworks perform comparably to those from “good” ones — reinforcing that pretrained networks contain many capable subnetworks and that every sufficiently wide slice can be a winner.

Efficiency Analysis

Efficiency of SliceFine

Efficiency comparison
Figure 3:  Comparison of PEFT methods on (a) model size, (b) peak memory, (c) throughput, and (d) total training time across ViT, VideoMAE, and RoBERTa backbones. SliceFine consistently shows 3–5% smaller models, 2–4 GB lower peak memory, 15–25% faster throughput, and 40–50% shorter training times.

SliceFine achieves remarkable computational and memory efficiency compared to LoRA-style and adapter-based PEFT methods. Slice training scales as \(O(d_\ell \times r)\) (row) or \(O(d_{\ell-1} \times r)\) (column), introducing zero additional parameters (\#APs = 0). In contrast, LoRA and adapter methods require \(2r(d_\ell+d_{\ell-1})\) or higher overhead, increasing both space and time complexity. Empirically, SliceFine yields faster iteration rates, smaller peak memory, and shorter wall-clock times across ViT, VideoMAE, and RoBERTa backbones.

Methods Space Time #TTPs #APs
FTO(d×dℓ-1)O(d×dℓ-1)d·dℓ-10
(IA)3O(dk+dv+dff)O(dk+dv+dff)dk+dv+dffdk+dv+dff
PromptO(d×lp)O(d×lp)lp·dlp·d
PrefixO(L×d×lp)O(L×d×lp)L·lp·dL·lp·d
LoRAO((d+dℓ-1)×r)O((d+dℓ-1)×r)2·d·r2·d·r
LoRA-FAO((d+dℓ-1)×r)O((d+dℓ-1)×r)d·r2·d·r
AdaLoRAO((d+dℓ-1+r)×r)O((d+dℓ-1+r)×r)2·d·r+r²2·d·r+r²
LoHAO(2r×(d+dℓ-1))O(2r×(d+dℓ-1))4·d·r4·d·r
PropulsionO(d)O(d)dd
SliceFine (row)O(d×r)O(d×r)r·d0
SliceFine (column)O(dℓ-1×r)O(dℓ-1×r)r·dℓ-10

Table 1: Comparison of space/time complexity, total trainable parameters (#TTPs), and additional parameters (#APs) per layer \(W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\). SliceFine achieves \(O(d_\ell×r)\) or \(O(d_{\ell-1}×r)\) complexity with no additional parameters, unlike other methods that incur higher costs.

Performance results

Main results across language, vision, and video

From the paper, SliceFine matches or surpasses strong PEFT baselines across LLM (RoBERTa/LLaMA), vision (ViT), and video (VideoMAE) backbones while updating far fewer parameters. Results are consistent across multiple datasets; gains are strongest at low ranks, with stability across slice positions.

Table 1: Main language results
Table 2: Vision and Video results (VTAB-1K)
Table 3: Commonsense and Math reasoning
Table 4: NLU Understanding
Table 5: GLUE Banchmark
Technical Code Use

How to use SliceFine in your pipeline

SliceFine updates small row/column slices of the original weight matrices (no new parameters), then optionally merges back into the backbone for inference. Below are minimal snippets for injection, training with a HuggingFace-style trainer, and restoring layers for deployment.

1) Inject SliceFine into your model

Python
from slice import inject_peft, restore_layers

  # Insert SliceFine slices into your model (no new params added)
  inject_peft(
      model,                 # your HF/torch model
      rank=5,                # slice rank (capacity)
      position=0,            # starting column/row position
      bias=False,            # include bias slices or not
      mode=("row","column"), # ("row",) or ("column",) or both
  )
Tip: mode=("row","column") alternates row/column slices across steps (block-coordinate descent). Use ("row",) or ("column",) if you want a single direction.

2) Train with SliceTrainer (HuggingFace style)

Python
from transformers import TrainingArguments
  from SliceTrainer import SliceTrainer

  training_args = TrainingArguments(
      output_dir="dir",
      learning_rate=3e-4,
      remove_unused_columns=False,
      per_device_train_batch_size=32,
      per_device_eval_batch_size=32,
      num_train_epochs=4,
      evaluation_strategy="steps",
      save_strategy="no",
      logging_steps=100,
      save_steps=100,
      eval_steps=100,
      lr_scheduler_type="cosine",
      warmup_steps=10,
      report_to=[],
      fp16=True,
  )

  trainer = SliceTrainer(
      model=model,
      train_dataset=tokenized_datasets["train"],
      eval_dataset=tokenized_datasets["validation"],
      data_collator=data_collator,
      compute_metrics=compute_metrics,
      training_args=training_args,
      move_steps=1000,             # switch active slice every N steps
      rank=5,
      learnig_rate_decay=0.0001,   # (param name as in your code)
      min_learning_rate=0.00001,
      position=0,
      max_position=768,
      bias=False,
      peft_modes=("row","column"), # or ("row",) / ("column",)
      targets=None,                # or a list of target layer names
      verbose=True,
      tollerance=1,                # (param name as in your code)
      rank_decay=1,
      min_rank=1,
  )

  trainer.run()
Tip: move_steps controls how often the active slice shifts; small ranks (even r=1) often work well.
Note: If your internal API uses learnig_rate_decay / tollerance, keep those names. If not, consider standardizing to learning_rate_decay / tolerance.

3) Merge slices back for inference

Python
# After training, fold the learned slices back into the original weights
  from slice import restore_layers

  restore_layers(model)  # model is now slice-free and ready for deployment

Targeting specific layers (optional)

Python
# Example: restrict SliceFine to attention projections only
  targets = [
      "encoder.layer.*.attention.self.query",
      "encoder.layer.*.attention.self.key",
      "encoder.layer.*.attention.self.value",
      "encoder.layer.*.attention.output.dense",
  ]

  inject_peft(model, rank=4, position=0, bias=False, mode=("row","column"))
  trainer = SliceTrainer(
      model=model,
      train_dataset=...,
      eval_dataset=...,
      data_collator=...,
      compute_metrics=...,
      training_args=training_args,
      move_steps=800,
      rank=4,
      position=0,
      max_position=768,
      peft_modes=("row","column"),
      targets=targets,   # only apply to matched layers
  )
  trainer.run()
FAQ: Row vs. column? Columns modify incoming features; rows affect outgoing channels. Alternating both tends to cover more task-relevant directions over time.
Citation

BibTeX

BibTeX
@misc{kowsher2025slicefineuniversalwinningslicehypothesis,
      title={SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks}, 
      author={Md Kowsher and Ali O. Polat and Ehsan Mohammady Ardehaly and Mehrdad Salehi and Zia Ghiasi and Prasanth Murali and Chen Chen},
      year={2025},
      eprint={2510.08513},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.08513}, 
}
}