This paper presents a theoretical framework that explains why fine-tuning small, randomly selected subnetworks (slices) within pre-trained models is sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property, arising from two phenomena: (1) spectral balance — the eigenspectra of different weight matrix slices are remarkably similar — and (2) high task energy — their backbone representations (pretrained weights) retain rich, task-relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter-efficient fine-tuning (PEFT) in large-scale models. Inspired by this, we propose SliceFine, a PEFT method that uses this inherent redundancy by updating only selected slices of the original weights — introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of SOTA PEFT methods across various language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.
W
; no adapters needed.r=1
is often competitive; complexity O(d × r)
.Result: strong accuracy with lower memory, faster throughput, and compact artifacts.
In a dense, pretrained network, any random slice with sufficient width acts as a local winning ticket: training only that slice while freezing the rest improves downstream performance. Moreover, tuning a small set of such slices across layers can match full fine-tuning accuracy while updating far fewer parameters.
Formalizes spectral balance across weight slices and high task energy in pretrained features, explaining why slices receive non-zero restricted gradients and reduce loss.
SliceFine: trains a small, moving row/column slice per layer—no added parameters—covering task-aligned directions over time.
Competitive or better than strong PEFT baselines across language, vision, and video with reduced memory & faster throughput.
Simple rules for choosing slice rank and switching interval; rank-1 often suffices, with robustness to slice position.
We empirically test the Universal Winning–Slice Hypothesis with three complementary experiments, all indicating that slice selection is largely insensitive — supporting the winner-slice property.
Positional robustness. See Fig. Winner Slice Ablation (b).
Using a fixed slice rank of \(5\), we train slices at multiple positions of the weight matrix and measure accuracy. Accuracy remains within \(\pm 1\%\) of the anchor across all positions, showing that both row and column slices contribute comparably to the task-relevant subspace.
Importance robustness (Wanda). See Fig. Winner Slice Ablation (c).
We adopt the Wanda pruning heuristic to rank weights by importance. Given a weight matrix \( W \in \mathbb{R}^{d_\ell \times d_{\ell-1}} \) and activations \( X \in \mathbb{R}^{s \times d_{\ell-1}} \) from a sequence of length \( s \), Wanda defines the importance of entry \((i,j)\) as \( S_{ij} = |W_{ij}| \cdot \lVert X_{\cdot j} \rVert_2 \), where \( \lVert X_{\cdot j} \rVert_2 \) is the \(\ell_2\) norm of the \(j\)-th input feature across the batch. To score a slice, we aggregate over its entries: \( S_{\text{slice}} = \sum_{(i,j)\in \text{slice}} S_{ij} \). We then select slices from the most-important, least-important, mixed, or random categories. Results show nearly identical accuracy across all categories, confirming that slice winners emerge regardless of weight importance.
“Good” vs “Bad” slices (LTH view). See Fig. Winner Slice Ablation (d).
Following the Lottery Ticket Hypothesis framework, we extract both “winning” and “losing” sparse subnetworks and use their masks to define slices. Standard LTH seeks a mask \( M \in \{0,1\}^d \) such that the pruned subnetwork matches full-model accuracy: \( \mathcal{L}(f_{\theta \odot M}(x), y) \approx \mathcal{L}(f_{\theta}(x), y) \). Surprisingly, even slices derived from “bad” subnetworks perform comparably to those from “good” ones — reinforcing that pretrained networks contain many capable subnetworks and that every sufficiently wide slice can be a winner.
SliceFine achieves remarkable computational and memory efficiency compared to LoRA-style and adapter-based PEFT methods. Slice training scales as \(O(d_\ell \times r)\) (row) or \(O(d_{\ell-1} \times r)\) (column), introducing zero additional parameters (\#APs = 0). In contrast, LoRA and adapter methods require \(2r(d_\ell+d_{\ell-1})\) or higher overhead, increasing both space and time complexity. Empirically, SliceFine yields faster iteration rates, smaller peak memory, and shorter wall-clock times across ViT, VideoMAE, and RoBERTa backbones.
Methods | Space | Time | #TTPs | #APs |
---|---|---|---|---|
FT | O(dℓ×dℓ-1) | O(dℓ×dℓ-1) | dℓ·dℓ-1 | 0 |
(IA)3 | O(dk+dv+dff) | O(dk+dv+dff) | dk+dv+dff | dk+dv+dff |
Prompt | O(dℓ×lp) | O(dℓ×lp) | lp·dℓ | lp·dℓ |
Prefix | O(L×dℓ×lp) | O(L×dℓ×lp) | L·lp·dℓ | L·lp·dℓ |
LoRA | O((dℓ+dℓ-1)×r) | O((dℓ+dℓ-1)×r) | 2·dℓ·r | 2·dℓ·r |
LoRA-FA | O((dℓ+dℓ-1)×r) | O((dℓ+dℓ-1)×r) | d·r | 2·d·r |
AdaLoRA | O((dℓ+dℓ-1+r)×r) | O((dℓ+dℓ-1+r)×r) | 2·d·r+r² | 2·d·r+r² |
LoHA | O(2r×(dℓ+dℓ-1)) | O(2r×(dℓ+dℓ-1)) | 4·dℓ·r | 4·dℓ·r |
Propulsion | O(d) | O(d) | d | d |
SliceFine (row) | O(dℓ×r) | O(dℓ×r) | r·dℓ | 0 |
SliceFine (column) | O(dℓ-1×r) | O(dℓ-1×r) | r·dℓ-1 | 0 |
Table 1: Comparison of space/time complexity, total trainable parameters (#TTPs), and additional parameters (#APs) per layer \(W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}\). SliceFine achieves \(O(d_\ell×r)\) or \(O(d_{\ell-1}×r)\) complexity with no additional parameters, unlike other methods that incur higher costs.
From the paper, SliceFine matches or surpasses strong PEFT baselines across LLM (RoBERTa/LLaMA), vision (ViT), and video (VideoMAE) backbones while updating far fewer parameters. Results are consistent across multiple datasets; gains are strongest at low ranks, with stability across slice positions.
SliceFine updates small row/column slices of the original weight matrices (no new parameters), then optionally merges back into the backbone for inference. Below are minimal snippets for injection, training with a HuggingFace-style trainer, and restoring layers for deployment.
from slice import inject_peft, restore_layers
# Insert SliceFine slices into your model (no new params added)
inject_peft(
model, # your HF/torch model
rank=5, # slice rank (capacity)
position=0, # starting column/row position
bias=False, # include bias slices or not
mode=("row","column"), # ("row",) or ("column",) or both
)
mode=("row","column")
alternates row/column slices across steps (block-coordinate descent).
Use ("row",)
or ("column",)
if you want a single direction.
SliceTrainer
(HuggingFace style)from transformers import TrainingArguments
from SliceTrainer import SliceTrainer
training_args = TrainingArguments(
output_dir="dir",
learning_rate=3e-4,
remove_unused_columns=False,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=4,
evaluation_strategy="steps",
save_strategy="no",
logging_steps=100,
save_steps=100,
eval_steps=100,
lr_scheduler_type="cosine",
warmup_steps=10,
report_to=[],
fp16=True,
)
trainer = SliceTrainer(
model=model,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
compute_metrics=compute_metrics,
training_args=training_args,
move_steps=1000, # switch active slice every N steps
rank=5,
learnig_rate_decay=0.0001, # (param name as in your code)
min_learning_rate=0.00001,
position=0,
max_position=768,
bias=False,
peft_modes=("row","column"), # or ("row",) / ("column",)
targets=None, # or a list of target layer names
verbose=True,
tollerance=1, # (param name as in your code)
rank_decay=1,
min_rank=1,
)
trainer.run()
move_steps
controls how often the active slice shifts; small ranks (even r=1
) often work well.
learnig_rate_decay
/ tollerance
, keep those names.
If not, consider standardizing to learning_rate_decay
/ tolerance
.
# After training, fold the learned slices back into the original weights
from slice import restore_layers
restore_layers(model) # model is now slice-free and ready for deployment
# Example: restrict SliceFine to attention projections only
targets = [
"encoder.layer.*.attention.self.query",
"encoder.layer.*.attention.self.key",
"encoder.layer.*.attention.self.value",
"encoder.layer.*.attention.output.dense",
]
inject_peft(model, rank=4, position=0, bias=False, mode=("row","column"))
trainer = SliceTrainer(
model=model,
train_dataset=...,
eval_dataset=...,
data_collator=...,
compute_metrics=...,
training_args=training_args,
move_steps=800,
rank=4,
position=0,
max_position=768,
peft_modes=("row","column"),
targets=targets, # only apply to matched layers
)
trainer.run()
@misc{kowsher2025slicefineuniversalwinningslicehypothesis,
title={SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks},
author={Md Kowsher and Ali O. Polat and Ehsan Mohammady Ardehaly and Mehrdad Salehi and Zia Ghiasi and Prasanth Murali and Chen Chen},
year={2025},
eprint={2510.08513},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.08513},
}
}