Segment This Thing: Foveated Tokenization for Efficient Point-Prompted Segmentation

Overview

Segment This Thing is a novel image segmentation model, based on the Segment Anything Model (SAM) and designed for efficiency. Whereas prior works have improved the efficiency of SAM but decreasing model size, we maintain the model size and instead decrease the input size. This is accomplished using our novel Foveated Tokenization, a biologically-inspired tokenization scheme which extracts high-resolution patches centered on the prompt and progressively downsampled patches from the surrounding image regions.

Patch Tokenization

Foveated Tokenization

Foveated tokenization has two key advantages over the patch tokenizating used in SAM and other Vision Transformer-based models:

Reduced pixel count: A foveated image as consumed by Segment This Thing requires ~24x fewer pixels to store compared to the equivalent SAM input, meaning far less bandwidth is required to stream images from sensor to compute.
Reduced token count: Segment This Thing also operates on ~24x smaller token sets than SAM. Thus, while the model is the same size, it uses an order of magnitude fewer FLOPs owing to the quadratic scaling of attention.

Unlike other input-reduction strategies, such as cropping or uniform downsampling, our foveated tokenization preserves full input resolution around the prompt while also providing full image-scale context through the peripheral tokens.

Results

Despite the vastly increased efficiency, our model remains competitive with SAM across a variety of datasets as measured by the standard mean intersection over union (mIoU) criterion. Segment This Thing is also competitive with (and runs with lower latency than) other SAM variants such as EfficientSAM and MobileSAM.

Please refer to the paper for more details!