Segment This Thing is a novel image segmentation model, based on the Segment Anything Model (SAM) and designed for efficiency. Whereas prior works have improved the efficiency of SAM but decreasing model size, we maintain the model size and instead decrease the input size. This is accomplished using our novel Foveated Tokenization, a biologically-inspired tokenization scheme which extracts high-resolution patches centered on the prompt and progressively downsampled patches from the surrounding image regions.
Foveated tokenization has two key advantages over the patch tokenizating used in SAM and other Vision Transformer-based models:
Unlike other input-reduction strategies, such as cropping or uniform downsampling, our foveated tokenization preserves full input resolution around the prompt while also providing full image-scale context through the peripheral tokens.
Despite the vastly increased efficiency, our model remains competitive with SAM across a variety of datasets as measured by the standard mean intersection over union (mIoU) criterion. Segment This Thing is also competitive with (and runs with lower latency than) other SAM variants such as EfficientSAM and MobileSAM.
Please refer to the paper for more details!