Segment This Thing, Bit by Bit
Abstract
Inspired by biological vision, where foveated retina and early neural circuits compress visual input before it leaves the eye, we implement a point-directed segmentation pipeline based on the "Segment This Thing" architecture on a custom bit-serial, near-memory SIMD processor array. The hardware is prototyped on an Xilinx KV260 FPGA, and consists of 2,304 processing elements with bit-flexible arithmetic capabilities. We co-design the model and hardware using a Python instruction-set simulator and verify them against a cycle-accurate RTL co-simulation, which enables rapid iteration without synthesis. Foveated tokenisation compresses the input image into just 40 tokens, achieving an 84% reduction in memory while preserving resolution around the query point. To our knowledge, this is the first implementation of a foveated transformer encoder on a pixel processing device.