Sketchformer: Transformer-based Representation for Sketched Structure
This work addresses the need for better sketch representation in computer vision, offering incremental improvements over existing LSTM-based methods like SketchRNN.
The authors tackled the problem of encoding free-hand sketches as vector sequences by introducing Sketchformer, a transformer-based representation that achieved state-of-the-art performance in sketch classification and image retrieval tasks, with significant improvements in reconstruction and interpolation for complex sketches.
Sketchformer is a novel transformer-based representation for encoding free-hand sketches input in a vector form, i.e. as a sequence of strokes. Sketchformer effectively addresses multiple tasks: sketch classification, sketch based image retrieval (SBIR), and the reconstruction and interpolation of sketches. We report several variants exploring continuous and tokenized input representations, and contrast their performance. Our learned embedding, driven by a dictionary learning tokenization scheme, yields state of the art performance in classification and image retrieval tasks, when compared against baseline representations driven by LSTM sequence to sequence architectures: SketchRNN and derivatives. We show that sketch reconstruction and interpolation are improved significantly by the Sketchformer embedding for complex sketches with longer stroke sequences.