This landmark paper presents Masked Auto-Encoders (MAE), which are self-supervised learners for vision tasks.

Major Learning Points

The two core designs (like they claim) are:

(1) an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset ofpatches (without mask tokens - this is important), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

(2) random patch masking (without replacement) a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Both these enable training large models efficiently (accelerate training - 3× or more) and effectively (improve accuracy).

Avoiding the need for labeled data implies building auto-regressive self-supervised models, which are already popular in the NLP world. Pixels are not the same as words in the semantic hierarchy - random blocks are closer approximations.

Vision Transformers overcomes the obstacles of convolutional neural networks working with regular grids - such that positional embeddings can be inserted into it.

Shifting the mask tokens reduces computation significantly - the mask tokens are introduced after the encoder: it sees only random patches without positional information.

Interesting bits

  • “Simple algorithms that scale well are the core of Deep Learning” - this is a deep deep statement. They go on to say, an autoencoder, a simple self-supervised method similar to techniques in NLP, provides scalable benefits. Self-supervision in vision is next!
  • It is very considerate of the authors to acknowledge that this method can generate inexistent images, and the statistics can reflect biases in the training data. Ethical usage is implied.


Masked Autoencoders on Arxiv

GitHub unofficial implementation