22. ViT 2021

Mở đầu

Image - 16x16 words, Image Recognition at scale

Self-attention based architecture, hay Transformer đã có sự thành công nhất định trong các task của NLP

Vậy đối với các task của CV như Recognize hay Classificate thì sao?

Để làm thế, ta tách image thành các bản vá (patches) và biến thành 1 chuỗi nhúng tuyến tính (sequence of linear embedding), lấy đó làm input của Transformer.

Naive application yêu cầu mỗi pixel chú ý đến mọi pixel khác. Do đó, chi phí sẽ bị mũ 2, quá tốn kém. Vì vậy, ta cần xây dựng một số phép tính gần đúng. Ví dụ Param et al 2018 dùng self-attention chỉ cho local neighborhood thay vì mỗi pixel trên global (local-attention), một mở rộng khác là local multi-head dot-product thay thế convolutions. Hay Sparse Transformers có thể xử dụng các phép gần đúng cho global. Hay 1 alternative way là ứng dụng các block có size khác nhau theo các trục riêng lẻ. Tuy nhiên, phần chung là có kết quả rất khả thi nhưng đánh đổi về phần cứng lớn

Mục đích: xây dựng large-scale pre-training cho vanilla transformers để cạnh tranh với các SOTA CNNs. Ngoài ra, có thể áp dụng cho cả small-resolution images và medium-resolution images

Có nhiều cách để áp dụng Transformers vào CV cho từng task khác nhau. Hiện nay, ứng dụng nhiều nhất là thay thế vị trí embedding và backbone cho các CNNs (ví dụ như ResNet).

Method

Cố gắng hướng giải thích gần Transformer 2017 cổ điển nhất có thể, để có thể sử dụng cách implement của NLP để có thể mở rộng hơn

1. Vấn đề 1: input của Transformer là 1D sequence.

Ta cần chuyển 2D images thành 1D sequence: image x thuộc R[HxWxC] -reshape-> 2D patches xp thuộc R[Nx(P2.C)], với (P,P) là resolution của mỗi image patch, N = HW/P2. Vậy mỗi patches có thể đóng vai trò là 1 input sequence của Transformer. Transformer sử dụng constant latent vector size D cho toàn bộ các layers, vì vậy ta chỉ cần flatten patches và map thành D dimensions với trainable linear projection -> output là patches embedding

2. Vấn đề 2: classification

Giống BERT, ta xử lí vấn đề này bằng [class] token, ta đặt 1 learnable embedding vào sequence tương ứng (z00 = xclass). Vậy ta có state output (z0L) trong image representation y. Trong mỗi pre-training và fine-tuning, ta sẽ đặt 1 classification head tại z0L (implement bởi MLP với 1 hidden layer tại pre-training và 1 linear layer tai fine-tuning)

3. Vấn đề 3: Position embedding

Được thêm vào patch embedding để có thông tin vị trí. Ta sử dụng 1D position embeddings, hoặc tìm hiểu về 2D-aware position embedding. Kết quả là sequence of embedding vector tại input encoder

MSA: Multiheaded self-attention, Multi-layer perceptron MLP và Layernorm LN
Inductive bias: ViT có ít image-specific inductive (cảm ứng hình ảnh cụ thể) hơn CNNs. Thật vậy, trong CNNs, 2D neighborhood structure hay tương đương phép tịnh tiến, được thực hiện trong suốt các layer của toàn models. Trong ViT, chỉ MLP layers là local, còn self-attention layers là global, do đó 2D neighborhood được đề cập rất ít, thậm chí tại input ta đã cắt image thành các patches và tại fine-tuning, ta embedding vị trị tại diferent resolution (position embedding không mang thông tin về vị trí 2D của patches và mọi spatial relations giữa các patches chỉ được học từ lúc đầu)

Hybrid Architecture: như 1 alternative way, input sequence có thể formed từ feature map của 1 CNN. Vậy, trong hybrid model, patch embedding projection E được dùng cho patches extracted từ 1 CNN feature map. Patches sẽ có spatial size 1x1, tức là input sequence được flatten spatial dimension của feature map và chiếu vào Transformer dimension. Classification input embedding và position embedding được cộng vào đó

Fine-tuning và Higher-Resolution
Bình thường, ta pre-train ViT trên large dataset và fine-tune về (smaller) downstream tasks. Để làm vậy, ta bỏ pre-trained prediction head và thêm 1 zero-initialized DxK feedforward layer ( K là số downstream classes). Thường với higher resolution, ta fine-tune hơn là pre-training. Khi feeding images ở higher resolution, ta giữ patch size giống nhau, để kết quả tốt hơn với sequence length. ViT có thể work cho độ dài chuỗi tùy ý (không quá memory constaints), dù vậy pre-trained position embeddings có thể không ý nghĩa nữa. Vì vậy, ta biểu diễn 2D interpolation của pre-trained position embedding, ứng vị trí trong original image. Lưu ý rằng việc điều chỉnh resolution và patch extraction là những block duy nhất mà inductive bias về 2D structure của image được đưa vào ViT

Experiments

Kinh nghiệm cho thấy self-supervised ViT có tiềm năng nhất

Model Variant: ViT-L/16 là "Large" variant, 16x16 input patch size

Few-shot accurracies được giải quyết bằng bài toán least-squares regression biểu diễn ánh xạ (frozen) của tập con training images đến {-1, 1}^K

Pre-training data requirement:

Pre-train ViT với dataset increasing size: ImageNet, ImageNet-21k, JFT-300M

Để boost kết quả với smaller dataset, ta tối ưu 3 basic regularization parameters:

- weight decay

- dropout

- label smoothing

Scaling Study:

Inspecting ViT:

- Layer đầu tiên là linear project fatten patches về low-dimensional space.
- 1 learned position embedding được cộng vào patch -> Row-column structure xuất hiện, patches trong cùng row/column có embedding tương tự nhau -> 1 sinusoidal structure đôi khi rõ ràng với các grid lớn hơn. Position embedding biểu diễn 2D image topology, giải thích tại sao nhúng 2D thủ công không cải tiến kết quả

- Self-attention cho phép ViT tích hợp thông tin trên toàn ảnh, ngay cả trong lowest layer. Qua tính toán, "attention distance" tương tự field size trong CNNs. 1 head, thể hiện khả năng tổng hợp thông tin globally. Head khác, tập trung vào highly lozalized attention, tuy nhiên thua hybrid models (ResNet trước Transformer). Hơn nữa, các attention distance tăng theo độ sâu của mạng. Thực tế, các models chú ý đến các image regions mà liên quan ngữ nghĩa (semantically relevant) cho classification

Về CNNs và Transformer backbone

Các model CNNs tập trung về yếu tố local (pixels) của image nên dễ dàng train lại cho các tập small và medium data. Trong khi Transformer tập trung về yếu tố global và relationship nhiều hơn nên cần lượng data lớn. Vì vậy nên với các Model dùng Transformer hiện nay được lấy pre-trained là chủ yếu thay vì train lại từ đầu.

Appendix

Training

Strong regularization là key cho training từ scratch (weight decay, dropout, label smoothing). Dropout, được áp dụng sau mỗi dense layer ngoại trừ qkv-projections và sau adding position to patch embedding. Hybrid cần thiết lập chính xác ViT counterparts. Cuối cùng, mọi training được dùng cho resolution 224

Positional embedding
Với từng trường hợp, ta có các cách khác nhau:

- Không cung cấp thông tin vị trí: coi input là bag of patches
- 1D position embedding: sequence of patches theo raster order

- 2D position embedding: grid of patches (2D).mỗi X-embedding và Y-embedding size D/2 rồi concat để có final positional embedding

- Relative positional embedding: dùng relative between patches thay cho absolute position. 1D Relative Attention (relative distance giữa mọi cặp patches): mỗi cặp query-ey/value, có 1 offset pq-pk với mỗi offset là 1 embedding, tức ta dùng attention như thường nhưng value là position embedding. Sau đó dùng logits từ relative attention như bias term và cộng nó vào logits của main attention (content-based attention) trước khi softmax

Với 1D và 2D positional embedding, có 3 hướng add:

- Add vào input sau preprocessing và trước khi feeding inputs vào Transformer model (thường dùng)

- Học và add vào input tại đầu mỗi layer

- Add learned positional embedding vào input tại đầu mỗi layer (shared giữa các layers)

Axial Attention - AxialResNet 2019
Kĩ thuật hiệu quả cho self-attention cho large inputs (multidimensional tensors). Ý tưởng là biễu diễn multiple attention operations, mỗi cái qua 1 axis riêng independent của input tensor thay vì áp dụng 1D attention cho fattened version của input.

AxialResNet thay toàn bộ convolutions với kernel 3x3 với axial self-attention (row, column attention, được cải thiện bởi relative positional encoding)

Hiệu quả hơn nhưng cũng ngốn tài nguyên hơn

Implement and more imformation

(Tutorial + Baseline) Pytorch

https://www.kaggle.com/code/abhinand05/vision-transformer-vit-tutorial-baseline

Vision Transformer Architect:

- Chỉ dùng Encoder của Transformer, nhưng khác biệt lớn nhất ở feeding images

- Breaking down image thành patches (16x16 hoặc 32x32) Càng nhiều patches, train càng đơn giản vì bản thân chúng nhỏ hơn -> " An image worth 16x16 words"

- patches được unrolled (flattened) và gửi vào further processing trong netword

- Khác biệt nữa với NNs là positional embedding vector được thêm thường xuyên. Positional embedding learnable nên không cần feed hard coded vector

- Cũng có special token như BERT

- Còn lại giống Transformer

from Scratch Tensorflow

https://www.kaggle.com/code/raufmomin/vision-transformer-vit-from-scratch

Tham khảo

- https://arxiv.org/pdf/2010.11929v2.pdf

- https://www.kaggle.com/code/abhinand05/vision-transformer-vit-tutorial-baseline

- https://github.com/google-research/vision_transformer

- https://huggingface.co/docs/transformers/v4.26.1/en/model_doc/vit

- https://www.kaggle.com/code/raufmomin/vision-transformer-vit-from-scratch