fbpx
Sigmoidal
  • Home
  • LinkedIn
  • About me
  • Contact
No Result
View All Result
  • Português
  • Home
  • LinkedIn
  • About me
  • Contact
No Result
View All Result
Sigmoidal
No Result
View All Result

Vision Transformer (ViT): Python Implementation

Carlos Melo by Carlos Melo
March 23, 2026
in Computer Vision, Deep Learning, Posts, Tutoriais
0
121
VIEWS
Share on LinkedInShare on FacebookShare on Whatsapp

What if I told you that a model originally created to translate text can analyze images with performance comparable to — and in some cases better than — Convolutional Neural Networks?

That’s exactly what the Vision Transformer (ViT) proposes. And the idea behind it is so elegant it fits in a single sentence: Cut the image into pieces, treat each piece as a “word,” and let the Transformer do the rest.

From CNNs to the Vision Transformer (ViT)
Evolution of computer vision: from pure CNNs to the Vision Transformer (ViT).

The Vision Transformer (ViT) was introduced by Dosovitskiy et al. in the paper An Image is Worth 16×16 Words, published in 2020, and quickly established itself as one of the most influential architectures in modern computer vision.

In this article, I’ll walk you through every component of the ViT architecture, from patch creation to the final classification step. At the end, we’ll put it all into practice using a pre-trained model to classify a real image and visualize the attention maps, understanding how the model distributes its focus across the image.

💡 Notebook: Vision Transformer ViT

 

What Is the Vision Transformer?

Until 2020, the world of computer vision was dominated by Convolutional Neural Networks (CNNs). From LeNet-5 in the 90s to ResNet in 2015, the paradigm was always the same: local filters sliding across the image, extracting hierarchical patterns.

The problem with CNNs is that they see locally. Each filter looks at only a small region of the image at a time. For distant information to communicate, you need to stack many layers.

ViT proposed something radical: throw away convolutions entirely and use only Transformers. The same architecture that revolutionized natural language processing (GPT, BERT) was applied directly to images.

The result? When trained with enough data, ViT outperformed the best CNNs of the time.

Patches: Cutting the Image into Pieces

The first question is: how do you feed an image into a Transformer? The Transformer works with sequences of tokens (like words in a sentence). An image is a grid of pixels, not a sequence.

ViT’s solution is simple and ingenious: divide the image into a grid of squares called patches. Each patch is a piece of the image, like a puzzle piece.

A 224×224 pixel image divided into 16×16 patches produces a 14×14 grid, that is, 196 patches. Each patch captures a local region of the image: an eye, a piece of the background, part of a paw.

Image divided into 196 patches of 16x16

Think of it this way: if a text Transformer processes a sequence of words, ViT processes a sequence of patches. Each patch is a “visual token.”

# Split a 224x224 image into 16x16 patches
patch_size = 16
img_size = 224
n_patches_side = img_size // patch_size
n_patches_total = n_patches_side ** 2

print(f"Grid: {n_patches_side}x{n_patches_side} = {n_patches_total} patches")
# Grid: 14x14 = 196 patches

Looking at the patches individually, we can see that each one captures a different region of the photo: some contain parts of the cat, others pieces of the background or branches.

First 14 patches extracted from the image

Patch Embedding: From Pixels to Vectors

Each patch is a pixel matrix (16×16×3 for RGB images). But the Transformer works with fixed-dimension vectors. We need to convert each patch into a numerical vector (this is the patch embedding).

In practice, ViT uses a convolution with kernel and stride equal to the patch size. This operation does two things at once: it divides the image into patches and projects each patch into a vector of dimension D.

# Conv2d with kernel=16 and stride=16: splits and projects in one operation
embed_dim = 768  # ViT-Base dimension
patch_embed = nn.Conv2d(3, embed_dim, kernel_size=16, stride=16)

# Input: (1, 3, 224, 224)
# After convolution: (1, 768, 14, 14)
# Reshaped: (1, 196, 768)

The result is a sequence of 196 vectors, each with 768 dimensions. Each 16×16×3 pixel patch (768 values) became a 768-dimensional vector — a compact numerical representation of that image region.

CLS Token and Positional Embedding

Before entering the Transformer, two ingredients are added to the sequence.

The CLS token (from classification) is a special vector inserted at the beginning of the sequence. It acts as an “observer” that doesn’t belong to any specific patch. As it passes through all the attention layers, the CLS token accumulates information from every patch. In the end, it’s the one we use to classify the image.

The Positional Embedding solves a fundamental problem: the Transformer has no notion of order. Without it, the model wouldn’t know that the patch in the top-left corner is different from the one in the bottom-right corner. We add a position vector to each token, giving the model the information of “where” each patch was in the original image.

The resulting sequence has 197 tokens (1 CLS + 196 patches), each with 768 dimensions. This is the Transformer’s input.

Self-Attention: The Central Piece

The self-attention mechanism is what makes the Transformer so powerful. It allows each token to “look at” every other token and decide who to pay attention to.

The most useful analogy is that of a classroom. Imagine each patch is a student. Each student generates three things:

  • Query (Q): “What am I looking for?”
  • Key (K): “What do I have to offer?”
  • Value (V): “What information do I carry?”

Attention is calculated by comparing each token’s Query with the Keys of all other tokens. The more compatible Q and K are, the more attention one token pays to the other. The result is a weighted average of the Values.

    \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

The division by \sqrt{d_k} prevents values from becoming too large before the softmax. It’s a simple detail, but crucial for training stability.

In practice, ViT doesn’t use a single attention — it uses Multi-Head Attention. The idea is to split the embedding into multiple “heads” that operate in parallel, each learning to attend to different aspects: one head might focus on colors, another on shapes, another on textures. The outputs are concatenated at the end.

The Transformer Block

Each Transformer block combines two operations, both with residual connections and Layer Normalization:

  1. Multi-Head Self-Attention — tokens communicate with each other
  2. Feed-Forward Network (FFN) — each token is individually processed by two linear layers with GELU activation

The structure is:

tokens → LayerNorm → Multi-Head Attention → (+residual) → LayerNorm → FFN → (+residual)

ViT-Base stacks 12 of these blocks in sequence. With each layer, the tokens become richer in contextual information. The CLS token, which started with no information, progressively “listens” to all patches and builds a global representation of the image.

The residual connections (adding the input to the output) facilitate gradient flow during training and allow the model to stack many layers without degrading the signal.

In Practice: Classifying a Cat Photo

The truth is, in day-to-day work, nobody implements ViT from scratch. We use pre-trained models and run inference or fine-tuning. torchvision already includes ViT-B/16 pre-trained on ImageNet with 1,000 classes.

Let’s load the model and classify a real cat photo:

from torchvision.models import vit_b_16, ViT_B_16_Weights

# Load pre-trained model on ImageNet
weights = ViT_B_16_Weights.IMAGENET1K_V1
model = vit_b_16(weights=weights)
model.eval()

# Preprocess and classify
preprocess = weights.transforms()
img_input = preprocess(img).unsqueeze(0)

with torch.no_grad():
    output = model(img_input)
    probabilities = F.softmax(output[0], dim=0)

# Top-5 predictions
top5_prob, top5_idx = torch.topk(probabilities, 5)
categories = weights.meta["categories"]

for i in range(5):
    print(f"  {categories[top5_idx[i]]:30s} {top5_prob[i].item():.1%}")

The result:

  tiger cat                      66.2%
  tabby                          16.1%
  Egyptian cat                    8.3%
  lynx                            0.2%
  tiger                           0.1%

The model classified the image as tiger cat with 66.2% confidence, followed by tabby (16.1%) and Egyptian cat (8.3%). The top three classes are all domestic cat categories — the model nailed it. All of this without any training, using only pre-trained ImageNet weights.

Visualizing Attention: Where Is the ViT Looking?

One of the great advantages of the Vision Transformer is its interpretability. We can visualize which regions of the image were most important for the model’s decision.

To do this, we use a technique called attention rollout. Instead of looking at the attention of a single layer (which tends to be sparse and hard to interpret), rollout accumulates the attention weights from all 12 layers, taking residual connections into account.

The result is a map that shows, in aggregate, where the CLS token concentrated its attention throughout the entire network.

# Collect attention from each block and accumulate via rollout
rollout = torch.eye(197).unsqueeze(0)  # identity matrix
for attn in all_attn:
    attn_with_residual = attn + torch.eye(197).unsqueeze(0)
    attn_with_residual = attn_with_residual / attn_with_residual.sum(dim=-1, keepdim=True)
    rollout = attn_with_residual @ rollout

# Extract attention from CLS to the 196 patches
cls_attn = rollout[0, 0, 1:].reshape(14, 14).numpy()
ViT attention map overlaid on the original image

The brighter regions in the map indicate where the model concentrated more attention to classify the image. Notice how the attention focuses on the cat’s body and head, ignoring the background with branches and the sky. The ViT learned, without explicit supervision, to focus on the most discriminative regions of the image — exactly what a human would do.

ViT vs CNN: When to Use Each?

CNNs process images with local filters and build spatial hierarchies layer by layer. They are efficient and work well even with limited data. ViT, on the other hand, processes the image globally from the very first layer.

In other words, each patch can relate to any other patch, regardless of distance.

This global capability comes at a cost: ViT needs much more data to learn good representations from scratch. In the original paper, ViT only outperformed CNNs when pre-trained on massive datasets like JFT-300M (300 million images). For smaller datasets, CNNs still hold the advantage.

In practice, the most common strategy is to use transfer learning: start from a ViT pre-trained on a large dataset and fine-tune it for your specific task.

Key Takeaways

  • Image as a sequence of patches: ViT splits the image into 16×16 pieces and treats each one as a “token,” applying the same Transformer architecture used in NLP — with no convolutions whatsoever.
  • Patch embedding via convolution: a Conv2d with kernel and stride equal to the patch size splits the image and projects each patch into a 768-dimensional vector in a single operation.
  • CLS token as a global observer: a special token inserted at the beginning of the sequence accumulates information from all patches via attention, serving as the image’s global representation for classification.
  • Self-attention enables global communication: unlike CNNs that see locally, each patch in ViT can directly relate to any other patch in the image, capturing long-range dependencies from the very first layer.
  • In practice, use pre-trained models: torchvision’s ViT-B/16 classified our cat photo as tiger cat with 66% confidence, and the attention map confirmed the model focuses on the animal while ignoring the background — all with direct inference, no additional training required.
ShareShare2Send
Previous Post

Grad-CAM: Visualizing What a Neural Network Sees

Carlos Melo

Carlos Melo

Computer Vision Engineer with a degree in Aeronautical Sciences from the Air Force Academy (AFA), Master in Aerospace Engineering from the Technological Institute of Aeronautics (ITA), and founder of Sigmoidal.

Related Posts

Computer Vision

Grad-CAM: Visualizing What a Neural Network Sees

by Carlos Melo
March 10, 2026
Blog

What is Sampling and Quantization in Image Processing

by Carlos Melo
June 20, 2025
Como equalizar histograma de imagens com OpenCV e Python
Computer Vision

Histogram Equalization with OpenCV and Python

by Carlos Melo
July 16, 2024
How to Train YOLOv9 on Custom Dataset
Computer Vision

How to Train YOLOv9 on Custom Dataset – A Complete Tutorial

by Carlos Melo
February 29, 2024
YOLOv9 para detecção de Objetos
Blog

YOLOv9: A Step-by-Step Tutorial for Object Detection

by Carlos Melo
February 26, 2024

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Estimativa de Pose Humana com MediaPipe

Real-time Human Pose Estimation using MediaPipe

September 11, 2023
ORB-SLAM 3: A Tool for 3D Mapping and Localization

ORB-SLAM 3: A Tool for 3D Mapping and Localization

April 10, 2023

Build a Surveillance System with Computer Vision and Deep Learning

1
ORB-SLAM 3: A Tool for 3D Mapping and Localization

ORB-SLAM 3: A Tool for 3D Mapping and Localization

1
Point Cloud Processing with Open3D and Python

Point Cloud Processing with Open3D and Python

1

Fundamentals of Image Formation

0
ViT Visual Transformer

Vision Transformer (ViT): Python Implementation

March 23, 2026

Grad-CAM: Visualizing What a Neural Network Sees

March 10, 2026

What is Sampling and Quantization in Image Processing

June 20, 2025
Como equalizar histograma de imagens com OpenCV e Python

Histogram Equalization with OpenCV and Python

July 16, 2024
Instagram Youtube LinkedIn Twitter
Sigmoidal

O melhor conteúdo técnico de Data Science, com projetos práticos e exemplos do mundo real.

Seguir no Instagram

Categories

  • Aerospace Engineering
  • Blog
  • Carreira
  • Computer Vision
  • Data Science
  • Deep Learning
  • Featured
  • Iniciantes
  • Machine Learning
  • Posts
  • Tutoriais

Navegar por Tags

3d 3d machine learning 3d vision bayer filter camera calibration career clahe computer vision custom dataset data science deep learning depth anything depth estimation digital image processing estimativa de pose grad-cam histogram histogram equalization image formation job lens lente machine learning machine learning engineering nasa object detection open3d opencv python pytorch quantization redes neurais resnet roboflow rocket sampling space tensorflow transformer tutorial vision-transformer visão computacional vit yolov8 yolov9

© 2024 Sigmoidal - Aprenda Data Science, Visão Computacional e Python na prática.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

No Result
View All Result
  • Home
  • Pós-Graduação
  • Blog
  • Sobre Mim
  • Contato
  • Português

© 2024 Sigmoidal - Aprenda Data Science, Visão Computacional e Python na prática.