On the speed of ViTs and CNNs
You disabled JavaScript.
Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =)
This page doesn't contain any tracking/analytics/ad code.
Context
Computer vision is now powered by two workhorse architectures:
Convolutional Neural Networks (CNN) and Vision Transformers (ViT).
CNNs slide a feature extractor (stack of convolutions) over the image
to get the final, usually lower-resolution, feature map on which the task is performed.
ViTs on the other hand cut t...
Read more at lucasb.eyer.be