ViTs Scale Efficiently to 1024x1024px², Often Outperforming CNNs; High Resolutions Unnecessary for Most AI Vision Tasks

On the speed of ViTs and CNNs

You disabled JavaScript. Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =) This page doesn't contain any tracking/analytics/ad code. Context Computer vision is now powered by two workhorse architectures: Convolutional Neural Networks (CNN) and Vision Transformers (ViT). CNNs slide a feature extractor (stack of convolutions) over the image to get the final, usually lower-resolution, feature map on which the task is performed. ViTs on the other hand cut t...