Getting Started with Computer Vision: From Pixels to Predictions
2024-11-10
A practical introduction to computer vision — what it is, how convolutional neural networks see images, and how to build your first image classifier in Python.
Computer vision is one of the most exciting areas in machine learning today. At its core, it is about teaching machines to see — to extract meaningful information from images and videos the same way a human brain does effortlessly. In this article I want to break down the fundamentals and show you how to build something real.
What is Computer Vision?
Computer vision is a field of artificial intelligence that enables computers to interpret and understand visual information from the world. This includes tasks like:
- Image classification — "Is this a cat or a dog?"
- Object detection — "Where exactly is the cat in this image?"
- Semantic segmentation — "Label every single pixel in this image."
- Pose estimation — "What position is this person's body in?"
It has applications in self-driving cars, medical imaging, facial recognition, augmented reality, and much more.
How Machines See Images
To a computer, an image is just a grid of numbers. A grayscale 28×28 image is a matrix of 784 numbers, each between 0 and 255, representing the brightness of a pixel. A color image adds three channels (Red, Green, Blue), turning it into a 3D tensor of shape (height, width, 3).
The challenge is that these raw numbers are not very meaningful on their own. Shifting the image by one pixel changes hundreds of values — but the content is the same.
Convolutional Neural Networks (CNNs)
CNNs solved this problem. Instead of connecting every pixel to every neuron (which would be computationally insane for large images), CNNs use convolutional filters — small matrices that slide across the image and detect local patterns like edges, curves, and textures.
A simple CNN layer looks like this in PyTorch:
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
self.relu = nn.ReLU()
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.fc = nn.Linear(32 * 16 * 16, 10) # 10 classes
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = x.view(x.size(0), -1) # flatten
return self.fc(x)
Each deeper layer learns more abstract features — early layers detect edges, later layers detect eyes, wheels, or faces.
Building Your First Classifier
Let's classify images from the CIFAR-10 dataset (10 classes: airplane, car, bird, cat, etc.) using PyTorch.
1. Load the data
import torchvision
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True)
2. Train the model
import torch
import torch.optim as optim
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(10):
for images, labels in trainloader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1} complete")
3. Evaluate
correct = 0
total = 0
with torch.no_grad():
for images, labels in testloader:
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print(f"Accuracy: {100 * correct / total:.1f}%")
A simple model like this reaches roughly 65–70% accuracy on CIFAR-10. Using a deeper architecture like ResNet or VGG pushes it above 90%.
Transfer Learning: The Shortcut
Training from scratch requires a lot of data and compute. Transfer learning lets you take a model pre-trained on a huge dataset (like ImageNet with 1.2 million images) and fine-tune it on your own smaller dataset.
import torchvision.models as models
model = models.resnet18(pretrained=True)
# Freeze all layers except the final classifier
for param in model.parameters():
param.requires_grad = False
# Replace the final layer for our number of classes
model.fc = nn.Linear(model.fc.in_features, num_classes)
This is how most real-world computer vision projects work — you almost never train a full network from scratch.
What's Next?
Computer vision is a huge field. Once you are comfortable with image classification, explore:
- YOLO / Faster R-CNN for real-time object detection
- U-Net for medical image segmentation
- Vision Transformers (ViT) for state-of-the-art image understanding
- OpenCV for classical computer vision without deep learning
The best way to learn is to pick a project that excites you. Build a license plate detector, a plant disease classifier, or a gesture-controlled interface. The fundamentals carry across all of them.
Happy building!