Understand, visualize, and remember how neural networks work. Built for quick reference.
The definition, why it matters, who invented it, and how it evolved — all in a format that sticks.
ChatGPT, Claude, Siri, Alexa, Google Translate — all powered by neural networks under the hood.
Face recognition, self-driving cars, medical scan diagnosis — all learned by neural networks from images.
AlphaFold predicted 200M+ protein structures using neural networks — a problem that stumped science for 50 years.
DALL-E, Midjourney, Sora — generating images, music, and video from text descriptions.
Understanding NNs is the single most valuable skill in AI/ML engineering, data science, and product management.
Built the Perceptron — the first neural network that could learn. A single neuron that started it all.
Pioneered backpropagation (1986) and deep learning. Made training deep networks practical. Nobel Prize 2024.
Invented Convolutional Neural Networks (CNNs) in 1989. Made computers see. Now Chief AI Scientist at Meta.
Invented LSTM (1997) — giving neural networks long-term memory. Essential for speech and language.
Published "Attention Is All You Need" — the Transformer paper that launched the entire Generative AI revolution.
Pioneered word embeddings and generative models. With Hinton & LeCun, won the 2018 Turing Award — the "Nobel of CS."
Dendrites receive signals from other neurons.
Cell body (soma) processes the signals.
Axon transmits the output signal.
Synapse connects to the next neuron with variable strength.
Inputs (x) receive data from previous layer.
Weighted sum + bias processes the signals.
Activation function decides the output.
Weights (w) connect to next neuron with learned strength.
Neural networks are easiest to remember through real-world analogies.
A neural network is like a factory with assembly lines. Raw materials (data) enter, pass through departments (layers), where workers (neurons) each do one small job, and a finished product (prediction) comes out the other end.
Training a neural network is like perfecting a recipe. You start with random ingredient ratios (weights), taste the result (loss), and adjust each ingredient a little (gradient descent) until it tastes perfect.
Each neuron is like a post office sorter. Letters (signals) arrive, the sorter weighs each letter's importance (weights), adds up the scores, and forwards only important enough letters (activation threshold) to the next desk.
Watch data flow through a live network. Toggle layers and hit "Fire" to see signals propagate.
Click any step to expand the memory tip.
Each function shapes the output differently. Hover to see the curve live.
Each architecture is designed for a specific kind of data.
Data flows one direction — input → hidden → output. No loops, no memory. The simplest architecture and the building block for everything else.
Uses sliding filters (kernels) to detect patterns — edges, textures, shapes. Pools to compress. Thinks "spatially" about nearby pixels.
Has loops — output feeds back as input. Remembers previous steps. LSTM/GRU add gates to control what to remember and forget over long sequences.
Uses self-attention to weigh every token against every other token simultaneously. No recurrence needed. Parallelizable. Powers GPT, BERT, Claude, and modern LLMs.
GANs pit a generator vs. discriminator. VAEs learn compressed representations. Diffusion models denoise random noise into images. All create new content.
Operates on graph-structured data — nodes and edges. Learns from relationships. Each node aggregates info from its neighbors to update its representation.
Pin-worthy quick reference cards.
Click each architecture to see how data flows through it, what makes it unique, and when to use it.
Every neuron in layer N connects to every neuron in layer N+1 (fully connected). Data moves strictly left-to-right. No loops, no skipping layers, no memory of previous inputs. Each input is treated as completely independent.
No cycles: information never flows backward during inference.
No memory: the network has zero knowledge of what it processed before.
Fully connected: every neuron talks to every neuron in the next layer.
Fixed input size: always expects the same number of features.
"FNN = A toll road with no U-turns." Cars (data) enter, pay tolls at each booth (layer), and exit. They can't reverse. Each car is processed alone — the booth doesn't remember the last car.
A small filter/kernel (e.g. 3×3) slides across the input. At each position, it computes a dot product — producing a feature map that highlights where a pattern was found. Pooling then shrinks the map (e.g. taking the max of each 2×2 block), reducing size while keeping the important info. Multiple conv+pool stages stack up, each detecting more complex patterns. Finally, the output flattens into a regular FNN for classification.
NOT fully connected: each neuron only sees a small local region (receptive field), not the entire input.
Weight sharing: the same filter slides everywhere — so the same "edge detector" works whether the edge is top-left or bottom-right.
Spatial awareness: preserves the 2D structure of images. FNN would flatten an image into a 1D list, losing all spatial relationships.
Far fewer parameters: a 3×3 filter has just 9 weights, reused across the entire image.
"CNN = A detective with a magnifying glass." Instead of staring at the whole crime scene (image) at once like FNN would, CNN slowly scans small patches — first finding fingerprints (edges), then faces (shapes), then the whole story (objects). Same magnifying glass everywhere = weight sharing.
At each time step, the neuron receives TWO inputs: the current data (e.g., today's word) AND the hidden state from the previous step (a summary of everything it's seen so far). It combines both, produces an output, and passes an updated hidden state to the next step. This loop is the core difference — data literally flows in a circle.
Has loops: output feeds back as input — FNN has zero loops.
Has memory: hidden state carries information across time steps.
Variable input length: can process sequences of any length (sentences, time series) — FNN needs fixed-size input.
Same weights reused: the same neuron processes step 1, step 2, step 3... unlike FNN where each layer has its own weights.
Basic RNNs forget long-ago steps because gradients shrink during backprop through time. LSTM adds gates (forget, input, output) that control what to keep/discard. GRU is a simpler 2-gate version. Think of LSTM gates as a diary with a lock — you choose what to write, what to erase, and what to share.
"RNN = Reading a book vs. seeing a photo." FNN sees a photo (one snapshot, no sequence). RNN reads a book — each word makes sense because you remember what came before. LSTM is reading with highlighters — you mark important passages so you don't forget them 200 pages later.
The Generator (G) takes random noise and tries to create realistic data (e.g., a face). The Discriminator (D) receives both real images AND generator's fakes, and outputs "real or fake?". Both are trained simultaneously: G learns to fool D, D learns to catch G. This adversarial loop is what makes GANs unique — it's not one network, it's a competition between two.
Two networks, not one: FNN/CNN/RNN are single networks. GAN is a two-player game.
No labeled data needed: the Discriminator creates its own training signal.
Generative, not discriminative: FNN/CNN/RNN classify or predict. GAN creates new data from scratch.
Adversarial loss: instead of minimizing a fixed loss, both networks chase a moving target (each other).
"GAN = An art forger vs. a museum curator." The forger (Generator) paints fake masterpieces. The curator (Discriminator) examines each painting — "Real Monet or fake?" Over time, the forger gets so good that even the curator can't tell. The key insight: the forger never sees real paintings — only learns from the curator's feedback!
Input tokens are embedded + given positional encoding (since there are no loops to track order). Then Self-Attention lets every token look at every other token simultaneously, computing relevance scores. This is done multiple times in parallel via Multi-Head Attention. The result passes through a feedforward network. This Attention+FFN block repeats N times (the "layers"). An Encoder stack processes input; a Decoder stack generates output autoregressively.
vs. FNN: Transformer has attention — dynamically chooses what to focus on. FNN applies fixed weights blindly.
vs. CNN: CNN only sees local patches. Transformer sees the entire sequence at once — no locality constraint.
vs. RNN: RNN processes sequentially (slow). Transformer processes all positions in parallel (fast). No recurrent loops, no vanishing gradients.
vs. GAN: Transformer is a single architecture for understanding/generating. GAN is two competing networks. (Though GANs can use Transformers inside!)
"Transformer = A round-table conference." Every person (token) can hear and ask questions of every other person simultaneously. They each decide who's most relevant to listen to (attention). RNN is like a phone chain — each person calls the next. CNN is like everyone reading their own section of the newspaper. The Transformer puts everyone in one room.
| Dimension | FNN | CNN | RNN | GAN | Transformer |
|---|---|---|---|---|---|
| Data Flow | One direction → | One direction → (local scan) | Forward + loops ↺ | Two competing ⇄ | All-to-all attention (parallel) ⇉ |
| Memory | None | None | Hidden state | None | Attention = dynamic, context-based memory |
| Connection | Every-to-every | Local filter window | Same weights across steps | G+D separate | Every token attends to every token |
| Input Type | Fixed flat vector | 2D/3D spatial | Variable sequences | Random noise | Variable sequences (+ images via ViT) |
| Weight Sharing | No | Yes (filter reuse) | Yes (cell reuse) | No | Yes (attention heads shared across positions) |
| Purpose | Classify / regress | Spatial patterns | Sequential deps | Generate data | Universal: NLP, vision, audio, multi-modal |
| # of Networks | 1 | 1 | 1 | 2 | 1 (or Encoder+Decoder pair) |
| Parallelizable | Yes | Yes | No (sequential) | Partially | Yes — fully parallel (key advantage) |
| Long-Range Deps | No | Limited (receptive field) | Struggles (vanishing grad) | N/A | Excellent — attends to any distance |
| Positional Info | None | Inherent (spatial grid) | Inherent (step order) | N/A | Must be added (positional encoding) |
| Key Innovation | Universal approx | Filters + pooling | Recurrent loop | Adversarial | Self-attention + multi-head + pos encoding |
| Analogy | Toll road | Magnifying glass | Reading a book | Forger vs. Curator | Round-table conference |
| Best For | Tabular data | Images, video | Time series, audio | Image generation | LLMs, translation, vision, everything modern |
| Weakness | No sequence/spatial | No sequences | Slow, vanishing grad | Mode collapse | O(n²) memory for long sequences |
| Modern Status | Tabular / final layers | Dominant for vision | Replaced by Transformers | Rivaled by Diffusion | Dominant everywhere — powers all LLMs |
The 2017 paper "Attention Is All You Need" gave birth to GPT, BERT, Claude, and the entire Generative AI revolution. Let's understand it piece by piece.
Think of it like a library. You walk in with a Question (Q). Each book has a title on the spine — that's the Key (K). You compare your question against all the titles. When you find a match, you pull the book off the shelf and read its content — that's the Value (V). Self-attention does this for every word against every word, simultaneously.
The dot products of Q and K can grow very large as the dimension increases. Large values push softmax into regions with tiny gradients (near 0 or 1), making learning difficult. Dividing by √dₖ keeps the scores in a range where softmax behaves well. It's a simple but critical numerical stability trick.
Sees: all tokens at once (bidirectional).
Good at: understanding — classification, search, NER, Q&A extraction.
Limitation: can't generate text autoregressively.
Sees: only past tokens (causal mask).
Good at: generation — writing, code, chat, reasoning.
Insight: turns out, with enough scale, decoders become great at understanding too!
17 questions covering neural networks and Transformers.