Published: July 6, 2025
Optimizers determine how neural networks learn by updating parameters to minimize loss. The choice of optimizer significantly affects training speed and final performance.
Watch how different optimizers navigate toward the minimum of a simple quadratic function:
Step 0 / 100 | SGD Loss: 9.0000 | Momentum Loss: 9.0000 | Adam Loss: 9.0000
The simplest optimizer. Updates parameters directly proportional to the gradient magnitude.
Key Formula:
θ = θ - α * ∇f(θ)
Where α is learning rate and ∇f(θ) is the gradient
class SGD:
def __init__(self, learning_rate=0.01):
self.lr = learning_rate
def update(self, params, gradients):
for i in range(len(params)):
params[i] -= self.lr * gradients[i]
return params
Adds momentum to SGD by accumulating a weighted average of past gradients. This helps accelerate convergence and reduces oscillations.
Key Formulas:
v = β * v + ∇f(θ)
θ = θ - α * v
Where β is momentum coefficient (typically 0.9)
class SGDMomentum:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = None
def update(self, params, gradients):
if self.velocity is None:
self.velocity = [0] * len(params)
for i in range(len(params)):
self.velocity[i] = (self.momentum * self.velocity[i] +
gradients[i])
params[i] -= self.lr * self.velocity[i]
return params
Combines momentum with adaptive learning rates. Maintains separate learning rates for each parameter based on first and second moment estimates.
Key Formulas:
m = β₁ * m + (1-β₁) * ∇f(θ)
v = β₂ * v + (1-β₂) * ∇f(θ)²
m̂ = m / (1-β₁ᵗ), v̂ = v / (1-β₂ᵗ)
θ = θ - α * m̂ / (√v̂ + ε)
Default: β₁=0.9, β₂=0.999, ε=1e-8
class Adam:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999):
self.lr = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = 1e-8
self.m = None # First moment
self.v = None # Second moment
self.t = 0 # Time step
def update(self, params, gradients):
if self.m is None:
self.m = [0] * len(params)
self.v = [0] * len(params)
self.t += 1
for i in range(len(params)):
# Update moments
self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * gradients[i]
self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * gradients[i]**2
# Bias correction
m_hat = self.m[i] / (1 - self.beta1**self.t)
v_hat = self.v[i] / (1 - self.beta2**self.t)
# Update parameters
params[i] -= self.lr * m_hat / (math.sqrt(v_hat) + self.epsilon)
return params
Optimizer | Memory | Convergence | Hyperparameters | Best Use Case |
---|---|---|---|---|
SGD | Minimal | Slow but stable | Learning rate only | Large datasets, limited memory |
Momentum | Low | Faster than SGD | LR + momentum | Non-convex problems |
Adam | Higher | Fast, adaptive | LR + 2 betas | Deep learning, general purpose |
import torch.optim as optim
# SGD
optimizer = optim.SGD(model.parameters(), lr=0.01)
# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Adam
optimizer = optim.Adam(model.parameters(), lr=0.001)
from tensorflow.keras.optimizers import SGD, Adam
# SGD
optimizer = SGD(learning_rate=0.01)
# SGD with momentum
optimizer = SGD(learning_rate=0.01, momentum=0.9)
# Adam
optimizer = Adam(learning_rate=0.001)
Leave comment
Comments
There are no comments at the moment.
Check out other blog posts
2025/07/07
Q-Learning: Interactive Reinforcement Learning Foundation
2025/07/05
Building a Japanese BPE Tokenizer: From Characters to Subwords
2024/06/19
Create A Simple and Dynamic Tooltip With Svelte and JavaScript
2024/06/17
Create an Interactive Map of Tokyo with JavaScript
2024/06/14
How to Easily Fix Japanese Character Issue in Matplotlib
2024/06/13
Book Review | Talking to Strangers: What We Should Know about the People We Don't Know by Malcolm Gladwell
2024/06/07
Most Commonly Used 3,000 Kanjis in Japanese
2024/06/07
Replace With Regex Using VSCode
2024/06/06
Do Not Use Readable Store in Svelte
2024/06/05
Increase Website Load Speed by Compressing Data with Gzip and Pako
2024/05/31
Find the Word the Mouse is Pointing to on a Webpage with JavaScript
2024/05/29
Create an Interactive Map with Svelte using SVG
2024/05/28
Book Review | Originals: How Non-Conformists Move the World by Adam Grant & Sheryl Sandberg
2024/05/27
How to Algorithmically Solve Sudoku Using Javascript
2024/05/26
How I Increased Traffic to my Website by 10x in a Month
2024/05/24
Life is Like Cycling
2024/05/19
Generate a Complete Sudoku Grid with Backtracking Algorithm in JavaScript
2024/05/16
Why Tailwind is Amazing and How It Makes Web Dev a Breeze
2024/05/15
Generate Sitemap Automatically with Git Hooks Using Python
2024/05/14
Book Review | Range: Why Generalists Triumph in a Specialized World by David Epstein
2024/05/13
What is Svelte and SvelteKit?
2024/05/12
Internationalization with SvelteKit (Multiple Language Support)
2024/05/11
Reduce Svelte Deploy Time With Caching
2024/05/10
Lazy Load Content With Svelte and Intersection Oberver
2024/05/10
Find the Optimal Stock Portfolio with a Genetic Algorithm
2024/05/09
Convert ShapeFile To SVG With Python
2024/05/08
Reactivity In Svelte: Variables, Binding, and Key Function
2024/05/07
Book Review | The Art Of War by Sun Tzu
2024/05/06
Specialists Are Dead. Long Live Generalists!
2024/05/03
Analyze Voter Behavior in Turkish Elections with Python
2024/05/01
Create Turkish Voter Profile Database With Web Scraping
2024/04/30
Make Infinite Scroll With Svelte and Tailwind
2024/04/29
How I Reached Japanese Proficiency In Under A Year
2024/04/25
Use-ready Website Template With Svelte and Tailwind
2024/01/29
Lazy Engineers Make Lousy Products
2024/01/28
On Greatness
2024/01/28
Converting PDF to PNG on a MacBook
2023/12/31
Recapping 2023: Compilation of 24 books read
2023/12/30
Create a Photo Collage with Python PIL
2024/01/09
Detect Device & Browser of Visitors to Your Website
2024/01/19
Anatomy of a ChatGPT Response