Why Gradient Descent Became Stochastic
TL;DR · AI Summary
The core reason gradient descent evolved into stochastic gradient descent (SGD) is computational scalability: as dataset size grows, batch gradient descent (BGD) becomes prohibitively expensive, while SGD updates parameters using only one or a few samples per iteration—reducing cost and leveraging noise to escape local minima; the article illustrates this via linear regression, deriving the closed-form solution from MSE and naturally motivating iterative optimization.
Key Takeaways
- The linear regression coefficients β₀=27315.74 and β₁=9020.66 are derived by set
- Batch gradient descent computes gradients over all n samples per step (O(n)), wh
- SGD’s randomness is beneficial: gradient noise helps escape flat saddle points a
Outline
Jump quickly between sections.
Slope β₁=9020.66 and intercept β₀=27315.74 are computed directly via covariance and mean formulas—effective for small data but not scalable.
Taking partial derivatives of MSE w.r.t. β₀ and β₁ and setting them to zero yields the exact closed-form equations for linear regression.
Viewing MSE as a 3D bowl-shaped surface over β₀–β₁ space, minimization corresponds to descending along the negative gradient toward the global minimum.
When n is large, BGD’s O(n) per-step cost is infeasible; SGD approximates full gradients with single-sample estimates, achieving O(1) updates and better generalization.
Randomness in gradient estimation breaks symmetry, prevents stagnation at saddle points, and enhances convergence robustness in non-convex settings.
Mindmap
See how the topics connect at a glance.
查看大纲文本(无障碍 / 无 JS 友好)
- 梯度下降为何走向随机化
- 理论基础
- 线性回归闭式解
- MSE损失函数
- 偏导→极值条件
- 计算瓶颈
- BGD: O(n)每步
- 大数据下不可行
- 内存与算力限制
- SGD优势
- O(1)单样本更新
- 噪声促进跳出局部极小
- 更优泛化性能
Highlights
Key sentences worth saving and sharing.
As dataset size increases, batch gradient descent requires computing gradients over all n samples per iteration (O(n) time), while SGD uses just one sample (O(1)), making large-scale training feasible
Differentiating MSE w.r.t. β₀ gives: −2/n ∑(yᵢ − β₀ − β₁xᵢ) = 0, whose solution is β₀ = ȳ − β₁x̄—revealing the differential origin of the intercept formula.
SGD’s randomness is not a flaw but a feature: gradient noise helps optimizers escape flat regions and local minima, which is critical for non-convex deep learning optimization.
We value your privacy
We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. By clicking "Accept All", you consent to our use of cookies.
Customise Reject All Accept All Typesetting math: 100% Skip to content
Publish AI, ML & data-science insights to a global community of data professionals.
Sign in Submit an Article LATEST EDITOR’S PICKS DEEP DIVES NEWSLETTER WRITE FOR TDS LinkedIn X Toggle Search MACHINE LEARNING Why Gradient Descent Became Stochastic
A step-by-step journey from calculus-based optimization to Stochastic Gradient Descent
Nikhil Dasari May 29, 2026 19 min read Share Photo by Sami TÜRK
In this blog post, we are going to discuss not only how but also why gradient descent and stochastic gradient descent are used.
We already know about linear regression, and recently I wrote about it in the context of vectors and projections.
Now, we will try to understand gradient descent with the help of a linear regression problem.
But before that, I just want to briefly recall what we already know about linear regression and the math behind it, so that anyone starting out finds it easy to follow.
If you already know the basic math behind linear regression, then you can directly start from the section titled Why Do We Need Gradient Descent?
Let’s say we started our machine learning journey, and the first thing we did was implementing a linear regression model using Python.
We implemented it successfully and got the best values for the slope and intercept.
Now we have a question: What’s actually happening behind this algorithm?
We want to understand the math behind it.
Linear Regression Recap
For that, let’s consider this data.
Image by Author
Now, we want to understand the math behind the algorithm.
Image by Author
We come across these formulas for the slope and intercept.
β 1 = ∑ n i=1 ( x i – x ¯ )( y i – y ¯ ) ∑ n i=1 ( x i – x ¯ ) 2
β 0 = y ¯ – β 1 x ¯
Now, by using these formulas we calculate the slope and intercept.
The Simple Linear Regression equation is:
y ^ = β 0 + β 1 x
The slope formula is:
β 1 = ∑ n i=1 ( x i − x ¯ )( y i − y ¯ ) ∑ n i=1 ( x i − x ¯ ) 2
The intercept formula is:
β 0 = y ¯ – β 1 x ¯
The dataset is:
x=[1.2,1.4,1.6,2.1,2.3,3.0,3.1,3.3,3.3,3.8] y=[39344,46206,37732,43526,39892,56643,60151,54446,64446,57190]
Compute the mean of x:
x ¯ = 1.2+1.4+1.6+2.1+2.3+3.0+3.1+3.3+3.3+3.8 10 x ¯ = 25.1 10 =2.51
Compute the mean of y:
y ¯ = 39344+46206+37732+43526+39892+56643+60151+54446+64446+57190 10 y ¯ = 499576 10 =49957.6
Now compute:
∑( x i − x ¯ )( y i − y ¯ )
After substitution and calculation:
∑( x i − x ¯ )( y i − y ¯ )=41663.44
Now compute:
∑( x i − x ¯ ) 2
After calculation:
∑( x i − x ¯ ) 2 =4.619
Now compute the slope:
β 1 = 41663.44 4.619 β 1 =9020.66
Now compute the intercept:
β 0 =49957.6−(9020.66)(2.51) β 0 =27315.74
Therefore:
β 0 =27315.74 β 1 =9020.66
Final regression equation:
y ^ =27315.74+9020.66x
We got the values using the formulas, but we are not satisfied and want to go deeper.
Now our goal is to learn how we got these formulas.
To understand that, we will now see a 3D bowl curve. We get that bowl curve when we plot all the possible combinations of 𝛽 0 , 𝛽 1 and the mean squared error (MSE).
Image by Author
Now, by looking at the curve, we understand that we need the mean squared error to be as low as possible, and it reaches it’s minimum when the gradient becomes zero.
We already know that to find the slope of any curve, we need differentiation.
Next, we perform differentiation on the loss function, since the bowl curve is the 3D representation of it, and you realize that here we have two variables.
So, we perform partial differentiation and then solve further to get the formulas for the slope and intercept.
Deriving the Formulas for Slope and Intercept
Start with the Mean Squared Error (MSE) loss function:
MSE( β 0 , β 1 )= 1 n ∑ i=1 n ( y i −( β 0 + β 1 x i ) ) 2
Rearrange the inner expression:
= 1 n ∑ i=1 n ( y i − β 0 − β 1 x i ) 2
Now take partial derivative with respect to β 0 :
∂MSE ∂ β 0 = ∂ ∂ β 0 ( 1 n ∑ i=1 n ( y i − β 0 − β 1 x i ) 2 )
Take constant outside:
= 1 n ∂ ∂ β 0 ∑ i=1 n ( y i − β 0 − β 1 x i ) 2
Move derivative inside the summation:
= 1 n ∑ i=1 n ∂ ∂ β 0 ( y i − β 0 − β 1 x i ) 2
Apply chain rule:
= 1 n ∑ i=1 n 2( y i − β 0 − β 1 x i )⋅ ∂ ∂ β 0 ( y i − β 0 − β 1 x i )
Apply derivative rules:
d d β 0 ( y i )=0 d d β 0 (− β 0 )=−1 d d β 0 (− β 1 x i )=0
So the inner derivative becomes:
∂ ∂ β 0 ( y i − β 0 − β 1 x i )=−1
Substitute back:
∂MSE ∂ β 0 = 1 n ∑ i=1 n 2( y i − β 0 − β 1 x i )(−1)
Simplify:
=− 2 n ∑ i=1 n ( y i − β 0 − β 1 x i )
Set derivative equal to zero:
− 2 n ∑ i=1 n ( y i − β 0 − β 1 x i )=0
Multiply both sides by:
− n 2 ∑ i=1 n ( y i − β 0 − β 1 x i )=0
Expand:
∑ i=1 n y i –n β 0 – β 1 ∑ i=1 n x i =0
Rearrange:
n β 0 = ∑ i=1 n y i – β 1 ∑ i=1 n x i
Divide by n :
β 0 = 1 n ∑ i=1 n y i – β 1 1 n ∑ i=1 n x i
Using means:
x ¯ = 1 n ∑ i=1 n x i y ¯ = 1 n ∑ i=1 n y i
Final intercept formula:
β 0 = y ¯ – β 1 x ¯
Now take partial derivative with respect to β 1 :
∂MSE ∂ β 1 = ∂ ∂ β 1 ( 1 n ∑ i=1 n ( y i − β 0 − β 1 x i ) 2 )
Take constant outside:
= 1 n ∂ ∂ β 1 ∑ i=1 n ( y i − β 0 − β 1 x i ) 2
Move derivative inside the summation:
= 1 n ∑ i=1 n ∂ ∂ β 1 ( y i − β 0 − β 1 x i ) 2
Apply chain rule:
= 1 n ∑ i=1 n 2( y i − β 0 − β 1 x i )⋅ ∂ ∂ β 1 ( y i − β 0 − β 1 x i )
Apply derivative rules:
d d β 1 ( y i )=0 d d β 1 (− β 0 )=0 d d β 1 (− β 1 x i )=− x i
So the inner derivative becomes:
∂ ∂ β 1 ( y i − β 0 − β 1 x i )=− x i
Substitute back:
∂MSE ∂ β 1 = 1 n ∑ i=1 n 2( y i − β 0 − β 1 x i )(− x i )
Simplify:
=− 2 n ∑ i=1 n x i ( y i − β 0 − β 1 x i )
Set derivative equal to zero:
− 2 n ∑ i=1 n x i ( y i − β 0 − β 1 x i )=0
Multiply both sides by:
− n 2 ∑ i=1 n x i ( y i − β 0 − β 1 x i )=0
Expand:
∑ i=1 n x i y i – β 0 ∑ i=1 n x i – β 1 ∑ i=1 n x 2 i =0
Substitute:
β 0 = y ¯ – β 1 x ¯
into the equation:
∑ i=1 n x i y i –( y ¯ − β 1 x ¯ ) ∑ i=1 n x i – β 1 ∑ i=1 n x 2 i =0
Expand:
∑ i=1 n x i y i – y ¯ ∑ i=1 n x i + β 1 x ¯ ∑ i=1 n x i – β 1 ∑ i=1 n x 2 i =0
Since:
∑ i=1 n x i =n x ¯
Substitute:
∑ i=1 n x i y i –n x ¯ y ¯ + β 1 n x ¯ 2 – β 1 ∑ i=1 n x 2 i =0
Group β 1 terms:
β 1 (n x ¯ 2 − ∑ i=1 n x 2 i )=n x ¯ y ¯ – ∑ i=1 n x i y i
Multiply both sides by -1:
β 1 ( ∑ i=1 n x 2 i −n x ¯ 2 )= ∑ i=1 n x i y i –n x ¯ y ¯
Final slope formula:
β 1 = ∑ n i=1 x i y i –n x ¯ y ¯ ∑ n i=1 x 2 i –n x ¯ 2
Equivalent covariance form:
β 1 = ∑ n i=1 ( x i − x ¯ )( y i − y ¯ ) ∑ n i=1 ( x i − x ¯ ) 2
Finally, substitute the computed value of β 1 into the intercept equation:
β 0 = y ¯ – β 1 x ¯
Thus, the final regression equation becomes:
y ^ = β 0 + β 1 x
Now, we learned how we got the formulas for the slope and intercept.
But one thing we need to consider here is that we derived these formulas for a case where we only have one feature, and even for one feature, we can see how complex the math was.
What if we have more than one feature, as most real-world datasets do?
The math becomes more complex, and this is where we use the matrix form to represent the equations. Using matrix notation, we can derive the normal equation, which generalizes to any number of features.
Deriving the Normal Equation
In Simple Linear Regression, we derived one intercept and one slope:
y ^ = β 0 + β 1 x
However, real-world problems usually contain multiple features.
For example:
years of experience education level age
In such cases, Linear Regression becomes:
y ^ = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 +⋯+ β p x p
where:
β 0 is the intercept and
β 1 , β 2 , β 3 ,…, β p are slopes for different features
As the number of features increases, solving separate equations for every parameter becomes difficult.
To solve this easily, Linear Regression is rewritten using matrix notation.
Suppose we have n observations and p features.
First define the target vector:
Y= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ y 1 y 2 y 3 ⋮ y n ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Now define the feature matrix.
The first column contains only 1s to represent the intercept term.
X= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 1 ⋮ 1 x 11 x 21 x 31 ⋮ x n1 x 12 x 22 x 32 ⋮ x n2 ⋯ ⋯ ⋯ ⋱ ⋯ x 1p x 2p x 3p ⋮ x np ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Now define the parameter vector:
β= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ β 0 β 1 β 2 ⋮ β p ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Using matrix multiplication:
Xβ= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 1 ⋮ 1 x 11 x 21 x 31 ⋮ x n1 x 12 x 22 x 32 ⋮ x n2 ⋯ ⋯ ⋯ ⋱ ⋯ x 1p x 2p x 3p ⋮ x np ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ β 0 β 1 β 2 ⋮ β p ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Performing the multiplication:
= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ β 0 + β 1 x 11 + β 2 x 12 +⋯+ β p x 1p β 0 + β 1 x 21 + β 2 x 22 +⋯+ β p x 2p β 0 + β 1 x 31 + β 2 x 32 +⋯+ β p x 3p ⋮ β 0 + β 1 x n1 + β 2 x n2 +⋯+ β p x np ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
This gives the prediction vector:
Y ^ =Xβ
Now define the residual vector.
Residuals are the differences between actual and predicted values.
Y− Y ^
Substituting:
Y−Xβ
The Mean Squared Error (MSE) becomes:
MSE= 1 n (Y−Xβ ) T (Y−Xβ)
The transpose is required because:
(Y−Xβ)
is a column vector.
Multiplying by its transpose converts the expression into a scalar sum of squared residuals.
Now expand the expression.
MSE= 1 n (Y−Xβ ) T (Y−Xβ) = 1 n ( Y T Y– Y T Xβ–(Xβ ) T Y+(Xβ ) T Xβ)
Using transpose property:
(Xβ ) T = β T X T
Substitute into the equation:
MSE= 1 n ( Y T Y– Y T Xβ– β T X T Y+ β T X T Xβ)
Notice that:
Y T Xβ
is a scalar.
Scalars are equal to their transpose.
Therefore:
Y T Xβ= β T X T Y
So the middle two terms combine:
MSE= 1 n ( Y T Y–2 β T X T Y+ β T X T Xβ)
To minimize MSE, take derivative with respect to β .
Derivative of:
Y T Y
is zero because it does not contain β .
Derivative of:
−2 β T X T Y
becomes:
−2 X T Y
Derivative of:
β T X T Xβ
becomes:
2 X T Xβ
Therefore:
∂MSE ∂β = 1 n (−2 X T Y+2 X T Xβ)
Simplify:
= −2 n X T Y+ 2 n X T Xβ
Set derivative equal to zero for minimization:
−2 n X T Y+ 2 n X T Xβ=0
Multiply both sides by:
n 2 − X T Y+ X T Xβ=0
Rearrange:
X T Xβ= X T Y
Now multiply both sides by:
( X T X ) −1 ( X T X ) −1 X T Xβ=( X T X ) −1 X T Y
Using the identity matrix property:
( X T X ) −1 ( X T X)=I
we get:
Iβ=( X T X ) −1 X T Y
Since:
Iβ=β
the final Normal Equation becomes:
β=( X T X ) −1 X T Y
This equation simultaneously computes:
the intercept all slopes the optimal parameters
that minimize the Mean Squared Error.
In general, the normal equation is derived by minimizing the RSS (Residual Sum of Squares). However, since MSE is simply RSS divided by the number of observations, minimizing MSE also produces the same normal equation.
Now we have the normal equation. Let’s solve for the slope and intercept once again using this equation.
Solving for Slope and Intercept Using the Normal Equation
The matrix form of Linear Regression is:
β=( X T X ) −1 X T Y
Construct the feature matrix.
The first column contains 1s for the intercept term.
X= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 1 1 1 1 1 1 1 1 1.2 1.4 1.6 2.1 2.3 3.0 3.1 3.3 3.3 3.8 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Construct the target vector:
Y= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 39344 46206 37732 43526 39892 56643 60151 54446 64446 57190 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Parameter vector:
β=[ β 0 β 1 ]
Now compute the transpose:
X T =[ 1 1.2 1 1.4 1 1.6 1 2.1 1 2.3 1 3.0 1 3.1 1 3.3 1 3.3 1 3.8 ]
Compute:
X T X=[ 10 25.1 25.1 67.89 ]
Now compute the inverse:
( X T X ) −1 =[ 1.4547 −0.5378 −0.5378 0.2142 ]
Now compute:
X T Y=[ 493576 1326200.7 ]
Substitute into the Normal Equation:
β=[ 1.4547 −0.5378 −0.5378 0.2142 ][ 493576 1326200.7 ]
After multiplication:
β=[ 27315.02 9020.93 ]
Therefore:
β 0 =27315.02 β 1 =9020.93
Final regression equation:
y ^ =27315.02+9020.93x Why Do We Need Gradient Descent?
Now, after getting the normal equation for linear regression, we might think that we can solve for the optimal parameters even when we have many features.
But one thing we need to observe here is that this method works well only for small or medium-sized datasets. When we have very large datasets, solving the normal equation becomes computationally expensive.
Let’s look at the normal equation:
β=( X T X ) −1 X T y
From the equation, we can observe the inverse calculation, and this is where solving for the slope and intercept using the normal equation becomes computationally expensive.
This works well for small datasets, but in the real world, we often have thousands of features and millions of data points.
In such cases, solving the normal equation becomes slow and requires a lot of computational power.
This is where gradient descent is used, because instead of directly solving for the solution, we gradually move toward the optimal solution step by step.
Now, to understand how gradient descent works, let’s look at the math behind it.
The Math Behind Gradient Descent
When we were deriving the normal equation, we arrived at this equation.
∂MSE ∂β = 2 n X T (Xβ−Y)
This equation represents the gradient (slope) of the bowl-shaped loss curve.
We made it equal to zero and then solved further to get the normal equation, which is used to find the optimal solution.
But in gradient descent, we stop at this equation and initialize some random values for 𝛽 . Using these values, we calculate the gradient (slope) and gradually move toward the minimum loss step by step.
Let’s assume we initialize:
𝛽 0 = 2 and 𝛽 1 = 5
β (0) =[ β 0 β 1 ]=[ 2 5 ]
Next, we calculate the slope of the bowl curve by substituting these values into the gradient equation.
We already know that the gradient equation is:
∂MSE ∂β = −2 n X T y+ 2 n X T Xβ
The initialized parameter values are:
β (0) =[ 2 5 ]
These are just the starting values from where Gradient Descent begins searching for the minimum loss.
Now let’s construct the feature matrix.
Since we have one feature, the matrix X becomes:
X= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 1 1 1 1 1 1 1 1 1 1 1.2 1.4 1.6 2.1 2.3 3.0 3.1 3.3 3.3 3.8 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
The first column contains ones for the intercept term.
Now calculate:
X T X T =[ 1 1.2 1 1.4 1 1.6 1 2.1 1 2.3 1 3.0 1 3.1 1 3.3 1 3.3 1 3.8 ]
Now calculate:
X T X X T X=[ 10 25.1 25.1 67.89 ]
Next, let the target vector be:
y= ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ 39344 46206 37732 43526 39892 56643 60151 54446 64446 57190 ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
Now calculate:
X T y X T y=[ 493576 1326200.7 ]
Since our dataset contains:
n=10
Now substitute all the values into the gradient equation:
∂MSE ∂β = −2 10 [ 493576 1326200.7 ]+ 2 10 [ 10 25.1 25.1 67.89 ][ 2 5 ]
First, calculate the matrix multiplication:
[ 10 25.1 25.1 67.89 ][ 2 5 ]=[ (10)(2)+(25.1)(5) (25.1)(2)+(67.89)(5) ] =[ 20+125.5 50.2+339.45 ] =[ 145.5 389.65 ]
Now multiply by:
2 10 2 10 [ 145.5 389.65 ]=[ 29.1 77.93 ]
Next, calculate:
−2 10 [ 493576 1326200.7 ]=[ −98715.2 −265240.14 ]
Now substitute everything back:
∂MSE ∂β =[ −98715.2 −265240.14 ]+[ 29.1 77.93 ]
Finally:
∂MSE ∂β =[ −98686.1 −265162.21 ]
This gradient represents the slope of the bowl-shaped MSE loss curve at the current parameter values.
Here:
−98686.1
represents the slope with respect to β 0
and
−265162.21
represents the slope with respect to β 1
Since both values are negative, the loss decreases toward the right, so Gradient Descent moves toward the right to reduce the loss.
Now, instead of directly solving for the optimal parameters mathematically, Gradient Descent gradually updates the parameter values step by step until it reaches the minimum point of the bowl-shaped loss curve.
This update is performed using the Gradient Descent update equation:
β:=β−α ∂MSE ∂β
where:
α
is called the learning rate and controls how large each update step should be.
The update equation can be understood step by step.
β
represents the current parameter values.
∂MSE ∂β
represents the slope (gradient) of the bowl-shaped loss curve at the current point.
The gradient tells us the direction in which the loss increases the fastest.
Therefore, to reduce the loss, we move in the opposite direction of the gradient.
This is why the update equation subtracts the gradient:
β:=β−α ∂MSE ∂β
Here:
α
controls how large each step should be while moving toward the minimum point.
If the gradient is positive, Gradient Descent moves toward the left.
If the gradient is negative, Gradient Descent moves toward the right.
By repeatedly calculating gradients and updating parameters, Gradient Descent gradually moves toward the minimum point of the bowl-shaped loss curve.
After updating the parameters, the entire process is repeated again until the loss becomes minimum, and the model reaches the optimal parameters.
We can observe here is that there is no inverse calculation involved.
Learning Rate
One important thing we need to understand here is the learning rate.
Let’s assume:
α=0.01
and the calculated gradient is:
∂MSE ∂β =[ −98686.1 −265162.21 ]
Now substitute these values into the update equation:
β=[ 2 5 ]–0.01[ −98686.1 −265162.21 ]
First, multiply the learning rate with the gradient:
0.01[ −98686.1 −265162.21 ]=[ −986.861 −2651.6221 ]
Now substitute back:
β=[ 2 5 ]–[ −986.861 −2651.6221 ]
then
β=[ 2+986.861 5+2651.6221 ]
Finally:
β=[ 988.861 2656.6221 ]
After one iteration of Gradient Descent:
β 0
changed from:
2→988.861
and
β 1
changed from:
5→2656.6221
These updated parameter values move us closer to the minimum point of the bowl-shaped MSE loss curve.
Now using these updated values, the entire process is repeated again:
Predictions→Residuals→Loss→Gradient→Parameter Update
This iterative process continues until the loss becomes minimum and the model reaches the optimal parameters.
Now let’s understand why choosing the learning rate is very important.
If the learning rate is very small:
α=0.000001
then the updates become extremely small.
As a result:
Very Slow Learning
and Gradient Descent may require thousands of iterations to reach the minimum point.
On the other hand, if the learning rate is very large:
α=10
then the updates become extremely large.
As a result, Gradient Descent may overshoot the minimum point repeatedly and fail to reach the solution.
Therefore, choosing a proper learning rate is very important for efficient optimization.
GIF by Author Stochastic Gradient Descent
Now we have an idea about what gradient descent actually is.
In this method, we can observe that we used the entire dataset to calculate the gradients before updating the parameters.
This process can become slow for very large datasets, and this approach is called batch gradient descent because it uses the entire dataset for every update step.
Now imagine a dataset containing millions of data points.
For every single update step, Gradient Descent would need to:
Process Entire Dataset Calculate Loss Calculate Gradients
and then finally update the parameters.
This repeated computation becomes computationally expensive and time taking process.
This is where Stochastic Gradient Descent (SGD) comes into the picture.
Instead of calculating gradients using the entire dataset, SGD randomly selects only one observation at a time and immediately updates the parameters.
The update equation still remains the same:
β:=β−α ∂MSE ∂β
The only difference is that the gradient is now calculated using a single observation instead of the entire dataset.
We can understand this by using one data point from our dataset.
The parameter values are:
β (0) =[ 2 5 ]
and the learning rate is:
α=0.01
Now let’s say SGD randomly selected the following training example from our dataset:
(x,y)=(3.0,56643)
For this single observation:
X=[ 1 3.0 ]
and
y=[ 56643 ]
Now calculate:
X T =[ 1 3.0 ]
Next calculate:
X T X =[ 1 3.0 ][ 1 3.0 ] =[ 1 3.0 3.0 9.0 ]
Now calculate:
X T y =[ 1 3.0 ][ 56643 ] =[ 56643 169929 ]
Since SGD is using only one observation:
n=1
Now substitute everything into the gradient equation:
∂MSE ∂β = −2 n X T y+ 2 n X T Xβ
Substituting:
= −2 1 [ 56643 169929 ]+ 2 1 [ 1 3.0 3.0 9.0 ][ 2 5 ]
First calculate the matrix multiplication:
[ 1 3.0 3.0 9.0 ][ 2 5 ] =[ (1)(2)+(3.0)(5) (3.0)(2)+(9.0)(5) ] =[ 2+15 6+45 ] =[ 17 51 ]
Now multiply by:
2 1 =[ 34 102 ]
Now calculate:
−2 1 [ 56643 169929 ]=[ −113286 −339858 ]
Now substitute everything back:
∂MSE ∂β =[ −113286 −339858 ]+[ 34 102 ]
Finally:
∂MSE ∂β =[ −113252 −339756 ]
This gradient represents the slope of the bowl-shaped loss curve for this single training example.
Now update the parameters using:
β:=β−α ∂MSE ∂β
Substituting the values:
β=[ 2 5 ]–0.01[ −113252 −339756 ]
First multiply the learning rate:
=[ 2 5 ]–[ −1132.52 −3397.56 ]
Now subtract:
=[ 2+1132.52 5+3397.56 ]
Finally:
β=[ 1134.52 3402.56 ]
After solving for just one observation, the parameters immediately get updated.
Now SGD randomly selects another observation from the dataset and repeats the same process again.
Unlike batch gradient descent, which waits to process the entire dataset before updating the parameters, SGD updates the parameters after every single training example.
Because of these frequent updates, SGD reaches the solution faster.
We can observe how simple the calculation becomes when using just one observation.
SGD continues updating the parameters repeatedly using different training examples until the loss becomes minimum or stops changing significantly.
But the path toward the minimum point becomes noisy and zig-zag in nature.
This makes SGD highly useful for modern machine learning and deep learning problems involving very large datasets.
Conclusion
Now we have an idea of both gradient descent and stochastic gradient descent.
First, we derived the normal equation, and then we learned that the inverse matrix calculation becomes computationally expensive and memory usage becomes high for large datasets.
To solve this problem, we used gradient descent, which is not limited to linear regression but is also used in many machine learning and deep learning algorithms.
Next, we learned that even the first method of gradient descent that we used, called batch gradient descent, can become slow for very large datasets because it uses the entire dataset before updating parameters.
This led us to stochastic gradient descent (SGD), which updates the parameters using one training example at a time and works faster than batch gradient descent for large datasets.
We also have another variation of gradient descent called mini-batch gradient descent, in which we use a small batch of training examples from the dataset, such as 32 or 64 rows, before updating the parameters.
In this way, it becomes faster than batch gradient descent and more stable than stochastic gradient descent.
Even though linear regression has a closed-form solution, we often prefer to use gradient descent when working with large datasets containing millions of observations because the normal equation becomes computationally expensive and impractical.
In deep learning, however, closed-form solutions usually do not exist, which makes optimization algorithms like gradient descent even more important.
Dataset License
The dataset used in this blog is the Salary dataset.
It is publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This means it can be freely used, modified, and shared for both non-commercial and commercial purposes without restriction.
I hope you now have a better understanding of what gradient descent and stochastic gradient descent actually are.
If you’d like to read more of my writing, you can also find it on Medium and LinkedIn.
I recently wrote a detailed breakdown of Lasso Regression from a geometric and intuitive perspective.
You can read it here.
Thanks for reading!
WRITTEN BY
Nikhil Dasari See all from Nikhil Dasari
Data Science Deep Dives Gradient Descent Math Stochastic Gradient
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS Related Articles Implementing Convolutional Neural Networks in TensorFlow ARTIFICIAL INTELLIGENCE
Step-by-step code guide to building a Convolutional Neural Network
Shreya Rao August 20, 2024 6 min read Hands-on Time Series Anomaly Detection using Autoencoders, with Python DATA SCIENCE
Here’s how to use Autoencoders to detect signals with anomalies in a few lines of…
Piero Paialunga August 21, 2024 12 min read Back To Basics, Part Uno: Linear Regression and Cost Function DATA SCIENCE
An illustrated guide on essential machine learning concepts
Shreya Rao February 3, 2023 6 min read Must-Know in Statistics: The Bivariate Normal Projection Explained DATA SCIENCE
Derivation and practical examples of this powerful concept
Luigi Battistoni August 14, 2024 7 min read Our Columns DATA SCIENCE
Columns on TDS are carefully curated collections of posts on a particular idea or category…
TDS Editors November 14, 2020 4 min read Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits DATA SCIENCE
With demos, our new solution, and a video
Vadim Arzamasov August 16, 2024 10 min read Back to Basics, Part Tres: Logistic Regression DATA SCIENCE
An illustrated guide to everything you need to know about Logistic Regression
Shreya Rao March 2, 2023 8 min read YouTube X LinkedIn Threads Bluesky
Your home for data science and Al. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.
© Insight Media Group, LLC 2026 Subscribe to Our Newsletter WRITE FOR TDS ABOUT ADVERTISE TERMS OF USE Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.