Logo

The Data Daily

Fundamentals of Multivariate Calculus for DataScience and Machine Learning

Fundamentals of Multivariate Calculus for DataScience and Machine Learning

Fundamentals of Multivariate Calculus for DataScience and Machine Learning
Photo Source — https://i.pinimg.com/originals/12/10/17/121017deafcab3026b8fba0a9bce9b68.jpg
Multivariate Calculus is used all around Machine Learning and DataScience ecosystem, so having a first-principle understanding of it, is incredibly useful when you are dealing with some complex Math equations in implementing some ML Algo.
To start with, as soon as you need to implement multi-variate Linear Regression, you hit multivariate-calculus which is what you will have to use to derive the Gradient of a set of multi-variate Linear Equations i.e. Derivative of a Matrix. Beyond Optimization of Neural networks (NN) cost function and variations of gradient descent algorithm here in NN and a plethora of other optimization are handled with multivariate calculus.
So in this blog, I shall go over the fundamental concepts required multivariate-calculus which we will need to to understand many of the mechanisms of Machine Learning and Data Science
Differentiation of Function of Several Variables
The derivative of a one-variable function measures its rate of change. Mathematically we already know from the first principle, the definition of a derivative of a single variable function definition as follows — Let D ⊆ R and let c be an interior point of D, that is, (c − r, c + r) ⊆ D for some r > 0 . A function f : D → R is said to be differentiable at c if the limit
exists. In this case the value of the limit is denoted by f ′ © and is called the
derivative of f at c.
Now we see how a two-variable function has two rates of change: one as x changes (with y held constant) and one as y changes (with x held constant).
In general, if f is a function of two variables x and y, suppose we let only x vary while keeping y fixed, say y = b, where b is a constant. Then we are really considering a function of a single variable x.
We study the influence of x and y separately on the value of the function f (x, y) by holding one fixed and letting the other vary. This leads to the following definitions of Partial Derivatives of function f with respect to x and y.
For all points at which the limits exist, we define the partial derivatives at the point (a, b) by
Similarly, the partial derivative of f with respect to y at (a, b), is obtained by keeping x fixed (x = a ) and finding the ordinary derivative at b of the function.
So, if f is a function of two variables, its partial derivatives are the functions fx and fy defined by
Extending the above, if f is a function of three variables x, y, and z, then its partial derivative with respect to x is defined as
and it is found by treating y and z as constants and differentiating f(x, y, z) with respect to x. If w =f(x, y, z), then fx = δw/δx can be interpreted as the rate of change of w with respect to x when y and z are held fixed. But we can’t interpret it geometrically because the graph of f lies in four-dimensional space.
In general, if u is a function of n variables, u = f(x1, x 2, . . . , xn), its partial deriva­tive with respect to the i-th variable x_i is
Quick exercise — calculating Partial Derivative of a Multivariate Function
Find δz/δx and δz/δy if z is defined implicitly as a function of x and y by
the equation
To find δz/δx, first we differentiate implicitly with respect to x, treating y as a constant:
Solving this equation for δz/δx, I get
Similarly, differentiation with respect to y i.e. δz/δy treating x as constant gives
Geometry of Multivariate Function and Three-dimensional space
First, a quick refresher on plain-vanilla two-dimensional space, denoted by R², is the familiar Cartesian plane. If we construct two perpendicular lines (the x- and y-coordinate axes), set the origin as the point of intersection of the axes, and establish numerical scales on these lines, then we may locate a point in R² by giving an ordered pair of numbers (x, y), the coordinates of the point. Note that the coordinate axes divide the plane into four quadrants.
For any multi-variable Calculus, we need to have some basic understanding of 3-dimensional space.
Three-dimensional space, denoted R³, requires three mutually perpendicular
coordinate axes (called the x-, y- and z-axes) that meet in a single point (called the origin) in order to locate an arbitrary point. Analogous to the case of R², if we establish scales on the axes, then we can locate a point in R³ by giving an ordered triple of numbers (x, y, z). The coordinate axes divide three-dimensional space into eight octants. It takes some practice to get your sense of perspective correct when sketching points in R³.
Plotting a multivariate function in three-dimensional space
Coordinate axes in
three-dimensional space
Imagine three coordinate axes meeting at the origin(0, 0, 0). A vertical axis (z), and two horizontal axes at right angles to each other (x and y). The xy-plane is horizontal, while the z-axis extends vertically above and below the plane. We generally use right-handed axes. This means that if you curl the fingers of your right hand from the positive x-axis to the positive y-axis, then your thumb will point along the positive z-axis.
The x-, y-, and z-axes in R³ are always
drawn in a right-handed configuration.
We identify a point in 3-space by giving its coordinates (x, y, z) concerning these axes.
We may imagine a picture of a three-dimensional coordinate system in terms of a room. The origin is one of the corners at floor level where two walls meet the floor. The z-axis is the vertical intersection of the two walls; the x- and the y-axis are the intersections of each wall with the floor. Points with
negative coordinates lie behind a wall in the next room or below the floor.
Now a quick example, let's see how the graphs of the equations z = 0, z = 3, and z = −1 look like?
The planes z = −1, z = 0, and z = 3
For graph z = 0, we visualize the set of points whose z-coordinate is zero. Means it must be at the same vertical level as the origin, i.e. in the horizontal plane containing the origin. So the graph of z = 0 is the middle plane in the above figure.
The graph of z = 3 is a plane parallel to the graph of z = 0, but three units above it. The graph of z = −1 is a plane parallel to the graph of z = 0, but one unit below it.
The plane z = 0 contains the x- and the y-coordinate axes, and is called the xy-plane. There are two other coordinate planes. The yz-plane contains both the y- and the z-axis, and the xz-plane contains the x- and the z-axis.
The three coordinate planes
Graph of Univariate Function vs Multivariate Function
If f is a scalar-valued function of a single variable, f:R→Rf:R→R ( notation R stands for the real numbers and similarly, R² is a two-dimensional vector and also denotes a 2-D coordinate system ) , then the graph of f is the set of points (x, f(x)) for all x in the domain of f. We call this the graph of y=f(x) since the points are lying in the xy-plane. When plotting the points in the xy-plane, they typically form a curve, such as the graph of f(x)=x² shown below.
But, the graphs of functions of two or more variables are examples of surfaces. That is, a set of points (x,y,z) that satisfy an equation relating all three variables is often a surface.
Now, we define the graph of a scalar-valued function of two variables, f:R²→R in the same way. The graph is the set of points (x, y, f(x, y)) for all (x, y) in the domain of f. When often call this the graph of z=f(x, y), since the points as lying in xyz-space (instead of only xy-plane). The graph of this f(x,y) is a surface.
Let’s plot an example. You saw above the graph of the single variable function y=x² is a parabola. Now extend it to make a multi-variate function.
f(x,y)= x² + y²
The graph of which is something called a paraboloid, a type of quadric surface.
Made with 3D-Surface Graphing Online Tool — https://academo.org/demos/3d-surface-plotter/
Now Geometric Interpretation of Partial Derivatives of a Multivariate Function
Take a look at the following graph of a surface
To understand the geometric interpretation of partial derivatives, we recall that the equation z =f (x,y) represents the surface S (which is the graph of f ).
If f (a, b) = c, then the point P(a, b, c) lies on S. By fixing y = b, we are restricting our attention to the curve C1 in which the ver­tical plane y =b intersects S. (In other words, C1 is the trace of S in the plane y =b. Likewise, the vertical plane x = a intersects S in a curve C2. Both of the curves C1 and
C2 passes through point P.
Now, note that the curve C1 is the graph of the function g(x) =f(x, b), so the slope of its tangent T1 at P is g′(a) = fx(a, b).
The curve C2 is the graph of the function G(y) = f(a, y), so the slope of its tangent T2 at P is G′(b) = fy(a,b).
Thus the partial derivatives fx(a, b), and fy(a, b) can be interpreted geometrically as the slopes of the tangent lines at P(a, b, c) to the traces C 1 and C2 of S in the planes y =b and x = a
Partial derivatives can also be interpreted as rates of change. If z =f(x, y), then δz/δx represents the rate of change of z with respect to x when y is fixed. Similarly, δz/δy represents the rate of change of z with respect to y when x is fixed.
So to see a use case of the above implementation, lets take a quick look at the below graph of the surface z = f (x, y) and try to understand y whether each partial derivative is positive or negative.
The positive x-axis points out of the page. So imagine that I am heading off in this direction from the point marked P, where we descend steeply. So the partial derivative with respect to x is negative at P, with quite a large absolute value. The same is true for the partial derivative with respect to y at P since there is also a steep descent in the positive y-direction.
At the point marked Q, heading in the positive x-direction results in a gentle descent, whereas heading in the positive y-direction results in a gentle ascent.
Thus, the partial derivative fx at Q is negative but small (that is, near zero), and the partial derivative fy is positive but small. Basic Rules of Partial Differentiation
In the multivariate case, the basic differentiation rules that we know from high-school mathematics (e.g., sum rule, product rule, chain rule) still apply. However, when we compute derivatives with respect to vectors x we need to pay attention: Our gradients now involve vectors and matrices, and matrix multiplication is not commutative , i.e., the order matters.
Product and Sum Rule
Chain rule
Total Derivative of a Multivariate Function
A total derivative of a multivariable function of several variables, each of which is a function of another argument, is the derivative of the function with respect to said argument. it is equal to the sum of the partial derivatives with respect to each variable times the derivative of that variable with respect to the independent variable .
In genearl Mathematical form, the total differential of three or more variables is defined as below. For a function z = f(x, y, .. , u) the total differential is
For example, lets say, a function f(x,y,z) which is a continuous function of n variables x, y, z , with continuous partial derivatives ∂w/∂x, ∂w/∂y, ∂w/∂z. And assume x, y, z are differentiable functions x = x(t), y = y(t) , z = z(t). of a variable t. Then the total derivative of w with respect to t is given by
Let's see a quick example — Find the total differential of the following function
w = x³yz + xy + z + 3 at a point (1,2,3)
The total differential at the point (x0, y0, z0) is
Total Differential
Substituting the x, y, z values for the point (1, 2, 3) we get: wx(1, 2, 3) = 20, wy(1, 2, 3) = 4, wz(1, 2, 3) = 3 . So final ans is
Chain Rule of a univariate and multivariate function
First, the case when the Inner function is Univariate
The inner function is Univariate
We already have seen in the above example the implementation of the Chain Rule for the derivative of a composite of two functions
For simple composite function
f(x)=f(g(x))
The derivative is
The chain rule says that we should take a derivative of an outside function, keep an inside function untouched, and then multiply everything by the derivative of the inside function.
Which in partial derivative form.
In above equation, both f(x) and g(x) are functions of one variable.
The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into subexpressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple subexpression in isolation yet still combine the intermediate results to get the correct overall result.
Consider a function f : R² → R of two variables x1, x2 . Where, x1(t) and x2(t) are themselves functions of t. To compute the gradient of f with respect to t, we have the chain rule for multivariate functions as below
Chain rule for Multi-variate Functions
where d denotes the gradient and ∂ partial derivatives.
Example of the above,
Consider the following function
where x1 = sin t and x2 = cos t, then the corresponding derivative of f with respect to t is the following
derivative of f with respect to t.
And now the case when the inner function is multi-variate function.
That is, If f (x1, x2 ) is a function of x1 and x2, where x1 (s, t) and x2(s, t) are themselves functions of two variables s and t, meaning the function I want get the derivative of is
f(x1(s, t), x2(s, t))
In this case, the chain rule yields the following partial derivatives
Note in above set, we derived a total six partial derivative.
and the gradient is obtained by the matrix multiplication
Let’s see an example of this, i.e. inner function has Two Independent Variables
Derive ∂z/∂u and ∂z/∂v using the following functions:
Where the inner functions are
So following our earlier rule, to implement the chain rule for two variables, we need six partial derivatives — ∂z/∂x, ∂z/∂y, ∂x/∂u, ∂x/∂v, ∂y/∂u, and ∂y/∂v
So I just put these above values to our Partial Derivative rule
Next, we substitute x(u,v)=3u+2v and y(u,v)=4u−v
Similarly, repeat the above steps for ∂z/∂v
So we the final Partial Derivative of the above problem as below.
Gradient-Descent (GD) — the famous algorithm where Multivariate Calculus implementation is required
Without going into much detail of GD, as we know, like the derivative, the gradient represents the slope of a function.
The gradient points in the direction of the largest rate of increase of the function, and its magnitude is the slope in that direction. And by the GD-algorithm, if we are at a particular value of θ (representing coefficients of the Linear Regression function) and if we want to move to a new value of θ such that the new loss is less than the current loss then we should move in the direction opposite to the gradient.
To understand and implement these gradients we need to return to partial derivatives (of the cost function), which we can reorganize into a row (i.e. horizontal) vector as below:
The above is what we call the gradient of f(x,y), or ∇ f(x,y)
That is we find the gradient of the function f with respect to x by varying one variable at a time and keeping the others constant. The gradient is then the collection of these partial derivatives.
So what it means is, if we have the gradient for function f(x,y) this is the same as writing the partial derivative of function f with respect to x and the partial derivative with respect to y:
The general Mathematical form will be as below.
For a function f(x), of n variables x1 , . . . , xn we define the partial derivatives as
and collect them in the row vector
where n is the number of variables
Let's see an example, For f (x, y) = (x + 2y 3)², we obtain the below two partial derivatives using Chain Rule.
And so now we get the gradient of the function∇ f(x,y) by organizing these partials into a horizontal (row) vector form
Gradient of f(x,y)
2(x+2y³) is the change in f(x,y) with respect to a change in x, while 12(x+2y³)*y² is the change in f(x,y) with respect to a change in y.
Difference between a Gradient and a Derivative
By now we have seen, the gradient holds all the partial derivatives of a multivariable function. Before we dive deep into Gradient-Descent algorithm below, let’s explore the difference between Derivative and Gradient.
The derivative is a number, which shows the rate of change when some point in our function moves in a particular direction. We can visualize the derivative as a slope of a function which goes along some direction on a graph. We use the letter d to denote the derivative.
The gradient is, on the other hand, a vector which points in the direction of the steepest ascent or the greatest upward slope whose length is the directional derivative in that direction
We use the symbol ∇ to denote the gradient.
So why Gradient is a Vector
The regular derivative gives us the rate of change of a single variable. For example, df/dx tells us how much the function f(x) changes for a change in x. But if a function takes multiple variables, such as x and y, it will have multiple derivatives: the value of the function will change when we change x (df/dx) and also when we change y (df/dy).
So, now we can represent these multiple rates of change in a vector, with one component for each derivative. Thus, a function that takes 3 variables will have a gradient with 3 components each represented by a partial derivative. And just like the regular derivative, the gradient points in the direction of the greatest increase. However, now that we have multiple directions to consider (x, y, and z), the direction of greatest increase is no longer simply “forward” or “backward” along the x-axis, like it is with functions of a single variable.
If we have two variables, then our 2-component gradient can specify any direction on a plane. Likewise, with 3 variables, the gradient can specify and direction in 3D space to move to increase our function.
Now it will make sense of the below definition of Gradient. The gradient of vector-valued function
on real domain it is a row vector represented by a row-matrix
where N is the number of variables.
An example, for the following function
the partial derivatives (i.e., the derivatives of f with respect to x1 and x2 ) are
and the gradient is then a matrix with one row (or a row-matrix as it is called)
Thank you for reading…

Images Powered by Shutterstock