Assorted notes on important concepts

I'm writing these notes while going through the book (Calculus of Several Variables by Adams). I'm including some notes from class as well. These notes are a method of study for me, but hopefully others find them useful as well. I've tried to make them as clear as possible, adding some insights that have helped my intuition while working through the course. There may be a bit of extraneous information in order to give some relevant background for the material too - I'll try to point out material that isn't necessary for the course, but helpful nonetheless.

I'm not covering the basics of partial differentiation (the principle of holding variables besides the variable of differentiation constant, etc) but the notes begin with the chain rule for multiple variables.

If you see any corrections that need to be made, or anything important that I've omitted, feel free to make an account (it's easy) and edit as you feel fit. If you'd like to contact me about these notes my McGill email is matthew.wetmore.

1The chain rule¶

1.1The chain rule and function composition (Calc 1 review)¶

As you almost certainly remember from your earlier adventures in calculus, there is a specific procedure for differentiating compositions of functions. If you don't recognize the term composition, or you forget the notation, here is a quick review:

Say we have two functions, $f$ and $g$, both of a single variable. We can compose these functions by making one the parameter of the other - that is, by taking $f(g(x))$. There is a special notation for this (because math loves having notation for everything): we can write $f(g(x))$ as $(f \circ g)(x)$. In this notation, the function to the right of the $\circ$ is always the argument of the function to the left of the $\circ$. When we compose two functions, we are taking the results of the inner function and passing those results to the outer function.

Function composition comes up all the time - the expression $\sin(x^2)$ is the composition of the functions $\sin x$ and $x^2$, for example. If $f(x) = \sin x$ and $g(x) = x^2$, we can write it as $f(g(x)) = (f \circ g)(x)$. So naturally, we need to know how to differentiate compositions of functions, since they occur so often. The chain rule tells us how to do this - if $z = f(u)$ and $u = g(x)$, we have $z(x) = (f \circ g)(x)$. Thus:

$$\frac{dz}{dx} = f'(u) \cdot g'(x) = \frac{dz}{du} \cdot \frac{du}{dx}$$

1.2Multi-variable chain rule¶

Now let's extend this to functions of multiple variables. What if the inner function of the composition is a function of more than one variable? For example, what if our function $g$ above is actually a function of more than one variable? Given $z = f(u)$ and $u = g(x,y)$, we have $z(x,y) = (f \circ g)(x,y)$. We want the rate of change of the function $f$ with respect to the independent variables $x$ and $y$, so there are two partial derivatives we want.

When we take a partial derivative of a function with respect to some variable (let's say $x$), we just differentiate using $x$ as our variable of differentiation, counting all other variables as constant. Therefore, in this case we can use the same principle as the chain rule with one variable. So our partial derivatives are:

$$\frac{dz}{dx} = f'(u) \cdot g_x(x,y) = \frac{dz}{du} \cdot \frac{du}{dx} \\ \frac{dz}{dy} = f'(u) \cdot g_y(x,y) = \frac{dz}{du} \cdot \frac{du}{dy}$$

More generally, if $g$ is a function of $n$ independent variables (so $u = g(x_1, \ldots, x_n)$), then:

$$\frac{dz}{dx_k} = f'(u) \cdot g_k(x,y) = \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial x_k}$$

Where the notation $g_k(x,y)$ means the derivative of $g$ with respect to the $k$-th variable, $x_k$.

This is a basic transition from the single-variable to the multi-variable case, so it doesn't take much thought. However, what if our outer function $f$ is of multiple variables, each one a function of more independent variables? For instance, consider $z = f(x, y)$ where $x = x(t)$ and $y = y(t)$ (both are functions of the independent variable $t$). We thus have $z = f(x(t), y(t))$, which is a more complex composition than before.

If we want to find the rate of change of $z$ with respect to $t$, the chain rule will be a bit trickier, but it's still pretty easy and follows from the same logic as before. Instead of just finding $f'(x)$, we need $f_x(x,y)$ and $f_y(x,y)$. So where before we had $f'(u) \cdot g'(x)$, now we have $f_x(x,y) \cdot x'(t)$ and $f_y(x,y) \cdot y'(t)$. An easier notation to remember is:

$$\frac{dz}{dt} = \frac{\partial z}{\partial x} \cdot \frac{\partial x}{\partial t} + \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial t}$$

So we've covered almost all cases of function compositions, but if you're observant you've probably noticed that we're missing a procedure to differentiate a function of multiple variables, each variable itself a function of multiple independent variables. That is, we don't know how to differentiate $z = f(x(s,t), y(s,t))$. Using what we've learned, the missing procedure should be pretty obvious - if we want the derivative of $f$ with respect to $s$, we differentiate using the method above, holding $t$ constant (by the rules of partial differentiation). The process is similar if we want the derivative of $f$ with respect to $t$.

Therefore:

$$\frac{dz}{dt} = \frac{\partial z}{\partial x} \cdot \frac{\partial x}{\partial t} + \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial t} \\ \frac{dz}{ds} = \frac{\partial z}{\partial x} \cdot \frac{\partial x}{\partial s} + \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial s}$$

Note: for the chain rule to apply, the partial derivatives that it requires must exist. In our problems, this is pretty much always the case.

1.3Function-dependency tree¶

You may have noticed a pattern when applying the chain rule to functions of multiple variables - in each case, the resulting expression for the derivative is a sum of products of derivatives. Also, even though you can't cancel the numerators and denominators like you can with normal fractions - that is:

$$\text{Even though } \frac{a}{b} \cdot \frac{b}{c} = \frac{a}{c}, \quad \frac{\partial z}{\partial x} \cdot \frac{\partial x}{\partial t} \neq \frac{\partial z}{\partial t}$$

you might have noticed that if you could do that, each term in the application of the chain rule to obtain something like $\partial z / \partial t$ would be $\partial z / \partial t$.

These observations can help you verify that your application of the chain rule is correct. They are also encoded, in a way, by the following visualization of a function and the variables it depends on.

The above tree represents the function we worked with before, $f(x(s,t), y(s,t))$. Some quick terminology for trees: an edge is a link from one variable to another. In this diagram, an edge coming out of the bottom of a variable, and pointing to some other variable means the first variable depends on the second. So it's easy to see that $x$ and $y$ are dependent on the independent variables $s$ and $t$.

In order to find the derivative of the main function $f$ with respect to one of the independent variables, we just need to follow every possible path from $f$ to the independent variable in question. Remember how applying the chain rule to a function like this gives us the sum of some terms? Each term represents a path to the differentiation variable.

This method of visualizing a function should make understanding the chain rule easier and more intuitive.

2Functions from n-space to m-space¶

2.1Motivation¶

When we have a function $f(x_1, \ldots, x_n) = w$, we are taking a point in $\mathbb{R}^n$ and assigning it a unique value $w$ in $\mathbb{R}$. This is a simple transformation, and we know that we can find various properties about the function (such as rate of change) by taking partial derivatives $f_1, \ldots f_n$ (where $f_k$ refers to the partial derivative with respect to the $k$-th independent variable).

Let's extend this a bit further. Let's say we want to make a transformation that acts on a point in $\mathbb{R}^n$, giving us a point in $\mathbb{R}^m$. If we needed one function to map $\mathbb{R}^n \rightarrow \mathbb{R}$, it makes sense that we need $m$ functions to map a point in $\mathbb{R}^n \rightarrow \mathbb{R}^m$. So we define a vector $\mathbf{f}$ of $m$ functions of $n$ variables. We can then write our transformation in a nice vector form:

$$\mathbf{y} = \mathbf{f}(\mathbf{x})$$

Where $\mathbf{y}$ is the point in $\mathbb{R}^m$ obtained from transforming $\mathbf{x}$.

At this point you're probably asking "yeah, but how do we find the rate of change for this transformation?" Through partial derivatives of course! When we had a transformation from $n$-space to $\mathbb{R}$, there were $n$ possible partials, because we had a single function of $n$ variables. Now we have $m$ functions of $n$ variables, so there are $n \times m$ partials, of the form:

$$\frac{\partial y_i}{\partial x_j} \text{ for } 1 \leq i \leq m, \; 1 \leq j \leq n$$

This is where the Jacobian matrix comes in.

2.2Jacobian matrix¶

The fact that we have partials of the form given above means that we can put them into a matrix that looks like:

$$\left( \begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array} \right)$$

We'll call this $D\mathbf{f}(\mathbf{x})$, or the Jacobian matrix of the function $\mathbf{f}$. As you might remember from linear algebra, every linear transformation can be represented with a matrix - in this case, the linear transformation represented by the Jacobian of a function $\mathbf{f}$ is called the derivative of $\mathbf{f}$ 1. A nice fact about Jacobians (that I won't demonstrate here, but you can confirm it if you're that kind of person) is that the Jacobian matrix for the composition of two functions is the same as the matrix product of the Jacobians for each respective function.

For a simple single-variable function $y = f(x)$, we can define a differential $dy$ with $dy = f'(x)dx$, which approximates the change in $y$ corresponding to some change $dx$ in $x$. We can extend this to our multivariable function. As our function $\mathbf{f}$ operates on the vector $x$ consisting of $n$ independent variables $x_i$, if we have a vector $d\mathbf{x}$, where $dx_i$ corresponds to the change in $x_i$, then we can obtain $d\mathbf{y}$ with $d\mathbf{y} = D\mathbf{f}(\mathbf{x})d\mathbf{x}$. This is a direct analogue to the single-variable case.

When we have a function $f(x,y)$, the partial derivatives $f_x$ and $f_y$ represent the rate of change of the function in the direction of the $x$ and $y$ axes, respectively. Notice that these directions are in the domain. However, it's possible to find the rate of change in more than just the basic direction of the axes. Consider this one motivation for this section.

To accomplish this, we first need to introduce a few new concepts. First, let's look at the del operator. Del is a vector operator:

$$\nabla = \frac{\partial}{\partial x}\mathbf{i} + \frac{\partial}{\partial y}\mathbf{j} = \left\langle\frac{\partial}{\partial x} , \frac{\partial}{\partial y} \right\rangle$$

As an operator, we can apply del to a scalar function, which, by distributivity, will give us a vector function. Observe:

$$\nabla f(x,y) = \left(\frac{\partial}{\partial x}\mathbf{i} + \frac{\partial}{\partial y}\mathbf{j}\right)f = \frac{\partial f}{\partial x}\mathbf{i} + \frac{\partial f}{\partial y}\mathbf{j}$$

Applying del to a scalar function gives us the gradient of that function, $\nabla f$.

So the gradient of some scalar function is a vector function - what does the vector it produces at some point represent? Well, if the function $f$ is differentiable at $(a,b)$ and $\nabla f(a,b) \neq \mathbf{0}$, the vector $\nabla f(a,b)$ points in the direction of the greatest rate of change. It is also perpendicular to the level curve of the surface at $(a,b)$.

3.1Directional derivative¶

Now that we're equipped with the tools to do so, let's talk about finding the derivative in some direction in the domain - the aptly-named directional derivative. We started with gradient in two dimensions, which applies to a function of two variables, $z=f(x,y)$. This function can be pictured as producing a three-dimensional surface, with a domain in the $xy$ plane. Consider some unit vector in this domain - it points in some direction. We can find the rate of change of the surface created by the function in this direction - just take the dot product of the directional vector and the gradient. So, if $\mathbf{\hat u}$ is a unit vector in the domain, the directional derivative at $(a,b)$ is given by $D_{\mathbf{\hat u}}f(a,b) = \mathbf{\hat u} \cdot \nabla f(a,b)$.

3.2Geometric properties of gradient and directional derivative¶

At some point $(a,b)$:

• The direction of $\nabla f(a,b)$ is the direction that $f$ increases most rapidly, at a rate of $|\nabla f(a,b)|$

• The direction of $-\nabla f(a,b)$ is the direction that $f$ decreases most rapidly, at a rate of $|\nabla f(a,b)|$

• The directions perpendicular to the direction of $\nabla f(a,b)$ have a rate of change of 0. This follows from the fact that gradient is perpendicular to the level curve.

The definition of gradient extends easily to more than 2 dimensions:

$$\nabla = \frac{\partial}{\partial x}\mathbf{i} + \frac{\partial}{\partial y}\mathbf{j} + \frac{\partial}{\partial z}\mathbf{k} = \left\langle\frac{\partial}{\partial x} , \frac{\partial}{\partial y} , \frac{\partial}{\partial z}\right\rangle$$

The calculation of the directional derivative is also the same, however the unit vector that denotes the direction is no longer in 2 dimensions. The main change that you should be cognizant of is the geometric interpretation in more than 2 dimensions.

Just as a surface defined by $z = f(x,y)$ has level curves where $f(x,y) = k$ for some constant $k$, a function of the form $f(x,y,z)$ has level surfaces defined similarly. And just as the gradient of a function with a 2D domain points perpendicular to the level curve of the function, the gradient of a function like $f(x,y,z)$ is perpendicular to the level surface of $f$ at whatever point we are considering. This allows us to find tangent planes easily - say we have a surface defined by an equation like $x^2 + y^2 + z^2 = k$ (this one is a sphere). The gradient of this surface at $p_0 = (a,b,c)$ will be a vector normal to the surface at $p_0$. This vector and $p_0$ is enough to define the tangent plane to the surface at $p_0$.

The gradient still represents the direction of greatest rate of change, as before.

5Implicit functions¶

If the term "implicit function" is confusing to you (as it was to me), I'd recommend reading this, which is short and makes it an understandable concept.

Anyway, given some equation, what we are looking to do in this chapter is find out whether or not we can implicitly define one of the variables as a function of the others. For instance, if we have the classic equation for a unit sphere, $x^2 + y^2 + z^2 = 1$, can we define $z$ as a function of $x$ and $y$? If so, where can we do that?

Let's look at a more general case. Given the equation $F(x,y,z) = 0$, and a point $p_0$ that satisfies the equation, can we define a variable, say $z$, as a function of the other variables? If we can, and assuming first-order partials exist (which is pretty much always the case it seems), then at $p_0$, we should be able to solve for $z_x$ and $z_y$. The notation used for this is:

$$\frac{\partial z}{\partial x}\bigg|_{p_0}$$

In order to solve for such derivatives, we need to use implicit differentiation, a term you probably remember from Calc 1 or whatever you took to get into this course, but may not remember what it actually means. It's pretty simple - you assume that some variables are functions of other variables, and differentiate both sides of the equation accordingly. So in this case, we're assuming $z$ is a function $z(x,y)$, so we need to use the chain rule when we differentiate it.

So let's say we're looking for $z_x$, given our equation $F(x,y,z) = 0$. To do so, we first differentiate both sides with respect to $x$, giving us $F_x(x,y,z) + F_z(x,y,z)\cdot (\partial z / \partial x) = 0$. Solving for what we want, we obtain:

$$\frac{\partial z}{\partial x}\bigg|_{p_0} = -\frac{F_x(p_0)}{F_z(p_0)}$$

We can find $(\partial z / \partial y)$ similarly. For these partials to exist, the denominator must of course be nonzero. What does that mean? Since the denominator is a derivative with respect to $z$, then the normal vector to the level surface at $p_0$ (remember when we talked about level surfaces here?) must be horizontal - implying the level surface itself is vertical at $p_0$. This, of course, makes sense - for $z(x,y)$ to be a proper function it must pass the vertical line test, and if the level surface is vertical at some point, that means there are multiple values of $z$ for the same $x$ and $y$.

Similarly, in order to define $x = x(y,z)$ at some point $p_0$, it is necessary that $F_x(p_0) \neq 0$ and for $y = y(x,z)$, $F_y(p_0) \neq 0$.

One caveat when implicitly differentiating - remember to apply the chain rule! For instance, if we want to find $(\partial z / \partial x)$ for $x^2 + y^2 + z^2 = 1$, we want to differentiate the equation with respect to $x$ to start. It is very important that we remember that in this case, $z = z(x,y)$. So we get:

$$2x + 2z\cdot\frac{\partial z}{\partial x} = 0$$

Note the chain rule application to $z^2$. If we didn't do this, not only would we be wrong - we'd have no $(\partial z / \partial x)$ to work with in the first place.

5.1Notes on notation¶

When we are given a single equation like $F(x,y,z) = 0$ and we want to find $(\partial x / \partial z)$ we know that we can assume $x$ is a function of the remaining variables, $y$ and $z$. So when we differentiate, we know that we are differentiating with respect to $z$ and holding the final variable $y$ constant.

But what do we do when given two equations? If we are given, say:

$$F(x,y,z,w) = 0 \\ G(x,y,z,w) = 0$$

and we want to find $(\partial x / \partial z)$, then we know, first of all, that two of the variables are dependent variables, with the remaining two variables independent. This follows from the fact that we have two equations, so we can solve for two variables as the system is linear 2. We know that $x$ is one of the dependent variables, because we're looking to differentiate it with respect to $z$. But which of the remaining variables is independent, and which is dependent? This is important to know, since we hold the other independent constant when differentiating, and we need to differentiate the other dependent variable with respect to $z$ as well.

In order to avoid this problem, whenever such an ambiguity would arise, we use the following notation to state which variable is independent:

$$\left(\frac{\partial z}{\partial x}\right)_w$$

In this case, $w$ is the independent variable, and the remaining variable $y$ is a function $y(z,w)$.

5.2Jacobian determinant¶

Alright now it's time for some magic. Remember the Jacobian matrix? Well it turns out that if you take the determinant of this matrix you obtain an interesting relation to the partial derivatives of implicit functions we're covering right now. Of course, the matrix must be square to yield a determinant. What do we remember about the Jacobian matrix? Well, if we have $n$ functions of $m$ variables, we can create an $n \times m$ matrix that holds the derivative of each function with respect to each variable.

Thus if we have $n$ functions of $n$ variables, we can define a square matrix with those properties. Then we can take its determinant. We'll look at 2 functions of 2 variables right now so I don't need to write out a ton of stuff but the idea generalizes obviously. Anyway, we use the the following notation for this Jacobian determinant:

$$\frac{\partial (F, G)}{\partial (x, y)} = \left| \begin{array}{cc} \frac{\partial F}{\partial x} & \frac{\partial F}{\partial y} \\ \frac{\partial G}{\partial x} & \frac{\partial G}{\partial y} \end{array} \right|$$

Of course, you can use the transposition of this matrix if you want, since that doesn't affect the determinant.

Okay so now let's see how this relates to finding some partial derivative when working with implicit functions. In the book they simply state the following relation:

$$\left(\frac{\partial z}{\partial x}\right)_w = -\frac{\frac{\partial (F, G)}{\partial (z, y)}}{\frac{\partial (F, G)}{\partial (x, y)}}$$

Notice that the denominator is just the Jacobian determinant from above - it's the Jacobian for the two functions $F$ and $G$ with respect to the two dependent variables $x$ and $y$. The numerator is roughly the same - the only difference is that we replaced the dependent variable we are differentiating with the independent variable we're differentiating with respect to. That's pretty much all you need to know to make use of this useful relation.

If you're wondering why we can do this, there is a pretty good explanation here. However, the final step in this explanation, where they go from the 4 equations directly to Cramer's rule, might be confusing (I don't think it's exceptionally clear myself). So I'll try to explain the step they're missing. In the example, they are trying to find $(\partial x / \partial u)$ (they have independent variables $u$ and $v$ instead of $z$ and $w$, but the principle is the same). So out of the 4 equations they give, we are concerned with the two that contain terms for $(\partial x / \partial u)$:

$$\frac{\partial F}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial F}{\partial y}\frac{\partial y}{\partial u} = -\frac{\partial F}{\partial u} \\ \frac{\partial G}{\partial x}\frac{\partial x}{\partial u} + \frac{\partial G}{\partial y}\frac{\partial y}{\partial u} = -\frac{\partial G}{\partial u}$$

Now notice that we can write this in matrix form $A\mathbf{x} = \mathbf{b}$:

$$\left( \begin{array}{cc} \frac{\partial F}{\partial x} & \frac{\partial F}{\partial y} \\ \frac{\partial G}{\partial x} & \frac{\partial G}{\partial y} \end{array} \right) \left( \begin{array}{c} \frac{\partial x}{\partial u} \\ \frac{\partial y}{\partial u} \end{array} \right) = -\left( \begin{array}{c} \frac{\partial F}{\partial u} \\ \frac{\partial G}{\partial u} \end{array} \right)$$

Now the application of Cramer's Rule should be clearer. We're looking for $(\partial x / \partial u)$ which is the first element in $\mathbf{x}$. So according to Cramer's rule, we evaluate a fraction where the denominator is the determinant of $A$ (notice $A$ is the Jacobian matrix) and the numerator is the same determinant, except the first column is replaced by $\mathbf{b}$. This give us:

$$-\frac{\frac{\partial (F, G)}{\partial (u, y)}}{\frac{\partial (F, G)}{\partial (x, y)}}$$

Just like we learned.

6Extreme value problems¶

These are pretty straightforward so I won't spend much time on them. They're also review from Calc 3. You probably remember finding extreme values of single-variable functions from Calc 1 - you check places where $f'(x) = 0$ (critical points), $f'(x)$ doesn't exist (singular points), and finally you check the edges of the function's domain.

The multivariable situation is similar, but instead of the derivative, we of course apply the gradient. The necessary conditions for extreme values in the two variable case are analogous to those for a function of a single variable: a critical point exists where $\nabla f(x,y) = 0$, a singular point exists where $\nabla f(x,y)$ exists, and there might also be extreme values on boundary points on the edge of the function's domain.

However at singular or critical points, there are 3 possibilities for the "shape" of the extreme value. The point could represent a local maximum or minimum, or a saddle point. How can we tell? In this situation we apply the second derivative test. We'll look at this for a function $f$ of 2 variables. Suppose we have a point $(a, b)$ and we want to tell if it is a local maximum, minimum, or a saddle point. Assuming that the second partial derivatives of $f$ are all continuous, we define $A = f_{11}(a,b), \quad B = f_{12}(a,b) = f_{21}(a,b), \quad C = f_{22}(a,b)$ so that:

• if $B^2 - AC < 0$, $A > 0$ then $(a,b)$ is a local minimum
• if $B^2 - AC < 0$, $A > 0$ then $(a,b)$ is a local maximum
• if $B^2 - AC < 0$, then $(a,b)$ is a saddle point
• if $B^2 - AC = 0$, this test is inconclusive

7Lagrange multipliers¶

The method of applying Lagrange multipliers is also review from Calc 3, and also pretty easy. We apply this method whenever we want to find the extreme values of some function that is subject to one or more constraints. Constraints can be either equations or inequalities.

I won't go into the theory behind Lagrange multipliers, the basics follow. First of all, it's important to understand that the ability to apply Lagrange multipliers to a problem does not necessarily mean that a solution to the problem exists. Anyway, let's look at the method of Lagrange multipliers for two variables first:

Let's say we have some function $f(x,y)$ we want to maximize, subject to the constraint $g(x,y) = C$. In order to solve this, we can solve the following system of equations:

$$\nabla f(x,y) = \lambda \nabla g(x,y) \\ g(x,y) = C$$

If we expand the gradients we end up with a system of 3 equations, and given the fact that we have 3 unknowns ($x, y, \lambda$) this is solvable. The point(s) $(x,y)$ that form the solution represent some extreme values of $f$, not necessarily the maximum. You need to figure out which point in the solution set answers whatever problem you're setting out to solve.

Oh and if you were wondering what a "Lagrange multiplier" is, I'll let you in on a little secret - it's the $\lambda$.

If we have more than one constraint, we can still solve this similarly. Consider the problem of maximizing or minimizing the function $f(x,y)$ subject to the constraints $g(x,y) = C$ and $h(x,y) = K$. Our system of equations to solve this is:

$$\nabla f(x,y) = \lambda \nabla g(x,y) + \mu \nabla h(x,y) \\ g(x,y) = C \\ h(x,y) = K$$

This can be harder though.

The systems of equations above apply to functions of more than two variables in the most obvious way possible. That is, if we have a function $f(x,y,z)$ we want to maximize/minimize subject to the constraint $g(x,y,z) = C$ we have the following system of equations:

$$\nabla f(x,y,z) = \lambda \nabla g(x,y,z) \\ g(x,y,z) = C$$

This method of solving constrained maximim/minimum problems doesn't work in situations where the gradient is undefined - that is, when one of the functions is not smooth. For instance, try minimizing $f(x,y) = y$ subject to $g(x,y) = y^3-x^2=0$. Go ahead, I dare you. You'll find there is no solution. This is because the constraint function is not smooth at the solution point $(0,0)$, so at that point $\nabla g = \mathbf{0}$.

If you want some problems to practice on, as usual our friend Paul has us covered.

11Vector calculus¶

1. This makes sense, as a linear transformation is a function such that $f(x+y)=f(x)+f(y)$ and $f(\alpha x) = \alpha f(x)$. The operation of differentiation obeys these conditions, so it's a linear function (well, more specifically a linear operator, but it can still be represented with a matrix)
2. Think back to linear algebra (as much as that might pain you). When we have a linear system of $n$ equations, we have enough information to find $n$ variables. If we have more than $n$ variables in the system, we have infinitely many solutions, so we write the $n$ variables in terms of the remaining free variables. Here is a good explanation of this.