Derivatives in Theano

Here is the code to compute this gradient:

In this example, we can see from pp(gy) that we are computingthe correct symbolic gradient.fill((x 2), 1.0) means to make a matrix of the same shape asx 2 and fill it with 1.0.

Note

The optimizer simplifies the symbolic gradient expression. You can seethis by digging inside the internal properties of the compiled function.

pp(f.maker.fgraph.outputs[0])
'(2.0 * x)'

After optimization there is only one Apply node left in the graph, whichdoubles the input.

We can also compute the gradient of complex expressions such as thelogistic function defined above. It turns out that the derivative of thelogistic is: .

A plot of the gradient of the logistic function, with x on the x-axisand on the y-axis.

>>> x = T.dmatrix('x')
>>> s = T.sum(1 / (1 + T.exp(-x)))
>>> gs = T.grad(s, x)
>>> dlogistic = theano.function([x], gs)
>>> dlogistic([[0, 1], [-1, -2]])
array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

In general, for any scalar expression s, T.grad(s, w) providesthe Theano expression for computing $\frac{\partial s}{\partial w}$ . Inthis way Theano can be used for doing efficient symbolic differentiation(as the expression returned by T.grad will be optimized during compilation), even forfunction with many inputs. (see automatic differentiation for a descriptionof symbolic differentiation).

Note

Additional information on the inner workings of differentiation may also befound in the more advanced tutorial .

Computing the Jacobian

In Theano’s parlance, the term Jacobian designates the tensor comprising thefirst partial derivatives of the output of a function with respect to its inputs.(This is a generalization of to the so-called Jacobian matrix in Mathematics.)Theano implements the macro that does allthat is needed to compute the Jacobian. The following text explains howto do it manually.

In order to manually compute the Jacobian of some function y withrespect to some parameter x we need to use scan. What wedo is to loop over the entries in y and compute the gradient ofy[i] with respect to x.

Note

scan is a generic op in Theano that allows writing in a symbolicmanner all kinds of recurrent equations. While creatingsymbolic loops (and optimizing them for performance) is a hard task,effort is being done for improving the performance of scan. Weshall return to scan later in this tutorial.

What we do in this code is to generate a sequence of ints from 0 toy.shape[0] using . Then we loop through this sequence, andat each step, we compute the gradient of element y[i] with respect tox. scan automatically concatenates all these rows, generating amatrix which corresponds to the Jacobian.

Note

There are some pitfalls to be aware of regarding T.grad. One of them is that youcannot re-write the above expression of the Jacobian astheano.scan(lambda yi,x: T.grad(y_i,x), sequences=y, non_sequences=x), even though from the documentation of scan thisseems possible. The reason is that _y_i will not be a function ofx anymore, while y[i] still is.

In Theano, the term Hessian has the usual mathematical meaning: It is thematrix comprising the second order partial derivative of a function with scalaroutput and vector input. Theano implements macro that does allthat is needed to compute the Hessian. The following text explains howto do it manually.

>>> x = T.dvector('x')
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
>>> f = theano.function([x], H, updates=updates)
>>> f([4, 4])
array([[ 2.,  0.],
       [ 0.,  2.]])

Jacobian times a Vector

Sometimes we can express the algorithm in terms of Jacobians times vectors,or vectors times Jacobians. Compared to evaluating the Jacobian and thendoing the product, there are methods that compute the desired results whileavoiding actual evaluation of the Jacobian. This can bring about significantperformance gains. A description of one such algorithm can be found here:

Barak A. Pearlmutter, “Fast Exact Multiplication by the Hessian”, NeuralComputation, 1994

While in principle we would want Theano to identify these patterns automatically for us,in practice, implementing such optimizations in a generic manner is extremelydifficult. Therefore, we provide special functions dedicated to these tasks.

The R operator is built to evaluate the product between a Jacobian and avector, namely . The formulationcan be extended even for x being a matrix, or a tensor in general, case inwhich also the Jacobian becomes a tensor and the product becomes some kindof tensor product. Because in practice we end up needing to compute suchexpressions in terms of weight matrices, Theano supports this more genericform of the operation. In order to evaluate the R-operation ofexpression y, with respect to x, multiplying the Jacobian with _v_you need to do something similar to this:

>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W, V, x], JV)
>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
array([ 2.,  2.])

of Op that implement Rop.

L-operator

In similitude to the R-operator, the L-operator would compute a row vector timesthe Jacobian. The mathematical formula would be $v \frac{\partialf(x)}{\partial x}$ . The L-operator is also supported for generic tensors(not only for vectors). Similarly, it can be implemented as follows:

Note

If you need to compute the Hessian times a vector, you can make use of theabove-defined operators to do it more efficiently than actually computingthe exact Hessian and then performing the product. Due to the symmetry of theHessian matrix, you have two options that willgive you the same result, though these options might exhibit differing performances.Hence, we suggest profiling the methods before using either one of the two:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> gy = T.grad(y, x)
>>> vH = T.grad(T.sum(gy * v), x)
>>> f = theano.function([x, v], vH)
>>> f([4, 4], [2, 2])
array([ 4.,  4.])

or, making use of the R-operator:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy, x, v)
>>> f = theano.function([x, v], Hv)
>>> f([4, 4], [2, 2])
array([ 4.,  4.])

Final Pointers

The grad function works symbolically: it receives and returns Theano variables.
can be compared to a macro since it can be applied repeatedly.
Scalar costs only can be directly handled by grad. Arrays are handled through repeated applications.
Built-in functions allow to compute efficiently vector times Jacobian and vector times Hessian.
Work is in progress on the optimizations required to compute efficiently the fullJacobian and the Hessian matrix as well as the Jacobian times vector.