Making arithmetic Ops on double

    An Op is any object which inherits from gof.Op. It has todefine the following methods.

    • makenode(*inputs_)
    • This method is responsible for creating output Variables of asuitable symbolic Type to serve as the outputs of this Op’sapplication. The Variables found in *inputs must be operated onusing Theano’s symbolic language to compute the symbolic outputVariables. This method should put these outputs into an Applyinstance, and return the Apply instance.

    This method creates an Apply node representing the application ofthe Op on the inputs provided. If the Op cannot be applied to theseinputs, it must raise an appropriate exception.

    The inputs of the Apply instance returned by this call must beordered correctly: a subsequent self.make_node(*apply.inputs)must produce something equivalent to the first apply.

    • perform(node, inputs, output_storage)
    • This method computes the function associated to this Op. node isan Apply node created by the Op’s make_node method. inputsis a list of references to data to operate on using non-symbolicstatements, (i.e., statements in Python, Numpy). output_storageis a list of storage cells where the variables of the computationmust be put.

    More specifically:

    This method must be determined by the inputs. That is to say, ifit is evaluated once on inputs A and returned B, then if everinputs C, equal to A, are presented again, then outputs equal toB must be returned again.

    You must be careful about aliasing outputs to inputs, and makingmodifications to any of the inputs. See Views and inplaceoperations before writing a performimplementation that does either of these things.

    Instead (or in addition to) perform() You can also provide a of For more details, refer to thedocumentation for Op.

    • eq(other)
    • other is also an Op.

    Returning True here is a promise to the optimization systemthat the other Op will produce exactly the same graph effects(from perform) as this one, given identical inputs. This means itwill produce the same output values, it will destroy the sameinputs (same destroy_map), and will alias outputs to the sameinputs (same view_map). For more details, seeViews and inplace operations.

    • hash()
    • If two Op instances compare equal, then they must return thesame hash value.

    Equally important, this hash value must not change during thelifetime of self. Op instances should be immutable in thissense.

    Optional methods or attributes

    • props
    • Default: Undefined

    Must be a tuple. Lists the name of the attributes which influencethe computation performed. This will also enable the automaticgeneration of appropriate eq, hash and str methods.Should be set to () if you have no attributes that are relevant tothe computation to generate the methods.

    New in version 0.7.

    • default_output
    • Default: None

    If this member variable is an integer, then the defaultimplementation of call will returnnode.outputs[self.default_output], where node was returnedby make_node. Otherwise, the entire list of outputs will bereturned, unless it is of length 1, where the single element will bereturned by itself.

    • makethunk(_node, storage_map, compute_map, no_recycling, impl=None)
    • This function must return a thunk, that is a zero-argumentsfunction that encapsulates the computation to be performed by thisop on the arguments of the node.

    Parameters:

    • node – Apply instanceThe node for which a thunk is requested.
    • storage_map – dict of listsThis maps variables to a one-element lists holding the variable’scurrent value. The one-element list acts as pointer to the valueand allows sharing that “pointer” with other nodes and instances.
    • compute_map – dict of listsThis maps variables to one-element lists holding booleans. Ifthe value is 0 then the variable has not been computed and thevalue should not be considered valid. If the value is 1 thevariable has been computed and the value is valid. If the valueis 2 the variable has been garbage-collected and is no longervalid, but shouldn’t be required anymore for this call.
    • no_recycling – WRITEMEWRITEME
    • impl – None, ‘c’ or ‘py’Which implementation to use.

    The returned function must ensure that is sets the computedvariables as computed in the compute_map.

    Defining this function removes the requirement for perform()or C code, as you will define the thunk for the computationyourself.

    • (*inputs, **kwargs)
    • By default this is a convenience function which callsmake_node() with the supplied arguments and returns theresult indexed by default_output. This can be overridden bysubclasses to do anything else, but must return either a theanoVariable or a list of Variables.

    If you feel the need to override call to change the graphbased on the arguments, you should instead create a function thatwill use your Op and build the graphs that you want and call thatinstead of the Op instance directly.

    • infershape(_node, shapes)
    • This function is needed for shape optimization. shapes is alist with one tuple for each input of the Apply node (which correspondsto the inputs of the op). Each tuple contains as many elements as thenumber of dimensions of the corresponding input. The value of each elementis the shape (number of items) along the corresponding dimension of thatspecific input.

    While this might sound complicated, it is nothing more than the shapeof each input as symbolic variables (one per dimension).

    The function should return a list with one tuple for each output.Each tuple should contain the corresponding output’s computed shape.

    Implementing this method will allow Theano to compute the output’sshape without computing the output itself, potentially sparing youa costly recomputation.

    • flops(inputs, outputs)
    • It is only used to have more information printed by the memoryprofiler. It makes it print the mega flops and giga flops persecond for each apply node. It takes as inputs two lists: one for theinputs and one for the outputs. They contain tuples that are theshapes of the corresponding inputs/outputs.
    • str()
    • This allows you to specify a more informative string representation of yourOp. If an Op has parameters, it is highly recommended to have thestr method include the name of the op and the Op’s parameters’values.

    Note

    If you set props, this will be automatically generated.You can still overide it for custom output.

    • doconstant_folding(_node)
    • Default: Return True

    As done in the Alloc op, you can return False only in some cases byanalyzing the graph from the node parameter.

    • debugperform(_node, inputs, output_storage)
    • Undefined by default.

    If you define this function then it will be used instead of C codeor perform() to do the computation while debugging (currentlyDebugMode, but others may also use it in the future). It has thesame signature and contract as perform().

    This enables ops that cause trouble with DebugMode with theirnormal behaviour to adopt a different one when run under thatmode. If your op doesn’t have any problems, don’t implement this.

    If you want your op to work with gradient.grad() you also need toimplement the functions described below.

    These are the function required to work with gradient.grad().

    • grad(inputs, output_gradients)

    If the output is not differentiable with respect to an input thenthis method should be defined to return a variable of type NullTypefor that input. Likewise, if you have not implemented the gradcomputation for some input, you may return a variable of typeNullType for that input. theano.gradient contains conveniencemethods that can construct the variable for you:theano.gradient.grad_undefined() and, respectively.

    If an element of output_gradient is of typetheano.gradient.DisconnectedType, it means that the cost is not afunction of this output. If any of the op’s inputs participate inthe computation of only disconnected outputs, then Op.grad shouldreturn DisconnectedType variables for those inputs.

    If the grad method is not defined, then Theano assumes it has beenforgotten. Symbolic differentiation will fail on a graph thatincludes this Op.

    It must be understood that the Op’s grad method is not meant toreturn the gradient of the Op’s output. theano.tensor.grad computesgradients; Op.grad is a helper function that computes terms thatappear in gradients.

    If an Op has a single vector-valued output y and a singlevector-valued input x, then the grad method will be passed x and asecond vector z. Define J to be the Jacobian of y with respect tox. The Op’s grad method should return dot(J.T,z). Whentheano.tensor.grad calls the grad method, it will set z to be thegradient of the cost C with respect to y. If this op is the only opthat acts on x, then dot(J.T,z) is the gradient of C with respect tox. If there are other ops that act on x, theano.tensor.grad willhave to add up the terms of x’s gradient contributed by the otherop’s grad method.

    In practice, an op’s input and output are rarely implemented assingle vectors. Even if an op’s output consists of a listcontaining a scalar, a sparse matrix, and a 4D tensor, you can thinkof these objects as being formed by rearranging a vector. Likewisefor the input. In this view, the values computed by the grad methodstill represent a Jacobian-vector product.

    In practice, it is probably not a good idea to explicitly constructthe Jacobian, which might be very large and very sparse. However,the returned value should be equal to the Jacobian-vector product.

    So long as you implement this product correctly, you need notunderstand what theano.tensor.grad is doing, but for the curious themathematical justification is as follows:

    In essence, the grad method must simply implement through symbolicVariables and operations the chain rule of differentialcalculus. The chain rule is the mathematical procedure that allowsone to calculate the total derivative of thefinal scalar symbolic Variable C with respect to a primitivesymbolic Variable x found in the list inputs. The grad methoddoes this using output_gradients which provides the totalderivative \frac{d C}{d f} of C with respect to a symbolicVariable that is returned by the Op (this is provided inoutput_gradients), as well as the knowledge of the totalderivative of the latter with respect to theprimitive Variable (this has to be computed).

    In mathematics, the total derivative of a scalar variable (C) withrespect to a vector of scalar variables (x), i.e. the gradient, iscustomarily represented as the row vector of the partialderivatives, whereas the total derivative of a vector of scalarvariables (f) with respect to another (x), is customarilyrepresented by the matrix of the partial derivatives, i.e.thejacobian matrix. In this convenient setting, the chain ruleinstructs that the gradient of the final scalar variable C withrespect to the primitive scalar variables in x through those in f issimply given by the matrix product: \frac{d C}{d x} = \frac{dC}{d f} * \frac{d f}{d x}.

    Here, the chain rule must be implemented in a similar but slightlymore complex setting: Theano provides in the listoutput_gradients one gradient for each of the Variables returnedby the Op. Where f is one such particular Variable, thecorresponding gradient found in output_gradients andrepresenting is provided with a shapesimilar to f and thus not necessarily as a row vector of scalars.Furthermore, for each Variable x of the Op’s list of input variablesinputs, the returned gradient representing \frac{d C}{dx} must have a shape similar to that of Variable x.

    If the output list of the op is , then thelist outputgradients is ![[grad{f1}(C), grad{f2}(C),… , grad{fn}(C)]](/projects/Theano-1.0/63d5a94fa645c53dfc10ebdbc89b8e14.png). If inputs consists of the list[x_1, ..., x_m], then Op.grad should return the list![[grad{x1}(C), grad{x2}(C), …, grad{xm}(C)]](/projects/Theano-1.0/270f705b0dddef5bd27000e6c477b210.png), where![(grad{y}(Z))_i = \frac{\partial Z}{\partial y_i}](/projects/Theano-1.0/af94f97b6c45670e46c835f602169abc.png) (and can stand for multiple dimensions).

    In other words, grad() does not return \frac{d f_i}{dx_j}, but instead the appropriate dot product specified by thechain rule: . Both the partial differentiation and themultiplication have to be performed by .

    Theano currently imposes the following constraints on the valuesreturned by the grad method:

    • They must be Variable instances.
    • When they are types that have dtypes, they must never have an integer dtype. The output gradients passed to Op.grad will also obey these constraints.

    Integers are a tricky subject. Integers are the main reason forhaving DisconnectedType, NullType or zero gradient. When you have aninteger as an argument to your grad method, recall the definition ofa derivative to help you decide what value to return:

    \frac{d f}{d x} = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon.

    Suppose your function f has an integer-valued output. For mostfunctions you’re likely to implement in theano, this means yourgradient should be zero, because f(x+epsilon) = f(x) for almost allx. (The only other option is that the gradient could be undefined,if your function is discontinuous everywhere, like the rationalindicator function)

    Suppose your function f has an integer-valued input. This is alittle trickier, because you need to think about what you meanmathematically when you make a variable integer-valued intheano. Most of the time in machine learning we mean “f is afunction of a real-valued x, but we are only going to pass ininteger-values of x”. In this case, f(x+epsilon) exists, so thegradient through f should be the same whether x is an integer or afloating point variable. Sometimes what we mean is “f is a functionof an integer-valued x, and f is only defined where x is aninteger.” Since f(x+epsilon) doesn’t exist, the gradient isundefined. Finally, many times in theano, integer valued inputsdon’t actually affect the elements of the output, only its shape.

    If your function f has both an integer-valued input and aninteger-valued output, then both rules have to be combined:

    • If f is defined at (x+epsilon), then the input gradient isdefined. Since f(x+epsilon) would be equal to f(x) almosteverywhere, the gradient should be 0 (first rule).
    • If f is only defined where x is an integer, then the gradientis undefined, regardless of what the gradient with respect to theoutput is. Examples:

      • f(x,y) = dot product between x and y. x and y are integers.
      • Since the output is also an integer, f is a step function.Its gradient is zero almost everywhere, so Op.grad should returnzeros in the shape of x and y.
      • f(x,y) = dot product between x and y. x is floating point and y is an integer.
      • In this case the output is floating point. It doesn’t matterthat y is an integer. We consider f to still be defined atf(x,y+epsilon). The gradient is exactly the same as if y werefloating point.
      • f(x,y) = argmax of x along axis y.
      • The gradient with respect to y is undefined, because f(x,y) isnot defined for floating point y. How could you take an argmaxalong a fraActional axis? The gradient with respect to x is0, because f(x+epsilon, y) = f(x) almost everywhere.
      • f(x,y) = a vector with y elements, each of which taking on the value x
      • The grad method should return DisconnectedType()() for y,because the elements of f don’t depend on y. Only the shape off depends on y. You probably also want to implement aconnection_pattern method to encode this.
      • f(x) = int(x) converts float x into an int. g(y) = float(y) converts an integer y into a float.
      • If the final cost C = 0.5 * g(y) = 0.5 g(f(x)), then thegradient with respect to y will be 0.5, even if y is aninteger. However, the gradient with respect to x will be 0,because the output of f is integer-valued.
    • connection_pattern(node):
    • Sometimes needed for proper operation of gradient.grad().

    Returns a list of list of bools.

    Op.connection_pattern[input_idx][output_idx] is true if theelements of inputs[input_idx] have an effect on the elements ofoutputs[output_idx].

    The node parameter is needed to determine the number ofinputs. Some ops such as Subtensor take a variable number ofinputs.

    If no connection_pattern is specified, gradient.grad willassume that all inputs have some elements connected to someelements of all outputs.

    This method conveys two pieces of information that are otherwisenot part of the theano graph:

    • Which of the op’s inputs are truly ancestors of each of theop’s outputs. Suppose an op has two inputs, x and y, andoutputs f(x) and g(y). y is not really an ancestor of f, butit appears to be so in the theano graph.
    • gradient.grad erroneously raising a TypeError reporting thata gradient is undefined.

    • gradient.grad failing to raise a ValueError reporting thatan input is disconnected. Even if connection_pattern is not implemented correctly, ifgradient.grad returns an expression, that expression will benumerically correct.
    • Rop(_inputs, eval_points)
    • Optional, to work with gradient.R_op().

    This function implements the application of the R-operator on thefunction represented by your op. Let assume that function is ,with input x, applying the R-operator means computing theJacobian of and right-multiplying it by v, the evaluationpoint, namely: .

    inputs are the symbolic variables corresponding to the value ofthe input where you want to evaluate the jacobian, and eval_pointsare the symbolic variables corresponding to the value you want toright multiply the jacobian with.

    Same conventions as for the grad method hold. If your op is notdifferentiable, you can return None. Note that in contrast tothe method , for R_op() you need to return thesame number of outputs as there are ouputs of the op. You can thinkof it in the following terms. You have all your inputs concatenatedinto a single vector x. You do the same with the evaluationpoints (which are as many as inputs and of the shame shape) and obtainanother vector . For each output, you reshape it into a vector,compute the jacobian of that vector with respect to x andmultiply it by . As a last step you reshape each of thesevectors you obtained for each outputs (that have the same shape asthe outputs) back to their corresponding shapes and return them as theoutput of the method.

    List of op with r op support.

    Defining an Op: mul

    We’ll define multiplication as a binary operation, even though amultiplication Op could take an arbitrary number of arguments.

    First, we’ll instantiate a mul Op:

    make_node

    This function must take as many arguments as the operation we aredefining is supposed to take as inputs—in this example that would betwo. This function ensures that both inputs have the double type.Since multiplying two doubles yields a double, this function makes anApply node with an output Variable of type double.

    1. def make_node(x, y):
    2. if x.type != double or y.type != double:
    3. raise TypeError('mul only works on doubles')
    4. return gof.Apply(mul, [x, y], [double()])
    5. mul.make_node = make_node

    The first two lines make sure that both inputs are Variables of thedouble type that we created in the previous section. We would notwant to multiply two arbitrary types, it would not make much sense(and we’d be screwed when we implement this in C!)

    The last line is the meat of the definition. There we create an Applynode representing the application of Op mul to inputs x and, giving a Variable instance of type double as the output.

    Note

    Theano relies on the fact that if you call the make_node methodof Apply’s first argument on the inputs passed as the Apply’ssecond argument, the call will not fail and the returned Applyinstance will be equivalent. This is how graphs are copied.

    perform

    This code actually computes the function.In our example, the data in inputs will be instances of Python’sbuilt-in type float because this is the type that double.filter()will always return, per our own definition. output_storage willcontain a single storage cell for the multiplication’s variable.

    Here, z is a list of one element. By default, z == [None].

    Note

    It is possible that z does not contain None. If it containsanything else, Theano guarantees that whatever it contains is whatperform put there the last time it was called with thisparticular storage. Furthermore, Theano gives you permission to dowhatever you want with z‘s contents, chiefly reusing it or thememory allocated for it. More information can be found in theOp documentation.

    Warning

    We gave z the Theano type double in make_node, which meansthat a Python float must be put there. You should not put, say, anint in z[0] because Theano assumes Ops handle typing properly.

    In the following code, we use our new Op:

    1. >>> import theano
    2. >>> x, y = double('x'), double('y')
    3. >>> z = mul(x, y)
    4. >>> f = theano.function([x, y], z)
    5. >>> f(5, 6)
    6. 30.0
    7. >>> f(5.6, 6.7)
    8. 37.519999999999996

    Note that there is an implicit call todouble.filter() on each argument, so if we give integers as inputsthey are magically cast to the right type. Now, what if we try this?

    Well, OK. We’d like our Op to be a bit more flexible. This can be doneby modifying make_node to accept Python int or float asx and/or y:

    1. def make_node(x, y):
    2. if isinstance(x, (int, float)):
    3. if isinstance(y, (int, float)):
    4. y = gof.Constant(double, y)
    5. if x.type != double or y.type != double:
    6. raise TypeError('mul only works on doubles')
    7. return gof.Apply(mul, [x, y], [double()])
    8. mul.make_node = make_node

    Whenever we pass a Python int or float instead of a Variable as x ory, make_node will convert it to for us. gof.Constantis a Variable we statically know the value of.

    Now the code works the way we want it to.

    Note

    Most Theano Ops follow this convention of up-casting literalmake_node arguments to Constants.This makes typing expressions more natural. If you donot want a constant somewhere in your graph, you have to pass a Variable(like double('x') here).

    Final version

    The above example is pedagogical. When you define other basic arithmeticoperations add, sub and div, code for make_node can beshared between these Ops. Here is revised implementation of these fourarithmetic operators:

    1. from theano import gof
    2.  
    3. class BinaryDoubleOp(gof.Op):
    4.  
    5. __props__ = ("name", "fn")
    6.  
    7. def __init__(self, name, fn):
    8. self.name = name
    9. self.fn = fn
    10.  
    11. def make_node(self, x, y):
    12. if isinstance(x, (int, float)):
    13. x = gof.Constant(double, x)
    14. if isinstance(y, (int, float)):
    15. y = gof.Constant(double, y)
    16. if x.type != double or y.type != double:
    17. raise TypeError('%s only works on doubles' % self.name)
    18. return gof.Apply(self, [x, y], [double()])
    19.  
    20. def perform(self, node, inp, out):
    21. x, y = inp
    22. z, = out
    23. z[0] = self.fn(x, y)
    24.  
    25. def __str__(self):
    26. return self.name
    27.  
    28. add = BinaryDoubleOp(name='add',
    29. fn=lambda x, y: x + y)
    30.  
    31. sub = BinaryDoubleOp(name='sub',
    32. fn=lambda x, y: x - y)
    33.  
    34. mul = BinaryDoubleOp(name='mul',
    35. fn=lambda x, y: x * y)
    36.  
    37. fn=lambda x, y: x / y)

    Instead of working directly on an instance of Op, we create a subclass ofOp that we can parametrize. All the operations we define are binary. Theyall work on two inputs with type double. They all return a singleVariable of type double. Therefore, make_node does the same thingfor all these operations, except for the Op reference self passedas first argument to Apply. We define perform using the function passed in the constructor.

    This design is a flexible way to define basic operations withoutduplicating code. The same way a Type subclass represents a set ofstructurally similar types (see previous section), an Op subclassrepresents a set of structurally similar operations: operations thathave the same input/output types, operations that only differ in onesmall detail, etc. If you see common patterns in several Ops that youwant to define, it can be a good idea to abstract out what you can.Remember that an Op is just an object which satisfies the contractdescribed above on this page and that you should use all the tools atyour disposal to create these objects as efficiently as possible.