1.5. Stochastic Gradient Descent

SGD has been successfully applied to large-scale and sparse machinelearning problems often encountered in text classification and naturallanguage processing. Given that the data is sparse, the classifiersin this module easily scale to problems with more than 10^5 trainingexamples and more than 10^5 features.

The advantages of Stochastic Gradient Descent are:

The disadvantages of Stochastic Gradient Descent include:

SGD requires a number of hyperparameters such as the regularizationparameter and the number of iterations.
SGD is sensitive to feature scaling.

Warning

Make sure you permute (shuffle) your training data before fitting themodel or use to shuffle after each iteration.

The class SGDClassifier implements a plain stochastic gradientdescent learning routine which supports different loss functions andpenalties for classification.

As other classifiers, SGD has to be fitted with two arrays: an array Xof size [n_samples, n_features] holding the training samples, and anarray Y of size [n_samples] holding the target values (class labels)for the training samples:

>>>

After being fitted, the model can then be used to predict new values:

>>>

>>> clf.predict([[2., 2.]])
array([1])

SGD fits a linear model to the training data. The member coef_ holdsthe model parameters:

>>>

Member intercept_ holds the intercept (aka offset or bias):

>>>

>>> clf.intercept_
array([-9.9...])

Whether or not the model should use an intercept, i.e. a biasedhyperplane, is controlled by the parameter fit_intercept.

To get the signed distance to the hyperplane use SGDClassifier.decision_function:

>>>

The concrete loss function can be set via the lossparameter. supports the following loss functions:

loss="hinge": (soft-margin) linear Support Vector Machine,
loss="modified_huber": smoothed hinge loss,
and all regression losses below.

The first two loss functions are lazy, they only update the modelparameters if an example violates the margin constraint, which makestraining very efficient and may result in sparser models, even when L2 penaltyis used.

Using loss="log" or loss="modified_huber" enables thepredict_proba method, which gives a vector of probability estimates

per sample:

>>>

>>> clf = SGDClassifier(loss="log", max_iter=5).fit(X, y)
>>> clf.predict_proba([[1., 1.]])
array([[0.00..., 0.99...]])

The concrete penalty can be set via the penalty parameter.SGD supports the following penalties:

The default setting is penalty="l2". The L1 penalty leads to sparsesolutions, driving most coefficients to zero. The Elastic Net solvessome deficiencies of the L1 penalty in the presence of highly correlatedattributes. The parameter l1_ratio controls the convex combinationof L1 and L2 penalty.

SGDClassifier supports multi-class classification by combiningmultiple binary classifiers in a “one versus all” (OVA) scheme. For eachof the

classes, a binary classifier is learned that discriminatesbetween that and all other classes. At testing time, we compute theconfidence score (i.e. the signed distances to the hyperplane) for eachclassifier and choose the class with the highest confidence. The Figurebelow illustrates the OVA approach on the iris dataset. The dashedlines represent the three OVA classifiers; the background colors showthe decision surface induced by the three classifiers.

In the case of multi-class classification coef is a two-dimensionalarray of shape=[n_classes, n_features] and intercept is aone-dimensional array of shape=[nclasses]. The i-th row of coef holdsthe weight vector of the OVA classifier for the i-th class; classes areindexed in ascending order (see attribute ).Note that, in principle, since they allow to create a probability model,loss="log" and loss="modified_huber" are more suitable forone-vs-all classification.

SGDClassifier supports both weighted classes and weightedinstances via the fit parameters class_weight and sample_weight. Seethe examples below and the docstring of forfurther information.

Examples:

SGD: Maximum margin separating hyperplane,
Comparing various online solvers
(See the Note)

SGDClassifier supports averaged SGD (ASGD). Averaging can be enabledby setting average=True. ASGD works by averaging the coefficientsof the plain SGD over each iteration over a sample. When using ASGDthe learning rate can be larger and even constant leading on somedatasets to a speed up in training time.

For classification with a logistic loss, another variant of SGD with anaveraging strategy is available with Stochastic Average Gradient (SAG)algorithm, available as a solver in .

1.5.2. Regression

The class implements a plain stochastic gradientdescent learning routine which supports different loss functions andpenalties to fit linear regression models. SGDRegressor iswell suited for regression problems with a large number of trainingsamples (> 10.000), for other problems we recommend ,Lasso, or .

The concrete loss function can be set via the lossparameter. SGDRegressor supports the following loss functions:

loss="squared_loss": Ordinary least squares,
loss="huber": Huber loss for robust regression,
loss="epsilon_insensitive": linear Support Vector Regression.

The Huber and epsilon-insensitive loss functions can be used forrobust regression. The width of the insensitive region has to bespecified via the parameter epsilon. This parameter depends on thescale of the target variables.

supports averaged SGD as SGDClassifier.Averaging can be enabled by setting average=True.

For regression with a squared loss and a l2 penalty, another variant ofSGD with an averaging strategy is available with Stochastic AverageGradient (SAG) algorithm, available as a solver in .

Note

The sparse implementation produces slightly different resultsthan the dense implementation due to a shrunk learning rate for theintercept.

There is built-in support for sparse data given in any matrix in a formatsupported by scipy.sparse. For maximum efficiency, however, use the CSRmatrix format as defined in .

Examples:

Classification of text documents using sparse features

1.5.4. Complexity

The major advantage of SGD is its efficiency, which is basicallylinear in the number of training examples. If X is a matrix of size (n, p)training has a cost of

, where k is the numberof iterations (epochs) and is the average number ofnon-zero attributes per sample.

Recent theoretical results, however, show that the runtime to get somedesired optimization accuracy does not increase as the training set size increases.

The classes SGDClassifier and provide twocriteria to stop the algorithm when a given level of convergence is reached:

With early_stopping=True, the input data is split into a training setand a validation set. The model is then fitted on the training set, and thestopping criterion is based on the prediction score computed on thevalidation set. The size of the validation set can be changed with theparameter validation_fraction.
With early_stopping=False, the model is fitted on the entire input dataand the stopping criterion is based on the objective function computed onthe input data.

In both cases, the criterion is evaluated once by epoch, and the algorithm stopswhen the criterion does not improve n_iter_no_change times in a row. Theimprovement is evaluated with a tolerance tol, and the algorithm stops inany case after a maximum number of iteration max_iter.

1.5.6. Tips on Practical Use

References:

Y. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricksof the Trade 1998.

Given a set of training examples

where and, our goal is tolearn a linear scoring function with model parameters and intercept. In orderto make predictions, we simply look at the sign of.A common choice to find the model parameters is by minimizing the regularizedtraining error given by

where

is a loss function that measures model (mis)fit and is a regularization term (aka penalty) that penalizes modelcomplexity; is a non-negative hyperparameter.

Different choices for

entail different classifiers such as

Hinge: (soft-margin) Support Vector Machines.
Log: Logistic Regression.
Least-Squares: Ridge Regression.
Epsilon-Insensitive: (soft-margin) Support Vector Regression.

Popular choices for the regularization term

include:

L2 norm:
,
L1 norm:
, which leads to sparsesolutions.
Elastic Net:
, a convex combination of L2 and L1, where
is given by 1 - l1_ratio.

The Figure below shows the contours of the different regularization termsin the parameter space when

Stochastic gradient descent is an optimization method for unconstrainedoptimization problems. In contrast to (batch) gradient descent, SGDapproximates the true gradient of

by considering asingle training example at a time.

The class SGDClassifier implements a first-order SGD learningroutine. The algorithm iterates over the training examples and for eachexample updates the model parameters according to the update rule given by

where

is the learning rate which controls the step-size inthe parameter space. The intercept is updated similarly butwithout regularization.

The learning rate

can be either constant or gradually decaying. Forclassification, the default learning rate schedule (learning_rate='optimal')is given by

where

is the time step (there are a total of n_samples * n_itertime steps), is determined based on a heuristic proposed by Léon Bottousuch that the expected initial updates are comparable with the expectedsize of the weights (this assuming that the norm of the training samples isapprox. 1). The exact definition can be found in _init_t in BaseSGD.

For regression the default learning rate schedule is inverse scaling(learning_rate='invscaling'), given by

where

and are hyperparameters chosen by theuser via eta0 and power_t, resp.

For a constant learning rate use learning_rate='constant' and use eta0to specify the learning rate.

For an adaptively decreasing learning rate, use learning_rate='adaptive'and use eta0 to specify the starting learning rate. When the stoppingcriterion is reached, the learning rate is divided by 5, and the algorithmdoes not stop. The algorithm stops when the learning rate goes below 1e-6.

The model parameters can be accessed through the members coef andintercept:

References:

T. Zhang - In Proceedings of ICML ‘04.
“Regularization and variable selection via the elastic net”H. Zou, T. Hastie - Journal of the Royal Statistical Society Series B,67 (2), 301-320.
Xu, Wei

1.5.8. Implementation details

The implementation of SGD is influenced by the of Léon Bottou. Similar to SvmSGD,the weight vector is represented as the product of a scalar and a vectorwhich allows an efficient weight update in the case of L2 regularization.In the case of sparse feature vectors, the intercept is updated with asmaller learning rate (multiplied by 0.01) to account for the fact thatit is updated more frequently. Training examples are picked up sequentiallyand the learning rate is lowered after each observed example. We adopted thelearning rate schedule from Shalev-Shwartz et al. 2007.For multi-class classification, a “one versus all” approach is used.We use the truncated gradient algorithm proposed by Tsuruoka et al. 2009for L1 regularization (and the Elastic Net).The code is written in Cython.

References:

“Stochastic Gradient Descent” L. Bottou - Website, 2010.
L. Bottou - Website, 2011.
“Pegasos: Primal estimated sub-gradient solver for svm”S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML ‘07.