Chapter 9 Modeling Data - 9.5 Regression with SciKit-Learn Laboratory - 《Data Science at the Command Line》

We’ll be using the SciKit-Learn Laboratory (or SKLL) package for this. If you’re not using the Data Science Toolbox, you can install SKLL using :

If you’re running Python 2.7, you also need to install the following packages:

$ pip install configparser futures logutils

SKLL expects that the train and test data have the same filenames, located in separate directories. However, in this example, we’re going to use cross-validation, meaning that we only need to specify a training data set. Cross-validation is a technique that splits up the whole data set into a certain number of subsets. These subsets are called folds. (Usually, five or ten folds are used.)

$ mkdir train
$ wine-white-clean.csv nl -s, -w1 -v0 | sed '1s/0,/id,/' > train/features.csv

Create a configuration file called predict-quality.cfg:

We run the experiment using the run_experiment command-line tool [cite:run_experiment]:

$ run_experiment -l evaluate.cfg

The -l command-line argument indicates that we’re running in local mode. SKLL also offers the possibility to run experiments on clusters. The time it takes to run the experiment depends on the complexity of the chosen algorithms.

$ cd output
$ ls -1
Wine_features.csv_GradientBoostingRegressor.predictions
Wine_features.csv_GradientBoostingRegressor.results.json
Wine_features.csv_LinearRegression.log
Wine_features.csv_LinearRegression.predictions
Wine_features.csv_LinearRegression.results
Wine_features.csv_LinearRegression.results.json
Wine_features.csv_RandomForestRegressor.log
Wine_features.csv_RandomForestRegressor.predictions
Wine_features.csv_RandomForestRegressor.results.json

SKLL generates four files for each learner: one log, two with results, and one with predictions. Moreover, SKLL generates a summary file, which contains a lot of information about each individual fold (too much to show here). We can extract the relevant metrics using the following SQL query:

The relevant column here is pearson, which indicates the Pearson’s ranking correlation. This is value between -1 and 1 that indicates the correlation between the true ranking (of quality scores) and the predicted ranking. Let’s paste all the predictions back to the data set:

$ parallel "csvjoin -c id train/features.csv <(< output/Wine_features.csv_{}"\
> ".predictions | tr '\t' ',') | csvcut -c id,quality,prediction > {}" ::: \
> RandomForestRegressor GradientBoostingRegressor LinearRegression
$ csvstack *Regres* -n learner --filenames > predictions.csv

And create a plot using Rio:

$ < predictions.csv Rio -ge 'g+geom_point(aes(quality, round(prediction), '\
> 'color=learner), position="jitter", alpha=0.1) + facet_wrap(~ learner) + '\
> 'theme(aspect.ratio=1) + xlim(3,9) + ylim(3,9) + guides(colour=FALSE) + '\