• The data type in Python terminology (see Table 7-1 for a comparison between Python and SQL data types).
    • Whether it has any missing values (nulls).
    • The number of unique values.
    • Various descriptive statistics (maximum, minimum, sum, mean, standard deviation, and median) for those features for which it is appropriate.We invoke csvstat as follows:

    This gives a very verbose output. For a more concise output specify one of the statistics arguments:

    • —max (maximum)
    • —min (minimum)
    • —sum (sum)
    • —mean (mean)
    • —median (median)
    • —stdev (standard deviation)

    • —nulls (whether column contains nulls)

    • —unique (unique values)
    • —freq (frequent values)
    • —len (max value length)

    For example:

    1. $ csvstat data/datatypes.csv --null
    2. 1. a: False
    3. 2. b: True
    4. 3. c: False
    5. 4. d: False
    6. 5. e: True
    7. 6. f: True
    8. 7. g: True

    You can select a subset of features with the -c command-line argument. This accepts both integers and column names:

    1. $ csvstat data/investments2.csv -c 2,13,19,24
    2. 2. company_name
    3. <type 'unicode'>
    4. Nulls: True
    5. Unique values: 27324
    6. 5 most frequent values:
    7. Aviir: 13
    8. Galectin Therapeutics: 12
    9. Rostima: 12
    10. Facebook: 11
    11. Max length: 66
    12. 13. investor_country_code
    13. <type 'unicode'>
    14. Nulls: True
    15. Unique values: 111
    16. 5 most frequent values:
    17. USA: 20806
    18. DEU: 946
    19. CAN: 893
    20. FRA: 737
    21. Max length: 15
    22. 19. funding_round_code
    23. <type 'unicode'>
    24. Nulls: True
    25. Unique values: 15
    26. 5 most frequent values:
    27. a: 7529
    28. b: 4776
    29. c: 2452
    30. d: 1042
    31. e: 384
    32. Max length: 10
    33. 24. raised_amount_usd
    34. <type 'int'>
    35. Nulls: True
    36. Min: 0
    37. Max: 3200000000
    38. Sum: 359891203117
    39. Mean: 10370010.1748
    40. Median: 3250000
    41. Standard Deviation: 38513119.1802
    42. Unique values: 6143
    43. 5 most frequent values:
    44. 1000000: 1074
    45. 5000000: 1066
    46. 2000000: 875
    47. 3000000: 820
    48. Row count: 41799

    Please note that csvstat, just like csvsql, employs heuristics to determine the data type, and therefore may not always get it right. We encourage you to always do a manual inspection as discussed in the previous subsection. Moreover, the type may be a character string or integer that doesn’t say anything about how it should be used.

    As a nice extra, csvstat outputs, at the very end, the number of data points (rows). Newlines and commas inside values are handles correctly. To only see the relevant line, we can use tail:

    1. $ csvstat data/iris.csv | tail -n 1

    If you only want to see the actual number number of data points, you can use, for example, the following expression to extract the number:

    1. $ csvstat data/iris.csv | sed -rne '${s/^([^:]+): ([0-9]+)$/\2/;p}'

    7.3.2 Using R from the Command Line using Rio

    R is a very powerful statistical software package to analyze data and create visualizations. It’s an interpreted programming language, has an extensive collection of packages, and offers its own REPL (Read-Eval-Print-Loop), which allows you, similar to the command line, to play with your data. Unfortunately, R is quite separated from the command line. Once you start it, you’re in a separate environment. R doesn’t really play well with the command line because you cannot pipe any data into it and it also doesn’t support any one-liners that you can specify.

    For example, imagine that you have a CSV file called tips.csv, and you would like compute the tip percentage, and save the result. To accomplish this in R you would first startup R:

    And then run the following commands:

    1. > tips <- read.csv('tips.csv', header = T, sep = ',', stringsAsFactors = F)
    2. > tips.percent <- tips$tip / tips$bill * 100
    3. > cat(tips.percent, sep = '\n', file = 'percent.csv')
    4. > q("no")

    Afterwards, you can continue with the saved file percent.csv on the command line. Note that there is only one command that is associated with what we want to accomplish specifically. The other commands are necessary boilerplate. Typing in this boilerplate in order to accomplish something simple is cumbersome and breaks your workflow. Sometimes, you only want to do one or two things at a time to your data. Wouldn’t it be great if we could harness the power of R and be able to use it from the command line?

    This is where Rio comes in. The name Rio stands for R input/output, because it enables you to use R as a filter on the command line. You simply pipe CSV data into Rio and you specify the R commands that you want to run on it. Let’s perform the same task as before, but now using Rio:

    1. $ < data/tips.csv Rio -e 'df$tip / df$bill * 100' | head -n 10

    Rio can execute multiple R command that are separated by semicolons. So, if you wanted to add a column called percent to the input data, you could do the following:

    1. $ < data/tips.csv Rio -e 'df$percent <- df$tip / df$bill * 100; df' | head

    These small one-liners are possible because Rio takes care of all the boilerplate. Being able to use the command line for this and capture the power of R into a one-liner is fantastic, especially if you want to keep on working on the command line. Rio assumes that the input data is in CSV format with a header. (By specifying the -n command-line argument Rio does not consider the first row to be the header and creates default column names.) Behind the scenes, Rio writes the piped data to a temporary CSV file and creates a script that:

    • Import required libraries.
    • Loads the CSV file as a data frame.
    • Generates a ggplot2 object if needed (more on this in the next section).
    • Runs the specified commands.

    So now, if you wanted to do one or two things to your data set with R, you can specify it as a one-liner, and keep on working on the command line. All the knowledge that you already have about R can now be used from the command line. With Rio, you can even create sophisticated visualizations, as you will see later in this chapter.

    Rio doesn’t have to be used as a filter, meaning the output doesn’t have to be a in CSV format per se. You can compute

    1. $ < data/iris.csv Rio -e 'mean(df$sepal_length)'
    2. $ < data/iris.csv Rio -e 'sd(df$sepal_length)'
    3. $ < data/iris.csv Rio -e 'sum(df$sepal_length)'

    If we wanted to compute the five summary statistics, we would do:

    You can also compute the skewness (symmetry of the distribution) and kurtosis (peakedness of the distribution), but then you need to have the moments package installed:

    1. $ #? [echo]
    2. $ < data/iris.csv Rio -e 'skewness(df$sepal_length)'
    3. $ < data/iris.csv Rio -e 'kurtosis(df$petal_width)'

    Correlation between two features:

    1. $ < tips.csv Rio -e 'cor(df$bill, df$tip)'
    2. 0.6757341

    Or a correlation matrix:

    1. $ < data/tips.csv csvcut -c bill,tip | Rio -f cor | csvlook
    2. |--------------------+--------------------|
    3. | bill | tip |
    4. |--------------------+--------------------|
    5. | 1 | 0.675734109211365 |
    6. | 0.675734109211365 | 1 |
    7. |--------------------+--------------------|

    Note that with the command-line argument -f, we can specify the function to apply to the data frame df. In this case, it is the same as -e cor(df).

    You can even create a stem plot (Tukey 1977) using :