Chapter 5 Scrubbing Data

    The data we obtained in Chapter 3 can come in a variety of formats. The most common ones are plain text, CSV, JSON, and HTML/XML. Since most command-line tools operate on one format only, it is worthwhile to be able to convert data from one format to another.

    Once our data is in the format we want it to be, we can apply common scrubbing operations. These include filtering, replacing, and merging data. The command line is especially well-suited for these kind of operations, as there exist many powerful command-line tools that are optimized for handling large amounts of data. Tools that we’ll discuss in this chapter include classic ones such as: (Ihnat, MacKenzie, and Meyering ) and sed (Fenlason et al. 2012), and newer ones such as (Dolan ) and csvgrep (Groskopf 2014).

    If your data requires additional functionality than that is offered by (a combination of) these command-line tools, you can use csvsql. This is a new command-line tool that allow you to perform SQL queries directly on CSV files. And remember, if after reading this chapter you still need more flexibility, you’re free to use R, Python, or whatever programming language you prefer.