In Section 4.4, we saw that select
can take one or more source columns and put it into one or more target columns source => target
. Also to recap here’s an example: select(df, :name => :people_names)
.
In this section, we discuss how to transform variables, that is, how to modify data. In DataFrames.jl
, the syntax is source => transformation => target
.
Like before, we use the grades_2020
dataset:
Suppose we want to increase all the grades in grades_2020
by 1. First, we define a function that takes as argument a vector of data and returns all of its elements increased by 1. Then we use the transform
function from DataFrames.jl
that, like all native DataFrames.jl
’s functions, takes a DataFrame
as first argument followed by the transformation syntax:
transform(grades_2020(), :grade_2020 => plus_one)
name | grade_2020 | grade_2020_plus_one |
---|---|---|
Sally | 1.0 | 2.0 |
Bob | 5.0 | 6.0 |
Alice | 8.5 | 9.5 |
Hank | 4.0 | 5.0 |
Like we said above, the DataFrames.jl
minilanguage is always source => transformation => target
. So, if we want to keep the naming of the target
column in the output, we can do:
transform(grades_2020(), :grade_2020 => plus_one => :grade_2020)
We can also use the keyword argument renamecols=false
:
name | grade_2020 |
---|---|
Sally | 2.0 |
Bob | 6.0 |
Alice | 9.5 |
Hank | 5.0 |
The same transformation can also be written with select
as follows:
select(grades_2020(), :, :grade_2020 => plus_one => :grade_2020)
where the :
means “select all the columns” as described in Section . Alternatively, you can also use Julia’s broadcasting and modify the column by accessing it with df.grade_2020
:
df.grade_2020 = plus_one.(df.grade_2020)
df
name | grade_2020 |
---|---|
Sally | 2.0 |
Bob | 6.0 |
Alice | 9.5 |
Hank | 5.0 |
To show how to transform two columns at the same time, we use the left joined data from Section :
With this, we can add a column saying whether someone was approved by the criterion that one of their grades was above 5.5:
pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
transform(leftjoined, [:grade_2020, :grade_2021] => pass; renamecols=false)
name | grade_2020 | grade_2021 | grade_2020_grade_2021 |
---|---|---|---|
Sally | 1.0 | 9.5 | true |
Hank | 4.0 | 6.0 | true |
Bob | 5.0 | missing | missing |
Alice | 8.5 | missing | true |
We can clean up the outcome and put the logic in a function to get a list of all the approved students:
function only_pass()
pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
leftjoined = transform(leftjoined, [:grade_2020, :grade_2021] => pass => :pass)
passed = subset(leftjoined, :pass; skipmissing=true)
return passed.name