Introduction to DataFrames

05_columns

DataFrame v1.2, Julia 1.6.1

In [17]:
using DataFrames, Random, Pipe, Chain

Manipulating columns of a DataFrame

Renaming columns

In [27]:
x = DataFrame(rand(Bool,3,4), :auto)
Out[27]:

3 rows × 4 columns

x1x2x3x4
BoolBoolBoolBool
10010
20101
31010
In [28]:
randstring(12)
Out[28]:
"oqdI9puUHCrm"
In [29]:
rename(x, :x1 => :A)
Out[29]:

3 rows × 4 columns

Ax2x3x4
BoolBoolBoolBool
10010
20101
31010
In [34]:
rename!(c->c^2,x)
Out[34]:

3 rows × 4 columns

x1x1x2x2x3x3x4x4
BoolBoolBoolBool
10010
20101
31010
In [35]:
rename(x,3=>:third)
Out[35]:

3 rows × 4 columns

x1x1x2x2thirdx4x4
BoolBoolBoolBool
10010
20101
31010
In [38]:
rename(x,[1,2].=>[:xx1,:xx2])
Out[38]:

3 rows × 4 columns

xx1xx2x3x3x4x4
BoolBoolBoolBool
10010
20101
31010
In [40]:
rename!(x,[:a,:b,:c,:d])
Out[40]:

3 rows × 4 columns

abcd
BoolBoolBoolBool
10010
20101
31010
In [41]:
rename!(x,string.('a':'d'))
Out[41]:

3 rows × 4 columns

abcd
BoolBoolBoolBool
10010
20101
31010
In [43]:
rename!(x,"a"=>"d","d"=>"a")
Out[43]:

3 rows × 4 columns

dbca
BoolBoolBoolBool
10010
20101
31010
In [44]:
rename(x,fill(:a,4))
ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

Stacktrace:
 [1] rename!(x::DataFrames.Index, nms::Vector{Symbol}; makeunique::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/other/index.jl:51
 [2] #rename!#55
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:177 [inlined]
 [3] rename(df::DataFrame, vals::Vector{Symbol}; makeunique::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:301
 [4] rename(df::DataFrame, vals::Vector{Symbol})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:301
 [5] top-level scope
   @ In[44]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [45]:
rename(x, fill(:a,4),makeunique=true)
Out[45]:

3 rows × 4 columns

aa_1a_2a_3
BoolBoolBoolBool
10010
20101
31010

Reordering columns

We can reorder the names(x) vector as needed, creating a new DataFrame

In [49]:
x
Out[49]:

3 rows × 4 columns

dbca
BoolBoolBoolBool
10010
20101
31010
In [53]:
Random.seed!(1234)
@pipe x |> names |> shuffle |> x[:, _]
Out[53]:

3 rows × 4 columns

bacd
BoolBoolBoolBool
10010
21100
30011
In [54]:
x
Out[54]:

3 rows × 4 columns

dbca
BoolBoolBoolBool
10010
20101
31010
In [55]:
select!(x,4:-1:1)
x
Out[55]:

3 rows × 4 columns

acbd
BoolBoolBoolBool
10100
21010
30101

Merging/adding columns

In [57]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4], :auto)
Out[57]:

3 rows × 4 columns

x1x2x3x4
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)
2(2, 1)(2, 2)(2, 3)(2, 4)
3(3, 1)(3, 2)(3, 3)(3, 4)

With hcat we can merge two DataFrames. Also [x,y] syntax is supported but only when DataFrames have unique column names.

In [58]:
hcat(x,x,makeunique=true)
Out[58]:

3 rows × 8 columns

x1x2x3x4x1_1x2_1x3_1x4_1
Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 1)(1, 2)(1, 3)(1, 4)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 1)(2, 2)(2, 3)(2, 4)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 1)(3, 2)(3, 3)(3, 4)

You can append a vector to a data frame with the following syntax.

In [59]:
y = [x DataFrame(A=[1,2,3])]
Out[59]:

3 rows × 5 columns

x1x2x3x4A
Tuple…Tuple…Tuple…Tuple…Int64
1(1, 1)(1, 2)(1, 3)(1, 4)1
2(2, 1)(2, 2)(2, 3)(2, 4)2
3(3, 1)(3, 2)(3, 3)(3, 4)3
In [87]:
y=[DataFrame(A=1:3) x]
Out[87]:

3 rows × 5 columns

Ax1x2x3x4
Int64Tuple…Tuple…Tuple…Tuple…
11(1, 1)(1, 2)(1, 3)(1, 4)
22(2, 1)(2, 2)(2, 3)(2, 4)
33(3, 1)(3, 2)(3, 3)(3, 4)

A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

In [88]:
using BenchmarkTools
In [89]:
@btime [$x[!,1:2] DataFrame(A=[1,2,3]) $x[!,3:4]]
  4.710 μs (87 allocations: 8.08 KiB)
Out[89]:

3 rows × 5 columns

x1x2Ax3x4
Tuple…Tuple…Int64Tuple…Tuple…
1(1, 1)(1, 2)1(1, 3)(1, 4)
2(2, 1)(2, 2)2(2, 3)(2, 4)
3(3, 1)(3, 2)3(3, 3)(3, 4)
In [90]:
@time [x[!,1:2] DataFrame(A=[1,2,3]) x[!,3:4]]
  0.000027 seconds (91 allocations: 8.203 KiB)
Out[90]:

3 rows × 5 columns

x1x2Ax3x4
Tuple…Tuple…Int64Tuple…Tuple…
1(1, 1)(1, 2)1(1, 3)(1, 4)
2(2, 1)(2, 2)2(2, 3)(2, 4)
3(3, 1)(3, 2)3(3, 3)(3, 4)

We could also do this with a specialized in place method insertcols!. Let's add ':newcol' to the DataFrame y.

In [91]:
insertcols!(y,2,"newcol"=>[1,2,3])
Out[91]:

3 rows × 6 columns

Anewcolx1x2x3x4
Int64Int64Tuple…Tuple…Tuple…Tuple…
111(1, 1)(1, 2)(1, 3)(1, 4)
222(2, 1)(2, 2)(2, 3)(2, 4)
333(3, 1)(3, 2)(3, 3)(3, 4)

If you want to insert the same column name several times makeunique=true is needed as usual.

In [92]:
insertcols!(y,2,:newcol => [1,2,3], makeunique=true)
Out[92]:

3 rows × 7 columns

Anewcol_1newcolx1x2x3x4
Int64Int64Int64Tuple…Tuple…Tuple…Tuple…
1111(1, 1)(1, 2)(1, 3)(1, 4)
2222(2, 1)(2, 2)(2, 3)(2, 4)
3333(3, 1)(3, 2)(3, 3)(3, 4)

We can see how much faster it is to insert a column with insertcol! than with hcat using @btime(note that we use here a Pair notation as example)

In [93]:
@btime insertcols!(copy($x), 3, :A=>[1,2,3])
  1.149 μs (19 allocations: 2.17 KiB)
Out[93]:

3 rows × 5 columns

x1x2Ax3x4
Tuple…Tuple…Int64Tuple…Tuple…
1(1, 1)(1, 2)1(1, 3)(1, 4)
2(2, 1)(2, 2)2(2, 3)(2, 4)
3(3, 1)(3, 2)3(3, 3)(3, 4)
In [95]:
insertcols!(x,:A=>[1,2,3])
Out[95]:

3 rows × 5 columns

x1x2x3x4A
Tuple…Tuple…Tuple…Tuple…Int64
1(1, 1)(1, 2)(1, 3)(1, 4)1
2(2, 1)(2, 2)(2, 3)(2, 4)2
3(3, 1)(3, 2)(3, 3)(3, 4)3
In [98]:
insertcols!(x,1,:B=>[1,2,3])
Out[98]:

3 rows × 6 columns

Bx1x2x3x4A
Int64Tuple…Tuple…Tuple…Tuple…Int64
11(1, 1)(1, 2)(1, 3)(1, 4)1
22(2, 1)(2, 2)(2, 3)(2, 4)2
33(3, 1)(3, 2)(3, 3)(3, 4)3

Note that insertcols! can be used to insert several columns to a data frame at once and that it performs broadcasting if needed:

In [124]:
@time df = DataFrame(:a=>[1,2,3])
  0.000017 seconds (18 allocations: 1.828 KiB)
Out[124]:

3 rows × 1 columns

a
Int64
11
22
33
In [125]:
insertcols!(df,:b=>"x",:c=>'a',:d=>Ref([1,2,3]))
Out[125]:

3 rows × 4 columns

abcd
Int64StringCharArray…
11xa[1, 2, 3]
22xa[1, 2, 3]
33xa[1, 2, 3]

Interestingly we can emulate hcat mutating the data frame in-place using insertcols!:

In [126]:
df1 = DataFrame(:a=>[1,2])
Out[126]:

2 rows × 1 columns

a
Int64
11
22
In [127]:
df2 = DataFrame(:b=>[2,3],:c=>[3,4])
Out[127]:

2 rows × 2 columns

bc
Int64Int64
123
234
In [128]:
hcat(df1,df2)
Out[128]:

2 rows × 3 columns

abc
Int64Int64Int64
1123
2234
In [129]:
df1 #df1 is not touched
Out[129]:

2 rows × 1 columns

a
Int64
11
22
In [149]:
@pipe eachcol(df2) |> pairs
Out[149]:
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Vector{Symbol})(...):
  :b => [2, 3]
  :c => [3, 4]
In [138]:
@pipe eachcol(df2) |> pairs |> insertcols!(df1,_...)
Out[138]:

2 rows × 3 columns

abc
Int64Int64Int64
1123
2234
In [151]:
df1 # now we have changed df1
Out[151]:

2 rows × 3 columns

abc
Int64Int64Int64
1123
2234

Subsetting/removing columns

Let's create a new DataFrame x and show a few way to create DataFrames with a subset of x's columns.

In [152]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5], :auto)
Out[152]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)
In [180]:
@btime $x[!,[1,2,4,5]]
  999.100 ns (14 allocations: 1.59 KiB)
Out[180]:

3 rows × 4 columns

x1x2x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 4)(3, 5)
In [183]:
@btime $x[:,[1,2,4,5]] # use ! instead of : for non-copying operation
  1.138 μs (18 allocations: 2.09 KiB)
Out[183]:

3 rows × 4 columns

x1x2x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 4)(3, 5)
In [185]:
x[:,[:x1,:x4]]
Out[185]:

3 rows × 2 columns

x1x4
Tuple…Tuple…
1(1, 1)(1, 4)
2(2, 1)(2, 4)
3(3, 1)(3, 4)

We can also choose to keep or exclude columns by bool (we need a vector whose length is the number of columns in the original DataFrame).

In [186]:
x[:,[true,false,true,false,true]]
Out[186]:

3 rows × 3 columns

x1x3x5
Tuple…Tuple…Tuple…
1(1, 1)(1, 3)(1, 5)
2(2, 1)(2, 3)(2, 5)
3(3, 1)(3, 3)(3, 5)
In [194]:
x[:,[:x1]]
Out[194]:

3 rows × 1 columns

x1
Tuple…
1(1, 1)
2(2, 1)
3(3, 1)
In [196]:
x[:,:x1]
Out[196]:
3-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (2, 1)
 (3, 1)
In [197]:
typeof(x[:,[:x1]]),typeof(x[:,:x1])
Out[197]:
(DataFrame, Vector{Tuple{Int64, Int64}})
In [202]:
x[!,:x1] === x.x1,x[:,:x1] === x.x1
Out[202]:
(true, false)
In [203]:
x[!,1] === x[!,:x1]
Out[203]:
true
In [207]:
x[!,[1]] == x[!,1],x[!,[1]] === x[!,1]
Out[207]:
(false, false)
In [212]:
typeof.([x[!,[1]] , x[!,1]])
Out[212]:
2-element Vector{DataType}:
 DataFrame
 Vector{Tuple{Int64, Int64}} (alias for Array{Tuple{Int64, Int64}, 1})

You can alse user Regex, All, Between and Not from IvertedIndies.jl for column selection:

In [219]:
x[!,r"[123]"]
Out[219]:

3 rows × 3 columns

x1x2x3
Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)
2(2, 1)(2, 2)(2, 3)
3(3, 1)(3, 2)(3, 3)
In [220]:
x[!,Not(1)]
Out[220]:

3 rows × 4 columns

x2x3x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 2)(3, 3)(3, 4)(3, 5)
In [221]:
x[!,Between(:x2,:x4)]
Out[221]:

3 rows × 3 columns

x2x3x4
Tuple…Tuple…Tuple…
1(1, 2)(1, 3)(1, 4)
2(2, 2)(2, 3)(2, 4)
3(3, 2)(3, 3)(3, 4)
In [222]:
x[!,Between(2,4)]
Out[222]:

3 rows × 3 columns

x2x3x4
Tuple…Tuple…Tuple…
1(1, 2)(1, 3)(1, 4)
2(2, 2)(2, 3)(2, 4)
3(3, 2)(3, 3)(3, 4)
In [223]:
x[!,Cols(:x1,Between(:x3,:x5))]
Out[223]:

3 rows × 4 columns

x1x3x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 3)(3, 4)(3, 5)
In [224]:
select(x,:x1,Between(:x3,:x5),copycols=false) # the same as above
Out[224]:

3 rows × 4 columns

x1x3x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 3)(3, 4)(3, 5)
In [225]:
df = copy(x)
Out[225]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)
In [226]:
df2 = select(df,[1,2])
Out[226]:

3 rows × 2 columns

x1x2
Tuple…Tuple…
1(1, 1)(1, 2)
2(2, 1)(2, 2)
3(3, 1)(3, 2)
In [227]:
select(df,Not([1,2]))
Out[227]:

3 rows × 3 columns

x3x4x5
Tuple…Tuple…Tuple…
1(1, 3)(1, 4)(1, 5)
2(2, 3)(2, 4)(2, 5)
3(3, 3)(3, 4)(3, 5)

By default `select` copies colums

In [239]:
df2[!,1] == df[!,1],df2[!,1] === df[!,1]
Out[239]:
(true, false)

This can be using `copycols=false` keyword argument

In [240]:
df2 = select(df,[1,2],copycols=false)
Out[240]:

3 rows × 2 columns

x1x2
Tuple…Tuple…
1(1, 1)(1, 2)
2(2, 1)(2, 2)
3(3, 1)(3, 2)
In [241]:
df2[!,1] === df[!,1]
Out[241]:
true
In [242]:
df
Out[242]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

using select! will modify the source data frame

In [243]:
select!(df,[1,2])
Out[243]:

3 rows × 2 columns

x1x2
Tuple…Tuple…
1(1, 1)(1, 2)
2(2, 1)(2, 2)
3(3, 1)(3, 2)
In [247]:
df == df2
Out[247]:
true

Here we create a copy of x and delete the 3rd column from the copy with select! and Not

In [248]:
z = copy(x)
select!(z,Not(3))
Out[248]:

3 rows × 4 columns

x1x2x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 4)(3, 5)

alternativerly we can archieve the same by using select function

In [249]:
select(x, Not(3))
Out[249]:

3 rows × 4 columns

x1x2x4x5
Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 4)(3, 5)

x stays unchanged

In [250]:
x
Out[250]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

Views

Note, that you can also create a view of a DataFrame when we want a sebset of its columns:

In [251]:
@btime x[:,[1,3,5]]
  979.900 ns (18 allocations: 2.00 KiB)
Out[251]:

3 rows × 3 columns

x1x3x5
Tuple…Tuple…Tuple…
1(1, 1)(1, 3)(1, 5)
2(2, 1)(2, 3)(2, 5)
3(3, 1)(3, 3)(3, 5)
In [252]:
@btime @view x[:,[1,3,5]]
  239.706 ns (3 allocations: 320 bytes)
Out[252]:

3 rows × 3 columns

x1x3x5
Tuple…Tuple…Tuple…
1(1, 1)(1, 3)(1, 5)
2(2, 1)(2, 3)(2, 5)
3(3, 1)(3, 3)(3, 5)

(now creation of the view is slow, but in the coming release of the DataFeames.jl package it will become significantly faster)

Modify column by name

In [253]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5],:auto)
Out[253]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

With the following syntax, the existing column is modified without performing any copying (this is discouraged as it creates column alias)

In [255]:
x[!,:x1] = x[!,:x2]
x
Out[255]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 2)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)
In [256]:
x[1,:x2] = (11,2)
Out[256]:
(11, 2)
In [258]:
# x2의 첫번째 값을 변경한것이 x1의 첫번째 값도 같이 변경됨, 
# 위에서 복사시에 x2의 번지를 참조 하기 때문에 조심해야 함
x
Out[258]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(11, 2)(11, 2)(1, 3)(1, 4)(1, 5)
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)
In [259]:
x[1,:x2] = (1,2)
Out[259]:
(1, 2)

this syntax is safer

In [260]:
x[!,:x1] = x[:,:x2]
Out[260]:
3-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (2, 2)
 (3, 2)
In [261]:
x
Out[261]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 2)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)
In [262]:
x[1,:x2] = (11,2)
Out[262]:
(11, 2)
In [264]:
# x2의 첫번째 값을 변경 했지만  x1의 값은 바뀌지 않음
x
Out[264]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 2)(11, 2)(1, 3)(1, 4)(1, 5)
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)

We can also use the following syntax to add a new column at the end of a DataFrame

In [265]:
x[!,:A] = [1,2,3]
x
Out[265]:

3 rows × 6 columns

x1x2x3x4x5A
Tuple…Tuple…Tuple…Tuple…Tuple…Int64
1(1, 2)(11, 2)(1, 3)(1, 4)(1, 5)1
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)2
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)3

A new column name will be added to our DataFrame with the following syntax as well.

In [266]:
x.B = 11:13
x
Out[266]:

3 rows × 7 columns

x1x2x3x4x5AB
Tuple…Tuple…Tuple…Tuple…Tuple…Int64Int64
1(1, 2)(11, 2)(1, 3)(1, 4)(1, 5)111
2(2, 2)(2, 2)(2, 3)(2, 4)(2, 5)212
3(3, 2)(3, 2)(3, 3)(3, 4)(3, 5)313

Find column name

In [267]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5],:auto)
Out[267]:

3 rows × 5 columns

x1x2x3x4x5
Tuple…Tuple…Tuple…Tuple…Tuple…
1(1, 1)(1, 2)(1, 3)(1, 4)(1, 5)
2(2, 1)(2, 2)(2, 3)(2, 4)(2, 5)
3(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)

We can check if a column with a given name exists via

In [270]:
hasproperty(x,:x1)
Out[270]:
true

and determine its index via

In [272]:
columnindex(x,:x2)
Out[272]:
2

Advanced ways of column selection

these are most useful for non-standard column names(e.g. containing spaces)

In [281]:
df = DataFrame()
df.x1 = 1:3
df."column 2" = 4:6
df[!,"column 3"] = 7:9
df
Out[281]:

3 rows × 3 columns

x1column 2column 3
Int64Int64Int64
1147
2258
3369
In [282]:
select!(df,Not("column 3"))
Out[282]:

3 rows × 2 columns

x1column 2
Int64Int64
114
225
336
In [283]:
df
Out[283]:

3 rows × 2 columns

x1column 2
Int64Int64
114
225
336
In [284]:
df."column 2"
Out[284]:
3-element Vector{Int64}:
 4
 5
 6
In [285]:
df[:,"column 2"]
Out[285]:
3-element Vector{Int64}:
 4
 5
 6

or you can interpolate column name using :() syntax

In [286]:
for n in names(df)
  println(n,"\n",df.:($n),"\n")
end
x1
[1, 2, 3]

column 2
[4, 5, 6]

Working on a collection of columns

When using eachcol of a data frame the resulting object retains reference to its parent and e.g. can be queried with getproperty

In [287]:
df = DataFrame(reshape(1:12,3,4), :auto)
Out[287]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
In [288]:
ec_df = eachcol(df)
Out[288]:

3×4 DataFrameColumns

x1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
In [292]:
df[:,1]
Out[292]:
3-element Vector{Int64}:
 1
 2
 3
In [293]:
ec_df[1]
Out[293]:
3-element Vector{Int64}:
 1
 2
 3
In [294]:
ec_df.x1
Out[294]:
3-element Vector{Int64}:
 1
 2
 3

Transforming columns

The general rules are the following

  • select and transform always return the number of rows equal to the source data frame, while combine returns any number of rows(combine is allowed to combine rows of the source data frame)
  • transform retains columns from the old data frame
  • select! and transform! are in-place versions of select and transform
In [295]:
df = DataFrame(reshape(1:12,3,4),:auto)
Out[295]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
114710
225811
336912

Here we add a new column :res that is a sum of columns :x1 and :x2. Ageneral syntax of transformations of this kind is:

`source_columns => function_to_apply => target_column_name`

then function_to_apply gets columns selected by source_column as positional arguments.

In [335]:
@btime transform(df,[:x1,:x2]=>(+)=>:res)
  28.484 μs (154 allocations: 8.84 KiB)
Out[335]:

3 rows × 5 columns

x1x2x3x4res
Int64Int64Int64Int64Int64
1147105
2258117
3369129

transform의 역할을 select를 통해 동일하게 규현 할 수 있음

In [338]:
@btime select(df,[1:ncol(df)]...,[:x1,:x2]=>(+)=>:res)
  29.593 μs (147 allocations: 8.59 KiB)
Out[338]:

3 rows × 5 columns

x1x2x3x4res
Int64Int64Int64Int64Int64
1147105
2258117
3369129

One can omit passing target_column_name in which case it is automatically generated

In [339]:
using Statistics
In [340]:
combine(df,[:x1,:x2] => Statistics.cor)
Out[340]:

1 rows × 1 columns

x1_x2_cor
Float64
11.0

Note that combine allowed the number of columns in the resulting data frame to be changed. If we used select instead it would automatically broadcast the return value to match the number of rows of the source:

In [342]:
#x1,x2 전체 컬럼에 대해 cor를 계산한 결과를 각 row에 값을 복사 한다.
select(df,[:x1,:x2]=>Statistics.cor)
Out[342]:

3 rows × 1 columns

x1_x2_cor
Float64
11.0
21.0
31.0

If you want to apply some function on each row of the source wrap it in ByRow:

In [372]:
@btime select(df, :x1,:x2,[:x1,:x2]=>ByRow(string))
  49.138 μs (238 allocations: 12.59 KiB)
Out[372]:

3 rows × 3 columns

x1x2x1_x2_string
Int64Int64String
11414
22525
33636
In [373]:
@btime select(df, :x1,:x2,AsTable([:x1,:x2])=>ByRow(x->string(x.x1,x.x2)))
  52.774 μs (262 allocations: 14.30 KiB)
Out[373]:

3 rows × 3 columns

x1x2x1_x2_function
Int64Int64String
11414
22525
33636
In [387]:
select(df, :x1,:x2,AsTable([:x1,:x2])=> (x->string(x.x1,x.x2)))
Out[387]:

3 rows × 3 columns

x1x2x1_x2_function
Int64Int64String
114[1, 2, 3][4, 5, 6]
225[1, 2, 3][4, 5, 6]
336[1, 2, 3][4, 5, 6]

Also if toy want columns to be passed as NamedTuple to a function (instead of being positional arguments) wrap them in AsTable:

In [355]:
@btime select(df, :x1,:x2, AsTable([:x1, :x2]) => (x->x.x1 + x.x2) => :x3 )
  52.082 μs (244 allocations: 13.77 KiB)
Out[355]:

3 rows × 3 columns

x1x2x3
Int64Int64Int64
1145
2257
3369
In [356]:
@btime select(df, :x1,:x2, [:x1, :x2] => ByRow(+) => :x3)
  48.071 μs (219 allocations: 12.02 KiB)
Out[356]:

3 rows × 3 columns

x1x2x3
Int64Int64Int64
1145
2257
3369

For simplicity (as this functionality is often needed) there is a special treat ment of nrow function that can be just passed as a transformation (without specifying of column selector):

In [389]:
select(df, :x1, nrow=>"number_of_rows")
Out[389]:

3 rows × 2 columns

x1number_of_rows
Int64Int64
113
223
333

(note that in select the number of rows is automatically broadcasted to match the number of rows of the source data frame)

cumsum example

In [394]:
select(df, :x1, [:x1]=>cumsum)
Out[394]:

3 rows × 2 columns

x1x1_cumsum
Int64Int64
111
223
336

Finally you can conveninently create multiple columns with one function, e.g.:

In [411]:
@btime select(df, :x1,:x1=>ByRow(x->[x ^ 2, x ^ 3]) => ["x1²","x1³"])
  54.415 μs (229 allocations: 13.78 KiB)
Out[411]:

3 rows × 3 columns

x1x1²x1³
Int64Int64Int64
1111
2248
33927
In [410]:
@btime select(df, :x1,:x1=>(x->DataFrame("x1²"=>x .^ 2,"x1³"=>x .^ 3))=>AsTable)
  40.864 μs (189 allocations: 11.31 KiB)
Out[410]:

3 rows × 3 columns

x1x1²x1³
Int64Int64Int64
1111
2248
33927
In [418]:
@pipe DataFrame(:x1=>[1,2,3], :x2=>[3,4,5]) |> AsTable
Out[418]:
AsTable(3×2 DataFrame
 Row  x1     x2    
      Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4
   3 │     3      5)
In [ ]: