Introduction to DataFrames¶

DataFrame v1.2, Julia 1.6.1

using DataFrames, Random, Pipe, Chain

Manipulating columns of a DataFrame¶

Renaming columns¶

x = DataFrame(rand(Bool,3,4), :auto)

randstring(12)

"oqdI9puUHCrm"

rename(x, :x1 => :A)

rename!(c->c^2,x)

rename(x,3=>:third)

rename(x,[1,2].=>[:xx1,:xx2])

rename!(x,[:a,:b,:c,:d])

rename!(x,string.('a':'d'))

rename!(x,"a"=>"d","d"=>"a")

rename(x,fill(:a,4))

ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

Stacktrace:
 [1] rename!(x::DataFrames.Index, nms::Vector{Symbol}; makeunique::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/other/index.jl:51
 [2] #rename!#55
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:177 [inlined]
 [3] rename(df::DataFrame, vals::Vector{Symbol}; makeunique::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:301
 [4] rename(df::DataFrame, vals::Vector{Symbol})
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:301
 [5] top-level scope
   @ In[44]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

rename(x, fill(:a,4),makeunique=true)

Reordering columns¶

We can reorder the names(x) vector as needed, creating a new DataFrame

x

Random.seed!(1234)
@pipe x |> names |> shuffle |> x[:, _]

x

select!(x,4:-1:1)
x

Merging/adding columns¶

x = DataFrame([(i,j) for i in 1:3, j in 1:4], :auto)

With hcat we can merge two DataFrames. Also [x,y] syntax is supported but only when DataFrames have unique column names.

hcat(x,x,makeunique=true)

You can append a vector to a data frame with the following syntax.

y = [x DataFrame(A=[1,2,3])]

y=[DataFrame(A=1:3) x]

A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

using BenchmarkTools

@btime [$x[!,1:2] DataFrame(A=[1,2,3]) $x[!,3:4]]

  4.710 μs (87 allocations: 8.08 KiB)

@time [x[!,1:2] DataFrame(A=[1,2,3]) x[!,3:4]]

  0.000027 seconds (91 allocations: 8.203 KiB)

We could also do this with a specialized in place method insertcols!. Let's add ':newcol' to the DataFrame y.

insertcols!(y,2,"newcol"=>[1,2,3])

If you want to insert the same column name several times makeunique=true is needed as usual.

insertcols!(y,2,:newcol => [1,2,3], makeunique=true)

We can see how much faster it is to insert a column with insertcol! than with hcat using @btime(note that we use here a Pair notation as example)

@btime insertcols!(copy($x), 3, :A=>[1,2,3])

  1.149 μs (19 allocations: 2.17 KiB)

insertcols!(x,:A=>[1,2,3])

insertcols!(x,1,:B=>[1,2,3])

Note that insertcols! can be used to insert several columns to a data frame at once and that it performs broadcasting if needed:

@time df = DataFrame(:a=>[1,2,3])

  0.000017 seconds (18 allocations: 1.828 KiB)

insertcols!(df,:b=>"x",:c=>'a',:d=>Ref([1,2,3]))

Interestingly we can emulate hcat mutating the data frame in-place using insertcols!:

df1 = DataFrame(:a=>[1,2])

df2 = DataFrame(:b=>[2,3],:c=>[3,4])

hcat(df1,df2)

df1 #df1 is not touched

@pipe eachcol(df2) |> pairs

Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Vector{Symbol})(...):
  :b => [2, 3]
  :c => [3, 4]

@pipe eachcol(df2) |> pairs |> insertcols!(df1,_...)

df1 # now we have changed df1

Subsetting/removing columns¶

Let's create a new DataFrame x and show a few way to create DataFrames with a subset of x's columns.

x = DataFrame([(i,j) for i in 1:3, j in 1:5], :auto)

@btime $x[!,[1,2,4,5]]

  999.100 ns (14 allocations: 1.59 KiB)

@btime $x[:,[1,2,4,5]] # use ! instead of : for non-copying operation

  1.138 μs (18 allocations: 2.09 KiB)

x[:,[:x1,:x4]]

We can also choose to keep or exclude columns by bool (we need a vector whose length is the number of columns in the original DataFrame).

x[:,[true,false,true,false,true]]

x[:,[:x1]]

x[:,:x1]

3-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (2, 1)
 (3, 1)

typeof(x[:,[:x1]]),typeof(x[:,:x1])

(DataFrame, Vector{Tuple{Int64, Int64}})

x[!,:x1] === x.x1,x[:,:x1] === x.x1

(true, false)

x[!,1] === x[!,:x1]

true

x[!,[1]] == x[!,1],x[!,[1]] === x[!,1]

(false, false)

typeof.([x[!,[1]] , x[!,1]])

2-element Vector{DataType}:
 DataFrame
 Vector{Tuple{Int64, Int64}} (alias for Array{Tuple{Int64, Int64}, 1})

You can alse user Regex, All, Between and Not from IvertedIndies.jl for column selection:

x[!,r"[123]"]

x[!,Not(1)]

x[!,Between(:x2,:x4)]

x[!,Between(2,4)]

x[!,Cols(:x1,Between(:x3,:x5))]

select(x,:x1,Between(:x3,:x5),copycols=false) # the same as above

df = copy(x)

df2 = select(df,[1,2])

select(df,Not([1,2]))

By default `select` copies colums

df2[!,1] == df[!,1],df2[!,1] === df[!,1]

(true, false)

This can be using `copycols=false` keyword argument

df2 = select(df,[1,2],copycols=false)

df2[!,1] === df[!,1]

true

df

using select! will modify the source data frame

select!(df,[1,2])

df == df2

true

Here we create a copy of x and delete the 3rd column from the copy with select! and Not

z = copy(x)
select!(z,Not(3))

alternativerly we can archieve the same by using select function

select(x, Not(3))

x stays unchanged

x

Views¶

Note, that you can also create a view of a DataFrame when we want a sebset of its columns:

@btime x[:,[1,3,5]]

  979.900 ns (18 allocations: 2.00 KiB)

@btime @view x[:,[1,3,5]]

  239.706 ns (3 allocations: 320 bytes)

(now creation of the view is slow, but in the coming release of the DataFeames.jl package it will become significantly faster)

Modify column by name¶

x = DataFrame([(i,j) for i in 1:3, j in 1:5],:auto)

With the following syntax, the existing column is modified without performing any copying (this is discouraged as it creates column alias)

x[!,:x1] = x[!,:x2]
x

x[1,:x2] = (11,2)

(11, 2)

# x2의 첫번째 값을 변경한것이 x1의 첫번째 값도 같이 변경됨, 
# 위에서 복사시에 x2의 번지를 참조 하기 때문에 조심해야 함
x

x[1,:x2] = (1,2)

(1, 2)

this syntax is safer

x[!,:x1] = x[:,:x2]

3-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (2, 2)
 (3, 2)

x

x[1,:x2] = (11,2)

(11, 2)

# x2의 첫번째 값을 변경 했지만  x1의 값은 바뀌지 않음
x

We can also use the following syntax to add a new column at the end of a DataFrame

x[!,:A] = [1,2,3]
x

A new column name will be added to our DataFrame with the following syntax as well.

x.B = 11:13
x

Find column name¶

x = DataFrame([(i,j) for i in 1:3, j in 1:5],:auto)

We can check if a column with a given name exists via

hasproperty(x,:x1)

true

and determine its index via

columnindex(x,:x2)

2

Advanced ways of column selection¶

these are most useful for non-standard column names(e.g. containing spaces)

df = DataFrame()
df.x1 = 1:3
df."column 2" = 4:6
df[!,"column 3"] = 7:9
df

select!(df,Not("column 3"))

df

df."column 2"

3-element Vector{Int64}:
 4
 5
 6

df[:,"column 2"]

3-element Vector{Int64}:
 4
 5
 6

or you can interpolate column name using :() syntax

for n in names(df)
  println(n,"\n",df.:($n),"\n")
end

x1
[1, 2, 3]

column 2
[4, 5, 6]

Working on a collection of columns¶

When using eachcol of a data frame the resulting object retains reference to its parent and e.g. can be queried with getproperty

df = DataFrame(reshape(1:12,3,4), :auto)

ec_df = eachcol(df)

df[:,1]

3-element Vector{Int64}:
 1
 2
 3

ec_df[1]

3-element Vector{Int64}:
 1
 2
 3

ec_df.x1

3-element Vector{Int64}:
 1
 2
 3

Transforming columns¶

The general rules are the following

select and transform always return the number of rows equal to the source data frame, while combine returns any number of rows(combine is allowed to combine rows of the source data frame)
transform retains columns from the old data frame
select! and transform! are in-place versions of select and transform

df = DataFrame(reshape(1:12,3,4),:auto)

Here we add a new column :res that is a sum of columns :x1 and :x2. Ageneral syntax of transformations of this kind is:

`source_columns => function_to_apply => target_column_name`

then function_to_apply gets columns selected by source_column as positional arguments.

@btime transform(df,[:x1,:x2]=>(+)=>:res)

  28.484 μs (154 allocations: 8.84 KiB)

transform의 역할을 select를 통해 동일하게 규현 할 수 있음

@btime select(df,[1:ncol(df)]...,[:x1,:x2]=>(+)=>:res)

  29.593 μs (147 allocations: 8.59 KiB)

One can omit passing target_column_name in which case it is automatically generated

using Statistics

combine(df,[:x1,:x2] => Statistics.cor)

Note that combine allowed the number of columns in the resulting data frame to be changed. If we used select instead it would automatically broadcast the return value to match the number of rows of the source:

#x1,x2 전체 컬럼에 대해 cor를 계산한 결과를 각 row에 값을 복사 한다.
select(df,[:x1,:x2]=>Statistics.cor)

If you want to apply some function on each row of the source wrap it in ByRow:

@btime select(df, :x1,:x2,[:x1,:x2]=>ByRow(string))

  49.138 μs (238 allocations: 12.59 KiB)

@btime select(df, :x1,:x2,AsTable([:x1,:x2])=>ByRow(x->string(x.x1,x.x2)))

  52.774 μs (262 allocations: 14.30 KiB)

select(df, :x1,:x2,AsTable([:x1,:x2])=> (x->string(x.x1,x.x2)))

Also if toy want columns to be passed as NamedTuple to a function (instead of being positional arguments) wrap them in AsTable:

@btime select(df, :x1,:x2, AsTable([:x1, :x2]) => (x->x.x1 + x.x2) => :x3 )

  52.082 μs (244 allocations: 13.77 KiB)

@btime select(df, :x1,:x2, [:x1, :x2] => ByRow(+) => :x3)

  48.071 μs (219 allocations: 12.02 KiB)

For simplicity (as this functionality is often needed) there is a special treat ment of nrow function that can be just passed as a transformation (without specifying of column selector):

select(df, :x1, nrow=>"number_of_rows")

(note that in select the number of rows is automatically broadcasted to match the number of rows of the source data frame)

cumsum example

select(df, :x1, [:x1]=>cumsum)

Finally you can conveninently create multiple columns with one function, e.g.:

@btime select(df, :x1,:x1=>ByRow(x->[x ^ 2, x ^ 3]) => ["x1²","x1³"])

  54.415 μs (229 allocations: 13.78 KiB)

@btime select(df, :x1,:x1=>(x->DataFrame("x1²"=>x .^ 2,"x1³"=>x .^ 3))=>AsTable)

  40.864 μs (189 allocations: 11.31 KiB)

@pipe DataFrame(:x1=>[1,2,3], :x2=>[3,4,5]) |> AsTable

AsTable(3×2 DataFrame
 Row │ x1     x2    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4
   3 │     3      5)

	x1	x2	x3	x4	x1_1	x2_1	x3_1	x4_1
	Tuple…	Tuple…	Tuple…	Tuple…	Tuple…	Tuple…	Tuple…	Tuple…
1	(1, 1)	(1, 2)	(1, 3)	(1, 4)	(1, 1)	(1, 2)	(1, 3)	(1, 4)
2	(2, 1)	(2, 2)	(2, 3)	(2, 4)	(2, 1)	(2, 2)	(2, 3)	(2, 4)
3	(3, 1)	(3, 2)	(3, 3)	(3, 4)	(3, 1)	(3, 2)	(3, 3)	(3, 4)

	x1	x2	x3	x4	x5
	Tuple…	Tuple…	Tuple…	Tuple…	Tuple…
1	(1, 1)	(1, 2)	(1, 3)	(1, 4)	(1, 5)
2	(2, 1)	(2, 2)	(2, 3)	(2, 4)	(2, 5)
3	(3, 1)	(3, 2)	(3, 3)	(3, 4)	(3, 5)

	x1	x3	x5
	Tuple…	Tuple…	Tuple…
1	(1, 1)	(1, 3)	(1, 5)
2	(2, 1)	(2, 3)	(2, 5)
3	(3, 1)	(3, 3)	(3, 5)

	x2	x3	x4	x5
	Tuple…	Tuple…	Tuple…	Tuple…
1	(1, 2)	(1, 3)	(1, 4)	(1, 5)
2	(2, 2)	(2, 3)	(2, 4)	(2, 5)
3	(3, 2)	(3, 3)	(3, 4)	(3, 5)

	x2	x3	x4
	Tuple…	Tuple…	Tuple…
1	(1, 2)	(1, 3)	(1, 4)
2	(2, 2)	(2, 3)	(2, 4)
3	(3, 2)	(3, 3)	(3, 4)

	A	x1	x2	x3	x4
	Int64	Tuple…	Tuple…	Tuple…	Tuple…
1	1	(1, 1)	(1, 2)	(1, 3)	(1, 4)
2	2	(2, 1)	(2, 2)	(2, 3)	(2, 4)
3	3	(3, 1)	(3, 2)	(3, 3)	(3, 4)

	a	b	c	d
	Int64	String	Char	Array…
1	1	x	a	[1, 2, 3]
2	2	x	a	[1, 2, 3]
3	3	x	a	[1, 2, 3]

	x1	x2	x3	x4
	Bool	Bool	Bool	Bool
1	0	0	1	0
2	0	1	0	1
3	1	0	1	0

	A	x2	x3	x4
	Bool	Bool	Bool	Bool
1	0	0	1	0
2	0	1	0	1
3	1	0	1	0

	x1x1	x2x2	x3x3	x4x4
	Bool	Bool	Bool	Bool
1	0	0	1	0
2	0	1	0	1
3	1	0	1	0

	x1x1	x2x2	third	x4x4
	Bool	Bool	Bool	Bool
1	0	0	1	0
2	0	1	0	1
3	1	0	1	0