Introduction to DataFrames¶

DataFrame v1.2, Julia 1.6.1

using DataFrames
using Statistics
using Random
using BenchmarkTools

Random.seed!(1)

MersenneTwister(1)

Manipulating rows of DataFrame¶

Selecting rows¶

df = DataFrame(rand(4,5), :auto)

using : as row selector will copy colums

df[:,:]

this is the same as

copy(df)

you can get a subset of rows of a data frame without copying using view to get a SubDataFrame

sdf = view(df,1:3, 1:3)

you still have a detailed reference to the parent

parent(sdf)

parentindices(sdf)

(1:3, 1:3)

selecting a single row returns a DataFrameRow object which is also a view

하나의 row를 선택하는 경우 view를 돌려 준다. 리턴된 row 의 값을 변경하면 원래 df의 값이 변경된다.

dfr = df[3,:]

size(dfr)

(5,)

# view임을 확인
dfr === df[3,:]

true

# dfr 값 변경시 df 값이 바뀜
dfr[1] = 100
df

두개이상의 row를 선택하는 경우 복사 값을 돌려 준다

dfr2 = df[3:4,:]

# 두개 이상의 row 선택시 복사 값을 돌려 준다.
# view가 아님
dfr2 === df[3:4,:]

false

# dfr2 값이 변경되었지만 df값은 변경 되지 않음
dfr2[1,1]=200
df

parent(dfr)

parentindices(dfr)

(3, Base.OneTo(5))

rownumber(dfr)

3

ncol(dfr)

MethodError: no method matching ncol(::DataFrameRow{DataFrame, DataFrames.Index})
Closest candidates are:
  ncol(::DataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:420
  ncol(::SubDataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/subdataframe/subdataframe.jl:154

Stacktrace:
 [1] top-level scope
   @ In[60]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

nrow(dfr)

MethodError: no method matching nrow(::DataFrameRow{DataFrame, DataFrames.Index})
Closest candidates are:
  nrow(::DataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:419
  nrow(::SubDataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/subdataframe/subdataframe.jl:153

Stacktrace:
 [1] top-level scope
   @ In[61]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

df[!,:Z] .= 1

4-element Vector{Int64}:
 1
 1
 1
 1

df

Earlier we used : for column selection in a view (SubDataFrame and DataFrameRow). In this case a view will have all columns of the parent after the parent is mutated

# df의row에 추가된 Z column 내역이 dfr에 반영됨 (view이기 때문)
dfr

parent(dfr)

# Z가 추가 되어 기존 5에서 6으로 증가 됨
parentindices(dfr)

(3, Base.OneTo(6))

rownumber(dfr)

3

Note that parent and parentindices refer to the true source of data for DataFrameRow and rownumber refer to row number in the direct object that was used to create DataFrameRow

df = DataFrame(a=1:4)

dvf = view(df,[3,2],:)

typeof(dvf)

SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}

typeof(df[!,[1]])

DataFrame

# dvf의 뷰를 가진다, dvf는 df의 뷰
dfr = dvf[2,:]

typeof(dfr)

DataFrameRow{DataFrame, DataFrames.Index}

# dfr은 dfv의 뷰이고 dfv는  df의 뷰로써 parent는 뷰의 원본 DataFrame을 가리킨다
parent(dfr)

parentindices(dfr)

(2, Base.OneTo(1))

rownumber(dfr)

2

Reordering rows¶

We create some random data frame (and hope that x.x is not sorted:), which is quite likely with 12 rows)

x = DataFrame(id=1:12, x=rand(12), y = [zeros(6); ones(6)])

[zeros(6)..., ones(6)...]

12-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

check if a DataFrame or a subset of its column is sorted

issorted(x)

true

issorted(select(x,2,1,3)),issorted(select(x,3)),issorted(select(x,3,1,2)),
issorted(select(x,3,2,1))

(false, true, true, false)

issorted(x,:x)

false

sort!(x,:x)
x

now we create a new DataFrame

y = sort(x,:id)

here we sort by two columns, first is decreasing, second is increasing

@time sort(x,[:y,:x],rev=[true,false])

  0.000057 seconds (47 allocations: 4.109 KiB)

@time sort(x,[order(:y,rev=true),:x]) # the same as above

  0.000077 seconds (50 allocations: 4.031 KiB)

now we try some more fancy sorting stuff

# x값의 역순으로 정렬
sort(x, [order(:y,rev=true), order(:x,by=v->-v)])

# x 값은 cos(pi/(1-0.19)*v) 계산 결과에 따라 sorting 된다.
sort(x, [order(:y,rev=true), order(:x,by=v->cos(pi/(1-0.19)*v))])

this is how you can reorder rows(here randomly)

x[shuffle(1:10),:]

it is also easy to swap rows using broadcasted assignment

sort!(x,:id)

x[[1,10],:] .= x[[10,1],:]
x

Merging/adding rows¶

x = DataFrame(rand(3,5),:auto)

merge by rows - data frames must have the same column names; the same is vcat

@btime [x;x;x]

  19.096 μs (116 allocations: 8.11 KiB)

@btime vcat(x,x,x)

  18.754 μs (116 allocations: 8.11 KiB)

you can efficiently vcat a vector of DataFrames using reduce

@btime reduce(vcat,[x,x,x])

  19.256 μs (114 allocations: 8.17 KiB)

get y with other order of names

y = x[:,reverse(names(x))]

`vcat` is still possible as it does dolumn name matching

vcat(x,y)

but column names must still match

vcat(x, y[:,1:3])

ArgumentError: column(s) x1 and x2 are missing from argument(s) 2

Stacktrace:
 [1] _vcat(dfs::Vector{AbstractDataFrame}; cols::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1757
 [2] #reduce#124
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1677 [inlined]
 [3] #vcat#123
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1595 [inlined]
 [4] vcat(::DataFrame, ::DataFrame)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1595
 [5] top-level scope
   @ In[173]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

vcat(x,y[:,1:3],cols=:intersect)

vcat(x,y[:,1:3],cols=:union)

vcat(x,y[:,1:3],cols=[:x1,:x5])

append!(x,x)
x

here column names must match exactly unless cols keyword argument is passed

append!(x,y)
x

standard repeat function works on rowsl also inner and outer keyword arguments are accepted

repeat(x,2)

push! add one row to x at the end, one must pass a correct number of values unless cols keyword argument is passed

push!(x,1:5)
x

also works with dictionaries

push!(x,Dict(:x1=>11, :x2=>12, :x3=>13, :x4=>14, :x5=>15))

and NamedTuples via name matching

push!(x,(x2=2,x1=1,x4=4,x3=3,x5=5))

and DataFrameRow also via name matching

push!(x,x[1,:])

Please consult the documentation of push!,append! and vcat for allowed values of cols keyword argument. This keyword argument governs the way these functions perform column matching of passed arguments. Also append! and push! support a promote keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

source = [(a=1, b=2),(a=missing,b=10,c=20),(b="s",c=1,d=1)]

3-element Vector{NamedTuple}:
 (a = 1, b = 2)
 (a = missing, b = 10, c = 20)
 (b = "s", c = 1, d = 1)

df = DataFrame()

for row in source
  push!(df,row,cols=:union) # if cols is :union then promote is true by default
end

df

and we see that push! dynamically added columns as needed and updated their element types

Subsetting/removing rows¶

x = DataFrame(id=1:10, val='a':'j')

by using indexing

x[1:2,:]

x[1:2,:] |> typeof

DataFrame

a single row selection creates a `DataFrameRow`

x[1,:]

but this is a DataFrame

x[1:1,:]

x[1:1,:] |> typeof

DataFrame

the same a view

v1 = view(x,1:2,:)

v1 |> typeof

SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}

selects column 1 and 2

v2 = view(x,:,1:2)

v2 |> typeof

SubDataFrame{DataFrame, DataFrames.SubIndex{DataFrames.Index, UnitRange{Int64}, UnitRange{Int64}}, Base.OneTo{Int64}}

indexing by Bool, exact length math is required

# 홀수 번째 row만 뽑아 내기
df1 = x[repeat([true,false],5),:]

df1 |> typeof

DataFrame

alternatively we can also create a view

v3 = view(x, repeat([true,false],5),:)

we can delete one row in place

x

delete!(x,7)
x

delete!(x, 6:7)
x

you can also create a new DataFrame when deleting rows using Not indexing

x[Not(1:2),:]

x

now we move to row filtering

x = DataFrame([1:4, 2:5, 3:6],:auto)

create a new DataFrame where filtering function operates on DataFrameRow

f1 = filter(r->r.x1 > 2.5,x)

f1 |> typeof

DataFrame

filter에 view 파라미터 적용

f2 = filter(r->r.x1 > 2.5,x, view=true) # the same but as a view

f2 |> typeof

SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}

or

f2 = filter(:x1 => >(2.5),x)

f2 |> typeof

DataFrame

in place modification of x, an example with do-block syntax

f4 = filter!(x) do r
  if r.x1 > 2.5
    return r.x2 < 4.5
  end
  r.x3 < 3.5
end

f4 |> typeof

DataFrame

A common operation is selection of rows for which a value in a column is contained in a given set, Here are a few ways in which you can archive this.

df = DataFrame(x=1:12, y=mod1.(1:12,4))

We select rows for which column y has value 1 or 4

@btime filter(row->row.y in [1,4],df)

  3.044 μs (28 allocations: 2.55 KiB)

@btime filter(:y => in([1,4]),df)

  1.489 μs (19 allocations: 1.53 KiB)

@btime df[in.(df.y,Ref([1,4])),:]

  1.497 μs (17 allocations: 1.50 KiB)

DataFrames.jl also providers a subset function that works on whole columns and allows for multiple conditions:

x = DataFrame([1:4, 2:5, 3:6], :auto)

s1 = subset(x, :x1=>x->x .< mean(x), :x2 => ByRow(<(4)))

s1 |> typeof

DataFrame

중복제거(Deduplicating)¶

x = DataFrame(A=[1,2], B=["x","y"])
append!(x,x)
x.C = 1:4
x

get first unique rows for geive index

unique(x)

unique(x,[1,2])

get indicators of non-unique rows

nonunique(x,:A)

4-element Vector{Bool}:
 0
 0
 1
 1

x

modify x in place

unique!(x, :B)

x

Extracting one row from a `DataFrame` into standard collections¶

x = DataFrame(x=[1,missing,2], y=["a","b",missing],z=[true,false,true])

cols = [:y,:z]

2-element Vector{Symbol}:
 :y
 :z

you can use a conversion to a Vector or an Array

Vector(x[1,cols])

2-element Vector{Any}:
     "a"
 true

nwo you will get a vector of vectors

xx = DataFrame(rand(2,3),:auto)
axes(xx,1),axes(xx,2),[axes(xx,1)...],[axes(xx,2)...]

(Base.OneTo(2), Base.OneTo(3), [1, 2], [1, 2, 3])

# axes(x,1) : row
# axes(x,2) : column
[Vector(x[i,cols]) for i in axes(x,1)]

3-element Vector{Vector{Any}}:
 ["a", true]
 ["b", false]
 [missing, true]

it is easy to convert a DataFrameRow into a NamedTuple

typeof(x[1,cols])

DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}

c1 = copy(x[1,cols])

NamedTuple{(:y, :z), Tuple{Union{Missing, String}, Bool}}(("a", true))

n1 = NamedTuple(x[1,cols])

NamedTuple{(:y, :z), Tuple{Union{Missing, String}, Bool}}(("a", true))

c1 == n1, c1===n1

(true, true)

c1.y

"a"

or a Tuple

Tuple(x[1,cols])

("a", true)

Working with a collection of rows of a data frame¶

You can use eachrow to get a vector-like collection of DataFrameRows

df = DataFrame(reshape(1:12,3,4),:auto)

er_df = eachrow(df)

er_df |> typeof

DataFrames.DataFrameRows{DataFrame}

er_df[1]

er_df[1][2] === er_df[1].x2

true

last(er_df)

er_df[end]

As DataFrameRows objects keeps connection to the parent data frame you can get the columns of parent using getproperty

er_df.x1

3-element Vector{Int64}:
 1
 2
 3

er_df.x1 === df.x1

true

Flattening a data frame¶

Occasionally you have a data frame whose one column is a vector of collections. You can expand(flatten) such a column using the flatten function

df = DataFrame(a='a':'c', b=[[1,2,3],[4,5],6])

f1 = flatten(df, :b)

f1 |> typeof

DataFrame

Only one row¶

only from Julia Base is also supported in DataFrames.jl and succeeds if the data frame has only one row, in which case it is returned.

df = DataFrame(a=1)

only(df)

df2 = repeat(df,2)

only(df2)

ArgumentError: data frame must contain exactly 1 row

Stacktrace:
 [1] only(df::DataFrame)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:455
 [2] top-level scope
   @ In[351]:1
 [3] eval
   @ ./boot.jl:360 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0769509	0.751313	0.0856352	0.111981	0.455692
2	0.640396	0.644883	0.553206	0.976312	0.279395
3	0.873544	0.0778264	0.46335	0.0516146	0.178246
4	0.278582	0.848185	0.185821	0.53803	0.548983

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0769509	0.751313	0.0856352	0.111981	0.455692
2	0.640396	0.644883	0.553206	0.976312	0.279395
3	0.873544	0.0778264	0.46335	0.0516146	0.178246
4	0.278582	0.848185	0.185821	0.53803	0.548983

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0769509	0.751313	0.0856352	0.111981	0.455692
2	0.640396	0.644883	0.553206	0.976312	0.279395
3	0.873544	0.0778264	0.46335	0.0516146	0.178246
4	0.278582	0.848185	0.185821	0.53803	0.548983

	x1	x2	x3
	Float64	Float64	Float64
1	0.0769509	0.751313	0.0856352
2	0.640396	0.644883	0.553206
3	0.873544	0.0778264	0.46335

	x1	x2	x3	x4	x5
	Float64	Float64	Float64	Float64	Float64
1	0.0769509	0.751313	0.0856352	0.111981	0.455692
2	0.640396	0.644883	0.553206	0.976312	0.279395
3	0.873544	0.0778264	0.46335	0.0516146	0.178246
4	0.278582	0.848185	0.185821	0.53803	0.548983

	id	x	y
	Int64	Float64	Float64
1	1	0.526443	0.0
2	2	0.465019	0.0
3	3	0.275519	0.0
4	4	0.461823	0.0
5	5	0.951861	0.0
6	6	0.288737	0.0
7	7	0.661232	1.0
8	8	0.194568	1.0
9	9	0.393193	1.0
10	10	0.990741	1.0
11	11	0.550334	1.0
12	12	0.580782	1.0

	x1	x2	x3	x4	x5
	Float64?	Float64?	Float64	Float64	Float64
1	0.309144	0.230063	0.762276	0.456446	0.114529
2	0.170391	0.0929292	0.339081	0.739918	0.748928
3	0.147162	0.681415	0.138763	0.816004	0.878108
4	missing	missing	0.762276	0.456446	0.114529
5	missing	missing	0.339081	0.739918	0.748928
6	missing	missing	0.138763	0.816004	0.878108

	a	b	c	d
	Int64?	Any	Int64?	Int64?
1	1	2	missing	missing
2	missing	10	20	missing
3	missing	s	1	1