Introduction to DataFrames

06_rows

DataFrame v1.2, Julia 1.6.1

In [166]:
using DataFrames
using Statistics
using Random
using BenchmarkTools
In [2]:
Random.seed!(1)
Out[2]:
MersenneTwister(1)

Manipulating rows of DataFrame

Selecting rows

In [23]:
df = DataFrame(rand(4,5), :auto)
Out[23]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
30.8735440.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983

using : as row selector will copy colums

In [24]:
df[:,:]
Out[24]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
30.8735440.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983

this is the same as

In [25]:
copy(df)
Out[25]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
30.8735440.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983

you can get a subset of rows of a data frame without copying using view to get a SubDataFrame

In [26]:
sdf = view(df,1:3, 1:3)
Out[26]:

3 rows × 3 columns

x1x2x3
Float64Float64Float64
10.07695090.7513130.0856352
20.6403960.6448830.553206
30.8735440.07782640.46335

you still have a detailed reference to the parent

In [27]:
parent(sdf)
Out[27]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
30.8735440.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983
In [28]:
parentindices(sdf)
Out[28]:
(1:3, 1:3)

selecting a single row returns a DataFrameRow object which is also a view

하나의 row를 선택하는 경우 view를 돌려 준다. 리턴된 row 의 값을 변경하면 원래 df의 값이 변경된다.

In [29]:
dfr = df[3,:]
Out[29]:

DataFrameRow (5 columns)

x1x2x3x4x5
Float64Float64Float64Float64Float64
30.8735440.07782640.463350.05161460.178246
In [30]:
size(dfr)
Out[30]:
(5,)
In [34]:
# view임을 확인
dfr === df[3,:]
Out[34]:
true
In [36]:
# dfr 값 변경시 df 값이 바뀜
dfr[1] = 100
df
Out[36]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
3100.00.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983

두개이상의 row를 선택하는 경우 복사 값을 돌려 준다

In [37]:
dfr2 = df[3:4,:]
Out[37]:

2 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
1100.00.07782640.463350.05161460.178246
20.2785820.8481850.1858210.538030.548983
In [38]:
# 두개 이상의 row 선택시 복사 값을 돌려 준다.
# view가 아님
dfr2 === df[3:4,:]
Out[38]:
false
In [40]:
# dfr2 값이 변경되었지만 df값은 변경 되지 않음
dfr2[1,1]=200
df
Out[40]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
3100.00.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983
In [51]:
parent(dfr)
Out[51]:

4 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.07695090.7513130.08563520.1119810.455692
20.6403960.6448830.5532060.9763120.279395
3100.00.07782640.463350.05161460.178246
40.2785820.8481850.1858210.538030.548983
In [52]:
parentindices(dfr)
Out[52]:
(3, Base.OneTo(5))
In [57]:
rownumber(dfr)
Out[57]:
3
In [60]:
ncol(dfr)
MethodError: no method matching ncol(::DataFrameRow{DataFrame, DataFrames.Index})
Closest candidates are:
  ncol(::DataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:420
  ncol(::SubDataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/subdataframe/subdataframe.jl:154

Stacktrace:
 [1] top-level scope
   @ In[60]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [61]:
nrow(dfr)
MethodError: no method matching nrow(::DataFrameRow{DataFrame, DataFrames.Index})
Closest candidates are:
  nrow(::DataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/dataframe/dataframe.jl:419
  nrow(::SubDataFrame) at /home/shpark/.julia/packages/DataFrames/vuMM8/src/subdataframe/subdataframe.jl:153

Stacktrace:
 [1] top-level scope
   @ In[61]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [63]:
df[!,:Z] .= 1
Out[63]:
4-element Vector{Int64}:
 1
 1
 1
 1
In [64]:
df
Out[64]:

4 rows × 6 columns

x1x2x3x4x5Z
Float64Float64Float64Float64Float64Int64
10.07695090.7513130.08563520.1119810.4556921
20.6403960.6448830.5532060.9763120.2793951
3100.00.07782640.463350.05161460.1782461
40.2785820.8481850.1858210.538030.5489831

Earlier we used : for column selection in a view (SubDataFrame and DataFrameRow). In this case a view will have all columns of the parent after the parent is mutated

In [67]:
# df의row에 추가된 Z column 내역이 dfr에 반영됨 (view이기 때문)
dfr
Out[67]:

DataFrameRow (6 columns)

x1x2x3x4x5Z
Float64Float64Float64Float64Float64Int64
3100.00.07782640.463350.05161460.1782461
In [68]:
parent(dfr)
Out[68]:

4 rows × 6 columns

x1x2x3x4x5Z
Float64Float64Float64Float64Float64Int64
10.07695090.7513130.08563520.1119810.4556921
20.6403960.6448830.5532060.9763120.2793951
3100.00.07782640.463350.05161460.1782461
40.2785820.8481850.1858210.538030.5489831
In [70]:
# Z가 추가 되어 기존 5에서 6으로 증가 됨
parentindices(dfr)
Out[70]:
(3, Base.OneTo(6))
In [71]:
rownumber(dfr)
Out[71]:
3

Note that parent and parentindices refer to the true source of data for DataFrameRow and rownumber refer to row number in the direct object that was used to create DataFrameRow

In [72]:
df = DataFrame(a=1:4)
Out[72]:

4 rows × 1 columns

a
Int64
11
22
33
44
In [73]:
dvf = view(df,[3,2],:)
Out[73]:

2 rows × 1 columns

a
Int64
13
22
In [74]:
typeof(dvf)
Out[74]:
SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}
In [82]:
typeof(df[!,[1]])
Out[82]:
DataFrame
In [92]:
# dvf의 뷰를 가진다, dvf는 df의 뷰
dfr = dvf[2,:]
Out[92]:

DataFrameRow (1 columns)

a
Int64
22
In [93]:
typeof(dfr)
Out[93]:
DataFrameRow{DataFrame, DataFrames.Index}
In [94]:
# dfr은 dfv의 뷰이고 dfv는  df의 뷰로써 parent는 뷰의 원본 DataFrame을 가리킨다
parent(dfr)
Out[94]:

4 rows × 1 columns

a
Int64
11
22
33
44
In [95]:
parentindices(dfr)
Out[95]:
(2, Base.OneTo(1))
In [96]:
rownumber(dfr)
Out[96]:
2

Reordering rows

We create some random data frame (and hope that x.x is not sorted:), which is quite likely with 12 rows)

In [99]:
x = DataFrame(id=1:12, x=rand(12), y = [zeros(6); ones(6)])
Out[99]:

12 rows × 3 columns

idxy
Int64Float64Float64
110.5264430.0
220.4650190.0
330.2755190.0
440.4618230.0
550.9518610.0
660.2887370.0
770.6612321.0
880.1945681.0
990.3931931.0
10100.9907411.0
11110.5503341.0
12120.5807821.0
In [105]:
[zeros(6)..., ones(6)...]
Out[105]:
12-element Vector{Float64}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

check if a DataFrame or a subset of its column is sorted

In [106]:
issorted(x)
Out[106]:
true
In [119]:
issorted(select(x,2,1,3)),issorted(select(x,3)),issorted(select(x,3,1,2)),
issorted(select(x,3,2,1))
Out[119]:
(false, true, true, false)
In [120]:
issorted(x,:x)
Out[120]:
false
In [122]:
sort!(x,:x)
x
Out[122]:

12 rows × 3 columns

idxy
Int64Float64Float64
180.1945681.0
230.2755190.0
360.2887370.0
490.3931931.0
540.4618230.0
620.4650190.0
710.5264430.0
8110.5503341.0
9120.5807821.0
1070.6612321.0
1150.9518610.0
12100.9907411.0

now we create a new DataFrame

In [123]:
y = sort(x,:id)
Out[123]:

12 rows × 3 columns

idxy
Int64Float64Float64
110.5264430.0
220.4650190.0
330.2755190.0
440.4618230.0
550.9518610.0
660.2887370.0
770.6612321.0
880.1945681.0
990.3931931.0
10100.9907411.0
11110.5503341.0
12120.5807821.0

here we sort by two columns, first is decreasing, second is increasing

In [125]:
@time sort(x,[:y,:x],rev=[true,false])
  0.000057 seconds (47 allocations: 4.109 KiB)
Out[125]:

12 rows × 3 columns

idxy
Int64Float64Float64
180.1945681.0
290.3931931.0
3110.5503341.0
4120.5807821.0
570.6612321.0
6100.9907411.0
730.2755190.0
860.2887370.0
940.4618230.0
1020.4650190.0
1110.5264430.0
1250.9518610.0
In [127]:
@time sort(x,[order(:y,rev=true),:x]) # the same as above
  0.000077 seconds (50 allocations: 4.031 KiB)
Out[127]:

12 rows × 3 columns

idxy
Int64Float64Float64
180.1945681.0
290.3931931.0
3110.5503341.0
4120.5807821.0
570.6612321.0
6100.9907411.0
730.2755190.0
860.2887370.0
940.4618230.0
1020.4650190.0
1110.5264430.0
1250.9518610.0

now we try some more fancy sorting stuff

In [146]:
# x값의 역순으로 정렬
sort(x, [order(:y,rev=true), order(:x,by=v->-v)])
Out[146]:

12 rows × 3 columns

idxy
Int64Float64Float64
1100.9907411.0
270.6612321.0
3120.5807821.0
4110.5503341.0
590.3931931.0
680.1945681.0
750.9518610.0
810.5264430.0
920.4650190.0
1040.4618230.0
1160.2887370.0
1230.2755190.0
In [144]:
# x 값은 cos(pi/(1-0.19)*v) 계산 결과에 따라 sorting 된다.
sort(x, [order(:y,rev=true), order(:x,by=v->cos(pi/(1-0.19)*v))])
Out[144]:

12 rows × 3 columns

idxy
Int64Float64Float64
170.6612321.0
2100.9907411.0
3120.5807821.0
4110.5503341.0
590.3931931.0
680.1945681.0
750.9518610.0
810.5264430.0
920.4650190.0
1040.4618230.0
1160.2887370.0
1230.2755190.0

this is how you can reorder rows(here randomly)

In [147]:
x[shuffle(1:10),:]
Out[147]:

10 rows × 3 columns

idxy
Int64Float64Float64
1120.5807821.0
240.4618230.0
330.2755190.0
470.6612321.0
580.1945681.0
690.3931931.0
760.2887370.0
810.5264430.0
9110.5503341.0
1020.4650190.0

it is also easy to swap rows using broadcasted assignment

In [148]:
sort!(x,:id)
Out[148]:

12 rows × 3 columns

idxy
Int64Float64Float64
110.5264430.0
220.4650190.0
330.2755190.0
440.4618230.0
550.9518610.0
660.2887370.0
770.6612321.0
880.1945681.0
990.3931931.0
10100.9907411.0
11110.5503341.0
12120.5807821.0
In [153]:
x[[1,10],:] .= x[[10,1],:]
x
Out[153]:

12 rows × 3 columns

idxy
Int64Float64Float64
1100.9907411.0
220.4650190.0
330.2755190.0
440.4618230.0
550.9518610.0
660.2887370.0
770.6612321.0
880.1945681.0
990.3931931.0
1010.5264430.0
11110.5503341.0
12120.5807821.0

Merging/adding rows

In [154]:
x = DataFrame(rand(3,5),:auto)
Out[154]:

3 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108

merge by rows - data frames must have the same column names; the same is vcat

In [167]:
@btime [x;x;x]
  19.096 μs (116 allocations: 8.11 KiB)
Out[167]:

9 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108
In [168]:
@btime vcat(x,x,x)
  18.754 μs (116 allocations: 8.11 KiB)
Out[168]:

9 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108

you can efficiently vcat a vector of DataFrames using reduce

In [169]:
@btime reduce(vcat,[x,x,x])
  19.256 μs (114 allocations: 8.17 KiB)
Out[169]:

9 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108

get y with other order of names

In [170]:
y = x[:,reverse(names(x))]
Out[170]:

3 rows × 5 columns

x5x4x3x2x1
Float64Float64Float64Float64Float64
10.1145290.4564460.7622760.2300630.309144
20.7489280.7399180.3390810.09292920.170391
30.8781080.8160040.1387630.6814150.147162

`vcat` is still possible as it does dolumn name matching

In [172]:
vcat(x,y)
Out[172]:

6 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108

but column names must still match

In [173]:
vcat(x, y[:,1:3])
ArgumentError: column(s) x1 and x2 are missing from argument(s) 2

Stacktrace:
 [1] _vcat(dfs::Vector{AbstractDataFrame}; cols::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1757
 [2] #reduce#124
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1677 [inlined]
 [3] #vcat#123
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1595 [inlined]
 [4] vcat(::DataFrame, ::DataFrame)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:1595
 [5] top-level scope
   @ In[173]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [174]:
vcat(x,y[:,1:3],cols=:intersect)
Out[174]:

6 rows × 3 columns

x3x4x5
Float64Float64Float64
10.7622760.4564460.114529
20.3390810.7399180.748928
30.1387630.8160040.878108
40.7622760.4564460.114529
50.3390810.7399180.748928
60.1387630.8160040.878108
In [175]:
vcat(x,y[:,1:3],cols=:union)
Out[175]:

6 rows × 5 columns

x1x2x3x4x5
Float64?Float64?Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
4missingmissing0.7622760.4564460.114529
5missingmissing0.3390810.7399180.748928
6missingmissing0.1387630.8160040.878108
In [176]:
vcat(x,y[:,1:3],cols=[:x1,:x5])
Out[176]:

6 rows × 2 columns

x1x5
Float64?Float64
10.3091440.114529
20.1703910.748928
30.1471620.878108
4missing0.114529
5missing0.748928
6missing0.878108
In [177]:
append!(x,x)
x
Out[177]:

6 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108

here column names must match exactly unless cols keyword argument is passed

In [178]:
append!(x,y)
x
Out[178]:

9 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108

standard repeat function works on rowsl also inner and outer keyword arguments are accepted

In [180]:
repeat(x,2)
Out[180]:

18 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108
100.3091440.2300630.7622760.4564460.114529
110.1703910.09292920.3390810.7399180.748928
120.1471620.6814150.1387630.8160040.878108
130.3091440.2300630.7622760.4564460.114529
140.1703910.09292920.3390810.7399180.748928
150.1471620.6814150.1387630.8160040.878108
160.3091440.2300630.7622760.4564460.114529
170.1703910.09292920.3390810.7399180.748928
180.1471620.6814150.1387630.8160040.878108

push! add one row to x at the end, one must pass a correct number of values unless cols keyword argument is passed

In [181]:
push!(x,1:5)
x
Out[181]:

10 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108
101.02.03.04.05.0

also works with dictionaries

In [182]:
push!(x,Dict(:x1=>11, :x2=>12, :x3=>13, :x4=>14, :x5=>15))
Out[182]:

11 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108
101.02.03.04.05.0
1111.012.013.014.015.0

and NamedTuples via name matching

In [183]:
push!(x,(x2=2,x1=1,x4=4,x3=3,x5=5))
Out[183]:

12 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108
101.02.03.04.05.0
1111.012.013.014.015.0
121.02.03.04.05.0

and DataFrameRow also via name matching

In [185]:
push!(x,x[1,:])
Out[185]:

13 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.3091440.2300630.7622760.4564460.114529
20.1703910.09292920.3390810.7399180.748928
30.1471620.6814150.1387630.8160040.878108
40.3091440.2300630.7622760.4564460.114529
50.1703910.09292920.3390810.7399180.748928
60.1471620.6814150.1387630.8160040.878108
70.3091440.2300630.7622760.4564460.114529
80.1703910.09292920.3390810.7399180.748928
90.1471620.6814150.1387630.8160040.878108
101.02.03.04.05.0
1111.012.013.014.015.0
121.02.03.04.05.0
130.3091440.2300630.7622760.4564460.114529

Please consult the documentation of push!,append! and vcat for allowed values of cols keyword argument. This keyword argument governs the way these functions perform column matching of passed arguments. Also append! and push! support a promote keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

In [188]:
source = [(a=1, b=2),(a=missing,b=10,c=20),(b="s",c=1,d=1)]
Out[188]:
3-element Vector{NamedTuple}:
 (a = 1, b = 2)
 (a = missing, b = 10, c = 20)
 (b = "s", c = 1, d = 1)
In [189]:
df = DataFrame()
Out[189]:

0 rows × 0 columns

In [190]:
for row in source
  push!(df,row,cols=:union) # if cols is :union then promote is true by default
end
In [191]:
df
Out[191]:

3 rows × 4 columns

abcd
Int64?AnyInt64?Int64?
112missingmissing
2missing1020missing
3missings11

and we see that push! dynamically added columns as needed and updated their element types

Subsetting/removing rows

In [192]:
x = DataFrame(id=1:10, val='a':'j')
Out[192]:

10 rows × 2 columns

idval
Int64Char
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j

by using indexing

In [193]:
x[1:2,:]
Out[193]:

2 rows × 2 columns

idval
Int64Char
11a
22b
In [194]:
x[1:2,:] |> typeof
Out[194]:
DataFrame

a single row selection creates a `DataFrameRow`

In [195]:
x[1,:]
Out[195]:

DataFrameRow (2 columns)

idval
Int64Char
11a

but this is a DataFrame

In [196]:
x[1:1,:]
Out[196]:

1 rows × 2 columns

idval
Int64Char
11a
In [197]:
x[1:1,:] |> typeof
Out[197]:
DataFrame

the same a view

In [198]:
v1 = view(x,1:2,:)
Out[198]:

2 rows × 2 columns

idval
Int64Char
11a
22b
In [200]:
v1 |> typeof
Out[200]:
SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}

selects column 1 and 2

In [201]:
v2 = view(x,:,1:2)
Out[201]:

10 rows × 2 columns

idval
Int64Char
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j
In [202]:
v2 |> typeof
Out[202]:
SubDataFrame{DataFrame, DataFrames.SubIndex{DataFrames.Index, UnitRange{Int64}, UnitRange{Int64}}, Base.OneTo{Int64}}

indexing by Bool, exact length math is required

In [208]:
# 홀수 번째 row만 뽑아 내기
df1 = x[repeat([true,false],5),:]
Out[208]:

5 rows × 2 columns

idval
Int64Char
11a
23c
35e
47g
59i
In [209]:
df1 |> typeof
Out[209]:
DataFrame

alternatively we can also create a view

In [210]:
v3 = view(x, repeat([true,false],5),:)
Out[210]:

5 rows × 2 columns

idval
Int64Char
11a
23c
35e
47g
59i

we can delete one row in place

In [212]:
x
Out[212]:

10 rows × 2 columns

idval
Int64Char
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j
In [213]:
delete!(x,7)
x
Out[213]:

9 rows × 2 columns

idval
Int64Char
11a
22b
33c
44d
55e
66f
78h
89i
910j
or a collection of rows, also in place
In [214]:
delete!(x, 6:7)
x
Out[214]:

7 rows × 2 columns

idval
Int64Char
11a
22b
33c
44d
55e
69i
710j

you can also create a new DataFrame when deleting rows using Not indexing

In [215]:
x[Not(1:2),:]
Out[215]:

5 rows × 2 columns

idval
Int64Char
13c
24d
35e
49i
510j
In [216]:
x
Out[216]:

7 rows × 2 columns

idval
Int64Char
11a
22b
33c
44d
55e
69i
710j

now we move to row filtering

In [238]:
x = DataFrame([1:4, 2:5, 3:6],:auto)
Out[238]:

4 rows × 3 columns

x1x2x3
Int64Int64Int64
1123
2234
3345
4456

create a new DataFrame where filtering function operates on DataFrameRow

In [240]:
f1 = filter(r->r.x1 > 2.5,x)
Out[240]:

2 rows × 3 columns

x1x2x3
Int64Int64Int64
1345
2456
In [241]:
f1 |> typeof
Out[241]:
DataFrame

filter에 view 파라미터 적용

In [242]:
f2 = filter(r->r.x1 > 2.5,x, view=true) # the same but as a view
Out[242]:

2 rows × 3 columns

x1x2x3
Int64Int64Int64
1345
2456
In [243]:
f2 |> typeof
Out[243]:
SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}

or

In [245]:
f2 = filter(:x1 => >(2.5),x)
Out[245]:

2 rows × 3 columns

x1x2x3
Int64Int64Int64
1345
2456
In [246]:
f2 |> typeof
Out[246]:
DataFrame

in place modification of x, an example with do-block syntax

In [250]:
f4 = filter!(x) do r
  if r.x1 > 2.5
    return r.x2 < 4.5
  end
  r.x3 < 3.5
end
Out[250]:

2 rows × 3 columns

x1x2x3
Int64Int64Int64
1123
2345
In [251]:
f4 |> typeof
Out[251]:
DataFrame

A common operation is selection of rows for which a value in a column is contained in a given set, Here are a few ways in which you can archive this.

In [252]:
df = DataFrame(x=1:12, y=mod1.(1:12,4))
Out[252]:

12 rows × 2 columns

xy
Int64Int64
111
222
333
444
551
662
773
884
991
10102
11113
12124

We select rows for which column y has value 1 or 4

In [265]:
@btime filter(row->row.y in [1,4],df)
  3.044 μs (28 allocations: 2.55 KiB)
Out[265]:

6 rows × 2 columns

xy
Int64Int64
111
244
351
484
591
6124
In [274]:
@btime filter(:y => in([1,4]),df)
  1.489 μs (19 allocations: 1.53 KiB)
Out[274]:

6 rows × 2 columns

xy
Int64Int64
111
244
351
484
591
6124
In [277]:
@btime df[in.(df.y,Ref([1,4])),:]
  1.497 μs (17 allocations: 1.50 KiB)
Out[277]:

6 rows × 2 columns

xy
Int64Int64
111
244
351
484
591
6124

DataFrames.jl also providers a subset function that works on whole columns and allows for multiple conditions:

In [281]:
x = DataFrame([1:4, 2:5, 3:6], :auto)
Out[281]:

4 rows × 3 columns

x1x2x3
Int64Int64Int64
1123
2234
3345
4456
In [286]:
s1 = subset(x, :x1=>x->x .< mean(x), :x2 => ByRow(<(4)))
Out[286]:

2 rows × 3 columns

x1x2x3
Int64Int64Int64
1123
2234
In [287]:
s1 |> typeof
Out[287]:
DataFrame

중복제거(Deduplicating)

In [288]:
x = DataFrame(A=[1,2], B=["x","y"])
append!(x,x)
x.C = 1:4
x
Out[288]:

4 rows × 3 columns

ABC
Int64StringInt64
11x1
22y2
31x3
42y4

get first unique rows for geive index

In [289]:
unique(x)
Out[289]:

4 rows × 3 columns

ABC
Int64StringInt64
11x1
22y2
31x3
42y4
In [290]:
unique(x,[1,2])
Out[290]:

2 rows × 3 columns

ABC
Int64StringInt64
11x1
22y2

get indicators of non-unique rows

In [291]:
nonunique(x,:A)
Out[291]:
4-element Vector{Bool}:
 0
 0
 1
 1
In [292]:
x
Out[292]:

4 rows × 3 columns

ABC
Int64StringInt64
11x1
22y2
31x3
42y4

modify x in place

In [293]:
unique!(x, :B)
Out[293]:

2 rows × 3 columns

ABC
Int64StringInt64
11x1
22y2
In [294]:
x
Out[294]:

2 rows × 3 columns

ABC
Int64StringInt64
11x1
22y2

Extracting one row from a DataFrame into standard collections

In [295]:
x = DataFrame(x=[1,missing,2], y=["a","b",missing],z=[true,false,true])
Out[295]:

3 rows × 3 columns

xyz
Int64?String?Bool
11a1
2missingb0
32missing1
In [296]:
cols = [:y,:z]
Out[296]:
2-element Vector{Symbol}:
 :y
 :z

you can use a conversion to a Vector or an Array

In [298]:
Vector(x[1,cols])
Out[298]:
2-element Vector{Any}:
     "a"
 true

nwo you will get a vector of vectors

In [314]:
xx = DataFrame(rand(2,3),:auto)
axes(xx,1),axes(xx,2),[axes(xx,1)...],[axes(xx,2)...]
Out[314]:
(Base.OneTo(2), Base.OneTo(3), [1, 2], [1, 2, 3])
In [317]:
# axes(x,1) : row
# axes(x,2) : column
[Vector(x[i,cols]) for i in axes(x,1)]
Out[317]:
3-element Vector{Vector{Any}}:
 ["a", true]
 ["b", false]
 [missing, true]

it is easy to convert a DataFrameRow into a NamedTuple

In [318]:
typeof(x[1,cols])
Out[318]:
DataFrameRow{DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}}
In [319]:
c1 = copy(x[1,cols])
Out[319]:
NamedTuple{(:y, :z), Tuple{Union{Missing, String}, Bool}}(("a", true))
In [324]:
n1 = NamedTuple(x[1,cols])
Out[324]:
NamedTuple{(:y, :z), Tuple{Union{Missing, String}, Bool}}(("a", true))
In [326]:
c1 == n1, c1===n1
Out[326]:
(true, true)
In [320]:
c1.y
Out[320]:
"a"

or a Tuple

In [321]:
Tuple(x[1,cols])
Out[321]:
("a", true)

Working with a collection of rows of a data frame

You can use eachrow to get a vector-like collection of DataFrameRows

In [327]:
df = DataFrame(reshape(1:12,3,4),:auto)
Out[327]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
In [328]:
er_df = eachrow(df)
Out[328]:

3×4 DataFrameRows

x1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
In [329]:
er_df |> typeof
Out[329]:
DataFrames.DataFrameRows{DataFrame}
In [330]:
er_df[1]
Out[330]:

DataFrameRow (4 columns)

x1x2x3x4
Int64Int64Int64Int64
114710
In [333]:
er_df[1][2] === er_df[1].x2
Out[333]:
true
In [334]:
last(er_df)
Out[334]:

DataFrameRow (4 columns)

x1x2x3x4
Int64Int64Int64Int64
336912
In [335]:
er_df[end]
Out[335]:

DataFrameRow (4 columns)

x1x2x3x4
Int64Int64Int64Int64
336912

As DataFrameRows objects keeps connection to the parent data frame you can get the columns of parent using getproperty

In [336]:
er_df.x1
Out[336]:
3-element Vector{Int64}:
 1
 2
 3
In [337]:
er_df.x1 === df.x1
Out[337]:
true

Flattening a data frame

Occasionally you have a data frame whose one column is a vector of collections. You can expand(flatten) such a column using the flatten function

In [338]:
df = DataFrame(a='a':'c', b=[[1,2,3],[4,5],6])
Out[338]:

3 rows × 2 columns

ab
CharAny
1a[1, 2, 3]
2b[4, 5]
3c6
In [339]:
f1 = flatten(df, :b)
Out[339]:

6 rows × 2 columns

ab
CharInt64
1a1
2a2
3a3
4b4
5b5
6c6
In [340]:
f1 |> typeof
Out[340]:
DataFrame

Only one row

only from Julia Base is also supported in DataFrames.jl and succeeds if the data frame has only one row, in which case it is returned.

In [348]:
df = DataFrame(a=1)
Out[348]:

1 rows × 1 columns

a
Int64
11
In [349]:
only(df)
Out[349]:

DataFrameRow (1 columns)

a
Int64
11
In [350]:
df2 = repeat(df,2)
Out[350]:

2 rows × 1 columns

a
Int64
11
21
In [351]:
only(df2)
ArgumentError: data frame must contain exactly 1 row

Stacktrace:
 [1] only(df::DataFrame)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/abstractdataframe.jl:455
 [2] top-level scope
   @ In[351]:1
 [3] eval
   @ ./boot.jl:360 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [ ]: