Introduction to DataFrames

10_transforms

DataFrame v1.2, Julia 1.6.1

In [46]:
using DataFrames, Pipe, BenchmarkTools

Split-apply-combine

Grouping a data frame

In [3]:
x = DataFrame(id=[1,2,3,4,1,2,3,4],id2=[1,2,1,2,1,2,1,2], v=rand(8))
Out[3]:

8 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2220.837793
3310.222869
4420.40303
5110.415974
6220.907121
7310.0626348
8420.513873
In [8]:
groupby(x,:id)
Out[8]:

GroupedDataFrame with 4 groups based on key: id

First Group (2 rows): id = 1

idid2v
Int64Int64Float64
1110.670434
2110.415974

Last Group (2 rows): id = 4

idid2v
Int64Int64Float64
1420.40303
2420.513873
In [9]:
groupby(x,[])
Out[9]:

GroupedDataFrame with 1 group based on key:

First Group (8 rows):

idid2v
Int64Int64Float64
1110.670434
2220.837793
3310.222869
4420.40303
5110.415974
6220.907121
7310.0626348
8420.513873
In [10]:
gx2 = groupby(x,[:id,:id2])
Out[10]:

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1

idid2v
Int64Int64Float64
1110.670434
2110.415974

Last Group (2 rows): id = 4, id2 = 2

idid2v
Int64Int64Float64
1420.40303
2420.513873
In [20]:
p1 = parent(gx2) # get the parent DataFrame
Out[20]:

8 rows × 3 columns

idid2v
Int64Int64Float64
11100.670434
2220.837793
3310.222869
4420.40303
5110.415974
6220.907121
7310.0626348
8420.513873

parent는 parent의 주소를 돌려 준다.

In [17]:
p1 === x
Out[17]:
true
In [23]:
# back to the DataFrame, but in a different order of rows than the original
vcat(gx2...)
Out[23]:

8 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2110.415974
3220.837793
4220.907121
5310.222869
6310.0626348
7420.40303
8420.513873
In [26]:
DataFrame(gx2) # the same
Out[26]:

8 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2110.415974
3220.837793
4220.907121
5310.222869
6310.0626348
7420.40303
8420.513873
In [28]:
# drop grouping columns when creating a data frame
DataFrame(gx2,keepkeys=false)
Out[28]:

8 rows × 1 columns

v
Float64
10.670434
20.415974
30.837793
40.907121
50.222869
60.0626348
70.40303
80.513873
In [29]:
# vector of names of grouping variables
groupcols(gx2)
Out[29]:
2-element Vector{Symbol}:
 :id
 :id2
In [36]:
valuecols(gx2) # and non-grouping variables
Out[36]:
1-element Vector{Symbol}:
 :v
In [37]:
groupindices(gx2) # group indices in parent(gx2)
Out[37]:
8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4
In [38]:
kgx2 = keys(gx2)
Out[38]:
4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

you can index into a GroupDataFrame like to a vector or to a dictionary. The second form acceps GroupKey,NameTuple or Tuple

In [39]:
gx2
Out[39]:

GroupedDataFrame with 4 groups based on keys: id, id2

First Group (2 rows): id = 1, id2 = 1

idid2v
Int64Int64Float64
1110.670434
2110.415974

Last Group (2 rows): id = 4, id2 = 2

idid2v
Int64Int64Float64
1420.40303
2420.513873
In [40]:
k = keys(gx2)[1]
Out[40]:
GroupKey: (id = 1, id2 = 1)
In [41]:
ntk = NamedTuple(k)
Out[41]:
(id = 1, id2 = 1)
In [42]:
tk = Tuple(k)
Out[42]:
(1, 1)

the operations below produce the same result and are fast

In [49]:
@btime gx2[1]
  325.930 ns (3 allocations: 192 bytes)
Out[49]:

2 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2110.415974
In [50]:
@btime gx2[k]
  383.871 ns (3 allocations: 192 bytes)
Out[50]:

2 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2110.415974
In [52]:
@btime gx2[ntk]
  545.723 ns (3 allocations: 192 bytes)
Out[52]:

2 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2110.415974
In [54]:
@btime gx2[tk]
  483.284 ns (3 allocations: 192 bytes)
Out[54]:

2 rows × 3 columns

idid2v
Int64Int64Float64
1110.670434
2110.415974

handling missing values

In [55]:
x = DataFrame(id = [missing,5,1,3,missing],x=1:5)
Out[55]:

5 rows × 2 columns

idx
Int64?Int64
1missing1
252
313
434
5missing5
In [56]:
# by default group include missing values and are not sorted
groupby(x,:id)
Out[56]:

GroupedDataFrame with 4 groups based on key: id

First Group (1 row): id = 1

idx
Int64?Int64
113

Last Group (2 rows): id = missing

idx
Int64?Int64
1missing1
2missing5
In [59]:
groupby(x,:id,sort=true,skipmissing=true) # but we can change it
Out[59]:

GroupedDataFrame with 3 groups based on key: id

First Group (1 row): id = 1

idx
Int64?Int64
113

Last Group (1 row): id = 5

idx
Int64?Int64
152

Performing transformations by group using combine,select,select!,transform and transform!

In [60]:
using Statistics
using Chain

Reduce the number of rows in the output

In [61]:
ENV["LINES"] = 15
Out[61]:
15
In [62]:
x = DataFrame(id=rand('a':'d',100), v=rand(100))
Out[62]:

100 rows × 2 columns

idv
CharFloat64
1b0.778459
2c0.965702
3a0.714609
4a0.883121
5c0.542032
6d0.115543
7b0.458695
8c0.277306
9b0.18459
10b0.565907
11a0.390352
12a0.670578
13d0.344171
14b0.462482
15c0.185975

Apply a function to each group of data frame

combine keeps as many rows as are returned from the function

In [63]:
@chain x begin 
  groupby(:id)
  combine(:v=>mean)
end
Out[63]:

4 rows × 2 columns

idv_mean
CharFloat64
1b0.619295
2c0.538388
3a0.52231
4d0.54441
In [69]:
x.id2 = axes(x,1)
x
Out[69]:

100 rows × 3 columns

idvid2
CharFloat64Int64
1b0.7784591
2c0.9657022
3a0.7146093
4a0.8831214
5c0.5420325
6d0.1155436
7b0.4586957
8c0.2773068
9b0.184599
10b0.56590710
11a0.39035211
12a0.67057812
13d0.34417113
14b0.46248214
15c0.18597515
In [71]:
axes(x)
Out[71]:
(Base.OneTo(100), Base.OneTo(3))
In [77]:
# select and transform keep as many rows as are in the source data frame 
# and in correct order additionally transform keeps all columns from the source
@chain x begin
  groupby(:id)
  transform(:v=>mean)
end
Out[77]:

100 rows × 4 columns

idvid2v_mean
CharFloat64Int64Float64
1b0.77845910.619295
2c0.96570220.538388
3a0.71460930.52231
4a0.88312140.52231
5c0.54203250.538388
6d0.11554360.54441
7b0.45869570.619295
8c0.27730680.538388
9b0.1845990.619295
10b0.565907100.619295
11a0.390352110.52231
12a0.670578120.52231
13d0.344171130.54441
14b0.462482140.619295
15c0.185975150.538388
In [124]:
@pipe x |> groupby(_,:id)
Out[124]:

GroupedDataFrame with 4 groups based on key: id

First Group (23 rows): id = 'b'

idvid2
CharFloat64Int64
1b0.7784591
2b0.4586957
3b0.184599
4b0.56590710
5b0.46248214
6b0.33585717
7b0.75800122
8b0.91636629
9b0.42863631
10b0.61014532
11b0.80862136
12b0.86551748
13b0.93838252
14b0.1782465
15b0.14140371

Last Group (31 rows): id = 'd'

idvid2
CharFloat64Int64
1d0.1155436
2d0.34417113
3d0.16946818
4d0.58312621
5d0.19065725
6d0.92186626
7d0.81291930
8d0.16900134
9d0.75505739
10d0.27718240
11d0.20082746
12d0.79275749
13d0.96187850
14d0.92098851
15d0.28209453
In [79]:
# note that combine reorders rows by group of GroupedDataFrame
@chain x begin
  groupby(:id)
  combine(:id2,:v=>mean)
end
Out[79]:

100 rows × 3 columns

idid2v_mean
CharInt64Float64
1b10.619295
2b70.619295
3b90.619295
4b100.619295
5b140.619295
6b170.619295
7b220.619295
8b290.619295
9b310.619295
10b320.619295
11b360.619295
12b480.619295
13b520.619295
14b650.619295
15b710.619295
In [125]:
# we give a custom name for the result column
@chain x begin
  groupby(:id)
  combine(:v=>mean=>:res)
end
Out[125]:

4 rows × 2 columns

idres
CharFloat64
1b0.619295
2c0.538388
3a0.52231
4d0.54441
In [150]:
# you can have multiple operations
@chain x begin
  groupby(:id)
  combine(:v=>mean=>:res1, :v=>sum=>:res2, nrow => :n, ncol)
end
Out[150]:

4 rows × 5 columns

idres1res2nx1
CharFloat64Float64Int64Int64
1b0.61929514.2438233
2c0.53838812.9213243
3a0.5223111.4908223
4d0.5444116.8767313
In [151]:
combine(groupby(x,:id)) do sdf
  n = nrow(sdf)
  n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end
Out[151]:

1 rows × 2 columns

idn
CharInt64
1d31
In [182]:
df = DataFrame(id=[1,1,2,2],val=[1,2,3,4])
Out[182]:

4 rows × 2 columns

idval
Int64Int64
111
212
323
424
In [183]:
@chain df begin
  groupby(:id)
  combine(:val=>(x->[x])=>AsTable)
end
Out[183]:

2 rows × 3 columns

idx1x2
Int64Int64Int64
1112
2234
In [184]:
@chain df begin
  groupby(:id)
  combine(:val=>(x->[x]) => [:c1,:c2])
end
Out[184]:

2 rows × 3 columns

idc1c2
Int64Int64Int64
1112
2234
In [185]:
df = DataFrame(a=[(p=1,q=2),(p=3,q=4)])
Out[185]:

2 rows × 1 columns

a
NamedTu…
1(p = 1, q = 2)
2(p = 3, q = 4)
In [187]:
df = DataFrame(a=[[1,2],[3,4]])
Out[187]:

2 rows × 1 columns

a
Array…
1[1, 2]
2[3, 4]
In [188]:
select(df, :a)
Out[188]:

2 rows × 1 columns

a
Array…
1[1, 2]
2[3, 4]
In [189]:
select(df, :a=>AsTable) # Automatic column names generated
Out[189]:

2 rows × 2 columns

x1x2
Int64Int64
112
234
In [190]:
select(df,:a=>[:C1,:C2])
Out[190]:

2 rows × 2 columns

C1C2
Int64Int64
112
234

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

In [191]:
df = DataFrame(id=repeat(1:10,10),x1=1:100,x2=101:200)
Out[191]:

100 rows × 3 columns

idx1x2
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
11111111
12212112
13313113
14414114
15515115
In [197]:
groupby(df,:id)
Out[197]:

GroupedDataFrame with 10 groups based on key: id

First Group (10 rows): id = 1

idx1x2
Int64Int64Int64
111101
2111111
3121121
4131131
5141141
6151151
7161161
8171171
9181181
10191191

Last Group (10 rows): id = 10

idx1x2
Int64Int64Int64
11010110
21020120
31030130
41040140
51050150
61060160
71070170
81080180
91090190
1010100200
In [193]:
@chain df begin
  groupby(:id)
  combine([:x1,:x2] .=> minimum)
end
Out[193]:

10 rows × 3 columns

idx1_minimumx2_minimum
Int64Int64Int64
111101
222102
333103
444104
555105
666106
777107
888108
999109
101010110
In [195]:
@chain df begin
  groupby(:id)
  combine([:x1,:x2] .=>[minimum,maximum])
end
Out[195]:

10 rows × 3 columns

idx1_minimumx2_maximum
Int64Int64Int64
111191
222192
333193
444194
555195
666196
777197
888198
999199
101010200

Aggregation of a data frame using mapcols

In [280]:
x = DataFrame(rand(11,10),:auto)
Out[280]:

11 rows × 10 columns (omitted printing of 2 columns)

x1x2x3x4x5x6x7x8
Float64Float64Float64Float64Float64Float64Float64Float64
10.9647450.6476740.4159170.1894790.3350550.5938050.1536370.418127
20.7003040.7717620.230390.09824040.9366950.371230.1720740.927579
30.3006170.9243640.327710.3237170.02038390.1340280.5145710.931041
40.1298590.9678220.4291380.4117110.3832030.9325310.8334550.114311
50.1436950.2651490.9323980.3294620.5993050.9640550.1511350.0878731
60.7666320.428770.9099380.625830.4112640.2507530.1237650.497648
70.853370.8123040.7726880.4007190.5809510.6031070.624720.339668
80.7905910.9937340.1639440.05582070.2406770.2763010.4836740.929667
90.2868510.6167110.6319630.4988660.2360670.1816570.8521930.546293
100.8848250.929060.2171420.2567270.9087380.213790.5876120.240607
110.2023990.08711880.03727640.3304590.807850.1648570.008414670.904867
In [281]:
mapcols(mean,x)
Out[281]:

1 rows × 10 columns (omitted printing of 1 columns)

x1x2x3x4x5x6x7x8x9
Float64Float64Float64Float64Float64Float64Float64Float64Float64
10.5476260.676770.4607730.3200940.4963810.426010.4095680.5397890.46261

Mapping rows and columns using eachcol and eachrow

In [282]:
map(mean,eachcol(x)) # map a function over each column and return a vector
Out[282]:
10-element Vector{Float64}:
 0.5476262044989846
 0.6767699633928886
 0.460773140618106
 0.32009375238924265
 0.4963808937498881
 0.42601035608155713
 0.40956819272961004
 0.5397891681041568
 0.46260956285273386
 0.32084170235808185
In [283]:
# an iteration returns a Pair with column name and values
foreach(c->println(c[1], ": ",mean(c[2])),pairs(eachcol(x)))
x1: 0.5476262044989846
x2: 0.6767699633928886
x3: 0.460773140618106
x4: 0.32009375238924265
x5: 0.4963808937498881
x6: 0.42601035608155713
x7: 0.40956819272961004
x8: 0.5397891681041568
x9: 0.46260956285273386
x10: 0.32084170235808185
In [284]:
keys(pairs(eachcol(x)))
Out[284]:
10-element Vector{Symbol}:
 :x1
 :x2
 :x3
 :x4
 :x5
 :x6
 :x7
 :x8
 :x9
 :x10
In [285]:
values(pairs(eachcol(x)))
Out[285]:

11×10 DataFrameColumns

x1x2x3x4x5x6x7x8
Float64Float64Float64Float64Float64Float64Float64Float64
10.9647450.6476740.4159170.1894790.3350550.5938050.1536370.418127
20.7003040.7717620.230390.09824040.9366950.371230.1720740.927579
30.3006170.9243640.327710.3237170.02038390.1340280.5145710.931041
40.1298590.9678220.4291380.4117110.3832030.9325310.8334550.114311
50.1436950.2651490.9323980.3294620.5993050.9640550.1511350.0878731
60.7666320.428770.9099380.625830.4112640.2507530.1237650.497648
70.853370.8123040.7726880.4007190.5809510.6031070.624720.339668
80.7905910.9937340.1639440.05582070.2406770.2763010.4836740.929667
90.2868510.6167110.6319630.4988660.2360670.1816570.8521930.546293
100.8848250.929060.2171420.2567270.9087380.213790.5876120.240607
110.2023990.08711880.03727640.3304590.807850.1648570.008414670.904867

now the returned value is DataFrameRow which works as NamedTuple but is a view to a parent DataFrame

In [286]:
map(r->r.x1/r.x2,eachrow(x))
Out[286]:
11-element Vector{Float64}:
 1.4895537041964226
 0.9074093264118998
 0.3252145507900127
 0.13417674795899745
 0.5419395347592332
 1.787979457065716
 1.0505552785308994
 0.7955757198996063
 0.46512965845343396
 0.952387303771245
 2.3232570241282002
In [288]:
map(c->mean(c),eachcol(x))
Out[288]:
10-element Vector{Float64}:
 0.5476262044989846
 0.6767699633928886
 0.460773140618106
 0.32009375238924265
 0.4963808937498881
 0.42601035608155713
 0.40956819272961004
 0.5397891681041568
 0.46260956285273386
 0.32084170235808185

It prints like a data frame, only caption is different so that you know the type of the object

In [289]:
er = eachrow(x)
Out[289]:

11×10 DataFrameRows

x1x2x3x4x5x6x7x8
Float64Float64Float64Float64Float64Float64Float64Float64
10.9647450.6476740.4159170.1894790.3350550.5938050.1536370.418127
20.7003040.7717620.230390.09824040.9366950.371230.1720740.927579
30.3006170.9243640.327710.3237170.02038390.1340280.5145710.931041
40.1298590.9678220.4291380.4117110.3832030.9325310.8334550.114311
50.1436950.2651490.9323980.3294620.5993050.9640550.1511350.0878731
60.7666320.428770.9099380.625830.4112640.2507530.1237650.497648
70.853370.8123040.7726880.4007190.5809510.6031070.624720.339668
80.7905910.9937340.1639440.05582070.2406770.2763010.4836740.929667
90.2868510.6167110.6319630.4988660.2360670.1816570.8521930.546293
100.8848250.929060.2171420.2567270.9087380.213790.5876120.240607
110.2023990.08711880.03727640.3304590.807850.1648570.008414670.904867
In [290]:
# you can access columns of parent data frame directly
er.x1
Out[290]:
11-element Vector{Float64}:
 0.9647450497886172
 0.7003041920539808
 0.3006167230205241
 0.12985920072205337
 0.14369459782730165
 0.7666322788507816
 0.8533699765785863
 0.7905908940520163
 0.28685063623595863
 0.8848252530238048
 0.20239944733520687

It prints like a data frame, only the caption is different so that you know the type of the object

In [291]:
ec = eachcol(x)
Out[291]:

11×10 DataFrameColumns

x1x2x3x4x5x6x7x8
Float64Float64Float64Float64Float64Float64Float64Float64
10.9647450.6476740.4159170.1894790.3350550.5938050.1536370.418127
20.7003040.7717620.230390.09824040.9366950.371230.1720740.927579
30.3006170.9243640.327710.3237170.02038390.1340280.5145710.931041
40.1298590.9678220.4291380.4117110.3832030.9325310.8334550.114311
50.1436950.2651490.9323980.3294620.5993050.9640550.1511350.0878731
60.7666320.428770.9099380.625830.4112640.2507530.1237650.497648
70.853370.8123040.7726880.4007190.5809510.6031070.624720.339668
80.7905910.9937340.1639440.05582070.2406770.2763010.4836740.929667
90.2868510.6167110.6319630.4988660.2360670.1816570.8521930.546293
100.8848250.929060.2171420.2567270.9087380.213790.5876120.240607
110.2023990.08711880.03727640.3304590.807850.1648570.008414670.904867
In [292]:
ec.x1
Out[292]:
11-element Vector{Float64}:
 0.9647450497886172
 0.7003041920539808
 0.3006167230205241
 0.12985920072205337
 0.14369459782730165
 0.7666322788507816
 0.8533699765785863
 0.7905908940520163
 0.28685063623595863
 0.8848252530238048
 0.20239944733520687
In [293]:
er[1]
Out[293]:

DataFrameRow (10 columns)

x1x2x3x4x5x6x7x8
Float64Float64Float64Float64Float64Float64Float64Float64
10.9647450.6476740.4159170.1894790.3350550.5938050.1536370.418127
In [294]:
ec[1]
Out[294]:
11-element Vector{Float64}:
 0.9647450497886172
 0.7003041920539808
 0.3006167230205241
 0.12985920072205337
 0.14369459782730165
 0.7666322788507816
 0.8533699765785863
 0.7905908940520163
 0.28685063623595863
 0.8848252530238048
 0.20239944733520687

Transposing

you can transpose a data frame using permuteddims:

In [295]:
df = DataFrame(reshape(1:12,3,4),:auto)
Out[295]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
114710
225811
336912
In [297]:
df.names=["a","b","c"]
Out[297]:
3-element Vector{String}:
 "a"
 "b"
 "c"
In [298]:
df
Out[298]:

3 rows × 5 columns

x1x2x3x4names
Int64Int64Int64Int64String
114710a
225811b
336912c
In [299]:
permutedims(df,:names)
Out[299]:

4 rows × 4 columns

namesabc
StringInt64Int64Int64
1x1123
2x2456
3x3789
4x4101112
In [ ]: