Introduction to DataFrames¶

DataFrame v1.2, Julia 1.6.1

using DataFrames, Pipe, BenchmarkTools

Split-apply-combine¶

Grouping a data frame¶

x = DataFrame(id=[1,2,3,4,1,2,3,4],id2=[1,2,1,2,1,2,1,2], v=rand(8))

groupby(x,:id)

groupby(x,[])

gx2 = groupby(x,[:id,:id2])

p1 = parent(gx2) # get the parent DataFrame

parent는 parent의 주소를 돌려 준다.

p1 === x

true

# back to the DataFrame, but in a different order of rows than the original
vcat(gx2...)

DataFrame(gx2) # the same

# drop grouping columns when creating a data frame
DataFrame(gx2,keepkeys=false)

# vector of names of grouping variables
groupcols(gx2)

2-element Vector{Symbol}:
 :id
 :id2

valuecols(gx2) # and non-grouping variables

1-element Vector{Symbol}:
 :v

groupindices(gx2) # group indices in parent(gx2)

8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4

kgx2 = keys(gx2)

4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

you can index into a GroupDataFrame like to a vector or to a dictionary. The second form acceps GroupKey,NameTuple or Tuple

gx2

k = keys(gx2)[1]

GroupKey: (id = 1, id2 = 1)

ntk = NamedTuple(k)

(id = 1, id2 = 1)

tk = Tuple(k)

(1, 1)

the operations below produce the same result and are fast

@btime gx2[1]

  325.930 ns (3 allocations: 192 bytes)

@btime gx2[k]

  383.871 ns (3 allocations: 192 bytes)

@btime gx2[ntk]

  545.723 ns (3 allocations: 192 bytes)

@btime gx2[tk]

  483.284 ns (3 allocations: 192 bytes)

handling missing values

x = DataFrame(id = [missing,5,1,3,missing],x=1:5)

# by default group include missing values and are not sorted
groupby(x,:id)

groupby(x,:id,sort=true,skipmissing=true) # but we can change it

Performing transformations by group using `combine,select,select!,transform` and `transform!`¶

using Statistics
using Chain

Reduce the number of rows in the output

ENV["LINES"] = 15

15

x = DataFrame(id=rand('a':'d',100), v=rand(100))

Apply a function to each group of data frame

combine keeps as many rows as are returned from the function

@chain x begin 
  groupby(:id)
  combine(:v=>mean)
end

x.id2 = axes(x,1)
x

axes(x)

(Base.OneTo(100), Base.OneTo(3))

# select and transform keep as many rows as are in the source data frame 
# and in correct order additionally transform keeps all columns from the source
@chain x begin
  groupby(:id)
  transform(:v=>mean)
end

@pipe x |> groupby(_,:id)

# note that combine reorders rows by group of GroupedDataFrame
@chain x begin
  groupby(:id)
  combine(:id2,:v=>mean)
end

# we give a custom name for the result column
@chain x begin
  groupby(:id)
  combine(:v=>mean=>:res)
end

# you can have multiple operations
@chain x begin
  groupby(:id)
  combine(:v=>mean=>:res1, :v=>sum=>:res2, nrow => :n, ncol)
end

combine(groupby(x,:id)) do sdf
  n = nrow(sdf)
  n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

df = DataFrame(id=[1,1,2,2],val=[1,2,3,4])

@chain df begin
  groupby(:id)
  combine(:val=>(x->[x])=>AsTable)
end

@chain df begin
  groupby(:id)
  combine(:val=>(x->[x]) => [:c1,:c2])
end

df = DataFrame(a=[(p=1,q=2),(p=3,q=4)])

df = DataFrame(a=[[1,2],[3,4]])

select(df, :a)

select(df, :a=>AsTable) # Automatic column names generated

select(df,:a=>[:C1,:C2])

Finally, observe that one can conveniently apply multiple transformations using broadcasting:

df = DataFrame(id=repeat(1:10,10),x1=1:100,x2=101:200)

groupby(df,:id)

@chain df begin
  groupby(:id)
  combine([:x1,:x2] .=> minimum)
end

@chain df begin
  groupby(:id)
  combine([:x1,:x2] .=>[minimum,maximum])
end

Aggregation of a data frame using `mapcols`¶

x = DataFrame(rand(11,10),:auto)

mapcols(mean,x)

Mapping rows and columns using `eachcol` and `eachrow`¶

map(mean,eachcol(x)) # map a function over each column and return a vector

10-element Vector{Float64}:
 0.5476262044989846
 0.6767699633928886
 0.460773140618106
 0.32009375238924265
 0.4963808937498881
 0.42601035608155713
 0.40956819272961004
 0.5397891681041568
 0.46260956285273386
 0.32084170235808185

# an iteration returns a Pair with column name and values
foreach(c->println(c[1], ": ",mean(c[2])),pairs(eachcol(x)))

x1: 0.5476262044989846
x2: 0.6767699633928886
x3: 0.460773140618106
x4: 0.32009375238924265
x5: 0.4963808937498881
x6: 0.42601035608155713
x7: 0.40956819272961004
x8: 0.5397891681041568
x9: 0.46260956285273386
x10: 0.32084170235808185

keys(pairs(eachcol(x)))

10-element Vector{Symbol}:
 :x1
 :x2
 :x3
 :x4
 :x5
 :x6
 :x7
 :x8
 :x9
 :x10

values(pairs(eachcol(x)))

now the returned value is DataFrameRow which works as NamedTuple but is a view to a parent DataFrame

map(r->r.x1/r.x2,eachrow(x))

11-element Vector{Float64}:
 1.4895537041964226
 0.9074093264118998
 0.3252145507900127
 0.13417674795899745
 0.5419395347592332
 1.787979457065716
 1.0505552785308994
 0.7955757198996063
 0.46512965845343396
 0.952387303771245
 2.3232570241282002

map(c->mean(c),eachcol(x))

10-element Vector{Float64}:
 0.5476262044989846
 0.6767699633928886
 0.460773140618106
 0.32009375238924265
 0.4963808937498881
 0.42601035608155713
 0.40956819272961004
 0.5397891681041568
 0.46260956285273386
 0.32084170235808185

It prints like a data frame, only caption is different so that you know the type of the object

er = eachrow(x)

# you can access columns of parent data frame directly
er.x1

11-element Vector{Float64}:
 0.9647450497886172
 0.7003041920539808
 0.3006167230205241
 0.12985920072205337
 0.14369459782730165
 0.7666322788507816
 0.8533699765785863
 0.7905908940520163
 0.28685063623595863
 0.8848252530238048
 0.20239944733520687

It prints like a data frame, only the caption is different so that you know the type of the object

ec = eachcol(x)

ec.x1

11-element Vector{Float64}:
 0.9647450497886172
 0.7003041920539808
 0.3006167230205241
 0.12985920072205337
 0.14369459782730165
 0.7666322788507816
 0.8533699765785863
 0.7905908940520163
 0.28685063623595863
 0.8848252530238048
 0.20239944733520687

er[1]

ec[1]

11-element Vector{Float64}:
 0.9647450497886172
 0.7003041920539808
 0.3006167230205241
 0.12985920072205337
 0.14369459782730165
 0.7666322788507816
 0.8533699765785863
 0.7905908940520163
 0.28685063623595863
 0.8848252530238048
 0.20239944733520687

Transposing¶

you can transpose a data frame using permuteddims:

df = DataFrame(reshape(1:12,3,4),:auto)

df.names=["a","b","c"]

3-element Vector{String}:
 "a"
 "b"
 "c"

df

permutedims(df,:names)

	id	id2	v
	Int64	Int64	Float64
1	1	1	0.670434
2	2	2	0.837793
3	3	1	0.222869
4	4	2	0.40303
5	1	1	0.415974
6	2	2	0.907121
7	3	1	0.0626348
8	4	2	0.513873

	id	id2	v
	Int64	Int64	Float64
1	1	1	0.670434
2	2	2	0.837793
3	3	1	0.222869
4	4	2	0.40303
5	1	1	0.415974
6	2	2	0.907121
7	3	1	0.0626348
8	4	2	0.513873

	id	id2	v
	Int64	Int64	Float64
1	1	10	0.670434
2	2	2	0.837793
3	3	1	0.222869
4	4	2	0.40303
5	1	1	0.415974
6	2	2	0.907121
7	3	1	0.0626348
8	4	2	0.513873

	id	id2	v
	Int64	Int64	Float64
1	1	1	0.670434
2	1	1	0.415974
3	2	2	0.837793
4	2	2	0.907121
5	3	1	0.222869
6	3	1	0.0626348
7	4	2	0.40303
8	4	2	0.513873

	id	id2	v
	Int64	Int64	Float64
1	1	1	0.670434
2	1	1	0.415974
3	2	2	0.837793
4	2	2	0.907121
5	3	1	0.222869
6	3	1	0.0626348
7	4	2	0.40303
8	4	2	0.513873

	id	v
	Char	Float64
1	b	0.778459
2	c	0.965702
3	a	0.714609
4	a	0.883121
5	c	0.542032
6	d	0.115543
7	b	0.458695
8	c	0.277306
9	b	0.18459
10	b	0.565907
11	a	0.390352
12	a	0.670578
13	d	0.344171
14	b	0.462482
15	c	0.185975
⋮	⋮	⋮

	id	x
	Int64?	Int64
1	missing	1
2	5	2
3	1	3
4	3	4
5	missing	5

	id	x
	Int64?	Int64
1	1	3

	id	x
	Int64?	Int64
1	missing	1
2	missing	5

	id	x
	Int64?	Int64
1	1	3