Introduction to DataFrames

13_extras

DataFrame v1.2, Julia 1.6.1

Extras - selected functionalities of selected packages

In [1]:
using DataFrames
In [97]:
using Pipe
In [2]:
using FreqTables
In [3]:
using CategoricalArrays
In [4]:
using BenchmarkTools
In [5]:
ENV["LINES"] = 15
Out[5]:
15
In [6]:
df = DataFrame(a=rand('a':'d',1000), b=rand(["x","y","z"],1000))
# observe that dimensions are sorted if possible
ft = freqtable(df, :a, :b)
Out[6]:
4×3 Named Matrix{Int64}
a ╲ b │  x   y   z
──────┼───────────
'a'   │ 78  86  80
'b'   │ 87  87  92
'c'   │ 79  86  84
'd'   │ 80  79  82
In [7]:
# you can index the result using number of names
ft[1,1], ft['b',"z"]
Out[7]:
(78, 92)
In [8]:
# getting proportions - 1 means we want to calculate them in rows
# (first dimension)
pr = prop(ft, margins=1)
Out[8]:
4×3 Named Matrix{Float64}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.319672  0.352459  0.327869
'b'   │ 0.327068  0.327068  0.345865
'c'   │ 0.317269  0.345382  0.337349
'd'   │  0.33195  0.327801  0.340249
In [9]:
pr[1,:] |> sum,pr[4,:] |> sum
Out[9]:
(1.0, 1.0)
In [10]:
# and columns are normalized to 1.0 now
pc = prop(ft, margins=2)
Out[10]:
4×3 Named Matrix{Float64}
a ╲ b │        x         y         z
──────┼─────────────────────────────
'a'   │ 0.240741  0.254438  0.236686
'b'   │ 0.268519  0.257396  0.272189
'c'   │ 0.243827  0.254438  0.248521
'd'   │ 0.246914  0.233728  0.242604
In [11]:
pc[:,"x"] |> sum,pc[:,"y"] |> sum
Out[11]:
(1.0, 1.0)
In [12]:
p = prop(ft)
Out[12]:
4×3 Named Matrix{Float64}
a ╲ b │     x      y      z
──────┼────────────────────
'a'   │ 0.078  0.086   0.08
'b'   │ 0.087  0.087  0.092
'c'   │ 0.079  0.086  0.084
'd'   │  0.08  0.079  0.082
In [13]:
p |> sum
Out[13]:
0.9999999999999998
In [14]:
x = categorical(rand(1:3,10))
Out[14]:
10-element CategoricalArray{Int64,1,UInt32}:
 2
 2
 1
 2
 1
 1
 3
 3
 2
 1
In [15]:
# rerodering levels and adding an extra level
levels!(x,[3,1,2,4])

# order is preserved and not-used level is shown
freqtable(x)
Out[15]:
4-element Named Vector{Int64}
Dim1                              │ 
──────────────────────────────────┼──
CategoricalValue{Int64, UInt32} 3 │ 2
CategoricalValue{Int64, UInt32} 1 │ 4
CategoricalValue{Int64, UInt32} 2 │ 4
CategoricalValue{Int64, UInt32} 4 │ 0
In [16]:
# by default missings are listed
freqtable([1,1,2,3,missing])
Out[16]:
4-element Named Vector{Int64}
Dim1    │ 
────────┼──
1       │ 2
2       │ 1
3       │ 1
missing │ 1
In [17]:
# but we can skip them
freqtable([1,1,2,3,missing],skipmissing=true)
Out[17]:
3-element Named Vector{Int64}
Dim1  │ 
──────┼──
1     │ 2
2     │ 1
3     │ 1
In [18]:
df = DataFrame(a=rand(3:4,1_000),b=rand(5:6,1_000))
# now dimensions are numbers
ft = freqtable(df, :a, :b)
Out[18]:
2×2 Named Matrix{Int64}
a ╲ b │   5    6
──────┼─────────
3     │ 265  249
4     │ 242  244
In [19]:
# this is an error - standard array indexing takes precedence
ft[3,5]
BoundsError: attempt to access 2×2 Matrix{Int64} at index [3, 5]

Stacktrace:
 [1] getindex
   @ ./array.jl:802 [inlined]
 [2] getindex(::NamedArrays.NamedMatrix{Int64, Matrix{Int64}, Tuple{OrderedCollections.OrderedDict{Int64, Int64}, OrderedCollections.OrderedDict{Int64, Int64}}}, ::Int64, ::Int64)
   @ NamedArrays ~/.julia/packages/NamedArrays/TuJLn/src/index.jl:17
 [3] top-level scope
   @ In[19]:2
 [4] eval
   @ ./boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [20]:
# you have to use Name() wrapper
ft[FreqTables.Name(3),FreqTables.Name(5)]
Out[20]:
265

DataFramesMeta.jl - working on DataFrame

DataFramesMeta.jl provides a more terse syntax due to the benefits of metaprogramming.

In [21]:
using DataFramesMeta
In [22]:
df = DataFrame(x=1:8, y='a':'h',z=repeat([true,false],outer=4))
Out[22]:

8 rows × 3 columns

xyz
Int64CharBool
11a1
22b0
33c1
44d0
55e1
66f0
77g1
88h0
In [23]:
# expressions with columns of DataFrame
@btime DataFramesMeta.@with($df, :x + :z)
  653.782 ns (7 allocations: 576 bytes)
Out[23]:
8-element Vector{Int64}:
 2
 2
 4
 4
 6
 6
 8
 8
In [24]:
@btime +($df.x,df.z)
  134.350 ns (1 allocation: 144 bytes)
Out[24]:
8-element Vector{Int64}:
 2
 2
 4
 4
 6
 6
 8
 8
In [25]:
# you can define code blocks
@with df begin
  a = :x[:z] # 1,3,5,7
  b = :x[.!:z] # 2,4,6,8
  :y + [a;b] # [a;b] : 1,3,5,7,2,4,8
end
Out[25]:
8-element Vector{Char}:
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'k': ASCII/Unicode U+006B (category Ll: Letter, lowercase)
 'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)
 'j': ASCII/Unicode U+006A (category Ll: Letter, lowercase)
 'm': ASCII/Unicode U+006D (category Ll: Letter, lowercase)
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
In [26]:
# @with creates hard scope so variables do not leak out
a
UndefVarError: a not defined

Stacktrace:
 [1] top-level scope
   @ :0
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [27]:
df2 = DataFrame(a=[:a,:b,:c])
Out[27]:

3 rows × 1 columns

a
Symbol
1a
2b
3c
In [28]:
# sometimes we want to work on a raw Symbol, ^() escapes it
@with(df2, :a .== ^(:a))
Out[28]:
3-element BitVector:
 1
 0
 0
In [29]:
@with(df2, :a .== Symbol("a"))
Out[29]:
3-element BitVector:
 1
 0
 0
In [30]:
x_str = "x"
y_str = "y"
df2 = DataFrame(x=1:3, y=4:6, z=7:9)
Out[30]:

3 rows × 3 columns

xyz
Int64Int64Int64
1147
2258
3369
In [31]:
# cols(expression) selects given columns
@with(df2,cols(x_str) + cols(y_str))
┌ Warning: cols(x) is deprecated, use $x instead
â”” @ DataFramesMeta /home/shpark/.julia/packages/DataFramesMeta/mHJrB/src/parsing.jl:62
┌ Warning: cols(x) is deprecated, use $x instead
â”” @ DataFramesMeta /home/shpark/.julia/packages/DataFramesMeta/mHJrB/src/parsing.jl:62
Out[31]:
3-element Vector{Int64}:
 5
 7
 9
In [32]:
@with(df2,$(x_str) + $(y_str))
Out[32]:
3-element Vector{Int64}:
 5
 7
 9
In [33]:
df
Out[33]:

8 rows × 3 columns

xyz
Int64CharBool
11a1
22b0
33c1
44d0
55e1
66f0
77g1
88h0
In [34]:
# a very useful macro for filtering
DataFramesMeta.@subset(df, :x .< 4, :z .== true)
Out[34]:

2 rows × 3 columns

xyz
Int64CharBool
11a1
23c1
In [35]:
# create a new DataFrame based on the old one
DataFramesMeta.@select(df, :x, y = 2*:x,z=:y)
Out[35]:

8 rows × 3 columns

xyz
Int64Int64Char
112a
224b
336c
448d
5510e
6612f
7714g
8816h
In [36]:
# create a new DataFrame adding columns based on old one
@btime DataFramesMeta.@transform($df,:x = 2*:x, :y=:x)
  49.959 μs (228 allocations: 12.59 KiB)
Out[36]:

8 rows × 3 columns

xyz
Int64Int64Bool
1211
2420
3631
4840
51051
61260
71471
81680
In [37]:
@btime transform($df,:x=>ByRow(x->2*x)=>:x,:x=>:y)
  45.753 μs (214 allocations: 11.95 KiB)
Out[37]:

8 rows × 3 columns

xyz
Int64Int64Bool
1211
2420
3631
4840
51051
61260
71471
81680
In [38]:
# dorting into a new data frame, less powerful than sort, 
# but lightweight
@btime DataFramesMeta.@orderby($df,:z,-:x)
  38.649 μs (183 allocations: 10.83 KiB)
Out[38]:

8 rows × 3 columns

xyz
Int64CharBool
18h0
26f0
34d0
42b0
57g1
65e1
73c1
81a1
In [39]:
@btime sort($df,[:z,:x],rev=[false,true])
  4.551 μs (49 allocations: 3.98 KiB)
Out[39]:

8 rows × 3 columns

xyz
Int64CharBool
18h0
26f0
34d0
42b0
57g1
65e1
73c1
81a1
In [40]:
using Chain
In [41]:
# chainning of operations on DataFrame
@chain df begin
  @subset(:x .< 5)
  @orderby(:z)
  @transform(:x²= :x .^2)
  @select(:z,:x,:x²)
end
Out[41]:

4 rows × 3 columns

zxx²
BoolInt64Int64
1024
20416
3111
4139

DataFramesMeta -working on grouped DataFrame

In [42]:
df = DataFrame(a=1:12, b=repeat('a':'d', outer=3))
Out[42]:

12 rows × 2 columns

ab
Int64Char
11a
22b
33c
44d
55a
66b
77c
88d
99a
1010b
1111c
1212d
In [43]:
g = groupby(df, :b)
Out[43]:

GroupedDataFrame with 4 groups based on key: b

First Group (3 rows): b = 'a'

ab
Int64Char
11a
25a
39a

Last Group (3 rows): b = 'd'

ab
Int64Char
14d
28d
312d
In [44]:
using Statistics
In [45]:
# groupby + combine in one shot
@btime DataFramesMeta.@by(df,:b,:first=first(:a),:last=last(:a),:mean=mean(:a))
  173.221 μs (486 allocations: 30.22 KiB)
Out[45]:

4 rows × 4 columns

bfirstlastmean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0
In [46]:
@btime @chain df begin
  groupby(:b)
  combine(:a=>first=>:first,:a=>last=>:last,:a=>mean=>:mean)
end
  157.729 μs (444 allocations: 28.30 KiB)
Out[46]:

4 rows × 4 columns

bfirstlastmean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0
In [51]:
# the same as by but on grouped DataFrame
@btime @combine($g,:first=first(:a),:last=last(:a), :mean=mean(:a))
  170.003 μs (485 allocations: 29.92 KiB)
Out[51]:

4 rows × 4 columns

bfirstlastmean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0
In [52]:
# similar in DataFrames.jl 
@btime combine($g,:a .=> [first,last, mean] .=> [:first,:last,:mean])
  142.343 μs (411 allocations: 27.36 KiB)
Out[52]:

4 rows × 4 columns

bfirstlastmean
CharInt64Int64Float64
1a195.0
2b2106.0
3c3117.0
4d4128.0
In [114]:
# perform operations within a group and return ungrouped DataFrame
@btime @transform($g, :center=mean(:a), :centered = :a .- mean(:a))
  217.743 μs (622 allocations: 37.77 KiB)
Out[114]:

12 rows × 4 columns

abcentercentered
Int64CharFloat64Float64
11a5.0-4.0
22b6.0-4.0
33c7.0-4.0
44d8.0-4.0
55a5.00.0
66b6.00.0
77c7.00.0
88d8.00.0
99a5.04.0
1010b6.04.0
1111c7.04.0
1212d8.04.0
In [115]:
@btime transform($g,:a.=>[mean,(a-> (a .- mean(a)))] .=> [:center, :centered])
  217.313 μs (592 allocations: 37.11 KiB)
Out[115]:

12 rows × 4 columns

abcentercentered
Int64CharFloat64Float64
11a5.0-4.0
22b6.0-4.0
33c7.0-4.0
44d8.0-4.0
55a5.00.0
66b6.00.0
77c7.00.0
88d8.00.0
99a5.04.0
1010b6.04.0
1111c7.04.0
1212d8.04.0
In [110]:
# this is defined in DataFrames.jl
DataFrame(g)
Out[110]:

12 rows × 2 columns

ab
Int64Char
11a
25a
39a
42b
56b
610b
73c
87c
911c
104d
118d
1212d
In [111]:
# actually this is not the same as DataFrame()
# as it preserves the original row order
@transform(g)
Out[111]:

12 rows × 2 columns

ab
Int64Char
11a
22b
33c
44d
55a
66b
77c
88d
99a
1010b
1111c
1212d

DataFramesMeta - rowwise operations on DataFrame

In [320]:
df = DataFrame(a=1:12, b=repeat(1:4, outer=3))
Out[320]:

12 rows × 2 columns

ab
Int64Int64
111
222
333
444
551
662
773
884
991
10102
11113
12124
In [321]:
# such conditions are often needed but are complex to write
@btime @transform($df, :x=ifelse.((:a .> 6) .& (:b .== 4),"yes","no"))
  38.167 μs (162 allocations: 9.62 KiB)
Out[321]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes
In [322]:
@btime transform($df, [:a,:b]=>
  ByRow((a,b)->ifelse((a > 6) & (b == 4),"yes","no"))=>:x)
  40.235 μs (165 allocations: 9.72 KiB)
Out[322]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes
In [323]:
# one option is to use a function that works on a single observation and 
# broadcast it 
myfunc(a,b) = a > 6 && b == 4 ? "yes" : "no"
@btime @transform($df, :x=myfunc.(:a,:b))
  39.701 μs (165 allocations: 9.72 KiB)
Out[323]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes
In [324]:
# or you can use @eachrow macro that allows you to process DataFrame rowwise
@btime @eachrow $df begin
  @newcol x::Vector{String}
    :x = :a > 6 && :b == 4 ? "yes" : "no"
end
  40.225 μs (171 allocations: 10.16 KiB)
Out[324]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

Very Fast!!!

In [325]:
@btime @chain $df begin
  _.x = ifelse.((_.a .> 6) .&  (_.b .== 4) ,"yes","no")
end
df
  2.396 μs (12 allocations: 592 bytes)
Out[325]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes
In [329]:
@btime transform($df,[:a,:b]=>ByRow((a,b)->ifelse(a > 6 && b == 4,"yes","no"))=>:x)
  40.337 μs (169 allocations: 9.98 KiB)
Out[329]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

You can also use eachrow from DataFrames to perform the same transformation.

Very Fast!!!

In [330]:
df2 = copy(df) 
df2.x = Vector{String}(undef,nrow(df2))
@btime begin
  for row in eachrow($df2)
    row[:x] = row[:a] > 6 && row[:b] == 4 ? "yes" : "no"
  end
end
df2
  1.825 μs (0 allocations: 0 bytes)
Out[330]:

12 rows × 3 columns

abx
Int64Int64String
111no
222no
333no
444no
551no
662no
773no
884yes
991no
10102no
11113no
12124yes

Visualizing data with StatsPlots

In [331]:
using StatsPlots
default(fmt=:png)

We present only a minimal functionality of the package

In [332]:
using Random
Random.seed!(1)
df = DataFrame(x=sort(randn(1_000)), y=randn(1_000),
        z=[fill("b",500);fill("a",500)])
Out[332]:

1,000 rows × 3 columns

xyz
Float64Float64String
1-4.72636-1.6144b
2-3.299870.683288b
3-2.7486-1.95519b
4-2.633991.18536b
5-2.62741.58449b
6-2.62056-0.144271b
7-2.599180.010576b
8-2.549670.019725b
9-2.52371-0.588736b
10-2.46726-0.40365b
11-2.463641.24812b
12-2.38138-0.533295b
13-2.373161.05646b
14-2.289590.626849b
15-2.26709-0.102321b
In [333]:
@df df plot(:x,:y, legend=:topleft,label="y(x)")
Out[333]:
In [334]:
@df df density(:x,label="")
Out[334]:
In [335]:
@df df histogram(:y, label="y")
Out[335]:
In [336]:
@df df boxplot(:z,:x,label="x")
Out[336]:
In [337]:
@df df violin(:z,:y,label="y")
Out[337]:
In [ ]: