Introduction to DataFrames

12_pitfalls

DataFrame v1.2, Julia 1.6.1

Possible pitfalls

Know what is copied when creating a DataFrame

In [20]:
using DataFrames,Pipe, Chain
using BenchmarkTools
In [3]:
x = DataFrame(rand(3,5),:auto)
Out[3]:

3 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.5176070.9968420.6035990.2310840.537229
20.8532280.358960.4736050.05547810.196862
30.8673230.9822730.5151270.8294470.0107014
In [4]:
y = copy(x)
x === y # not the same objwct
Out[4]:
false
In [5]:
y = DataFrame(x)
Out[5]:

3 rows × 5 columns

x1x2x3x4x5
Float64Float64Float64Float64Float64
10.5176070.9968420.6035990.2310840.537229
20.8532280.358960.4736050.05547810.196862
30.8673230.9822730.5151270.8294470.0107014
In [6]:
x === y
Out[6]:
false
In [7]:
any(x[!,i] === y[!,i] for i in ncol(x)) # the columns are also not the same
Out[7]:
false
In [8]:
y = DataFrame(x,copycols=false)
x === y
Out[8]:
false
In [9]:
all(x[!,i] === y[!,i] for i in ncol(x)) # the columns are the same
Out[9]:
true
In [10]:
any(x[!,i] === y[!,i] for i in ncol(x)) # the columns are also not the same
Out[10]:
true
In [11]:
# the same when creating data frame using kwarg syntax
x = 1:3; y = [1,2,3]; df = DataFrame(x=x,y=y)
Out[11]:

3 rows × 2 columns

xy
Int64Int64
111
222
333
In [12]:
y === df.y # different object
Out[12]:
false
In [13]:
typeof(x), typeof(df.x) # range is converted to vector
Out[13]:
(UnitRange{Int64}, Vector{Int64})

Slicing rows always create a copy

In [14]:
y === df[:,:y]
Out[14]:
false

You can avoid copying by using `copycols=false` keyword argument in functions.

In [15]:
df = DataFrame(x=x, y=y, copycols=false)
Out[15]:

3 rows × 2 columns

xy
Int64Int64
111
222
333
In [16]:
y === df.y # now it is the same
Out[16]:
true
In [21]:
@btime select($df,:y)[!,1]
  671.210 ns (17 allocations: 1.70 KiB)
Out[21]:
3-element Vector{Int64}:
 1
 2
 3
In [22]:
@btime @pipe select($df, :y) |> _[!,1]
  671.630 ns (17 allocations: 1.70 KiB)
Out[22]:
3-element Vector{Int64}:
 1
 2
 3
In [24]:
select(df,:y)[!,1] === y # not the same
Out[24]:
false
In [25]:
select(df, :y, copycols=false)[!,1] === y # the same
Out[25]:
true

Do not modify the parent of GroupedDataFrame or view

In [27]:
x = DataFrame(id=repeat([1,2],inner=3),x=1:6)
Out[27]:

6 rows × 2 columns

idx
Int64Int64
111
212
313
424
525
626
In [28]:
x = DataFrame(id=repeat([1,2],outer=3),x=1:6)
Out[28]:

6 rows × 2 columns

idx
Int64Int64
111
222
313
424
515
626
In [29]:
g = groupby(x, :id)
Out[29]:

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 1

idx
Int64Int64
111
213
315

Last Group (3 rows): id = 2

idx
Int64Int64
122
224
326
In [30]:
x[1:3,1] = [2,2,2]
x
Out[30]:

6 rows × 2 columns

idx
Int64Int64
121
222
323
424
515
626

Well - it is wrong now, `g` is only a view

In [31]:
g
Out[31]:

GroupedDataFrame with 2 groups based on key: id

First Group (3 rows): id = 2

idx
Int64Int64
121
223
315

Last Group (3 rows): id = 2

idx
Int64Int64
122
224
326
In [33]:
s = view(x,5:6,:)
Out[33]:

2 rows × 2 columns

idx
Int64Int64
115
226
In [34]:
delete!(x,3:6)
Out[34]:

2 rows × 2 columns

idx
Int64Int64
121
222
In [38]:
s
BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

Stacktrace:
  [1] throw_boundserror(A::Vector{Int64}, I::Tuple{UnitRange{Int64}})
    @ Base ./abstractarray.jl:651
  [2] checkbounds
    @ ./abstractarray.jl:616 [inlined]
  [3] view(A::Vector{Int64}, I::UnitRange{Int64})
    @ Base ./subarray.jl:177
  [4] view
    @ ~/.julia/packages/DataFrames/vuMM8/src/subdataframe/subdataframe.jl:129 [inlined]
  [5] getindex
    @ ~/.julia/packages/DataFrames/vuMM8/src/subdataframe/subdataframe.jl:166 [inlined]
  [6] getindex
    @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/iteration.jl:200 [inlined]
  [7] iterate(itr::DataFrames.DataFrameColumns{SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}}, i::Int64) (repeats 2 times)
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/iteration.jl:198
  [8] _show(io::IOContext{IOBuffer}, df::SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}}; allrows::Bool, allcols::Bool, rowlabel::Symbol, summary::Bool, eltypes::Bool, rowid::Nothing, truncate::Int64, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/show.jl:167
  [9] #show#692
    @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/show.jl:348 [inlined]
 [10] show(io::IOContext{IOBuffer}, df::SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}})
    @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/show.jl:348
 [11] #show#707
    @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/io.jl:138 [inlined]
 [12] show
    @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/io.jl:138 [inlined]
 [13] limitstringmime(mime::MIME{Symbol("text/plain")}, x::SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}})
    @ IJulia ~/.julia/packages/IJulia/e8kqU/src/inline.jl:43
 [14] display_mimestring
    @ ~/.julia/packages/IJulia/e8kqU/src/display.jl:71 [inlined]
 [15] display_dict(x::SubDataFrame{DataFrame, DataFrames.Index, UnitRange{Int64}})
    @ IJulia ~/.julia/packages/IJulia/e8kqU/src/display.jl:102
 [16] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [17] invokelatest
    @ ./essentials.jl:706 [inlined]
 [18] execute_request(socket::ZMQ.Socket, msg::IJulia.Msg)
    @ IJulia ~/.julia/packages/IJulia/e8kqU/src/execute_request.jl:112
 [19] #invokelatest#2
    @ ./essentials.jl:708 [inlined]
 [20] invokelatest
    @ ./essentials.jl:706 [inlined]
 [21] eventloop(socket::ZMQ.Socket)
    @ IJulia ~/.julia/packages/IJulia/e8kqU/src/eventloop.jl:8
 [22] (::IJulia.var"#15#18")()
    @ IJulia ./task.jl:411

Single column selection for DataFrame creates aliases with ! and getproperty syntax and copies with:

In [41]:
x = DataFrame(a=1:3)
x.b = x[!,1] # alias
x.c = x[:,1] # copy
x.d = x[!,1][:] # copy
x.e = copy(x[!,1]) # explicit copy
display(x)
x[1,1] = 100
"x.b column은 x.a의 alias로써 x[1,1]=100 으로 변경된 경우 x.b도 변경됨" |> println
display(x)

3 rows × 5 columns

abcde
Int64Int64Int64Int64Int64
111111
222222
333333

3 rows × 5 columns

abcde
Int64Int64Int64Int64Int64
1100100111
222222
333333
x.b column은 x.a의 alias로써 x[1,1]=100 으로 변경된 경우 x.b도 변경됨

When iterating rows of a data frame use eachrow to avoid compilation cost(wide tables), but Tables.namedtupleiterator for fast execution(tall table)

this table is wide

In [42]:
df1 = DataFrame([rand([1:2,'a':'b',false:true,1.0:2.0]) for i in 1:900], :auto)
Out[42]:

2 rows × 900 columns (omitted printing of 889 columns)

x1x2x3x4x5x6x7x8x9x10x11
Float64Float64Int64Float64CharInt64Int64BoolFloat64CharChar
11.01.011.0a1101.0aa
22.02.022.0b2212.0bb
In [44]:
@time collect(eachrow(df1))
  0.000013 seconds (4 allocations: 256 bytes)
Out[44]:
2-element Vector{DataFrameRow}:
 DataFrameRow
 Row  x1       x2       x3     x4       x5    x6     x7     x8     x9       x      Float64  Float64  Int64  Float64  Char  Int64  Int64  Bool   Float64  C ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1.0      1.0      1      1.0  a         1      1  false      1.0  a ⋯
                                                             891 columns omitted
 DataFrameRow
 Row  x1       x2       x3     x4       x5    x6     x7     x8    x9       x1      Float64  Float64  Int64  Float64  Char  Int64  Int64  Bool  Float64  Ch ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   2 │     2.0      2.0      2      2.0  b         2      2  true      2.0  b  ⋯
                                                             891 columns omitted
In [48]:
@time collect(Tables.namedtupleiterator(df1));
  0.004972 seconds (5.28 k allocations: 12.925 MiB)

as you can see the time to compile Tables.namedtupleiterator is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [49]:
df2 = DataFrame(rand(10^6, 10),:auto)
Out[49]:

1,000,000 rows × 10 columns (omitted printing of 3 columns)

x1x2x3x4x5x6x7
Float64Float64Float64Float64Float64Float64Float64
10.1931080.3222830.2324870.06674510.0994170.09993910.219174
20.09808230.3728840.4279470.7653230.6424460.8981550.412915
30.9119350.6048470.06284060.871180.3330640.943440.636854
40.2081310.9758560.413420.2716230.5495170.4434590.299266
50.7085720.75620.6642320.2605090.8267770.3693610.213487
60.4263920.9614420.07287440.5358040.3746990.7500840.132052
70.8537990.5930910.7240570.5971940.5130780.7026250.374183
80.3387750.7633460.7924930.722920.2832210.1164490.103614
90.2227920.7153210.1359470.2141490.9270370.7304220.660307
100.0442160.1526890.9491240.2666090.9412060.982870.0150329
110.6147810.003730960.6780520.2698920.07226590.2636930.879764
120.5260070.06695360.2900490.394580.4874580.8416160.961109
130.4635430.1950050.9200270.3950630.6611530.6592750.449859
140.4677210.2091890.01561280.8892840.234220.05329210.596516
150.8458520.3381960.006777970.007593560.002055550.1402430.616174
160.5754290.3314340.885570.5981610.785530.06350050.189848
170.5184410.1672930.8024510.07459070.665920.1725040.82874
180.5778980.8320010.2477090.4310770.8722550.6278130.259124
190.02608390.2650030.4574740.9147950.5815290.6270050.105869
200.6884030.8713780.05054630.7298070.9849560.6598020.358253
210.6514550.1776310.9896610.5664730.3863590.3942670.828212
220.9518930.2040180.660190.7873450.2535570.01971260.516339
230.1814790.7478340.7155750.6614840.5571110.7136040.320488
240.3872630.9545480.6449090.621920.5882390.9919440.468822
250.1559320.4407010.2949630.8999410.5230530.2130670.219863
260.7637750.4195950.007344790.04968310.9439620.4453210.307321
270.5397550.7182690.4918470.5154540.829540.7058410.0535696
280.2092540.5390450.8512560.8143850.5469470.4511690.878354
290.9405290.0490920.7791940.05379090.4344730.3199480.287719
300.4760810.3698540.9592690.8158220.1373050.203070.987843
In [50]:
@time map(sum,eachrow(df2))
  2.338953 seconds (60.18 M allocations: 1.061 GiB, 12.63% gc time, 5.80% compilation time)
Out[50]:
1000000-element Vector{Float64}:
 2.704071523015175
 4.411682447713391
 5.921627951565904
 3.643846052647391
 5.511158451988555
 5.238299046906927
 5.361935745226174
 4.07640255971654
 4.188105980469235
 4.420129273862483
 4.577410055812212
 5.077443374765336
 5.199538366982413
 ⋮
 5.143906356321334
 3.1391696688346715
 6.28315535477638
 5.5665712737566295
 3.2598167419720845
 5.491525479182397
 5.781255375508597
 2.4901351267964995
 5.751675463507496
 4.2993837933271
 4.888116627432323
 4.240389765946817
In [52]:
@time map(sum,eachrow(df2))
  1.956384 seconds (59.99 M allocations: 1.050 GiB, 4.18% gc time)
Out[52]:
1000000-element Vector{Float64}:
 2.704071523015175
 4.411682447713391
 5.921627951565904
 3.643846052647391
 5.511158451988555
 5.238299046906927
 5.361935745226174
 4.07640255971654
 4.188105980469235
 4.420129273862483
 4.577410055812212
 5.077443374765336
 5.199538366982413
 ⋮
 5.143906356321334
 3.1391696688346715
 6.28315535477638
 5.5665712737566295
 3.2598167419720845
 5.491525479182397
 5.781255375508597
 2.4901351267964995
 5.751675463507496
 4.2993837933271
 4.888116627432323
 4.240389765946817
In [53]:
@time map(sum,Tables.namedtupleiterator(df2))
  0.280857 seconds (510.88 k allocations: 38.592 MiB, 3.12% gc time, 94.90% compilation time)
Out[53]:
1000000-element Vector{Float64}:
 2.704071523015175
 4.411682447713391
 5.921627951565904
 3.643846052647391
 5.511158451988555
 5.238299046906927
 5.361935745226174
 4.07640255971654
 4.188105980469235
 4.420129273862483
 4.577410055812212
 5.077443374765336
 5.199538366982413
 ⋮
 5.143906356321334
 3.1391696688346715
 6.28315535477638
 5.5665712737566295
 3.2598167419720845
 5.491525479182397
 5.781255375508597
 2.4901351267964995
 5.751675463507496
 4.2993837933271
 4.888116627432323
 4.240389765946817
In [54]:
@time map(sum,Tables.namedtupleiterator(df2))
  0.016954 seconds (17 allocations: 7.631 MiB)
Out[54]:
1000000-element Vector{Float64}:
 2.704071523015175
 4.411682447713391
 5.921627951565904
 3.643846052647391
 5.511158451988555
 5.238299046906927
 5.361935745226174
 4.07640255971654
 4.188105980469235
 4.420129273862483
 4.577410055812212
 5.077443374765336
 5.199538366982413
 ⋮
 5.143906356321334
 3.1391696688346715
 6.28315535477638
 5.5665712737566295
 3.2598167419720845
 5.491525479182397
 5.781255375508597
 2.4901351267964995
 5.751675463507496
 4.2993837933271
 4.888116627432323
 4.240389765946817

as you can see - this time it is much faster to iterate a type stable container

In [ ]: