Introduction to DataFrames

01_constructor

In [1]:
using DataFrames, Random

Costructors and conversion

In [5]:
x = Dict("A"=>[1,2], "B"=>[true,false],"C"=>['a','b'],"fixed"=>Ref([1,1]))
DataFrame(x)
Out[5]:

2 rows × 4 columns

ABCfixed
Int64BoolCharArray…
111a[1, 1]
220b[1, 1]
In [17]:
DataFrame([rand(3) for _ in 1:3], :auto)
Out[17]:

3 rows × 3 columns

x1x2x3
Float64Float64Float64
10.8379540.7350990.759242
20.03287720.9889910.384152
30.7412920.9430740.794541
In [20]:
DataFrame([1,2,3,4]',:auto)
Out[20]:

1 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
11234
In [28]:
DataFrame(:x1=>[1,2,3,4])
Out[28]:

4 rows × 1 columns

x1
Int64
11
22
33
44
In [29]:
DataFrame(rand(3,4),Symbol.('a':'d'))
Out[29]:

3 rows × 4 columns

abcd
Float64Float64Float64Float64
10.5830380.07060060.5347490.52262
20.2094420.9636660.791570.32516
30.2729970.7364430.7519090.337337
In [31]:
DataFrame(rand(3,4),string.('a':'d'))
Out[31]:

3 rows × 4 columns

abcd
Float64Float64Float64Float64
10.9558030.9897040.652350.0691728
20.4314410.394550.4083810.843323
30.3751530.3732220.1267270.3333
  • Empty DataFrame
In [32]:
DataFrame(A=Int[],B=Float64[],C=String[])
Out[32]:

0 rows × 3 columns

ABC
Int64Float64String
In [33]:
x = DataFrame(a=1:2,b='a':'b')
Out[33]:

2 rows × 2 columns

ab
Int64Char
11a
22b
In [34]:
y = copy(x)
Out[34]:

2 rows × 2 columns

ab
Int64Char
11a
22b
In [35]:
x == y
Out[35]:
true
In [36]:
x === y
Out[36]:
false
In [37]:
x.a == y.a
Out[37]:
true
In [39]:
x.a === y.a
Out[39]:
false

copycols=false를 사용하여 복사를 방지 할 수 있다

In [40]:
x = DataFrame(a=1:2,b='a':'b')
Out[40]:

2 rows × 2 columns

ab
Int64Char
11a
22b
In [41]:
y = DataFrame(x, copycols=false)
Out[41]:

2 rows × 2 columns

ab
Int64Char
11a
22b
In [42]:
(x===y), (x==y), (x.a==y.a), (x.a === y.a)
Out[42]:
(false, true, true, true)
In [47]:
y.a[1] = 20
Out[47]:
20
In [49]:
y
Out[49]:

2 rows × 2 columns

ab
Int64Char
120a
22b
In [48]:
x
Out[48]:

2 rows × 2 columns

ab
Int64Char
120a
22b

You can create a similar uninitialized DataFrame based on an original one

In [50]:
x = DataFrame(a=1,b=1.0)
Out[50]:

1 rows × 2 columns

ab
Int64Float64
111.0
In [51]:
similar(x)
Out[51]:

1 rows × 2 columns

ab
Int64Float64
11399856340154566.91633e-310
In [53]:
similar(x,0)
Out[53]:

0 rows × 2 columns

ab
Int64Float64
In [54]:
similar(x,2)
Out[54]:

2 rows × 2 columns

ab
Int64Float64
11399859339789926.91633e-310
21399859120918246.91633e-310
In [98]:
# [1,1] : 1번째 row, 1번째 row => 1번째 row를 두번 선택함
#  : 모든 컬럼
# 
sdf = view(x,[1,1],:)
Out[98]:

2 rows × 2 columns

ab
Int64Float64
111.0
211.0
In [99]:
typeof(sdf)
Out[99]:
SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}
  • 값을 변경하면 원래 데이터프레임의 값이 바뀐다
In [115]:
sdf.a[2] = 2
Out[115]:
2
In [116]:
x
Out[116]:

1 rows × 2 columns

ab
Int64Float64
121.0
In [100]:
dfr = x[1,:]
Out[100]:

DataFrameRow (2 columns)

ab
Int64Float64
111.0

Conversion to a matrix

In [117]:
x = DataFrame(x=1:2, y=["A","B"])
Out[117]:

2 rows × 2 columns

xy
Int64String
11A
22B
In [127]:
@time x |> Matrix
  0.000034 seconds (4 allocations: 272 bytes)
Out[127]:
2×2 Matrix{Any}:
 1  "A"
 2  "B"
In [128]:
@time Matrix(x)
  0.000034 seconds (4 allocations: 272 bytes)
Out[128]:
2×2 Matrix{Any}:
 1  "A"
 2  "B"
In [153]:
@time x |> Array
  0.000033 seconds (4 allocations: 272 bytes)
Out[153]:
2×2 Matrix{Any}:
 1  "A"
 2  "B"
In [154]:
@time Array(x)
  0.000037 seconds (4 allocations: 272 bytes)
Out[154]:
2×2 Matrix{Any}:
 1  "A"
 2  "B"
In [155]:
x = DataFrame(x=1:2,y=[missing,"B"])
Out[155]:

2 rows × 2 columns

xy
Int64String?
11missing
22B
In [156]:
x |> Matrix
Out[156]:
2×2 Matrix{Any}:
 1  missing
 2  "B"
In [158]:
x|>Array
Out[158]:
2×2 Matrix{Any}:
 1  missing
 2  "B"
In [159]:
x = DataFrame(x=1:2, y=["A","B"])
Out[159]:

2 rows × 2 columns

xy
Int64String
11A
22B
In [161]:
rt = Tables.rowtable(x)
Out[161]:
2-element Vector{NamedTuple{(:x, :y), Tuple{Int64, String}}}:
 (x = 1, y = "A")
 (x = 2, y = "B")
In [162]:
ct = Tables.columntable(x)
Out[162]:
(x = [1, 2], y = ["A", "B"])
In [163]:
DataFrame(rt)
Out[163]:

2 rows × 2 columns

xy
Int64String
11A
22B
In [164]:
DataFrame(ct)
Out[164]:

2 rows × 2 columns

xy
Int64String
11A
22B

Iterating data frame by rows or colums

In [169]:
ec = eachcol(x)
Out[169]:

2×2 DataFrameColumns

xy
Int64String
11A
22B
  • DataFrameColumns object behaves as a vector (note though it is not AbstractVector)
In [170]:
ec isa AbstractVector
Out[170]:
false
In [171]:
isa(ec,AbstractVector)
Out[171]:
false
In [172]:
ec[1]
Out[172]:
2-element Vector{Int64}:
 1
 2
In [174]:
ec["y"]
Out[174]:
2-element Vector{String}:
 "A"
 "B"
In [175]:
ec[:y]
Out[175]:
2-element Vector{String}:
 "A"
 "B"
In [188]:
ec.y
Out[188]:
2-element Vector{String}:
 "A"
 "B"
In [190]:
ec.y[1]
Out[190]:
"A"
In [176]:
er = eachrow(x)
Out[176]:

2×2 DataFrameRows

xy
Int64String
11A
22B
  • DataFrameRows is an AbstractVector
In [177]:
er isa AbstractVector
Out[177]:
true
In [180]:
er[end]
Out[180]:

DataFrameRow (2 columns)

xy
Int64String
22B
In [186]:
er.y
Out[186]:
2-element Vector{String}:
 "A"
 "B"
In [187]:
er.y[1]
Out[187]:
"A"

Note that both data frame and also DataFrameColumns and DataFrameRows objects are not type stable (they do not know the types of their columns). This is useful to avoid compilation cost if you have very wide data frames with heterogenous column types.

However, often (especially if a data frame is narrows) it is useful to create a lazy iterator that produces NamedTuples for each row of the DataFrame. Its key benefit is that it is type stable (so it is useful when you want to perform some operations in a fast way on a small subset of columns of a DataFrame - this strategy is often used internally by DataFrames.jl package):

In [191]:
nti = Tables.namedtupleiterator(x)
Out[191]:
Tables.NamedTupleIterator{Tables.Schema{(:x, :y), Tuple{Int64, String}}, Tables.RowIterator{NamedTuple{(:x, :y), Tuple{Vector{Int64}, Vector{String}}}}}(Tables.RowIterator{NamedTuple{(:x, :y), Tuple{Vector{Int64}, Vector{String}}}}((x = [1, 2], y = ["A", "B"]), 2))
In [211]:
for row in enumerate(nti)
  @show row
  @show row[2].y
end
row = (1, (x = 1, y = "A"))
(row[2]).y = "A"
row = (2, (x = 2, y = "B"))
(row[2]).y = "B"

Handling of duplicate column names

We can pass the makeunique keyword argument to allow passing duplicate names (they get deduplicated)

In [212]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)
Out[212]:

1 rows × 3 columns

aa_2a_1
Int64Int64Int64
1123

Otherwise, duplicates are not allowed.

In [214]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3)
ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

Stacktrace:
 [1] make_unique!(names::Vector{Symbol}, src::Vector{Symbol}; makeunique::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/other/utils.jl:35
 [2] #make_unique#6
   @ ~/.julia/packages/DataFrames/pVFzb/src/other/utils.jl:57 [inlined]
 [3] #Index#9
   @ ~/.julia/packages/DataFrames/pVFzb/src/other/index.jl:27 [inlined]
 [4] DataFrame(::Pair{Symbol, Int64}, ::Vararg{Pair{Symbol, Int64}, N} where N; makeunique::Bool, copycols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/dataframe/dataframe.jl:244
 [5] DataFrame(::Pair{Symbol, Int64}, ::Vararg{Pair{Symbol, Int64}, N} where N)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/dataframe/dataframe.jl:242
 [6] top-level scope
   @ In[214]:1
 [7] eval
   @ ./boot.jl:360 [inlined]
 [8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Observe that currently nothing is not printed when displaying a DataFrame in Jupyter Notebook:

In [215]:
df = DataFrame(x=[1,nothing], y=[nothing,"a"],z=[missing,"c"])
Out[215]:

2 rows × 3 columns

xyz
Union…Union…String?
11missing
2ac
In [216]:
empty!(df)
Out[216]:

0 rows × 3 columns

xyz
Union…Union…String?
In [217]:
df
Out[217]:

0 rows × 3 columns

xyz
Union…Union…String?
In [ ]: