Introduction to DataFrames¶

01_constructor

using DataFrames, Random

Costructors and conversion¶

x = Dict("A"=>[1,2], "B"=>[true,false],"C"=>['a','b'],"fixed"=>Ref([1,1]))
DataFrame(x)

DataFrame([rand(3) for _ in 1:3], :auto)

DataFrame([1,2,3,4]',:auto)

DataFrame(:x1=>[1,2,3,4])

DataFrame(rand(3,4),Symbol.('a':'d'))

DataFrame(rand(3,4),string.('a':'d'))

Empty DataFrame

DataFrame(A=Int[],B=Float64[],C=String[])

x = DataFrame(a=1:2,b='a':'b')

y = copy(x)

x == y

true

x === y

false

x.a == y.a

true

x.a === y.a

false

copycols=false를 사용하여 복사를 방지 할 수 있다¶

x = DataFrame(a=1:2,b='a':'b')

y = DataFrame(x, copycols=false)

(x===y), (x==y), (x.a==y.a), (x.a === y.a)

(false, true, true, true)

y.a[1] = 20

20

y

x

You can create a similar uninitialized DataFrame based on an original one¶

x = DataFrame(a=1,b=1.0)

similar(x)

similar(x,0)

similar(x,2)

# [1,1] : 1번째 row, 1번째 row => 1번째 row를 두번 선택함
#  : 모든 컬럼
# 
sdf = view(x,[1,1],:)

typeof(sdf)

SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}

값을 변경하면 원래 데이터프레임의 값이 바뀐다

sdf.a[2] = 2

2

x

dfr = x[1,:]

Conversion to a matrix¶

x = DataFrame(x=1:2, y=["A","B"])

@time x |> Matrix

  0.000034 seconds (4 allocations: 272 bytes)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

@time Matrix(x)

  0.000034 seconds (4 allocations: 272 bytes)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

@time x |> Array

  0.000033 seconds (4 allocations: 272 bytes)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

@time Array(x)

  0.000037 seconds (4 allocations: 272 bytes)

2×2 Matrix{Any}:
 1  "A"
 2  "B"

x = DataFrame(x=1:2,y=[missing,"B"])

x |> Matrix

2×2 Matrix{Any}:
 1  missing
 2  "B"

x|>Array

2×2 Matrix{Any}:
 1  missing
 2  "B"

Conversion to `NamedTuple` related tabular structures¶

x = DataFrame(x=1:2, y=["A","B"])

rt = Tables.rowtable(x)

2-element Vector{NamedTuple{(:x, :y), Tuple{Int64, String}}}:
 (x = 1, y = "A")
 (x = 2, y = "B")

ct = Tables.columntable(x)

(x = [1, 2], y = ["A", "B"])

DataFrame(rt)

DataFrame(ct)

Iterating data frame by rows or colums¶

ec = eachcol(x)

DataFrameColumns object behaves as a vector (note though it is not AbstractVector)

ec isa AbstractVector

false

isa(ec,AbstractVector)

false

ec[1]

2-element Vector{Int64}:
 1
 2

ec["y"]

2-element Vector{String}:
 "A"
 "B"

ec[:y]

2-element Vector{String}:
 "A"
 "B"

ec.y

2-element Vector{String}:
 "A"
 "B"

ec.y[1]

"A"

er = eachrow(x)

DataFrameRows is an AbstractVector

er isa AbstractVector

true

er[end]

er.y

2-element Vector{String}:
 "A"
 "B"

er.y[1]

"A"

Note that both data frame and also DataFrameColumns and DataFrameRows objects are not type stable (they do not know the types of their columns). This is useful to avoid compilation cost if you have very wide data frames with heterogenous column types.

However, often (especially if a data frame is narrows) it is useful to create a lazy iterator that produces NamedTuples for each row of the DataFrame. Its key benefit is that it is type stable (so it is useful when you want to perform some operations in a fast way on a small subset of columns of a DataFrame - this strategy is often used internally by DataFrames.jl package):

nti = Tables.namedtupleiterator(x)

Tables.NamedTupleIterator{Tables.Schema{(:x, :y), Tuple{Int64, String}}, Tables.RowIterator{NamedTuple{(:x, :y), Tuple{Vector{Int64}, Vector{String}}}}}(Tables.RowIterator{NamedTuple{(:x, :y), Tuple{Vector{Int64}, Vector{String}}}}((x = [1, 2], y = ["A", "B"]), 2))

for row in enumerate(nti)
  @show row
  @show row[2].y
end

row = (1, (x = 1, y = "A"))
(row[2]).y = "A"
row = (2, (x = 2, y = "B"))
(row[2]).y = "B"

Handling of duplicate column names¶

We can pass the makeunique keyword argument to allow passing duplicate names (they get deduplicated)

df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

Otherwise, duplicates are not allowed.

df = DataFrame(:a=>1, :a=>2, :a_1=>3)

ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

Stacktrace:
 [1] make_unique!(names::Vector{Symbol}, src::Vector{Symbol}; makeunique::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/other/utils.jl:35
 [2] #make_unique#6
   @ ~/.julia/packages/DataFrames/pVFzb/src/other/utils.jl:57 [inlined]
 [3] #Index#9
   @ ~/.julia/packages/DataFrames/pVFzb/src/other/index.jl:27 [inlined]
 [4] DataFrame(::Pair{Symbol, Int64}, ::Vararg{Pair{Symbol, Int64}, N} where N; makeunique::Bool, copycols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/dataframe/dataframe.jl:244
 [5] DataFrame(::Pair{Symbol, Int64}, ::Vararg{Pair{Symbol, Int64}, N} where N)
   @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/dataframe/dataframe.jl:242
 [6] top-level scope
   @ In[214]:1
 [7] eval
   @ ./boot.jl:360 [inlined]
 [8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Observe that currently nothing is not printed when displaying a DataFrame in Jupyter Notebook:

df = DataFrame(x=[1,nothing], y=[nothing,"a"],z=[missing,"c"])

empty!(df)

df

	A	B	C	fixed
	Int64	Bool	Char	Array…
1	1	1	a	[1, 1]
2	2	0	b	[1, 1]

	x1	x2	x3
	Float64	Float64	Float64
1	0.837954	0.735099	0.759242
2	0.0328772	0.988991	0.384152
3	0.741292	0.943074	0.794541

	a	b	c	d
	Float64	Float64	Float64	Float64
1	0.583038	0.0706006	0.534749	0.52262
2	0.209442	0.963666	0.79157	0.32516
3	0.272997	0.736443	0.751909	0.337337

	a	b	c	d
	Float64	Float64	Float64	Float64
1	0.955803	0.989704	0.65235	0.0691728
2	0.431441	0.39455	0.408381	0.843323
3	0.375153	0.373222	0.126727	0.3333

	a	b
	Int64	Float64
1	139985933978992	6.91633e-310
2	139985912091824	6.91633e-310

	a	b
	Int64	Float64
1	1	1.0
2	1	1.0

	x	y
	Int64	String
1	1	A
2	2	B

	x	y
	Int64	String?
1	1	missing
2	2	B

	x	y
	Int64	String
1	1	A
2	2	B