using DataFrames
using Statistics
using Random
using BenchmarkTools
Random.seed!(1)
df = DataFrame(rand(4,5), :auto)
using : as row selector will copy colums
df[:,:]
this is the same as
copy(df)
you can get a subset of rows of a data frame without copying using view
to get a SubDataFrame
sdf = view(df,1:3, 1:3)
you still have a detailed reference to the parent
parent(sdf)
parentindices(sdf)
selecting a single row returns a DataFrameRow
object which is also a view
하나의 row를 선택하는 경우 view를 돌려 준다. 리턴된 row 의 값을 변경하면 원래 df의 값이 변경된다.
dfr = df[3,:]
size(dfr)
# view임을 확인
dfr === df[3,:]
# dfr 값 변경시 df 값이 바뀜
dfr[1] = 100
df
두개이상의 row를 선택하는 경우 복사 값을 돌려 준다
dfr2 = df[3:4,:]
# 두개 이상의 row 선택시 복사 값을 돌려 준다.
# view가 아님
dfr2 === df[3:4,:]
# dfr2 값이 변경되었지만 df값은 변경 되지 않음
dfr2[1,1]=200
df
parent(dfr)
parentindices(dfr)
rownumber(dfr)
ncol(dfr)
nrow(dfr)
df[!,:Z] .= 1
df
Earlier we used : for column selection in a view (SubDataFrame
and DataFrameRow
). In this case a view will have all columns of the parent after the parent is mutated
# df의row에 추가된 Z column 내역이 dfr에 반영됨 (view이기 때문)
dfr
parent(dfr)
# Z가 추가 되어 기존 5에서 6으로 증가 됨
parentindices(dfr)
rownumber(dfr)
Note that parent
and parentindices
refer to the true source of data for DataFrameRow
and rownumber
refer to row number in the direct object that was used to create DataFrameRow
df = DataFrame(a=1:4)
dvf = view(df,[3,2],:)
typeof(dvf)
typeof(df[!,[1]])
# dvf의 뷰를 가진다, dvf는 df의 뷰
dfr = dvf[2,:]
typeof(dfr)
# dfr은 dfv의 뷰이고 dfv는 df의 뷰로써 parent는 뷰의 원본 DataFrame을 가리킨다
parent(dfr)
parentindices(dfr)
rownumber(dfr)
We create some random data frame (and hope that x.x is not sorted:), which is quite likely with 12 rows)
x = DataFrame(id=1:12, x=rand(12), y = [zeros(6); ones(6)])
[zeros(6)..., ones(6)...]
check if a DataFrame or a subset of its column is sorted
issorted(x)
issorted(select(x,2,1,3)),issorted(select(x,3)),issorted(select(x,3,1,2)),
issorted(select(x,3,2,1))
issorted(x,:x)
sort!(x,:x)
x
now we create a new DataFrame
y = sort(x,:id)
here we sort by two columns, first is decreasing, second is increasing
@time sort(x,[:y,:x],rev=[true,false])
@time sort(x,[order(:y,rev=true),:x]) # the same as above
now we try some more fancy sorting stuff
# x값의 역순으로 정렬
sort(x, [order(:y,rev=true), order(:x,by=v->-v)])
# x 값은 cos(pi/(1-0.19)*v) 계산 결과에 따라 sorting 된다.
sort(x, [order(:y,rev=true), order(:x,by=v->cos(pi/(1-0.19)*v))])
this is how you can reorder rows(here randomly)
x[shuffle(1:10),:]
it is also easy to swap rows using broadcasted assignment
sort!(x,:id)
x[[1,10],:] .= x[[10,1],:]
x
x = DataFrame(rand(3,5),:auto)
merge by rows - data frames must have the same column names; the same is vcat
@btime [x;x;x]
@btime vcat(x,x,x)
you can efficiently vcat
a vector of DataFrames
using reduce
@btime reduce(vcat,[x,x,x])
get y with other order of names
y = x[:,reverse(names(x))]
`vcat` is still possible as it does dolumn name matching
vcat(x,y)
but column names must still match
vcat(x, y[:,1:3])
vcat(x,y[:,1:3],cols=:intersect)
vcat(x,y[:,1:3],cols=:union)
vcat(x,y[:,1:3],cols=[:x1,:x5])
append!(x,x)
x
here column names must match exactly unless cols
keyword argument is passed
append!(x,y)
x
standard repeat
function works on rowsl also inner
and outer
keyword arguments are accepted
repeat(x,2)
push!
add one row to x
at the end, one must pass a correct number of values unless cols
keyword argument is passed
push!(x,1:5)
x
also works with dictionaries
push!(x,Dict(:x1=>11, :x2=>12, :x3=>13, :x4=>14, :x5=>15))
and NamedTuples
via name matching
push!(x,(x2=2,x1=1,x4=4,x3=3,x5=5))
and DataFrameRow
also via name matching
push!(x,x[1,:])
Please consult the documentation of push!,append!
and vcat
for allowed values of cols
keyword argument. This keyword argument governs the way these functions perform column matching of passed arguments. Also append!
and push!
support a promote
keyword argument that decides if column type promotion is allowed.
Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:
source = [(a=1, b=2),(a=missing,b=10,c=20),(b="s",c=1,d=1)]
df = DataFrame()
for row in source
push!(df,row,cols=:union) # if cols is :union then promote is true by default
end
df
and we see that push!
dynamically added columns as needed and updated their element types
x = DataFrame(id=1:10, val='a':'j')
by using indexing
x[1:2,:]
x[1:2,:] |> typeof
a single row selection creates a `DataFrameRow`
x[1,:]
but this is a DataFrame
x[1:1,:]
x[1:1,:] |> typeof
the same a view
v1 = view(x,1:2,:)
v1 |> typeof
selects column 1 and 2
v2 = view(x,:,1:2)
v2 |> typeof
indexing by Bool
, exact length math is required
# 홀수 번째 row만 뽑아 내기
df1 = x[repeat([true,false],5),:]
df1 |> typeof
alternatively we can also create a view
v3 = view(x, repeat([true,false],5),:)
we can delete one row in place
x
delete!(x,7)
x
delete!(x, 6:7)
x
you can also create a new DataFrame
when deleting rows using Not
indexing
x[Not(1:2),:]
x
now we move to row filtering
x = DataFrame([1:4, 2:5, 3:6],:auto)
create a new DataFrame
where filtering function operates on DataFrameRow
f1 = filter(r->r.x1 > 2.5,x)
f1 |> typeof
filter에 view 파라미터 적용
f2 = filter(r->r.x1 > 2.5,x, view=true) # the same but as a view
f2 |> typeof
or
f2 = filter(:x1 => >(2.5),x)
f2 |> typeof
in place modification of x
, an example with do
-block syntax
f4 = filter!(x) do r
if r.x1 > 2.5
return r.x2 < 4.5
end
r.x3 < 3.5
end
f4 |> typeof
A common operation is selection of rows for which a value in a column is contained in a given set, Here are a few ways in which you can archive this.
df = DataFrame(x=1:12, y=mod1.(1:12,4))
We select rows for which column y has value 1 or 4
@btime filter(row->row.y in [1,4],df)
@btime filter(:y => in([1,4]),df)
@btime df[in.(df.y,Ref([1,4])),:]
DataFrames.jl also providers a subset
function that works on whole columns and allows for multiple conditions:
x = DataFrame([1:4, 2:5, 3:6], :auto)
s1 = subset(x, :x1=>x->x .< mean(x), :x2 => ByRow(<(4)))
s1 |> typeof
x = DataFrame(A=[1,2], B=["x","y"])
append!(x,x)
x.C = 1:4
x
get first unique rows for geive index
unique(x)
unique(x,[1,2])
get indicators of non-unique rows
nonunique(x,:A)
x
modify x
in place
unique!(x, :B)
x
DataFrame
into standard collections¶x = DataFrame(x=[1,missing,2], y=["a","b",missing],z=[true,false,true])
cols = [:y,:z]
you can use a conversion to a Vector
or an Array
Vector(x[1,cols])
nwo you will get a vector of vectors
xx = DataFrame(rand(2,3),:auto)
axes(xx,1),axes(xx,2),[axes(xx,1)...],[axes(xx,2)...]
# axes(x,1) : row
# axes(x,2) : column
[Vector(x[i,cols]) for i in axes(x,1)]
it is easy to convert a DataFrameRow
into a NamedTuple
typeof(x[1,cols])
c1 = copy(x[1,cols])
n1 = NamedTuple(x[1,cols])
c1 == n1, c1===n1
c1.y
or a Tuple
Tuple(x[1,cols])
You can use eachrow
to get a vector-like collection of DataFrameRows
df = DataFrame(reshape(1:12,3,4),:auto)
er_df = eachrow(df)
er_df |> typeof
er_df[1]
er_df[1][2] === er_df[1].x2
last(er_df)
er_df[end]
As DataFrameRows
objects keeps connection to the parent data frame you can get the columns of parent using getproperty
er_df.x1
er_df.x1 === df.x1
Occasionally you have a data frame whose one column is a vector of collections. You can expand(flatten)
such a column using the flatten
function
df = DataFrame(a='a':'c', b=[[1,2,3],[4,5],6])
f1 = flatten(df, :b)
f1 |> typeof
only
from Julia Base is also supported in DataFrames.jl
and succeeds if the data frame has only one row, in which case it is returned.
df = DataFrame(a=1)
only(df)
df2 = repeat(df,2)
only(df2)