using DataFrames,Pipe, Chain
using BenchmarkTools
x = DataFrame(rand(3,5),:auto)
y = copy(x)
x === y # not the same objwct
y = DataFrame(x)
x === y
any(x[!,i] === y[!,i] for i in ncol(x)) # the columns are also not the same
y = DataFrame(x,copycols=false)
x === y
all(x[!,i] === y[!,i] for i in ncol(x)) # the columns are the same
any(x[!,i] === y[!,i] for i in ncol(x)) # the columns are also not the same
# the same when creating data frame using kwarg syntax
x = 1:3; y = [1,2,3]; df = DataFrame(x=x,y=y)
y === df.y # different object
typeof(x), typeof(df.x) # range is converted to vector
Slicing rows always create a copy
y === df[:,:y]
You can avoid copying by using `copycols=false` keyword argument in functions.
df = DataFrame(x=x, y=y, copycols=false)
y === df.y # now it is the same
@btime select($df,:y)[!,1]
@btime @pipe select($df, :y) |> _[!,1]
select(df,:y)[!,1] === y # not the same
select(df, :y, copycols=false)[!,1] === y # the same
GroupedDataFrame
or view
¶x = DataFrame(id=repeat([1,2],inner=3),x=1:6)
x = DataFrame(id=repeat([1,2],outer=3),x=1:6)
g = groupby(x, :id)
x[1:3,1] = [2,2,2]
x
Well - it is wrong now, `g` is only a view
g
s = view(x,5:6,:)
delete!(x,3:6)
s
DataFrame
creates aliases with !
and getproperty
syntax and copies with:¶x = DataFrame(a=1:3)
x.b = x[!,1] # alias
x.c = x[:,1] # copy
x.d = x[!,1][:] # copy
x.e = copy(x[!,1]) # explicit copy
display(x)
x[1,1] = 100
"x.b column은 x.a의 alias로써 x[1,1]=100 으로 변경된 경우 x.b도 변경됨" |> println
display(x)
eachrow
to avoid compilation cost(wide tables), but Tables.namedtupleiterator
for fast execution(tall table)¶this table is wide
df1 = DataFrame([rand([1:2,'a':'b',false:true,1.0:2.0]) for i in 1:900], :auto)
@time collect(eachrow(df1))
@time collect(Tables.namedtupleiterator(df1));
as you can see the time to compile Tables.namedtupleiterator
is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)
the table below is tall
df2 = DataFrame(rand(10^6, 10),:auto)
@time map(sum,eachrow(df2))
@time map(sum,eachrow(df2))
@time map(sum,Tables.namedtupleiterator(df2))
@time map(sum,Tables.namedtupleiterator(df2))
as you can see - this time it is much faster to iterate a type stable container