using DataFrames, Random, Pipe, Chain
x = DataFrame(rand(Bool,3,4), :auto)
randstring(12)
rename(x, :x1 => :A)
rename!(c->c^2,x)
rename(x,3=>:third)
rename(x,[1,2].=>[:xx1,:xx2])
rename!(x,[:a,:b,:c,:d])
rename!(x,string.('a':'d'))
rename!(x,"a"=>"d","d"=>"a")
rename(x,fill(:a,4))
rename(x, fill(:a,4),makeunique=true)
We can reorder the names(x)
vector as needed, creating a new DataFrame
x
Random.seed!(1234)
@pipe x |> names |> shuffle |> x[:, _]
x
select!(x,4:-1:1)
x
x = DataFrame([(i,j) for i in 1:3, j in 1:4], :auto)
With hcat
we can merge two DataFrames
. Also [x,y] syntax is supported but only when DataFrames
have unique column names.
hcat(x,x,makeunique=true)
You can append a vector to a data frame with the following syntax.
y = [x DataFrame(A=[1,2,3])]
y=[DataFrame(A=1:3) x]
A column can also be added in the middle. Here a brute-force method is used and a new DataFrame
is created.
using BenchmarkTools
@btime [$x[!,1:2] DataFrame(A=[1,2,3]) $x[!,3:4]]
@time [x[!,1:2] DataFrame(A=[1,2,3]) x[!,3:4]]
We could also do this with a specialized in place method insertcols!
. Let's add ':newcol' to the DataFrame
y.
insertcols!(y,2,"newcol"=>[1,2,3])
If you want to insert the same column name several times makeunique=true
is needed as usual.
insertcols!(y,2,:newcol => [1,2,3], makeunique=true)
We can see how much faster it is to insert a column with insertcol!
than with hcat
using @btime
(note that we use here a Pair
notation as example)
@btime insertcols!(copy($x), 3, :A=>[1,2,3])
insertcols!(x,:A=>[1,2,3])
insertcols!(x,1,:B=>[1,2,3])
Note that insertcols!
can be used to insert several columns to a data frame at once and that it performs broadcasting if needed:
@time df = DataFrame(:a=>[1,2,3])
insertcols!(df,:b=>"x",:c=>'a',:d=>Ref([1,2,3]))
Interestingly we can emulate hcat
mutating the data frame in-place using insertcols!
:
df1 = DataFrame(:a=>[1,2])
df2 = DataFrame(:b=>[2,3],:c=>[3,4])
hcat(df1,df2)
df1 #df1 is not touched
@pipe eachcol(df2) |> pairs
@pipe eachcol(df2) |> pairs |> insertcols!(df1,_...)
df1 # now we have changed df1
Let's create a new DataFrame x
and show a few way to create DataFrames with a subset of x's columns.
x = DataFrame([(i,j) for i in 1:3, j in 1:5], :auto)
@btime $x[!,[1,2,4,5]]
@btime $x[:,[1,2,4,5]] # use ! instead of : for non-copying operation
x[:,[:x1,:x4]]
We can also choose to keep or exclude columns by bool
(we need a vector whose length is the number of columns in the original DataFrame
).
x[:,[true,false,true,false,true]]
x[:,[:x1]]
x[:,:x1]
typeof(x[:,[:x1]]),typeof(x[:,:x1])
x[!,:x1] === x.x1,x[:,:x1] === x.x1
x[!,1] === x[!,:x1]
x[!,[1]] == x[!,1],x[!,[1]] === x[!,1]
typeof.([x[!,[1]] , x[!,1]])
You can alse user Regex, All, Between and Not from IvertedIndies.jl
for column selection:
x[!,r"[123]"]
x[!,Not(1)]
x[!,Between(:x2,:x4)]
x[!,Between(2,4)]
x[!,Cols(:x1,Between(:x3,:x5))]
select(x,:x1,Between(:x3,:x5),copycols=false) # the same as above
df = copy(x)
df2 = select(df,[1,2])
select(df,Not([1,2]))
By default `select` copies colums
df2[!,1] == df[!,1],df2[!,1] === df[!,1]
This can be using `copycols=false` keyword argument
df2 = select(df,[1,2],copycols=false)
df2[!,1] === df[!,1]
df
using select!
will modify the source data frame
select!(df,[1,2])
df == df2
Here we create a copy of x and delete the 3rd column from the copy with select!
and Not
z = copy(x)
select!(z,Not(3))
alternativerly we can archieve the same by using select
function
select(x, Not(3))
x stays unchanged
x
Note, that you can also create a view of a DataFrame
when we want a sebset of its columns:
@btime x[:,[1,3,5]]
@btime @view x[:,[1,3,5]]
(now creation of the view
is slow, but in the coming release of the DataFeames.jl
package it will become significantly faster)
x = DataFrame([(i,j) for i in 1:3, j in 1:5],:auto)
With the following syntax, the existing column is modified without performing any copying (this is discouraged as it creates column alias)
x[!,:x1] = x[!,:x2]
x
x[1,:x2] = (11,2)
# x2의 첫번째 값을 변경한것이 x1의 첫번째 값도 같이 변경됨,
# 위에서 복사시에 x2의 번지를 참조 하기 때문에 조심해야 함
x
x[1,:x2] = (1,2)
this syntax is safer
x[!,:x1] = x[:,:x2]
x
x[1,:x2] = (11,2)
# x2의 첫번째 값을 변경 했지만 x1의 값은 바뀌지 않음
x
We can also use the following syntax to add a new column at the end of a DataFrame
x[!,:A] = [1,2,3]
x
A new column name will be added to our DataFrame
with the following syntax as well.
x.B = 11:13
x
x = DataFrame([(i,j) for i in 1:3, j in 1:5],:auto)
We can check if a column with a given name exists via
hasproperty(x,:x1)
and determine its index via
columnindex(x,:x2)
these are most useful for non-standard column names(e.g. containing spaces)
df = DataFrame()
df.x1 = 1:3
df."column 2" = 4:6
df[!,"column 3"] = 7:9
df
select!(df,Not("column 3"))
df
df."column 2"
df[:,"column 2"]
or you can interpolate column name using :() syntax
for n in names(df)
println(n,"\n",df.:($n),"\n")
end
When using eachcol
of a data frame the resulting object retains reference to its parent and e.g. can be queried with getproperty
df = DataFrame(reshape(1:12,3,4), :auto)
ec_df = eachcol(df)
df[:,1]
ec_df[1]
ec_df.x1
The general rules are the following
select
and transform
always return the number of rows equal to the source data frame, while combine
returns any number of rows(combine
is allowed to combine rows of the source data frame)transform
retains columns from the old data frameselect!
and transform!
are in-place versions of select
and transform
df = DataFrame(reshape(1:12,3,4),:auto)
Here we add a new column :res
that is a sum of columns :x1
and :x2
. Ageneral syntax of transformations of this kind is:
`source_columns => function_to_apply => target_column_name`
then function_to_apply
gets columns selected by source_column
as positional arguments.
@btime transform(df,[:x1,:x2]=>(+)=>:res)
transform의 역할을 select를 통해 동일하게 규현 할 수 있음
@btime select(df,[1:ncol(df)]...,[:x1,:x2]=>(+)=>:res)
One can omit passing target_column_name
in which case it is automatically generated
using Statistics
combine(df,[:x1,:x2] => Statistics.cor)
Note that combine
allowed the number of columns in the resulting data frame to be changed. If we used select
instead it would automatically broadcast the return value to match the number of rows of the source:
#x1,x2 전체 컬럼에 대해 cor를 계산한 결과를 각 row에 값을 복사 한다.
select(df,[:x1,:x2]=>Statistics.cor)
If you want to apply some function on each row of the source wrap it in ByRow
:
@btime select(df, :x1,:x2,[:x1,:x2]=>ByRow(string))
@btime select(df, :x1,:x2,AsTable([:x1,:x2])=>ByRow(x->string(x.x1,x.x2)))
select(df, :x1,:x2,AsTable([:x1,:x2])=> (x->string(x.x1,x.x2)))
Also if toy want columns to be passed as NamedTuple
to a function (instead of being positional arguments) wrap them in AsTable
:
@btime select(df, :x1,:x2, AsTable([:x1, :x2]) => (x->x.x1 + x.x2) => :x3 )
@btime select(df, :x1,:x2, [:x1, :x2] => ByRow(+) => :x3)
For simplicity (as this functionality is often needed) there is a special treat ment of nrow
function that can be just passed as a transformation (without specifying of column selector):
select(df, :x1, nrow=>"number_of_rows")
(note that in select
the number of rows is automatically broadcasted to match the number of rows of the source data frame)
cumsum example
select(df, :x1, [:x1]=>cumsum)
Finally you can conveninently create multiple columns with one function, e.g.:
@btime select(df, :x1,:x1=>ByRow(x->[x ^ 2, x ^ 3]) => ["x1²","x1³"])
@btime select(df, :x1,:x1=>(x->DataFrame("x1²"=>x .^ 2,"x1³"=>x .^ 3))=>AsTable)
@pipe DataFrame(:x1=>[1,2,3], :x2=>[3,4,5]) |> AsTable