Introduction to DataFrames

09_reshaping

DataFrame v1.2, Julia 1.6.1

Reshaping DataFrames

Wide to long

In [3]:
using DataFrames,Pipe
In [4]:
x = DataFrame(id=[1,2,3,4], id2=[1,1,2,2], M1 = [11,12,13,14], M2 = [111,112,113,114])
Out[4]:

4 rows × 4 columns

idid2M1M2
Int64Int64Int64Int64
11111111
22112112
33213113
44214114

Stack a data frame x, i.e. convert it from wide to long format. Stack(df, [아래로 펼칠 컬럼], 표시할 컬럼)

  • 아래 예에서는 M1,M2 variable과 value를 아래로 표시하고 추가적으로 표시할 컬럼 id를 지정 했음
In [12]:
stack(x,[:M1,:M2],:id) # first pass measure variables and then id-variable
Out[12]:

8 rows × 3 columns

idvariablevalue
Int64StringInt64
11M111
22M112
33M113
44M114
51M2111
62M2112
73M2113
84M2114

add view=true arguement to make a view; in that case columns of the resulting data frame share memory with columns of the source data frame, so the operation is potentially unsafe.

In [15]:
# optionally you can rename columns
stack(x,["M1","M2"], "id", variable_name="key", value_name="observed", 
  view=true)
Out[15]:

8 rows × 3 columns

idkeyobserved
Int64StringInt64
11M111
22M112
33M113
44M114
51M2111
62M2112
73M2113
84M2114

if second argument is omitted in stack, all other columns are assumed to be the id-variables

In [18]:
stack(x,[:M1,:M2])
Out[18]:

8 rows × 4 columns

idid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114
In [21]:
stack(x,Not([:id,:id2]))
Out[21]:

8 rows × 4 columns

idid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114
In [22]:
stack(x,Not([1,2]))
Out[22]:

8 rows × 4 columns

idid2variablevalue
Int64Int64StringInt64
111M111
221M112
332M113
442M114
511M2111
621M2112
732M2113
842M2114
In [23]:
x = DataFrame(id=[1,1,1],id2=['a','b','c'],a1=rand(3),a2=rand())
Out[23]:

3 rows × 4 columns

idid2a1a2
Int64CharFloat64Float64
11a0.5615980.736461
21b0.1334630.736461
31c0.4414980.736461
In [25]:
x = DataFrame(id=[1,1,1],id2=['a','b','c'],a1=rand(3),a2=rand(3))
Out[25]:

3 rows × 4 columns

idid2a1a2
Int64CharFloat64Float64
11a0.4899980.824482
21b0.7061250.221454
31c0.2853260.564364

if `stack` is not passed any measure variables by default numeric variables are selected as measures

In [26]:
stack(x)
Out[26]:

6 rows × 4 columns

idid2variablevalue
Int64CharStringFloat64
11aa10.489998
21ba10.706125
31ca10.285326
41aa20.824482
51ba20.221454
61ca20.564364

here all columns are treated as measures:

In [27]:
stack(DataFrame(rand(3,2),:auto))
Out[27]:

6 rows × 2 columns

variablevalue
StringFloat64
1x10.738212
2x10.936584
3x10.374822
4x20.178436
5x20.13522
6x20.93944
In [28]:
df = DataFrame(rand(3,2),:auto)
Out[28]:

3 rows × 2 columns

x1x2
Float64Float64
10.8961340.946699
20.9203830.776507
30.7318290.0190809
In [33]:
df.key = [1,1,1]
Out[33]:
3-element Vector{Int64}:
 1
 1
 1
In [34]:
df
Out[34]:

3 rows × 3 columns

x1x2key
Float64Float64Int64
10.8961340.9466991
20.9203830.7765071
30.7318290.01908091
In [35]:
mdf = stack(df) # duplicates in key are silenetly accepted
Out[35]:

6 rows × 3 columns

keyvariablevalue
Int64StringFloat64
11x10.896134
21x10.920383
31x10.731829
41x20.946699
51x20.776507
61x20.0190809

Long to wide

In [37]:
x = DataFrame(id=[1,1,1],id2='a':'c',a1=rand(3),a2 = rand(3))
Out[37]:

3 rows × 4 columns

idid2a1a2
Int64CharFloat64Float64
11a0.3734280.566582
21b0.2550270.706737
31c0.4326030.0557234
In [38]:
y = stack(x)
Out[38]:

6 rows × 4 columns

idid2variablevalue
Int64CharStringFloat64
11aa10.373428
21ba10.255027
31ca10.432603
41aa20.566582
51ba20.706737
61ca20.0557234
In [40]:
unstack(y,:id2,:variable,:value)
Out[40]:

3 rows × 3 columns

id2a1a2
CharFloat64?Float64?
1a0.3734280.566582
2b0.2550270.706737
3c0.4326030.0557234
In [42]:
unstack(y,:variable,:value) # all other columns are treated as keys
Out[42]:

3 rows × 4 columns

idid2a1a2
Int64CharFloat64?Float64?
11a0.3734280.566582
21b0.2550270.706737
31c0.4326030.0557234

all columns other than named :variable an :value are treated as keys

In [43]:
unstack(y)
Out[43]:

3 rows × 4 columns

idid2a1a2
Int64CharFloat64?Float64?
11a0.3734280.566582
21b0.2550270.706737
31c0.4326030.0557234
In [44]:
# you can rename the unstacked columns
unstack(y,renamecols=n->string("unstacked_",n))
Out[44]:

3 rows × 4 columns

idid2unstacked_a1unstacked_a2
Int64CharFloat64?Float64?
11a0.3734280.566582
21b0.2550270.706737
31c0.4326030.0557234
In [45]:
df = stack(DataFrame(rand(3,2),:auto))
Out[45]:

6 rows × 2 columns

variablevalue
StringFloat64
1x10.257075
2x10.489404
3x10.957308
4x20.496959
5x20.939764
6x20.483036

unable to unstack when no key column is presented

In [49]:
unstack(df)
ArgumentError: No key column found

Stacktrace:
 [1] unstack(df::DataFrame, rowkeys::InvertedIndex{InvertedIndices.TupleVector{Tuple{Int64, Int64}}}, colkey::Int64, value::Int64; renamecols::Function, allowmissing::Bool, allowduplicates::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/reshape.jl:342
 [2] unstack(df::DataFrame, colkey::Symbol, value::Symbol; renamecols::Function, allowmissing::Bool, allowduplicates::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/reshape.jl:355
 [3] #unstack#510
   @ ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/reshape.jl:361 [inlined]
 [4] unstack(df::DataFrame)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/abstractdataframe/reshape.jl:361
 [5] top-level scope
   @ In[49]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [ ]: