Introduction to DataFrames

08_joins

DataFrame v1.2, Julia 1.6.1

In [3]:
using DataFrames

Joining DataFrames

Preparing DataFrames for a join

In [4]:
x = DataFrame(ID=[1,2,3,4,missing],name=["Alice","Bob","Conor","Dave","Zed"])
Out[4]:

5 rows × 2 columns

IDname
Int64?String
11Alice
22Bob
33Conor
44Dave
5missingZed
In [5]:
y = DataFrame(id=[1,2,5,6,missing],age=[21,22,23,24,99])
Out[5]:

5 rows × 2 columns

idage
Int64?Int64
1121
2222
3523
4624
5missing99

Rules for the on keyword argument:

  • a single Symbol or string if joining on one column with the same name, e.g. on=:id
  • a Pair of Symbols or string if joining on one column with different names, e.g. on=:id=>:id2
  • a vector of Symbols or strings if joining on multiple columns with the same name, e.g. on=[:id1, :id2]
  • a vector of Pairs of Symbols or strings if joining on ,ultiple columns with the same name, e.g. on=[:a1=>:a2, :b1=>b2]
  • a vector containing a combination of Symbols or strings or Pair of Symbols or strings, e.g. on=[:a1=>:a2, :b1]

Standard joins: inner, left, outer, semi, anti

missing is not allowed to hoin-on by default

In [7]:
innerjoin(x,y,on=:ID=>:id)
ArgumentError: missing values in key columns are not allowed when matchmissing == :error

Stacktrace:
 [1] DataFrames.DataFrameJoiner(dfl::DataFrame, dfr::DataFrame, on::Pair{Symbol, Symbol}, matchmissing::Symbol, kind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/join/composer.jl:73
 [2] _join(df1::DataFrame, df2::DataFrame; on::Pair{Symbol, Symbol}, kind::Symbol, makeunique::Bool, indicator::Nothing, validate::Tuple{Bool, Bool}, left_rename::typeof(identity), right_rename::typeof(identity), matchmissing::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/join/composer.jl:332
 [3] #innerjoin#533
   @ ~/.julia/packages/DataFrames/vuMM8/src/join/composer.jl:589 [inlined]
 [4] top-level scope
   @ In[7]:1
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

missing 포함 하여 inner join

In [8]:
innerjoin(x,y, on=:ID=>:id, matchmissing=:equal)
Out[8]:

3 rows × 3 columns

IDnameage
Int64?StringInt64
11Alice21
22Bob22
3missingZed99
In [11]:
# 왼쪽을 기준으로 오른쪽 매칭
leftjoin(x,y,on="ID"=>"id", matchmissing=:equal)
Out[11]:

5 rows × 3 columns

IDnameage
Int64?StringInt64?
11Alice21
22Bob22
3missingZed99
43Conormissing
54Davemissing
In [12]:
#오른쪽을 기준으로 왼쪽 매칭
rightjoin(x,y,on=:ID=>:id,matchmissing=:equal)
Out[12]:

5 rows × 3 columns

IDnameage
Int64?String?Int64
11Alice21
22Bob22
3missingZed99
45missing23
56missing24
In [13]:
# 양쪽의 합집합
outerjoin(x,y,on=:ID=>:id,matchmissing=:equal)
Out[13]:

7 rows × 3 columns

IDnameage
Int64?String?Int64?
11Alice21
22Bob22
3missingZed99
43Conormissing
54Davemissing
65missing23
76missing24

semijoin : left join에서 오른쪽 값 생략

In [14]:
semijoin(x,y,on=:ID=>:id,matchmissing=:equal)
Out[14]:

3 rows × 2 columns

IDname
Int64?String
11Alice
22Bob
3missingZed

antijoin 왼쪽기준으로 오른쪽에 없는 것만 표시

In [16]:
antijoin(x,y,on=:ID=>:id, matchmissing=:equal)
Out[16]:

2 rows × 2 columns

IDname
Int64?String
13Conor
24Dave

Cross join

(here no on arquement)

In [20]:
crossjoin(DataFrame(x=[1,2]),DataFrame(y=["a","b","c"]))
Out[20]:

6 rows × 2 columns

xy
Int64String
11a
21b
31c
42a
52b
62c

Complex cases of joins

In [21]:
x = DataFrame(id1=[1,1,2,2,missing,missing],
              id2=[1,11,2,21,missing,99],
              name=["Alice","Bob","Conor","Dave","Zed","Zoe"])
Out[21]:

6 rows × 3 columns

id1id2name
Int64?Int64?String
111Alice
2111Bob
322Conor
4221Dave
5missingmissingZed
6missing99Zoe
In [22]:
y = DataFrame(id1=[1,1,3,3,missing,missing],
              id2=[11,1,31,3,missing,999],
              age=[21,22,23,24,99,100])
Out[22]:

6 rows × 3 columns

id1id2age
Int64?Int64?Int64
111121
21122
333123
43324
5missingmissing99
6missing999100
In [28]:
j1 = innerjoin(x,y, on=[:id1=>:id1,:id2=>:id2], matchmissing=:equal)
Out[28]:

3 rows × 4 columns

id1id2nameage
Int64?Int64?StringInt64
1111Bob21
211Alice22
3missingmissingZed99
In [29]:
j2 = innerjoin(x,y, on=[:id1,:id2], matchmissing=:equal)
Out[29]:

3 rows × 4 columns

id1id2nameage
Int64?Int64?StringInt64
1111Bob21
211Alice22
3missingmissingZed99
In [34]:
isequal(j1,j2)
Out[34]:
true
In [35]:
# with duplicates all combinations are processed
outerjoin(x,y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)
Out[35]:

12 rows × 6 columns

id1id2nameid2_1agesource
Int64?Int64?String?Int64?Int64?String
111Alice1121both
2111Bob1121both
311Alice122both
4111Bob122both
5missingmissingZedmissing99both
6missing99Zoemissing99both
7missingmissingZed999100both
8missing99Zoe999100both
922Conormissingmissingleft_only
10221Davemissingmissingleft_only
113missingmissing3123right_only
123missingmissing324right_only
In [36]:
x = DataFrame(id1=1:6, id2=[1,2,1,2,1,2], x1 = 'a':'f')
Out[36]:

6 rows × 3 columns

id1id2x1
Int64Int64Char
111a
222b
331c
442d
551e
662f
In [37]:
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
Out[37]:

6 rows × 3 columns

id1ID2x2
Int64Int64Char
111a
222b
333c
444d
555e
666f
  • [:id1,:id2=>:ID2] : x,y가 공통으로 :id1를 포함고 있어 :id1=>:id1 으로 표현하지 않음
  • x의 :id2와 y의 ID2를 join 하기 위해 :id1=>ID2 로 표현
In [39]:
innerjoin(x,y,on=[:id1,:id2=>:ID2] )
Out[39]:

2 rows × 4 columns

id1id2x1x2
Int64Int64CharChar
111aa
222bb

joining more than two data frames

In [40]:
xs = [DataFrame("id"=>1:6, "v$i"=>((1:6) .+ 10i)) for i in 1:5]
Out[40]:
5-element Vector{DataFrame}:
 6×2 DataFrame
 Row  id     v1    
      Int64  Int64 
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15
   6 │     6     16
 6×2 DataFrame
 Row  id     v2    
      Int64  Int64 
─────┼──────────────
   1 │     1     21
   2 │     2     22
   3 │     3     23
   4 │     4     24
   5 │     5     25
   6 │     6     26
 6×2 DataFrame
 Row  id     v3    
      Int64  Int64 
─────┼──────────────
   1 │     1     31
   2 │     2     32
   3 │     3     33
   4 │     4     34
   5 │     5     35
   6 │     6     36
 6×2 DataFrame
 Row  id     v4    
      Int64  Int64 
─────┼──────────────
   1 │     1     41
   2 │     2     42
   3 │     3     43
   4 │     4     44
   5 │     5     45
   6 │     6     46
 6×2 DataFrame
 Row  id     v5    
      Int64  Int64 
─────┼──────────────
   1 │     1     51
   2 │     2     52
   3 │     3     53
   4 │     4     54
   5 │     5     55
   6 │     6     56
In [46]:
innerjoin(xs...,on=:id) # also for outerjoin and crossjoin
Out[46]:

6 rows × 6 columns

idv1v2v3v4v5
Int64Int64Int64Int64Int64Int64
111121314151
221222324252
331323334353
441424344454
551525354555
661626364656

matchmissing keyword argument

In general you have three options how missing values are handled in joins that are handled by matchmissing keyword argument value as follows:

  • :error : throw an error if missings are encountered(this is default)
  • :equal : assume missing values are equal to themselves
  • :notequal : assume missing values are not equal to themselves (not available for `outerjoin`)

Here are some examples comparing the options.

In [47]:
df1 = DataFrame(id=[1,2,missing],x=1:3)
Out[47]:

3 rows × 2 columns

idx
Int64?Int64
111
222
3missing3
In [48]:
df2 = DataFrame(id=[1,missing,3], y = 1:3)
Out[48]:

3 rows × 2 columns

idy
Int64?Int64
111
2missing2
333
In [49]:
innerjoin(df1,df2,on=:id)
ArgumentError: missing values in key columns are not allowed when matchmissing == :error

Stacktrace:
 [1] DataFrames.DataFrameJoiner(dfl::DataFrame, dfr::DataFrame, on::Symbol, matchmissing::Symbol, kind::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/join/composer.jl:73
 [2] _join(df1::DataFrame, df2::DataFrame; on::Symbol, kind::Symbol, makeunique::Bool, indicator::Nothing, validate::Tuple{Bool, Bool}, left_rename::typeof(identity), right_rename::typeof(identity), matchmissing::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/vuMM8/src/join/composer.jl:332
 [3] #innerjoin#533
   @ ~/.julia/packages/DataFrames/vuMM8/src/join/composer.jl:589 [inlined]
 [4] top-level scope
   @ In[49]:1
 [5] eval
   @ ./boot.jl:360 [inlined]
 [6] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [50]:
innerjoin(df1,df2,on=:id,matchmissing=:equal)
Out[50]:

2 rows × 3 columns

idxy
Int64?Int64Int64
1111
2missing32
In [52]:
innerjoin(df1,df2,on=:id, matchmissing=:notequal)
Out[52]:

1 rows × 3 columns

idxy
Int64?Int64Int64
1111
In [ ]: