using DataFrames
x = DataFrame(ID=[1,2,3,4,missing],name=["Alice","Bob","Conor","Dave","Zed"])
y = DataFrame(id=[1,2,5,6,missing],age=[21,22,23,24,99])
Rules for the on
keyword argument:
Symbol
or string if joining on one column with the same name, e.g. on=:id
Pair
of Symbols
or string if joining on one column with different names, e.g. on=:id=>:id2
Symbols
or strings if joining on multiple columns with the same name, e.g. on=[:id1, :id2]
Pairs
of Symbols
or strings if joining on ,ultiple columns with the same name, e.g. on=[:a1=>:a2, :b1=>b2]
Symbols
or strings or Pair
of Symbols
or strings, e.g. on=[:a1=>:a2, :b1]
missing is not allowed to hoin-on by default
innerjoin(x,y,on=:ID=>:id)
missing 포함 하여 inner join
innerjoin(x,y, on=:ID=>:id, matchmissing=:equal)
# 왼쪽을 기준으로 오른쪽 매칭
leftjoin(x,y,on="ID"=>"id", matchmissing=:equal)
#오른쪽을 기준으로 왼쪽 매칭
rightjoin(x,y,on=:ID=>:id,matchmissing=:equal)
# 양쪽의 합집합
outerjoin(x,y,on=:ID=>:id,matchmissing=:equal)
semijoin : left join에서 오른쪽 값 생략
semijoin(x,y,on=:ID=>:id,matchmissing=:equal)
antijoin 왼쪽기준으로 오른쪽에 없는 것만 표시
antijoin(x,y,on=:ID=>:id, matchmissing=:equal)
(here no on
arquement)
crossjoin(DataFrame(x=[1,2]),DataFrame(y=["a","b","c"]))
x = DataFrame(id1=[1,1,2,2,missing,missing],
id2=[1,11,2,21,missing,99],
name=["Alice","Bob","Conor","Dave","Zed","Zoe"])
y = DataFrame(id1=[1,1,3,3,missing,missing],
id2=[11,1,31,3,missing,999],
age=[21,22,23,24,99,100])
j1 = innerjoin(x,y, on=[:id1=>:id1,:id2=>:id2], matchmissing=:equal)
j2 = innerjoin(x,y, on=[:id1,:id2], matchmissing=:equal)
isequal(j1,j2)
# with duplicates all combinations are processed
outerjoin(x,y, on=:id1, makeunique=true, indicator=:source, matchmissing=:equal)
x = DataFrame(id1=1:6, id2=[1,2,1,2,1,2], x1 = 'a':'f')
y = DataFrame(id1=1:6, ID2=1:6, x2='a':'f')
[:id1,:id2=>:ID2]
: x,y가 공통으로 :id1를 포함고 있어 :id1=>:id1
으로 표현하지 않음innerjoin(x,y,on=[:id1,:id2=>:ID2] )
joining more than two data frames
xs = [DataFrame("id"=>1:6, "v$i"=>((1:6) .+ 10i)) for i in 1:5]
innerjoin(xs...,on=:id) # also for outerjoin and crossjoin
In general you have three options how missing
values are handled in joins that are handled by matchmissing
keyword argument value as follows:
:error
: throw an error if missings are encountered(this is default):equal
: assume missing
values are equal to themselves:notequal
: assume missing
values are not equal to themselves (not available for `outerjoin`)Here are some examples comparing the options.
df1 = DataFrame(id=[1,2,missing],x=1:3)
df2 = DataFrame(id=[1,missing,3], y = 1:3)
innerjoin(df1,df2,on=:id)
innerjoin(df1,df2,on=:id,matchmissing=:equal)
innerjoin(df1,df2,on=:id, matchmissing=:notequal)