Introduction to DataFrames

03_missingvalues

Handling missing values

In [17]:
using Pipe
In [1]:
missing, typeof(missing)
Out[1]:
(missing, Missing)

Arrays automatically create an appropriate union type

In [2]:
x = [1,2,missing,3]
Out[2]:
4-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 3
In [6]:
1 |>ismissing, missing |> ismissing, x|>ismissing, x .|> ismissing
Out[6]:
(false, true, false, Bool[0, 0, 1, 0])
In [60]:
x |> eltype, x |> eltype |> nonmissingtype
Out[60]:
(Union{Missing, Int64}, Int64)

missing comparisons produce missing

In [9]:
missing === missing, missing == missing, missing != missing, missing < missing
Out[9]:
(true, missing, missing, missing)
In [19]:
@pipe missing |> isequal(_,missing)
Out[19]:
true
In [21]:
isequal(missing,missing)
Out[21]:
true
In [30]:
1 == missing, 1 != missing, 1 < missing
Out[30]:
(missing, missing, missing)

missing is considered greater than any numeric value

In the next few examples, we see that many (not all) functions handle missing

In [33]:
map(x->x(missing),[sin,cos,zero,sqrt]) # part 1
Out[33]:
4-element Vector{Missing}:
 missing
 missing
 missing
 missing
In [34]:
map(x->x(missing,1),[+,-,*,/,div]) # part 2
Out[34]:
5-element Vector{Missing}:
 missing
 missing
 missing
 missing
 missing
In [35]:
using Statistics # needed for mean
In [36]:
map(x->x([1,2,missing]),[minimum,maximum,extrema,mean,float]) # part 3
Out[36]:
5-element Vector{Any}:
 missing
 missing
 (missing, missing)
 missing
 Union{Missing, Float64}[1.0, 2.0, missing]
In [40]:
[1,missing,2,missing] |> skipmissing |> collect
Out[40]:
2-element Vector{Int64}:
 1
 2
In [49]:
@time @pipe [1.0,missing,2.0,missing] |> replace(_,missing=>NaN)
  0.000007 seconds (6 allocations: 352 bytes)
Out[49]:
4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

Another way to do this,

In [48]:
# 첫번째 파라미터가 missing이 아니면 첫번째를 리턴하고 missing이면 
# 두번째 파라미터를 리턴 한다.
@time @pipe [1.0, missing, 2.0, missing] .|> coalesce(_,NaN)
  0.109858 seconds (149.50 k allocations: 9.213 MiB, 99.38% compilation time)
Out[48]:
4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

You can also use recode from CategoricalArrays.jl if you have a default output value.

In [51]:
using CategoricalArrays
In [55]:
@pipe [1.0,missing,2.0,missing] |> recode(_,0,missing=>1)
Out[55]:
4-element Vector{Int64}:
 0
 1
 0
 1
In [57]:
using DataFrames
In [58]:
df = DataFrame(a=[1,2,missing],b=["a","b",missing])
Out[58]:

3 rows × 2 columns

ab
Int64?String?
11a
22b
3missingmissing
In [61]:
replace!(df.a,missing=>100)
Out[61]:
3-element Vector{Union{Missing, Int64}}:
   1
   2
 100
In [63]:
df.b = @pipe df.b .|> coalesce(_,100)
Out[63]:
3-element Vector{Any}:
    "a"
    "b"
 100
In [64]:
df
Out[64]:

3 rows × 2 columns

ab
Int64?Any
11a
22b
3100100

You can use unique or levels to get unique values with or without missings, repectivery.

In [77]:
[1,missing,2,missing] |> unique
Out[77]:
3-element Vector{Union{Missing, Int64}}:
 1
  missing
 2
In [78]:
[1,missing,2,missing] |> levels
Out[78]:
2-element Vector{Int64}:
 1
 2
In [88]:
x = [1,2,3]
y = allowmissing(x)
Out[88]:
3-element Vector{Union{Missing, Int64}}:
 1
 2
 3
In [90]:
push!(y,missing)
Out[90]:
4-element Vector{Union{Missing, Int64}}:
 1
 2
 3
  missing
In [91]:
x = [1,2,3]
y = allowmissing(x)
z = disallowmissing(y)
Out[91]:
3-element Vector{Int64}:
 1
 2
 3
In [92]:
push!(z,missing)
MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{S}, ::CategoricalValue) where S<:Union{AbstractChar, AbstractString, Number} at /home/shpark/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:92
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  ...

Stacktrace:
 [1] push!(a::Vector{Int64}, item::Missing)
   @ Base ./array.jl:928
 [2] top-level scope
   @ In[92]:1
 [3] eval
   @ ./boot.jl:360 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

disallowmissing has error keyword argument that can be used to decide how it should behave when it encounters a column that actually contains a missing value

In [110]:
@time df = allowmissing(DataFrame(ones(2,3),:auto))
  0.000050 seconds (51 allocations: 4.766 KiB)
Out[110]:

2 rows × 3 columns

x1x2x3
Float64?Float64?Float64?
11.01.01.0
21.01.01.0
In [111]:
@time df = @pipe ones(2,3) |> DataFrame(_,:auto) |> allowmissing
  0.000053 seconds (51 allocations: 4.766 KiB)
Out[111]:

2 rows × 3 columns

x1x2x3
Float64?Float64?Float64?
11.01.01.0
21.01.01.0
In [112]:
@time df = (allowmissing  DataFrame)(ones(2,3),:auto)
  0.000058 seconds (52 allocations: 4.797 KiB)
Out[112]:

2 rows × 3 columns

x1x2x3
Float64?Float64?Float64?
11.01.01.0
21.01.01.0
In [113]:
df[1,1] = missing
Out[113]:
missing
In [114]:
disallowmissing(df) # an error is thrown
MethodError: Cannot `convert` an object of type Missing to an object of type Float64
Closest candidates are:
  convert(::Type{S}, ::CategoricalValue) where S<:Union{AbstractChar, AbstractString, Number} at /home/shpark/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:92
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...

Stacktrace:
  [1] setindex!(A::Vector{Float64}, x::Missing, i1::Int64)
    @ Base ./array.jl:839
  [2] _unsafe_copyto!(dest::Vector{Float64}, doffs::Int64, src::Vector{Union{Missing, Float64}}, soffs::Int64, n::Int64)
    @ Base ./array.jl:235
  [3] unsafe_copyto!
    @ ./array.jl:289 [inlined]
  [4] _copyto_impl!
    @ ./array.jl:313 [inlined]
  [5] copyto!
    @ ./array.jl:299 [inlined]
  [6] copyto!
    @ ./array.jl:325 [inlined]
  [7] copyto_axcheck!
    @ ./abstractarray.jl:1056 [inlined]
  [8] AbstractVector{Float64}(A::Vector{Union{Missing, Float64}})
    @ Base ./array.jl:541
  [9] AbstractArray
    @ ./boot.jl:475 [inlined]
 [10] convert
    @ ./abstractarray.jl:15 [inlined]
 [11] disallowmissing(x::Vector{Union{Missing, Float64}})
    @ Missings ~/.julia/packages/Missings/hn4Ye/src/Missings.jl:50
 [12] disallowmissing(df::DataFrame, cols::Colon; error::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/abstractdataframe/abstractdataframe.jl:1982
 [13] disallowmissing (repeats 2 times)
    @ ~/.julia/packages/DataFrames/pVFzb/src/abstractdataframe/abstractdataframe.jl:1974 [inlined]
 [14] top-level scope
    @ In[114]:1
 [15] eval
    @ ./boot.jl:360 [inlined]
 [16] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094
In [116]:
# column :x1 is left untouched as it contains missing
disallowmissing(df, error=false)
Out[116]:

2 rows × 3 columns

x1x2x3
Float64?Float64Float64
1missing1.01.0
21.01.01.0

In this next example,we show that the type of each column in x is initially Int64. After using allowmissing! to accept missing values in column 1 and 3, the types of those columns become Union{Int64, Missing}.

In [117]:
x = DataFrame(rand(Int,2,3),:auto)
Out[117]:

2 rows × 3 columns

x1x2x3
Int64Int64Int64
1-20900804536073797819709387473001176741287466900659119197
2-1577814882947858749-6673046903430235044-4206324573690547144
In [129]:
@pipe x |> eachcol .|> eltype |> println("Before : ",_)
@pipe x |> allowmissing!(_,1) # make first column accept missings
@pipe x |> allowmissing!(_,:x3) # make :x3 column accept missings
@pipe x |> eachcol .|> eltype |> println("After : ",_)
Before : DataType[Int64, Int64, Int64]
After : Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]

In this next example, we'll use completecase to find all the rows of a DataFrame that have complete data.

In [130]:
x = DataFrame(A=[1,missing,3,4], B=["A","B",missing,"C"])
Out[130]:

4 rows × 2 columns

AB
Int64?String?
11A
2missingB
33missing
44C
In [131]:
@pipe x |> completecases |> println("Complete cases:\n",_)
Complete cases:
Bool[1, 0, 0, 1]

We can use dropmissing or dropmissing! to remove the rows with incomplete data from a DataFrame and either create a new DataFrame or mutable the original in-place.

In [132]:
y = x |> dropmissing
x |> dropmissing!
;
In [133]:
x
Out[133]:

2 rows × 2 columns

AB
Int64String
11A
24C
In [134]:
y
Out[134]:

2 rows × 2 columns

AB
Int64String
11A
24C
In [135]:
x |> describe
Out[135]:

2 rows × 7 columns

variablemeanminmedianmaxnmissingeltype
SymbolUnion…AnyUnion…AnyInt64DataType
1A2.512.540Int64
2BAC0String

Alternatively you can pass disallowmissing keyword argument to dropmissing and dropmissing!

In [140]:
x = DataFrame(A=[1,missing,3,4],B=["A","B",missing,"C"])
Out[140]:

4 rows × 2 columns

AB
Int64?String?
11A
2missingB
33missing
44C
In [141]:
@pipe x |> dropmissing!(_,disallowmissing=false)
Out[141]:

2 rows × 2 columns

AB
Int64?String?
11A
24C

Making functions missing-aware

If we have a function that does not handle missing values we can wrap it using passmissing function so that if any of its positional arguments is missing we will get a missing value in return. In the example below we change how string function behaves:

missing을 취급할 수 없는 함수를 wrap하여 입력값중에 missing이 있는 경우 missing을 리턴하도록 처리 할 수 있는 passmissing을 제공한다. 예를 들어 string의 경우 아래 처럼 재대로 missing을 처리 하지 못하는데 passmissing를 이용하여 string 함수를 wrap하여 missing을 처리 할 수 있게 한다.

In [142]:
string(missing)
Out[142]:
"missing"
In [176]:
@time string(missing," ", missing)
  0.000008 seconds (3 allocations: 176 bytes)
Out[176]:
"missing missing"
In [177]:
@time @pipe (missing, " ", missing)... |> string
  0.000008 seconds (3 allocations: 176 bytes)
Out[177]:
"missing missing"
In [178]:
@time @pipe (1,2,3)...|>string
  0.000009 seconds (9 allocations: 464 bytes)
Out[178]:
"123"
In [179]:
@time string(1,2,3)
  0.000010 seconds (9 allocations: 464 bytes)
Out[179]:
"123"
In [184]:
lift_string = passmissing(string)
Out[184]:
(::Missings.PassMissing{typeof(string)}) (generic function with 2 methods)
In [185]:
missing |> lift_string
Out[185]:
missing
In [187]:
@pipe (missing," ",missing)... |> lift_string
Out[187]:
missing
In [188]:
lift_string(1,2,3)
Out[188]:
"123"

Aggregating rows containing missing values

In [197]:
df = DataFrame(a=[1,missing,missing], b=[1,2,missing])
Out[197]:

3 rows × 2 columns

ab
Int64?Int64?
111
2missing2
3missingmissing

If we just sum on the rows we get two missing entries:

In [199]:
@pipe df |> eachrow .|> sum
Out[199]:
3-element Vector{Union{Missing, Int64}}:
 2
  missing
  missing

One can apply skipmissing on the rows to avoid this problem:

In [205]:
@pipe df |> eachrow .|> skipmissing .|> sum
ArgumentError: reducing over an empty collection is not allowed

Stacktrace:
  [1] _empty_reduce_error()
    @ Base ./reduce.jl:299
  [2] reduce_empty_iter(op::Base.BottomRF{typeof(Base.add_sum)}, itr::Base.SkipMissing{DataFrameRow{DataFrame, DataFrames.Index}}, #unused#::Base.EltypeUnknown)
    @ Base ./reduce.jl:356
  [3] reduce_empty_iter
    @ ./reduce.jl:354 [inlined]
  [4] foldl_impl
    @ ./reduce.jl:49 [inlined]
  [5] mapfoldl_impl
    @ ./reduce.jl:44 [inlined]
  [6] #mapfoldl#214
    @ ./reduce.jl:160 [inlined]
  [7] mapfoldl
    @ ./reduce.jl:160 [inlined]
  [8] #mapreduce#218
    @ ./reduce.jl:287 [inlined]
  [9] mapreduce
    @ ./reduce.jl:287 [inlined]
 [10] #sum#221
    @ ./reduce.jl:501 [inlined]
 [11] sum
    @ ./reduce.jl:501 [inlined]
 [12] #sum#222
    @ ./reduce.jl:528 [inlined]
 [13] sum
    @ ./reduce.jl:528 [inlined]
 [14] #54
    @ ~/.julia/packages/Pipe/5PIGG/src/Pipe.jl:61 [inlined]
 [15] |>
    @ ./operators.jl:858 [inlined]
 [16] _broadcast_getindex_evalf
    @ ./broadcast.jl:648 [inlined]
 [17] _broadcast_getindex
    @ ./broadcast.jl:621 [inlined]
 [18] getindex
    @ ./broadcast.jl:575 [inlined]
 [19] copyto_nonleaf!(dest::Vector{Int64}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(|>), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(|>), Tuple{Base.Broadcast.Extruded{DataFrames.DataFrameRows{DataFrame}, Tuple{Bool}, Tuple{Int64}}, Base.RefValue{var"#53#55"}}}, Base.RefValue{var"#54#56"}}}, iter::Base.OneTo{Int64}, state::Int64, count::Int64)
    @ Base.Broadcast ./broadcast.jl:1078
 [20] copy
    @ ./broadcast.jl:930 [inlined]
 [21] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(|>), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(|>), Tuple{DataFrames.DataFrameRows{DataFrame}, Base.RefValue{var"#53#55"}}}, Base.RefValue{var"#54#56"}}})
    @ Base.Broadcast ./broadcast.jl:883
 [22] top-level scope
    @ In[205]:1
 [23] eval
    @ ./boot.jl:360 [inlined]
 [24] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

However, we get an error. The problem is that the last row of df contains only missing values, and since eachrow is type unstable the eltype of the result of skipmissing is unknown

df의 마지막 row는 missing값만 두개가 있어,각 eachrow는 type unstable이기 때문에 skipmissing의 결과의 eltype이 unkown이다(그래서 Any 로 표시 된다)

In [207]:
@pipe df |> eachrow(_)[end] |> skipmissing
Out[207]:
skipmissing(DataFrameRow
 Row  a        b       
      Int64?   Int64?  
─────┼──────────────────
   3 │ missing  missing )
In [208]:
@pipe df |> eachrow(_)[end] |> skipmissing |> collect
Out[208]:
Any[]

마지막 row를 제외하면 잘 나오는것을 확인 할 수 있다.

In [206]:
@pipe df[1:2,:] |> eachrow .|> skipmissing .|> sum
Out[206]:
2-element Vector{Int64}:
 2
 2

In such case it is useful to switch to Tables.namedtupleiterator which is type stable as discussed in 01_constructors.ipynb notebook

In [210]:
@pipe df |> Tables.namedtupleiterator |> collect
Out[210]:
3-element Vector{NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}}:
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}((1, 1))
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}((missing, 2))
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}((missing, missing))
In [211]:
@pipe df |> Tables.namedtupleiterator .|> skipmissing .|> sum
Out[211]:
3-element Vector{Int64}:
 2
 2
 0
In [ ]: