Introduction to DataFrames¶

Handling missing values¶

using Pipe

missing, typeof(missing)

(missing, Missing)

Arrays automatically create an appropriate union type

x = [1,2,missing,3]

4-element Vector{Union{Missing, Int64}}:
 1
 2
  missing
 3

1 |>ismissing, missing |> ismissing, x|>ismissing, x .|> ismissing

(false, true, false, Bool[0, 0, 1, 0])

x |> eltype, x |> eltype |> nonmissingtype

(Union{Missing, Int64}, Int64)

missing comparisons produce missing

missing === missing, missing == missing, missing != missing, missing < missing

(true, missing, missing, missing)

@pipe missing |> isequal(_,missing)

true

isequal(missing,missing)

true

1 == missing, 1 != missing, 1 < missing

(missing, missing, missing)

missing is considered greater than any numeric value

In the next few examples, we see that many (not all) functions handle `missing`¶

map(x->x(missing),[sin,cos,zero,sqrt]) # part 1

4-element Vector{Missing}:
 missing
 missing
 missing
 missing

map(x->x(missing,1),[+,-,*,/,div]) # part 2

5-element Vector{Missing}:
 missing
 missing
 missing
 missing
 missing

using Statistics # needed for mean

map(x->x([1,2,missing]),[minimum,maximum,extrema,mean,float]) # part 3

5-element Vector{Any}:
 missing
 missing
 (missing, missing)
 missing
 Union{Missing, Float64}[1.0, 2.0, missing]

[1,missing,2,missing] |> skipmissing |> collect

2-element Vector{Int64}:
 1
 2

@time @pipe [1.0,missing,2.0,missing] |> replace(_,missing=>NaN)

  0.000007 seconds (6 allocations: 352 bytes)

4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

Another way to do this,

# 첫번째 파라미터가 missing이 아니면 첫번째를 리턴하고 missing이면 
# 두번째 파라미터를 리턴 한다.
@time @pipe [1.0, missing, 2.0, missing] .|> coalesce(_,NaN)

  0.109858 seconds (149.50 k allocations: 9.213 MiB, 99.38% compilation time)

4-element Vector{Float64}:
   1.0
 NaN
   2.0
 NaN

You can also use recode from CategoricalArrays.jl if you have a default output value.

using CategoricalArrays

@pipe [1.0,missing,2.0,missing] |> recode(_,0,missing=>1)

4-element Vector{Int64}:
 0
 1
 0
 1

using DataFrames

df = DataFrame(a=[1,2,missing],b=["a","b",missing])

replace!(df.a,missing=>100)

3-element Vector{Union{Missing, Int64}}:
   1
   2
 100

df.b = @pipe df.b .|> coalesce(_,100)

3-element Vector{Any}:
    "a"
    "b"
 100

df

You can use unique or levels to get unique values with or without missings, repectivery.

[1,missing,2,missing] |> unique

3-element Vector{Union{Missing, Int64}}:
 1
  missing
 2

[1,missing,2,missing] |> levels

2-element Vector{Int64}:
 1
 2

x = [1,2,3]
y = allowmissing(x)

3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

push!(y,missing)

4-element Vector{Union{Missing, Int64}}:
 1
 2
 3
  missing

x = [1,2,3]
y = allowmissing(x)
z = disallowmissing(y)

3-element Vector{Int64}:
 1
 2
 3

push!(z,missing)

MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{S}, ::CategoricalValue) where S<:Union{AbstractChar, AbstractString, Number} at /home/shpark/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:92
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  ...

Stacktrace:
 [1] push!(a::Vector{Int64}, item::Missing)
   @ Base ./array.jl:928
 [2] top-level scope
   @ In[92]:1
 [3] eval
   @ ./boot.jl:360 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

disallowmissing has error keyword argument that can be used to decide how it should behave when it encounters a column that actually contains a missing value

@time df = allowmissing(DataFrame(ones(2,3),:auto))

  0.000050 seconds (51 allocations: 4.766 KiB)

@time df = @pipe ones(2,3) |> DataFrame(_,:auto) |> allowmissing

  0.000053 seconds (51 allocations: 4.766 KiB)

@time df = (allowmissing ∘ DataFrame)(ones(2,3),:auto)

  0.000058 seconds (52 allocations: 4.797 KiB)

df[1,1] = missing

missing

disallowmissing(df) # an error is thrown

MethodError: Cannot `convert` an object of type Missing to an object of type Float64
Closest candidates are:
  convert(::Type{S}, ::CategoricalValue) where S<:Union{AbstractChar, AbstractString, Number} at /home/shpark/.julia/packages/CategoricalArrays/rDwMt/src/value.jl:92
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...

Stacktrace:
  [1] setindex!(A::Vector{Float64}, x::Missing, i1::Int64)
    @ Base ./array.jl:839
  [2] _unsafe_copyto!(dest::Vector{Float64}, doffs::Int64, src::Vector{Union{Missing, Float64}}, soffs::Int64, n::Int64)
    @ Base ./array.jl:235
  [3] unsafe_copyto!
    @ ./array.jl:289 [inlined]
  [4] _copyto_impl!
    @ ./array.jl:313 [inlined]
  [5] copyto!
    @ ./array.jl:299 [inlined]
  [6] copyto!
    @ ./array.jl:325 [inlined]
  [7] copyto_axcheck!
    @ ./abstractarray.jl:1056 [inlined]
  [8] AbstractVector{Float64}(A::Vector{Union{Missing, Float64}})
    @ Base ./array.jl:541
  [9] AbstractArray
    @ ./boot.jl:475 [inlined]
 [10] convert
    @ ./abstractarray.jl:15 [inlined]
 [11] disallowmissing(x::Vector{Union{Missing, Float64}})
    @ Missings ~/.julia/packages/Missings/hn4Ye/src/Missings.jl:50
 [12] disallowmissing(df::DataFrame, cols::Colon; error::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/pVFzb/src/abstractdataframe/abstractdataframe.jl:1982
 [13] disallowmissing (repeats 2 times)
    @ ~/.julia/packages/DataFrames/pVFzb/src/abstractdataframe/abstractdataframe.jl:1974 [inlined]
 [14] top-level scope
    @ In[114]:1
 [15] eval
    @ ./boot.jl:360 [inlined]
 [16] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

# column :x1 is left untouched as it contains missing
disallowmissing(df, error=false)

In this next example,we show that the type of each column in x is initially Int64. After using allowmissing! to accept missing values in column 1 and 3, the types of those columns become Union{Int64, Missing}.

x = DataFrame(rand(Int,2,3),:auto)

@pipe x |> eachcol .|> eltype |> println("Before : ",_)
@pipe x |> allowmissing!(_,1) # make first column accept missings
@pipe x |> allowmissing!(_,:x3) # make :x3 column accept missings
@pipe x |> eachcol .|> eltype |> println("After : ",_)

Before : DataType[Int64, Int64, Int64]
After : Type[Union{Missing, Int64}, Int64, Union{Missing, Int64}]

In this next example, we'll use completecase to find all the rows of a DataFrame that have complete data.

x = DataFrame(A=[1,missing,3,4], B=["A","B",missing,"C"])

@pipe x |> completecases |> println("Complete cases:\n",_)

Complete cases:
Bool[1, 0, 0, 1]

We can use dropmissing or dropmissing! to remove the rows with incomplete data from a DataFrame and either create a new DataFrame or mutable the original in-place.

y = x |> dropmissing
x |> dropmissing!
;

x

y

x |> describe

Alternatively you can pass disallowmissing keyword argument to dropmissing and dropmissing!

x = DataFrame(A=[1,missing,3,4],B=["A","B",missing,"C"])

@pipe x |> dropmissing!(_,disallowmissing=false)

Making functions `missing`-aware¶

If we have a function that does not handle missing values we can wrap it using passmissing function so that if any of its positional arguments is missing we will get a missing value in return. In the example below we change how string function behaves:

missing을 취급할 수 없는 함수를 wrap하여 입력값중에 missing이 있는 경우 missing을 리턴하도록 처리 할 수 있는 passmissing을 제공한다. 예를 들어 string의 경우 아래 처럼 재대로 missing을 처리 하지 못하는데 passmissing를 이용하여 string 함수를 wrap하여 missing을 처리 할 수 있게 한다.

string(missing)

"missing"

@time string(missing," ", missing)

  0.000008 seconds (3 allocations: 176 bytes)

"missing missing"

@time @pipe (missing, " ", missing)... |> string

  0.000008 seconds (3 allocations: 176 bytes)

"missing missing"

@time @pipe (1,2,3)...|>string

  0.000009 seconds (9 allocations: 464 bytes)

"123"

@time string(1,2,3)

  0.000010 seconds (9 allocations: 464 bytes)

"123"

lift_string = passmissing(string)

(::Missings.PassMissing{typeof(string)}) (generic function with 2 methods)

missing |> lift_string

missing

@pipe (missing," ",missing)... |> lift_string

missing

lift_string(1,2,3)

"123"

Aggregating rows containing missing values¶

df = DataFrame(a=[1,missing,missing], b=[1,2,missing])

If we just sum on the rows we get two missing entries:

@pipe df |> eachrow .|> sum

3-element Vector{Union{Missing, Int64}}:
 2
  missing
  missing

One can apply skipmissing on the rows to avoid this problem:

@pipe df |> eachrow .|> skipmissing .|> sum

ArgumentError: reducing over an empty collection is not allowed

Stacktrace:
  [1] _empty_reduce_error()
    @ Base ./reduce.jl:299
  [2] reduce_empty_iter(op::Base.BottomRF{typeof(Base.add_sum)}, itr::Base.SkipMissing{DataFrameRow{DataFrame, DataFrames.Index}}, #unused#::Base.EltypeUnknown)
    @ Base ./reduce.jl:356
  [3] reduce_empty_iter
    @ ./reduce.jl:354 [inlined]
  [4] foldl_impl
    @ ./reduce.jl:49 [inlined]
  [5] mapfoldl_impl
    @ ./reduce.jl:44 [inlined]
  [6] #mapfoldl#214
    @ ./reduce.jl:160 [inlined]
  [7] mapfoldl
    @ ./reduce.jl:160 [inlined]
  [8] #mapreduce#218
    @ ./reduce.jl:287 [inlined]
  [9] mapreduce
    @ ./reduce.jl:287 [inlined]
 [10] #sum#221
    @ ./reduce.jl:501 [inlined]
 [11] sum
    @ ./reduce.jl:501 [inlined]
 [12] #sum#222
    @ ./reduce.jl:528 [inlined]
 [13] sum
    @ ./reduce.jl:528 [inlined]
 [14] #54
    @ ~/.julia/packages/Pipe/5PIGG/src/Pipe.jl:61 [inlined]
 [15] |>
    @ ./operators.jl:858 [inlined]
 [16] _broadcast_getindex_evalf
    @ ./broadcast.jl:648 [inlined]
 [17] _broadcast_getindex
    @ ./broadcast.jl:621 [inlined]
 [18] getindex
    @ ./broadcast.jl:575 [inlined]
 [19] copyto_nonleaf!(dest::Vector{Int64}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(|>), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(|>), Tuple{Base.Broadcast.Extruded{DataFrames.DataFrameRows{DataFrame}, Tuple{Bool}, Tuple{Int64}}, Base.RefValue{var"#53#55"}}}, Base.RefValue{var"#54#56"}}}, iter::Base.OneTo{Int64}, state::Int64, count::Int64)
    @ Base.Broadcast ./broadcast.jl:1078
 [20] copy
    @ ./broadcast.jl:930 [inlined]
 [21] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(|>), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(|>), Tuple{DataFrames.DataFrameRows{DataFrame}, Base.RefValue{var"#53#55"}}}, Base.RefValue{var"#54#56"}}})
    @ Base.Broadcast ./broadcast.jl:883
 [22] top-level scope
    @ In[205]:1
 [23] eval
    @ ./boot.jl:360 [inlined]
 [24] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

However, we get an error. The problem is that the last row of df contains only missing values, and since eachrow is type unstable the eltype of the result of skipmissing is unknown

df의 마지막 row는 missing값만 두개가 있어,각 eachrow는 type unstable이기 때문에 skipmissing의 결과의 eltype이 unkown이다(그래서 Any 로 표시 된다)

@pipe df |> eachrow(_)[end] |> skipmissing

skipmissing(DataFrameRow
 Row │ a        b       
     │ Int64?   Int64?  
─────┼──────────────────
   3 │ missing  missing )

@pipe df |> eachrow(_)[end] |> skipmissing |> collect

Any[]

마지막 row를 제외하면 잘 나오는것을 확인 할 수 있다.

@pipe df[1:2,:] |> eachrow .|> skipmissing .|> sum

2-element Vector{Int64}:
 2
 2

In such case it is useful to switch to Tables.namedtupleiterator which is type stable as discussed in 01_constructors.ipynb notebook

@pipe df |> Tables.namedtupleiterator |> collect

3-element Vector{NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}}:
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}((1, 1))
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}((missing, 2))
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, Int64}}}((missing, missing))

@pipe df |> Tables.namedtupleiterator .|> skipmissing .|> sum

3-element Vector{Int64}:
 2
 2
 0

	a	b
	Int64?	String?
1	1	a
2	2	b
3	missing	missing

	x1	x2	x3
	Int64	Int64	Int64
1	-2090080453607379781	970938747300117674	1287466900659119197
2	-1577814882947858749	-6673046903430235044	-4206324573690547144

	A	B
	Int64?	String?
1	1	A
2	missing	B
3	3	missing
4	4	C

	A	B
	Int64?	String?
1	1	A
2	missing	B
3	3	missing
4	4	C

	a	b
	Int64?	Int64?
1	1	1
2	missing	2
3	missing	missing

	variable	mean	min	median	max	nmissing	eltype
	Symbol	Union…	Any	Union…	Any	Int64	DataType
1	A	2.5	1	2.5	4	0	Int64
2	B		A		C	0	String

Introduction to DataFrames¶

Handling missing values¶

In the next few examples, we see that many (not all) functions handle missing¶

Making functions missing-aware¶

Aggregating rows containing missing values¶

In the next few examples, we see that many (not all) functions handle `missing`¶

Making functions `missing`-aware¶