Introduction to DataFrames¶

DataFrame v1.2, Julia 1.6.1

Load and save DataFrames¶

Here we'll load CSV.jl to read and write CSV files and Arrow.jl, JLSO.jl, and serialization. which allow us to work with binary format and JSONTable.jl for JSON interaction. Finally we consider a custom JDF.jl format.

using DataFrames
using Pipe
using Arrow
using CSV
using Serialization
using JLSO # JLSO is a storage container for serialized Julia objects
using JSONTables
using CodecZlib
using ZipFile
using JDF # DataFrame Load/Save Package
using StatsPlots # for charts
using Mmap # for compression

x = DataFrame(A=[true, false, true], B=[1,2,missing],
              C=[missing, "b", "c"], D=['a',missing,'c'])

@pipe x |> eachcol .|> eltype

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

CSV.jl¶

Let's use CSV to save x to disk; make sure x1.csv does not conflict with some file in your working directory

CSV.write("x1.csv",x)

"x1.csv"

read("x1.csv") |> print

UInt8[0x41, 0x2c, 0x42, 0x2c, 0x43, 0x2c, 0x44, 0x0a, 0x74, 0x72, 0x75, 0x65, 0x2c, 0x31, 0x2c, 0x2c, 0x61, 0x0a, 0x66, 0x61, 0x6c, 0x73, 0x65, 0x2c, 0x32, 0x2c, 0x62, 0x2c, 0x0a, 0x74, 0x72, 0x75, 0x65, 0x2c, 0x2c, 0x63, 0x2c, 0x63, 0x0a]

read("x1.csv",String) |> print

A,B,C,D
true,1,,a
false,2,b,
true,,c,c

y = CSV.read("x1.csv",DataFrame)

CSV에서 DataFrame을 로딩 했을 때 :D 컬럼 타입이 Char? 에서 String?으로 변경 되었다.

@pipe y |> eachcol .|> eltype

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}

Serialization, JDF.jl and JLSO.jl¶

Serialization 첫번째 방법은 Serialization.serialize 를 사용하는 것인데 read/write는 Julia 버전이 달라지거나 다른 시스템 이미지 하에서는 동작하지 않을 수 있다.

y = open("x.bin","w") do io
  serialize(io,x)
end

DataFrame 이 load 되어 있지 않으면 읽기시에 DataFrame으로 인식 하지 않을 것이다.

# 결과는 원래 데이터와 동일 , :D 컬럼 확인
y = open(deserialize,"x.bin")

@pipe y |> eachcol .|> eltype

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JDF.jl¶

JDF.jl is a relatively new package designed to serialize DataFrames. You can save DataFrame with the savejdf function.

JDF.save("x.jdf",x)

JDFFile{String}("x.jdf")

x_loaded = JDF.load("x.jdf") |> DataFrame

isequal(x_loaded,x)

true

x_loaded2 = JDF.loadjdf("x.jdf") |> DataFrame

isequal(x_loaded2,x)

true

JDF.jl offers the ability to load only certain columns from disk to help with working with large files

# set up a JDFFile which is a on disk representation of `x` backed by JDF.jl
x_ondisk = jdf"x.jdf"

JDFFile{String}("x.jdf")

We can see all the names of x without loading it into memory

names(x_ondisk)

4-element Vector{Symbol}:
 :A
 :B
 :C
 :D

The below is an example of how to load only columns :A and :D

xd = JDF.load(x_ondisk; cols = ["A", "D"]) |> DataFrame

JDF.jl vs others

JLSO.jl¶

Another way to perform serialization is by using the JLSO.jl library.

JLSO.save("x.jlso", :data=>x)

y = JLSO.load("x.jlso")[:data]

@pipe y |> eachcol .|> eltype

4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JSONTables.jl¶

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

JSONTables.open(io->arraytable(io,x),"x1.json","w")

106

JSONTables.open(io->objecttable(io,x),"x2.json","w")

76

JSONTables.read("x1.json",String) |> print

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

JSONTables.read("x2.json",String) |> print

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

y1 = JSONTables.open(jsontable,"x1.json") |> DataFrame

JSONTables.open(jsontable,"x2.json") |> DataFrame

Arrow.jl¶

Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.

Arrow.write("x.arrow",x)

"x.arrow"

y = Arrow.Table("x.arrow") |> DataFrame

Note that columns of y are immutable

y.A[1] = false

ReadOnlyMemoryError()

Stacktrace:
 [1] setindex!
   @ ./array.jl:839 [inlined]
 [2] setindex!(p::Arrow.BoolVector{Bool}, v::Bool, i::Int64)
   @ Arrow ~/.julia/packages/Arrow/PQ5Mm/src/arraytypes/bool.jl:49
 [3] top-level scope
   @ In[54]:1
 [4] eval
   @ ./boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

y.A

3-element Arrow.BoolVector{Bool}:
 1
 0
 1

y.B

3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
 1
 2
  missing

You can get standard Julia Base vectors by copying data frame

y2 = copy(y)

y2.A

3-element Vector{Bool}:
 1
 0
 1

y2.A[2]=true

true

y2

y2.B

3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Basic benchmarking¶

Next, we'll create some files, so be careful that you don't already have these files in your working directory!

In particular, we'll time how long it takes us to write a DataFrame with 10^3 rows and 10^5 columns.

bigdf = DataFrame(rand(Bool, 10^5, 1000), :auto)
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite1 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite1 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite2 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")

First run
CSV.jl
 84.237866 seconds (598.99 M allocations: 17.922 GiB, 4.06% gc time, 8.20% compilation time)
Serialization
  3.466752 seconds (424.62 k allocations: 21.350 MiB, 1.26% compilation time)
JDF.jl
  0.225190 seconds (69.83 k allocations: 323.739 MiB, 11.43% gc time, 19.34% compilation time)
JLSO.jl
 15.372842 seconds (359.31 k allocations: 49.876 MiB)
Arrow.jl
 10.408485 seconds (9.50 M allocations: 647.830 MiB, 1.56% gc time, 85.10% compilation time)
JSONTables.jl arraytable
168.068537 seconds (2.30 G allocations: 54.625 GiB, 6.72% gc time, 0.29% compilation time)
JSONTables.jl objecttable
  3.442377 seconds (321.70 k allocations: 3.087 GiB, 4.75% gc time, 8.00% compilation time)
Second run
CSV.jl
 59.757214 seconds (597.48 M allocations: 17.848 GiB, 5.20% gc time)
Serialization
  3.353829 seconds (106.79 k allocations: 2.320 MiB, 0.54% compilation time)
JDF.jl
  0.270219 seconds (33.76 k allocations: 321.691 MiB)
JLSO.jl
 10.432732 seconds (156.97 k allocations: 38.690 MiB)
Arrow.jl
  0.903048 seconds (56.74 k allocations: 21.216 MiB)
JSONTables.jl arraytable
216.991245 seconds (2.30 G allocations: 54.590 GiB, 6.28% gc time, 0.01% compilation time)
JSONTables.jl objecttable
  4.363861 seconds (18.42 k allocations: 3.071 GiB, 5.29% gc time, 0.48% compilation time)

4.36438275

groupedbar(
    # Exclude JSONTables.jl arraytable due to timing
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
            inner = 2),
    [csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
     jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
    group = repeat(["1st", "2nd"], outer = 6),
    ylab = "Second",
    title = "Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"

data_files = ["bigdf1.csv", "bigdf.bin", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file = data_files, size = getfield.(stat.(data_files), :size))
append!(df, DataFrame(file = "bigdf.jdf", size=reduce((x,y)->x+y.size,
                                                      stat.(joinpath.("bigdf.jdf", readdir("bigdf.jdf"))),
                                                      init=0)))
sort!(df, :size)

@df df plot(:file, :size/1024^2, seriestype=:bar, title = "Format File Size (MB)", label="Size", ylab="MB")

println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");

First run
CSV.jl
  0.903796 seconds (88.21 k allocations: 295.137 MiB)
Serialization
  5.771103 seconds (99.30 M allocations: 1.575 GiB, 1.37% gc time)
JDF.jl
  0.141858 seconds (78.33 k allocations: 363.171 MiB)
JLSO.jl
  4.253081 seconds (99.42 M allocations: 1.594 GiB, 14.31% gc time)
Arrow.jl
  0.006889 seconds (68.53 k allocations: 3.605 MiB)
  0.978933 seconds (103.50 k allocations: 101.361 MiB)
JSONTables.jl arraytable
108.807965 seconds (2.40 M allocations: 13.538 GiB, 1.40% gc time)
JSONTables.jl objecttable
  6.051225 seconds (9.07 k allocations: 2.900 GiB, 12.34% gc time)
Second run
  0.844293 seconds (88.22 k allocations: 295.137 MiB)
Serialization
  5.774235 seconds (99.30 M allocations: 1.575 GiB, 1.41% gc time)
JDF.jl
  0.125702 seconds (78.33 k allocations: 363.171 MiB)
JLSO.jl
  4.227428 seconds (99.42 M allocations: 1.594 GiB, 14.12% gc time)
Arrow.jl
  0.006417 seconds (68.53 k allocations: 3.605 MiB)
  0.980623 seconds (103.50 k allocations: 101.361 MiB)
JSONTables.jl arraytable
108.512410 seconds (2.40 M allocations: 13.538 GiB, 1.40% gc time)
JSONTables.jl objecttable
  6.138250 seconds (9.07 k allocations: 2.900 GiB, 12.19% gc time)

# Exclude JSON\narraytable arraytable due to much longer timing
groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", #"JSON\narraytable",
            "JSON\nobjecttable"], inner = 2),
    [csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
     arrowread1, arrowread2, arrowread1+arrowread1copy, arrowread2+arrowread2copy,
     # jsontablesaread1, jsontablesaread2,
     jsontablesoread1, jsontablesoread2],    
    group = repeat(["1st", "2nd"], outer = 7),
    ylab = "Second",
    title = "Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)

Using gzip compression¶

A common user requrement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl. The same pattern is applicable to JSONTables.jl compression/decompression.

Again make sure that you do not have file named df_comress_test.csv.gz in your working directory

We first generate a random data frame

df = DataFrame(rand(1:10,10,1000),:auto)

# GzipCompressorStream comes from CodecZlib
open("df_compress_test.csv.gz","w") do io
  stream = GzipCompressorStream(io)
  CSV.write(stream,df)
  close(stream)
end

@time df2 = @pipe "df_compress_test.csv.gz" |> Mmap.mmap |>
            transcode(GzipDecompressor,_) |> CSV.File |>
            DataFrame

  0.008342 seconds (22.11 k allocations: 1.536 MiB)

using Chain

@time df3 = @chain "df_compress_test.csv.gz" begin
  Mmap.mmap
  transcode(GzipDecompressor,_)
  CSV.File 
  DataFrame
end

  0.008326 seconds (22.11 k allocations: 1.536 MiB)

df == df2 == df3

true

Using zip files¶

Sometimes you may have files compressed inside a zip files.

In such a situation you may use ZipFile.jl in conjunction an appropriate reader to read the files.

Here we first create a ZIO file and then read back its contents into DataFrame.

df1 = DataFrame(rand(1:10,3,4), :auto)

df2 = DataFrame(rand(1:10,3,4),:auto)

And we show yet another way to write a DataFrame into a CSV file

# write a CSV file into the zip file
w = ZipFile.Writer("x.zip")

f1 = ZipFile.addfile(w,"x1.csv")
write(f1,sprint(show,"text/csv",df1))

# write a second CSV file into zip file
f2 = ZipFile.addfile(w,"x2.csv",method=ZipFile.Deflate)
write(f2,sprint(show, "text/csv", df2))

close(w)

Now we read the CSV we have written:

z = ZipFile.Reader("x.zip")

ZipFile.Reader for IOStream(<file x.zip>) containing 2 files:

uncompressedsize method  mtime            name
----------------------------------------------
              45 Store   2021-08-13 01-01 x1.csv
              45 Deflate 2021-08-13 01-01 x2.csv

# find the index of file called x1.csv
index_xcsv = findfirst(x->x.name == "x1.csv",z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]),DataFrame)

df1_2 == df1

true

# find the index of file called x2.csv
index_xcsv = findfirst(x->x.name == "x2.csv",z.files)
# to read the x2.csv file in the zip file.
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)

df2_2 == df2

true

close(z)

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files

@chain ["x1.csv", "x.bin", "x.jlso", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf.bin", "bigdf.jlso", "bigdf1.json", "bigdf2.json", 
             "x.zip"] begin
  # try를 사용하여 에러 무시
  foreach(x->try rm(x) catch end,_)
end

@chain ["bigdf.arrow","x.arrow","df_compress_test.csv.gz"] begin
  # try를 사용하여 에러 무시
  foreach(x->try rm(x,recursive=true) catch end,_)
end

	A	B	C	D
	Bool	Int64?	String?	Char?
1	1	1	missing	a
2	0	2	b	missing
3	1	missing	c	c

	A	B	C	D
	Bool	Int64?	String?	String?
1	1	1	missing	a
2	0	2	b	missing
3	1	missing	c	c

	A	B	C	D
	Bool	Int64?	String?	Char?
1	1	1	missing	a
2	0	2	b	missing
3	1	missing	c	c

	A	B	C	D
	Bool	Int64?	String?	Char?
1	1	1	missing	a
2	0	2	b	missing
3	1	missing	c	c

	A	B	C	D
	Bool	Int64?	String?	Char?
1	1	1	missing	a
2	0	2	b	missing
3	1	missing	c	c

	file	size
	String	Int64
1	bigdf.arrow	16151882
2	bigdf.bin	51771653
3	bigdf.jdf	100769814
4	bigdf1.csv	550808398
5	bigdf2.json	550812399
6	bigdf1.json	1240303506