Introduction to DataFrames

04_loadsave

DataFrame v1.2, Julia 1.6.1

Load and save DataFrames

Here we'll load CSV.jl to read and write CSV files and Arrow.jl, JLSO.jl, and serialization. which allow us to work with binary format and JSONTable.jl for JSON interaction. Finally we consider a custom JDF.jl format.

In [2]:
using DataFrames
using Pipe
using Arrow
using CSV
using Serialization
using JLSO # JLSO is a storage container for serialized Julia objects
using JSONTables
using CodecZlib
using ZipFile
using JDF # DataFrame Load/Save Package
using StatsPlots # for charts
using Mmap # for compression
In [3]:
x = DataFrame(A=[true, false, true], B=[1,2,missing],
              C=[missing, "b", "c"], D=['a',missing,'c'])
Out[3]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc
In [4]:
@pipe x |> eachcol .|> eltype
Out[4]:
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

CSV.jl

Let's use CSV to save x to disk; make sure x1.csv does not conflict with some file in your working directory

In [10]:
CSV.write("x1.csv",x)
Out[10]:
"x1.csv"
In [13]:
read("x1.csv") |> print
UInt8[0x41, 0x2c, 0x42, 0x2c, 0x43, 0x2c, 0x44, 0x0a, 0x74, 0x72, 0x75, 0x65, 0x2c, 0x31, 0x2c, 0x2c, 0x61, 0x0a, 0x66, 0x61, 0x6c, 0x73, 0x65, 0x2c, 0x32, 0x2c, 0x62, 0x2c, 0x0a, 0x74, 0x72, 0x75, 0x65, 0x2c, 0x2c, 0x63, 0x2c, 0x63, 0x0a]
In [14]:
read("x1.csv",String) |> print
A,B,C,D
true,1,,a
false,2,b,
true,,c,c
In [16]:
y = CSV.read("x1.csv",DataFrame)
Out[16]:

3 rows × 4 columns

ABCD
BoolInt64?String?String?
111missinga
202bmissing
31missingcc

CSV에서 DataFrame을 로딩 했을 때 :D 컬럼 타입이 Char? 에서 String?으로 변경 되었다.

In [17]:
@pipe y |> eachcol .|> eltype
Out[17]:
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, String}

Serialization, JDF.jl and JLSO.jl

Serialization 첫번째 방법은 Serialization.serialize 를 사용하는 것인데 read/write는 Julia 버전이 달라지거나 다른 시스템 이미지 하에서는 동작하지 않을 수 있다.

In [18]:
y = open("x.bin","w") do io
  serialize(io,x)
end

DataFrame 이 load 되어 있지 않으면 읽기시에 DataFrame으로 인식 하지 않을 것이다.

In [20]:
# 결과는 원래 데이터와 동일 , :D 컬럼 확인
y = open(deserialize,"x.bin")
Out[20]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc
In [21]:
@pipe y |> eachcol .|> eltype
Out[21]:
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JDF.jl

JDF.jl is a relatively new package designed to serialize DataFrames. You can save DataFrame with the savejdf function.

In [22]:
JDF.save("x.jdf",x)
Out[22]:
JDFFile{String}("x.jdf")
In [23]:
x_loaded = JDF.load("x.jdf") |> DataFrame
Out[23]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc
In [24]:
isequal(x_loaded,x)
Out[24]:
true
In [27]:
x_loaded2 = JDF.loadjdf("x.jdf") |> DataFrame
Out[27]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc
In [28]:
isequal(x_loaded2,x)
Out[28]:
true

JDF.jl offers the ability to load only certain columns from disk to help with working with large files

In [31]:
# set up a JDFFile which is a on disk representation of `x` backed by JDF.jl
x_ondisk = jdf"x.jdf"
Out[31]:
JDFFile{String}("x.jdf")

We can see all the names of x without loading it into memory

In [32]:
names(x_ondisk)
Out[32]:
4-element Vector{Symbol}:
 :A
 :B
 :C
 :D

The below is an example of how to load only columns :A and :D

In [33]:
xd = JDF.load(x_ondisk; cols = ["A", "D"]) |> DataFrame
Out[33]:

3 rows × 2 columns

AD
BoolChar?
11a
20missing
31c

JLSO.jl

Another way to perform serialization is by using the JLSO.jl library.

In [34]:
JLSO.save("x.jlso", :data=>x)
In [35]:
y = JLSO.load("x.jlso")[:data]
Out[35]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc
In [38]:
@pipe y |> eachcol .|> eltype
Out[38]:
4-element Vector{Type}:
 Bool
 Union{Missing, Int64}
 Union{Missing, String}
 Union{Missing, Char}

JSONTables.jl

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

In [45]:
JSONTables.open(io->arraytable(io,x),"x1.json","w")
Out[45]:
106
In [46]:
JSONTables.open(io->objecttable(io,x),"x2.json","w")
Out[46]:
76
In [47]:
JSONTables.read("x1.json",String) |> print
[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]
In [48]:
JSONTables.read("x2.json",String) |> print
{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}
In [49]:
y1 = JSONTables.open(jsontable,"x1.json") |> DataFrame
Out[49]:

3 rows × 4 columns

ABCD
BoolInt64?String?String?
111missinga
202bmissing
31missingcc
In [50]:
JSONTables.open(jsontable,"x2.json") |> DataFrame
Out[50]:

3 rows × 4 columns

ABCD
BoolInt64?String?String?
111missinga
202bmissing
31missingcc

Arrow.jl

Finally we use Apache Arrow format that allows, in particular, for data interchange with R or Python.

In [52]:
Arrow.write("x.arrow",x)
Out[52]:
"x.arrow"
In [53]:
y = Arrow.Table("x.arrow") |> DataFrame
Out[53]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc

Note that columns of y are immutable

In [54]:
y.A[1] = false
ReadOnlyMemoryError()

Stacktrace:
 [1] setindex!
   @ ./array.jl:839 [inlined]
 [2] setindex!(p::Arrow.BoolVector{Bool}, v::Bool, i::Int64)
   @ Arrow ~/.julia/packages/Arrow/PQ5Mm/src/arraytypes/bool.jl:49
 [3] top-level scope
   @ In[54]:1
 [4] eval
   @ ./boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [55]:
y.A
Out[55]:
3-element Arrow.BoolVector{Bool}:
 1
 0
 1
In [56]:
y.B
Out[56]:
3-element Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}:
 1
 2
  missing

You can get standard Julia Base vectors by copying data frame

In [57]:
y2 = copy(y)
Out[57]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
202bmissing
31missingcc
In [58]:
y2.A
Out[58]:
3-element Vector{Bool}:
 1
 0
 1
In [59]:
y2.A[2]=true
Out[59]:
true
In [60]:
y2
Out[60]:

3 rows × 4 columns

ABCD
BoolInt64?String?Char?
111missinga
212bmissing
31missingcc
In [61]:
y2.B
Out[61]:
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Basic benchmarking

Next, we'll create some files, so be careful that you don't already have these files in your working directory!

In particular, we'll time how long it takes us to write a DataFrame with 10^3 rows and 10^5 columns.

In [62]:
bigdf = DataFrame(rand(Bool, 10^5, 1000), :auto)
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
println("CSV.jl")
csvwrite1 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite1 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite1 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite1 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite1 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite1 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite1 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
csvwrite2 = @elapsed @time CSV.write("bigdf1.csv", bigdf)
println("Serialization")
serializewrite2 = @elapsed @time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JDF.jl")
jdfwrite2 = @elapsed @time JDF.save("bigdf.jdf", bigdf)
println("JLSO.jl")
jlsowrite2 = @elapsed @time JLSO.save("bigdf.jlso", :data => bigdf)
println("Arrow.jl")
arrowwrite2 = @elapsed @time Arrow.write("bigdf.arrow", bigdf)
println("JSONTables.jl arraytable")
jsontablesawrite2 = @elapsed @time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
jsontablesowrite2 = @elapsed @time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
First run
CSV.jl
 84.237866 seconds (598.99 M allocations: 17.922 GiB, 4.06% gc time, 8.20% compilation time)
Serialization
  3.466752 seconds (424.62 k allocations: 21.350 MiB, 1.26% compilation time)
JDF.jl
  0.225190 seconds (69.83 k allocations: 323.739 MiB, 11.43% gc time, 19.34% compilation time)
JLSO.jl
 15.372842 seconds (359.31 k allocations: 49.876 MiB)
Arrow.jl
 10.408485 seconds (9.50 M allocations: 647.830 MiB, 1.56% gc time, 85.10% compilation time)
JSONTables.jl arraytable
168.068537 seconds (2.30 G allocations: 54.625 GiB, 6.72% gc time, 0.29% compilation time)
JSONTables.jl objecttable
  3.442377 seconds (321.70 k allocations: 3.087 GiB, 4.75% gc time, 8.00% compilation time)
Second run
CSV.jl
 59.757214 seconds (597.48 M allocations: 17.848 GiB, 5.20% gc time)
Serialization
  3.353829 seconds (106.79 k allocations: 2.320 MiB, 0.54% compilation time)
JDF.jl
  0.270219 seconds (33.76 k allocations: 321.691 MiB)
JLSO.jl
 10.432732 seconds (156.97 k allocations: 38.690 MiB)
Arrow.jl
  0.903048 seconds (56.74 k allocations: 21.216 MiB)
JSONTables.jl arraytable
216.991245 seconds (2.30 G allocations: 54.590 GiB, 6.28% gc time, 0.01% compilation time)
JSONTables.jl objecttable
  4.363861 seconds (18.42 k allocations: 3.071 GiB, 5.29% gc time, 0.48% compilation time)
Out[62]:
4.36438275
In [63]:
groupedbar(
    # Exclude JSONTables.jl arraytable due to timing
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "JSONTables.jl\nobjecttable"],
            inner = 2),
    [csvwrite1, csvwrite2, serializewrite1, serializewrite1, jdfwrite1, jdfwrite2,
     jlsowrite1, jlsowrite2, arrowwrite1, arrowwrite2, jsontablesowrite2, jsontablesowrite2],
    group = repeat(["1st", "2nd"], outer = 6),
    ylab = "Second",
    title = "Write Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
Out[63]:
In [64]:
data_files = ["bigdf1.csv", "bigdf.bin", "bigdf.arrow", "bigdf1.json", "bigdf2.json"]
df = DataFrame(file = data_files, size = getfield.(stat.(data_files), :size))
append!(df, DataFrame(file = "bigdf.jdf", size=reduce((x,y)->x+y.size,
                                                      stat.(joinpath.("bigdf.jdf", readdir("bigdf.jdf"))),
                                                      init=0)))
sort!(df, :size)
Out[64]:

6 rows × 2 columns

filesize
StringInt64
1bigdf.arrow16151882
2bigdf.bin51771653
3bigdf.jdf100769814
4bigdf1.csv550808398
5bigdf2.json550812399
6bigdf1.json1240303506
In [65]:
@df df plot(:file, :size/1024^2, seriestype=:bar, title = "Format File Size (MB)", label="Size", ylab="MB")
Out[65]:
In [68]:
println("First run")
println("CSV.jl")
csvread1 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread1 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread1 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread1 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread1 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread1copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread1 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread1 = @elapsed @time open(jsontable, "bigdf2.json")
println("Second run")
csvread2 = @elapsed @time CSV.read("bigdf1.csv", DataFrame)
println("Serialization")
serializeread2 = @elapsed @time open(deserialize, "bigdf.bin")
println("JDF.jl")
jdfread2 = @elapsed @time JDF.load("bigdf.jdf") |> DataFrame
println("JLSO.jl")
jlsoread2 = @elapsed @time JLSO.load("bigdf.jlso")
println("Arrow.jl")
arrowread2 = @elapsed @time df_tmp = Arrow.Table("bigdf.arrow") |> DataFrame
arrowread2copy = @elapsed @time copy(df_tmp)
println("JSONTables.jl arraytable")
jsontablesaread2 = @elapsed @time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
jsontablesoread2 = @elapsed @time open(jsontable, "bigdf2.json");
First run
CSV.jl
  0.903796 seconds (88.21 k allocations: 295.137 MiB)
Serialization
  5.771103 seconds (99.30 M allocations: 1.575 GiB, 1.37% gc time)
JDF.jl
  0.141858 seconds (78.33 k allocations: 363.171 MiB)
JLSO.jl
  4.253081 seconds (99.42 M allocations: 1.594 GiB, 14.31% gc time)
Arrow.jl
  0.006889 seconds (68.53 k allocations: 3.605 MiB)
  0.978933 seconds (103.50 k allocations: 101.361 MiB)
JSONTables.jl arraytable
108.807965 seconds (2.40 M allocations: 13.538 GiB, 1.40% gc time)
JSONTables.jl objecttable
  6.051225 seconds (9.07 k allocations: 2.900 GiB, 12.34% gc time)
Second run
  0.844293 seconds (88.22 k allocations: 295.137 MiB)
Serialization
  5.774235 seconds (99.30 M allocations: 1.575 GiB, 1.41% gc time)
JDF.jl
  0.125702 seconds (78.33 k allocations: 363.171 MiB)
JLSO.jl
  4.227428 seconds (99.42 M allocations: 1.594 GiB, 14.12% gc time)
Arrow.jl
  0.006417 seconds (68.53 k allocations: 3.605 MiB)
  0.980623 seconds (103.50 k allocations: 101.361 MiB)
JSONTables.jl arraytable
108.512410 seconds (2.40 M allocations: 13.538 GiB, 1.40% gc time)
JSONTables.jl objecttable
  6.138250 seconds (9.07 k allocations: 2.900 GiB, 12.19% gc time)
In [69]:
# Exclude JSON\narraytable arraytable due to much longer timing
groupedbar(
    repeat(["CSV.jl", "Serialization", "JDF.jl", "JLSO.jl", "Arrow.jl", "Arrow.jl\ncopy", #"JSON\narraytable",
            "JSON\nobjecttable"], inner = 2),
    [csvread1, csvread2, serializeread1, serializeread2, jdfread1, jdfread2, jlsoread1, jlsoread2,
     arrowread1, arrowread2, arrowread1+arrowread1copy, arrowread2+arrowread2copy,
     # jsontablesaread1, jsontablesaread2,
     jsontablesoread1, jsontablesoread2],    
    group = repeat(["1st", "2nd"], outer = 7),
    ylab = "Second",
    title = "Read Performance\nDataFrame: bigdf\nSize: $(size(bigdf))"
)
Out[69]:

Using gzip compression

A common user requrement is to be able to load and save CSV that are compressed using gzip. Below we show how this can be accomplished using CodecZlib.jl. The same pattern is applicable to JSONTables.jl compression/decompression.

Again make sure that you do not have file named df_comress_test.csv.gz in your working directory

We first generate a random data frame

In [70]:
df = DataFrame(rand(1:10,10,1000),:auto)
Out[70]:

10 rows × 1,000 columns (omitted printing of 988 columns)

x1x2x3x4x5x6x7x8x9x10x11x12
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
1242597872482
210992151104185
3515843393344
4932364710101016
58442181712102
66143641079832
710667107553793
8943757310107104
9221281165443
105104216228328
In [74]:
# GzipCompressorStream comes from CodecZlib
open("df_compress_test.csv.gz","w") do io
  stream = GzipCompressorStream(io)
  CSV.write(stream,df)
  close(stream)
end
In [82]:
@time df2 = @pipe "df_compress_test.csv.gz" |> Mmap.mmap |>
            transcode(GzipDecompressor,_) |> CSV.File |>
            DataFrame
  0.008342 seconds (22.11 k allocations: 1.536 MiB)
Out[82]:

10 rows × 1,000 columns (omitted printing of 988 columns)

x1x2x3x4x5x6x7x8x9x10x11x12
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
1242597872482
210992151104185
3515843393344
4932364710101016
58442181712102
66143641079832
710667107553793
8943757310107104
9221281165443
105104216228328
In [83]:
using Chain
In [84]:
@time df3 = @chain "df_compress_test.csv.gz" begin
  Mmap.mmap
  transcode(GzipDecompressor,_)
  CSV.File 
  DataFrame
end
  0.008326 seconds (22.11 k allocations: 1.536 MiB)
Out[84]:

10 rows × 1,000 columns (omitted printing of 988 columns)

x1x2x3x4x5x6x7x8x9x10x11x12
Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
1242597872482
210992151104185
3515843393344
4932364710101016
58442181712102
66143641079832
710667107553793
8943757310107104
9221281165443
105104216228328
In [85]:
df == df2 == df3
Out[85]:
true

Using zip files

Sometimes you may have files compressed inside a zip files.

In such a situation you may use ZipFile.jl in conjunction an appropriate reader to read the files.

Here we first create a ZIO file and then read back its contents into DataFrame.

In [86]:
df1 = DataFrame(rand(1:10,3,4), :auto)
Out[86]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
161065
28841
35962
In [87]:
df2 = DataFrame(rand(1:10,3,4),:auto)
Out[87]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
11455
231029
33751

And we show yet another way to write a DataFrame into a CSV file

In [89]:
# write a CSV file into the zip file
w = ZipFile.Writer("x.zip")

f1 = ZipFile.addfile(w,"x1.csv")
write(f1,sprint(show,"text/csv",df1))

# write a second CSV file into zip file
f2 = ZipFile.addfile(w,"x2.csv",method=ZipFile.Deflate)
write(f2,sprint(show, "text/csv", df2))

close(w)

Now we read the CSV we have written:

In [90]:
z = ZipFile.Reader("x.zip")
Out[90]:
ZipFile.Reader for IOStream(<file x.zip>) containing 2 files:

uncompressedsize method  mtime            name
----------------------------------------------
              45 Store   2021-08-13 01-01 x1.csv
              45 Deflate 2021-08-13 01-01 x2.csv
In [91]:
# find the index of file called x1.csv
index_xcsv = findfirst(x->x.name == "x1.csv",z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(read(z.files[index_xcsv]),DataFrame)
Out[91]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
161065
28841
35962
In [92]:
df1_2 == df1
Out[92]:
true
In [93]:
# find the index of file called x2.csv
index_xcsv = findfirst(x->x.name == "x2.csv",z.files)
# to read the x2.csv file in the zip file.
df2_2 = CSV.read(read(z.files[index_xcsv]), DataFrame)
Out[93]:

3 rows × 4 columns

x1x2x3x4
Int64Int64Int64Int64
11455
231029
33751
In [94]:
df2_2 == df2
Out[94]:
true
In [95]:
close(z)

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files

In [120]:
@chain ["x1.csv", "x.bin", "x.jlso", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf.bin", "bigdf.jlso", "bigdf1.json", "bigdf2.json", 
             "x.zip"] begin
  # try를 사용하여 에러 무시
  foreach(x->try rm(x) catch end,_)
end
In [119]:
@chain ["bigdf.arrow","x.arrow","df_compress_test.csv.gz"] begin
  # try를 사용하여 에러 무시
  foreach(x->try rm(x,recursive=true) catch end,_)
end
In [ ]: