환경 : Julia v1.8.3, DuckDB v0.6.0, DataFrames v1.3.6
KOSPI DATA (2012 ~ 2022) :
KOSPI관련 분석 내용은 아래 링크를 참조
KOSPI 수익률의 표준편차는? #1 https://sine-qua-none.tistory.com/14
KOSPI수익률의 표준편차는? #2 https://sine-qua-none.tistory.com/15
KOSPI 수익률의 표준편차는? #3 (로그수익률) https://sine-qua-none.tistory.com/15
KOSPI 수익률의 표준편차는? #4(로그수익률) https://sine-qua-none.tistory.com/20
KOSPI수익률의 분포는 어떤 모양일까? #1 https://sine-qua-none.tistory.com/24
KOSPI수익률의 분포는 어떤 모양일까? #2 https://sine-qua-none.tistory.com/25
DuckDB와 DataFrames를 같이 사용하는 방법을 포함함
using DataFrames
using CSV
using Statistics
using Pipe
using Chain
using Dates
using DuckDB
using BenchmarkTools
using Plots
using LsqFit
using Distributions
using StatsPlots
import MathTeXEngine as ML
using PrettyPrint:pprintln
# ENV["LINES"] and ENV["COLUMNS"]
ENV["COLUMNS"] = 120
DP = display
# 챠트 폰트
# default(fontfamily="")
display (generic function with 32 methods)
DuckDB 사용법¶
data_file = "./data/kospi_index_2012_2022.csv"
"./data/kospi_index_2012_2022.csv"
df = CSV.File(data_file) |> DataFrame
DP(first(df,3))
@chain df begin
transform!(
[:open,:close,:high,:low] .=> x->parse.(Float64,replace.(x,","=>"")) ,
[:volume] .=> x-> begin
# 마지막 1개의 chracter이므로 문자열 비교가 아닌 chracter 비교
map(y-> y[end] == 'K' ? round(Int,parse(Float64,y[1:end-1]) * 1000) :
round(Int,parse(Float64,y[1:end-1]) * 10000),x)
end,
renamecols=false)
end
df
3 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date | String15 | String15 | String15 | String15 | String7 | Float64 | |
1 | 2022-12-27 | 2,330.39 | 2,327.52 | 2,335.99 | 2,322.27 | 154.50K | 0.0057 |
2 | 2022-12-26 | 2,317.14 | 2,312.54 | 2,321.92 | 2,304.20 | 427.84K | 0.0015 |
3 | 2022-12-23 | 2,313.69 | 2,325.86 | 2,333.08 | 2,311.90 | 366.99M | -0.0183 |
2,708 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date | Float64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 2022-12-27 | 2330.39 | 2327.52 | 2335.99 | 2322.27 | 154500 | 0.0057 |
2 | 2022-12-26 | 2317.14 | 2312.54 | 2321.92 | 2304.2 | 427840 | 0.0015 |
3 | 2022-12-23 | 2313.69 | 2325.86 | 2333.08 | 2311.9 | 3669900 | -0.0183 |
4 | 2022-12-22 | 2356.73 | 2340.0 | 2356.73 | 2335.75 | 5530500 | 0.0119 |
5 | 2022-12-21 | 2328.95 | 2346.39 | 2347.0 | 2325.78 | 3297200 | -0.0019 |
6 | 2022-12-20 | 2333.29 | 2344.73 | 2353.86 | 2324.66 | 3587700 | -0.008 |
7 | 2022-12-19 | 2352.17 | 2350.78 | 2358.76 | 2342.28 | 3239900 | -0.0033 |
8 | 2022-12-16 | 2360.02 | 2329.75 | 2360.44 | 2326.83 | 4141700 | -0.0004 |
9 | 2022-12-15 | 2360.97 | 2383.83 | 2392.11 | 2360.95 | 3753900 | -0.016 |
10 | 2022-12-14 | 2399.25 | 2380.81 | 2400.18 | 2379.44 | 4098000 | 0.0113 |
11 | 2022-12-13 | 2372.4 | 2385.05 | 2388.27 | 2364.87 | 4475600 | -0.0003 |
12 | 2022-12-12 | 2373.02 | 2373.58 | 2381.87 | 2368.47 | 4992200 | -0.0067 |
13 | 2022-12-09 | 2389.04 | 2382.73 | 2390.08 | 2367.25 | 3156700 | 0.0076 |
14 | 2022-12-08 | 2371.08 | 2386.9 | 2387.95 | 2357.2 | 3519400 | -0.0049 |
15 | 2022-12-07 | 2382.81 | 2385.87 | 2394.88 | 2377.98 | 3325900 | -0.0043 |
16 | 2022-12-06 | 2393.16 | 2397.7 | 2416.88 | 2390.2 | 3686400 | -0.0108 |
17 | 2022-12-05 | 2419.32 | 2442.17 | 2442.22 | 2413.05 | 3855900 | -0.0062 |
18 | 2022-12-02 | 2434.33 | 2471.5 | 2471.5 | 2434.33 | 4688500 | -0.0184 |
19 | 2022-12-01 | 2479.84 | 2501.43 | 2501.43 | 2474.33 | 4913800 | 0.003 |
20 | 2022-11-30 | 2472.53 | 2424.44 | 2472.53 | 2421.42 | 5941300 | 0.0161 |
21 | 2022-11-29 | 2433.39 | 2405.54 | 2433.87 | 2401.1 | 5501700 | 0.0104 |
22 | 2022-11-28 | 2408.27 | 2425.05 | 2425.65 | 2401.95 | 3505000 | -0.0121 |
23 | 2022-11-25 | 2437.86 | 2442.21 | 2449.66 | 2433.57 | 4383900 | -0.0014 |
24 | 2022-11-24 | 2441.33 | 2437.51 | 2441.33 | 2428.12 | 4811800 | 0.0096 |
25 | 2022-11-23 | 2418.01 | 2424.36 | 2427.28 | 2407.58 | 4747900 | 0.0053 |
26 | 2022-11-22 | 2405.27 | 2405.96 | 2426.4 | 2401.5 | 5391700 | -0.0059 |
27 | 2022-11-21 | 2419.5 | 2446.05 | 2448.14 | 2409.36 | 5445300 | -0.0102 |
28 | 2022-11-18 | 2444.48 | 2448.13 | 2471.87 | 2442.87 | 5895900 | 0.0006 |
29 | 2022-11-17 | 2442.9 | 2466.5 | 2467.39 | 2442.9 | 9145200 | -0.0139 |
30 | 2022-11-16 | 2477.45 | 2487.0 | 2487.0 | 2446.79 | 6578300 | -0.0012 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
describe(df,:eltype)
7 rows × 2 columns
variable | eltype | |
---|---|---|
Symbol | DataType | |
1 | date | Date |
2 | close | Float64 |
3 | open | Float64 |
4 | high | Float64 |
5 | low | Float64 |
6 | volume | Int64 |
7 | return | Float64 |
# @btime
df[ DateTime("2022-04-29") .>= df.date .>= DateTime("2022-04-18"),:]
10 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date | Float64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 2022-04-29 | 2695.05 | 2669.18 | 2696.1 | 2664.06 | 962400 | 0.0103 |
2 | 2022-04-28 | 2667.49 | 2656.54 | 2667.49 | 2638.37 | 8991700 | 0.0108 |
3 | 2022-04-27 | 2639.06 | 2630.58 | 2641.82 | 2615.5 | 10700 | -0.011 |
4 | 2022-04-26 | 2668.31 | 2674.0 | 2678.05 | 2663.83 | 8899400 | 0.0042 |
5 | 2022-04-25 | 2657.13 | 2676.67 | 2680.35 | 2657.04 | 988350 | -0.0176 |
6 | 2022-04-22 | 2704.71 | 2704.72 | 2708.58 | 2690.48 | 12400 | -0.0086 |
7 | 2022-04-21 | 2728.21 | 2725.71 | 2737.54 | 2725.04 | 10000 | 0.0035 |
8 | 2022-04-20 | 2718.69 | 2718.49 | 2724.46 | 2702.84 | 17400 | -0.0001 |
9 | 2022-04-19 | 2718.89 | 2707.76 | 2723.98 | 2705.32 | 12600 | 0.0095 |
10 | 2022-04-18 | 2693.21 | 2685.04 | 2701.11 | 2681.37 | 10200 | -0.0011 |
# @btime
filter(:date => x-> DateTime("2022-04-29") >= x >= DateTime("2022-04-18"),df)
10 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date | Float64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 2022-04-29 | 2695.05 | 2669.18 | 2696.1 | 2664.06 | 962400 | 0.0103 |
2 | 2022-04-28 | 2667.49 | 2656.54 | 2667.49 | 2638.37 | 8991700 | 0.0108 |
3 | 2022-04-27 | 2639.06 | 2630.58 | 2641.82 | 2615.5 | 10700 | -0.011 |
4 | 2022-04-26 | 2668.31 | 2674.0 | 2678.05 | 2663.83 | 8899400 | 0.0042 |
5 | 2022-04-25 | 2657.13 | 2676.67 | 2680.35 | 2657.04 | 988350 | -0.0176 |
6 | 2022-04-22 | 2704.71 | 2704.72 | 2708.58 | 2690.48 | 12400 | -0.0086 |
7 | 2022-04-21 | 2728.21 | 2725.71 | 2737.54 | 2725.04 | 10000 | 0.0035 |
8 | 2022-04-20 | 2718.69 | 2718.49 | 2724.46 | 2702.84 | 17400 | -0.0001 |
9 | 2022-04-19 | 2718.89 | 2707.76 | 2723.98 | 2705.32 | 12600 | 0.0095 |
10 | 2022-04-18 | 2693.21 | 2685.04 | 2701.11 | 2681.37 | 10200 | -0.0011 |
# @btime
df2 = @chain df begin
filter(:date => >=(DateTime("2022-04-18")),_)
filter(:date => <=(DateTime("2022-04-29")),_)
end
10 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date | Float64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 2022-04-29 | 2695.05 | 2669.18 | 2696.1 | 2664.06 | 962400 | 0.0103 |
2 | 2022-04-28 | 2667.49 | 2656.54 | 2667.49 | 2638.37 | 8991700 | 0.0108 |
3 | 2022-04-27 | 2639.06 | 2630.58 | 2641.82 | 2615.5 | 10700 | -0.011 |
4 | 2022-04-26 | 2668.31 | 2674.0 | 2678.05 | 2663.83 | 8899400 | 0.0042 |
5 | 2022-04-25 | 2657.13 | 2676.67 | 2680.35 | 2657.04 | 988350 | -0.0176 |
6 | 2022-04-22 | 2704.71 | 2704.72 | 2708.58 | 2690.48 | 12400 | -0.0086 |
7 | 2022-04-21 | 2728.21 | 2725.71 | 2737.54 | 2725.04 | 10000 | 0.0035 |
8 | 2022-04-20 | 2718.69 | 2718.49 | 2724.46 | 2702.84 | 17400 | -0.0001 |
9 | 2022-04-19 | 2718.89 | 2707.76 | 2723.98 | 2705.32 | 12600 | 0.0095 |
10 | 2022-04-18 | 2693.21 | 2685.04 | 2701.11 | 2681.37 | 10200 | -0.0011 |
for r in eachrow(df[1:5,:])
@show r.open, r.close, r.high, r.low, r.volume, r.return
end
(r.open, r.close, r.high, r.low, r.volume, r.return) = (2327.52, 2330.39, 2335.99, 2322.27, 154500, 0.0057) (r.open, r.close, r.high, r.low, r.volume, r.return) = (2312.54, 2317.14, 2321.92, 2304.2, 427840, 0.0015) (r.open, r.close, r.high, r.low, r.volume, r.return) = (2325.86, 2313.69, 2333.08, 2311.9, 3669900, -0.0183) (r.open, r.close, r.high, r.low, r.volume, r.return) = (2340.0, 2356.73, 2356.73, 2335.75, 5530500, 0.0119) (r.open, r.close, r.high, r.low, r.volume, r.return) = (2346.39, 2328.95, 2347.0, 2325.78, 3297200, -0.0019)
DuckDB¶
# db = DuckDB.open(":memory:")
# con = DuckDB.connect(db)
con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")
DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM '$(data_file)'
LIMIT 100
""")
7×6 DataFrame Row │ column_name column_type null key default extra │ String? String? String? String? String? String? ─────┼────────────────────────────────────────────────────────────── 1 │ date DATE YES missing missing missing 2 │ close VARCHAR YES missing missing missing 3 │ open VARCHAR YES missing missing missing 4 │ high VARCHAR YES missing missing missing 5 │ low VARCHAR YES missing missing missing 6 │ volume VARCHAR YES missing missing missing 7 │ return VARCHAR YES missing missing missing
DataFrame에서 table을 생성하고 데이터를 입력¶
# df 를 mydf로 등록
DuckDB.register_data_frame(con, df, "mydf")
# create the table "kospi" from the DataFrame "df"
DBInterface.execute(con,"CREATE TABLE kospi AS SELECT * FROM mydf")
# # insert into the table "my_table" from the DataFrame "my_df"
# DBInterface.execute(con, "INSERT INTO kospi SELECT * FROM df")
1×1 DataFrame Row │ Count │ Int64? ─────┼──────── 1 │ 2708
DBInterface.execute(con,
"""
SELECT * FROM kospi
""")
2708×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ──────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-12-27 2330.39 2327.52 2335.99 2322.27 154500 0.0057 2 │ 2022-12-26 2317.14 2312.54 2321.92 2304.2 427840 0.0015 3 │ 2022-12-23 2313.69 2325.86 2333.08 2311.9 3669900 -0.0183 4 │ 2022-12-22 2356.73 2340.0 2356.73 2335.75 5530500 0.0119 5 │ 2022-12-21 2328.95 2346.39 2347.0 2325.78 3297200 -0.0019 6 │ 2022-12-20 2333.29 2344.73 2353.86 2324.66 3587700 -0.008 7 │ 2022-12-19 2352.17 2350.78 2358.76 2342.28 3239900 -0.0033 8 │ 2022-12-16 2360.02 2329.75 2360.44 2326.83 4141700 -0.0004 9 │ 2022-12-15 2360.97 2383.83 2392.11 2360.95 3753900 -0.016 10 │ 2022-12-14 2399.25 2380.81 2400.18 2379.44 4098000 0.0113 11 │ 2022-12-13 2372.4 2385.05 2388.27 2364.87 4475600 -0.0003 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2699 │ 2012-01-13 1875.68 1865.72 1881.84 1861.53 4750300 0.006 2700 │ 2012-01-12 1864.57 1853.1 1864.57 1837.96 4763400 0.0103 2701 │ 2012-01-11 1845.55 1857.54 1857.82 1843.12 4479600 -0.0041 2702 │ 2012-01-10 1853.22 1841.13 1859.55 1840.47 3886400 0.0146 2703 │ 2012-01-09 1826.49 1832.04 1832.04 1810.48 3955100 -0.009 2704 │ 2012-01-06 1843.14 1867.12 1867.12 1824.29 4726800 -0.0111 2705 │ 2012-01-05 1863.74 1869.42 1875.97 1860.57 5337700 -0.0013 2706 │ 2012-01-04 1866.22 1883.48 1884.69 1866.22 4945000 -0.0049 2707 │ 2012-01-03 1875.41 1846.56 1875.41 1846.56 4434700 0.0269 2708 │ 2012-01-02 1826.37 1831.69 1837.81 1814.55 3233600 0.0003 2687 rows omitted
stmt = DBInterface.prepare(con,
"""
SELECT * FROM kospi
WHERE date BETWEEN '2022-04-18' AND '2022-04-29'
"""
)
DuckDB.Stmt(DuckDB.Connection(":memory:"), Ptr{Nothing} @0x000000000dab2230, " SELECT * FROM kospi\n WHERE date BETWEEN '2022-04-18' AND '2022-04-29'\n")
# @btime
DBInterface.execute(stmt)
10×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ─────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-04-29 2695.05 2669.18 2696.1 2664.06 962400 0.0103 2 │ 2022-04-28 2667.49 2656.54 2667.49 2638.37 8991700 0.0108 3 │ 2022-04-27 2639.06 2630.58 2641.82 2615.5 10700 -0.011 4 │ 2022-04-26 2668.31 2674.0 2678.05 2663.83 8899400 0.0042 5 │ 2022-04-25 2657.13 2676.67 2680.35 2657.04 988350 -0.0176 6 │ 2022-04-22 2704.71 2704.72 2708.58 2690.48 12400 -0.0086 7 │ 2022-04-21 2728.21 2725.71 2737.54 2725.04 10000 0.0035 8 │ 2022-04-20 2718.69 2718.49 2724.46 2702.84 17400 -0.0001 9 │ 2022-04-19 2718.89 2707.76 2723.98 2705.32 12600 0.0095 10 │ 2022-04-18 2693.21 2685.04 2701.11 2681.37 10200 -0.0011
tbl = DBInterface.execute(stmt)
10×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ─────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-04-29 2695.05 2669.18 2696.1 2664.06 962400 0.0103 2 │ 2022-04-28 2667.49 2656.54 2667.49 2638.37 8991700 0.0108 3 │ 2022-04-27 2639.06 2630.58 2641.82 2615.5 10700 -0.011 4 │ 2022-04-26 2668.31 2674.0 2678.05 2663.83 8899400 0.0042 5 │ 2022-04-25 2657.13 2676.67 2680.35 2657.04 988350 -0.0176 6 │ 2022-04-22 2704.71 2704.72 2708.58 2690.48 12400 -0.0086 7 │ 2022-04-21 2728.21 2725.71 2737.54 2725.04 10000 0.0035 8 │ 2022-04-20 2718.69 2718.49 2724.46 2702.84 17400 -0.0001 9 │ 2022-04-19 2718.89 2707.76 2723.98 2705.32 12600 0.0095 10 │ 2022-04-18 2693.21 2685.04 2701.11 2681.37 10200 -0.0011
DuckDB.toDataFrame(tbl)
# tbl.df
10 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date? | Float64? | Float64? | Float64? | Float64? | Int64? | Float64? | |
1 | 2022-04-29 | 2695.05 | 2669.18 | 2696.1 | 2664.06 | 962400 | 0.0103 |
2 | 2022-04-28 | 2667.49 | 2656.54 | 2667.49 | 2638.37 | 8991700 | 0.0108 |
3 | 2022-04-27 | 2639.06 | 2630.58 | 2641.82 | 2615.5 | 10700 | -0.011 |
4 | 2022-04-26 | 2668.31 | 2674.0 | 2678.05 | 2663.83 | 8899400 | 0.0042 |
5 | 2022-04-25 | 2657.13 | 2676.67 | 2680.35 | 2657.04 | 988350 | -0.0176 |
6 | 2022-04-22 | 2704.71 | 2704.72 | 2708.58 | 2690.48 | 12400 | -0.0086 |
7 | 2022-04-21 | 2728.21 | 2725.71 | 2737.54 | 2725.04 | 10000 | 0.0035 |
8 | 2022-04-20 | 2718.69 | 2718.49 | 2724.46 | 2702.84 | 17400 | -0.0001 |
9 | 2022-04-19 | 2718.89 | 2707.76 | 2723.98 | 2705.32 | 12600 | 0.0095 |
10 | 2022-04-18 | 2693.21 | 2685.04 | 2701.11 | 2681.37 | 10200 | -0.0011 |
DBInterface.close!(stmt)
# DuckDB.appendDataFrame(df,con,"kospi")
DBInterface.execute(con,
"""
INSERT INTO kospi
VALUES('2022-12-28', 10.0, 11.0, 12.0, 13.0, 10000, 0.1234)
""")
1×1 DataFrame Row │ Count │ Int64? ─────┼──────── 1 │ 1
DBInterface.execute(con, "SELECT * FROM kospi")
2709×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ──────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-12-27 2330.39 2327.52 2335.99 2322.27 154500 0.0057 2 │ 2022-12-26 2317.14 2312.54 2321.92 2304.2 427840 0.0015 3 │ 2022-12-23 2313.69 2325.86 2333.08 2311.9 3669900 -0.0183 4 │ 2022-12-22 2356.73 2340.0 2356.73 2335.75 5530500 0.0119 5 │ 2022-12-21 2328.95 2346.39 2347.0 2325.78 3297200 -0.0019 6 │ 2022-12-20 2333.29 2344.73 2353.86 2324.66 3587700 -0.008 7 │ 2022-12-19 2352.17 2350.78 2358.76 2342.28 3239900 -0.0033 8 │ 2022-12-16 2360.02 2329.75 2360.44 2326.83 4141700 -0.0004 9 │ 2022-12-15 2360.97 2383.83 2392.11 2360.95 3753900 -0.016 10 │ 2022-12-14 2399.25 2380.81 2400.18 2379.44 4098000 0.0113 11 │ 2022-12-13 2372.4 2385.05 2388.27 2364.87 4475600 -0.0003 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2700 │ 2012-01-12 1864.57 1853.1 1864.57 1837.96 4763400 0.0103 2701 │ 2012-01-11 1845.55 1857.54 1857.82 1843.12 4479600 -0.0041 2702 │ 2012-01-10 1853.22 1841.13 1859.55 1840.47 3886400 0.0146 2703 │ 2012-01-09 1826.49 1832.04 1832.04 1810.48 3955100 -0.009 2704 │ 2012-01-06 1843.14 1867.12 1867.12 1824.29 4726800 -0.0111 2705 │ 2012-01-05 1863.74 1869.42 1875.97 1860.57 5337700 -0.0013 2706 │ 2012-01-04 1866.22 1883.48 1884.69 1866.22 4945000 -0.0049 2707 │ 2012-01-03 1875.41 1846.56 1875.41 1846.56 4434700 0.0269 2708 │ 2012-01-02 1826.37 1831.69 1837.81 1814.55 3233600 0.0003 2709 │ 2022-12-28 10.0 11.0 12.0 13.0 10000 0.1234 2688 rows omitted
DBInterface.execute(con, "DELETE FROM kospi WHERE date = '2022-12-28'")
1×1 DataFrame Row │ Count │ Int64? ─────┼──────── 1 │ 1
DBInterface.execute(con, "SELECT * FROM kospi")
2708×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ──────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-12-27 2330.39 2327.52 2335.99 2322.27 154500 0.0057 2 │ 2022-12-26 2317.14 2312.54 2321.92 2304.2 427840 0.0015 3 │ 2022-12-23 2313.69 2325.86 2333.08 2311.9 3669900 -0.0183 4 │ 2022-12-22 2356.73 2340.0 2356.73 2335.75 5530500 0.0119 5 │ 2022-12-21 2328.95 2346.39 2347.0 2325.78 3297200 -0.0019 6 │ 2022-12-20 2333.29 2344.73 2353.86 2324.66 3587700 -0.008 7 │ 2022-12-19 2352.17 2350.78 2358.76 2342.28 3239900 -0.0033 8 │ 2022-12-16 2360.02 2329.75 2360.44 2326.83 4141700 -0.0004 9 │ 2022-12-15 2360.97 2383.83 2392.11 2360.95 3753900 -0.016 10 │ 2022-12-14 2399.25 2380.81 2400.18 2379.44 4098000 0.0113 11 │ 2022-12-13 2372.4 2385.05 2388.27 2364.87 4475600 -0.0003 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2699 │ 2012-01-13 1875.68 1865.72 1881.84 1861.53 4750300 0.006 2700 │ 2012-01-12 1864.57 1853.1 1864.57 1837.96 4763400 0.0103 2701 │ 2012-01-11 1845.55 1857.54 1857.82 1843.12 4479600 -0.0041 2702 │ 2012-01-10 1853.22 1841.13 1859.55 1840.47 3886400 0.0146 2703 │ 2012-01-09 1826.49 1832.04 1832.04 1810.48 3955100 -0.009 2704 │ 2012-01-06 1843.14 1867.12 1867.12 1824.29 4726800 -0.0111 2705 │ 2012-01-05 1863.74 1869.42 1875.97 1860.57 5337700 -0.0013 2706 │ 2012-01-04 1866.22 1883.48 1884.69 1866.22 4945000 -0.0049 2707 │ 2012-01-03 1875.41 1846.56 1875.41 1846.56 4434700 0.0269 2708 │ 2012-01-02 1826.37 1831.69 1837.81 1814.55 3233600 0.0003 2687 rows omitted
DBInterface.execute(con, "COPY (SELECT * FROM kospi) TO './data/kospi.parquet' (FORMAT 'parquet');")
1×1 DataFrame Row │ Count │ Int64? ─────┼──────── 1 │ 2708
DBInterface.close!(con)
Load parquet using DuckDB¶
con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")
DBInterface.execute(con,
"""
SELECT *
FROM "./data/kospi.parquet"
""")
2708×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ──────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-12-27 2330.39 2327.52 2335.99 2322.27 154500 0.0057 2 │ 2022-12-26 2317.14 2312.54 2321.92 2304.2 427840 0.0015 3 │ 2022-12-23 2313.69 2325.86 2333.08 2311.9 3669900 -0.0183 4 │ 2022-12-22 2356.73 2340.0 2356.73 2335.75 5530500 0.0119 5 │ 2022-12-21 2328.95 2346.39 2347.0 2325.78 3297200 -0.0019 6 │ 2022-12-20 2333.29 2344.73 2353.86 2324.66 3587700 -0.008 7 │ 2022-12-19 2352.17 2350.78 2358.76 2342.28 3239900 -0.0033 8 │ 2022-12-16 2360.02 2329.75 2360.44 2326.83 4141700 -0.0004 9 │ 2022-12-15 2360.97 2383.83 2392.11 2360.95 3753900 -0.016 10 │ 2022-12-14 2399.25 2380.81 2400.18 2379.44 4098000 0.0113 11 │ 2022-12-13 2372.4 2385.05 2388.27 2364.87 4475600 -0.0003 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2699 │ 2012-01-13 1875.68 1865.72 1881.84 1861.53 4750300 0.006 2700 │ 2012-01-12 1864.57 1853.1 1864.57 1837.96 4763400 0.0103 2701 │ 2012-01-11 1845.55 1857.54 1857.82 1843.12 4479600 -0.0041 2702 │ 2012-01-10 1853.22 1841.13 1859.55 1840.47 3886400 0.0146 2703 │ 2012-01-09 1826.49 1832.04 1832.04 1810.48 3955100 -0.009 2704 │ 2012-01-06 1843.14 1867.12 1867.12 1824.29 4726800 -0.0111 2705 │ 2012-01-05 1863.74 1869.42 1875.97 1860.57 5337700 -0.0013 2706 │ 2012-01-04 1866.22 1883.48 1884.69 1866.22 4945000 -0.0049 2707 │ 2012-01-03 1875.41 1846.56 1875.41 1846.56 4434700 0.0269 2708 │ 2012-01-02 1826.37 1831.69 1837.81 1814.55 3233600 0.0003 2687 rows omitted
DBInterface.close!(con)
con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")
DBInterface.execute(con,
"""CREATE TABLE kospi AS SELECT * FROM "./data/kospi.parquet" """
)
1×1 DataFrame Row │ Count │ Int64? ─────┼──────── 1 │ 2708
tbl = DBInterface.execute(con,
"""
SELECT *
FROM kospi
""")
2708×7 DataFrame Row │ date close open high low volume return │ Date? Float64? Float64? Float64? Float64? Int64? Float64? ──────┼─────────────────────────────────────────────────────────────────────── 1 │ 2022-12-27 2330.39 2327.52 2335.99 2322.27 154500 0.0057 2 │ 2022-12-26 2317.14 2312.54 2321.92 2304.2 427840 0.0015 3 │ 2022-12-23 2313.69 2325.86 2333.08 2311.9 3669900 -0.0183 4 │ 2022-12-22 2356.73 2340.0 2356.73 2335.75 5530500 0.0119 5 │ 2022-12-21 2328.95 2346.39 2347.0 2325.78 3297200 -0.0019 6 │ 2022-12-20 2333.29 2344.73 2353.86 2324.66 3587700 -0.008 7 │ 2022-12-19 2352.17 2350.78 2358.76 2342.28 3239900 -0.0033 8 │ 2022-12-16 2360.02 2329.75 2360.44 2326.83 4141700 -0.0004 9 │ 2022-12-15 2360.97 2383.83 2392.11 2360.95 3753900 -0.016 10 │ 2022-12-14 2399.25 2380.81 2400.18 2379.44 4098000 0.0113 11 │ 2022-12-13 2372.4 2385.05 2388.27 2364.87 4475600 -0.0003 ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 2699 │ 2012-01-13 1875.68 1865.72 1881.84 1861.53 4750300 0.006 2700 │ 2012-01-12 1864.57 1853.1 1864.57 1837.96 4763400 0.0103 2701 │ 2012-01-11 1845.55 1857.54 1857.82 1843.12 4479600 -0.0041 2702 │ 2012-01-10 1853.22 1841.13 1859.55 1840.47 3886400 0.0146 2703 │ 2012-01-09 1826.49 1832.04 1832.04 1810.48 3955100 -0.009 2704 │ 2012-01-06 1843.14 1867.12 1867.12 1824.29 4726800 -0.0111 2705 │ 2012-01-05 1863.74 1869.42 1875.97 1860.57 5337700 -0.0013 2706 │ 2012-01-04 1866.22 1883.48 1884.69 1866.22 4945000 -0.0049 2707 │ 2012-01-03 1875.41 1846.56 1875.41 1846.56 4434700 0.0269 2708 │ 2012-01-02 1826.37 1831.69 1837.81 1814.55 3233600 0.0003 2687 rows omitted
df = DuckDB.toDataFrame(tbl)
df[1:3,:]
3 rows × 7 columns
date | close | open | high | low | volume | return | |
---|---|---|---|---|---|---|---|
Date? | Float64? | Float64? | Float64? | Float64? | Int64? | Float64? | |
1 | 2022-12-27 | 2330.39 | 2327.52 | 2335.99 | 2322.27 | 154500 | 0.0057 |
2 | 2022-12-26 | 2317.14 | 2312.54 | 2321.92 | 2304.2 | 427840 | 0.0015 |
3 | 2022-12-23 | 2313.69 | 2325.86 | 2333.08 | 2311.9 | 3669900 | -0.0183 |
DataFrame에서 특정 column 삭제 방법¶
select!(df,Not([:r02]))
시간에 따른 수익률¶
T = 30
select!(df,1:7)
for t in 1:T
df[!,Symbol("r"*lpad(t,3,"0"))] = map(i->try log(df.close[i]/df.close[i+t]) catch e missing end, 1:size(df,1))
end
describe(df,:eltype)
dropmissing!(df)
last(df,10)
10 rows × 37 columns (omitted printing of 26 columns)
date | close | open | high | low | volume | return | r001 | r002 | r003 | r004 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | Float64 | Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 2012-02-28 | 2003.69 | 1995.93 | 2006.26 | 1992.66 | 6079300 | 0.0063 | 0.0062731 | -0.00805257 | -0.00204911 | -0.0123801 |
2 | 2012-02-27 | 1991.16 | 2013.61 | 2013.69 | 1987.74 | 6360100 | -0.0142 | -0.0143257 | -0.00832221 | -0.0186532 | -0.0164769 |
3 | 2012-02-24 | 2019.89 | 2010.6 | 2019.9 | 1998.39 | 7179100 | 0.006 | 0.00600346 | -0.00432749 | -0.00215127 | -0.00247726 |
4 | 2012-02-23 | 2007.8 | 2016.46 | 2016.46 | 2001.25 | 6901200 | -0.0103 | -0.010331 | -0.00815473 | -0.00848072 | -0.00777426 |
5 | 2012-02-22 | 2028.65 | 2018.07 | 2028.93 | 2015.3 | 8837400 | 0.0022 | 0.00217623 | 0.00185023 | 0.00255669 | 0.0154992 |
6 | 2012-02-21 | 2024.24 | 2022.05 | 2030.46 | 2007.46 | 7090200 | -0.0003 | -0.000325995 | 0.000380462 | 0.013323 | -0.000533391 |
7 | 2012-02-20 | 2024.9 | 2041.51 | 2047.43 | 2024.59 | 7339200 | 0.0007 | 0.000706457 | 0.0136489 | -0.000207396 | 0.011054 |
8 | 2012-02-17 | 2023.47 | 2025.56 | 2031.4 | 2019.79 | 7310900 | 0.013 | 0.0129425 | -0.000913853 | 0.0103475 | 0.00880079 |
9 | 2012-02-16 | 1997.45 | 2005.37 | 2012.44 | 1995.99 | 6966900 | -0.0138 | -0.0138563 | -0.00259494 | -0.0041417 | 0.00187414 |
10 | 2012-02-15 | 2025.32 | 2012.0 | 2026.75 | 2009.68 | 6814700 | 0.0113 | 0.0112614 | 0.00971464 | 0.0157305 | 0.00529712 |
DuckDB.register_data_frame(con,df,"mydf")
df2 = DBInterface.execute(con,
"""
SELECT * from mydf
--WHERE date BETWEEN '2022-04-14' AND '2022-04-29'
"""
).df[1:3,:]
3 rows × 37 columns (omitted printing of 26 columns)
date | close | open | high | low | volume | return | r001 | r002 | r003 | r004 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date? | Float64? | Float64? | Float64? | Float64? | Int64? | Float64? | Float64? | Float64? | Float64? | Float64? | |
1 | 2022-12-27 | 2330.39 | 2327.52 | 2335.99 | 2322.27 | 154500 | 0.0057 | 0.00570197 | 0.00719198 | -0.0112394 | 0.000618113 |
2 | 2022-12-26 | 2317.14 | 2312.54 | 2321.92 | 2304.2 | 427840 | 0.0015 | 0.00149001 | -0.0169414 | -0.00508386 | -0.00694562 |
3 | 2022-12-23 | 2313.69 | 2325.86 | 2333.08 | 2311.9 | 3669900 | -0.0183 | -0.0184314 | -0.00657387 | -0.00843564 | -0.0164947 |
수익률의 분포¶
시간에 따른 수익률의 분포$f(t)$는 $N(0,\sigma_{1}\sqrt{t})$ 인 정규분포를 따른다고 모델링 하고 맞는지 살펴보자\ 여기서 $\sigma_{1}은 1일 수익률의 표준편차$
시간에 따른 수익률의 표준편차¶
stds = Vector{Float64}()
for t in 1:T
s = Symbol("r"*lpad(t,3,"0"))
push!(stds,std(df[!,s]))
end
stds
30-element Vector{Float64}: 0.009813302121263563 0.013847351387053404 0.0174165025070727 0.020453249357504526 0.0230023916814406 0.02526228799874069 0.02717301743233356 0.029025990272606158 0.030760782701878524 0.03249886065505094 0.034130906378785214 0.03561940741563499 0.03707881163100406 ⋮ 0.04535414998918937 0.046623502858969865 0.047868617940500145 0.048983952678823266 0.05005135047409256 0.05110756129799725 0.052135935836887436 0.05314286409024463 0.054006589724114915 0.05484988920626446 0.0556720124490399 0.056494178052715074
@. model(x,p) = p[1]*x^p[2]
model (generic function with 1 method)
p0 = [0.5, 0.5]
fit = curve_fit(model,1:T,stds,p0; autodiff=:finiteforward)
# fit = curve_fit(model,1:T,stds,p0)
LsqFit.LsqFitResult{Vector{Float64}, Vector{Float64}, Matrix{Float64}, Vector{Float64}}([0.010013847924725537, 0.5115770633220432], [0.00020054580346197393, 0.0004284676083854914, 0.00014999919202853332, -0.00010153146419331419, -0.00018962210395919235, -0.00021934741041411063, -7.52341438251744e-5, -1.2423951734650412e-5, 5.4742385395479404e-5, 2.319996748500336e-5 … -0.0003331191234950681, -0.000303608008635492, -0.0002513073202101909, -0.0002113557730322041, -0.00016565830648194307, -0.00011930178976762157, 5.06495982140337e-5, 0.00022249107825247916, 0.0003979499521817101, 0.0005567016369432201], [1.0 0.0; 1.4256077287718654 0.009895243914797902; … ; 5.59924245160073 0.188804155215621; 5.697198531124741 0.19404130754992366], true, Float64[])
param = round.(fit.param,digits=6)
2-element Vector{Float64}: 0.010014 0.511577
Latex string내에 변수사용시 %
를 붙임¶
예) L"f(x) = %$(param[1])x^{%$(param[2])}"
# stderror(fit; atol, rtol)
#marker=:circle
scatter(1:T,stds,label="std",legend=:bottomright,markersize =1)
plot!(1:T,model(1:T,fit.param),label=ML.L"fit:f(x) = %$(param[1])x^{%$(param[2])}")
ML.L"f(x) = %$(param[1])x^{%$(param[2])}"
Coefficient of determination(결정계수) : $R^2$¶
- https://www.youtube.com/watch?v=lng4ZgConCM
- https://loadtoexcelmaster.tistory.com/entry/%EC%97%91%EC%85%80%EC%97%90%EC%84%9C-%EA%B2%B0%EC%A0%95%EA%B3%84%EC%88%98coefficeint-of-determination-R%C2%B2-%EA%B5%AC%ED%95%98%EA%B8%B0
- $R^2$이 1에 가까워 질 수록 fitting이 잘된것임 (0 ~ 1)범위를 가짐
- 선형회귀모델에서 선형으로 잘 피팅되었는지 나타냄
- residuals(잔차) = 샘플값 – 샘플평균 = $X – \bar{X}$ $$ y_i : 관찰값\\ \bar{y} : 관찰값 평균\\ m_i = m(x_i) : 회귀선\\ m(x_i) = a x_i + b\\ SE_{m} = \sum_i \left(y_i – m_i\right)^2\\ SE_{\bar{y}} = \sum_i \left(y_i – \bar{y}\right)^2\\ R^2 = 1 – \frac{SE_{m}}{SE_{\bar{y}}}\\ R^2 = 1 – \frac{MSE(regiduals)}{Var(y_i)}\\ $$
- MSE vs $R^2$ : https://vitalflux.com/mean-square-error-r-squared-which-one-to-use/
R²= 1-mse(fit)/var(stds)
0.9997133613902346
# coefficient
@show coef(fit)
# degree of freedom
@show dof(fit)
# 각 파라미터 p[1], p[2] 의 표준오차
@show stderror(fit)
coef(fit) = [0.010013847924725537, 0.5115770633220432] dof(fit) = 28 stderror(fit) = [6.112707838441906e-5, 0.002058808906591578]
2-element Vector{Float64}: 6.112707838441906e-5 0.002058808906591578
stderror(fit) 각 파라미터의 표준오차¶
- 예) p[1], p[2] 즉 a, b 값의 표준오차
- 표준오차 : https://ballpen.blog/%ED%91%9C%EC%A4%80-%EC%98%A4%EC%B0%A8-%EA%B0%9C%EB%85%90-%EA%B3%84%EC%82%B0-%EB%B0%A9%EB%B2%95/
- 표준오차 예제 : https://ballpen.blog/%ED%91%9C%EC%A4%80-%EC%98%A4%EC%B0%A8-%EA%B3%84%EC%82%B0-%EC%98%88%EC%A0%9C/
- 오차의 종류 : https://ballpen.blog/%EC%98%A4%EC%B0%A8-%EA%B3%84%ED%86%B5%EC%98%A4%EC%B0%A8-%EA%B3%BC%EC%8B%A4%EC%98%A4%EC%B0%A8-%EC%9A%B0%EC%97%B0%EC%98%A4%EC%B0%A8%EC%9D%98-%EA%B0%9C%EB%85%90%EA%B3%BC-%EC%98%88%EC%8B%9C/
- 표준 오차 개념을 이해하는 것은 측정값의 정확도를 올바르게 표기하는데 있어 아주 중요
- 정확도 : 측정값들이 한쪽으로 몰리는 일이 적은 정도
- 정밀도 : 측정값들의 퍼짐이 좁은 정도
- 어느 측정값을 표기하기 위해서는 정밀도 뿐만 아니라 오차의 범위도 제시할 필요가 있습니다. 왜냐하면 정확한 참값을 모르기 때문입니다. 이러한 배경에서 출발한 것이 표준오차가 되겠습니다.
- 표본 평균의 평균 : 계통오차를 제거하여 참값을 도출하는 방법
- 표준오차(Standard Error of Mean-SEM, 평균오차) : 표본 평균에 대한 표준편차
- 표본 평균의 평균을 참값으로 간주하고 측정한 측정값들의 표준편차
표본 평균의 평균은 n개의 표본으로부터 구한 평균값이기 때문에 각 표본이 갖는 계통오차들이 서로 상쇄되어 제거된 것으로 볼 수 있어 참값으로 간주
- 상대표준편차($\%RSD$) = $\frac{표준편차}{평균값}\cdot 100$
MSE vs R-Squared($R^2$) https://vitalflux.com/mean-square-error-r-squared-which-one-to-use/
stderror(fit)
2-element Vector{Float64}: 6.112707838441906e-5 0.002058808906591578
ML.L"
t일 수익률의 편차 = a \cdot t^b \;\;\text{로 fitting}\\
f(t) \approx %$(param[1])t^{%$(param[2])}\\
stds[1]=%$(stds[1])\\
a = %$(param[1]) \approx 1일수익율편차 = \sigma = %$(stds[1])\\
b = %$(param[2])\approx \frac{1}{2} \\
t일 수익률편차 = 1일수익률편차 \cdot \sqrt{t}\\
t일 수익률편차(t)= f(t) = \sigma \cdot \sqrt{t}\\
1일수익률편차=f(1) = \sigma\\
"
수익률의 분포¶
# 1일차 수익률에서 표준편차 σ₁을 구함
dist = Distributions.fit(Normal,df.r001)
Normal{Float64}(μ=5.659795572039397e-5, σ=0.009811469742922279)
σ₁ = dist.σ
0.009811469742922279
t_day_return(t) = df[!,"r"*lpad(t,3,"0")]
t_day_return (generic function with 1 method)
t_return_dist(t,xlim=nothing,ylim=nothing) = begin
args =[]
!isnothing(xlim) && push!(args,(:xlim => xlim));
!isnothing(ylim) && push!(args,(:ylim => ylim));
dist = Distributions.fit(Normal,t_day_return(t))
histogram!(t_day_return(t);norm=true,label="$(t) day return observed",alpha=0.3,args...,
xlabel="Rate of return",ylabel="Probability Density")
StatsPlots.plot!(dist,lw=2,label="N($(round(dist.μ,digits=5)),$(round(dist.σ,digits=5)))")
StatsPlots.plot!(Normal(0,σ₁*√t),lw=2,label="Theory : N(0,$(round(σ₁*√t,digits=5)))")
end
t_return_dist (generic function with 3 methods)
# 이전 plot 연결하지 않고 처음 부터
plot()
t_return_dist(1)
# 이전 plot 연결하지 않고 처음 부터
plot()
t_return_dist(30)