tutorial_bayesian_logistic_regression_01

Bayesian Logistic Regression¶

이 예에서는 RDatasets 패키지에 있는 합성 데이터 집합을 사용하여 채무 불이행 가능성을 예측할 것입니다. 이 데이터 집합인 Defaults는 R의 ISLR 패키지에서 가져온 것으로, 대출자에 대한 정보를 포함합니다.

https://daviddalpiaz.github.io/r4sl/logistic-regression.html

https://www.bijenpatel.com/guide/islr/classification/

In [1]:

using Turing
using Distributions
using RDatasets
using MCMCChains
using Plots, StatsPlots
using StatsFuns: logistic
# Functionality for splitting and normalizing the data
using MLDataUtils: shuffleobs, stratifiedobs, rescale!
using KernelDensity

In [2]:

using Random
Random.seed!(1234)

Out[2]:

TaskLocalRNG()

Data Cleaning & Setup¶

In [3]:

# Import the "Default" dataset.
# R ISLR 패키지에서 Default(채무불이행) 데이터를 가져온다.
data = RDatasets.dataset("ISLR","Default");

In [4]:

first(data,5)

Out[4]:

5×4 DataFrame

Row	Default	Student	Balance	Income
	Cat…	Cat…	Float64	Float64
1	No	No	729.526	44361.6
2	No	Yes	817.18	12106.1
3	No	No	1073.55	31767.1
4	No	No	529.251	35704.5
5	No	No	785.656	38463.5

대부분의 머신 러닝 프로세스는 데이터를 정리하는 데 약간의 노력이 필요하며, 이 과정도 다르지 않습니다. "예" 또는 "아니오"로 표시되는 Default 및 Student 열을 1과 0으로 변환해야 합니다. 그 후에는 기존의 단어 기반 열을 제거할 것입니다.

In [5]:

# Convert "Default" and "Student" to numeric values
data[!, :DefaultNum] = [ifelse(r.Default=="Yes",1.0, 0.0) for r in eachrow(data) ]
data[!, :StudentNum] = [ifelse(r.Student=="Yes",1.0, 0.0) for r in eachrow(data) ]
# Delete the old columns which say "Yes" and "No"
select!(data, Not([:Default, :Student]))
# Show the first 5 rows of our edited dataset
first(data,5)

Out[5]:

5×4 DataFrame

Row	Balance	Income	DefaultNum	StudentNum
	Float64	Float64	Float64	Float64
1	729.526	44361.6	0.0	0.0
2	817.18	12106.1	0.0	1.0
3	1073.55	31767.1	0.0	0.0
4	529.251	35704.5	0.0	0.0
5	785.656	38463.5	0.0	0.0

학습용, 테스트용 데이터 분리
정리를 마친 후에는 데이터 집합을 학습 및 테스트 집합으로 나누고 데이터에서 레이블을 분리할 차례입니다. 데이터를 train과 test의 두 부분으로 분리합니다. at = 0.05 인수를 수정하여 더 높은 분할 비율(또는 더 낮은 비율)을 사용할 수 있습니다. 작은 샘플 크기로 베이지안 추론의 힘을 보여주기 위해 5%의 샘플만 사용하는 것을 강조했습니다.

In [6]:

function split_data(df, target; at=0.70)
    shuffled = shuffle(df)
    return trainset, testset = stratifiedobs(row -> row[target], shuffled; p=at)
end
target = :DefaultNum
# 학습, 테스트 데이터셋 분리
trainset, testset = split_data(data, target; at = 0.05)
first(trainset,5) |> display
first(testset,5) |> display

5×4 DataFrame

Row	Balance	Income	DefaultNum	StudentNum
	Float64	Float64	Float64	Float64
1	705.93	19891.3	0.0	1.0
2	1738.11	38821.6	0.0	0.0
3	107.707	42200.0	0.0	0.0
4	865.566	16643.2	0.0	1.0
5	1168.87	18113.3	0.0	1.0

5×4 DataFrame

Row	Balance	Income	DefaultNum	StudentNum
	Float64	Float64	Float64	Float64
1	282.91	35445.9	0.0	0.0
2	262.599	48762.3	0.0	0.0
3	0.0	39576.4	0.0	0.0
4	692.156	39199.4	0.0	0.0
5	1203.74	40969.1	0.0	0.0

정규화
각 열에서 평균을 빼고 표준 편차로 나누어 0을 중심으로 변수의 크기를 다시 조정해야 합니다. 이 단계가 없으면 튜링의 샘플러는 매개변수 추정치 검색을 시작할 위치를 찾는 데 어려움을 겪을 것입니다. 이를 위해 관찰을 손쉽게 섞고 계층화된 분할을 수행하여 대표적인 테스트 집합을 얻을 수 있는 MLDataUtils를 활용합니다.

rescale! function 설명
예) rescale!(trainset[!, feature]; obsdim=1)

첫 번째 인자: trainset[!, feature]는 trainset 데이터 프레임에서 feature 열(특징)을 나타냅니다. !는 모든 행을 선택하라는 의미입니다.

obsdim: 이 매개변수는 관찰치의 차원을 나타냅니다. 기본적으로, obsdim=1은 행을 관찰치로 취급하고 각 행마다 스케일링을 적용합니다. 반면, obsdim=2는 열을 관찰치로 취급하고 각 열마다 스케일링을 적용합니다.

rescale! 함수는 주어진 데이터를 인플레이스로 변경하며, 이는 원래의 데이터를 변경한다는 것을 의미합니다 (! 기호가 함수 이름의 끝에 붙어있기 때문에). 함수는 주어진 데이터를 0과 1 사이의 값으로 스케일링합니다.

rescale 함수는 zscore를 만들고 그 때 사용된 평균과 표준편차를 리턴한다. $$ \mu_{\text{trainset}} = mean(\text{trainset})\\ \sigma_{\text{trainset}}= std(\text{trainset})\\ zscore_{\text{trainset}} = \frac{\text{trainset} – \mu_{\text{trainset}}}{\sigma_{\text{trainset}}}\\ \;\\ \;\\ zscore_{\text{testset}} = \frac{\text{testset} – \mu_{\text{trainset}}}{\sigma_{\text{trainset}}} $$

중요사항

정규화 즉 zscore 계산시에 testset에 적용하는 평균과 표준편차는 trainset에서 사용한 평균과 표준편차를 사용해야 한다. 동일한 정규화 적용

In [7]:

numerics = [:Balance, :Income]
# 정규화 (Normalize)
for feature in numerics
    # trainset을 정규화(zscore)화 하고
    # 이 때 사용된 평균과 표준편차를 가져온다
    μ, σ = rescale!(trainset[!, feature]; obsdim=1)
    # trainset에서 사용한 μ, σ를 사용하여 testset을 정규화 한다.
    rescale!(testset[!, feature], μ,σ; obsdim=1)
end
first(trainset,5) |> display
first(testset,5) |> display

5×4 DataFrame

Row	Balance	Income	DefaultNum	StudentNum
	Float64	Float64	Float64	Float64
1	-0.290317	-1.07397	0.0	1.0
2	1.72611	0.396591	0.0	0.0
3	-1.45898	0.65903	0.0	0.0
4	0.0215398	-1.32629	0.0	1.0
5	0.614059	-1.21208	0.0	1.0

5×4 DataFrame

Row	Balance	Income	DefaultNum	StudentNum
	Float64	Float64	Float64	Float64
1	-1.11671	0.134357	0.0	0.0
2	-1.15639	1.16881	0.0	0.0
3	-1.66939	0.455227	0.0	0.0
4	-0.317225	0.425942	0.0	0.0
5	0.682175	0.563412	0.0	0.0

Turing에 사용할 데이터 포멧으로 변경

Turing은 DataFrame포맷의 데이터가 아닌 Matrix 포맷의 데이터를 사용한다.

In [8]:

features = [:StudentNum, :Balance, :Income]
train = Matrix(trainset[:, features])
test = Matrix(testset[:, features])
train_label = trainset[:, target]
test_label = testset[:,target]
train |> display

500×3 Matrix{Float64}:
 1.0  -0.290317   -1.07397
 0.0   1.72611     0.396591
 0.0  -1.45898     0.65903
 1.0   0.0215398  -1.32629
 1.0   0.614059   -1.21208
 0.0   1.68151     2.05248
 0.0   1.28278     0.169574
 0.0  -1.23139     1.35815
 0.0  -0.201578   -0.50778
 1.0  -0.503496   -1.26771
 0.0   0.952375    1.37702
 0.0   1.01372     1.19518
 1.0   1.203      -1.09293
 ⋮                
 0.0   0.556818    0.205647
 0.0   0.333643    0.813972
 0.0   0.123881   -0.181476
 1.0   0.07578    -1.69431
 1.0   2.96144    -0.73522
 0.0  -1.66939     0.688382
 1.0   2.12257    -1.47288
 0.0   0.292346    0.265388
 0.0  -0.0250153   0.334578
 0.0  -1.314       0.311422
 0.0  -0.780992    0.920992
 0.0  -0.0211741   1.33423

Model Declaration¶

마지막으로 모델을 정의할 수 있습니다.

logistic_regression은 네 개의 인수를 받습니다:

x는 독립 변수 집합입니다;
y는 예측하고자 하는 요소입니다;
n은 우리가 가진 관찰 데이터의 수입니다.
σ는 선험적 경험으로 가정하려는 표준 편차입니다.

모델 내에서 4개의 계수(intercept, student, balance, income)를 생성하고 사전분포로 각각 평균이 0이고 표준 편차가 σ인 정규분포를 따른다고 가정 . 이 네 가지 계수의 값을 찾아 주어진 y를 예측하고자 합니다.

for 블록은 로지스틱 함수인 변수 v를 생성합니다. 그런 다음 실제 레이블인 y[i]가 주어졌을 때 p를 계산하여 레이블이 y[i]일 확률을 계산한다.

$\beta_0$ : intercept
$\beta_1$ : student
$\beta_2$ : balance
$\beta_3$ : income
$i$ : i번째 데이터
$\hat{y}_i$ : i번째 데이터의 레이블이 $y_i$일 확률, p

$ z_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,3}\\ \hat{y}_i = \frac{1}{1+exp(-z_i)} $

In [9]:

# Bayesian logistic regression (LR)
@model function logistic_regression(x,y,n,σ)
    # prior
    intercept ~ Normal(0, σ)
    
    student ~ Normal(0, σ)
    balance ~ Normal(0, σ)
    income ~ Normal(0,σ)
    
    for i in 1:n
        p = logistic(intercept + student*x[i,1] + balance * x[i,2] + income * x[i,3])
        y[i] ~ Bernoulli(p)
    end
end

Out[9]:

logistic_regression (generic function with 2 methods)

Sampling¶

이제 샘플러를 실행할 수 있습니다. 이번에는 NUTS를 사용하여 사후분포에서 샘플링하겠습니다.

In [10]:

# Retrieve the number of observations.
n,_ = size(train)
# Sample using NUTS.
m = logistic_regression(train,train_label,n,1)
# 3개의 chain을 생성
chains = sample(m, NUTS(), MCMCThreads(), 1_500, 3)

┌ Info: Found initial step size
└   ϵ = 0.4
┌ Info: Found initial step size
└   ϵ = 0.4
┌ Info: Found initial step size
└   ϵ = 0.4

Out[10]:

Chains MCMC chain (1500×16×3 Array{Float64, 3}):
Iterations        = 751:1:2250
Number of chains  = 3
Samples per chain = 1500
Wall duration     = 14.81 seconds
Compute duration  = 44.02 seconds
parameters        = intercept, student, balance, income
internals         = lp, n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size
Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ⋯
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64   ⋯
   intercept   -4.4444    0.4241    0.0088   2378.9996   2346.6717    1.0013   ⋯
     student    0.0777    0.6449    0.0136   2275.4180   2462.2562    1.0004   ⋯
     balance    1.7781    0.2978    0.0064   2207.9511   2174.0305    1.0013   ⋯
      income    0.0860    0.3172    0.0062   2651.0232   2863.7982    1.0002   ⋯
                                                                1 column omitted
Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 
   intercept   -5.3339   -4.7181   -4.4209   -4.1511   -3.6788
     student   -1.2055   -0.3534    0.0876    0.4956    1.3324
     balance    1.2019    1.5822    1.7677    1.9732    2.3863
      income   -0.5490   -0.1236    0.0917    0.3087    0.6845

In [11]:

let
    mean_params = mean(chains)
    @assert mean_params[:student, :mean] < 0.1
    @assert mean_params[:balance, :mean] > 1
end

여러 체인을 실행했으므로 각 체인이 비슷한 지점을 중심으로 수렴하는지 확인하기 위해 스팟 체크를 수행할 수 있습니다.

In [12]:

plot(chains)

Out[12]:

In [13]:

density(chains[:lp], legend=false)

Out[13]:

code 설명

mean_params = mapreduce(hcat, mean(chains; append_chains=false)) do df
    return df[:, :mean]
end

chains에서 평균 파라미터 값을 추출하는 것으로 보입니다. 아래 코드의 각 부분에 대해 설명드리겠습니다.

mean(chain; append_chains=false):

mean 함수는 chains 의 각 파라미터에 대한 평균 값을 계산합니다.
append_chains=false 는 chains 에 여러 개의 연결된 체인이 있을 경우 이를 분리한 상태로 유지하라는 옵션입니다.

mapreduce(hcat, mean(chain; append_chains=false)):

mapreduce 는 두 개의 함수와 하나의 컬렉션을 인수로 받아 작업을 수행합니다.
첫 번째 함수는 컬렉션의 각 요소에 적용되며, 두 번째 함수는 첫 번째 함수의 결과를 결합합니다.
여기서는 hcat 이 결합 함수로 사용되었는데, 이는 수평적으로 데이터를 결합하라는 의미입니다.

do df:

do 문법은 Julia에서 익명 함수를 표현하는 데 사용됩니다. 이 구문은 mapreduce 의 첫 번째 함수에 대한 정의를 시작합니다.

return df[:, :mean]:

이 부분은 df 라는 이름의 데이터 프레임에서 mean 이라는 열만 추출하여 반환합니다.

결과적으로, mean_params 는 chains 의 각 파라미터에 대한 평균 값을 수평적으로 연결한 것입니다.

In [14]:

let
    mean(chains) |> display
#     mean(chains; append_chains=false) |> display
#     mean(chains; append_chains=false)[1] |> display
#     mean(chains; append_chains=false)|>DataFrame |> display
    
    # mapreduce(df->df[:,:mean],hcat, mean(chains; append_chains=false))
    # 와 동일
    mean_params = mapreduce(hcat, mean(chains; append_chains=false)) do df
        return df[:, :mean]
    end
    
    for i in (2,3)
        @assert mean_params[:,i] != mean_params[:,1]        
        @assert isapprox(mean_params[:,i], mean_params[:,1]; rtol=5e-2)
    end
end

Mean
  parameters      mean 
      Symbol   Float64 
   intercept   -4.4444
     student    0.0777
     balance    1.7781
      income    0.0860

MCMCChains의 corner 함수를 사용하여 로지스틱 회귀의 다양한 매개변수 분포를 표시할 수 있습니다.

In [15]:

let
    I = [:student, :balance, :income]
    # Use the corner function. Requires StatsPlots and MCMCChains
    corner(chains,I, norm=true)
end

Out[15]:

다행히 corner 플롯은 각 매개 변수에 대한 단모수 분포를 보여 주므로 각 매개 변수의 샘플링된 값의 평균을 사용하여 예측을 위한 모델을 추정하는 것은 간단합니다.

Making Predictions¶

모델이 실제로 채무 불이행 가능성을 얼마나 잘 예측하는지 테스트하려면 어떻게 해야 할까요? 앞서 만든 test 개체를 가져와 샘플링 중에 계산된 평균 매개 변수를 통해 실행하는 예측 함수를 구축해야 합니다.

아래 prediction 함수는 Matrix와 Chain 객체를 사용합니다. 이 함수는 각 매개변수의 샘플링된 값의 평균을 취하고 테스트 세트의 모든 요소에 대해 이 평균값을 사용하여 로지스틱 함수를 다시 실행합니다.

채무불이행이 여부를 판단하기 위해 PDF와 CDF 살펴 보기¶

채무불이행 가능성 분포와 채무불이행 여부를 판단하기 위한 threshold 결정

y : 채무불이행 확률

아래 채무불이행(Default)가능성의 확률분포 보면 채무불이행(Default) 가능성이 아주 낮은 것을 알 수 있다.

In [16]:

function get_logistic_info(x::Matrix, chains)
    # Pull the means from each parameter's sampled values in the chain
    intercept = mean(chains[:intercept])
    student = mean(chains[:student])
    balance = mean(chains[:balance])
    income = mean(chains[:income])
    
    # Retrieve the number of rows
    n, _ = size(x)
    
    # Generate a vector to store our predictions
    y = Vector{Float64}(undef,n)
    z = Vector{Float64}(undef,n)
    # Calculate the logistic function for each element in the test set.
    for i in 1:n
        # Default probability
        z[i] = intercept + student * x[i,1] + balance * x[i, 2] + income * x[i,3]
        y[i] = logistic(z[i])
    end
    return y,z
end

Out[16]:

get_logistic_info (generic function with 1 method)

In [17]:

let
    y,z = get_logistic_info(train, chains)
    density(y) |> display
    density_of_y = kde(y)
    y_max_of_density = density_of_y.x[argmax(density_of_y.density)]
    println("y* = argmax(KDE) = $y_max_of_density")
    plot(density_of_y, title="Probability Density(PDF) of 'Default Probability'", xlabel="y", ylabel="Probability Density",
        legend=false)
    # kde 에서 pdf 구하기
    f_pdf(x) = pdf(density_of_y,x)
    plot!(density_of_y.x, f_pdf(density_of_y.x)) |> display
    # kde에서 cds 구하기
    f_cdf = cumsum(density_of_y.density) *  step(density_of_y.x)
    plot(density_of_y.x, f_cdf, title="Probability(CDF) of 'Default Probability'", xlabel="y", ylabel="Probability",
        legend=false) |> display
end

y* = argmax(KDE) = 0.005340085742365299

ROC(Receiver Operating Characterisitic-수신자 조작 특성) 곡선을 통한 최적화된 분류 임계값(threshold) 구하기¶

ROC (Receiver Operating Characteristic) Curve 완벽 정리!

ROC curve, AUC

ROC Curve(ROC 커브)란? 머신러닝 모형 평가

이진 분류 문제에서 분류기의 성능을 평가 하는 데 사용되는 그래픽 도구입니다. ROC 곡선은 다양한 임계값(threshold)에 대한 참 양성률(True Positive Rate, TPR)과 거짓 양성률(False Positive Rate, FPR)을 플롯하여 분류기의 성능을 시각화합니다.

True Positive Rate(TPR): 이것은 또한 민감도(Sensitivity)라고도 합니다. TPR은 실제 양성 중 모델이 얼마나 많은 양성을 양성으로 올바르게 분류 했는지 나타냅니다.

$ True Positive Rate =\frac{\text{실제양성을 모델이 양성이라고 판단한 수}}{\text{실제양성의 수}}\\ TPR = \frac{TP}{TP + FN} $
False Positive Rate(FPR): 이것은 1에서 특이도(Specificity)를 뺀 값 입니다. FPR은 실제 음성 중 모델이 얼마나 많은 음성이 잘못된 양성으로 분류 했는지 나타냅니다.

$ False Positive Rate =\frac{\text{실제음성을 모델이 양성이라고 판단한 수}}{\text{실제음성의 수}}\\ FPR = \frac{FP}{FP+TN} $

ROC 곡선의 특징:

ROC 곡선은 y축에 TPR을, x축에 FPR을 가지고 있습니다.
완벽한 분류기는 TPR이 1이고 FPR이 0인 지점, 즉 y축에 가까운 지점을 통과하는 ROC 곡선을 가질 것입니다.
무작위로 예측하는 분류기는 대각선을 따라 ROC 곡선을 가질 것입니다. 이 대각선 아래의 영역은 분류기의 성능을 나타내는 AUC(Area Under the Curve) 값으로, 0.5보다 큰 값이면 분류기가 무작위 예측보다 나은 성능을 보이는 것으로 간주됩니다.

AUC는 ROC 곡선 아래의 영역을 나타내며, 분류기의 전반적인 성능을 측정하는 지표로 사용됩니다. AUC 값이 1에 가까울수록 분류기의 성능이 좋다고 판단하며, 0.5에 가까울수록 성능이 무작위 예측 수준이라고 판단합니다.

ROC 곡선은 다양한 임계값에서 분류기의 성능을 평가하므로, 특정 비즈니스 문제나 비용 구조에 따라 최적의 임계값을 선택하는 데 도움을 줍니다.

FPR이 중요한 이유?¶

FPR(False Positive Rate)의 중요성을 이해하기 위해서는 먼저 "False Positive"의 비용과 그에 따른 영향을 고려해야 합니다.

비용과 위험:

실제로는 음성인데 양성으로 잘못 판단하는 경우가 발생하면, 그 결과로 인한 비용이나 위험이 발생할 수 있습니다. 예를 들어, 의료 진단에서 건강한 사람을 질병이 있다고 잘못 진단하면 불필요한 검사나 치료를 받게 되어 비용과 불편함, 심지어 부작용까지 초래할 수 있습니다.
리소스의 낭비:

실제로는 문제가 없는데 문제가 있다고 판단하면, 해당 문제를 해결하기 위한 리소스(시간, 노력, 자금 등)가 낭비될 수 있습니다.
사용자의 신뢰도 하락:

모델이 지속적으로 False Positive를 발생시키면 사용자의 신뢰도가 떨어질 수 있습니다. 사용자는 모델의 예측을 신뢰하기 어려워 할 것입니다.
임계값 조정의 필요성:

FPR을 모니터링하면 모델의 임계값을 조정하여 False Positive를 줄이는 방향으로 최적화할 수 있습니다.
전반적인 성능 평가:

TPR(민감도)만 높다고 해서 모델이 좋은 것은 아닙니다. FPR이 높으면 그만큼 잘못된 예측도 많이 한다는 의미입니다. 따라서 TPR과 FPR을 함께 고려하여 모델의 전반적인 성능을 평가해야 합니다.

결론적으로, FPR은 모델의 잘못된 예측의 비율을 나타내므로, 이를 최소화하는 것은 중요한 목표입니다. FPR을 모니터링하면 모델의 성능을 개선하고 사용자의 신뢰도를 유지하는 데 도움이 됩니다.

AUC(Area Under the Curve)가 중요한 이유?¶

AUC는 ROC 곡선의 아래 영역을 나타냅니다. ROC 곡선은 TPR과 FPR을 각각 y축과 x축으로 사용하여 그린 그래프 입니다.
AUC 값이 1에 가까울수록 좋은 모델을 나타냅니다. AUC가 0.5인 경우는 무작위 선택과 같은 성능을 나타냅니다.

AUC가 크다는 것은 모델이 양성 클래스와 음성 클래스를 잘 구분한다는 것을 의미 합니다. 이는 모델이 높은 TPR값을 가질 때 낮은 FPR 값을 가지는 경우, 즉 ROC 곡선이 y축에 가깝게 그려지는 경우를 의미 합니다.

FPR이 커지는 것 자체가 모델의 성능이 좋다는 것을 의미 하는 것은 아닙니다. 중요한 것은 FPR이 어느 정도 일 때 TPR이 얼마나 높은지를 보는 것입니다. 즉, FPR이 증가함에 따라 TPR이 급격하게 증가하면 그만큼 모델의 성능이 좋다는 것을 의미 합니다.

참조 : 공돌이의 수학정리노트 – ROC curve

ROC Curve를 사용하여 최적의 threshold 찾기¶

일반적인 방법 중 하나는 Youden’s J 통계량을 사용하는 것입니다. Youden’s J 통계량은 다음과 같이 계산 됩니다.

$ J = Sensitivity + Specificity – 1 \\ J = TPR – FPR $

roc_curve 호출시 fpr,tpr,thresholds array가 리턴된다.

이때 J가 최대 되는 index를 찾는다.

index = argmax(J) = argmax(tpr – fpr)

optmal_threshold = thresholds[index]

이 optimal_threshold 값을 test set 분류시 threshold로 사용한다.

Plot roc curve¶

In [18]:

# return : optimal_threshold
function plot_roc_curve(fpr,tpr,thresholds,optimal_auc)
    J_values = tpr - fpr
    idx = argmax(J_values)
    optimal_threshold = thresholds[idx]    
    plot(fpr,tpr, title="ROC Curve", 
        label="ROC Curve(AUC=$(round(optimal_auc,digits=2)))",
        xlabel = "False Positive Rate(FPR)",
        ylabel = "True Positive Rate(TPR)")
    scatter!([fpr[idx]],[tpr[idx]], color=:red,
        label="Optimal threshold=$(round(optimal_threshold,digits=4))\nSensitivity=$(round(tpr[idx],digits=2))\nSpecificity=$(round(1 - fpr[idx],digits=2))")
    plot!([0, 1], [0, 1], linestyle=:dash, color=:black, label="Random Classifier")|>display    
    optimal_threshold
end

Out[18]:

plot_roc_curve (generic function with 1 method)

ROC Curve of Train – Python from sklearn¶

In [19]:

using PyCall
optimal_threshold = let    
    y,z = get_logistic_info(train, chains)
    y_true = Int.(train_label)
    
    sklearn = pyimport("sklearn.metrics")
    fpr, tpr, thresholds = sklearn.roc_curve(y_true,y)    
    optimal_auc = sklearn.auc(fpr, tpr)   
    
    optimal_threshold = plot_roc_curve(fpr,tpr,thresholds,optimal_auc)    
    
    optimal_threshold
end;

ROC Curve of Train – MLJBase¶

In [20]:

let
    using MLJBase
    using CategoricalArrays
    y,z = get_logistic_info(train, chains)
    y_true = Int.(train_label)
    
    # 채무불이행 발생: 1, 채무불이행 발생안함: 0
    classes = [1,0]
    pool = CategoricalArray(classes)
    # UnivariateFinite에 값을 설정하다.
    # 1 => 채무불이행확률, 0 => 1-채무불이행확률
    y_pred = map(s->UnivariateFinite(classes,[s,1-s], pool = pool), y)    
    # roc_curve function에 입력되는 값이 python sklearn의 roc_curve에
    # 입력되는 값과 순서가 반대 즉, 예측값, 실제값
    fpr, tpr, thresholds = MLJBase.roc_curve(y_pred,y_true)
    optimal_auc = MLJBase.auc(y_pred, y_true)
    
    optimal_threshold = plot_roc_curve(fpr,tpr,thresholds,optimal_auc)
end

Out[20]:

0.11676636265864587

In [21]:

function prediction(x::Matrix, chains, threshold)
    y,z = get_logistic_info(x, chains)
    v = ifelse.(y .> threshold,1,0)
    v
end

Out[21]:

prediction (generic function with 1 method)

어떻게 했는지 살펴봅시다! 예측 함수를 통해 테스트 행렬을 실행하고 예측에 대한 평균 제곱 오차(MSE)를 계산합니다. threshold 변수는 예측의 민감도를 설정합니다. 예를 들어 임계값이 0.07이면 0.07보다 큰 예측 값에 대해 채무불이행(1)을 예측하고 0.07보다 작은 경우 채무불이행 없음(0)을 예측함

여기서는 ROC Curve를 통해 최적의 train set의 최적 threshold를 찾고 test set의 threshold로 설정 한다.

더 중요한 것은 채무불이행을 정확히 예측한 비율이 몇 퍼센트인지 확인하는 것입니다. 아래 코드는 단순히 채무불이행과 예측을 계산하여 결과를 표시합니다.

In [22]:

# Set the prediction threshold
threshold = optimal_threshold
# Make the predictions.
predictions = prediction(test, chains, threshold)
# Calculate MSE for our test set.
loss = sum((predictions - test_label) .^ 2) / length(test_label)
println("loss = $loss")
defaults = sum(test_label)
not_defaults = length(test_label) - defaults
predicted_defaults = sum(test_label .== predictions .== 1)
predicted_not_defaults = sum(test_label .== predictions .== 0)
println("Optimal threshold : $threshold")
println("Default: $defaults
    Predictions: $predicted_defaults
    Percentage defaults correct $(predicted_defaults/defaults)")
println("Not defaults: $not_defaults
    Prediction: $predicted_not_defaults
    Percentage non-defaults correct $(predicted_not_defaults/not_defaults)")

loss = 0.06168421052631579
Optimal threshold : 0.1196338248440811
Default: 316.0
    Predictions: 221
    Percentage defaults correct 0.6993670886075949
Not defaults: 9184.0
    Prediction: 8693
    Percentage non-defaults correct 0.9465374564459931

위는 threshold를 0.06로 설정하면 상당 부분의 채무불이행을 정확하게 예측하고 대부분 채무불이행이이 아닌 항목을 정확하게 식별한다는 것을 보여줍니다. 이는 threshold 선택에 상당히 민감하므로 임계값을 실험해 볼 수 있습니다.

이 튜토리얼에서는 Turing을 사용하여 베이지안 로지스틱 회귀를 수행하는 방법을 보여드렸습니다.

ROC Curve of Test – Python from sklearn¶

In [23]:

best_threshold = let    
    y,z = get_logistic_info(test, chains)
    y_true = Int.(test_label)
    
    sklearn = pyimport("sklearn.metrics")
    fpr, tpr, thresholds = sklearn.roc_curve(y_true,y)
    optimal_auc = sklearn.auc(fpr, tpr)
    
    optimal_threshold = plot_roc_curve(fpr,tpr,thresholds,optimal_auc)
end;

ROC Curve of Test- MLJBase¶

In [24]:

let
    using MLJBase
    using CategoricalArrays
    y,z = get_logistic_info(test, chains)
    y_true = Int.(test_label)
    
    classes = [1,0]
    pool = CategoricalArray(classes)
    y_pred = map(s->UnivariateFinite(classes,[s,1-s], pool = pool), y)
    fpr, tpr, thresholds = MLJBase.roc_curve(y_pred,y_true)
    optimal_auc = MLJBase.auc(y_pred, y_true)
    
    optimal_threshold = plot_roc_curve(fpr,tpr,thresholds,optimal_auc) 
end

Out[24]:

0.05561593166405293

부록 : 저수준에서 ROC Curve 그려 보기¶

In [25]:

let
    function calculate_roc(y_pred,y_true)
        
        thresholds = sort(unique(y_pred), rev=true)
        push!(thresholds, 0.0)  # 마지막 임계값 추가
        tpr = Float64[]
        fpr = Float64[]
        for thresh in thresholds
            tp = sum((y_pred .>= thresh) .& (y_true .== 1))
            fp = sum((y_pred .>= thresh) .& (y_true .== 0))
            tn = sum((y_pred .< thresh) .& (y_true .== 0))
            fn = sum((y_pred .< thresh) .& (y_true .== 1))
            push!(tpr, tp / (tp + fn))
            push!(fpr, fp / (fp + tn))
        end
        return fpr, tpr, thresholds
    end
    y_pred,_ = get_logistic_info(test, chains)    
    y_true = test_label    
    fpr, tpr, thresholds = calculate_roc(y_pred, y_true)
    # 최적의 threshold 구하기
    J = tpr - fpr
    idx = argmax(J)    
    optimal_threshold = thresholds[idx]
    println("Optimal threshold = $optimal_threshold")
    
    # ROC 곡선 그리기
    plot(fpr, tpr, label="ROC Curve", xlabel="False Positive Rate", ylabel="True Positive Rate", legend=:bottomright)
    scatter!([fpr[idx]],[tpr[idx]], makercolor=:red,
        label="Optimal threshold=$(round(optimal_threshold,digits=4))\nSensitivity=$(round(tpr[idx],digits=2))\nSpecificity=$(round(1 - fpr[idx],digits=2))")
    plot!([0, 1], [0, 1], linestyle=:dash, label="Random Classifier")
    
end

Optimal threshold = 0.0556546851493293

Out[25]:

In [ ]:

[Julia, Turing] Tutorial – Bayesian Logistic Regression

Bayesian Logistic Regression¶

Data Cleaning & Setup¶

Model Declaration¶

Sampling¶

Making Predictions¶

채무불이행이 여부를 판단하기 위해 PDF와 CDF 살펴 보기¶

ROC(Receiver Operating Characterisitic-수신자 조작 특성) 곡선을 통한 최적화된 분류 임계값(threshold) 구하기¶

FPR이 중요한 이유?¶

AUC(Area Under the Curve)가 중요한 이유?¶

ROC Curve를 사용하여 최적의 threshold 찾기¶

Plot roc curve¶

ROC Curve of Train – Python from sklearn¶

ROC Curve of Train – MLJBase¶

ROC Curve of Test – Python from sklearn¶

ROC Curve of Test- MLJBase¶

부록 : 저수준에서 ROC Curve 그려 보기¶

댓글 달기 댓글 취소