๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

CS/์ธ๊ณต์ง€๋Šฅ

[24-2] ๐Ÿ‘พ ๊ธฐ๊ณ„ํ•™์Šต(ML) ํ”„๋กœ์ ํŠธ : ์™ธ๊ณ„ ํ–‰์„ฑ ์ฐพ๊ธฐ ๐Ÿ‘ฝ

728x90

 

 

๊ฐœ์š”

๋•Œ๋Š” ๋ฐ”์•ผํ๋กœ 11.4 ML ํ”„๋กœ์ ํŠธ ๊ณต์ง€๊ฐ€ ์˜ฌ๋ผ์™”๋‹ค! 

๋ฌด๋ ค ํฌ์Šคํ„ฐ ์„ธ์…˜์œผ๋กœ ์ง„ํ–‰๋˜๋Š” ํ”„๋กœ์ ํŠธ์˜€๊ธฐ ๋•Œ๋ฌธ์— ์—ด์‹ฌํžˆ ์ค€๋น„ํ•ด์„œ ๋ฉ‹์ง„ ๋ฐœํ‘œ๋ฅผ ํ•˜๊ณ  ์‹ถ์€ ์š•์‹ฌ์ด ์žˆ์—ˆ๋‹ค!

 

 

 

๋ฐฐ์šด ๊ฒƒ๋“ค์„ ์ข…ํ•ฉํ•ด์„œ ML ์„ ํ™œ์šฉํ•œ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ๋˜๊ณ ,

๋ฐฐ์šด ๊ฒƒ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์ œ์•ˆ์„ ์—ฐ๊ตฌ(Research) ํ•˜๊ณ , ํ˜„์กดํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํฅ๋ฏธ๋กœ์šด ๋ฌธ์ œ์— ์ ์šฉ์‹œ์ผœ๋ณธ๋‹ค. (Development)

๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰์œผ๋กœ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋„ ์ ์šฉํ•ด๊ฐ€๋ฉด์„œ ๋‹ค์–‘ํ•œ ์„ฑ๋Šฅ ์ธก์ • ์ง€ํ‘œ(performance metrics) ์— ๋Œ€ํ•˜์—ฌ ํ™•์žฅ ๋น„๊ต ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. (Evaluation)

 

 

์ฃผ์ œ ์„ ์ •

 

๊ฐ€์ด๋“œ๋ผ์ธ ๊ต์•ˆ์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ”„๋กœ์ ํŠธ ์˜ˆ์‹œ๋“ค์„ ์ฃผ์…จ๋‹ค.

  • Handwritten Alphabet/Digit Recognition
  • Predictive Stock/House Price Modeling
  • Image Classification with CNNs
  • Recommendation System
  • Face Recognition
  • Traffic Flow Prediction
  • Chatbot Development or LLM application
  • Object Detection in Images
  • Human Activity Recognition

๊ทผ๋ฐ ๊ผญ ์—ฌ๊ธฐ์„œ ๊ณจ๋ผ์•ผํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ๋‹ค๋ฅธ ํฅ๋ฏธ๋กœ์šด ์ฃผ์ œ๋ฅผ ์„ ํƒํ•ด๋„ ๋œ๋‹ค๊ณ  ํ•˜์…”์„œ ๋‚˜๋Š” novelty ์ ์ˆ˜๋ฅผ ๋” ๋ฐ›๊ณ  ์‹ถ์–ด์„œ

ํฅ๋ฏธ๋กœ์›Œ ๋ณด์ด๋Š” ์ƒˆ๋กœ์šด ์ฃผ์ œ๋ฅผ ์ •ํ–ˆ๋‹ค. <๋ณ„๋น›์˜ ์ฃผ๊ธฐ์ ์ธ ๋ณ€ํ™” ๋ถ„์„์„ ํ†ตํ•œ ๐Ÿ‘ฝ์™ธ๊ณ„ํ–‰์„ฑ์ฐพ๊ธฐ> ์ด๋‹ค. 

 

 

์–ด์ฉŒ๋‹ค๊ฐ€ ์ด๋Ÿฐ ์ฃผ์ œ๋ฅผ ๊ณจ๋ž๋ƒ๊ณ  ํ•œ๋‹ค๋ฉด,, ์‚ฌ์‹ค ์ง€๋‚œ 10์›”์— ๋ฐ์ดํ„ฐ ๋ถ„์„์— ใ„ท ์ž๋„ ๋ชจ๋ฅด๋ฉด์„œ

์บ๊ธ€์ด๋ผ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋Œ€ํšŒ ๋ชจ์ง‘ ๊ณต๊ณ ๋ฅผ ๋ณด๊ณ  ํฅ๋ฏธ๋กœ์›Œ ๋ณด์ธ๋‹ค๋Š” ์ด์œ ๋กœ ๋ฌด๋ชจํ•œ ๋„์ „์„ ํ–ˆ์—ˆ๋‹ค.

 

์•„๋ž˜๋Š” ๋ฐ”๋กœ ๊ทธ ๋ฌธ์ œ์˜ ์บ๊ธ€ ๋Œ€ํšŒ์ด๋‹ค.. ๐Ÿ˜„๐Ÿ˜ƒ

 

https://www.kaggle.com/competitions/ariel-data-challenge-2024/overview

 

NeurIPS - Ariel Data Challenge 2024

Derive exoplanet signals from Ariel's optical instruments

www.kaggle.com

 

 

์™ธ๊ณ„ ํ–‰์„ฑ์˜ ๋Œ€๊ธฐ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•œ multimodal supervised learning task ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  

๊ทธ ๊ณผ์ •์—์„œ ๊ด€์ธก์žฅ๋น„์— ์˜ํ•œ jitter noise ์™œ๊ณก์„ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ธ ๋Œ€ํšŒ์˜€๋‹ค. 

 

ํ˜„๋Œ€

 

 

๊ทธ๋Ÿฐ๋ฐ ์•„๋ฌด๋ž˜๋„ ๊ฒฝํ—˜์ด ํ„ฐ๋ฌด๋‹ˆ ์—†์ด ๋ชจ์ž๋ผ๋‹ค๋ณด๋‹ˆ ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•ด์•ผํ• ์ง€์กฐ์ฐจ ์ž˜ ๊ฐ์ด ์žกํžˆ์ง€ ์•Š์•˜๊ณ  ์˜จ๊ฐ– ์ˆ˜ํ•™ ๊ณต์‹๋“ค๋งŒ ๋‚œ๋ฌดํ•˜๋Š” ์ฒœ์ฒด๊ด€๋ จ ๋…ผ๋ฌธ๋“ค์„ ๋ณด๋ฉฐ ์ขŒ์ ˆํ•˜๊ณ  ํฌ๊ธฐํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.. ใ…Žใ…Ž ใ…Ž ( ์•„๋ฌด๋ž˜๋„ ํ˜„๋Œ€ ์ฒœ๋ฌธํ•™์—์„œ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ํ•˜๋‹ˆ๊นŒ..^^ )

 

๋น„๋ก ๊ทธ ๋•Œ ์‹œ๋„ํ–ˆ๋˜ ์ฃผ์ œ๋Š” ์‹คํŒจํ–ˆ์ง€๋งŒ, ์œ ์‚ฌํ•˜์ง€๋งŒ ์กฐ๊ธˆ ๋‚œ์ด๋„๋ฅผ ๋‚ฎ์ถฐ์„œ

๋ณ„๋น›์˜ ์ฃผ๊ธฐ์ ์ธ ๋ณ€ํ™” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์™ธ๊ณ„ํ–‰์„ฑ์˜ ์กด์žฌ๋ฅผ ์˜ˆ์ธก(Classification)ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ด๋ณด๋ฉด ์–ด๋–จ๊นŒํ•˜๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๋‹ค!

 

 

๋ณ„๋น›์˜ ์ฃผ๊ธฐ์ ์ธ ๋ณ€ํ™”๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ป๊ฒŒ ์™ธ๊ณ„ํ–‰์„ฑ์˜ ์กด์žฌ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์–ด ? ๋ผ๊ณ   ๋ฌป๋Š”๋‹ค๋ฉด, 

 

ํ–‰์„ฑ์ด ๋ณ„ ์•ž์„ ์ง€๋‚˜๊ฐ€๋ฉด, ๋ณ„๋น›์˜ ์ผ๋ถ€๊ฐ€ ๊ฐ€๋ ค์ ธ์„œ ๋ฐ๊ธฐ๊ฐ€ ์•ฝ๊ฐ„ ๊ฐ์†Œํ•œ๋‹ค. ์ฆ‰, ๋ชจํ–‰์„ฑ์ด ์™ธ๊ณ„ํ–‰์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉด ๊ทธ ์™ธ๊ณ„ํ–‰์„ฑ์ด ๋ชจํ–‰์„ฑ ์ค‘์‹ฌ์œผ๋กœ ๊ณต์ „ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ด€์ธก ํ•˜๋Š” ์ž…์žฅ์—์„œ๋Š” ์ฃผ๊ธฐ์ ์œผ๋กœ ๊ฐ์ง€ ๊ฐ€๋Šฅํ•œ ๋ถ€๋ถ„์„ ๋ง‰์•„ ๋น›์˜ ๋ฐ๊ธฐ๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๊ด€์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

์ด๋Ÿฌํ•œ ๋ณ„์˜ ์ฃผ๊ธฐ์ ์ธ ๋ฐ๊ธฐ ๋ณ€ํ™”์˜ ํŒจํ„ด์„ ๊ฐ์ง€ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ์™ธ๊ณ„ํ–‰์„ฑ์ด ์กด์žฌํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๊ด€์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด ์šฐ๋ฆฌ์˜ ๋ชฉ์ ์ด๋‹ค. 

 

๊ทธ๋ ‡๊ฒŒ ๋‚˜๋ž‘ ๋‚ด ํŒ€์›์€ "์™ธ๊ณ„ ํ–‰์„ฑ์„ ์ฐพ๊ธฐ ์œ„ํ•œ ์šฐ์ฃผ ๋กœ์˜ ์—ฌํ–‰" ์„ ๋– ๋‚ฌ๋‹ค! ๋ ˆ์ธ  ๊ธฐ๋ฆฟ ๐Ÿ˜Ž

 

์™ธ๊ณ„ ํ–‰์„ฑ ์ฐพ๊ธฐ

 

๋ฐ์ดํ„ฐ์…‹

NASA ์ผ€ํ”Œ๋Ÿฌ ์šฐ์ฃผ๋ง์›๊ฒฝ์—์„œ ๊ฐ€์ ธ์˜จ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์—์„œ 5,050 ๊ฐœ๋Š” ์™ธ๊ณ„ํ–‰์„ฑ์ด ์—†๊ณ  ์˜ค์ง 37๊ฐœ๋งŒ ์™ธ๊ณ„ ํ–‰์„ฑ์ด ์กด์žฌํ•œ๋‹ค.

 

 

 

 

๋น›์˜ ์ฃผ๊ธฐ์ ์ธ ๋ณ€ํ™”๋ฅผ ๊ด€์ฐฐํ•ด์„œ ์ฃผ๊ธฐ์„ฑ์„ ์ฐพ์œผ๋ ค๋ฉด ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. 

ํƒœ์–‘๊ณ„ ์•ˆ์— ์žˆ๋Š” ํ–‰์„ฑ์˜ ๊ณต์ „์€ 88๋…„ ~ 165๋…„๊นŒ์ง€ ๋‹ค๋ฅด๋‹ค. 

๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” 

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ†ต์ฐฐ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด์ž. ๊ฐ€๋กœ์ถ•์€ ๊ด€์ธก ํšŸ์ˆ˜์ด๊ณ  ์„ธ๋กœ์ถ•์ด ๊ด‘์†(Light Flux)์ด๋‹ค. 

 

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()

X = df.iloc[:,1:]
y = df.iloc[:,0] - 1

def light_plot(index):
    y_vals = X.iloc[index]
    x_vals = np.arange(len(y_vals))
    plt.figure(figsize=(15,8))
    plt.xlabel('Number of Observations')
    plt.ylabel('Light Flux')
    plt.title('Light Plot ' + str(index), size=15)
    plt.plot(x_vals, y_vals)
    plt.show()

 

๋ถ„๋ช… ๋ฐ์ดํ„ฐ์— ์ฃผ๊ธฐ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋ณ„์˜ ๋ฐ๊ธฐ ์ฆ๊ฐ์ด ์žˆ๋‹ค. 

 

์–ด๋–ค ํ–‰์„ฑ์€ ์ฃผ๊ธฐ์ ์œผ๋กœ ๋น›์ด ๊ฐ์†Œํ•œ๋‹ค. ์™ธ๊ณ„ ํ–‰์„ฑ์ด ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค. 

 

 

์™ธ๊ณ„ ํ–‰์„ฑ์ด ์žˆ๋Š” ๊ฒฝ์šฐ

 

 ๋ฐ˜๋ฉด, ์–ด๋–ค ํ–‰์„ฑ์€ ๋งค์šฐ ์ž ์ž ํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

์™ธ๊ณ„ ํ–‰์„ฑ์ด ์—†๋Š” ๊ฒฝ์šฐ

 

 

ํ•˜์ง€๋งŒ ๊ทธ๋ž˜ํ”„๋งŒ์œผ๋กœ๋Š” ์™ธ๊ณ„ํ–‰์„ฑ ์กด์žฌ ์—ฌ๋ถ€๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ์—๋Š” ํ™•์‹ค์น˜ ์•Š๋‹ค.

 

์ด ๋ฐ์ดํ„ฐ์…‹์€ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์ด์ง€๋งŒ, ๋‹ค์Œ ๋ฒˆ์˜ ๊ด‘์†์„ ์˜ˆ์ธกํ•˜๋Š” ํšŒ๊ท€ ๋ฌธ์ œ๊ฐ€ ์•„๋‹Œ ์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ ๋ณ„์„ ๋ถ„๋ฅ˜(Classificatoin)ํ•˜๋Š” ๋ฌธ์ œ์ด๋‹ค. 

 

 

๋ชจ๋ธ ์„ ์ •

๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•˜์—ฌ ์–ด๋–ค ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์•ผํ• ๊นŒ? ์—ฐ๊ด€์„ฑ์ด ๋†’์•„๋ณด์ด๋Š” ๋ชจ๋ธ๊ตฐ๋“ค์„ ์กฐ์‚ฌํ•ด๋ณด์•˜๋‹ค.

 

1. ์ „ํ†ต์ ์ธ ์‹œ๊ณ„์—ด ๋ถ„์„ ๋ชจ๋ธ

 

ARIMA (AutoRegressive Integrated Moving Average)

 

  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋น„๊ต์  ๋‹จ์ˆœํ•˜๊ณ , ๋…ธ์ด์ฆˆ๊ฐ€ ์ ์€ ๊ฒฝ์šฐ ์ ํ•ฉ
  • ๋ณ„๋น›์˜ ๋ฐ๊ธฐ ๋ฐ์ดํ„ฐ์—์„œ ์ฃผ๊ธฐ์ ์ธ ๋ณ€ํ™”๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ์ด์ƒ๊ฐ’(Transit)์„ ํƒ์ง€
  • ํ•œ๊ณ„: ๋ณต์žกํ•œ ๋น„์„ ํ˜• ํŒจํ„ด์ด๋‚˜ ๋‹ค์ค‘ ์ฃผ๊ธฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ๋Š” ๋ถ€์กฑ.

 

 

2. ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

 

1) ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ (Random Forest)

  • ๋ณ„๋น› ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•(๋ฐ๊ธฐ ๋ณ€ํ™”์˜ ํฌ๊ธฐ, ์ฃผ๊ธฐ, ๋…ธ์ด์ฆˆ ๋“ฑ)์„ ์ถ”์ถœํ•œ ๋’ค ์ด๋ฅผ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋กœ ์ ‘๊ทผ
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋น„๊ต์  ์ž‘๊ฑฐ๋‚˜ ์„ค๋ช… ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์ด ํ•„์š”ํ•  ๋•Œ ์ ํ•ฉ
  • ๋‹จ์ : ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Œ(ํŠน์ง• ์ถ”์ถœ ๊ณผ์ • ํ•„์š”)

 

2) Gradient Boosting ๊ณ„์—ด (e.g., XGBoost, LightGBM)

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๋ฐ์ดํ„ฐ์—์„œ ๋” ๋ณต์žกํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ํ•™์Šต
  • ํŠน์ง• ์ถ”์ถœ ํ›„ ๋ฐ๊ธฐ ํŒจํ„ด์ด ํ–‰์„ฑ์— ์˜ํ•œ ๊ฒƒ์ธ์ง€ ์•„๋‹Œ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ ์ ํ•ฉ

 

3. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

 

1) ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง (RNN, LSTM, GRU)

  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ํŠนํ™”๋œ ๋ชจ๋ธ
  • LSTM(Long Short-Term Memory)๊ณผ GRU(Gated Recurrent Unit)๋Š” ๊ธด ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด ์ฃผ๊ธฐ์ ์ด๊ณ  ๋ณต์žกํ•œ ๋ฐ๊ธฐ ๋ณ€ํ™”๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋ฐ ์œ ์šฉ
  • ์žฅ์ : ๋ฐ์ดํ„ฐ์˜ ์‹œ๊ฐ„์  ์ข…์†์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šต
  • ๋‹จ์ : ํ•™์Šต์— ๋งŽ์€ ๋ฐ์ดํ„ฐ์™€ ์‹œ๊ฐ„์ด ํ•„์š”

 

2) 1D ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง (1D CNN)

  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์—์„œ ์ง€์—ญ์  ํŒจํ„ด(ํŠน์ • ๋ฐ๊ธฐ ๋ณ€ํ™” ํŒจํ„ด)์„ ํ•™์Šต
  • ๋ณ„๋น›์˜ ๋ฐ๊ธฐ ๋ณ€ํ™”์—์„œ ํ–‰์„ฑ์˜ Transit ์‹ ํ˜ธ๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ๋ฐ ์ž์ฃผ ์‚ฌ์šฉ

 

3) ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ชจ๋ธ (CNN + LSTM)

  • CNN์œผ๋กœ ์ง€์—ญ์  ํŒจํ„ด(Transit ์‹ ํ˜ธ)์„ ํ•™์Šตํ•˜๊ณ , LSTM์œผ๋กœ ์ „์ฒด ์‹œ๊ณ„์—ด ํŒจํ„ด(์ฃผ๊ธฐ์  ๋ณ€ํ™”)์„ ํ•™์Šต

 

์‚ฌ์‹ค XGBoost ๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ์บ๊ธ€ ๋Œ€ํšŒ์—์„œ ์šฐ์Šน์„ ์ฐจ์ง€ํ• ๋งŒํผ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ๋กœ ์•Œ๋ ค์ ธ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ XGBClassifier ๋ฅผ ์‚ฌ์šฉํ•ด ๋จผ์ € ๋ถ„๋ฅ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด๋ณผ ๊ฒƒ์ด๋‹ค. 

 

 

๋ฐ์ดํ„ฐ ์ค€๋น„

 

์ดˆ๊ธฐ XGBClassifier

 

# XGBRegressor๋ฅผ ์ž„ํฌํŠธํ•ฉ๋‹ˆ๋‹ค.
from xgboost import XGBClassifier

# accuracy_score๋ฅผ ์ž„ํฌํŠธํ•ฉ๋‹ˆ๋‹ค.
from sklearn.metrics import accuracy_score

# train_test_split๋ฅผ ์ž„ํฌํŠธํ•ฉ๋‹ˆ๋‹ค.
from sklearn.model_selection import train_test_split

# ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

 

# XGBClassifier๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
model = XGBClassifier(booster='gbtree')

# ํ›ˆ๋ จ ์„ธํŠธ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
model.fit(X_train, y_train)

# ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
y_pred = model.predict(X_test)

score = accuracy_score(y_pred, y_test)

print('์ ์ˆ˜: ' + str(score))

 

์ ์ˆ˜: 0.89

 

 

์ด ๋ฐ์ดํ„ฐ์…‹์—์„œ ์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ ๋ณ„์€ 37 / ( 5,050 + 37 )  %  ๋ฟ์ด๋‹ค. ์ฆ‰ 10% ๋„ ์ฑ„ ๋˜์ง€ ์•Š๋Š”๋‹ค.

๋”ฐ๋ผ์„œ ๋ฌด์กฐ๊ฑด ์™ธ๊ณ„ ํ–‰์„ฑ์ด ์—†๋‹ค๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ด ์žˆ๋‹ค๋ฉด ์ด ๋ชจ๋ธ์ด ๋” ๋‚ซ๋‹ค๊ณ  ๋งํ•˜๊ธฐ ์–ด๋ ค์šธ ๊ฒƒ์ด๋‹ค. 

 

๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์ •ํ™•๋„๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๋‹ค๋Š” ๊นจ๋‹ฌ์Œ์„ ์–ป๋Š”๋‹ค.

 

 

์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ‰๊ฐ€์ง€ํ‘œ

 

์˜ค์ฐจํ–‰๋ ฌ ๋ถ„์„

๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํ‘œ๋กœ, ์‹ค์ œ ๊ฐ’๊ณผ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฐ’ ๊ฐ„์˜ ๋น„๊ต๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งŒ๋“ค์–ด์ง„๋‹ค.

์–ด๋–ค ์˜ˆ์ธก์ด ์ •ํ™•ํ•˜๊ณ  ์–ด๋–ค ์˜ˆ์ธก์ด ํ‹€๋ ธ๋Š”์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„์„ํ•˜๋Š” ๋ฐ ์ด์ƒ์ ์ด๋‹ค. 

 

 

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
array([[86,  2],
       [ 9,  3]])

 

 

 

์šฐ๋ฆฌ๋Š” ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•˜์—ฌ ์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ ๋ณ„์„ ๋ชจ๋‘ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค. ๊ฐ€๋Šฅํ•œ ๋งŽ์€ ์™ธ๊ณ„ ํ–‰์„ฑ์„ ์ฐพ๋Š” ๊ฒƒ์ด ์ข‹์œผ๋ฏ€๋กœ ์žฌํ˜„์œจ์„ ์ค‘์š”ํ•˜๊ฒŒ ๋ณด์ž. ์žฌํ˜„์œจ์€ ์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ ํ–‰์„ฑ ์ค‘์—์„œ ์‹ค์ œ๋กœ ๋ช‡๊ฐœ์˜ ํ–‰์„ฑ์„ ์ฐพ์•˜๋Š”์ง€๋ฅผ ์•Œ๋ ค์ค€๋‹ค.

 

 

 

- ์ •๋ฐ€๋„ : 3 / (3 + 2) = 60%  ํ™•๋ฅ ๋กœ ์™ธ๊ณ„ํ–‰์„ฑ์ด๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ์ด ์‹ค์ œ๋กœ ์ •๋‹ต์ด๋‹ค. 

- ์žฌํ˜„์œจ (Recall) : 3 / ( 3 + 9) = 25% ํ™•๋ฅ ๋กœ ์–‘์„ฑ ์ƒ˜ํ”Œ์„ ์ฐพ๋Š”๋‹ค. 

- ์ •ํ™•๋„ : 89 / 100 = 89% ํ™•๋ฅ ๋กœ ์ถ”์ธกํ•œ ๋‹ต์ด ์ •๋‹ต์ด๋‹ค.

 

classification_report

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

 

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.91      0.98      0.94        88
           1       0.60      0.25      0.35        12

    accuracy                           0.89       100
   macro avg       0.75      0.61      0.65       100
weighted avg       0.87      0.89      0.87       100

 

[0] ์€ ์Œ์„ฑ ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ precison, recall ์ด๊ณ  [1]์€ ์–‘์„ฑ ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ  precison, recall ์ด๋‹ค.

 

F1 ์ ์ˆ˜๋Š” ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์˜ ์กฐํ™” ํ‰๊ท ์ด๋‹ค. ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์˜ ๋ถ„๋ชจ๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ ์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•œ๋‹ค.

์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ ๋ชจ๋‘ ์ค‘์š”ํ•  ๋•Œ๋Š” F1 ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹๋‹ค. 0 ~ 1 ์‚ฌ์ด ๊ฐ’์ด๋ฉฐ 1์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฐ’์ด๋‹ค.

 

ROC ๊ณก์„ 

 

๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ ๋ฆฌ์ƒ˜ํ”Œ๋ง 

 

์•ž์„  classification_report ์—์„œ ๊ต‰์žฅํžˆ ๋‚ฎ์€ ์žฌํ˜„์œจ ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ๋ณด์•˜๋‹ค.

๊ทธ ์ด์œ ๋Š” ์• ์ดˆ์— ์–‘์„ฑ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜๊ฐ€ ์Œ์„ฑ ์ƒ˜ํ”Œ๋ณด๋‹ค ํ„ฑ์—†์ด ๋ชจ์ž๋ž๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 

  • ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ž€, ๋ชฉํ‘œ ๋ณ€์ˆ˜(Target)์˜ ํด๋ž˜์Šค ๋น„์œจ์ด ํฌ๊ฒŒ ์ฐจ์ด ๋‚˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋งํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฐ์ดํ„ฐ์—์„œ Positive ํด๋ž˜์Šค(1)๊ฐ€ 10%, Negative ํด๋ž˜์Šค(0)๊ฐ€ 90%๋ผ๋ฉด ๋ชจ๋ธ์€ Negative ํด๋ž˜์Šค์— ์น˜์šฐ์นœ ์˜ˆ์ธก์„ ํ•˜๊ฒŒ ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.
  • scale_pos_weight๋Š” ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด Positive ํด๋ž˜์Šค์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ, Positive ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ•™์Šต์˜ ์ค‘์š”์„ฑ์„ ๋†’์ด๋Š” ์—ญํ• 

 

 

์˜ˆ๋ฅผ ๋“ค์–ด์„œ ์–‘์„ฑ ์ƒ˜ํ”Œ์ด ๋‘๊ฐœ ๋ฟ์ด๋ผ๋ฉด, ๊ทธ ์ค‘ ํ•˜๋‚˜๋งŒ ์ œ๋Œ€๋กœ ๋ชป๋งž์ถฐ๋„ 50% ์žฌํ˜„์œจ์„ ๊ธฐ๋กํ•œ๋‹ค.

 

๋”ฐ๋ผ์„œ ๋‚ฎ์€ ์žฌํ˜„์œจ ์ ์ˆ˜๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ๊ณ ์น˜๊ธฐ ์œ„ํ•ด ๋ถˆ๊ท ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•  ๊ฒƒ์ด๋‹ค.

๋‹ค์ˆ˜ ํด๋ž˜์Šค์˜ ์ƒ˜ํ”Œ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์–ธ๋”์ƒ˜ํ”Œ๋ง ํ•˜๊ฑฐ๋‚˜ ์†Œ์ˆ˜ ํด๋ž˜์Šค ์ƒ˜ํ”Œ์„ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•ด ์˜ค๋ฒ„์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์–ธ๋”์ƒ˜ํ”Œ๋ง

 

๋ฐ์ดํ„ฐ์—์„œ ๋‹ค์ˆ˜ ํด๋ž˜์Šค(majority class)์˜ ์ƒ˜ํ”Œ์„ ์ค„์—ฌ์„œ ์†Œ์ˆ˜ ํด๋ž˜์Šค(minority class)์™€ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๋ฐฉ๋ฒ•.

ํด๋ž˜์Šค ๋ถ„ํฌ๋ฅผ ๊ท ํ˜• ์žˆ๊ฒŒ ๋งŒ๋“ค์–ด, ๋ชจ๋ธ์ด ๋‹ค์ˆ˜ ํด๋ž˜์Šค์— ์น˜์šฐ์น˜์ง€ ์•Š๋„๋ก ํ•™์Šต์„ ์œ ๋„ํ•œ๋‹ค.

 

๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ธฐ ๋•Œ๋ฌธ์— ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ๋นจ๋ผ์ง€๋ฉฐ, ๊ฐ„๋‹จํ•˜๊ณ  ๊ตฌํ˜„์ด ์šฉ์ดํ•˜๋‹ค.

๊ทธ๋Ÿฌ๋‚˜, ๋‹ค์ˆ˜ ํด๋ž˜์Šค์˜ ์ •๋ณด๋ฅผ ์†์‹คํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.

 

์–ธ๋”์ƒ˜ํ”Œ๋ง ๊ธฐ๋ฒ•

  • ๋žœ๋ค ์–ธ๋”์ƒ˜ํ”Œ๋ง(Random Under-sampling): ๋‹ค์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์ œ๊ฑฐ.
  • ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ธฐ๋ฐ˜ ์ƒ˜ํ”Œ๋ง: ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , ๊ฐ ๊ทธ๋ฃน์—์„œ ๋Œ€ํ‘œ ์ƒ˜ํ”Œ์„ ์„ ํƒ.

 

xgb_clf ํ•จ์ˆ˜๋Š” ์–ธ๋”์ƒ˜ํ”Œ๋ง ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋„๋ก ์žฌํ˜„์œจ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

def xgb_clf(model, nrows):

    df = pd.read_csv('exoplanets.csv', nrows=nrows)
    # ๋ฐ์ดํ„ฐ๋ฅผ X์™€ y๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
    X = df.iloc[:,1:]
    y = df.iloc[:,0] - 1

    # ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

    # ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    model.fit(X_train, y_train)

    # ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    y_pred = model.predict(X_test)

    score = recall_score(y_test, y_pred)
    
    print(confusion_matrix(y_test, y_pred))
    
    print(classification_report(y_test, y_pred))
        
    return score

 

nrows ๋ฅผ ๋ฐ”๊พธ์–ด ๊ฐ€๋ฉด์„œ ์žฌํ˜„์œจ ์ ์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด์ž.

xgb_clf(XGBClassifier(), nrows=800)

 

[[189   1]
 [  9   1]]
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       190
           1       0.50      0.10      0.17        10

    accuracy                           0.95       200
   macro avg       0.73      0.55      0.57       200
weighted avg       0.93      0.95      0.93       200

0.1

 

์™ธ๊ณ„ ํ–‰์„ฑ์ด ์—†๋Š” ๋ณ„์˜ ์žฌํ˜„์œจ์€ ๊ฑฐ์˜ ์™„๋ฒฝํ•˜์ง€๋งŒ, ์™ธ๊ณ„ ํ–‰์„ฑ์ด ์žˆ๋Š” ๋ณ„์˜ ์žฌํ˜„์œจ์€ 10%์— ๋ถˆ๊ณผํ•˜๋‹ค.

 

๋งŒ์ผ ์™ธ๊ณ„ ํ–‰์„ฑ์„ ์ง€๋‹Œ ๋ณ„๊ณผ ์—†๋Š” ๋ณ„์˜ ๊ฐœ์ˆ˜๋ฅผ 37๊ฐœ๋กœ ๋™์ผํ•˜๊ฒŒ ๋งž์ถ”๋ฉด ๊ท ํ˜•์ด ๋งž๋Š”๋‹ค.

xgb_clf(XGBClassifier(), nrows=74)

 

[[6 2]
 [5 6]]
              precision    recall  f1-score   support

           0       0.55      0.75      0.63         8
           1       0.75      0.55      0.63        11

    accuracy                           0.63        19
   macro avg       0.65      0.65      0.63        19
weighted avg       0.66      0.63      0.63        19

0.5454545454545454

 

์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง

 

๋ฐ์ดํ„ฐ์—์„œ ์†Œ์ˆ˜ ํด๋ž˜์Šค(minority class)์˜ ์ƒ˜ํ”Œ์„ ๋Š˜๋ ค์„œ ๋‹ค์ˆ˜ ํด๋ž˜์Šค์™€ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์†Œ์ˆ˜ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ ค์„œ ๋ชจ๋ธ์ด ์†Œ์ˆ˜ ํด๋ž˜์Šค์— ๋Œ€ํ•ด ๋” ์ž˜ ํ•™์Šตํ•˜๋„๋ก ์œ ๋„ํ•œ๋‹ค.

 

๊ธฐ์กด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ ๋ชจ๋ธ์ด ์†Œ์ˆ˜ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ•™์Šต์„ ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๊ณผ์ ํ•ฉ(overfitting)์˜ ์œ„ํ—˜์ด ์žˆ๋‹ค.

 

nrow = 400 ์ผ ๋•Œ, ์Œ์„ฑ:์–‘์„ฑ ๋น„์œจ์€ ์•ฝ 10๋Œ€ 1์ด๋‹ค. ๋”ฐ๋ผ์„œ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์–‘์„ฑ ํด๋ž˜์Šค ์ƒ˜ํ”Œ์„ 10๋ฐฐ ๋Š˜๋ฆฐ๋‹ค.

 

์ด๋ฅผ ์œ„ํ•œ ์ „๋žต์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  • ์–‘์„ฑ ํด๋ž˜์Šค ์ƒ˜ํ”Œ์„ ์•„ํ™‰๋ฒˆ ๋ณต์‚ฌํ•œ ์ƒˆ๋กœ์šด df ๋ฅผ ๋งŒ๋“ ๋‹ค.
  • ์ƒˆ๋กœ์šด df ์™€ ์›๋ณธ df ๋ฅผ ํ•ฉ์ณ์„œ 1:1 ๋น„์œจ์„ ๋งŒ๋“ ๋‹ค.

์—ฌ๊ธฐ์„œ ์ฃผ์˜ํ•  ์ ์ด ์žˆ๋Š”๋ฐ, ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๊ธฐ ์ „์— ๋ฆฌ์ƒ˜ํ”Œ๋งํ•˜๋ฉด ์žฌํ˜„์œจ ์ ์ˆ˜๊ฐ€ ๋ถ€ํ’€๋ ค์ง„๋‹ค.

์™œ์ผ๊นŒ? ๋‚˜๋ˆ„๊ธฐ์ „์— ์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ์ด๋ฅผ ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ„๋ฉด ๋ณต์‚ฌ๋ณธ์ด ๋‘ ์„ธํŠธ ๋ชจ๋‘์— ๋“ค์–ด๊ฐˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

๋”ฐ๋ผ์„œ ์ด๋ฏธ ํ•™์Šตํ•œ ๋™์ผํ•œ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๋ฏ€๋กœ ์ œ๋Œ€๋กœ๋œ Test ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ๋ณด๊ธฐ ์–ด๋ ต๊ฒŒ ๋œ๋‹ค.

 

์ ์ ˆํ•œ ๋ฐฉ๋ฒ•์€ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๋จผ์ € ๋‚˜๋ˆ„๊ณ  ๊ทธ๋‹ค์Œ ๋ฆฌ์ƒ˜ํ”Œ๋ง์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

 

 

1. x_train ๊ณผ y_train ์„ ํ•ฉ์นœ๋‹ค.

๋‘ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ธ๋ฑ์Šค ๊ฐ’์ด ๊ฐ™์€ ํ–‰๋ผ๋ฆฌ ๋ณ‘ํ•ฉ๋œ๋‹ค.

df_train = pd.merge(y_train, X_train, left_index=True, right_index=True)

 

 

2. np.repeat() ํ•จ์ˆ˜๋ฅผ ํ†ตํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ newdf ๋ฅผ ๋งŒ๋“ ๋‹ค.

 

-  ์–‘์„ฑ ์ƒ˜ํ”Œ์„ ๋„˜ํŒŒ์ด ๋ฐฐ์—ด๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

- ๋ณต์‚ฌ ํšŸ์ˆ˜ 9๋ฅผ ์ง€์ •ํ•œ๋‹ค.

- ํ–‰์„ ๊ธฐ์ค€์œผ๋กœ ๋ณต์‚ฌํ•˜๊ธฐ ์œ„ํ•ด axis=1 ์„ ์ง€์ •ํ•œ๋‹ค.

- ์—ด ์ด๋ฆ„์„ ๋ณต์‚ฌํ•˜๊ณ  ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์—ฐ๊ฒฐํ•œ๋‹ค.

newdf = pd.DataFrame(np.repeat(df_train[df_train['LABEL']==1].values,
                               9,axis=0))
newdf.columns = df_train.columns
df_train_resample = pd.concat([df_train, newdf])

df_train_resample['LABEL'].value_counts()
0.0    275
1.0    250
Name: LABEL, dtype: int64

 

์–‘์„ฑ ์ƒ˜ํ”Œ๊ณผ ์Œ์„ฑ ์ƒ˜ํ”Œ ์‚ฌ์ด ๊ท ํ˜•์ด ๋งž๋„๋ก ๋ฐ์ดํ„ฐ๊ฐ€ ์ฆ๊ฐ•๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

 

3. ๋ฆฌ์ƒ˜ํ”Œ๋ง ๋œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ X์™€ y๋กœ ๋‚˜๋ˆˆ๋‹ค.

X_train_resample = df_train_resample.iloc[:,1:]
y_train_resample = df_train_resample.iloc[:,0]

 

 

4. ๋ชจ๋ธ ํ›ˆ๋ จ

 

# XGBClassifier๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
model = XGBClassifier()

# ํ›ˆ๋ จ ์„ธํŠธ๋กœ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
model.fit(X_train_resample, y_train_resample)

# ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋Œ€ํ•ด ์˜ˆ์ธก์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
y_pred = model.predict(X_test)

score = recall_score(y_test, y_pred)

 

5. ์˜ค์ฐจ ํ–‰๋ ฌ๊ณผ ๋ถ„๋ฅ˜ ๋ฆฌํฌํŠธ ์ถœ๋ ฅ

 

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))

print(score)
[[86  2]
 [ 8  4]]
              precision    recall  f1-score   support

           0       0.91      0.98      0.95        88
           1       0.67      0.33      0.44        12

    accuracy                           0.90       100
   macro avg       0.79      0.66      0.69       100
weighted avg       0.89      0.90      0.88       100

0.3333333333333333

 

์‹œ์ž‘ํ•  ๋•Œ ๋งŒ๋“  ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ค๋ฒ„์ƒ˜ํ”Œ๋ง์œผ๋กœ 33.3% ์žฌํ˜„์œจ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

์—ฌ์ „ํžˆ ๋‚ฎ์ง€๋งŒ ์ด์ „์— ์–ป์€ 17%๋ณด๋‹ค๋Š” ๋‘๋ฐฐ ๋†’์€ ์ ์ˆ˜์ด๋‹ค.

 

๋ฆฌ์ƒ˜ํ”Œ๋ง์œผ๋กœ๋Š” ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์˜ฌ๋ผ๊ฐ€์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ XGBoost์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํŠœ๋‹ํ•ด๋ณผ ์ฐจ๋ก€์ด๋‹ค.

 

XGBClassifier ํŠœ๋‹

 

๊ฐ€๋Šฅํ•œ ์ตœ์ƒ์˜ ์žฌํ˜„์œจ ์ ์ˆ˜๋ฅผ ์–ป๋„๋ก ํŠœ๋‹ํ•  ๊ฒƒ์ด๋‹ค.

 

1. scale_pos_weight ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜๊ณ 

2. ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋กœ ์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ์„ ์ฐพ๋Š”๋‹ค.

 

๊ฐ€์ค‘์น˜ ์กฐ์ •ํ•˜๊ธฐ

 

scale_pos_weight์˜ ๊ฐ’์€ ๋‹ค์ˆ˜ ํด๋ž˜์Šค์™€ ์†Œ์ˆ˜ ํด๋ž˜์Šค ๊ฐ„์˜ ๋น„์œจ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค์ •๋œ๋‹ค.

 

 

# ๋ฐ์ดํ„ฐ๋ฅผ X์™€ y๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
X = df.iloc[:,1:]
y = df.iloc[:,0]

# ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

 

 

model = XGBClassifier(scale_pos_weight=10)

model.fit(X_train, y_train)

# ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
y_pred = model.predict(X_test)

score = recall_score(y_test, y_pred)

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))

print(score)

 

์žฌํ˜„์œจ:  [0.10526316 0.27777778]
์žฌํ˜„์œจ ํ‰๊ท :  0.1915204678362573

 

์ด ๊ฒฐ๊ณผ๋Š” ๋ฆฌ์ƒ˜ํ”Œ๋ง์œผ๋กœ ์–ป์€ ๊ฒƒ๊ณผ ๋™์ผํ•˜๋‹ค. 

์ง์ ‘ ๊ตฌํ˜„ํ•œ ์˜ค๋ฒ„ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•์€  scale_pos_weight ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งŒ๋“  XGBClassifier์™€ ๋™์ผํ•œ ์˜ˆ์ธก์„ ๋งŒ๋“ ๋‹ค.

 

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ( grid_search )

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•  ๋•Œ์—๋Š” ๋žœ๋ค or ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํ‘œ์ค€์ด๋‹ค. ๋‘ ํด๋ž˜์Šค ๋ชจ๋‘ ๋‘ ๊ฐœ ์ด์ƒ์˜ ํด๋“œ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์—ฌ๋Ÿฌ ํด๋“œ๋ฅผ ํ…Œ์ŠคํŠธ ํ•˜๋Š” ๊ฒƒ์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋ฏ€๋กœ ๋‘๊ฐœ์˜ ํด๋“œ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค. ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•˜์—ฌ StratifiedFold ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ถŒ์žฅ๋œ๋‹ค.

 

๐Ÿ“Œ StratifiedFold ๋ž€?

 

StratifiedKFold๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ต์ฐจ ๊ฒ€์ฆ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ํด๋ž˜์Šค ๋น„์œจ์„ ๊ท ํ˜• ์žˆ๊ฒŒ ์œ ์ง€ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ๊ฐ ํด๋“œ(Fold) ๋‚ด์—์„œ ์›๋ž˜ ๋ฐ์ดํ„ฐ์˜ ํด๋ž˜์Šค ๋น„์œจ์„ ์œ ์ง€ํ•˜๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ Positive:Negative ๋น„์œจ์ด 1:9๋ผ๋ฉด, ๊ฐ ํด๋“œ์—์„œ๋„ ์ด ๋น„์œจ์ด ์œ ์ง€๋œ๋‹ค.

 

 

๊ธฐ์ค€ ๋ชจ๋ธ

k-fold ๊ต์ฐจ ๊ฒ€์ฆ์œผ๋กœ ๊ธฐ์ค€ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค.

 

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV,StratifiedKFold, cross_val_score

kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=2)

model = XGBClassifier(scale_pos_weight=10)

# ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
scores = cross_val_score(model, X, y, cv=kfold, scoring='recall')

# ์žฌํ˜„์œจ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print('์žฌํ˜„์œจ: ', scores)

# ์žฌํ˜„์œจ์˜ ํ‰๊ท ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print('์žฌํ˜„์œจ ํ‰๊ท : ', scores.mean())
์žฌํ˜„์œจ:  [0.10526316 0.27777778]
์žฌํ˜„์œจ ํ‰๊ท :  0.1915204678362573

 

 

๊ต์ฐจ ๊ฒ€์ฆ์„ ํ•˜๋‹ˆ๊นŒ ์„ฑ๋Šฅ์ด ๋”์šฑ ๋‚˜๋น ์กŒ๋‹ค..ใ…Žใ…Ž ์•„๋ฌด๋ž˜๋„ ์–‘์„ฑ ์ƒ˜ํ”Œ์ด ์ ์„ ๋•Œ์—๋Š” ์–ด๋–ค ์ƒ˜ํ”Œ์ด ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ํฌํ•จ๋˜๋Š”์ง€๊ฐ€ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค. 

 

grid_search

def grid_search(params, random=False, X=X, y=y, 
                model=XGBClassifier(scale_pos_weight=10, random_state=2)): 
    
    xgb = model
    
    if random:
        grid = RandomizedSearchCV(xgb, params, cv=kfold, n_jobs=-1, 
                                  random_state=2, scoring='recall')
    else:
        # ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฐ์ฒด๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
        grid = GridSearchCV(xgb, params, cv=kfold, n_jobs=-1, scoring='recall')
    
    # X_train์™€ y_train์œผ๋กœ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    grid.fit(X, y)

    # ์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
    best_params = grid.best_params_

    # ์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    print("์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜:", best_params)
    
    # ์ตœ์ƒ์˜ ์ ์ˆ˜๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
    best_score = grid.best_score_

    # ์ตœ์ƒ์˜ ์ ์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    print("์ตœ์ƒ์˜ ์ ์ˆ˜: {:.5f}".format(best_score))

 

 

๊ทธ๋ฆฌ๊ณ  ๋…ธ๊ฐ€๋‹ค๋ฅผ ๋Œ๋ฆฐ๋‹ค..

 

 

์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜: {'gamma': 0.025, 'learning_rate': 0.001, 'max_depth': 2}
์ตœ์ƒ์˜ ์ ์ˆ˜: 0.53509

 

์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜: {'subsample': 0.3, 'colsample_bytree': 0.7, 'colsample_bynode': 0.7, 'colsample_bylevel': 1}
์ตœ์ƒ์˜ ์ ์ˆ˜: 0.37865

 

๊ท ํ˜• ์žกํžŒ ์„œ๋ธŒ์…‹ 

 

74๊ฐœ ์ƒ˜ํ”Œ๋กœ ์ด๋ฃจ์–ด์ง„ ๊ท ํ˜•์žกํžŒ ์„œ๋ธŒ์…‹์€ ์ตœ์†Œํ•œ ์–‘์˜ ๋ฐ์ดํ„ฐ์ด๊ธฐ์— ํ…Œ์ŠคํŠธํ•˜๊ธฐ๋„ ๋น ๋ฅด๋‹ค.

์ด ์„œ๋ธŒ์…‹์— ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ๋Š”๋‹ค. 

 

X_short = X.iloc[:74, :]
y_short = y.iloc[:74]
grid_search(params={'max_depth':[1, 2, 3], 
                    'colsample_bynode':[0.5, 0.75, 1]}, 
            X=X_short, y=y_short, 
            model=XGBClassifier(random_state=2))
์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜: {'colsample_bynode': 0.5, 'max_depth': 1}
์ตœ์ƒ์˜ ์ ์ˆ˜: 0.65205

 

์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ ํŠœ๋‹ํ•˜๊ธฐ

 

์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ grid_search() ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๊ต‰์žฅํžˆ ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.

 

 

์Œ์„ฑ ํด๋ž˜์Šค ๊ฐœ์ˆ˜ / ์–‘์„ฑ ํด๋ž˜์Šค ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•ด๋‹น ๊ฐ€์ค‘์น˜๋ฅผ scale_pos_weight ์— ์ ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค. 

weight = int(5050/37)

model = XGBClassifier(scale_pos_weight=weight)

# ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
scores = cross_val_score(model, X_all, y_all, cv=kfold, scoring='recall')

# ์žฌํ˜„์œจ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print('์žฌํ˜„์œจ:', scores)

# ์žฌํ˜„์œจ์˜ ํ‰๊ท ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print('์žฌํ˜„์œจ ํ‰๊ท :', scores.mean())

 

 

์ ์ˆ˜๊ฐ€ ์•„์ฃผ ์ข‹์ง€ ์•Š๋‹ค.

์žฌํ˜„์œจ: [0.10526316 0.        ]
์žฌํ˜„์œจ ํ‰๊ท : 0.05263157894736842

 

 

์ง€๊ธˆ๊นŒ์ง€ ์ œ์ผ ์ข‹์•˜๋˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•ด๋ณธ๋‹ค. 

์ด ์ ์ˆ˜๋Š” ์•ž์„œ ์–ธ๋”์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฐ๊ณผ๋งŒํผ์€ ์•„๋‹ˆ์ง€๋งŒ ๋” ๋‚˜์•„์กŒ๋‹ค. 

 

 

 

์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์ ์ˆ˜๊ฐ€ ๋‚ฎ๊ณ  ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ ธ์œผ๋‹ˆ "์™ธ๊ณ„ ํ–‰์„ฑ ๋ฐ์ดํ„ฐ์…‹์˜ ์ž‘์€ ์„œ๋ธŒ์…‹์—์„œ ML ๋ชจ๋ธ์ด ๋” ์ž˜ ๋™์ž‘ํ• ๊นŒ์š”?" ๋ผ๋Š” ์งˆ๋ฌธ์ด ์ƒ๊ธด๋‹ค.

 

๊ฒฐ๊ณผ ํ†ตํ•ฉ

์ง€๊ธˆ๊นŒ์ง€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„œ๋ธŒ์…‹์„ ์‹œ๋„ํ•ด๋ณด์•˜๋‹ค.

 

  • 5050๊ฐœ ์ƒ˜ํ”Œ -> ์•ฝ 54% ์žฌํ˜„์œจ
  • 400๊ฐœ ์ƒ˜ํ”Œ -> ์•ฝ 54% ์žฌํ˜„์œจ
  • 74๊ฐœ ์ƒ˜ํ”Œ -> ์•ฝ 68% ์žฌํ˜„์œจ

๊ฐ€์žฅ ์ข‹์€ ์ ์ˆ˜๋ฅผ ๋‚ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” learning_rate = 0.001, max_depth=2, colsample_bynode=0.5 ์ด๋‹ค.

 

์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ 37๊ฐœ์˜ ๋ณ„์„ ๋ชจ๋‘ ํฌํ•จํ•ด ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•œ๋‹ค. ์ด๋Š” ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋ชจ๋ธ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•œ ์ƒ˜ํ”Œ์ด ํฌํ•จ๋œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋Š” ์ข‹์€ ์ƒ๊ฐ์ด ์•„๋‹ˆ์ง€๋งŒ, ์ด ์˜ˆ์ œ์—์„œ๋Š” ์–‘์„ฑ ํด๋ž˜์Šค๊ฐ€ ๋งค์šฐ ์ ๊ธฐ ๋•Œ๋ฌธ์— ์ด์ „์— ๋ณธ ์  ์—†๋Š” ์–‘์„ฑ ์ƒ˜ํ”Œ๋กœ ์ด๋ฃจ์–ด์ง„ ๋” ์ž‘์€ ์„œ๋ธŒ์…‹์„ ํ…Œ์ŠคํŠธ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๋Š”๋ฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

 

def final_model(X, y, model):
    model.fit(X, y)
    y_pred = model.predict(X_all)
    score = recall_score(y_all, y_pred)
    print(score)
    print(confusion_matrix(y_all, y_pred))
    print(classification_report(y_all, y_pred))

 

 

74๊ฐœ ์ƒ˜ํ”Œ

final_model(X_short, y_short, 
            XGBClassifier(max_depth=2, colsample_by_node=0.5, 
                          random_state=2))

 

1.0
[[3588 1462]
 [   0   37]]
              precision    recall  f1-score   support

           0       1.00      0.71      0.83      5050
           1       0.02      1.00      0.05        37

    accuracy                           0.71      5087
   macro avg       0.51      0.86      0.44      5087
weighted avg       0.99      0.71      0.83      5087

 

์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ 37๊ฐœ์˜ ๋ณ„์„ ๋ชจ๋‘ ์™„๋ฒฝํ•˜๊ฒŒ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์™ธ๊ณ„ ํ–‰์„ฑ์ด ์—†๋Š” 0.29 ์˜ ์ƒ˜ํ”Œ(1,462) ๋ฅผ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ–ˆ๋‹ค.

๊ฒŒ๋‹ค๊ฐ€ ์ •๋ฐ€๋„๋Š” 2%์ด๋ฉฐ, F1 ์ ์ˆ˜๋Š” 5%์ด๋‹ค. ์žฌํ˜„์œจ๋งŒ ํŠœ๋‹ํ•  ๋•Œ์—๋Š” ์ง€๋‚˜์น˜๊ฒŒ ๋‚ฎ์€ ์ •๋ฐ€๋„์™€ F1 ์ ์ˆ˜๊ฐ€ ์œ„ํ—˜ ์š”์ธ์ด๋‹ค.

 

 

400๊ฐœ ์ƒ˜ํ”Œ

final_model(X, y, 
            XGBClassifier(max_depth=2, colsample_bynode=0.5, 
                          scale_pos_weight=10, random_state=2))

 

1.0
[[4897  153]
 [   0   37]]
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      5050
           1       0.19      1.00      0.33        37

    accuracy                           0.97      5087
   macro avg       0.60      0.98      0.66      5087
weighted avg       0.99      0.97      0.98      5087

 

์—ฌ๊ธฐ๋„ ์žฌํ˜„์œจ 100%๋ฅผ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ,

์™ธ๊ณ„ ํ–‰์„ฑ์ด ์—†๋Š” ๊ฒฝ์šฐ์˜ ์žฌํ˜„์œจ์ด 0.19๋กœ 149๊ฐœ์˜ ๋ณ„์„ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•œ๋‹ค.

์ด ๊ฒฝ์šฐ ์™ธ๊ณ„ ํ–‰์„ฑ์ด ์žˆ๋Š” ๋ณ„ 37๊ฐœ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ 190๊ฐœ์˜ ๋ณ„์„ ๋ถ„์„ํ•ด์•ผํ•œ๋‹ค.

 

5,050 ์ƒ˜ํ”Œ

final_model(X_all, y_all, 
            XGBClassifier(max_depth=2, colsample_bynode=0.5, 
                          scale_pos_weight=weight, random_state=2))
1.0
[[5050    0]
 [   0   37]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5050
           1       1.00      1.00      1.00        37

    accuracy                           1.00      5087
   macro avg       1.00      1.00      1.00      5087
weighted avg       1.00      1.00      1.00      5087

 

๋ชจ๋“  ์˜ˆ์ธก, ์žฌํ˜„์œจ, ์ •๋ฐ€๋„๊ฐ€ 100%๋กœ ์™„๋ฒฝํ•˜๋‹ค. 

 

ํ•˜์ง€๋งŒ ์œ ๋…ํ•ด์•ผํ•  ์ ์€, ์›๋ž˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋ธ์ด ๋ณธ ์  ์—†๋Š” ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•„์ˆ˜์ ์ด์ง€๋งŒ ์—ฌ๊ธฐ์—์„œ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ์ด ์ ์ˆ˜๋ฅผ ์–ป์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ํ•™์Šตํ–ˆ๋”๋ผ๋„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ์ž˜ ์ผ๋ฐ˜ํ™”๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค.

 

๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ๋ฏธ๋ฌ˜ํ•œ ํŒจํ„ด์„ ์žก์•„๋‚ด๋ ค๋ฉด ๋” ๋งŽ์€ ํŠธ๋ฆฌ์™€ ๋” ๋งŽ์€ ํŠœ๋‹์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๊ฒฐ๊ณผ ๋ถ„์„

์ •๋ฐ€๋„๋ฅผ ์‚ฌ์šฉํ•œ ์‚ฌ์šฉ์ž๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ 50~70%๋ฅผ ๋‹ฌ์„ฑํ–ˆ๊ณ , ์žฌํ˜„์œจ์„ ์‚ฌ์šฉํ•œ ์‚ฌ์šฉ์ž๋Š” 60~100%๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค. 

 

๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ์˜ ํ•œ๊ณ„๋ฅผ ์•Œ๊ณ  ์žˆ๋‹ค๋ฉด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ ์ตœ๋Œ€ 70% ์žฌํ˜„์œจ์ด๋ฉฐ, ์™ธ๊ณ„ ํ–‰์„ฑ์„ ๊ฐ€์ง„ 37๊ฐœ์˜ ๋ณ„๋กœ๋Š” ์ƒ๋ช…์ฒด๋‚˜ ๋‹ค๋ฅธ ํ–‰์„ฑ์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๊ฐ•๋ ฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ์—๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๋‹ค.