๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

CS/์ธ๊ณต์ง€๋Šฅ

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋ฅผ ์‚ฌ์šฉํ•œ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜๊ธฐ (Spam Classification via Naïve Bayes)

728x90

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋ž€?

ํ™•๋ฅ  ์ด๋ก ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ์ง€๋„ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋กœ, ์ฃผ๋กœ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜, ์ŠคํŒธ ํ•„ํ„ฐ๋ง, ๊ฐ์„ฑ ๋ถ„์„๊ณผ ๊ฐ™์€ ๋ฌธ์ œ์— ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค.

๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ๊ฐ ํŠน์ง•๋“ค์ด ๋…๋ฆฝ์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๋Š” ๋‚˜์ด๋ธŒ(naïve)ํ•œ ํŠน์ง•์„ ์ง€๋‹Œ๋‹ค.

 

๋ฒ ์ด์ฆˆ ์ •๋ฆฌ(Bayes' Theorem)

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ์— ๋”ฐ๋ผ ์ž‘๋™ํ•œ๋‹ค. ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋Š” ์–ด๋–ค ์‚ฌ๊ฑด A์™€ B๊ฐ€ ์žˆ์„ ๋•Œ, ์‚ฌ๊ฑด B๊ฐ€ ์ผ์–ด๋‚ฌ์„ ๋•Œ ์‚ฌ๊ฑด A๊ฐ€ ์ผ์–ด๋‚  ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ๊ณตํ•œ๋‹ค.

 

 

 

 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ์˜ ์ค‘์š”ํ•œ ๊ฐ€์ •์€ ๋ชจ๋“  ํŠน์ง•๋“ค์ด ์„œ๋กœ ๋…๋ฆฝ์ ์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰, ํŠน์ • ๋ฐ์ดํ„ฐ์˜ ํ•œ ํŠน์ง•์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹ค๋ฅธ ํŠน์ง•๋“ค์ด ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚œ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ํ˜„์‹ค์—์„œ๋Š” ์ด ๊ฐ€์ •์ด ํ•ญ์ƒ ๋งž์ง€ ์•Š์ง€๋งŒ, ์ด ๊ฐ€์ • ๋•๋ถ„์— ๊ณ„์‚ฐ์ด ๋งค์šฐ ๋‹จ์ˆœํ•ด์ง€๋ฉฐ, ์‹ค์ œ๋กœ๋„ ๊ฝค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

๊ธฐ๋ณธ ์•„์ด๋””์–ด

์ด๋Ÿฌํ•œ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๊ฐ€ ์ŠคํŒธ ๋ฉ”์ผ ๋ถ„๋ฅ˜์™€ ์–ด๋–ค ๊ด€๋ จ์ด ์žˆ์„๊นŒ?

 

์šฐ๋ฆฌ๋Š” ์ฃผ์–ด์ง„ ๋ฉ”์ผ์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํŠน์ง•๋“ค ( X1=x1, ... , Xj=xj ) ์„ ํ†ตํ•ด ์ด ๋ฉ”์ผ์ด ์ŠคํŒธ์ธ์ง€(y=true) ๊ทธ๋ ‡์ง€ ์•Š์€์ง€(y=false) ๊ตฌ๋ถ„ํ•ด๋‚ด์•ผํ•œ๋‹ค. ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ(ํŠน์ง•๋“ค)๊ฐ€ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ์ค‘ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ŠคํŒธ ์ด๋ฉ”์ผ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์ด๋ฉ”์ผ์˜ ๊ฐ ๋‹จ์–ด(ํŠน์ง•)๋ฅผ ๋ถ„์„ํ•ด ๊ทธ ์ด๋ฉ”์ผ์ด ์ŠคํŒธ์ผ ํ™•๋ฅ ๊ณผ ์ŠคํŒธ์ด ์•„๋‹ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ ํ›„, ๋” ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ์˜ ์ค‘์š”ํ•œ ๊ฐ€์ •์€ ๋ชจ๋“  ํŠน์ง•๋“ค์ด ์„œ๋กœ ๋…๋ฆฝ์ ์ด๋ผ๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰, ํŠน์ • ๋ฐ์ดํ„ฐ์˜ ํ•œ ํŠน์ง•์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹ค๋ฅธ ํŠน์ง•๋“ค์ด ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚œ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ์ฆ‰, P(X1,X2) = P(X1)P(X2) ๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค๊ณ  ์ „์ œํ•œ๋‹ค. 

 

๊ฒฐ๊ตญ, ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ๋Š” ๊ฐ ํด๋ž˜์Šค y์— ๋Œ€ํ•ด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์˜ ํŠน์ง• ๋ฒกํ„ฐ x1,x2,...,xn ์ด ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐํ•ด๋‚ธ๋‹ค.

 

 

๋™์ผํ•œ feature ์— ๋Œ€ํ•ด์„œ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ ์–ด์งœํ”ผ ๋ถ„๋ชจ๋Š” ๊ฐ™์„ ๊ฒƒ์ด๊ธฐ์— ๊ณ„์‚ฐํ•˜์—ฌ

ํด๋ž˜์Šค ํ™•๋ฅ ์„ ๋น„๊ตํ•  ๋•Œ์—๋Š” ํฌ๊ฒŒ ์˜๋ฏธ๊ฐ€ ์—†๊ธฐ์— ๋ถ„๋ชจ๋Š” ๋ฒ„๋ฆฌ๊ณ  ๋ถ„์ž๊ฐ’๋งŒ ๊ณ„์‚ฐํ•œ๋‹ค.

๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ(Laplace Smoothing)

ํŠน์ • ํด๋ž˜์Šค์—์„œ ์–ด๋–ค ํŠน์ง•์ด ํ•œ ๋ฒˆ๋„ ๋“ฑ์žฅํ•˜์ง€ ์•Š์•˜์„ ๊ฒฝ์šฐ, ํ•ด๋‹น ํ™•๋ฅ  P(xi | y) ๋Š” 0์ด๋œ๋‹ค. 

์ด๋Š” ํ•ด๋‹น ๋‹จ์–ด(ํŠน์ง•)์ด ํฌํ•จ๋˜์—ˆ๋‹ค๋Š” ์ด์œ ๋งŒ์œผ๋กœ ์ „์ฒด ํ™•๋ฅ ์„ 0์œผ๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ์„ ์ ์šฉํ•˜๋ฉฐ, ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ์€ ํ•œ ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ์–ด๋–ค ํŠน์„ฑ์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜์— 1์„ ๋”ํ•ด

ํ™•๋ฅ ์ด 0์ด ๋˜์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

 

์Šค๋ฌด๋”ฉ์ด ์ ์šฉ๋œ ํ™•๋ฅ  ๊ณ„์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

 

์—ฌ๊ธฐ์„œ V๋Š” ์ „์ฒด ํŠน์„ฑ์˜ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ฆ‰, x1, ... , xi ๊นŒ์ง€์˜ ์ฃผ์–ด์ง„ ๋ชจ๋“  ํŠน์„ฑ๋“ค์— ๋Œ€ํ•˜์—ฌ ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์”ฉ ๋”ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฏ€๋กœ,

๋”ํ•ด์ค€ ์ƒ˜ํ”Œ์˜ ๊ฐœ์ˆ˜๋งŒํผ ์ „์ฒด ์ƒ˜ํ”Œ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ์›๋ฆฌ์ด๋‹ค.

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ์˜ ๋‹จ๊ณ„

1. ํ›ˆ๋ จ

 

- ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ ํด๋ž˜์Šค(์˜ˆ: ์ŠคํŒธ/๋น„์ŠคํŒธ)์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ™•๋ฅ  P(Class) ๋ฅผ ๊ณ„์‚ฐ

- ๊ฐ ํด๋ž˜์Šค๋ณ„๋กœ ๊ฐ ํŠน์ง•์ด ๋“ฑ์žฅํ•  ํ™•๋ฅ  P(FeatureโˆฃClass) ์„ ๊ณ„์‚ฐ

 

2. ์˜ˆ์ธก 

 

- ์ฃผ์–ด์ง„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(์˜ˆ: ์ด๋ฉ”์ผ)์— ๋Œ€ํ•ด ๊ฐ ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ๊ณ„์‚ฐ

=> ์ฆ‰, ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํด๋ž˜์Šค์— ์†ํ•  ์‚ฌํ›„ ํ™•๋ฅ  P(ClassโˆฃFeatures)๋ฅผ ๊ณ„์‚ฐ

- ๊ฐ€์žฅ ๋†’์€ ์‚ฌํ›„ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ์„ ํƒํ•˜์—ฌ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธก

 

P(Class | Features) ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ ๊ณต์‹์„ ํ™œ์šฉํ•ด์ฃผ๋ฉด ๋œ๋‹ค. 

 

 

ํŠน์ง• ๋ฒกํ„ฐ์˜ ๊ฐ ํŠน์ง•์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๊ณฑํ•˜๋Š” ๊ณผ์ •์—์„œ ๋งค์šฐ ์ž‘์€ ๊ฐ’๋“ค์ด ๊ณ„์† ๊ณฑํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ’์ด ๋„ˆ๋ฌด ์ž‘์•„์ ธ ์–ธ๋”ํ”Œ๋กœ์šฐ(underflow) ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋กœ๊ทธ ๋ณ€ํ™˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณฑ์…ˆ์„ ๋ง์…ˆ์œผ๋กœ ๋ณ€ํ™˜ํ•  ๊ฒƒ์ด๋‹ค. 

 

 

๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ ์˜ˆ์‹œ

 

์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•ด๋ณด์ž. ๊ทธ๋ฆฌ๊ณ  ์šฐ์ธก๊ณผ ๊ฐ™์€ Test Sample ์— ๋Œ€ํ•˜์—ฌ

์ข‹์•„ํ• ์ง€(Liked) ์•„๋‹์ง€(Not Liked) ๋ถ„๋ฅ˜ํ•ด์•ผํ•œ๋‹ค.

 

 

์ฆ‰, ์šฐ๋ฆฌ๋Š” P(Like | Type=Comedy, Length=Medium ) ์ผ ํ™•๋ฅ ๊ณผ P( Not Liked | Type=Comedy, Length=Medium ) ์ผ ํ™•๋ฅ ์„ ๊ฐ๊ฐ ๊ตฌํ•˜์—ฌ ๋” ๋†’์€ ํ™•๋ฅ  ์ง€๋‹Œ ํด๋ž˜์Šค๋กœ ํ•ด๋‹น ๋ฉ”์ผ์„ ๋ถ„๋ฅ˜ํ•˜๊ฒŒ ๋œ๋‹ค. 

 

์ด ๋•Œ, ์•ž์„œ ๋ฐฐ์› ๋˜ ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ๋‹ค. 

 

์‹ค์ˆ˜๋กœ Liked ๋ฅผ Spam ์œผ๋กœ ์ž˜๋ชป ์ ์—ˆ๋Š”๋ฐ ์–‘ํ•ด ๋ถ€ํƒ๋“œ๋ ค์š”..ใ…Žใ…Žใ…Ž

 

Not-Like ์ผ ๊ฒฝ์šฐ๋„ ๋™์ผํ•œ ๊ณต์‹์„ ์จ์„œ ๊ตฌํ•ด์ฃผ๋ฉด ๋œ๋‹ค. 

 

์ด ์ˆ˜์‹์˜ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ฏธ๋ฆฌ ๋‹ค์Œ์˜ ๊ฐ’๋“ค์„ ๊ณ„์‚ฐํ•ด๋‘˜ ๊ฒƒ์ด๋‹ค.

 

- ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์‚ฌ์ „ ํ™•๋ฅ  P(Class) : P(Liked), P(Not Liked)

- ๊ฐ ํด๋ž˜์Šค๋ณ„๋กœ ๊ฐ ํŠน์ง•์ด ๋“ฑ์žฅํ•  ํ™•๋ฅ  P(FeatureโˆฃClass)  : P(Comedy | Liked), .... , P(Long | Not Liked )

 

 

 

๊ทธ๋ฆฌ๊ณ  ์˜ˆ์ธกํ• ๋•Œ์—๋Š” ์ด ๊ฐ’๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ ๊ณต์‹์— ๋Œ€์ž…ํ•ด์ฃผ๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค. 

 

Liked์ผ ํ™•๋ฅ 

 

 

Not Liked์ผ ํ™•๋ฅ 

 

๋”ฐ๋ผ์„œ P(Liked | Feature) = 0.15 > P(Not Liked | Feature) = 0.04 

Liked ํ•  ํ™•๋ฅ  > Not Liked ํ•  ํ™•๋ฅ  ์ด๋ฏ€๋กœ Type=Comedy, Length=Medium  ํŠน์„ฑ์„ ์ง€๋‹Œ ์ƒ˜ํ”Œ์€ Liked๋กœ ๋ถ„๋ฅ˜๋œ๋‹ค.

๋ชฉํ‘œ


์ด์ œ ๋ณธ ์ด๋ก ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ(Naïve Bayes) ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ SVM์„ ์‚ฌ์šฉํ•˜์—ฌ ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ๊ตฌํ˜„ํ•  ๊ฒƒ์ด๋‹ค.

์ด๋ฉ”์ผ์ฃผ์†Œ(EMAILADDR), ์›น์ฃผ์†Œ(HTTPADDR), ํ†ตํ™”(DOLLAR), ์ˆซ์ž๋“ค(NUMBER) ์„ ํŠน๋ณ„ํ•œ ํ† ํฐ์œผ๋กœ ์ •์˜ํ•˜๊ณ  ๋ถ„๋ฅ˜ ๊ณผ์ •์— ์ ์ ˆํ•˜๊ฒŒ ๊ณ ๋ ค๋˜๋„๋ก ํ•  ๊ฒƒ์ด๋‹ค. ๋ณธ ๋ฌธ์ œ์—์„œ ์šฐ๋ฆฌ๋Š” "words" ๋Œ€์‹  "tokens"์ด๋ผ๋Š” ์šฉ์–ด๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ๋ฐ, ๊ทธ ์ด์œ ๋Š” ์ผ๋ถ€ ํŠน์ง•๋“ค์ด EMAILADDR๊ณผ ๊ฐ™์€ ํŠน์ˆ˜ ๊ฐ’์— ํ•ด๋‹นํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋ฐ์ดํ„ฐ์…‹

 

์šฐ๋ฆฌ๊ฐ€ ๊ณ ๋ คํ•  ํ† ํฐ์˜ ์ง‘ํ•ฉ, ์ฆ‰ ์–ดํœ˜๋Š” ์ค‘๊ฐ„ ๋นˆ๋„์— ์†ํ•˜๋Š” ํ† ํฐ๋“ค๋กœ๋งŒ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ, ๋„ˆ๋ฌด ์ž์ฃผ ๋ฐœ์ƒํ•˜๊ฑฐ๋‚˜ ๋„ˆ๋ฌด ๋“œ๋ฌผ๊ฒŒ ๋ฐœ์ƒํ•˜๋Š” ํ† ํฐ๋“ค์€ ๋ถ„๋ฅ˜์— ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค๊ณ  ํŒ๋‹จ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "the", "and", "of" ๊ฐ™์€ ๋‹จ์–ด๋“ค์€ ๋„ˆ๋ฌด ๋งŽ์€ ์ด๋ฉ”์ผ์—์„œ ๋„ˆ๋ฌด ์ž์ฃผ ๋“ฑ์žฅํ•˜๋ฉฐ, ๊ทธ ์ž์ฒด๋กœ๋Š” ํŠน์ • ์ฝ˜ํ…์ธ ์™€ ๊ด€๋ จ์ด ์—†์œผ๋ฏ€๋กœ ๋ชจ๋ธ๋งํ•  ๊ฐ€์น˜๊ฐ€ ์—†๋‹ค. ๋˜ํ•œ "price", "prices", "priced"์™€ ๊ฐ™์€ ๋‹จ์–ด๋“ค์€ "price"๋กœ ๋Œ€์ฒด๋˜์–ด ๋™์ผํ•œ ๋‹จ์–ด๋กœ ์ฒ˜๋ฆฌ๋˜๋„๋ก ํ•˜์˜€๋‹ค. ์‚ฌ์šฉ๋œ ํ† ํฐ์˜ ๋ชฉ๋ก์€ ์•„๋ž˜์™€ ๊ฐ™์€ TOKENS_LIST ํŒŒ์ผ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

hw2_TOKENS_LIST.txt
0.01MB

1 abil
2 absolut
3 abus
4 access
5 accid
6 accord
7 account
8 accur
9 act
10 action
11 actual
12 ad
13 addition
14 address
15 administr
16 admit
17 adult
18 advanc
19 advantag
20 advert
21 advertis
22 advic
23 ae
24 affect
25 affili
26 afford

 

MATRIX.TRAIN ํŒŒ์ผ์—๋Š” ์ด๋ฏธ ์›์‹œ ์ด๋ฉ”์ผ ๋ฉ”์‹œ์ง€์—์„œ ์ถ”์ถœ๋œ ํŠน์ง• ๋ฒกํ„ฐ๋“ค์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

์ด๋Š” ์ด๋Š” ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ (Document-Term Matrix, DTM)๋กœ, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์— ์‚ฌ์šฉ๋œ๋‹ค. ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์ด๋ž€,

๋‹ค์ˆ˜์˜ ๋ฌธ์„œ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๊ฐ ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ํ–‰๋ ฌ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์„ ๋งํ•˜๋ฉฐ, i๋ฒˆ์งธ ํ–‰์€ i๋ฒˆ์งธ ๋ฌธ์„œ/์ด๋ฉ”์ผ์„ ๋‚˜ํƒ€๋‚ด๊ณ ,

j๋ฒˆ์งธ ์—ด์€ j๋ฒˆ์งธ ๊ณ ์œ  ํ† ํฐ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ํ–‰๋ ฌ์˜ (i, j) ํ•ญ๋ชฉ์€ i๋ฒˆ์งธ ๋ฌธ์„œ์—์„œ j๋ฒˆ์งธ ํ† ํฐ์ด ๋ฐœ์ƒํ•œ ํšŸ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

 

MATRIX.TRAIN

DOC_WORD_MATRIX_TRAIN
200 1448
abil absolut abus access accid accord account accur act action actual ad addition address administr admit adult advanc advantag advert advertis advic ae affect affili afford afraid ag against agenc agent ago agreem air al alert almost alon alt altern although alwai amaz america american amherst anim announc annual anonym answer anti anumb anybodi anymor anyth anytim anywai anywher aol apolog apologi appar appl appli applic appreci approach apr april area aren argu argum arial arm arriv art articl associ assum atheist attach attack attain attempt attent attract auction author auto automat averag awai awar award bachelor background backup bad balanc bank bar base basebal basi basic batf bb beach beat beauti becom bed behalf behavior behind belief believ benefit besid best bet beyond bibl big bigger bike bill billion bit black blank block blood blue board bob bodi boi bold bonu book born boss boston bother bottom bought box brain branch brand break breakthrough brian bring broadcast brother brought browser btw bu budget bui build built bunch burn busi button buyer ca cabl california came campaign canada car card career carefulli carri cartridg cash catalog catch categori caus cb cd celebr cell cent center centuri certain certainli certif certifi challeng chanc channel charg charset cheap check cheer child children chip christ church citi citizen claim class classic clean clear clearli click client clinton clipper clock close club cnn co code collect colleg color com combin comfort command comment commerci commit common commun compani compar compat compet competit compil complet complex complianc compound comprehens comput concept conclusion condition conduct confer confid confidenti configur congress connect consist consolid construct consult consum continu contract contribut control conveni convent convers convinc cool copi copyright coral corpor correctli cost couldn count countri coupl cours court cover coverag crave creat creation credit crime critic cross ct cultur cure current custom customerservic cut daili dailybargainmail damag damn danger data databas dave david de dead deal dealer dear death debt decor dedic deep defens defin definit degre delet deliv deliveri demand depend deposit dept describ descript deserv design desir destruct detect devic di did didn die diet differ digit diploma directli director directori disclaim discount discov discoveri diseas disk displai distanc distribut division doctor docum dod doesn dollar domin don door doubl doubt download dr draw dream drive driver drop drug dvd earli earn earth easi easier easiest easili eat ebai econom ed edg edition edu educ effort eight electr electron elimin els elsewher em email emailaddr employ emsi en encourag encrypt end energi enforc engin england enhanc enjoi enough enterpr entertain entir environ equal equip equiti eric error especi essenti establish et ethic evalu even event ever everyth everywher evid exactli excel exchang excit exclud exclus excus execut exerc exist expand expect expens experi experienc expert expir explain express extend extens extra extrem ey face fact factor fail fair faith fall famili famou fan faq far fast faster fastest fat favorit fax fbi feder fee feedback feel field fight figur file fill final financ financi fine finish fire firm first fit fix fl flame flow fly focu folk font food forc forev forget form format formula fortun forward frame frank free freedom fri friend friendli front frustrat ftp fulli fun function fund futur ga gain game gatewai gave gear gener georg gift girl gmt goal god goe gone got gotten govern graduat grant graphic great greatest greatli ground group grow growth guarante guess gui guid gun hacker hand handl hang happen happi hard hardwar hassl hate haven head health hear heard heart heat hei height hell hello help helvetica hi higher highest highli highlight him himself histori hit hockei hold holidai home homeown honor hope host hot hous hover hp html httpaddr huge human hundr hurri ibm id idea ideal identifi idiot ignor ii iii illeg imag imagin immedi impli import imposs impress inc inch incid incom inconveni incred inde independ indic individu industri info inform ing innoc input instal instant institut instruct insur integr intellig intens interact interfac intern internet interpret interview introduc introductori invest investig investor invit involv island isn iso israel issu iti jack jame januari jesu jew jewish jim job joe john join jon journal judg jump june justic kei keyword kick kid kill kind king kit knew knock knowledg koresh la lab lack land languag larger largest late latest launch law lead leader leagu learn led legal legitim lender length letter level librari licens life light limit line link listen littl live ll lo load loan local lock log logic longer lose loss lost lot love lover low lower lowest luck lucki ma mac machin magazin mail mailbox maintain major man manag manual manufactur map march mark market mass massiv master match matter maxim maximum mayb mb mba mean measur mechan media medic medicin meet member membership memori men mention merchand merchant mere messag met method michael microsoft middl mike mile militari million mime mind mine minimum minor minut miracl mirror miss mistak mit mode model modern modifi moment mondai monei monitor monthli moral morn mortgag mostli mother motiv mount movi mr multi multipl murder muscl music muslim myself nashua nation nationwid natur nb nearli necessari necessarili neighbor neither net network news newsgroup newslett newsread nh nice night nine non none normal north norton noth notic notifi novemb nt number numberd numberk numbermb numbernd numbernumb numberpt numberpx numberrd numberst numberth ny object oblig observ obviou obvious offens offic offici oh ok old older onc onlin oper opinion opportun oppos oprah opt option orbit order organ origin otherw output outsid ov overlook owner pack packag page pai paid pain paper paragraph parent park parti particip particularli partner past pattern paul payment pc peac peopl percentag perfect period perman permiss peter phd phone physic pick pictur piec pill plai plain plan player pleasant pleasur plenti plnumber plu plug pm po polic polici polit pool poor popul popular port posit position post postal potenc potenti pound power practic pre predict premium presid press pressur pretti prevent previous price print printer privaci privat prize pro probabl product profession profil profit program progress prohibit project promis promotion proof proper properli properti protect prove proven public publish pull purchas push qualif qualifi quantiti quick quickli quit quot race radiat radio ram rang rate re reach reader readi real realiti realiz realli reason recal recent recipi recogn record recur red reduc reduct refer referr refin refinanc reflect refund refus regard regardless regist registr regul regular rel relationship releas relev reliabl religi religion rem remark rememb repair repli report repres request research resel reserv resid resolut resourc respect respond respons restrict result retail reveal revers reward rich ride right risk road rock roll rom ron room round routin rule run safe safeti sai sale sampl san satisfact satisfi saturdai saw schedul scheme school sci scienc scientif score screen se seal search season second secret section secur seek self sell sender sens sensit separ seri serif seriou serious serv server servic set setup sex sexual shall shape share ship shoot shop shot shouldn shut sick side sign signal signatur signific similar simpl simpli sincer singl sit site situat six size skeptic skill skin sleep slow smaller smart smith smoke social societi softwar sold solid solution solv somebodi someon someth sometim somewher soon sorri sort soul sound sourc sp space spam spe speak special spend spent sport spot spring st staff stai stand standard star statem station statist statu step steve stick stock stop store stori straight strategi street strength strong strongli student studi stuff stupid style subject submit subscrib subscript success suffer suit summari summer sun sundai super suppli suppos sure surpris survei suspect sweepstak switch system tabl tal talk tape target tax teach team tech technic techniqu technologi tel telephon tell ten tend term test text th thank thecompanystor themselv theori therefor third though thought thousand thread throughout throw ti ticket tim tip tire titl tm todai told toll tom took tool top topic toronto total touch tough toward track trade trademark tradition traffic trail train transact transfer travel treat treatment tri trial trick trip troubl true truli trust truth tv tx type typic ultim un understand unfortun uniqu unit univers unix unlik unlimit unsolicit unsubscrib untitl updat upgrad upon url usa usenet user usual util vacat valet valid valu valuabl vari varieti ve vehicl verif verifi version video virtual viru vision visit vital vnumber voic volum vote waco wait walk wall war warn washington wasn wast watch water weapon wear web websit weekend weekli weight welcom went west whatev wheel whenev white whole why wi width wife will win window wing winner wit woman women won wonder word work world worri wors worth wouldn write wrong wrote www ye yeah year york young younger yourself zip 
0 6 1 25 1 41 1 40 1 260 1 37 1 256 1 26 1 23 1 146 1 18 2 17 1 77 2 64 1 184 1 40 1 9 1 146 1 21 1 6 3 -1
0 78 4 77 1 214 1 18 1 22 5 58 1 27 3 5 1 50 2 2 1 230 1 39 1 40 1 39 1 31 1 47 1 39 2 1 1 19 1 63 1 207 2 103 1 18 1 2 2 7 4 1 4 -1
0 26 1 149 1 197 1 37 2 29 1 130 1 37 1 113 1 75 1 3 1 64 1 5 1 5 1 8 3 24 1 27 1 22 1 39 1 46 1 133 1 34 1 57 1 51 1 124 1 1 1 -1
0 11 3 75 1 39 1 39 1 18 1 61 1 68 1 16 2 21 1 60 1 1 1 9 1 20 2 191 1 2 1 6 1 38 1 50 1 3 1 2 1 11 1 12 1 45 1 16 1 19 1 27 1 18 4 15 1 44 2 4 1 32 1 8 1 56 1 6 1 33 1 49 2 14 1 20 1 20 1 19 1 63 1 -1
0 31 1 47 1 94 1 174 1 41 1 22 2 29 1 19 1 39 1 70 1 144

 

ํ—ค๋”

ํ–‰ ์—ด

ํ† ํฐ ๋‹จ์–ด ๋ชฉ๋ก

์ŠคํŒธ๋ฉ”์ผ์—ฌ๋ถ€(0:false, 1:true) ๋‹จ์–ด์ถœํ˜„๋นˆ๋„์Œ(ํ† ํฐ์˜ ์ธ๋ฑ์Šค, ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜)

 

์ˆœ์œผ๋กœ ๋‚˜์—ด๋˜์–ด ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰ ํ–‰์ด ์˜๋ฏธํ•˜๋Š” ๋ฐ”๋ฅผ ๋‹ค์‹œ ํ•œ๋ฒˆ ์‚ดํŽด๋ณด๋ฉด,

0 6 1 25 1 41 1 40 1 260 1 37 1 256 1 26 1 23 1 146 1 18 2 17 1 77 2 64 1 184 1 40 1 9 1 146 1 21 1 6 3 -1

 

์œ„์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ, ์ฒซ๋ฒˆ์งธ ์—ด์ธ 0 ์€ ์ŠคํŒธ๋ฉ”์ผ์—ฌ๋ถ€(0=false) ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๋‚˜๋จธ์ง€ ์—ด๋“ค์— ๋Œ€ํ•ด์„œ๋Š” 

๋‘ ๊ฐ’์”ฉ ๋ฌถ์–ด์„œ ์ฒซ์งธ๋Š” ํ† ํฐ์˜ ์ธ๋ฑ์Šค, ๋‘˜์งธ๋Š” ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๊ฒŒ๋œ๋‹ค.

์ฆ‰, 6 1 -> 6๋ฒˆ ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜๊ฐ€ 1, 25 1 -> 25๋ฒˆ ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜๊ฐ€ 1 ... ์ด๋Ÿฐ์‹์œผ๋กœ ํ•ด์„ํ•˜๋ฉด ๋œ๋‹ค.

๋ชฉํ‘œ

 

- ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ(naïve Bayes classifier)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ŠคํŒธ ๋ถ„๋ฅ˜๊ธฐ(spam classifier)๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค.

- ์ด๋•Œ, ๋‹คํ•ญ ์‚ฌ๊ฑด ๋ชจ๋ธ(multinomial event model)๊ณผ ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ(Laplace smoothing)์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค.

 

๊ตฌํ˜„

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ•™์Šตํ•œ ๋‹ค์Œ( ์˜ˆ์ธก์— ํ•„์š”ํ•œ ์‚ฌ์ „ ๊ฐ’ ๋งˆ๋ จ ),

์ด ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋ฐ์ดํ„ฐ(MATRIX.TEST)๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์™„์„ฑํ•  ๊ฒƒ์ด๋‹ค.

 

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import matplotlib.pyplot as plt
import numpy as np

 

 

ํ›ˆ๋ จ ํŒŒ์ผ ์ฝ๊ณ  ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ ๋งŒ๋“ค๊ธฐ 

def readMatrix(file):
    fd = open(file, 'r') # ํŒŒ์ผ์˜ ์ฒซ ์ค„์€ ํ—ค๋”๋กœ ๋ฌด์‹œ
    rows, cols = [int(s) for s in fd.readline().strip().split()] # ํ–‰์˜ ๊ฐœ์ˆ˜(๋ฌธ์„œ ์ˆ˜)์™€ ์—ด์˜ ๊ฐœ์ˆ˜(ํ† ํฐ ์ˆ˜)
    tokens = fd.readline().strip().split() # ๋ฌธ์„œ์— ํฌํ•จ๋œ ํ† ํฐ(๋‹จ์–ด)๋“ค์˜ ๋ฆฌ์ŠคํŠธ ์ €์žฅ
    matrix = np.zeros((rows, cols)) # ํฌ๊ธฐ๊ฐ€ (rows, cols)์ธ ์˜ํ–‰๋ ฌ(zero matrix)์„ ์ƒ์„ฑ
    Y = [] # ๋ ˆ์ด๋ธ”์„ ์ €์žฅํ•  ๋นˆ ๋ฆฌ์ŠคํŠธ Y๋ฅผ ์ดˆ๊ธฐํ™”
    
    for i, line in enumerate(fd):
        nums = [int(x) for x in line.strip().split()]
        Y.append(nums[0]) # ์ด ๋ฌธ์„œ์˜ ๋ ˆ์ด๋ธ”(1: ์ŠคํŒธ, 0: ์ŠคํŒธ ์•„๋‹˜) ์ถ”๊ฐ€
        kv = np.array(nums[1:])
        
        # ์Šฌ๋ผ์ด์‹ฑ ๊ตฌ๋ฌธ array[start:stop:step] ํ™œ์šฉํ•˜์—ฌ key, value ๊ตฌ๋ถ„
        k = np.cumsum(kv[:-1:2]) # ์ง์ˆ˜ ์ธ๋ฑ์Šค ๋ถ€๋ถ„, ์ฆ‰ ํ† ํฐ์˜ ์ธ๋ฑ์Šค๋ฅผ ์ถ”์ถœ 
        v = kv[1::2] # ํ™€์ˆ˜ ์ธ๋ฑ์Šค ๋ถ€๋ถ„, ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ถ”์ถœ
        matrix[i, k] = v # ํ˜„์žฌ ๋ฌธ์„œ(i)์—์„œ ์ถ”์ถœํ•œ ํ† ํฐ์˜ ์ธ๋ฑ์Šค(k)์— ํ•ด๋‹นํ•˜๋Š” ์—ด์— ๊ทธ ํ† ํฐ์˜ ๋นˆ๋„์ˆ˜(v)๋ฅผ ์ €์žฅ.
    return matrix, tokens, np.array(Y)

 

 

๊ฒฐ๊ณผ

 

- matrix : i(ํŠน์ • ๋ฌธ์„œ), j(j๋ฒˆ์งธ ์ธ๋ฑ์Šค์˜ ํ† ํฐ ์ถœํ˜„ ๋นˆ๋„) ๋ฅผ ์ €์žฅํ•˜๋Š” ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ

[0] [1] ..  [1447]
0 1 3 1
0 2 5 0

 

- Y : i๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ์ŠคํŒธ๋ฉ”์ผ(1) ์ธ์ง€ ์•„๋‹Œ์ง€(0) ์ €์žฅ๋œ 1์ฐจ์› ๋ฐฐ์—ด ์ƒ์„ฑ

0 1 1 .. 0

 

๋ชจ๋ธ ํ›ˆ๋ จ

 

์ฃผ์–ด์ง„ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜๊ธฐ์— ํ•„์š”ํ•œ ํ™•๋ฅ ๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ๋•Œ, ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ๋„ ์ ์šฉํ•œ๋‹ค.

 

์šฐ๋ฆฌ๊ฐ€ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•ด๋‘์–ด์•ผ ํ•˜๋Š” ๊ฐ’๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

 

- P(Spam), P(Not Spam)

- P( xi | Spam ) i = 1 ~ 1448

- P( xi | Not Spam ) i = 1 ~ 1448

 

def nb_train(matrix, category):
    state = {}
    N = matrix.shape[1]  # ์ „์ฒด ํŠน์„ฑ(ํ† ํฐ)์˜ ๊ฐœ์ˆ˜
  

    # ๊ฐ ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ๋ฌธ์„œ ์ˆ˜
    spam_count = np.sum(category)  # ์ŠคํŒธ ๋ฌธ์„œ ์ˆ˜ (๋ ˆ์ด๋ธ”์ด 1์ธ ๊ฒฝ์šฐ)
    not_spam_count = len(category) - spam_count  # ์ŠคํŒธ์ด ์•„๋‹Œ ๋ฌธ์„œ ์ˆ˜ (๋ ˆ์ด๋ธ”์ด 0์ธ ๊ฒฝ์šฐ)

    # ์‚ฌ์ „ ํ™•๋ฅ  P(spam) ๋ฐ P(not_spam)
    P_spam = spam_count / len(category)
    P_not_spam = not_spam_count / len(category)

    # ์ŠคํŒธ๊ณผ ๋น„์ŠคํŒธ ๋ฌธ์„œ์—์„œ ํ† ํฐ์ด ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜๋ฅผ ๊ตฌํ•จ
    token_counts_spam = np.sum(matrix[category == 1], axis=0)   # ์ŠคํŒธ ๋ฌธ์„œ์—์„œ ๊ฐ ํ† ํฐ์ด ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜ 
    token_counts_not_spam = np.sum(matrix[category == 0], axis=0)  # ๋น„์ŠคํŒธ ๋ฌธ์„œ์—์„œ ๊ฐ ํ† ํฐ์ด ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜

    # ๋ผํ”Œ๋ผ์Šค ์Šค๋ฌด๋”ฉ์„ ์ ์šฉํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  P(token|spam) ๋ฐ P(token|not_spam) ๊ตฌํ•˜๊ธฐ 
    P_token_given_spam = ( token_counts_spam + 1 ) / ( spam_count + N )
    P_token_given_not_spam = ( token_counts_not_spam + 1 ) / ( total_tokens_not_spam )

    # ํ•™์Šต๋œ ์ƒํƒœ๋ฅผ ์ €์žฅ
    state['P_spam'] = P_spam
    state['P_not_spam'] = P_not_spam
    state['P_token_given_spam'] = P_token_given_spam
    state['P_token_given_not_spam'] = P_token_given_not_spam

    return state

 

matrix[category == 1] ๋ฅผ ํ†ตํ•ด ๋ณธ matrix ์—์„œ ์ŠคํŒธ์œผ๋กœ ๋ผ๋ฒจ๋ง ๋œ ๋ฉ”์ผ๋งŒ ํ•„ํ„ฐ๋งํ•œ๋‹ค. 

axis=0, ์ฆ‰ ์—ด ๋ฐฉํ–ฅ์— ๋Œ€ํ•˜์—ฌ ๋ˆ„์ ํ•ฉ์„ ๊ณ„์‚ฐํ•จ์œผ๋กœ์„œ ์ŠคํŒธ ๋ฉ”์ผ ์ค‘์—์„œ ํ•ด๋‹น i๋ฒˆ์งธ ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ์ด ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

 

๊ฒฐ๊ณผ 

 

- token_counts_spam : xi   Spam , ์ŠคํŒธ ๋ฉ”์ผ ์ค‘์—์„œ ํŠน์ • ์ธ๋ฑ์Šค์˜ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•œ ์ด ํšŸ์ˆ˜ 

[0] [1] [2]
0 5 6 7

 

 

- token_counts_not_spam : xi  Not Spam , ์ŠคํŒธ์ด ์•„๋‹Œ ๋ฉ”์ผ ์ค‘์—์„œ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•œ ์ด ํšŸ์ˆ˜ 

[0] [1] [2]
0 13 14  

 

 

-  P_token_given_spam : P( xi | Spam )

[0] [1] [2] [3] ...
0 0.33 0.55 0.03  

 

 

-  P_token_given_spam : P( xi | Not Spam )

[0] [1] [2] [3] ...
0 0.13 0.34 0.17  

 

 

์˜ˆ์ธก 

ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์˜ ํ™•๋ฅ ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ฌธ์„œ๊ฐ€ ์ŠคํŒธ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.

def nb_test(matrix, state):
    output = np.zeros(matrix.shape[0])  # ์˜ˆ์ธก ๊ฒฐ๊ณผ ์ €์žฅ

    # ์‚ฌ์ „ ํ™•๋ฅ ๊ณผ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
    P_spam = state['P_spam']
    P_not_spam = state['P_not_spam']
    P_token_given_spam = state['P_token_given_spam']
    P_token_given_not_spam = state['P_token_given_not_spam']

    # ๊ฐ ๋ฌธ์„œ์— ๋Œ€ํ•ด ์ŠคํŒธ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก
    for i in range(matrix.shape[0]):
        # ๊ฐ ๋ฌธ์„œ์˜ ํ† ํฐ ๋นˆ๋„
        doc = matrix[i]

        # ๋กœ๊ทธ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์–ธ๋”ํ”Œ๋กœ์šฐ ๋ฐฉ์ง€
        log_prob_spam = np.log(P_spam) + np.sum(np.log(P_token_given_spam) * doc)
        log_prob_not_spam = np.log(P_not_spam) + np.sum(np.log(P_token_given_not_spam) * doc)

        # ๋” ๋†’์€ ๋กœ๊ทธ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋กœ ์˜ˆ์ธก
        if log_prob_spam > log_prob_not_spam:
            output[i] = 1  # ์ŠคํŒธ์œผ๋กœ ๋ถ„๋ฅ˜
        else:
            output[i] = 0  # ์ŠคํŒธ์ด ์•„๋‹Œ ๊ฒƒ์œผ๋กœ ๋ถ„๋ฅ˜

    return output

 

 

์—ฌ๊ธฐ์„œ ํ™•๋ฅ ์„ ๊ตฌํ•  ๋•Œ ์กฐ๊ธˆ ํ˜ผ๋ž€์Šค๋Ÿฌ์šธ ์ˆ˜ ์žˆ๋Š” ์ ์€ ๊ธฐ์กด ์˜ˆ์ œ์™€๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅด๊ฒŒ

๊ฐ ๋ฌธ์„œ์˜ ํŠน์„ฑ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’๋“ค์ด ๋‹จ์ˆœํžˆ ๊ทธ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค, ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋‹ค๋กœ ํŒ๋ณ„๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ

ํŠน์„ฑ์ด ๋‚˜ํƒ€๋‚˜๋Š” ํšŸ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜๊ณ  ์žˆ์—ˆ๋‹ค๋Š” ์ ์ด๋‹ค.

 

์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ๋Š” ๊ฐ ํ† ํฐ์˜ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  P(token_i | spam)์„ ํ† ํฐ ๋“ฑ์žฅ ๋นˆ๋„์ˆ˜๋งŒํผ ๊ฑฐ๋“ญ ์ œ๊ณฑํ•ด์ค„ ๊ฒƒ์ด๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด ์–ด๋–ค ๋ฌธ์„œ์—์„œ a=3, b=4 ์ผ ๋•Œ์˜ ์ŠคํŒธ ํ™•๋ฅ ์„ ๊ตฌํ•œ๋‹ค๋ฉด

P(Spam) * P( a | Spam ) ^ 3 * P( b | Spam )^4 ๋ฅผ ๊ตฌํ•ด์ฃผ๋Š” ์…ˆ์ด๋‹ค.

 

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” doc, P_token_given_spam ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

doc: Matrix[i], i๋ฒˆ์งธ ์ƒ˜ํ”Œ์˜ ๋‹จ์–ด ํ–‰๋ ฌ

[0] [1] [2] ..  [1447]
0 1 3 .. 1

 

 

P_token_given_spam : P( xi | Spam )

[0] [1] [2] [3] ...
0 0.33 0.55 0.03  

 

 

๊ฒฐ๊ณผ : P_token_given_spam ^ doc 

[0] [1] [2] [3] ...
0 0.33^1 0.55^3 0.03^3  

 

 

์ฆ‰, ์ตœ์ข…์ ์œผ๋กœ๋Š” ์ด ํ–‰๋ ฌ์˜ ๊ฐ’๋“ค์„ ๋ชจ๋‘ ๊ณฑํ•ด์„œ P(token_i | spam) ๋ฅผ ๊ตฌํ•ด๋‚ด๊ฒŒ ๋œ๋‹ค.

 

 

 

โš ๏ธ ์ฃผ์˜ํ• ์  โ€ผ๏ธ

์ด๋•Œ ์ฃผ์˜ํ•ด์•ผํ•  ์ ์€ ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ๋ฅผ ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜๋ฉด ์ข…์ข… ๊ณ„์‚ฐ๋œ ๊ฐ’์ด 0์ด ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
์ด๋Š”, 1๋ณด๋‹ค ์ž‘์€ ํ™•๋ฅ ๋“ค์ด ๊ณ„์† ๊ณฑํ•ด์ง€๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ž‘์€ ๊ฐ’์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
ํ‘œ์ค€ ์ปดํ“จํ„ฐ์˜ ์‹ค์ˆ˜ ํ‘œํ˜„์€ ๋„ˆ๋ฌด ์ž‘์€ ๊ฐ’์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๊ณ , ๋Œ€์‹  ์ด๋ฅผ 0์œผ๋กœ ๋ฐ˜์˜ฌ๋ฆผํ•œ๋‹ค. ( ์–ธ๋”ํ”Œ๋กœ์šฐ(underflow) )
๋„ˆ๋ฌด ์ž‘์€ ์ˆซ์ž๋“ค์„ ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜์ง€ ์•Š๊ณ  ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ์˜ ์˜ˆ์ธก๋œ ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”์„ ๊ณ„์‚ฐํ•  ๋ฐฉ๋ฒ•์„ ์ฐพ์•„์•ผํ•œ๋‹ค.
๋”ฐ๋ผ์„œ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋กœ๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณฑ์…ˆ์„ ๋ง์…ˆ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ณ„์‚ฐํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. 

 

 

์–ธ๋”ํ”Œ๋กœ์šฐ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋กœ๊ทธ๋ฅผ ์ ์šฉํ•˜๋ฉด ๋กœ๊ทธ์˜ ์„ฑ์งˆ์— ์˜ํ•˜์—ฌ

๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋นˆ๋„ ํ–‰๋ ฌ์˜ ๊ฐ’์ธ doc ์ด ์ƒ์ˆ˜๋กœ ์•ž์— ๋น ์ ธ๋‚˜์˜ค๊ฒŒ ๋˜๊ณ 

 

 

 

๋กœ๊ทธ๋ฅผ ์ทจํ•ด ํ™•๋ฅ ์„ ์ผ๋ฐ˜ํ™”ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณต์‹์ด ๋‚˜์˜จ๋‹ค. 

 

์ด๋ฅผ ์‹ค์ œ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

log_prob_spam = np.log(P_spam) + np.sum(np.log(P_token_given_spam) * doc)

 

 

๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ‰๊ฐ€ 

def evaluate(output, label):
    error = (output != label).sum() * 1. / len(output)
    print('Error: %1.4f'%error)

def main():
    # Please set a training file that you want to use for this run below
    trainMatrix, tokenlist, trainCategory = readMatrix('./data/hw2_MATRIX.TRAIN')
    testMatrix, tokenlist, testCategory = readMatrix('./data/hw2_MATRIX.TEST')

    state = nb_train(trainMatrix, trainCategory)
    output = nb_test(testMatrix, state)

    evaluate(output, testCategory)
    return

if __name__ == '__main__':
    main()

 

 

์‹คํ–‰ ๊ฒฐ๊ณผ

Error: 0.0600

 

0.06 ์ •๋„์˜ ์—๋Ÿฌ์œจ์„ ๋ณด์ด๋ฉฐ, 100๊ฐœ ์ค‘์— 94๊ฐœ๋Š” ์ œ๋Œ€๋กœ ๋ถ„๋ฅ˜ํ•ด๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์–ด๋–ค ํ† ํฐ์ด ์ŠคํŒธ์„ ๊ตฌ๋ณ„ํ•˜๋Š”๋ฐ ๊ฐ€์žฅ ๋ณ€๋ณ„๋ ฅ ์žˆ์„๊นŒ?

 

์ŠคํŒธ๊ณผ ๋น„์ŠคํŒธ์„ ๊ตฌ๋ณ„ํ•˜๋Š” ๋ฐ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ† ํฐ 5๊ฐœ๋ฅผ ์ฐพ์„ ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๊ฐ ํ† ํฐ์ด ์ŠคํŒธ ํด๋ž˜์Šค์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€(ํŠน์ง•์ด ๋˜๋Š”์ง€)๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๊ฐ€์ง„ 5๊ฐœ์˜ ํ† ํฐ์„ ์„ ํƒํ•  ๊ฒƒ์ด๋‹ค. 

 

์ด๋ฅผ ์œ„ํ•˜์—ฌ ์ฃผ์–ด์ง„ ์ˆ˜์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋กœ๊ทธ ์šฐ๋„๋น„(Log-likelihood ratio)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. 

 

 

์œ„์˜ ์ˆ˜์‹์„ ์ฝ”๋“œ์— ๋ฐ˜์˜ํ•˜์—ฌ, ํ† ํฐ์˜ ํ™•๋ฅ ์„ ์ŠคํŒธ๊ณผ ๋น„์ŠคํŒธ์—์„œ ๋น„๊ตํ•˜๋Š” ๋กœ๊ทธ ํ™•๋ฅ ์„ ๊ตฌํ•œ ๋‹ค์Œ, ๊ทธ ๊ฐ’์„ ์ •๋ ฌํ•˜์—ฌ

์ŠคํŒธ์„ ๋ณ€๋ณ„ํ•˜๋Š” ์ƒ์œ„ 5๊ฐœ์˜ ํ† ํฐ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

 

# Spam ๊ณผ Not Spam ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ์ƒ์œ„ n๊ฐœ์˜ ํ† ํฐ ์ฐพ๊ธฐ
def findTopToken(top_n,state,tokens):

    P_token_given_spam = state['P_token_given_spam']
    P_token_given_not_spam = state['P_token_given_not_spam']

    # ๊ฐ ํ† ํฐ์— ๋Œ€ํ•œ ๋กœ๊ทธ ์šฐ๋„๋น„(log-likelihood ratio)๋ฅผ ๊ณ„์‚ฐ
    log_likelihood_ratio = np.log(P_token_given_spam) - np.log(P_token_given_not_spam)

    # ๋กœ๊ทธ ์šฐ๋„๋น„๊ฐ€ ๊ฐ€์žฅ ํฐ ์ƒ์œ„ top_n๊ฐœ์˜ ํ† ํฐ์„ ์ฐพ์Œ
    top_token_indices = np.argsort(log_likelihood_ratio)[-top_n:][::-1]

     # ์ƒ์œ„ top_n๊ฐœ์˜ ํ† ํฐ๊ณผ ๊ทธ ๊ฐ’์„ ์ถœ๋ ฅ
    print(f"== Top {top_n} tokens most indicative of SPAM ==\n")
    for idx in top_token_indices:
        print(f"Token: {tokens[idx]}, Log-Odds Ratio: {log_likelihood_ratio[idx]:.4f}")

 

๊ฒฐ๊ณผ

html, space, unsolicit, eat, vacat ๊ฐ€ ์ŠคํŒธ์„ ๋ณ€๋ณ„ํ•˜๋Š” Top 5 ํ‚ค์›Œ๋“œ๋กœ ์„ ๋ณ„๋˜์—ˆ๋‹ค. 

 

 

ํ•™์Šต ๊ณก์„ (Training Set Size vs Test Error)

๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ณ , ๊ทธ์— ๋”ฐ๋ผ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ์˜ ์˜ค๋ฅ˜์œจ์„ ๊ณ„์‚ฐํ•˜์—ฌ ํ•™์Šต ๊ณก์„ (learning curve)์„ ๊ทธ๋ฆด ๊ฒƒ์ด๋‹ค. ํ›ˆ๋ จ ์„ธํŠธ์˜ ํฌ๊ธฐ๋Š” 50, 100, 200, ... ์ตœ๋Œ€ 1400๊นŒ์ง€ ์ฃผ์–ด์ง€๋ฉฐ, ๊ฐ๊ฐ์˜ ํ›ˆ๋ จ ์„ธํŠธ์— ๋Œ€ํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ํ›„ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ์˜ ์˜ค๋ฅ˜์œจ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

 

    # ํ›ˆ๋ จ ์„ธํŠธ ํฌ๊ธฐ๋ณ„๋กœ ํŒŒ์ผ์„ ์ฝ๊ณ  ํ•™์Šต ํ›„ ์˜ค๋ฅ˜์œจ ๊ณ„์‚ฐ
    training_set_size = [50, 100, 200, 400, 800, 1400]
    test_errors = []
    for size in training_set_size:
        trainMatrix, tokenlist, trainCategory = readMatrix(f'./data/hw2_MATRIX.TRAIN.{size}')
        state = nb_train(trainMatrix, trainCategory)
        output = nb_test(testMatrix, state)
        error = evaluate(output, testCategory)
        test_errors.append(error)

 

def plotLearningCurve(train_sizes, test_errors):

    plt.plot(train_sizes, test_errors, marker='o', color='b')
    plt.title('Learning Curve (Training Set Size vs Test Error)')
    plt.xlabel('Training Set Size')
    plt.ylabel('Test Error')
    plt.grid(True)
    plt.show()

 

 

๊ฒฐ๊ณผ

 

ํ›ˆ๋ จ์„ธํŠธ์˜ ํฌ๊ธฐ๊ฐ€ 400์ผ ๋•Œ ๊ฐ€์žฅ ๋‚ฎ์€ ์˜ค์ฐจ์œจ์„ ๋ณด์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.