CS & DS/scikit-learn Machine Learning
scikit-learn Machine Learning Naive Bayes Kor NLP 사이킷런 머신러닝 나이브베이즈 한글 자연어처리
EthanJ
2022. 11. 21. 14:44

scikit-learn Machine Learning Naive Bayes Kor NLP
사이킷런 머신러닝 나이브베이즈 한글 자연어처리
In [1]:
# import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
1. 데이터 불러오기 (Data Collection)¶
In [2]:
# https://github.com/e9t/nsmc/
file_url = 'https://raw.githubusercontent.com/dev-EthanJ/scikit-learn_Machine_Learning/main/data/ratings.txt'
df = pd.read_csv(file_url, sep='\t', index_col=0)
df.head()
Out[2]:
document | label | |
---|---|---|
id | ||
8112052 | 어릴때보고 지금다시봐도 재밌어요ㅋㅋ | 1 |
8132799 | 디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산... | 1 |
4655635 | 폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고. | 1 |
9251303 | 와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런... | 1 |
10067386 | 안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화. | 1 |
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 200000 entries, 8112052 to 8548411 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 document 199992 non-null object 1 label 200000 non-null int64 dtypes: int64(1), object(1) memory usage: 4.6+ MB
In [4]:
df[df.document.isnull()]
Out[4]:
document | label | |
---|---|---|
id | ||
6369843 | NaN | 1 |
511097 | NaN | 1 |
2172111 | NaN | 1 |
402110 | NaN | 1 |
5942978 | NaN | 0 |
5026896 | NaN | 0 |
1034280 | NaN | 0 |
1034283 | NaN | 0 |
- 총 데이터 갯수에 비해 적은 missing value → drop
In [5]:
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 199992 entries, 8112052 to 8548411 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 document 199992 non-null object 1 label 199992 non-null int64 dtypes: int64(1), object(1) memory usage: 4.6+ MB
In [6]:
sample = pd.concat([df.head(1000), df.tail(1000)])
sample.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2000 entries, 8112052 to 8548411 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 document 2000 non-null object 1 label 2000 non-null int64 dtypes: int64(1), object(1) memory usage: 46.9+ KB
In [7]:
sample.head(10)
Out[7]:
document | label | |
---|---|---|
id | ||
8112052 | 어릴때보고 지금다시봐도 재밌어요ㅋㅋ | 1 |
8132799 | 디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산... | 1 |
4655635 | 폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고. | 1 |
9251303 | 와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런... | 1 |
10067386 | 안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화. | 1 |
2190435 | 사랑을 해본사람이라면 처음부터 끝까지 웃을수 있는영화 | 1 |
9279041 | 완전 감동입니다 다시봐도 감동 | 1 |
7865729 | 개들의 전쟁2 나오나요? 나오면 1빠로 보고 싶음 | 1 |
7477618 | 굿 | 1 |
9250537 | 바보가 아니라 병 쉰 인듯 | 1 |
2. 데이터 전처리 (Data pre-processing)¶
In [8]:
sample_text = sample.document.iloc[0]
sample_text
Out[8]:
'어릴때보고 지금다시봐도 재밌어요ㅋㅋ'
In [9]:
# https://konlpy.org/ko/latest/index.html
!pip install konlpy --quiet
# 단어 품사, 형태소 태깅 가능
In [10]:
from konlpy.tag import Okt
okt = Okt()
print(sample_text)
# 명사(noun)만 남기기
print(okt.nouns(sample_text))
어릴때보고 지금다시봐도 재밌어요ㅋㅋ ['때', '보고', '지금', '다시']
In [11]:
# 두 글자 이상의 단어
sample['nouns'] = sample.document.apply(okt.nouns).apply(
lambda nouns: [n for n in nouns if len(n) >= 2])
sample['nouns']
Out[11]:
id 8112052 [보고, 지금, 다시] 8132799 [디자인, 학생, 외국, 디자이너, 일군, 전통, 통해, 발전, 문화, 산업, 사실... 4655635 [폴리스스토리, 시리즈, 부터, 하나, 최고] 9251303 [연기, 진짜, 생각, 몰입, 진짜, 영화] 10067386 [안개, 밤하늘, 초승달, 영화] ... 8963373 [포켓, 몬스터] 3302770 [] 5458175 [완전, 사이코, 영화, 마지막, 더욱더, 영화, 린다] 6908648 [라따뚜이, 스머프, 런가] 8548411 [저그, 영차, 영차, 영차] Name: nouns, Length: 2000, dtype: object
In [12]:
sample.head()
Out[12]:
document | label | nouns | |
---|---|---|---|
id | |||
8112052 | 어릴때보고 지금다시봐도 재밌어요ㅋㅋ | 1 | [보고, 지금, 다시] |
8132799 | 디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산... | 1 | [디자인, 학생, 외국, 디자이너, 일군, 전통, 통해, 발전, 문화, 산업, 사실... |
4655635 | 폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고. | 1 | [폴리스스토리, 시리즈, 부터, 하나, 최고] |
9251303 | 와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런... | 1 | [연기, 진짜, 생각, 몰입, 진짜, 영화] |
10067386 | 안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화. | 1 | [안개, 밤하늘, 초승달, 영화] |
In [13]:
df = sample.copy()
3. 모델 학습 (Training Model)¶
In [14]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
joined_nouns = df.nouns.apply(" ".join)
cv.fit(joined_nouns)
cv.vocabulary_
Out[14]:
{'보고': 1206, '지금': 2670, '다시': 551, '디자인': 722, '학생': 3120, '외국': 2059, '디자이너': 721, '일군': 2309, '전통': 2479, '통해': 2972, '발전': 1119, '문화': 1011, '산업': 1363, '사실': 1345, '우리나라': 2090, '시절': 1640, '열정': 1964, '노라노': 494, '사람': 1333, '폴리스스토리': 3067, '시리즈': 1628, '부터': 1266, '하나': 3098, '최고': 2814, '연기': 1945, '진짜': 2711, '생각': 1404, '몰입': 964, '영화': 1977, '안개': 1770, '밤하늘': 1124, '초승달': 2806, '사랑': 1336, '라면': 747, '처음': 2772, '완전': 2044, '감동': 52, '전쟁': 2475, '바보': 1067, '나이': 430, '훗날': 3280, '사하나': 1360, '감정': 62, '이해': 2273, '고질': 199, '오페라': 2024, '작품': 2381, '극단': 323, '갈림': 42, '반전': 1105, '평점': 3046, '긴장감': 374, '스릴': 1574, '전장': 2474, '공포': 219, '고시': 189, '소재': 1512, '관련': 236, '단연': 565, '가면': 10, '갈수록': 44, '더욱': 624, '밀회': 1061, '화이팅': 3243, '수작': 1549, '일본': 2315, '마음': 840, '임팩트': 2342, '일품': 2334, '제대로': 2541, '범죄': 1174, '스릴러': 1575, '마디': 827, '징텅': 2728, '교복': 249, '이의': 2256, '볼펜': 1241, '자국': 2350, '역시': 1939, '미처': 1044, '전하': 2485, '형태': 3210, '마지막': 847, '강압': 78, '용서': 2082, '세뇌': 1476, '대한': 615, '비판': 1309, '중세시대': 2655, '명작': 933, '영상': 1971, '존재': 2588, '한번': 3129, '제니퍼': 2540, '코넬': 2888, '아역시절': 1745, '로버트': 790, '드니': 704, '장면': 2403, '정말': 2511, '가슴속': 16, '기억': 365, '수가': 1534, '인간': 2278, '잠재': 2390, '악마': 1765, '성은': 1460, '여러': 1915, '시간': 1617, '공간': 206, '존속': 2587, '다큐': 557, '그것': 300, '재현': 2441, '최고다': 2815, '삼일': 1374, '동안': 683, '틈틈이': 2995, '잠도': 2389, '여운': 1925, '실화': 1685, '충격': 2845, '어디': 1842, '일어나서': 2324, '각심': 35, '그라샴': 313, '농아인': 513, '이정재': 2259, '이범수': 2221, '친구': 2857, '우정': 2102, '매우': 891, '굿굿굿': 282, '또해': 735, '제발': 2547, '제이크': 2552, '질렌할': 2717, '대체': 610, '입가': 2343, '미소': 1037, '샤방샤방했던': 1415, '원표': 2133, '조연': 2571, '이양': 2246, '마치': 849, '바다': 1064, '아쿠아리움': 1762, '느낌': 533, '자녀': 2354, '강추': 81, '정의': 2529, '콜트': 2902, '콜텍': 2901, '노동자': 493, '이야기': 2245, '화보': 3241, '브라질': 1288, '남자': 457, '내내': 468, '여배우': 1920, '도법': 652, '멤버': 921, '모두': 943, '기대': 350, '액션': 1829, '런가': 766, '흥미진진': 3294, '워낙': 2119, '격투씬': 151, '그냥': 303, '스마트': 1577, '티비': 2996, '인지도': 2302, '암살': 1811, '하여튼': 3110, '인정': 2298, '파르': 3001, '북한': 1267, '목숨': 961, '대한민국': 616, '그거': 299, '납득': 460, '나불': 425, '거려': 112, '종북': 2600, '박평': 1086, '그대': 307, '시작': 1638, '얼굴': 1873, '주인공': 2627, '매력': 888, '이구': 2204, '용이': 2083, '요즘': 2072, '아이돌': 1750, '배우': 1153, '해먹': 3162, '정도': 2509, '정말재밋': 2513, '봣는대': 1246, '탱고': 2953, '음악': 2186, '평생': 3042, '후회': 3276, '드라마': 705, '미도': 1031, '캐릭터': 2868, '여러가지': 1916, '결말': 156, '이제': 2261, '재미': 2425, '순위': 1561, '신동엽': 1655, '순간': 1556, '캐치': 2872, '감탄': 64, '이영자': 2247, '대박': 599, '이동욱': 2212, '인생': 2290, '전작': 2473, '쓰레기': 1718, '한국영': 3125, '별로': 1201, '다소': 549, '당시': 585, '눈물': 523, '사발': 1342, '신부': 1658, '로서': 794, '명감': 926, '실천': 1682, '제리': 2543, '자신': 2366, '부정': 1263, '로키': 797, '원주율': 2132, '메이커': 915, '추석': 2835, '특선영화': 2991, '가족': 24, '끼리': 416, '보기': 1208, '우리': 2089, '애기': 1819, '선택': 1437, '괜찬': 245, '조합': 2582, '바로': 1066, '대의': 607, '어머니': 1849, '원작': 2127, '드래곤볼': 708, '에볼루션': 1892, '만듬': 862, '예능': 1992, '방학': 1138, '아침': 1761, '채널': 2766, '엔트랩먼트': 1903, '권력': 287, '의리': 2193, '역사': 1937, '초롱': 2801, '팬심': 3020, '내용': 475, '배경음악': 1142, '달달': 570, '중간': 2645, '본방': 1232, '사수': 1344, '스토리': 1597, '요새': 2068, '수백향': 1547, '제목': 2545, '화질': 3247, '가을로': 21, '가을': 20, '때문': 730, '이건': 2201, '벌써': 1172, '퀄리티': 2906, '일단': 2310, '까지봣다': 402, '장국영': 2394, '자살': 2362, '극적': 327, '뀰잼': 414, '예전': 1999, '양심': 1837, '냉장고': 479, '움찔': 2112, '한가지': 3121, '부탁': 1265, '연기자': 1948, '보호': 1226, '폭력': 3060, '걱정': 124, '성추행': 1467, '상가': 1377, '모든': 945, '목격': 958, '증언': 2665, '덕분': 627, '청소년': 2787, '계속': 172, '경찰서': 169, '멋있쪙': 905, '재밋었다': 2433, '세기': 1475, '비디오': 1301, '남기남': 450, '감독': 49, '이름': 2215, '성룡': 1456, '형님': 3205, '마이': 841, '우상': 2093, '당신': 586, '장르': 2401, '이영화': 2248, '리타': 814, '자꾸': 2353, '죄책감': 2609, '도잠': 658, '커서': 2875, '웃음': 2116, '정치': 2533, '묘사': 974, '표현': 3072, '흥행': 3296, '안나': 1773, '절대': 2488, '디테': 725, '일만': 2314, '봉임': 1244, '한마디': 3128, '달기': 569, '수록': 1544, '살이': 1367, '엇냐': 1882, '성격': 1449, '만점': 867, '다음': 553, '초반': 2802, '설정': 1447, '점차': 2505, '판타지': 3011, '미래': 1033, '현실': 3199, '언론': 1871, '탄압': 2941, '은유': 2181, '오카다': 2020, '예상': 1994, '의외': 2199, '인도영화': 2285, '무엇': 996, '다그': 545, '허정무': 3183, '대신': 605, '장외룡': 2414, '한국': 3123, '고고싱': 174, '어찌': 1863, '보조개': 1222, '메이': 913, '최고봉': 2816, '초딩': 2800, '봣음': 1250, '비포미드나잇': 1311, '심정': 1694, '또한': 734, '한편': 3145, '모습': 953, '절로': 2490, '뭔가': 1019, '스케': 1584, '라디오': 745, '임진강': 2340, '한석규': 3131, '쵝오': 2828, '허니': 3179, '꿀잼': 412, '개꿀잼': 88, '한예슬': 3134, '원래': 2122, '헤어': 3192, '메이크업': 916, '훈남': 3277, '쉐프': 1565, '보구': 1207, '요리': 2067, '시나리오': 1620, '비고': 1296, '잼잼꿀잼': 2444, '잼핵잼잼잼': 2445, '개인': 103, '언제': 1872, '팬텀': 3021, '크리스틴': 2910, '일이': 2326, '전혀': 2486, '처럼': 2770, '남편': 459, '아쉬움': 1739, '알파': 1809, '치노': 2850, '명화': 937, '뮤지컬': 1021, '역대': 1936, '대부': 600, '바람': 1065, '레벨': 768, '명의': 932, '마잭': 845, '마돈나': 826, '엘비스급': 1906, '노래실력': 496, '바비': 1068, '바스코': 1069, '이번': 2220, '결과': 154, '대해': 617, '불만': 1281, '어쩌면': 1860, '거지': 121, '어쩌': 1858, '자격': 2349, '코미디': 2891, '사극': 1329, '장혁': 2420, '명품': 935, '이서': 2234, '마련': 828, '사마': 1338, '그날': 302, '저녁': 2449, '기분': 360, '옛날': 2001, '워낭소리': 2120, '더빙': 623, '자막': 2358, '연말': 1952, '추천': 2839, '사회': 1361, '치부': 2854, '예언': 1997, '엣날': 1910, '제일': 2553, '슈퍼': 1570, '울트라': 2110, '노막': 500, '중동': 2652, '참여': 2758, '얼마나': 1875, '아버지': 1735, '이자': 2257, '무사': 991, '일생': 2323, '의미': 2195, '죽음': 2636, '일지': 2330, '신념': 1653, '진일': 2709, '물결': 1014, '쵝오임': 2829, '다만': 547, '엔딩': 1899, '배심원': 1151, '상대로': 1381, '일장': 2327, '연설': 1955, '작위': 2378, '신파': 1669, '무죄': 1001, '애정': 1826, '멜로': 920, '최영장군': 2824, '이민호': 2218, '안치환': 1797, '방향': 1140, '노래': 495, '걸작': 130, '안보': 1784, '고도': 177, '오스트레일리아': 2011, '조금': 2559, '프랑스': 3080, '심장': 1693, '쫄깃하': 2739, '메이드': 914, '범죄물': 1175, '류승범': 801, '황정민': 3259, '기쁨': 361, '등장인물': 717, '절제': 2496, '소설가': 1506, '무척': 1003, '엄마': 1877, '운전': 2108, '맥거핀': 893, '아이맥스': 1753, '개봉': 97, '거임': 119, '완죤': 2046, '스타': 1589, '정부': 2517, '국민': 273, '의지': 2200, '퀘벡': 2907, '호킹': 3218, '능욕': 540, '보통': 1224, '전쟁영화': 2476, '특유': 2992, '연출': 1961, '이정현': 2260, '전형': 2487, '사이코': 1353, '우릴': 2091, '욕망': 2076, '명대사': 928, '똥칠': 738, '된거': 692, '점도': 2499, '블레이드러너': 1293, '버금': 1166, '결코': 160, '정일우': 2530, '상투': 1394, '존경': 2583, '반지': 1107, '제왕': 2550, '피터': 3091, '잭슨': 2442, '감독판': 50, '추억': 2837, '녹화': 507, '리메이크': 807, '무협': 1006, '보너스': 1209, '시내': 1622, '보신': 1216, '추강': 2831, '아이': 1749, '만화영화': 875, '액션영화': 1831, '서도': 1417, '귀수': 292, '아군': 1720, '첫사랑': 2784, '설레임': 1440, '그대로': 308, '관람': 235, '향수': 3177, '김민선': 387, '김규리': 382, '여신': 1922, '미모': 1035, '힐링': 3305, '무비': 990, '자리': 2356, '연속': 1957, '질리': 2718, '외계인': 2058, '신분': 1659, '라이언': 750, '애니': 1820, '코드': 2889, '반영': 1100, '짜임새': 2730, '러브': 761, '모드': 944, '여주': 1931, '어요': 1851, '하나요': 3102, '시간여행': 1618, '트릴': 2984, '방송': 1132, '털털': 2956, '실제': 1680, '시기': 1619, '나중': 433, '입문': 2345, '가연': 17, '물감': 1012, '애가': 1817, '주제': 2631, '배급사': 1144, '싸이코패스': 1707, '안인숙': 1792, '이편': 2271, '조각': 2558, '재밋': 2426, '용도': 2081, '케미': 2880, '이예': 2249, '여유': 1926, '일요일': 2325, '오후': 2028, '동화': 690, '세계': 1473, '스타워즈': 1591, '쌍탑': 1710, '일견': 2308, '알콜중독': 1808, '의문': 2194, '겉보기': 135, '안정': 1794, '그녀': 304, '외로움': 2061, '아픔': 1763, '로써': 795, '이유': 2253, '내생': 473, '가장': 22, '여명': 1919, '눈동자': 522, '허준': 3184, '모래시계': 946, '황금의제국': 3257, '규모': 296, '사업가': 1346, '집안': 2724, '갈등': 41, '서로': 1418, '원수': 2124, '필름': 3094, '무당': 980, '방이': 1136, '들개': 711, '해외': 3166, '수니': 1539, '웁니': 2113, '가원': 19, '웃기': 2114, '해지': 3167, '소녀시대': 1492, '윤아': 2175, '잘못': 2387, '박한별': 1087, '송지효': 1532, '조안': 2569, '싸이코': 1706, '역할': 1941, '나름': 422, '몸매': 966, '부분': 1259, '주연': 2624, '목소리': 960, '간직': 40, '사람과': 1334, '리가': 802, '폭풍': 3064, '집중': 2726, '도리': 647, '글쎄': 335, '알라': 1801, '미움': 1042, '가르침': 9, '모슬렘': 952, '신고': 1651, '애국': 1818, '미국': 1025, '시민': 1629, '말로': 877, '교회': 251, '반성': 1099, '기독교인': 353, '래야': 758, '나마': 423, '행운': 3174, '엔딩크레딧': 1901, '발견': 1110, '목록': 959, '배경': 1141, '휴식': 3286, '난리': 441, '람보': 755, '터미네이터': 2954, '맥클레인': 894, '얘기': 1840, '재밋다': 2431, '혼자': 3225, '올해': 2036, '짱짱': 2734, '눈빛': 524, '생동감': 1405, '몸짓': 968, '발짓': 1120, '거의': 118, '강아지': 77, '토토': 2963, '흐름': 3288, '희망': 3297, '슬픔': 1614, '비극': 1298, '동전': 688, '양면': 1836, '어릴떄': 1848, '아빠': 1736, '매일': 892, '지오': 2689, '자연': 2367, '압도': 1812, '댄스': 619, '롤라': 798, '새해': 1399, '아따맘마': 1726, '리지': 813, '차태현': 2751, '박중훈': 1085, '아즈': 1759, '환상': 3251, '헤메': 3191, '막시무스': 857, '빈민촌': 1315, '실상': 1677, '리얼': 810, '영화로': 1981, '미화': 1050, '생명': 1408, '그름': 318, '나라': 419, '스스로': 1579, '정해': 2536, '그게': 301, '앞뒤': 1816, '가까이': 1, '오만': 2006, '편견': 3031, '위트': 2145, '영화매니아': 1982, '지침': 2693, '다른': 546, '미드': 1032, '재방송': 2435, '캐스팅': 2871, '한효주': 3146, '문채원': 1010, '방금': 1127, '중학교': 2658, '담임': 576, '선생님': 1435, '이후': 2276, '가끔': 2, '백영규': 1158, '재난': 2422, '극치': 329, '돋넼': 668, '대부분': 601, '완성': 2042, '코믹': 2892, '요소': 2069, '아치': 1760, '애환': 1828, '조음': 2573, '동물': 675, '재밋는': 2429, '박유천': 1081, '한지민': 3141, '미가': 1024, '폭포': 3063, '탈출': 2942, '이전': 2258, '사상': 1343, '무렵': 983, '최진희': 2825, '로랑': 783, '건가': 125, '누가': 517, '중국영화': 2648, '로맨스': 786, '공존': 214, '번은': 1171, '감히': 66, '프레이져': 3083, '주걸륜': 2612, '머이쪙': 900, '박력': 1073, '스턴': 1596, '평작': 3045, '하정우': 3115, '개그': 86, '단발': 563, '의사': 2196, '적당': 2456, '밋게봣': 1062, '봣던': 1248, '사관': 1328, '키스신': 2920, '예수님': 1995, '계기': 171, '모던': 941, '시네마': 1623, '주의': 2625, '로맨틱': 787, '조우': 2572, '키아로스타미': 2921, '이중': 2265, '운동': 2106, '믿음': 1058, '가슴': 15, '정태우': 2534, '감사용': 55, '패전': 3017, '처리': 2771, '투수': 2975, '전문': 2466, '패가': 3014, '이기': 2207, '무승부': 993, '등판': 718, '본인': 1234, '상대': 1380, '방어율': 1135, '전체': 2478, '옴니버스': 2037, '진행': 2715, '색감': 1400, '상상력': 1384, '킬림': 2923, '임용': 2339, '낫다': 461, '연애': 1958, '여자': 1928, '관점': 239, '표정': 3070, '대사': 602, '콘서트': 2899, '현장': 3202, '전달': 2463, '명동역': 929, '젊은이': 2497, '헌신': 3186, '전범': 2468, '체코': 2792, '운명': 2107, '해리슨': 3161, '포드': 3050, '홍콩영화': 3235, '화가': 3236, '몽환': 972, '갑자기': 67, '다해': 558, '경의': 166, '우울함': 2099, '만큼': 870, '실감': 1671, '연민': 1953, '벤자민': 1189, '검색': 132, '종로': 2597, '장거리': 2393, '세트': 1485, '장만': 2402, '요크셔': 2074, '테리어': 2958, '소형견': 1515, '상태': 1393, '요키': 2075, '견주': 153, '전적': 2477, '잔인': 2384, '나안': 428, '도시': 653, '세상': 1480, '세심': 1481, '완벽주의자': 2041, '피에르': 3090, '주네': 2616, '이윤기': 2254, '한장': 3136, '면도': 922, '빵빵': 1320, '지고': 2668, '재밋는데': 2430, '적꿈': 2455, '일리': 2313, '므흣': 1023, '메각하킬': 908, '스타스크림': 1590, '활약': 3253, '이참': 2266, '트포': 2987, '오토봇': 2023, '로봇': 791, '대장': 609, '제외': 2551, '기술': 364, '여친': 1933, '주말': 2618, '가요': 18, '흐흐': 3290, '게이고': 141, '플롯': 3087, '전개': 2458, '화면': 3240, '구성': 259, '예술': 1996, '판도': 3008, '뉴문': 528, '신도': 1654, '벨라': 1190, '트와일라잇': 2986, '이클립스': 2268, '커플': 2877, '털보': 2955, '뒷모습': 700, '레이드': 774, '맨몸': 895, '결정': 159, '체다': 2789, '편도': 3032, '소름': 1498, '올드보이': 2034, '합작': 3155, '진품': 2714, '예나': 1991, '노릇': 498, '아들': 1725, '뮤직비디오': 1022, '동양인': 685, '하자': 3114, '자체': 2372, '디스': 719, '통한': 2971, '유지': 2164, '남녀': 451, '유태인': 2167, '레즈비언': 778, '오타쿠': 2021, '등등': 715, '목적': 962, '소설': 1505, '강간범': 69, '새끼': 1397, '강간': 68, '정신': 2523, '난도질': 440, '드래곤길들이기': 707, '감명': 53, '봣어': 1249, '싱하형': 1702, '변호인': 1198, '법정': 1176, '분통': 1274, '지경': 2667, '있냠': 2348, '삼류': 1371, '타이틀': 2932, '덱스터': 636, '데스몬드': 633, '리즈시절': 812, '인형': 2306, '색히': 1403, '영화음악': 1984, '라인': 751, '난캉이': 446, '필리핀': 3095, '서클': 1427, '배웅': 1155, '그때': 311, '풍경': 3075, '과거': 222, '회상': 3261, '요코': 2073, '고양이': 190, '재난영화': 2423, '아주': 1757, '흥미': 3293, '똥개': 736, '잼잇엇어': 2443, '에피소드': 1895, '사이코패스': 1354, '초점': 2808, '탄도': 2939, '크리스마스': 2908, '상영': 1388, '겨우': 144, '연기력': 1946, '최상': 2820, '몰입도': 965, '웅앙아': 2117, '신은경': 1663, '피아니스트': 3089, '보시': 1215, '니요': 541, '시즌': 1641, '다가': 544, '어쩌다가': 1859, '특징': 2994, '거짓말': 122, '태고': 2946, '살때': 1365, '십대영화': 1698, '울면': 2109, '사육사': 1350, ...}
In [15]:
x = cv.transform(joined_nouns)
print(x)
(0, 551) 1 (0, 1206) 1 (0, 2670) 1 (1, 494) 1 (1, 721) 1 (1, 722) 1 (1, 1011) 1 (1, 1119) 1 (1, 1333) 1 (1, 1345) 1 (1, 1363) 1 (1, 1640) 1 (1, 1964) 1 (1, 2059) 1 (1, 2090) 1 (1, 2309) 1 (1, 2479) 2 (1, 2972) 1 (1, 3120) 1 (2, 1266) 1 (2, 1628) 1 (2, 2814) 1 (2, 3067) 1 (2, 3098) 1 (3, 964) 1 : : (1990, 2499) 1 (1990, 2892) 1 (1990, 3046) 1 (1991, 67) 1 (1991, 435) 1 (1991, 1957) 1 (1991, 1977) 1 (1991, 2347) 1 (1992, 2947) 1 (1993, 1990) 1 (1994, 102) 1 (1994, 1202) 1 (1995, 963) 1 (1995, 3057) 1 (1997, 625) 1 (1997, 819) 1 (1997, 847) 1 (1997, 1353) 1 (1997, 1977) 2 (1997, 2044) 1 (1998, 746) 1 (1998, 766) 1 (1998, 1578) 1 (1999, 1976) 3 (1999, 2447) 1
In [16]:
from sklearn.model_selection import train_test_split
y = df.label
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size = 0.2, random_state = 814)
In [17]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)
pred = model.predict(x_test)
4. 모델 평가 (Evaluating Model)¶
In [18]:
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, pred)
Out[18]:
0.69
In [19]:
cf_mtx = confusion_matrix(y_test, pred)
cf_mtx
Out[19]:
array([[123, 78], [ 46, 153]])
In [20]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(cf_mtx, cmap='coolwarm', annot=True, fmt='.0f')
plt.title("CONFUSION MATRIX")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
Mecab¶
In [21]:
!curl -s https://raw.githubusercontent.com/teddylee777/machine-learning/master/99-Misc/01-Colab/mecab-colab.sh | bash
--2022-11-16 05:30:26-- https://www.dropbox.com/s/9xls0tgtf3edgns/mecab-0.996-ko-0.9.2.tar.gz?dl=1 Resolving www.dropbox.com (www.dropbox.com)... 162.125.71.18, 2620:100:6021:18::a27d:4112 Connecting to www.dropbox.com (www.dropbox.com)|162.125.71.18|:443... connected. HTTP request sent, awaiting response... 302 Found Location: /s/dl/9xls0tgtf3edgns/mecab-0.996-ko-0.9.2.tar.gz [following] --2022-11-16 05:30:26-- https://www.dropbox.com/s/dl/9xls0tgtf3edgns/mecab-0.996-ko-0.9.2.tar.gz Reusing existing connection to www.dropbox.com:443. HTTP request sent, awaiting response... 302 Found Location: https://uc63ee86bb2fecd692b2666d2769.dl.dropboxusercontent.com/cd/0/get/Bw29JOIWcbnO6vIrxoQTy5re9QtCtzgv1UhjxQzauedYRrScZ7R_gQ_GkkyAyIwDA2tO8jB5uNB3PwFD2cV53UliOK9o2N6ndkU8rX4K6aWmTSFuTjbcKXMN2op9ODK2lwwbTOddv6IHFVvAx2l_9p1U3qJFPRltodNP_erATEZioqJe0c1KSBk_l7zJoN0BaOs/file?dl=1# [following] --2022-11-16 05:30:27-- https://uc63ee86bb2fecd692b2666d2769.dl.dropboxusercontent.com/cd/0/get/Bw29JOIWcbnO6vIrxoQTy5re9QtCtzgv1UhjxQzauedYRrScZ7R_gQ_GkkyAyIwDA2tO8jB5uNB3PwFD2cV53UliOK9o2N6ndkU8rX4K6aWmTSFuTjbcKXMN2op9ODK2lwwbTOddv6IHFVvAx2l_9p1U3qJFPRltodNP_erATEZioqJe0c1KSBk_l7zJoN0BaOs/file?dl=1 Resolving uc63ee86bb2fecd692b2666d2769.dl.dropboxusercontent.com (uc63ee86bb2fecd692b2666d2769.dl.dropboxusercontent.com)... 162.125.65.15, 2620:100:6021:15::a27d:410f Connecting to uc63ee86bb2fecd692b2666d2769.dl.dropboxusercontent.com (uc63ee86bb2fecd692b2666d2769.dl.dropboxusercontent.com)|162.125.65.15|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1414979 (1.3M) [application/binary] Saving to: ‘mecab-0.996-ko-0.9.2.tar.gz?dl=1.2’ mecab-0.996-ko-0.9. 100%[===================>] 1.35M --.-KB/s in 0.1s 2022-11-16 05:30:27 (11.6 MB/s) - ‘mecab-0.996-ko-0.9.2.tar.gz?dl=1.2’ saved [1414979/1414979] mecab-0.996-ko-0.9.2/ mecab-0.996-ko-0.9.2/example/ mecab-0.996-ko-0.9.2/example/example.cpp mecab-0.996-ko-0.9.2/example/example_lattice.cpp mecab-0.996-ko-0.9.2/example/example_lattice.c mecab-0.996-ko-0.9.2/example/example.c mecab-0.996-ko-0.9.2/example/thread_test.cpp mecab-0.996-ko-0.9.2/mecab-config.in mecab-0.996-ko-0.9.2/man/ mecab-0.996-ko-0.9.2/man/Makefile.am mecab-0.996-ko-0.9.2/man/mecab.1 mecab-0.996-ko-0.9.2/man/Makefile.in mecab-0.996-ko-0.9.2/mecab.iss.in mecab-0.996-ko-0.9.2/config.guess mecab-0.996-ko-0.9.2/README mecab-0.996-ko-0.9.2/COPYING mecab-0.996-ko-0.9.2/CHANGES.md mecab-0.996-ko-0.9.2/README.md mecab-0.996-ko-0.9.2/INSTALL mecab-0.996-ko-0.9.2/config.sub mecab-0.996-ko-0.9.2/configure.in mecab-0.996-ko-0.9.2/swig/ mecab-0.996-ko-0.9.2/swig/Makefile mecab-0.996-ko-0.9.2/swig/version.h.in mecab-0.996-ko-0.9.2/swig/version.h mecab-0.996-ko-0.9.2/swig/MeCab.i mecab-0.996-ko-0.9.2/aclocal.m4 mecab-0.996-ko-0.9.2/LGPL mecab-0.996-ko-0.9.2/Makefile.am mecab-0.996-ko-0.9.2/configure mecab-0.996-ko-0.9.2/tests/ mecab-0.996-ko-0.9.2/tests/autolink/ mecab-0.996-ko-0.9.2/tests/autolink/unk.def mecab-0.996-ko-0.9.2/tests/autolink/dicrc mecab-0.996-ko-0.9.2/tests/autolink/dic.csv mecab-0.996-ko-0.9.2/tests/autolink/test mecab-0.996-ko-0.9.2/tests/autolink/char.def mecab-0.996-ko-0.9.2/tests/autolink/matrix.def mecab-0.996-ko-0.9.2/tests/autolink/test.gld mecab-0.996-ko-0.9.2/tests/t9/ mecab-0.996-ko-0.9.2/tests/t9/unk.def mecab-0.996-ko-0.9.2/tests/t9/ipadic.pl mecab-0.996-ko-0.9.2/tests/t9/dicrc mecab-0.996-ko-0.9.2/tests/t9/dic.csv mecab-0.996-ko-0.9.2/tests/t9/test mecab-0.996-ko-0.9.2/tests/t9/char.def mecab-0.996-ko-0.9.2/tests/t9/matrix.def mecab-0.996-ko-0.9.2/tests/t9/mkdic.pl mecab-0.996-ko-0.9.2/tests/t9/test.gld mecab-0.996-ko-0.9.2/tests/cost-train/ mecab-0.996-ko-0.9.2/tests/cost-train/ipa.train mecab-0.996-ko-0.9.2/tests/cost-train/ipa.test mecab-0.996-ko-0.9.2/tests/cost-train/seed/ mecab-0.996-ko-0.9.2/tests/cost-train/seed/rewrite.def mecab-0.996-ko-0.9.2/tests/cost-train/seed/feature.def mecab-0.996-ko-0.9.2/tests/cost-train/seed/unk.def mecab-0.996-ko-0.9.2/tests/cost-train/seed/dicrc mecab-0.996-ko-0.9.2/tests/cost-train/seed/dic.csv mecab-0.996-ko-0.9.2/tests/cost-train/seed/char.def mecab-0.996-ko-0.9.2/tests/cost-train/seed/matrix.def mecab-0.996-ko-0.9.2/tests/run-eval.sh mecab-0.996-ko-0.9.2/tests/run-cost-train.sh mecab-0.996-ko-0.9.2/tests/Makefile.am mecab-0.996-ko-0.9.2/tests/katakana/ mecab-0.996-ko-0.9.2/tests/katakana/unk.def mecab-0.996-ko-0.9.2/tests/katakana/dicrc mecab-0.996-ko-0.9.2/tests/katakana/dic.csv mecab-0.996-ko-0.9.2/tests/katakana/test mecab-0.996-ko-0.9.2/tests/katakana/char.def mecab-0.996-ko-0.9.2/tests/katakana/matrix.def mecab-0.996-ko-0.9.2/tests/katakana/test.gld mecab-0.996-ko-0.9.2/tests/eval/ mecab-0.996-ko-0.9.2/tests/eval/answer mecab-0.996-ko-0.9.2/tests/eval/system mecab-0.996-ko-0.9.2/tests/eval/test.gld mecab-0.996-ko-0.9.2/tests/shiin/ mecab-0.996-ko-0.9.2/tests/shiin/unk.def mecab-0.996-ko-0.9.2/tests/shiin/dicrc mecab-0.996-ko-0.9.2/tests/shiin/dic.csv mecab-0.996-ko-0.9.2/tests/shiin/test mecab-0.996-ko-0.9.2/tests/shiin/char.def mecab-0.996-ko-0.9.2/tests/shiin/matrix.def mecab-0.996-ko-0.9.2/tests/shiin/mkdic.pl mecab-0.996-ko-0.9.2/tests/shiin/test.gld mecab-0.996-ko-0.9.2/tests/latin/ mecab-0.996-ko-0.9.2/tests/latin/unk.def mecab-0.996-ko-0.9.2/tests/latin/dicrc mecab-0.996-ko-0.9.2/tests/latin/dic.csv mecab-0.996-ko-0.9.2/tests/latin/test mecab-0.996-ko-0.9.2/tests/latin/char.def mecab-0.996-ko-0.9.2/tests/latin/matrix.def mecab-0.996-ko-0.9.2/tests/latin/test.gld mecab-0.996-ko-0.9.2/tests/chartype/ mecab-0.996-ko-0.9.2/tests/chartype/unk.def mecab-0.996-ko-0.9.2/tests/chartype/dicrc mecab-0.996-ko-0.9.2/tests/chartype/dic.csv mecab-0.996-ko-0.9.2/tests/chartype/test mecab-0.996-ko-0.9.2/tests/chartype/char.def mecab-0.996-ko-0.9.2/tests/chartype/matrix.def mecab-0.996-ko-0.9.2/tests/chartype/test.gld mecab-0.996-ko-0.9.2/tests/run-dics.sh mecab-0.996-ko-0.9.2/tests/ngram/ mecab-0.996-ko-0.9.2/tests/ngram/unk.def mecab-0.996-ko-0.9.2/tests/ngram/dicrc mecab-0.996-ko-0.9.2/tests/ngram/dic.csv mecab-0.996-ko-0.9.2/tests/ngram/test mecab-0.996-ko-0.9.2/tests/ngram/char.def mecab-0.996-ko-0.9.2/tests/ngram/matrix.def mecab-0.996-ko-0.9.2/tests/ngram/test.gld mecab-0.996-ko-0.9.2/tests/Makefile.in mecab-0.996-ko-0.9.2/ltmain.sh mecab-0.996-ko-0.9.2/config.rpath mecab-0.996-ko-0.9.2/config.h.in mecab-0.996-ko-0.9.2/mecabrc.in mecab-0.996-ko-0.9.2/GPL mecab-0.996-ko-0.9.2/Makefile.train mecab-0.996-ko-0.9.2/ChangeLog mecab-0.996-ko-0.9.2/install-sh mecab-0.996-ko-0.9.2/AUTHORS mecab-0.996-ko-0.9.2/doc/ mecab-0.996-ko-0.9.2/doc/bindings.html mecab-0.996-ko-0.9.2/doc/posid.html mecab-0.996-ko-0.9.2/doc/unk.html mecab-0.996-ko-0.9.2/doc/learn.html mecab-0.996-ko-0.9.2/doc/format.html mecab-0.996-ko-0.9.2/doc/libmecab.html mecab-0.996-ko-0.9.2/doc/mecab.css mecab-0.996-ko-0.9.2/doc/feature.html mecab-0.996-ko-0.9.2/doc/Makefile.am mecab-0.996-ko-0.9.2/doc/soft.html mecab-0.996-ko-0.9.2/doc/en/ mecab-0.996-ko-0.9.2/doc/en/bindings.html mecab-0.996-ko-0.9.2/doc/dic-detail.html mecab-0.996-ko-0.9.2/doc/flow.png mecab-0.996-ko-0.9.2/doc/mecab.html mecab-0.996-ko-0.9.2/doc/index.html mecab-0.996-ko-0.9.2/doc/result.png mecab-0.996-ko-0.9.2/doc/doxygen/ mecab-0.996-ko-0.9.2/doc/doxygen/tab_a.png mecab-0.996-ko-0.9.2/doc/doxygen/globals_eval.html mecab-0.996-ko-0.9.2/doc/doxygen/classMeCab_1_1Tagger-members.html mecab-0.996-ko-0.9.2/doc/doxygen/functions_vars.html mecab-0.996-ko-0.9.2/doc/doxygen/doxygen.css mecab-0.996-ko-0.9.2/doc/doxygen/tab_r.gif mecab-0.996-ko-0.9.2/doc/doxygen/classMeCab_1_1Lattice.html mecab-0.996-ko-0.9.2/doc/doxygen/functions.html mecab-0.996-ko-0.9.2/doc/doxygen/classMeCab_1_1Tagger.html mecab-0.996-ko-0.9.2/doc/doxygen/mecab_8h_source.html mecab-0.996-ko-0.9.2/doc/doxygen/tabs.css mecab-0.996-ko-0.9.2/doc/doxygen/nav_f.png mecab-0.996-ko-0.9.2/doc/doxygen/tab_b.png mecab-0.996-ko-0.9.2/doc/doxygen/globals.html mecab-0.996-ko-0.9.2/doc/doxygen/nav_h.png mecab-0.996-ko-0.9.2/doc/doxygen/tab_h.png mecab-0.996-ko-0.9.2/doc/doxygen/classMeCab_1_1Model.html mecab-0.996-ko-0.9.2/doc/doxygen/globals_func.html mecab-0.996-ko-0.9.2/doc/doxygen/closed.png mecab-0.996-ko-0.9.2/doc/doxygen/tab_l.gif mecab-0.996-ko-0.9.2/doc/doxygen/structmecab__path__t-members.html mecab-0.996-ko-0.9.2/doc/doxygen/functions_func.html mecab-0.996-ko-0.9.2/doc/doxygen/globals_type.html mecab-0.996-ko-0.9.2/doc/doxygen/classMeCab_1_1Lattice-members.html mecab-0.996-ko-0.9.2/doc/doxygen/structmecab__node__t.html mecab-0.996-ko-0.9.2/doc/doxygen/namespacemembers_func.html mecab-0.996-ko-0.9.2/doc/doxygen/tab_s.png mecab-0.996-ko-0.9.2/doc/doxygen/structmecab__dictionary__info__t-members.html mecab-0.996-ko-0.9.2/doc/doxygen/namespacemembers_type.html mecab-0.996-ko-0.9.2/doc/doxygen/classMeCab_1_1Model-members.html mecab-0.996-ko-0.9.2/doc/doxygen/structmecab__dictionary__info__t.html mecab-0.996-ko-0.9.2/doc/doxygen/namespaces.html mecab-0.996-ko-0.9.2/doc/doxygen/namespacemembers.html mecab-0.996-ko-0.9.2/doc/doxygen/namespaceMeCab.html mecab-0.996-ko-0.9.2/doc/doxygen/structmecab__path__t.html mecab-0.996-ko-0.9.2/doc/doxygen/files.html mecab-0.996-ko-0.9.2/doc/doxygen/structmecab__node__t-members.html mecab-0.996-ko-0.9.2/doc/doxygen/index.html mecab-0.996-ko-0.9.2/doc/doxygen/annotated.html mecab-0.996-ko-0.9.2/doc/doxygen/globals_defs.html mecab-0.996-ko-0.9.2/doc/doxygen/classes.html mecab-0.996-ko-0.9.2/doc/doxygen/mecab_8h-source.html mecab-0.996-ko-0.9.2/doc/doxygen/doxygen.png mecab-0.996-ko-0.9.2/doc/doxygen/tab_b.gif mecab-0.996-ko-0.9.2/doc/doxygen/bc_s.png mecab-0.996-ko-0.9.2/doc/doxygen/open.png mecab-0.996-ko-0.9.2/doc/doxygen/mecab_8h.html mecab-0.996-ko-0.9.2/doc/dic.html mecab-0.996-ko-0.9.2/doc/partial.html mecab-0.996-ko-0.9.2/doc/feature.png mecab-0.996-ko-0.9.2/doc/Makefile.in mecab-0.996-ko-0.9.2/missing mecab-0.996-ko-0.9.2/BSD mecab-0.996-ko-0.9.2/NEWS mecab-0.996-ko-0.9.2/mkinstalldirs mecab-0.996-ko-0.9.2/src/ mecab-0.996-ko-0.9.2/src/dictionary.h mecab-0.996-ko-0.9.2/src/writer.h mecab-0.996-ko-0.9.2/src/utils.h mecab-0.996-ko-0.9.2/src/string_buffer.cpp mecab-0.996-ko-0.9.2/src/tokenizer.cpp mecab-0.996-ko-0.9.2/src/make.bat mecab-0.996-ko-0.9.2/src/mecab.h mecab-0.996-ko-0.9.2/src/freelist.h mecab-0.996-ko-0.9.2/src/string_buffer.h mecab-0.996-ko-0.9.2/src/learner_tagger.h mecab-0.996-ko-0.9.2/src/dictionary_compiler.cpp mecab-0.996-ko-0.9.2/src/eval.cpp mecab-0.996-ko-0.9.2/src/mecab-system-eval.cpp mecab-0.996-ko-0.9.2/src/darts.h mecab-0.996-ko-0.9.2/src/param.h mecab-0.996-ko-0.9.2/src/char_property.h mecab-0.996-ko-0.9.2/src/learner_node.h mecab-0.996-ko-0.9.2/src/mecab-dict-gen.cpp mecab-0.996-ko-0.9.2/src/mecab-dict-index.cpp mecab-0.996-ko-0.9.2/src/winmain.h mecab-0.996-ko-0.9.2/src/thread.h mecab-0.996-ko-0.9.2/src/context_id.cpp mecab-0.996-ko-0.9.2/src/Makefile.am mecab-0.996-ko-0.9.2/src/connector.h mecab-0.996-ko-0.9.2/src/common.h mecab-0.996-ko-0.9.2/src/dictionary_rewriter.cpp mecab-0.996-ko-0.9.2/src/Makefile.msvc.in mecab-0.996-ko-0.9.2/src/dictionary_rewriter.h mecab-0.996-ko-0.9.2/src/feature_index.h mecab-0.996-ko-0.9.2/src/iconv_utils.cpp mecab-0.996-ko-0.9.2/src/char_property.cpp mecab-0.996-ko-0.9.2/src/mecab-test-gen.cpp mecab-0.996-ko-0.9.2/src/tagger.cpp mecab-0.996-ko-0.9.2/src/mecab-cost-train.cpp mecab-0.996-ko-0.9.2/src/learner.cpp mecab-0.996-ko-0.9.2/src/dictionary.cpp mecab-0.996-ko-0.9.2/src/lbfgs.cpp mecab-0.996-ko-0.9.2/src/ucs.h mecab-0.996-ko-0.9.2/src/writer.cpp mecab-0.996-ko-0.9.2/src/learner_tagger.cpp mecab-0.996-ko-0.9.2/src/lbfgs.h mecab-0.996-ko-0.9.2/src/libmecab.cpp mecab-0.996-ko-0.9.2/src/tokenizer.h mecab-0.996-ko-0.9.2/src/mecab.cpp mecab-0.996-ko-0.9.2/src/utils.cpp mecab-0.996-ko-0.9.2/src/dictionary_generator.cpp mecab-0.996-ko-0.9.2/src/param.cpp mecab-0.996-ko-0.9.2/src/context_id.h mecab-0.996-ko-0.9.2/src/mmap.h mecab-0.996-ko-0.9.2/src/viterbi.h mecab-0.996-ko-0.9.2/src/viterbi.cpp mecab-0.996-ko-0.9.2/src/stream_wrapper.h mecab-0.996-ko-0.9.2/src/feature_index.cpp mecab-0.996-ko-0.9.2/src/nbest_generator.h mecab-0.996-ko-0.9.2/src/ucstable.h mecab-0.996-ko-0.9.2/src/nbest_generator.cpp mecab-0.996-ko-0.9.2/src/iconv_utils.h mecab-0.996-ko-0.9.2/src/connector.cpp mecab-0.996-ko-0.9.2/src/Makefile.in mecab-0.996-ko-0.9.2/src/scoped_ptr.h mecab-0.996-ko-0.9.2/Makefile.in checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... no checking for mawk... mawk checking whether make sets $(MAKE)... yes checking for gcc... gcc checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking for style of include used by make... GNU checking dependency style of gcc... none checking for g++... g++ checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking dependency style of g++... none checking how to run the C preprocessor... gcc -E checking for grep that handles long lines and -e... /bin/grep checking for egrep... /bin/grep -E checking whether gcc needs -traditional... no checking whether make sets $(MAKE)... (cached) yes checking build system type... x86_64-unknown-linux-gnu checking host system type... x86_64-unknown-linux-gnu checking how to print strings... printf checking for a sed that does not truncate output... /bin/sed checking for fgrep... /bin/grep -F checking for ld used by gcc... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B checking the name lister (/usr/bin/nm -B) interface... BSD nm checking whether ln -s works... yes checking the maximum length of command line arguments... 1572864 checking whether the shell understands some XSI constructs... yes checking whether the shell understands "+="... yes checking how to convert x86_64-unknown-linux-gnu file names to x86_64-unknown-linux-gnu format... func_convert_file_noop checking how to convert x86_64-unknown-linux-gnu file names to toolchain format... func_convert_file_noop checking for /usr/bin/ld option to reload object files... -r checking for objdump... objdump checking how to recognize dependent libraries... pass_all checking for dlltool... dlltool checking how to associate runtime and link libraries... printf %s\n checking for ar... ar checking for archiver @FILE support... @ checking for strip... strip checking for ranlib... ranlib checking command to parse /usr/bin/nm -B output from gcc object... ok checking for sysroot... no ./configure: line 7378: /usr/bin/file: No such file or directory checking for mt... no checking if : is a manifest tool... no checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for dlfcn.h... yes checking for objdir... .libs checking if gcc supports -fno-rtti -fno-exceptions... no checking for gcc option to produce PIC... -fPIC -DPIC checking if gcc PIC flag -fPIC -DPIC works... yes checking if gcc static flag -static works... yes checking if gcc supports -c -o file.o... yes checking if gcc supports -c -o file.o... (cached) yes checking whether the gcc linker (/usr/bin/ld) supports shared libraries... yes checking whether -lc should be explicitly linked in... no checking dynamic linker characteristics... GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking whether stripping libraries is possible... yes checking if libtool supports shared libraries... yes checking whether to build shared libraries... yes checking whether to build static libraries... yes checking how to run the C++ preprocessor... g++ -E checking for ld used by g++... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes checking for g++ option to produce PIC... -fPIC -DPIC checking if g++ PIC flag -fPIC -DPIC works... yes checking if g++ static flag -static works... yes checking if g++ supports -c -o file.o... yes checking if g++ supports -c -o file.o... (cached) yes checking whether the g++ linker (/usr/bin/ld) supports shared libraries... yes checking dynamic linker characteristics... (cached) GNU/Linux ld.so checking how to hardcode library paths into programs... immediate checking for library containing strerror... none required checking whether byte ordering is bigendian... no checking for ld used by GCC... /usr/bin/ld checking if the linker (/usr/bin/ld) is GNU ld... yes checking for shared library run path origin... done checking for iconv... yes checking for working iconv... yes checking for iconv declaration... extern size_t iconv (iconv_t cd, char * *inbuf, size_t *inbytesleft, char * *outbuf, size_t *outbytesleft); checking for ANSI C header files... (cached) yes checking for an ANSI C-conforming const... yes checking whether byte ordering is bigendian... (cached) no checking for string.h... (cached) yes checking for stdlib.h... (cached) yes checking for unistd.h... (cached) yes checking fcntl.h usability... yes checking fcntl.h presence... yes checking for fcntl.h... yes checking for stdint.h... (cached) yes checking for sys/stat.h... (cached) yes checking sys/mman.h usability... yes checking sys/mman.h presence... yes checking for sys/mman.h... yes checking sys/times.h usability... yes checking sys/times.h presence... yes checking for sys/times.h... yes checking for sys/types.h... (cached) yes checking dirent.h usability... yes checking dirent.h presence... yes checking for dirent.h... yes checking ctype.h usability... yes checking ctype.h presence... yes checking for ctype.h... yes checking for sys/types.h... (cached) yes checking io.h usability... no checking io.h presence... no checking for io.h... no checking windows.h usability... no checking windows.h presence... no checking for windows.h... no checking pthread.h usability... yes checking pthread.h presence... yes checking for pthread.h... yes checking for off_t... yes checking for size_t... yes checking size of char... 1 checking size of short... 2 checking size of int... 4 checking size of long... 8 checking size of long long... 8 checking size of size_t... 8 checking for size_t... (cached) yes checking for unsigned long long int... yes checking for stdlib.h... (cached) yes checking for unistd.h... (cached) yes checking for sys/param.h... yes checking for getpagesize... yes checking for working mmap... yes checking for main in -lstdc++... yes checking for pthread_create in -lpthread... yes checking for pthread_join in -lpthread... yes checking for getenv... yes checking for opendir... yes checking whether make is GNU Make... yes checking if g++ supports stl <vector> (required)... yes checking if g++ supports stl <list> (required)... yes checking if g++ supports stl <map> (required)... yes checking if g++ supports stl <set> (required)... yes checking if g++ supports stl <queue> (required)... yes checking if g++ supports stl <functional> (required)... yes checking if g++ supports stl <algorithm> (required)... yes checking if g++ supports stl <string> (required)... yes checking if g++ supports stl <iostream> (required)... yes checking if g++ supports stl <sstream> (required)... yes checking if g++ supports stl <fstream> (required)... yes checking if g++ supports template <class T> (required)... yes checking if g++ supports const_cast<> (required)... yes checking if g++ supports static_cast<> (required)... yes checking if g++ supports reinterpret_cast<> (required)... yes checking if g++ supports namespaces (required) ... yes checking if g++ supports __thread (optional)... yes checking if g++ supports template <class T> (required)... yes checking if g++ supports GCC native atomic operations (optional)... yes checking if g++ supports OSX native atomic operations (optional)... no checking if g++ environment provides all required features... yes configure: creating ./config.status config.status: creating Makefile config.status: creating src/Makefile config.status: creating src/Makefile.msvc config.status: creating man/Makefile config.status: creating doc/Makefile config.status: creating tests/Makefile config.status: creating swig/version.h config.status: creating mecab.iss config.status: creating mecab-config config.status: creating mecabrc config.status: creating config.h config.status: config.h is unchanged config.status: executing depfiles commands config.status: executing libtool commands config.status: executing default commands make all-recursive make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2' Making all in src make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/src' make[2]: Nothing to be done for 'all'. make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/src' Making all in man make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/man' make[2]: Nothing to be done for 'all'. make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/man' Making all in doc make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/doc' make[2]: Nothing to be done for 'all'. make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/doc' Making all in tests make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[2]: Nothing to be done for 'all'. make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2' make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2' Making check in src make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/src' make[1]: Nothing to be done for 'check'. make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/src' Making check in man make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/man' make[1]: Nothing to be done for 'check'. make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/man' Making check in doc make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/doc' make[1]: Nothing to be done for 'check'. make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/doc' Making check in tests make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/tests' make check-TESTS make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/tests' ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 2 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 177 emitting double-array: 100% |###########################################| reading ./matrix.def ... 178x178 emitting matrix : 100% |###########################################| done! ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 2 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 83 emitting double-array: 100% |###########################################| reading ./matrix.def ... 84x84 emitting matrix : 100% |###########################################| done! ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 2 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 450 emitting double-array: 100% |###########################################| reading ./matrix.def ... 1x1 done! ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 2 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 162 emitting double-array: 100% |###########################################| reading ./matrix.def ... 3x3 emitting matrix : 100% |###########################################| done! ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 2 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 4 emitting double-array: 100% |###########################################| reading ./matrix.def ... 1x1 done! ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 11 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 1 reading ./matrix.def ... 1x1 done! ./pos-id.def is not found. minimum setting is used reading ./unk.def ... 2 emitting double-array: 100% |###########################################| ./model.def is not found. skipped. ./pos-id.def is not found. minimum setting is used reading ./dic.csv ... 1 reading ./matrix.def ... 1x1 done! PASS: run-dics.sh PASS: run-eval.sh seed/pos-id.def is not found. minimum setting is used reading seed/unk.def ... 40 emitting double-array: 100% |###########################################| seed/model.def is not found. skipped. seed/pos-id.def is not found. minimum setting is used reading seed/dic.csv ... 4335 emitting double-array: 100% |###########################################| reading seed/matrix.def ... 1x1 done! reading corpus ... Number of sentences: 34 Number of features: 64108 eta: 0.00005 freq: 1 eval-size: 6 unk-eval-size: 4 threads: 1 charset: EUC-JP C(sigma^2): 1.00000 iter=0 err=1.00000 F=0.35771 target=2406.28355 diff=1.00000 iter=1 err=0.97059 F=0.65652 target=1484.25231 diff=0.38318 iter=2 err=0.91176 F=0.79331 target=863.32765 diff=0.41834 iter=3 err=0.85294 F=0.89213 target=596.72480 diff=0.30881 iter=4 err=0.61765 F=0.95467 target=336.30744 diff=0.43641 iter=5 err=0.50000 F=0.96702 target=246.53039 diff=0.26695 iter=6 err=0.35294 F=0.95472 target=188.93963 diff=0.23361 iter=7 err=0.20588 F=0.99106 target=168.62665 diff=0.10751 iter=8 err=0.05882 F=0.99777 target=158.64865 diff=0.05917 iter=9 err=0.08824 F=0.99665 target=154.14530 diff=0.02839 iter=10 err=0.08824 F=0.99665 target=151.94257 diff=0.01429 iter=11 err=0.02941 F=0.99888 target=147.20825 diff=0.03116 iter=12 err=0.00000 F=1.00000 target=147.34956 diff=0.00096 iter=13 err=0.02941 F=0.99888 target=146.32592 diff=0.00695 iter=14 err=0.00000 F=1.00000 target=145.77299 diff=0.00378 iter=15 err=0.02941 F=0.99888 target=145.24641 diff=0.00361 iter=16 err=0.00000 F=1.00000 target=144.96490 diff=0.00194 iter=17 err=0.02941 F=0.99888 target=144.90246 diff=0.00043 iter=18 err=0.00000 F=1.00000 target=144.75959 diff=0.00099 iter=19 err=0.00000 F=1.00000 target=144.71727 diff=0.00029 iter=20 err=0.00000 F=1.00000 target=144.66337 diff=0.00037 iter=21 err=0.00000 F=1.00000 target=144.61349 diff=0.00034 iter=22 err=0.00000 F=1.00000 target=144.62987 diff=0.00011 iter=23 err=0.00000 F=1.00000 target=144.60060 diff=0.00020 iter=24 err=0.00000 F=1.00000 target=144.59125 diff=0.00006 iter=25 err=0.00000 F=1.00000 target=144.58619 diff=0.00004 iter=26 err=0.00000 F=1.00000 target=144.58219 diff=0.00003 iter=27 err=0.00000 F=1.00000 target=144.58059 diff=0.00001 Done! writing model file ... model-ipadic.c1.0.f1.model is not a binary model. reopen it as text mode... reading seed/unk.def ... 40 reading seed/dic.csv ... 4335 emitting model-ipadic.c1.0.f1.dic/left-id.def/ model-ipadic.c1.0.f1.dic/right-id.def emitting model-ipadic.c1.0.f1.dic/unk.def ... 40 emitting model-ipadic.c1.0.f1.dic/dic.csv ... 4335 emitting matrix : 100% |###########################################| copying seed/char.def to model-ipadic.c1.0.f1.dic/char.def copying seed/rewrite.def to model-ipadic.c1.0.f1.dic/rewrite.def copying seed/dicrc to model-ipadic.c1.0.f1.dic/dicrc copying seed/feature.def to model-ipadic.c1.0.f1.dic/feature.def copying model-ipadic.c1.0.f1.model to model-ipadic.c1.0.f1.dic/model.def done! model-ipadic.c1.0.f1.dic/pos-id.def is not found. minimum setting is used reading model-ipadic.c1.0.f1.dic/unk.def ... 40 emitting double-array: 100% |###########################################| model-ipadic.c1.0.f1.dic/pos-id.def is not found. minimum setting is used reading model-ipadic.c1.0.f1.dic/dic.csv ... 4335 emitting double-array: 100% |###########################################| reading model-ipadic.c1.0.f1.dic/matrix.def ... 346x346 emitting matrix : 100% |###########################################| done! precision recall F LEVEL 0: 12.8959(57/442) 11.8998(57/479) 12.3779 LEVEL 1: 12.2172(54/442) 11.2735(54/479) 11.7264 LEVEL 2: 11.7647(52/442) 10.8559(52/479) 11.2921 LEVEL 4: 11.7647(52/442) 10.8559(52/479) 11.2921 PASS: run-cost-train.sh ================== All 3 tests passed ================== make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2' Making install in src make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/src' make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/src' test -z "/usr/local/lib" || /bin/mkdir -p "/usr/local/lib" /bin/bash ../libtool --mode=install /usr/bin/install -c libmecab.la '/usr/local/lib' libtool: install: /usr/bin/install -c .libs/libmecab.so.2.0.0 /usr/local/lib/libmecab.so.2.0.0 libtool: install: (cd /usr/local/lib && { ln -s -f libmecab.so.2.0.0 libmecab.so.2 || { rm -f libmecab.so.2 && ln -s libmecab.so.2.0.0 libmecab.so.2; }; }) libtool: install: (cd /usr/local/lib && { ln -s -f libmecab.so.2.0.0 libmecab.so || { rm -f libmecab.so && ln -s libmecab.so.2.0.0 libmecab.so; }; }) libtool: install: /usr/bin/install -c .libs/libmecab.lai /usr/local/lib/libmecab.la libtool: install: /usr/bin/install -c .libs/libmecab.a /usr/local/lib/libmecab.a libtool: install: chmod 644 /usr/local/lib/libmecab.a libtool: install: ranlib /usr/local/lib/libmecab.a libtool: finish: PATH="/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin:/sbin" ldconfig -n /usr/local/lib ---------------------------------------------------------------------- Libraries have been installed in: /usr/local/lib If you ever happen to want to link against installed libraries in a given directory, LIBDIR, you must either use libtool, and specify the full pathname of the library, or use the `-LLIBDIR' flag during linking and do at least one of the following: - add LIBDIR to the `LD_LIBRARY_PATH' environment variable during execution - add LIBDIR to the `LD_RUN_PATH' environment variable during linking - use the `-Wl,-rpath -Wl,LIBDIR' linker flag - have your system administrator add LIBDIR to `/etc/ld.so.conf' See any operating system documentation about shared libraries for more information, such as the ld(1) and ld.so(8) manual pages. ---------------------------------------------------------------------- test -z "/usr/local/bin" || /bin/mkdir -p "/usr/local/bin" /bin/bash ../libtool --mode=install /usr/bin/install -c mecab '/usr/local/bin' libtool: install: /usr/bin/install -c .libs/mecab /usr/local/bin/mecab test -z "/usr/local/libexec/mecab" || /bin/mkdir -p "/usr/local/libexec/mecab" /bin/bash ../libtool --mode=install /usr/bin/install -c mecab-dict-index mecab-dict-gen mecab-cost-train mecab-system-eval mecab-test-gen '/usr/local/libexec/mecab' libtool: install: /usr/bin/install -c .libs/mecab-dict-index /usr/local/libexec/mecab/mecab-dict-index libtool: install: /usr/bin/install -c .libs/mecab-dict-gen /usr/local/libexec/mecab/mecab-dict-gen libtool: install: /usr/bin/install -c .libs/mecab-cost-train /usr/local/libexec/mecab/mecab-cost-train libtool: install: /usr/bin/install -c .libs/mecab-system-eval /usr/local/libexec/mecab/mecab-system-eval libtool: install: /usr/bin/install -c .libs/mecab-test-gen /usr/local/libexec/mecab/mecab-test-gen test -z "/usr/local/include" || /bin/mkdir -p "/usr/local/include" /usr/bin/install -c -m 644 mecab.h '/usr/local/include' make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/src' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/src' Making install in man make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/man' make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/man' make[2]: Nothing to be done for 'install-exec-am'. test -z "/usr/local/share/man/man1" || /bin/mkdir -p "/usr/local/share/man/man1" /usr/bin/install -c -m 644 mecab.1 '/usr/local/share/man/man1' make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/man' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/man' Making install in doc make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/doc' make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/doc' make[2]: Nothing to be done for 'install-exec-am'. make[2]: Nothing to be done for 'install-data-am'. make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/doc' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/doc' Making install in tests make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[2]: Nothing to be done for 'install-exec-am'. make[2]: Nothing to be done for 'install-data-am'. make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2/tests' make[1]: Entering directory '/tmp/mecab-0.996-ko-0.9.2' make[2]: Entering directory '/tmp/mecab-0.996-ko-0.9.2' test -z "/usr/local/bin" || /bin/mkdir -p "/usr/local/bin" /usr/bin/install -c mecab-config '/usr/local/bin' test -z "/usr/local/etc" || /bin/mkdir -p "/usr/local/etc" /usr/bin/install -c -m 644 mecabrc '/usr/local/etc' make[2]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2' make[1]: Leaving directory '/tmp/mecab-0.996-ko-0.9.2' --2022-11-16 05:31:06-- https://www.dropbox.com/s/i8girnk5p80076c/mecab-ko-dic-2.1.1-20180720.tar.gz?dl=1 Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.18, 2620:100:6021:18::a27d:4112 Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.18|:443... connected. HTTP request sent, awaiting response... 302 Found Location: /s/dl/i8girnk5p80076c/mecab-ko-dic-2.1.1-20180720.tar.gz [following] --2022-11-16 05:31:06-- https://www.dropbox.com/s/dl/i8girnk5p80076c/mecab-ko-dic-2.1.1-20180720.tar.gz Reusing existing connection to www.dropbox.com:443. HTTP request sent, awaiting response... 302 Found Location: https://uc803b9dfe98ccd0d9dd3f02a575.dl.dropboxusercontent.com/cd/0/get/Bw1kH15CQEnLgG3rS5oZto4SRxUZU_tB_ogeVj7XCfDVuhHIeBUtisWuOvXrN4CNRs3UaXBz26qSR6QmsryRMXskR49C12CS9Kw-xrElUXAVq1RuPXRlHm35fTd3VA4GpQt6XOZeui0bOli6wjD3B76tRG6-OwvXyZ8WgZYNWElCP7OXMw8mRoFBleyJRfIr8dM/file?dl=1# [following] --2022-11-16 05:31:07-- https://uc803b9dfe98ccd0d9dd3f02a575.dl.dropboxusercontent.com/cd/0/get/Bw1kH15CQEnLgG3rS5oZto4SRxUZU_tB_ogeVj7XCfDVuhHIeBUtisWuOvXrN4CNRs3UaXBz26qSR6QmsryRMXskR49C12CS9Kw-xrElUXAVq1RuPXRlHm35fTd3VA4GpQt6XOZeui0bOli6wjD3B76tRG6-OwvXyZ8WgZYNWElCP7OXMw8mRoFBleyJRfIr8dM/file?dl=1 Resolving uc803b9dfe98ccd0d9dd3f02a575.dl.dropboxusercontent.com (uc803b9dfe98ccd0d9dd3f02a575.dl.dropboxusercontent.com)... 162.125.65.15, 2620:100:6021:15::a27d:410f Connecting to uc803b9dfe98ccd0d9dd3f02a575.dl.dropboxusercontent.com (uc803b9dfe98ccd0d9dd3f02a575.dl.dropboxusercontent.com)|162.125.65.15|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 49775061 (47M) [application/binary] Saving to: ‘mecab-ko-dic-2.1.1-20180720.tar.gz?dl=1.2’ mecab-ko-dic-2.1.1- 100%[===================>] 47.47M 21.2MB/s in 2.2s 2022-11-16 05:31:10 (21.2 MB/s) - ‘mecab-ko-dic-2.1.1-20180720.tar.gz?dl=1.2’ saved [49775061/49775061] Reading package lists... Done Building dependency tree Reading state information... Done autoconf is already the newest version (2.69-11). The following package was automatically installed and is no longer required: libnvidia-common-460 Use 'apt autoremove' to remove it. 0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded. mecab-ko-dic-2.1.1-20180720/ mecab-ko-dic-2.1.1-20180720/configure mecab-ko-dic-2.1.1-20180720/COPYING mecab-ko-dic-2.1.1-20180720/autogen.sh mecab-ko-dic-2.1.1-20180720/Place-station.csv mecab-ko-dic-2.1.1-20180720/NNG.csv mecab-ko-dic-2.1.1-20180720/README mecab-ko-dic-2.1.1-20180720/EF.csv mecab-ko-dic-2.1.1-20180720/MAG.csv mecab-ko-dic-2.1.1-20180720/Preanalysis.csv mecab-ko-dic-2.1.1-20180720/NNB.csv mecab-ko-dic-2.1.1-20180720/Person-actor.csv mecab-ko-dic-2.1.1-20180720/VV.csv mecab-ko-dic-2.1.1-20180720/Makefile.in mecab-ko-dic-2.1.1-20180720/matrix.def mecab-ko-dic-2.1.1-20180720/EC.csv mecab-ko-dic-2.1.1-20180720/NNBC.csv mecab-ko-dic-2.1.1-20180720/clean mecab-ko-dic-2.1.1-20180720/ChangeLog mecab-ko-dic-2.1.1-20180720/J.csv mecab-ko-dic-2.1.1-20180720/.keep mecab-ko-dic-2.1.1-20180720/feature.def mecab-ko-dic-2.1.1-20180720/Foreign.csv mecab-ko-dic-2.1.1-20180720/XPN.csv mecab-ko-dic-2.1.1-20180720/EP.csv mecab-ko-dic-2.1.1-20180720/NR.csv mecab-ko-dic-2.1.1-20180720/left-id.def mecab-ko-dic-2.1.1-20180720/Place.csv mecab-ko-dic-2.1.1-20180720/Symbol.csv mecab-ko-dic-2.1.1-20180720/dicrc mecab-ko-dic-2.1.1-20180720/NP.csv mecab-ko-dic-2.1.1-20180720/ETM.csv mecab-ko-dic-2.1.1-20180720/IC.csv mecab-ko-dic-2.1.1-20180720/Place-address.csv mecab-ko-dic-2.1.1-20180720/Group.csv mecab-ko-dic-2.1.1-20180720/model.def mecab-ko-dic-2.1.1-20180720/XSN.csv mecab-ko-dic-2.1.1-20180720/INSTALL mecab-ko-dic-2.1.1-20180720/rewrite.def mecab-ko-dic-2.1.1-20180720/Inflect.csv mecab-ko-dic-2.1.1-20180720/configure.ac mecab-ko-dic-2.1.1-20180720/NNP.csv mecab-ko-dic-2.1.1-20180720/CoinedWord.csv mecab-ko-dic-2.1.1-20180720/XSV.csv mecab-ko-dic-2.1.1-20180720/pos-id.def mecab-ko-dic-2.1.1-20180720/Makefile.am mecab-ko-dic-2.1.1-20180720/unk.def mecab-ko-dic-2.1.1-20180720/missing mecab-ko-dic-2.1.1-20180720/VCP.csv mecab-ko-dic-2.1.1-20180720/install-sh mecab-ko-dic-2.1.1-20180720/Hanja.csv mecab-ko-dic-2.1.1-20180720/MAJ.csv mecab-ko-dic-2.1.1-20180720/XSA.csv mecab-ko-dic-2.1.1-20180720/Wikipedia.csv mecab-ko-dic-2.1.1-20180720/tools/ mecab-ko-dic-2.1.1-20180720/tools/add-userdic.sh mecab-ko-dic-2.1.1-20180720/tools/mecab-bestn.sh mecab-ko-dic-2.1.1-20180720/tools/convert_for_using_store.sh mecab-ko-dic-2.1.1-20180720/user-dic/ mecab-ko-dic-2.1.1-20180720/user-dic/nnp.csv mecab-ko-dic-2.1.1-20180720/user-dic/place.csv mecab-ko-dic-2.1.1-20180720/user-dic/person.csv mecab-ko-dic-2.1.1-20180720/user-dic/README.md mecab-ko-dic-2.1.1-20180720/NorthKorea.csv mecab-ko-dic-2.1.1-20180720/VX.csv mecab-ko-dic-2.1.1-20180720/right-id.def mecab-ko-dic-2.1.1-20180720/VA.csv mecab-ko-dic-2.1.1-20180720/char.def mecab-ko-dic-2.1.1-20180720/NEWS mecab-ko-dic-2.1.1-20180720/MM.csv mecab-ko-dic-2.1.1-20180720/ETN.csv mecab-ko-dic-2.1.1-20180720/AUTHORS mecab-ko-dic-2.1.1-20180720/Person.csv mecab-ko-dic-2.1.1-20180720/XR.csv mecab-ko-dic-2.1.1-20180720/VCN.csv Looking in current directory for macros. configure.ac:2: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated. For more info, see: configure.ac:2: http://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes /tmp/mecab-ko-dic-2.1.1-20180720/missing: Unknown `--is-lightweight' option Try `/tmp/mecab-ko-dic-2.1.1-20180720/missing --help' for more information configure: WARNING: 'missing' script is too old or missing checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... no checking for mawk... mawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking for mecab-config... /usr/local/bin/mecab-config checking that generated files are newer than configure... done configure: creating ./config.status config.status: creating Makefile make: Nothing to be done for 'all'. make[1]: Entering directory '/tmp/mecab-ko-dic-2.1.1-20180720' make[1]: Nothing to be done for 'install-exec-am'. /bin/mkdir -p '/usr/local/lib/mecab/dic/mecab-ko-dic' /usr/bin/install -c -m 644 model.bin matrix.bin char.bin sys.dic unk.dic left-id.def right-id.def rewrite.def pos-id.def dicrc '/usr/local/lib/mecab/dic/mecab-ko-dic' make[1]: Leaving directory '/tmp/mecab-ko-dic-2.1.1-20180720' fatal: destination path 'mecab-python-0.996' already exists and is not an empty directory. Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: konlpy in /usr/local/lib/python3.7/dist-packages (0.6.0) Requirement already satisfied: JPype1>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (1.4.1) Requirement already satisfied: lxml>=4.1.0 in /usr/local/lib/python3.7/dist-packages (from konlpy) (4.9.1) Requirement already satisfied: numpy>=1.6 in /usr/local/lib/python3.7/dist-packages (from konlpy) (1.21.6) Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from JPype1>=0.7.0->konlpy) (4.1.1) Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from JPype1>=0.7.0->konlpy) (21.3) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->JPype1>=0.7.0->konlpy) (3.0.9)
In [22]:
from konlpy.tag import Mecab
mecab = Mecab()
df = pd.read_csv(file_url, sep='\t', index_col=0)
df.head()
Out[22]:
document | label | |
---|---|---|
id | ||
8112052 | 어릴때보고 지금다시봐도 재밌어요ㅋㅋ | 1 |
8132799 | 디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산... | 1 |
4655635 | 폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고. | 1 |
9251303 | 와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런... | 1 |
10067386 | 안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화. | 1 |
In [23]:
df = df.dropna()
df = pd.concat([df.head(1000), df.tail(1000)])
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2000 entries, 8112052 to 8548411 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 document 2000 non-null object 1 label 2000 non-null int64 dtypes: int64(1), object(1) memory usage: 46.9+ KB
In [24]:
def handle_naive_bayes(df: pd.DataFrame, tagger):
nouns = df.document.apply(tagger).apply(" ".join)
cv = CountVectorizer()
x = cv.fit_transform(nouns)
y = df.label
x_train, x_test, y_train, y_test = train_test_split(x, y)
model = MultinomialNB()
model.fit(x_train, y_train)
pred = model.predict(x_test)
print(accuracy_score(y_test, pred))
sns.heatmap(confusion_matrix(y_test, pred),
cmap='coolwarm', annot=True, fmt='.0f')
plt.title("CONFUSION MATRIX")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
handle_naive_bayes(df, mecab.nouns)
0.62