문서 기반 성별 예측을 위한 요인 추출 및 한글 문서에의 적용 연구

Ph.D., P.E.; 김홍석; 박규연; 박종헌

㈜엘지씨엔에스 Entrue Journal of Information Technology 문서 기반 성별 예측을 위한 요인 추출 및 한글 문서에의 적용 연구

KCI 등재

A Study on Feature Extraction for Gender Prediction Using Text Documents and Their Application to Korean Corpus

최예림 ( Ye Rim Choi ) , 김소이 ( So Lee Kim ) , 박규연 ( Kyu Yon Park ) , 박종헌 ( Jong Hun Park )

㈜엘지씨엔에스 2015.12

Entrue Journal of Information Technology 14권 3호 85-99(15pages)

UCI I410-ECN-0102-2016-560-000576123

인용하기 URL 복사 보관함 담기

미리보기

초록

최근 개인화된 추천 시스템과 같이 성별 정보를 필요로 하는 서비스가 증가함에 따라 사용자의 성별 예측은 주요 연구 주제로 각광받고 있다. 이미지, 동영상, 센서 등 다양한 데이터를 기반으로 성별 예측이 이루어지고 있으며, 이 중 SNS나 블로그의 글을 토대로 저자의 성별을 알아 낼 수 있다. 이때, 문서에서 추출된 요인의 종류에 따라 예측 성능이 달라진다고 알려져 있다. 따라서 본 연구에서는 기존 문서 기반 성별 예측 연구에서 사용된 요인의 종류 및 추출 방법론을 정리하고 이들의 한글 문서에의 적용 가능성을 살펴본다. 약 40종류 이상의 요인들이 정리되었으며, 이들 중 한글 문서에 적용 가능한 요인들을 선정하여 한글 블로그 문서에서 추출하였다. 이렇게 추출된 요인을 이용하여 성별 예측 실험을 수행하였으며 실험을 통해 열린 사전 요인과 의미 요인이 성별 예측에 유의미하다는 결론을 내릴 수 있었다.

As gender information is required in diverse domains, gender prediction becomes an important research issue. Among gender pre-diction using various data types including image, video, and sensor data, gender prediction using text documents makes it possible to predict gender of users in social network or blog services using documents written by them. Gender prediction performance is closely related to the features extracted from documents and used for prediction. In this regard, we introduce feature extraction methods adopt-ed in previous gender prediction studies using text documents and investigate their application to a Korean corpus. We categorized the features into more than 40 types. Some of them, which can be applied to Korean corpus, were utilized for gender prediction using Ko-rean blog corpus. From the experiment, it can be concluded open dictionary features outperformed other lexical features and sematic feature is most effective for gender prediction.

키워드

Statistical Learning Method

Feature Extraction

참고문헌 (0)

[자료제공 : 네이버학술정보]