머신러닝을 이용한 빅데이터 도메인 자동 판별에 관한 연구

공성원; 황덕열

(사)한국빅데이터학회 한국빅데이터학회지 머신러닝을 이용한 빅데이터 도메인 자동 판별에 관한 연구

KCI 후보

A Study of Big Data Domain Automatic Classification Using Machine Learning

공성원 ( Kong Seongwon ) , 황덕열 ( Hwang Deokyoul )

(사)한국빅데이터학회 2018.12

한국빅데이터학회지 3권 2호 11-18(8pages)

UCI I410-ECN-0102-2019-500-001349899

인용하기 URL 복사 보관함 담기

미리보기

초록

본 연구는 빅데이터 품질 진단의 핵심 요소인 도메인 기반 품질 진단을 위한 도메인 자동 판별에 관한 연구다. 빅데이터의 가치와 활용도의 증가와 4차 산업혁명의 대두로, 법률, 의료, 금융 등 IT와 융합된 다양한 분야에서 빅데이터를 활용하여 새로운 가치를 창출하려는 노력을 진행중이다. 하지만, 신뢰도가 낮은 데이터에 기반한 분석은 과정과 결과 모두에서 치명적인 문제를 발생하며, 분석 결과에 따른 판단 또한 신뢰하기 어려워 진다. 이처럼 신뢰도가 높은 데이터의 필요성 또한 증가하였지만, 데이터의 품질 확보에 대한 연구와 그에 대한 결과는 미비하다. 본 연구는 데이터 품질 향상을 위한 진단 평가의 핵심적 요소인 도메인 기반 품질 진단에서, 수작업으로 진행되었던 도메인 판별 작업을 머신러닝을 이용하여 자동화 함으로써, 작업시간을 단축하는 것을 목표로 한다. 데이터 베이스에 저장된, 도메인이 판별되어 있는 데이터의 특성에 관한 정보들을 추출하여 변수화하고, 이를 머신러닝을 이용하여 도메인 판별을 자동화 한다. 이를 빅데이터 품질 진단에 활용하고, 품질 향상에 기여하도록 한다.

This study is a study on domain automatic classification for domain - based quality diagnosis which is a key element of big data quality diagnosis. With the increase of the value and utilization of Big Data and the rise of the Fourth Industrial Revolution, the world is making efforts to create new value by utilizing big data in various fields converged with IT such as law, medical, and finance. However, analysis based on low-reliability data results in critical problems in both the process and the result, and it is also difficult to believe that judgments based on the analysis results. Although the need of highly reliable data has also increased, research on the quality of data and its results have been insufficient. The purpose of this study is to shorten the work time to automizing the domain classification work which was performed from manually to using machine learning in the domain - based quality diagnosis, which is a key element of diagnostic evaluation for improving data quality. Extracts information about the characteristics of the data that is stored in the database and identifies the domain, and then featurize it, and automizes the domain classification using machine learning. We will use it for big data quality diagnosis and contribute to quality improvement.

키워드

Data Quality Diagnosis

Domain

Machine Learning

Random Forest

Ⅰ. 서 론
Ⅱ. 빅데이터와 머신러닝
Ⅲ. 랜덤 포레스트
Ⅳ. 데이터 품질
Ⅴ. 연구 및 결과
Ⅵ. 연구 결과
Ⅶ. 결 론
참 고 문 헌

참고문헌 (0)

[자료제공 : 네이버학술정보]