DB LGJul 2, 2017

Classification non supervisée des données hétérogènes à large échelle

Mohamed Ali Zoghlami, Olfa Arfaoui, Minyar Sassi Hidri, Rahma Ben Ayed

arXiv:1707.00297v1

Originality Synthesis-oriented

AI Analysis

This work addresses scalability and efficiency challenges in clustering for companies dealing with massive heterogeneous datasets, though it appears incremental as it builds on existing methods like MCA and MapReduce.

The paper tackles the problem of clustering large-scale heterogeneous data by proposing a framework that combines Multiple Correspondence Analysis (MCA) with MapReduce to improve response time and clustering quality, showing encouraging results in both quantitative and qualitative aspects.

When it comes to cluster massive data, response time, disk access and quality of formed classes becoming major issues for companies. It is in this context that we have come to define a clustering framework for large scale heterogeneous data that contributes to the resolution of these issues. The proposed framework is based on, firstly, the descriptive analysis based on MCA, and secondly, the MapReduce paradigm in a large scale environment. The results are encouraging and prove the efficiency of the hybrid deployment on response quality and time component as on qualitative and quantitative data.

View on arXiv PDF

Similar