Hierarchical Clustering of Large-Scale Short Conversations Based on Domain Ontol

Started by aruljothi, Apr 11, 2009, 07:05 PM

Previous topic - Next topic

aruljothi

With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as conversation in chatting room and email is common in such data. It is useful to cluster such short documents to get the structure of the data or to help building other data mining applications. But most of the current clustering algorithms can not get acceptable clustering accuracy since key words appear with a low frequency in short documents. It is also difficult to process high-dimensional text data in very large databases. In this paper, we propose a hierarchical clustering algorithm which uses domain ontology to improve clustering accuracy. This clustering algorithm is also parallel and frequent-concept based which makes it scalable to very large high-dimensional text data. Our experimental study shows that this algorithm is more accurate than other hierarchical clustering algorithms when clustering short conversations. Furthermore, this algorithm has good scalability and it can be used to process even huge data.