Clustering
Clustering is one of the methods that serves for data classification. It is traditionally used as algorithm beyond the information retrieval process as the assessment documents relevance. The innovation which could be brought by this approach is the projection of this algorithm in the presentational level of the information retrieval system.
Cluster
Cluster is defined as number of similar items – things, persons or groups - grouped closely together. The difference between clusters and thesaurus classes is the unsupervised classification – clusters are not predefined. The initiative which activates the clustering process is the user’s need expressed by user’s information retrieval query. Clusters could show the natural grouping or structure in data set. There are several clusters as resulting forms that are exploited in different clustering methods and models (Zaïane, 1999):
- Exclusive Clustering – definite cluster with strict data
- Overlapping Clustering – fuzzy sets to cluster data, each data has different degree of membership, each cluster belongs to two or more clusters
- Hierarchical Clustering – union between two nearest clusters
- Probabilistic Clustering – completely probabilistic approach
Distance-based clustering
We could divide clusters in different groups according to the algorithm that defines different grouping. In the case of the first picture, we easily identify 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering (Zaïane, 1999) – items in the group share almost the same characteristics expressed by their position in the information space; items are depicted in the 2D or 3D space – in our case 2D - according to their options that establishes their uniform position.
Conceptual clustering
Another kind of clustering is conceptual - two or more objects belong to the same cluster if each one is defined by common concept to all that objects. Conceptual clustering is not based on perfect match and similarity between objects, but rather conceptual likeness (Tutorial, 2000). Categories and features that determinate the similarity of the group are fuzzy and more open than in the previous distance model, they cold be defined as overlapping clusters – items in the group have at least one “same” character.
The example of conceptual approach is Latent Semantic Indexing (LSI, see Deerwester et al., 1990). A query with one term (such as “pigs”) could have a high similarity with a document that has a related term (“hogs”). Rather than expanding queries based only a small set of term relations, LSI considers all terms potentially related to each other, and all documents to be similarly related (Newby, 2002).
Model-based clustering
Another of the conceptual clustering approaches is the model-based clustering methods. It is based on fit between two different data sets the data set and model. It emerges from the nonlinear m-dimensional inputs in data set. Which position is based on closeness. Thus the data set is selfcorrecting according to the changeable mental model. This theory is in connection with the SOM – self organizing model - theory from 1981 proposed by Kohen.
The further development in clustering theories is based on the likeness with human information acquisition. According to this approach precede the clustering theory the learning, statistic and probabilistic theories.
Cognitive aspects
The advantage of the clustering is the close similarity to the human way of thinking. It responses to the theory of inner mental modelling according to Wittgenstein and the theory of term and conceptual thinking, that enables people to deal with large data sets and easier to classify their long term memory (Loukotová, 2009). Clustering method though reflects the higher mental activities and is sufficient for information retrieval. Other important advantage is its relation to the changeable context of the real world. The structure of clusters is not fixed and it is reflecting the changes of the inner mental model depending on the reality.
The clustering method on the representative level could then bring a tool for easier understanding of the data set’s environment and deeper understanding of the relations in between the terms and objects and not to say the reality.
Problems
The exploitation of clustering method in the Web environment brings problem as each method that is based on similarity to the human thinking. There emerge a lot of different unknown and changeable facts that have to be taken in account. As bigger data set as more unknown facts. Other problem is the changeability of data set itself. In the web environment is the change of the amount and kind of data high and fast.
All kinds of clustering models are basically founded on sort of “distance” between terms and thus the right identification of the cluster is based on their representation in the information space. In follows the problem of filtering clusters is primarily consequent on the position of the clusters in the information space.
Conclusion
Nowadays clustering methods are highly exploited in the form of hidden algorithm. However their exploitation is not fully utilized. The potential is in the cognitive aspects of the method. As will be presented later, this approach is closest to the cognitive perception and ways of human thinking. That could in connection with search engines serve as the perfect information retrieval and learning tool.
Examples
Solitary applications: Carrot2Workbench
Web search engines: clusty.com
References
BORGMAN, Christine L. (1989). All Users of Information Retrieval Systems are Not Created Equal: An Exploration into Individual Differences. Information Processing and Management, vol. 25, no.3, pp. 237–251.
CARD, Stuart K., Mackinlay, Jock D., and Shneiderman, Ben. (1999). Readings in Information Visualization : Using Vision to Think. San Francisco: Morgan-Kaufman.
CEJPEK, J. (1998) Informace, komunikace a myšlení. Karolinum, Praha. 178
HULL, David A. (1999). The TREC-7 Filtering Track: Description and Analysis. In Voorhees, Ellen and Harman, Donna (Eds.), Proceedings of the 7th Text REtrieval Conference (TREC-7), Gaithersburg. Maryland: National Institute of Science and Technology
INGWERSEN, P. (1996). Cognitive Perspectives in Information Retrieval Interaction: Elements of a Cognitive IR Theory. J. Documentation, vol. 52, no. 1, pp. 3–50.
LOUKOTOVÁ, K. (2009) Úvod do problematiky uživatelského rozhraní. In Červenková, A. & Hořava, M. (Eds.), Uživatelsky přívětivá rozhraní. Horava &Associates, Praha.
NEWBY, G. B. (2002) Empirical Study of a 3D Visualization for Information Retrieval Tasks. Journal of Intelligent Information Systems, vol. 18, pp. 31–53.
SABOL, V. et al. (2002) Applications of a Lightweight, Web-Based Retrieval, Clustering, and Visualization Framework. In D. Karagiannis and U. Reimer (Eds.): PAKM 2002, LNAI 2569, pp. 359–368, 2002.
SHNEIDERMAN, Ben. (1996). The Eyes Have It: User Interfaces for Information Visualization. Technical Report No. CS-TR-3665, Human Computer Interface Laboratory. University of Maryland at College Park. Available at http://www.cs.umd.edu/TRs/groups/HCIL-no-abs.html
SCHAMBER, Linda, Eisenberg, Michael, and Nilan, Michael. (1991). Towards a Dynamic, Situational Definition of Relevance. Information Processing and Management, vol. 26, no. 2, pp. 755–776.
TUTORIAL on Clustering Algorithms (2000) Politecnico di Milano. Available at: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/
ZAÏANE, Osmar R. (1999) Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering. University of Alberta. Available on : http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html