Clustering

Clustering is one of the methods that serves for data classification. It is traditionally used as algorithm beyond the information retrieval process as the assessment documents relevance. The innovation which could be brought by this approach is the projection of this algorithm in the presentational level of the information retrieval system.

Cluster

Cluster is defined as number of similar items – things, persons or groups - grouped closely together. The difference between clusters and thesaurus classes is the unsupervised classification – clusters are not predefined. The initiative which activates the clustering process is the user’s need expressed by user’s information retrieval query. Clusters could show the natural grouping or structure in data set. There are several clusters as resulting forms that are exploited in different clustering methods and models (Zaïane, 1999):

Exclusive Clustering – definite cluster with strict data

Overlapping Clustering – fuzzy sets to cluster data, each data has different degree of membership, each cluster belongs to two or more clusters
Hierarchical Clustering – union between two nearest clusters
Probabilistic Clustering – completely probabilistic approach

Distance-based clustering

We could divide clusters in different groups according to the algorithm that defines different grouping. In the case of the first picture, we easily identify 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering (Zaïane, 1999) – items in the group share almost the same characteristics expressed by their position in the information space; items are depicted in the 2D or 3D space – in our case 2D - according to their options that establishes their uniform position.

Conceptual clustering

Another kind of clustering is conceptual - two or more objects belong to the same cluster if each one is defined by common concept to all that objects. Conceptual clustering is not based on perfect match and similarity between objects, but rather conceptual likeness (Tutorial, 2000). Categories and features that determinate the similarity of the group are fuzzy and more open than in the previous distance model, they cold be defined as overlapping clusters – items in the group have at least one “same” character.

The example of conceptual approach is Latent Semantic Indexing (LSI, see Deerwester et al., 1990). A query with one term (such as “pigs”) could have a high similarity with a document that has a related term (“hogs”). Rather than expanding queries based only a small set of term relations, LSI considers all terms potentially related to each other, and all documents to be similarly related (Newby, 2002).

Model-based clustering

Another of the conceptual clustering approaches is the model-based clustering methods. It is based on fit between two different data sets the data set and model. It emerges from the nonlinear m-dimensional inputs in data set. Which position is based on closeness. Thus the data set is selfcorrecting according to the changeable mental model. This theory is in connection with the SOM – self organizing model - theory from 1981 proposed by Kohen.

The further development in clustering theories is based on the likeness with human information acquisition. According to this approach precede the clustering theory the learning, statistic and probabilistic theories.

Cognitive aspects

The advantage of the clustering is the close similarity to the human way of thinking. It responses to the theory of inner mental modelling according to Wittgenstein and the theory of term and conceptual thinking, that enables people to deal with large data sets and easier to classify their long term memory (Loukotová, 2009). Clustering method though reflects the higher mental activities and is sufficient for information retrieval. Other important advantage is its relation to the changeable context of the real world. The structure of clusters is not fixed and it is reflecting the changes of the inner mental model depending on the reality.

The clustering method on the representative level could then bring a tool for easier understanding of the data set’s environment and deeper understanding of the relations in between the terms and objects and not to say the reality.

Problems

The exploitation of clustering method in the Web environment brings problem as each method that is based on similarity to the human thinking. There emerge a lot of different unknown and changeable facts that have to be taken in account. As bigger data set as more unknown facts. Other problem is the changeability of data set itself. In the web environment is the change of the amount and kind of data high and fast.

All kinds of clustering models are basically founded on sort of “distance” between terms and thus the right identification of the cluster is based on their representation in the information space. In follows the problem of filtering clusters is primarily consequent on the position of the clusters in the information space.

Conclusion

Nowadays clustering methods are highly exploited in the form of hidden algorithm. However their exploitation is not fully utilized. The potential is in the cognitive aspects of the method. As will be presented later, this approach is closest to the cognitive perception and ways of human thinking. That could in connection with search engines serve as the perfect information retrieval and learning tool.

Examples

Solitary applications: Carrot2Workbench

Web search engines: clusty.com

References

BORGMAN, Christine L. (1989). All Users of Information Retrieval Systems are Not Created Equal: An Exploration into Individual Differences. Information Processing and Management, vol. 25, no.3, pp. 237–251.

CARD, Stuart K., Mackinlay, Jock D., and Shneiderman, Ben. (1999). Readings in Information Visualization : Using Vision to Think. San Francisco: Morgan-Kaufman.

CEJPEK, J. (1998) Informace, komunikace a myšlení. Karolinum, Praha. 178

HULL, David A. (1999). The TREC-7 Filtering Track: Description and Analysis. In Voorhees, Ellen and Harman, Donna (Eds.), Proceedings of the 7th Text REtrieval Conference (TREC-7), Gaithersburg. Maryland: National Institute of Science and Technology

INGWERSEN, P. (1996). Cognitive Perspectives in Information Retrieval Interaction: Elements of a Cognitive IR Theory. J. Documentation, vol. 52, no. 1, pp. 3–50.

LOUKOTOVÁ, K. (2009) Úvod do problematiky uživatelského rozhraní. In Červenková, A. & Hořava, M. (Eds.), Uživatelsky přívětivá rozhraní. Horava &Associates, Praha.

NEWBY, G. B. (2002) Empirical Study of a 3D Visualization for Information Retrieval Tasks. Journal of Intelligent Information Systems, vol. 18, pp. 31–53.

SABOL, V. et al. (2002) Applications of a Lightweight, Web-Based Retrieval, Clustering, and Visualization Framework. In D. Karagiannis and U. Reimer (Eds.): PAKM 2002, LNAI 2569, pp. 359–368, 2002.

SHNEIDERMAN, Ben. (1996). The Eyes Have It: User Interfaces for Information Visualization. Technical Report No. CS-TR-3665, Human Computer Interface Laboratory. University of Maryland at College Park. Available at http://www.cs.umd.edu/TRs/groups/HCIL-no-abs.html

SCHAMBER, Linda, Eisenberg, Michael, and Nilan, Michael. (1991). Towards a Dynamic, Situational Definition of Relevance. Information Processing and Management, vol. 26, no. 2, pp. 755–776.

TUTORIAL on Clustering Algorithms (2000) Politecnico di Milano. Available at: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

ZAÏANE, Osmar R. (1999) Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering. University of Alberta. Available on : http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html

Visual search engines

When talking about visual search engines, we are talking about visual interface of search engine. We omitted image search engine, but not the way of results representation.

The visual approach significantly evolved in 1980s when the Graphical User Interface (GUI) was fully implemented in computer applications instead of the traditional textual interface. The exploitation of graphical features became the leading approach. Nowadays we may consider according to web search engines that people use a visual metaphor for their core system interaction - that is, manipulating a mouse to select fields for data entry and submit a query for processing (Newby, 2002).

Cognitive aspects

The idea of visual representation emerged from the fact that humans initially acquired all information as symbols or images. That means before the natural language developing. As well as before the amount of information started to be enormous and brought necessity of their shattering on smaller solitary items. This development led into the term or conceptual thinking that substituted symbols and images by terms to ease the processing and extend the communication skills. Humans are thus more likely to be familiar with non-visual IR interfaces, but not visual interfaces.

This development could put the visual interface at a disadvantage, or create a need for extensive training (Newby, 2002). While the actual retrieval results are presented as linear text, supported by some hyperlinks; reflecting the evolution of cognitive process.

On the other hand the exploitation of images and alternative symbols never disappeared from the humans thinking and communicational processes (Cejpek, 1998). It causes the easement for the short-term memory in the low cognitive processes. Enables faster upload of already acquired information stored in human’s memory (Loukotová, 2009) and it underlines human-computer interaction (Newby, 2002).

Information retrieval

For purposes of information retrieval (IR) there was a long-standing interest in visualization of documents, collections and retrieval results presented by work Card et al. (1999).

Visual IR system is based on the idea of Information space that is defined as the set of relations among items held by an information system (cf. Ingwersen, 1996). Information space is multidimensional (2D x 3D) consisting of terms and documents found in retrieval results,which creates an intuitive landscape (Sabol, 2002).

We may think of the structure composed by collection of documents and their related terms as an information space. This idea is based on the vector space modeling where the document or collection is in centre (Newby, 2002). Information space is beyond the representational level of the IR system; however it may be apparent in different representational approaches:

· Book House – extension of library catalog. This approach works with items that could be catalogized as traditional documents and the structure of catalogue is based on bibliographic data. However in case of web sources as not catalogized items is definition of such structure not that easy – bibliographic data are substituted by metadata.

Hyperbolic tree – tree structure with focused term centered and gradually progressing branches of related terms in the hyperbolic space.
Visualization lexical thesaurus data – does reflect the structure of thesaurus. It is not related to the documents, it is based on hierarchical structure of the thesaurus’s network.

Problems

Current question of information space problematic is the use of 3D over 2D, however there is no simple recommendation, but rather the series of situations suitable for each approach. As well as there is nor the study that would prefer visualization over text.

Visual structure implemented in large data sets may bring difficulties of information overload and unnoticed results representation. For that reason there is Shneiderman’s “visualization mantra” (Shneiderman, 1996) that consists of three options that should be reached in visual search engine:

1. Overview first

2. Zoom and filter

3. Details on demand

I would suggest add two other options of visual search engine, as:

1) Interactivity – modifying visual presentation of a dataset according to user’s demand

2) Linking – connection to the desired information source/document.

The main problems are based on the data structures – hierarchies, thesauri - that are exploited as the base for visual representation. Aforementioned problems of implementation of such visual approach on the large data sets – Web – are mainly because of the insufficient data structure and data description – indexing – when acquired tacitly.

There are other IR approaches that serve as a background for the visual representation. Three general approaches are Boolean retrieval, probabilistic retrieval and vector retrieval. Where is the probabilistic approach based on Bayesian method. The probabilistic method is likely to be the leading method for next development as may be seen on the Latent semantic indexing approach , which will be described later.

Other potential sources beyond the visualized structure might include characteristics of the information seeker, such as standing profiles of information need (Hull, 1999), knowledge of the information seeker’s situation (Schamber et al., 1991), and individual differences among seekers (Borgman, 1989).

Conclusion

Nowadays web based visual search engines can not compete with other textual based search engines. The reason is mainly because of the development which supported since the beginning mainly term cognitional approach on the higher level of cognition and the exploitation of visual tools was led for low cognition as the basic automatic manipulation with applications. However the potential which is hidden in visual search engines approach is significant and the realization of web search engine as the real visual interactive and linked network is just the matter of time.

Examples

Search me application – new generation of visual search engine as the combination of tangent and visual approach. It is exploited more on the low cognitive approach.

Viewzi is similar to search me application, but it offers already some of structural backgrounds. It is highly designed and offers around 16 patterns of representation, unfortunately to the prejudice of the functionality.

Kartoo is probably the best version of web based visual search engine. It offers a structured map of terms, topics and the document connection.

References

CARD, Stuart K., Mackinlay, Jock D., and Shneiderman, Ben. (1999). Readings in Information Visualization : Using Vision to Think. San Francisco: Morgan-Kaufman.

CEJPEK, J. (1998) Informace, komunikace a myšlení. Karolinum, Praha. 178

INGWERSEN, P. (1996). Cognitive Perspectives in Information Retrieval Interaction: Elements of a Cognitive IR Theory. J. Documentation, vol. 52, no. 1, pp. 3–50.

LOUKOTOVÁ, K. (2009) Úvod do problematiky uživatelského rozhraní. In Červenková, A. & Hořava, M. (Eds.), Uživatelsky přívětivá rozhraní. Horava &Associates, Praha.

NEWBY, G. B. (2002) Empirical Study of a 3D Visualization for Information Retrieval Tasks. Journal of Intelligent Information Systems, vol. 18, pp. 31–53.

SCHAMBER, Linda, Eisenberg, Michael, and Nilan, Michael. (1991). Towards a Dynamic, Situational Definition of Relevance. Information Processing and Management, vol. 26, no. 2, pp. 755–776.

TUTORIAL on Clustering Algorithms (2000) Politecnico di Milano. Available at: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

InfoStudiesBlog - Barbora

What you will find here

Saturday, December 26, 2009

TRENDS 2 - Clustering

Clustering

Cluster

Distance-based clustering

Conceptual clustering

Model-based clustering

Cognitive aspects

Problems

Conclusion

Examples

References

Tuesday, December 15, 2009

TRENDS 1 - Visual search

Visual search engines

Cognitive aspects

Information retrieval

Problems

Conclusion

Examples

References

About Me

LINKS

BLOG HISTORY

Followers