What you will find here

Saturday, December 26, 2009

TRENDS 2 - Clustering

Clustering

Clustering is one of the methods that serves for data classification. It is traditionally used as algorithm beyond the information retrieval process as the assessment documents relevance. The innovation which could be brought by this approach is the projection of this algorithm in the presentational level of the information retrieval system.

Cluster

Cluster is defined as number of similar items – things, persons or groups - grouped closely together. The difference between clusters and thesaurus classes is the unsupervised classification – clusters are not predefined. The initiative which activates the clustering process is the user’s need expressed by user’s information retrieval query. Clusters could show the natural grouping or structure in data set. There are several clusters as resulting forms that are exploited in different clustering methods and models (Zaïane, 1999):

  • Exclusive Clustering – definite cluster with strict data
  • Overlapping Clustering – fuzzy sets to cluster data, each data has different degree of membership, each cluster belongs to two or more clusters
  • Hierarchical Clustering – union between two nearest clusters
  • Probabilistic Clustering – completely probabilistic approach

Distance-based clustering

We could divide clusters in different groups according to the algorithm that defines different grouping. In the case of the first picture, we easily identify 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering (Zaïane, 1999) – items in the group share almost the same characteristics expressed by their position in the information space; items are depicted in the 2D or 3D space – in our case 2D - according to their options that establishes their uniform position.

Conceptual clustering

Another kind of clustering is conceptual - two or more objects belong to the same cluster if each one is defined by common concept to all that objects. Conceptual clustering is not based on perfect match and similarity between objects, but rather conceptual likeness (Tutorial, 2000). Categories and features that determinate the similarity of the group are fuzzy and more open than in the previous distance model, they cold be defined as overlapping clusters – items in the group have at least one “same” character.

The example of conceptual approach is Latent Semantic Indexing (LSI, see Deerwester et al., 1990). A query with one term (such as “pigs”) could have a high similarity with a document that has a related term (“hogs”). Rather than expanding queries based only a small set of term relations, LSI considers all terms potentially related to each other, and all documents to be similarly related (Newby, 2002).

Model-based clustering

Another of the conceptual clustering approaches is the model-based clustering methods. It is based on fit between two different data sets the data set and model. It emerges from the nonlinear m-dimensional inputs in data set. Which position is based on closeness. Thus the data set is selfcorrecting according to the changeable mental model. This theory is in connection with the SOM – self organizing model - theory from 1981 proposed by Kohen.

The further development in clustering theories is based on the likeness with human information acquisition. According to this approach precede the clustering theory the learning, statistic and probabilistic theories.

Cognitive aspects

The advantage of the clustering is the close similarity to the human way of thinking. It responses to the theory of inner mental modelling according to Wittgenstein and the theory of term and conceptual thinking, that enables people to deal with large data sets and easier to classify their long term memory (Loukotová, 2009). Clustering method though reflects the higher mental activities and is sufficient for information retrieval. Other important advantage is its relation to the changeable context of the real world. The structure of clusters is not fixed and it is reflecting the changes of the inner mental model depending on the reality.

The clustering method on the representative level could then bring a tool for easier understanding of the data set’s environment and deeper understanding of the relations in between the terms and objects and not to say the reality.

Problems

The exploitation of clustering method in the Web environment brings problem as each method that is based on similarity to the human thinking. There emerge a lot of different unknown and changeable facts that have to be taken in account. As bigger data set as more unknown facts. Other problem is the changeability of data set itself. In the web environment is the change of the amount and kind of data high and fast.

All kinds of clustering models are basically founded on sort of “distance” between terms and thus the right identification of the cluster is based on their representation in the information space. In follows the problem of filtering clusters is primarily consequent on the position of the clusters in the information space.

Conclusion

Nowadays clustering methods are highly exploited in the form of hidden algorithm. However their exploitation is not fully utilized. The potential is in the cognitive aspects of the method. As will be presented later, this approach is closest to the cognitive perception and ways of human thinking. That could in connection with search engines serve as the perfect information retrieval and learning tool.

Examples

Solitary applications: Carrot2Workbench

Web search engines: clusty.com


References

BORGMAN, Christine L. (1989). All Users of Information Retrieval Systems are Not Created Equal: An Exploration into Individual Differences. Information Processing and Management, vol. 25, no.3, pp. 237–251.

CARD, Stuart K., Mackinlay, Jock D., and Shneiderman, Ben. (1999). Readings in Information Visualization : Using Vision to Think. San Francisco: Morgan-Kaufman.

CEJPEK, J. (1998) Informace, komunikace a myšlení. Karolinum, Praha. 178

HULL, David A. (1999). The TREC-7 Filtering Track: Description and Analysis. In Voorhees, Ellen and Harman, Donna (Eds.), Proceedings of the 7th Text REtrieval Conference (TREC-7), Gaithersburg. Maryland: National Institute of Science and Technology

INGWERSEN, P. (1996). Cognitive Perspectives in Information Retrieval Interaction: Elements of a Cognitive IR Theory. J. Documentation, vol. 52, no. 1, pp. 3–50.

LOUKOTOVÁ, K. (2009) Úvod do problematiky uživatelského rozhraní. In Červenková, A. & Hořava, M. (Eds.), Uživatelsky přívětivá rozhraní. Horava &Associates, Praha.

NEWBY, G. B. (2002) Empirical Study of a 3D Visualization for Information Retrieval Tasks. Journal of Intelligent Information Systems, vol. 18, pp. 31–53.

SABOL, V. et al. (2002) Applications of a Lightweight, Web-Based Retrieval, Clustering, and Visualization Framework. In D. Karagiannis and U. Reimer (Eds.): PAKM 2002, LNAI 2569, pp. 359–368, 2002.

SHNEIDERMAN, Ben. (1996). The Eyes Have It: User Interfaces for Information Visualization. Technical Report No. CS-TR-3665, Human Computer Interface Laboratory. University of Maryland at College Park. Available at http://www.cs.umd.edu/TRs/groups/HCIL-no-abs.html

SCHAMBER, Linda, Eisenberg, Michael, and Nilan, Michael. (1991). Towards a Dynamic, Situational Definition of Relevance. Information Processing and Management, vol. 26, no. 2, pp. 755–776.

TUTORIAL on Clustering Algorithms (2000) Politecnico di Milano. Available at: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

ZAÏANE, Osmar R. (1999) Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering. University of Alberta. Available on : http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html


Tuesday, December 15, 2009

TRENDS 1 - Visual search

Visual search engines

When talking about visual search engines, we are talking about visual interface of search engine. We omitted image search engine, but not the way of results representation.

The visual approach significantly evolved in 1980s when the Graphical User Interface (GUI) was fully implemented in computer applications instead of the traditional textual interface. The exploitation of graphical features became the leading approach. Nowadays we may consider according to web search engines that people use a visual metaphor for their core system interaction - that is, manipulating a mouse to select fields for data entry and submit a query for processing (Newby, 2002).

Cognitive aspects

The idea of visual representation emerged from the fact that humans initially acquired all information as symbols or images. That means before the natural language developing. As well as before the amount of information started to be enormous and brought necessity of their shattering on smaller solitary items. This development led into the term or conceptual thinking that substituted symbols and images by terms to ease the processing and extend the communication skills. Humans are thus more likely to be familiar with non-visual IR interfaces, but not visual interfaces.

This development could put the visual interface at a disadvantage, or create a need for extensive training (Newby, 2002). While the actual retrieval results are presented as linear text, supported by some hyperlinks; reflecting the evolution of cognitive process.

On the other hand the exploitation of images and alternative symbols never disappeared from the humans thinking and communicational processes (Cejpek, 1998). It causes the easement for the short-term memory in the low cognitive processes. Enables faster upload of already acquired information stored in human’s memory (Loukotová, 2009) and it underlines human-computer interaction (Newby, 2002).

Information retrieval

For purposes of information retrieval (IR) there was a long-standing interest in visualization of documents, collections and retrieval results presented by work Card et al. (1999).

Visual IR system is based on the idea of Information space that is defined as the set of relations among items held by an information system (cf. Ingwersen, 1996). Information space is multidimensional (2D x 3D) consisting of terms and documents found in retrieval results,which creates an intuitive landscape (Sabol, 2002).

We may think of the structure composed by collection of documents and their related terms as an information space. This idea is based on the vector space modeling where the document or collection is in centre (Newby, 2002). Information space is beyond the representational level of the IR system; however it may be apparent in different representational approaches:

· Book House – extension of library catalog. This approach works with items that could be catalogized as traditional documents and the structure of catalogue is based on bibliographic data. However in case of web sources as not catalogized items is definition of such structure not that easy – bibliographic data are substituted by metadata.

  • Hyperbolic tree – tree structure with focused term centered and gradually progressing branches of related terms in the hyperbolic space.
  • Visualization lexical thesaurus data – does reflect the structure of thesaurus. It is not related to the documents, it is based on hierarchical structure of the thesaurus’s network.

Problems

Current question of information space problematic is the use of 3D over 2D, however there is no simple recommendation, but rather the series of situations suitable for each approach. As well as there is nor the study that would prefer visualization over text.

Visual structure implemented in large data sets may bring difficulties of information overload and unnoticed results representation. For that reason there is Shneiderman’s “visualization mantra(Shneiderman, 1996) that consists of three options that should be reached in visual search engine:

1. Overview first

2. Zoom and filter

3. Details on demand

I would suggest add two other options of visual search engine, as:

1) Interactivity – modifying visual presentation of a dataset according to user’s demand

2) Linking – connection to the desired information source/document.

The main problems are based on the data structures – hierarchies, thesauri - that are exploited as the base for visual representation. Aforementioned problems of implementation of such visual approach on the large data sets – Web – are mainly because of the insufficient data structure and data description – indexing – when acquired tacitly.

There are other IR approaches that serve as a background for the visual representation. Three general approaches are Boolean retrieval, probabilistic retrieval and vector retrieval. Where is the probabilistic approach based on Bayesian method. The probabilistic method is likely to be the leading method for next development as may be seen on the Latent semantic indexing approach , which will be described later.

Other potential sources beyond the visualized structure might include characteristics of the information seeker, such as standing profiles of information need (Hull, 1999), knowledge of the information seeker’s situation (Schamber et al., 1991), and individual differences among seekers (Borgman, 1989).

Conclusion

Nowadays web based visual search engines can not compete with other textual based search engines. The reason is mainly because of the development which supported since the beginning mainly term cognitional approach on the higher level of cognition and the exploitation of visual tools was led for low cognition as the basic automatic manipulation with applications. However the potential which is hidden in visual search engines approach is significant and the realization of web search engine as the real visual interactive and linked network is just the matter of time.

Examples

Search me application – new generation of visual search engine as the combination of tangent and visual approach. It is exploited more on the low cognitive approach.

Viewzi is similar to search me application, but it offers already some of structural backgrounds. It is highly designed and offers around 16 patterns of representation, unfortunately to the prejudice of the functionality.

Kartoo is probably the best version of web based visual search engine. It offers a structured map of terms, topics and the document connection.


References

BORGMAN, Christine L. (1989). All Users of Information Retrieval Systems are Not Created Equal: An Exploration into Individual Differences. Information Processing and Management, vol. 25, no.3, pp. 237–251.

CARD, Stuart K., Mackinlay, Jock D., and Shneiderman, Ben. (1999). Readings in Information Visualization : Using Vision to Think. San Francisco: Morgan-Kaufman.

CEJPEK, J. (1998) Informace, komunikace a myšlení. Karolinum, Praha. 178

HULL, David A. (1999). The TREC-7 Filtering Track: Description and Analysis. In Voorhees, Ellen and Harman, Donna (Eds.), Proceedings of the 7th Text REtrieval Conference (TREC-7), Gaithersburg. Maryland: National Institute of Science and Technology

INGWERSEN, P. (1996). Cognitive Perspectives in Information Retrieval Interaction: Elements of a Cognitive IR Theory. J. Documentation, vol. 52, no. 1, pp. 3–50.

LOUKOTOVÁ, K. (2009) Úvod do problematiky uživatelského rozhraní. In Červenková, A. & Hořava, M. (Eds.), Uživatelsky přívětivá rozhraní. Horava &Associates, Praha.

NEWBY, G. B. (2002) Empirical Study of a 3D Visualization for Information Retrieval Tasks. Journal of Intelligent Information Systems, vol. 18, pp. 31–53.

SABOL, V. et al. (2002) Applications of a Lightweight, Web-Based Retrieval, Clustering, and Visualization Framework. In D. Karagiannis and U. Reimer (Eds.): PAKM 2002, LNAI 2569, pp. 359–368, 2002.

SHNEIDERMAN, Ben. (1996). The Eyes Have It: User Interfaces for Information Visualization. Technical Report No. CS-TR-3665, Human Computer Interface Laboratory. University of Maryland at College Park. Available at http://www.cs.umd.edu/TRs/groups/HCIL-no-abs.html

SCHAMBER, Linda, Eisenberg, Michael, and Nilan, Michael. (1991). Towards a Dynamic, Situational Definition of Relevance. Information Processing and Management, vol. 26, no. 2, pp. 755–776.

TUTORIAL on Clustering Algorithms (2000) Politecnico di Milano. Available at: http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/

ZAÏANE, Osmar R. (1999) Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering. University of Alberta. Available on : http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html


Monday, October 19, 2009

Knowledge management in consideration to Web 2.docx

Web 2.0 principles and knowledge management

Bc. Barbora Poláková  - April 2009, Åbo Akademi

 

Motto:

The most important feature of Web 2.0 is not to make money from it, but that we can cooperate to create a new world of dynamic knowledge and collective intelligence.

(Umeda, 2006)

Knowledge management is no longer about connecting people to content, it is about connecting people to people.“

(Lamont, 2009)

Shift from information to knowledge

Nowadays society is commonly defined as postindustrial knowledge society. This definition  arose from Porat´s (1977) theory about information society, where the society is economically dependent on information - distribution and usage. At the beginning of 21th century was the term information substituted by term knowledge and the society was established as knowledge society.

The main difference between information and knowledge is in context. While information in general is contextual independent unit, which is indexable and organizable according to norms and standards and is independent from its author, knowledge on the other hand is based on contextual engaging. It means that knowledge could be defined as „information in use“ which is involved by experiences of author and specific environment where is  knowledge developed. Knowledge is not necessarily expressed and takes place in peoples minds, where in the form of „knowledge structure“ helps to understand and manage the interaction with reality. It seems to be, that the knowledge is in fact the pragmatical reflection of information presented by intellectual capital of individuals (Bukh, 2001). As such is appearent, that knowledge contains more economical potential than information itself.  This potential is hidden in the complex understanding of situation.

Knowledge management

Regarding to the shift from information to knowledge was established knowledge management. That is supposed to manage knowledge in the way of distribution, usage and other connected processes.

The traditional approach to this problematic, is based on assumtion, that the knowledge is something what is possible to manage independently of the individuals who possessed it. Thus it is supposed to be just question of codification of the transportation process from authors heads to the knowledge systems in the form of normalized records (Tredinnick, 2006). This traditional – conventional – approach is focused on collecting of knowledges in a centralized repository and its accessibility is provided mostly by organization´s intranets (Lee, 2007).The knowledge scope happens on two levels - inter-organizational and intra-organizational (Lee, 2007) – and according to Case (2006) is concluded that more oppened organization will be more likely exposed to relevant information. That in practices lately meant building of huge storages of potentially needed knowledge, were significant part of them was rarely used – long tail effect (Tredinnick, 2006). 

Nevertheless the problem appears at the moment when we export these knowledge out of their context, in that moment happens the transformation of knowledge into the information because it lost its additional value represented by the context.

The sollution of this problem was found in the conversational approach as the way how to manage knowledge contextual and user dependently as well as standardised by necessary codifications of knowledge management system. Such system is based on emphasising the integration and collaboration of knowledge creation amongst knowledge possessors (Lee, 2007) and the basic characteristics is interactivity.

Interactive Web / Web 2.0

Web 2.0 is a phenomena, that appeared in 2004 on Web 2.0 Conference, where was the framework of the Web 2.0 presented first time. The most popular and mostl often presented definition was established in 2005 by Tim O´Reilly, who promoted the whole idea already during the mentioned conference in 2004:

“the network as platform, spanning all connected devices; Web 2.0 applications are those that make the most of the intrinsic advantages of that platform: delivering software as a continually-updated service that gets better the more people use it, consuming and remixing data from multiple sources, including individual users, while providing their own data and services in a form that allows remixing by others, creating network effects through an "architecture of participation," and going beyond the page metaphor of Web 1.0 to deliver rich user experiences.”

(O´Reilly, 2005 In Lee, 2007)

It follows from the definition, that advent of Web 2.0 doesn´t mean any significant technical changes in platforms, but mainly shift in understanding and usage of information and knowledge - as was presented recently – as well as significant shift to the user-centred approach. The role of users is seemed as active. That means that users interact directly with the web applications (Tredinnick, 2006). The direct participation of users as possessors of knowledge safeguards the contextual information and thus the potential of knowledge.

This interactive participation has different forms for example updating, publishing, evaluating, creating of own or shared space in web environment or communication with other users.

According to Lee (2007) there is a list of main characteristics for Web 2.0 that are benefical for extending and developing the knowledge management systems:

Contribution/Publishing/Organization

Every Internet user has the opportunity to freely provide their knowledge content to the relevant subject domains.“ The simplification of the publishing process makes the content contribution accessable for almost everyone – basic information literacy needed – that has two effects:

  • Speed – new content is appearing faster and thus is more actual and relevant.
  • Volume – thanks the speed and accessability is the extension of content enormous, which could lead to complications in information retrieval.
  • Experts/Peers – there is significant characteristic of anonymity, which erases the difference between experts and peers and equates them.

Organization of the context is mainly up to participants. It is practiced by folksonomy and tagging. It allowes participants use already prepared classification – partly by developers, mainly by other users - or create their own, which is later incorporated in the current classification system.

Sharing/Open source

Knowledge contents are freely available to others. Secured mechanisms may be enforced to enable the knowledge sharing amongst legitimate members within specific communities.“ The kowledge sharing in public Web 2.0 environment is based on willingness to participate on creating collective intelligence as is seen on the example of Wikipedia.

Collaboration

Knowledge contents are created and maintained collaboratively by knowledge providers. Internet users participating in the knowledge contents can have conversations as a kind of social interaction.“ Colaborative environment technologies include:

  • Synchronous technologies – instant chat, video, conferences and shared Group Decision Support System (GDSS)
  • Asynchronous technologies – Weblog, wiki, e-mail, moderated discussion forums

The long term goal of the Web 2.0 applications is to develope the same-place and same-time technology which would enable users to the two-way interaction – provider/recipient – in the realtime and one web space applying the principle of many-to-many model of communication (Tredinnick, 2006).

The additional characteristic which arose from collaboration is the social networking, which enable users to create relationships between each other and thus boost the emergence of social capital – individual as well as collective - as the promoter of the collective knowledge intelligence (Baker, 2000).

Dynamic/Actuality

Thanks the direct users interaction are „knowledge contents updated constantly to reflect the changing environment, situation“ and users needs. Thus is the knowledge content focused on the actuall problematics and offers faster and relevant answer. And regarding to this characteristic it also solves the problematic of long tail effect (Tredinnick, 2006), because at the moment unusable information are not acquired and stored.

Reliance

Knowledge contribution should be based on trust between knowledge providers and domain experts.“ The trust degree in such system has to be quite hight, because of the anonymity and ease of publishing. The responsibility for publishing as well as safeguardance of the content is let on the participators themselves. This factor could be the weakness as well as the strenght of such systems, however it is one of the basic principles of Interactivite Web.

Web 2.0 applications

Upper mentioned characteristics depict the framework of Web 2.0 in general context. They are reflected in Web 2.0 aplications, that are mostly presented by blogs, wikis, RSS, virtual communities or indexing applications - tagging.

Blog

Blog is simplified version of web page, which enables users via super simple interface to create their own web space without any knowledge about HTML or CSS - Cascading Style Sheets. It is simple tool for publishing, that offers some additional functions as comments – collaboration -, and via managing of profile options also social networking. This application started as kind of electronic diary and developed in the kind of public notepad exploited for presentation of research and scientific work – www.blogspot.com .

Wiki

Wiki is based on the same principle as blog – ease content publishing. The difference is in number of participations. This kind of application supports the group work, where more than one participants create one web space in the form of wiki. System nowadays allows trace the entries and connect them with the possessor which enable collaboration in the group, but from outside it could seem as one compact web space. It also allows using of comments and support the social networking – www.pbwiki.com .

RSS – Really Simple Syndication

Is a system which keeps tracking the updates possted across the web. It has also aggregational function which support creating so called mash-ups. Mash-up is a web page which concentrate content from different web pages in one place on web and create thus kind of gateway. In combination with RSS it concentrate actual, updated content, that reflect user´s interests – www.igoogle.com.

Virtual communities

As was already mentioned virtual communities arose around the Web 2.0 applications as additional effect, nevertheless virtual communities arise also in specialized applications for social networking – www.ning.com .

Indexing

Indexing in Web 2.0 exploited the folksonomy and tagging as basic principle of web pages organizing. It is based on users participation and thus it helps manage more successfully the information retrieval – www.blinklist.com -.

 

Realization Web 2.0 principles in knowledge management systems

Practical implementation of Web 2.0 principles in close knowledge management system is not only possible, but primarily elligible if the knowledge is acceptable as the virtue engine of success and development.

By implementation Web 2.0 principles and exploitation the Web 2.0 applications  in knowledge management system of organization is possible to manage satisfactory the knowledge content  in the company and connected resources.

The one of the most progresive approaches is already mentioned Group Decision Support System (GDSS). These systems gather more different Web 2.0 applications, principles and work as agregators. They mostly content wikis and blogs as publishing systems, discussions and instant messangers as communication system, supporte folksonomy and quality evaluation of content as indexing system. One of such system is TeamPage developed by Traction Software or Velocity 6.0 as well as Meet Stan application developed by Vivisimo.

According to Lamont (2009) is necessary to be aware of some important characteristics for such complex knowledge management system, that have to be accomplished:

  • Scale to large groups to be able to handle with the whole company environment.
  • Authentication capability to integrate seamlessly with other applications across the enterprise.
  • Functions has to be presented as blend of traditional and Web 2.0 approaches
  • Has to reflect user experience design which is based on organizational goals as well as users satisfying.
  • Schema–flexibility reflects the possibility of data analyzing, retrieving, managing regardless of source or structure

These characteristics are important to acknowledge before implementation of such system in organizational knowledge structure and thus enable its fluent engagging and exploitation of benefits emergent from the well organized and accessible knowledge content.


References

BAKER, W. (2000) What is social capital and why should you care about it? In Achieving success through social capital. University of Michigan Business School.

BUKH, P.N., Larsen, H.T., Mouritsen, J. (2001)Constructing intellectual capital statements. Scandinavian Journal of Management vol. 17, pp. 87 – 108.

CASE, D. O. (2006). Information behaviour. In: Cronin Blaise. (ed.) Annual Review of Information Science and Technology (ARIST), vol. 40 (2006). pp. 293-327

LAMONT, J. (2008). KM past and future: Web 2.0 kicks it up a notch. KMWorld. no1.

LEE, M. R. & Lan, Y. (2007) From web 2.0 to conversational Knowledge Management: towards collaborative intelligence. [online] Journal of Entrepreneurship Research, vol. 2, no. 2, pp. 47-62. Available on:
http://www.cme.org.tw/journal/search/JournalFile/v02n02/V02N2-3.pdf

TREDINNICK, L. (2006) Web 2.0 and business: a pointer to the intranets of the future?. [online] Business Information Review, 23(4), pp. 228-234. Availanble on: http://bir.sagepub.com/cgi/reprint/23/4/228.pdf

1

 


Tuesday, June 23, 2009

Knowledge Management in concideration to Web 2.0

This slideshow is based on paper for course of Information management helded at Åbo Akademi.

It contains the definition of knowledge and the knowledge society as well as Web 2.0 definition and principles that could help to eficient application of Knowledge management principles.