Indexing vs. Tagging

Patterns and Inconsistencies in Collaborative Tagging Systems: An Examination of Tagging Practices – (PDFVia Search Engine Land) – by Margaret E.I. Kipp and D. Grant Campbell – 2006-XX-XX:

Abstract
This paper analyzes the tagging patterns exhibited by users of del.icio.us, to assess how collaborative tagging supports and enhances traditional ways of classifying and indexing documents. Using frequency data and co-word analysis matrices analyzed by multi-dimensional scaling, the authors discovered that tagging practices to some extent work in ways that are continuous with conventional indexing. Small numbers of tags tend to emerge by unspoken consensus, and inconsistencies follow several predictable patterns that can easily be anticipated. However, the tags also indicated intriguing practices relating to time and task which suggest the presence of an extra dimension in classification and organization, a dimension which conventional systems are unable to facilitate.

Introduction
The rise of collaborative tagging systems has aroused the interest of information designers and information architects. Some argue that the tagging patterns emerging on such sites as del.icio.us, Flickr.com, 43things.com and Connotea constitute an entirely new method of information organization, one which can scale in ways that conventional systems cannot. Others argue that collaborative tagging systems flaunt too many standard principles of conventional indexing to be a viable replacement, no matter how far we lower our standards. This paper attempts to shed light on this debate by looking closely at the patterns of tagging in the del.icio.us bookmarking system, and assessing how the descriptors applied by individual users in this popular tool both resemble and differ from descriptors that professional indexers would apply.

Background
By allowing individual users to apply their own verbal descriptors to digital objects, personal tagging systems constitute a revolutionary form of verbal indexing. Rather than enforcing a pre-conceived controlled vocabulary, either by employing professional indexers or forcing authors to tag their articles from an established thesaurus, systems like del.icio.us assume that users can create useful results when given absolute freedom. This assumption has its roots in the development of the Web as a system which operates “as close as possible to no rules at all” (Berners Lee & Fischetti, 1999, 15). Intellectually, the assumption echoes the widespread belief that the World Wide Web is a complex, adaptive system, of the sort described in complexity theory (Waldrop 1992, 11; Johnson 2000, 18). In a complex system, there is no pacemaker, of the sort we see in libraries, with their centralized indexing and cataloguing systems. Instead, each unit, following a set of very simple rules, contributes to a spontaneously self-organizing pattern, in which the whole becomes greater than the sum of its parts. This view gets some support from the success of the Google PageRank system, in which individual votes for a site’s quality, manifested as links, combine to create large-scale patterns of consensus about site quality and relevance that can drastically improve search engine performance. If the same holds true for tagging, then personal tagging systems offer the chance to create indexing systems that scale in ways which conventional organization schemes cannot, thereby making them appropriate for large, dynamic systems (Shirky, 2005). Just as the ranking system of Google becomes more illuminating as the number of links grows, so do the patterns and clusters of user-centered tags, as more and more people adopt tagging practices. Early research has shown regularities in user tagging activity, regularities which could enable us to predict stable tagging patterns. Furthermore, early research suggests that when a URL acquires a certain number of taggers, the most common terms tend to remain stable (Golder & Huberman, 2006). Skeptics of tagging systems, many of them in the field of Information Architecture, argue that personal tagging merely constitutes “mob indexing” (Morville, 2006). Because users are untrained in indexing methods, they cannot create tags which cohere into useful patterns, at least not to the extent that would justify dispensing with controlled vocabularies and faceted browsing schemes. Before we can resolve this controversy, we must find ways to analyze the tagging data that is currently available to us, in order to measure the extent to which tagging represents a useful subject access mechanism. And because user-generated tags flaunt so many rules of indexing consistency, we cannot evaluate them by the usual standards of thesaurus construction. Untrained users will not, of themselves, produce rigorously-designed thesaural structures; we need to determine whether the results they do create are useful anyway. If user tags are forming useful patterns, co-word analysis provides a possible means of detecting them, based as it is on the assumption that the co-occurrence of words in a particular field in two or more documents is a measure of the strength of the relationship between the co-occurring words (Wilson 1999). Co-word analysis has many applications in Web environments (Leydesdorff & Vaughan 2006); many of these applications emerge from the historical use of co-word analysis to study changes in fields of study over time (Courtial, Callon & Sigogneau 1984; Courtial 1994). In particular, co-word analysis is sometimes used to enhance the effectiveness of traditional indexing, by providing a map for indexers to related words and synonyms (Courtial, Callon & Sigogneau 1984). A co-word analysis of tagging practices, therefore, could provide insight into the patterns that are emerging through these practices, and the extent to which they are consistent with, and supportive of, conventional indexing and classification practices. Equally important, such an analysis might well show important differences between user tagging and conventional indexing: differences which indexers would do well to notice. Our study, therefore, posed three research questions:
· What patterns of consistent user tagging activity emerge through analyses of tagging frequency and co-word analysis?
· To what extent do these patterns of tagging support and enhance some of the other traditional ways of classifying documents?
· To what extent do these patterns defy these traditional methods, suggesting viable and promising alternatives to traditional subject access tools?

Conclusion and Future Directions
While the tagging practices prevalent on del.icio.us would never be mistaken for professional indexing practice, some of the differences may be less significant than they at first appear. The fact that not all tag frequency graphs show an immediate rapid drop off does not actually indicate a substantial departure from classical classification trends. In fact, it it has commonly been accepted that documents are about multiple things and that the indexers’ task is to assign appropriate descriptors to describe this aboutness. Thus, the difference in the shapes of the frequency graphs seems to indicate the perceived breadth of the aboutness of the document. The data suggest that while the relationships between tags do not always follow the co-word clusters, they often follow relationships of synonymy that system designers can anticipate. Similarly, the presence of considerable consensus on certain terms suggests that emerging patterns from the tagging data are consistent, to some degree at least, with conventional indexing concepts of aboutness. In this sense, the data both supports earlier studies that detect predictable patterns (Golder & Huberman 2006), as well as suggestions that tagging patterns provide a useful “fast layer” of subject access that can filter down to slower, more deliberate layers (Morville 2006, 140). However, the presence of time-related tags suggests that users are also doing something else with their tags: something that conventional subject access systems have always avoided. Tags such as “GTD,” “To Read,” and even “Cool” defy traditional subject analysis in several ways:
· They express a response from the user rather than a statement of the aboutness of the document;
· They are intrinsically time-sensitive;
· They suggest an active engagement with the text, in which the user is linking the perceived subject matter with a specific task or a specific set of interests.

These temporal tags could hardly be integrated into thesauri. The tag “To Read” expresses nothing about the document’s subject. Nor is it likely to remain relevant for any length of time; the user will eventually read it, or never get around to reading it, making the tag an anachronism either way. They do, however, provide interesting evidence of the energy that an individual user throws against a knowledge structure. Classification systems are traditionally visualized in two dimensions, offering coordinate relationships across an array, or hierarchical relationships along a chain. These two dimensions appear either as conventional tree-structure graphs, or as patterns of indentation in sequential lists of terms in thesauri or classification manuals. The temporal dimension of such systems manifests itself in the regular revisions of the system, generally through regular updates that are then consolidated into full-scale revisions. User tags such as “to read” suggest that the user brings a new temporal dimension to a classification: one related not to long-term shifts in terms and their relationships, but rather to short-term needs and enthusiasms, which, by relating to a specific interest or a specific task, place the document in a set of relationships that, while not expected to endure, pull documents into idiosyncratic relationships. If temporal tags were to become more sophisticated, their effect on subject access systems might be transformative. Vector space modeling offers us a method of modeling document similarities based on the presence or absence of terms by means of a hypercube: an n-dimensional space that offers dimensions corresponding to every term (Salton & McGill 1983; Meadow, Boyce & Kraft 2000). If temporal tags add a third dimension to classification systems, it is possible that some transformation of the vector-space model might allow us to find new ways of modeling the passing effects of specific tasks and interests on document relationships in a dynamic way that serves as a foundation for new forms of visualizing document relationships. If this is true, we may well, as systems designers, want to be wary of solving problems too quickly. If we think of these tags as manifestations of this user energy, then the co-word visualizations acquire fresh implications. The distance between synonymous terms suggests the mutual exclusivity we expect from controlled vocabularies. But that doesn’t necessarily mean that the system designer should replicate the controlled vocabulary by collapsing the two terms together. The difference between “information architecture” and “ia” could represent significant cultural differences, with “information architecture” representing an academic formality and “ia” representing the convenient acronym of a consultant working within the professional information architecture community. As Shirky reminds us (2005), there may well be cases in which people who watch films don’t want to talk to people who watch movies. This study does not definitively resolve the controversies surrounding collaborative tagging systems. Frequency and co-word analysis, however, provide some illuminating insights into the way tagging patterns emerge. They reveal that closely-related terms are not necessarily revealed through cooccurrence; they also reveal that users employ a wide variety of conventions in constructing tags: conventions which they apply inconsistently. Time will tell whether consistent conventions emerge. While clustering is present in many cases, it is not always clearly marked, suggesting that tagging, like many other indexing methods, resorts to multiple terms to describe the aboutness of documents. These results suggest two important directions for future research. First, they suggest that there is continuity between conventional indexing and user tagging: a continuity that could form the basis for a complementary system of subject access that could enrich conventional indexing rather than crowding it out. Second, the differences suggest that user tagging extends beyond the traditional objectives of subject access, and expresses a dynamic relationship between document and user, and between subject and task, which may lead to new ways of modeling subject access.

Psstt… They talk about “the aboutness of documents”. Could we talk about “the finality of documents”?? Just curious.

About Chris F. Masse

Founder and President of Midas Oracle
This entry was posted in Collective Intelligence - Wisdom Of Crowds, Information Technology, The Internet and tagged , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply