Integrated Knowledge Solutions: document characterization

Often there is a confusion about data retrieval and information retrieval. This short post will attempt to clarify the difference between the two.

Information retrieval (IR) is a field of study dealing with the representation, storage, organization of, and access to documents. The documents may be books, reports, pictures, videos, web pages or multimedia files. The whole point of an IR system is to provide a user easy access to documents containing the desired information. One of the best known example of an IR system is Google search engine. A data retrieval system is a database management system (DBMS).

The difference between an information retrieval system and a data retrieval system is that

– IR deals with unstructured/semi-structured data while a data retrieval (a database management system or DBMS) deals with structured data with well-defined semantics

– Querying a DBMS system produces exact/precise results or no results if no exact match is found

– Querying an IR system produces multiple results with ranking. Partial match is allowed.

Document Characterization

Three kinds of characteristics are associated with a document.

Metadata characterization

This kind of characterization refers to ownership, authorship and other items of information about a document. The Library of Congress subject coding is also an example of metadata. Another example of metadata is the category headings at Internet search engine Yahoo. To standardize category headings, many areas use specific ontologies, which are hierarchical taxonomies of terms describing certain knowledge topics.

Presentation Characterization

This refers to attributes that control the formatting or presentation of a document.

Content Characterization

This refers to attributes that denote the semantic content of a document. Content characterization is of primary interest in IR. The common practice in IR is to represent a textual document by a set of keywords called index terms or simply terms. An index term is a word or a phrase in a document whose semantics give an indication of the document’s theme. The index terms, in general, are mainly nouns because nouns have meaning by themselves.

The same concept can be applied to images/multimedia documents to characterize them in terms of words using the representation known aptly as the Bag of Word (BoW) representation.

However, what are the words for images is a tricky question. One way to define such words for images is through the use of vector quantization which yields a set of code words (A code word is simply an image patch of certain size) that can be used to assemble given images. Once we are able to characterize images in terms of code words, various attributes such as frequencies and joint frequencies of code words occurrences can be computed to group images into meaningful groups. An example of one such grouping is shown in the figure below where the grouping of pictures is shown in the form of a graph.

Integrated Knowledge Solutions

Pages

Information Retrieval Vs. Data Retrieval

Document Characterization

Search This Blog