Information Retrieval Vs. Data Retrieval

Often there is a confusion about data retrieval and information retrieval. This short post will attempt to clarify the difference between the two. 

Information retrieval (IR) is a field of study dealing with the representation, storage, organization of, and access to documents. The documents may be books, reports, pictures, videos, web pages or multimedia files. The whole point of an IR system is to provide a user easy access to documents containing the desired information. One of the best known example of an IR system is Google search engine. A data retrieval system is a database management system (DBMS).

The difference between an information retrieval system and a data retrieval system is that

– IR deals with unstructured/semi-structured data while a data retrieval (a database management system or DBMS) deals with structured data with well-defined semantics

– Querying a DBMS system produces exact/precise results or no results if no exact match is found

– Querying an IR system produces multiple results with ranking. Partial match is allowed.

Document Characterization

Three kinds of characteristics are associated with a document.

    Metadata characterization

This kind of characterization refers to ownership, authorship and other items of information about a document. The Library of Congress subject coding is also an example of metadata. Another example of metadata is the category headings at Internet search engine Yahoo. To standardize category headings, many areas use specific ontologies, which are hierarchical taxonomies of terms describing certain knowledge topics.

    Presentation Characterization

This refers to attributes that control the formatting or presentation of a document.

Content Characterization

This refers to attributes that denote the semantic content of a document. Content characterization is of primary interest in imageIR. The common practice in IR is to represent a textual document by a set of keywords called index terms or simply terms. An index term is a word or a phrase in a document whose semantics give an indication of the document’s theme. The index terms, in general, are mainly nouns because nouns have meaning by themselves.

The same concept can be applied to images/multimedia documents to characterize them in terms of words using the representation known aptly as the Bag of Word (BoW) representation.


However, what are the words for images is a tricky question. One way to define such words for images is through the use of vector quantization which yields a set of code words (A code word is simply an image patch of certain size) that can be used to assemble given images. Once we are able to characterize images in terms of code words, various attributes such as frequencies and joint frequencies of code words occurrences can be computed to group images into meaningful groups. An example of one such grouping is shown in the figure below where the grouping of pictures is shown in the form of a graph. 

How Similar are Two Clustering Results?

While performing clustering, it is not uncommon to try a few different clustering methods. In such situations, we want to find out how similar are the results produced by different clustering methods. In some other situations, we may be interested in developing a new clustering algorithm or might be interested in evaluating a particular algorithm for our use. To do so, we make use of data sets with known ground truth so that we can compare the results against the ground truth. One way to evaluate the clustering results in all these situations is to make use of a numerical measure known as Rand index (RI). It is a measure of how similar two clustering results or groupings are.

Rand Index (RI)

RI works by looking at all possible unordered pairs of examples. If the number of examples or data vectors for clustering is n, then there are $\binom{n}{2}(=n(n-1)/2)$ pairs. For every example pair, there are three possibilities in terms of grouping. The first possibility is that the paired examples are always placed in the same group as a result of clustering. Lets count how often this happens over all pairs and represent that count by a. The second possibility is that the paired examples are never grouped together. Let's use b to represent the count of all pairs that are never grouped together. The third possibility is that the paired examples are sometimes grouped and sometimes not grouped together. The first two possibilities are treated as paired examples in agreement while the third possibility represents pairs in confusion. The RI of two groupings is then calculated by the following formula:

$\text{RI} = \frac{\text{Count of Pairs in Agreement}}{\text{Total Number of Pairs}} = \frac{(a+b)}{\binom{n}{2}}$

We can notice from the formula that RI can never exceed 1 and its possible lowest value is 0.

Let's take an example to illustrate RI calculation. Say we have five examples clustered into two clusters using two different clustering methods. The first method groups examples A, B, and C into one group and examples D and E into another group. The second clustering method groups A and B together and C, D, and E together. To compute RI for this example, lets first list all possible unordered pairs of five examples at hand. We have 10 (n*(n-1)/2) such pairs. These are: {A, B}, {A, C},  {A, D}, {A, E}, {B, C}, {B, D}, {B, E}, {C, D}, {C, E}, and {D, E}. Examining these pairs, we notice that the pair {A, B} and {D, E} are always grouped together by the both clustering methods. Thus, the value of a is two. We also notice that four pairs, {A, D}, {A, E}, {B, D}, and {B, E}, never occur together in any clustering result. Thus, the value of b is four. The Rand index (RI) is then 0.6.

Adjusted Rand Index (ARI)

RI suffers from one drawback; it yields a high value for pairs of random partitions of a given set of examples. To understand this drawback, think about randomly grouping a number of examples. When the number of partitions in each grouping, that is when the number of clusters, is increased, more and more example pairs are going to be in agreement because they are more likely to be not grouped together. This will result in a high RI value. Thus, RI is not able to take into consideration effects of random groupings. To counter this drawback, an adjustment is made to the calculations by taking into consideration grouping by chance. This is done by using a specialized distribution, the generalized hyper-geometric distribution, for modeling the randomness. The resulting measure is known as the adjusted Rand index (ARI).

ARI is best understood using an example. So let's look at the example of two clustering results used earlier. Let's create a contingency table summarizing the results of two clustering methods. In this case, it is a 2x2 table wherein each cell of the table shows the number of times an example occurs in two clusters referenced by the corresponding row and column. M1C1 and M1C2 refer to two clusters formed by a hypothetical method-1. M2C1 and M2C2 similarly refer to two clusters formed by method-2. For clarity sake, I have included the examples forming the respective clusters next to M1C1, M1C2 etc. The top left cell has an entry of 2 because the clusters M1C1 and M2C1 share two examples, A and B. Entries in the other cells have similar meaning. The numbers to the right and below the contingency table show the sums along respective rows and columns.

To write the formula for ARI, lets generalize the entries of the contingency table using the following notation:

$n_{ij} = \text{Number of examples common to cluster i and cluster j}$

$a_i = \text{Sum of contingency cells in row i}$

$b_j = \text{Sum of contingency cells in column j}$

The ARI is then expressed as:

The first term in the numerator is known as index, and the second term as expected index. The first term in the denominator is called maximum index, and the second term of the denominator is same as the second term of the numerator. With these designations of the terms, the ARI is often expressed as

$\text{ARI} = \frac{\text{index - expected index}}{\text{maximum index - expected index}}$

Now lets go back to the contingency table for our example and calculate the different parts of the ARI formula first. We have:

$\sum_{ij}\binom{n_{ij}}{2} = \binom{2}{2} + \binom{1}{2} + \binom{0}{2} + \binom{2}{2} = (1 + 0 + 0 + 1)=2$

$\sum_i\binom{a_i}{2} = (\binom{3}{2}+\binom{2}{2}) = (3 + 1)=4$

$latex \sum_j\binom{b_j}{2} = (\binom{2}{2}+\binom{3}{2}) = (1 + 3) = 4$

Thus the index value for our example is 2; the expected index value is 1.6 (4*4/(5*4/2)). The maximum index value is 4. Therefore, the ARI for our example is (2 - 1.6)/(4 - 1.6), which equals 0.1666. We see that RI is much higher than ARI; this is typical of these indices. While RI always lies in 0-1; ARI can achieve a negative value also.

ARI is not the only measure to compare two sets of groupings. Mutual information based measure, adjusted mutual information (AMI), is also used for this purpose. May be in one of the future posts, I will describe this measure.


Linear Regression using ChatGPT

[Originally published on March 7, 2023]

The ChatGPT is a large language model (LLM) from OpenAI that was released a few months ago. Since then, it has created lots of excitement in terms of a whole range of possible uses for it, lots and lots of hype, and a lot of concern about harm that might result from its use. Within five days after its release, the ChatGPT had over one million users and that number has been growing since then. The hype arising from ChatGPT is not surprising; the field of AI from its inception has been hyped. One just need to be reminded of the Noble Prize winner Herbert Simon’s statement “Machines will be capable, within twenty years, of doing any work that a man can do” made in 1965. Several concerns about the potential harm due to ChatGPT’s use have been expressed. It has been found to generate inaccurate information as facts that is presented very convincingly. Its capabilities are so good that Elon Musk recently tweeted “ChatGPT is scary good. We are not far from dangerously strong AI.”

Since ChatGPT’s release, many companies and researchers have been playing with its capabilities and this has given rise to what is being characterized as Generative AI. It has been used to write essays, emails, and even scientific articles, prepare travel plans, solve math problems, write code and create websites among many other usages. Many companies have incorporated it into their Apps. And of course, Microsoft has integrated it into its Bing search engine.

Given all the excitement about it, I decided to use it to build a linear regression model. The result of my interaction with the ChatGPT are presented below. The complete interaction was over in a minute or so; primarily slowed by my one finger typing.

So, all it took to build the regression model was to feed the data and let the ChatGPT know the predictor variables. Looks like a great tool. But like any other tool, it needs to be used in a constructive manner. I hope you like this simple demo of ChatGPT’s capabilities. I encourage you to try on your own. OpenAI is free but you will need to register.