"The Holy Grail of Multimedia Information Retrieval: So Close or Yet So Far Away?"
An Answer

Theo Pavlidis
© Copyright 2008

The lead editorial of a recent (April 2008) special issue of the IEEE Proceedings on multimedia retrieval was titled "The Holy Grail of Multimedia Information Retrieval: So Close or Yet So Far Away?" (Hanjalic et al [HLMS08]). In this note I try to provide an answer. Several authors have already pointed out that the state of the art of Content-Based Image Retrieval is currently unsatisfactory: Yanai et al [YSGB05] state"current image retrieval methods are well off the required mark" and Huston et al [HSHZ04] state"the semantic gap between the user's needs and the capability of CBIR algorithms remains significant." Deserno et al [DAL08] state "Research as reported in the scientific literature, however, has not made significant inroads as medical CBIR applications incorporated into routine clinical medicine or medical research." I have posted on the web a lenghty discussion of the issues and this document is a summary with frequent references to the posting for documentation.

Several authors point to the existence of a semantic gap (for example, Smeulders et al [SWSGJ00], Rasiwasia et al [RMV07], and Hanjalic et al [HLMS08]) as that between the pixel values and the interpretation of the image. Deserno et al [DAL08] point out that the semantic gap is only of several gaps in CBIR efforts and presents an interesting ontology of such gaps. I prefer to address the issue in a different way, based on my experience of both theoretical and applied research in pattern recognition. Certainly talking about a semantic gap is a major understatement.The semantic interpretation of an image has very little to do with statistics of the values of the pixels. One of the most popular set of features that have been used in CBIR literature is color histograms. However it is well known that histogram equalization is a method that rarely, if ever, affects the semantics while it distorts significantly the histograms. Click here if you need to be convinced about that fact. One can repeat similar experiments with edge strength or texture histograms. Such global statistics can distinguish only extreme cases, for example, images with a single dominant color.

The reliance on color appears particularly puzzling because it is well known that information that appears to be conveyed by color is conveyed by luminance: two-color displays where both colors have the same luminance are hard to discern. This phenomenon is called isoluminance (see the site of Michael Bach for an impressive demonstration) and its existence implies that an image search based only on luminance is not going to be significantly inferior to a search that includes colors. For the same reason the JPEG standard uses twice the resolution for luminance as it does for chrominance. The ability of color blind people to deal quite well with most tasks of daily life as well as the ability of everybody to obtain information from "black-and-white" images offer additional support to this view.

The representation of an image by simple features usually results in loss of information so that different pictures may map onto the same set of features. If an image has area A and each pixel has L possible levels, then there are LA such images possible. On the other hand the number of possible histograms is less than AL. Usually the perceptually significant L is much smaller than A so that different images may have similar color histograms. Another way of expressing that phenomenon is to think of the mapping of image into a set of features as a hashing process. The way such mapping are done results into too many hashing collisions. Images of quite different nature are mapped into the same features.

When a method is tested on a relatively small set of images such "hashing collisions" may not happen, so the results from such sets appear impressive. But the method is likely to fail on larger sets. The key parameter is not so much the absolute size of the set of images tested as its relation to the number of features used. This realtionship has been studied in Pattern Classification and it is usually called the "curse of dimensionality". Duda and Hart [DH73 p. 95] present the following analysis.. Let d be the number of degree of freedom of the classifier. If the classifier is linear d equals the number of features F, otherwise it is greater than F. If n is the number of training samples, then for n equal to 2(d+1) there is 50% probability that successful classification can be obtained with a random assignment of two classes on the sample points. Such a success can be called spurious classification and the classifier is likely to fail on new samples. This issue is discussed at great length in Chapter 3.8 ("Problems of Dimensionality") and other parts of [DH73] as well as in Chapter 4.3 ("Parzen Windows") of the new edition of the text [DHS01]. For this reason it has been the practice to select the number of samples per class to be considerably greater than the number of features, usually five or even ten times as many.

A well-known observation from research in vision is that humans rely a lot on broad context and are likely to ignore local details. Bela Julesz [Ju91] stated that "in real-life situations, bottom-up and top-down processes are interwoven in intricate ways," and that "progress in psychobiology is ... hampered ... by our inability to find the proper levels of complexity for describing mental phenomena". According to V. S. Ramachandran “Perceptions emerge as a result of reverberations of signals between different levels of the sensory hierarchy, indeed across different senses” ([RB98], p. 56). Ramachandran is explicitly critical of the view that “sensory processing involves a one-way cascade of information (processing)” [ibid] Context has been a major challenge to Artificial Intelligence research overall and the Nobel laureate Arno Penzias has put very succinctly: “Human intelligence almost always thrives on context while computers work on abstract numbers alone. … Independence from context is in fact a great strength of mathematics.” [Pe89, p. 49]

The human visual system is able to ignore local cues once a high level interpretation has been achieved. This is illustrated by the following two sentences:
    New York State lacks proper facilities for the mentally III.
    The New York Jets won Superbowl III.

The last word in each sentence is interpreted differently, even though they are identical. Such occurrences are even more likely in pictures. (See the main paper for more examples.)

Once we recognize the importance of context, there is at least one simple requirement for techniques for image retrieval: It is not enough to look at parts of an image, we must also look at the overall composition. Fischer et al [FSGD08] offer an example of improved retrieval by including object relationships in a medical application.

I believe that in order to construct a CBIR system we must make sure that the set of images is such that the following properties hold. The two "starred" properties (No. 2 and No. 3) may be considered as optional. One may attempt to build a CBIR system without them, but when hold, it is far more likely that a mechanical system will match or exceed human perfomance. Also the presence of the properties listed below does not negate the need to follow other good CBIR practices such as feedback from the user [IA03].

  1. The mapping between semantics and image features is well defined. This way the issue of "semantic gap" is put to rest. The representation by features can be made without loss of information, in such as way so that only images with similar interpretation and no others are mapped into the same set of features. (Examples abound in the pattern recognition literature, including OCR and fingerprint recognition.)
  2. * The context is known. For example, we may know that the object of interest occupies a particular area in the image or is close some other well defined feature. The context can be taken advantage in the design and selection of the features.
  3. * The matching of images requires careful scrutiny. This is a process that humans are not very good at and it is likely that machines can match or even exceed human performance. Medical, industrial, forensic, security, as well as other applications seem to offer problems of this type.
  4. The accuracy requirements are well defined. Knowing the relative significance of false matches versus omitted true matches is particularly important.
  5. The computational requirements are well defined. In many cases instant response is not needed. For example, in a medical application it may take well over an hour to produce an image, so waiting another hour to find matches in a database is not particularly onerous. In security applications a fast response is often necessary but in that case the database is proprietary (not a public collection) and it may organized accordingly for fast retrieval.

I do not see how we can have a mathematical representation without loss of information for the general category of "all images" but we have a good chance of achieving that by confining the problem into retrieving images from a particular category only (for example, matching an iris scan to a collection of iris scans). Biometrics and medical applications are areas that seem particularly promising. Probably, the most successful instance of CBIR is the automatic fingerprint identification (retrieving a match to a fingerprint from a database) is used in practice by the FBI and there are several commercial products for fingerprint verification (checking whether two fingerprints are the same), for example by Zvetco. (The mention of such systems attests only to their existence, rather than to their specific abilities.) In such applications the mapping of images to features was well understood and the main challenge has been the reliable extraction of such features from the image (for example, the minutiae in the case of fingerprints). Research is now continuing in more challenging soft biometrics such as the matching of tattoos and scars (Lee et al [LJJ08]).

I think that the first property is critical because if the images and objects of interest be characterized formally so the results can be scalable. The use of a standard set of images may be fine for the initial stages of the research but the algorithm must soon be tested on any image that satisfies appropriate criteria. Let me digress here for a moment. I have worked in industry where an imaging product is tested in the customer's site with blue collar workers (as opposed to graduate students) handling the equipment and, while hard, it is quite possible to design systems that perform well in an unlimited testing set (for example in the patent by Pavlidis et al [PJHHL06]).

Evaluation of methodologies solely on the basis of statistical results of their performance on "standard sets" is fraught with risks because of the difficulty of ensuring that the membership of the set and the rules of evaluation are representative of the real world where the methodologies will be applied. See, for example, the study by Ioannidis [Io05]. Even though [Io05] address mainly medical studies, several of the issues it raises are also relevant to pattern recognition. It is important to pay close attention to what is considered a satisfactory performance.

In conclusion, the answer to the question of the title is that past CBIR research has relied excessively on the mapping of images onto unreliable features and the use of statistical tests on relatively small sets of images. While such tests may produce impressive results in publications, they are not scalable and therefore not applicable to practical situations.

Retrieval of images without restrictions in their nature from large datases such as found on the web is well beyond the current state of the art and it will have to wait both advances in image analysis and in computer architectures.

Sources Cited
[DAL08] T. M. Deserno, S. Antani, and R. Long "Ontology of Gaps in Content-Based Image Retrieval", Journal of Digital Imaging, 2008.
[DH73] R. O. Duda and P. E. Hart Pattern Classification and Scene Analysis, Wiley, 1973.
[DHS01] R. O. Duda, P. E. Hart, and D. G. Stork Pattern Classification, 2nd edition, Wiley, 2001.
[FSGD08] B. Fischer, M. Sauren, M. O. Güld, and T. M. Deserno "Scene Analysis with Strcuctural Prototypes for Content-Based Image Retrieval in Medicine", Proc. SPIE Conf. No. 6914, 2008.
[HLMS08] A. Hanjalic, R. Lienhart, W-Y Ma, and J. R. Smith "The Holy Grail of Multimedia Information Retrieval: So Close or Yet So Far Away?" IEEE Proceedings, Special Issue on Multimedia Information Retrieval, 96 (4), 2008, pp. 541-547.
[HSHZ04] L. Huston, R. Sukthankar, D. Hoiem. and J. Zhang "SnapFind: Brute Force Interactive Image Retrieval" Proceedings of International Conference on Image Processing and Graphics, 2004.
[IA03] Q. Iqbal, and J. K. Aggarwal “Feature Integration, Multi-image Queries and Relevance Feedback in Image Retrieval”, Proc. VISUAL 2003, pp. 467-474.
[Io05] John P. A. Ioannidis "Why Most Published Research Findings Are False" Public Library of Science - Medicine 2(8): e124, 2005.
[Ju91] B. Julesz "Early vision and focal attention", Reviews of Modern Physics, vol. 63, (July 1991), pp. 735-772.
[LJJ08] J-E. Lee, A. K. Jain, R. Jin "Scars, Marks, and Tattoos (SMT): Soft Biometric for Suspect and Victim Identification" under review
[PJHHL06] T. Pavlidis, E. Joseph, D. He, E. Hatton, and K. Lu "Measurement of dimensions of solid objects from two-dimensional image(s)" U. S. Patent 6,995,762, February 7, 2006. Discussion of the methodology and links to sources on line.
[Pe89] Arno Penzias Ideas and Information, Norton, 1989.
[RB98] V. S. Ramachandran and S. Blakeslee Phantoms in the Brain, William Morrow and Company Inc., New York, 1998
[RMV07] Rasiwasia, N., P. J. Moreno, and N. Vasconcelos “Bridging the Gap: Query by Semantic Example”, IEEE Trans. Multimedia, vol. 9, 2007, pp. 923-938.
[SWSGJ00] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain "Content-Based Image Retrieval at the End of the Early Years", IEEE Trans. PAMI, vol. 22, 2000, pp 1349-1380.
[YSGB05] K. Yanai, N. V. Shirahatti, P. Gabbur, and K. Barnard "Evaluation Strategies for Image Understanding and Retrieval", Proc. ACM MIR, Nov. 2005.

Back to CBIR index page