"The Holy Grail of Multimedia Information Retrieval: So Close or Yet
So Far Away?" An Answer
Theo Pavlidis
© Copyright 2008
The lead editorial of a recent (April 2008) special issue of the IEEE
Proceedings on multimedia retrieval was titled "The Holy Grail of
Multimedia Information Retrieval: So Close or Yet So Far Away?" (Hanjalic
et al [HLMS08]). In this note I try to provide an answer. Several
authors have already pointed out that the state of the art of Content-Based
Image Retrieval is currently unsatisfactory: Yanai et al [YSGB05]
state"current image retrieval methods are well off the required mark"
and Huston et al [HSHZ04] state"the semantic gap between
the user's needs and the capability of CBIR algorithms remains significant."
Deserno et al [DAL08] state "Research as reported in the
scientific literature, however, has not made significant inroads as medical
CBIR applications incorporated into routine clinical medicine or medical
research." I have posted on the web
a lenghty discussion of the issues and this document is a summary
with frequent references to the posting for documentation.
Several authors point to the existence of a semantic gap (for
example, Smeulders et al [SWSGJ00], Rasiwasia et al
[RMV07], and Hanjalic et al [HLMS08]) as that between the pixel
values and the interpretation of the image. Deserno et al [DAL08]
point out that the semantic gap is only of several gaps in CBIR efforts
and presents an interesting ontology of such gaps. I prefer to address
the issue in a different way, based on my experience of both theoretical
and applied research in pattern recognition. Certainly talking about a
semantic gap is a major understatement.The semantic interpretation
of an image has very little to do with statistics of the values of the
pixels. One of the most popular set of features that have been used
in CBIR literature is color histograms. However it is well known that
histogram equalization is a method that rarely, if ever, affects the semantics
while it distorts significantly the histograms. Click
here if you need to be convinced about that fact. One can repeat similar
experiments with edge strength or texture histograms. Such global statistics
can distinguish only extreme cases, for example, images with a single
dominant color.
The reliance on color appears particularly puzzling because it is well
known that information that appears to be conveyed by color is conveyed
by luminance: two-color displays where both colors have the same luminance
are hard to discern. This phenomenon is called isoluminance (see
the site of Michael Bach
for an impressive demonstration) and its existence implies that an image
search based only on luminance is not going to be significantly inferior
to a search that includes colors. For the same reason the JPEG standard
uses twice the resolution for luminance as it does for chrominance. The
ability of color blind people to deal quite well with most tasks of daily
life as well as the ability of everybody to obtain information from "black-and-white"
images offer additional support to this view.
The representation of an image by simple features usually results in loss of
information so that different pictures may map onto the same set of features.
If an image has area A and each pixel has L possible levels,
then there are LA such images possible. On the other hand
the number of possible histograms is less than AL. Usually
the perceptually significant L is much smaller than A so that
different images may have similar color histograms. Another way of expressing
that phenomenon is to think of the mapping of image into a set of features as
a hashing process. The way such mapping are done results into too many
hashing collisions. Images of quite different nature are mapped into the same
features.
When a method is tested on a relatively small set of images such "hashing
collisions" may not happen, so the results from such sets appear
impressive. But the method is likely to fail on larger sets. The key parameter
is not so much the absolute size of the set of images tested as its relation
to the number of features used. This realtionship has been studied in
Pattern Classification and it is usually called the "curse of dimensionality".
Duda and Hart [DH73 p. 95] present the following analysis.. Let d
be the number of degree of freedom of the classifier. If the classifier
is linear d equals the number of features F, otherwise
it is greater than F. If n is the number of training samples,
then for n equal to 2(d+1) there is 50% probability
that successful classification can be obtained with a random assignment
of two classes on the sample points. Such a success can be called
spurious classification and the classifier is likely to fail
on new samples. This issue is discussed at great length in Chapter 3.8
("Problems of Dimensionality") and other parts of [DH73] as
well as in Chapter 4.3 ("Parzen Windows") of the new edition
of the text [DHS01]. For this reason it has been the practice to select
the number of samples per class to be considerably greater than the number
of features, usually five or even ten times as many.
A well-known observation from research in vision is that humans rely a lot
on broad context and are likely to ignore local details. Bela Julesz [Ju91]
stated that "in real-life situations, bottom-up and top-down processes
are interwoven in intricate ways," and that "progress in psychobiology
is ... hampered ... by our inability to find the proper levels of complexity
for describing mental phenomena". According to V. S. Ramachandran “Perceptions
emerge as a result of reverberations of signals between different levels of
the sensory hierarchy, indeed across different senses” ([RB98], p. 56).
Ramachandran is explicitly critical of the view that “sensory processing
involves a one-way cascade of information (processing)” [ibid]
Context has been a major challenge to Artificial Intelligence research overall
and the Nobel laureate Arno Penzias has put very succinctly: “Human intelligence
almost always thrives on context while computers work on abstract numbers alone.
… Independence from context is in fact a great strength of mathematics.”
[Pe89, p. 49]
The human visual system is able to ignore local cues once a high level
interpretation has been achieved. This is illustrated by the following
two sentences:
New York State lacks proper facilities for the mentally III.
The New York Jets won Superbowl III.
The last word in each sentence is interpreted differently, even though
they are identical. Such occurrences are even more likely in pictures.
(See the main paper for
more examples.)
Once we recognize the importance of context, there is at least one simple
requirement for techniques for image retrieval: It is not enough to
look at parts of an image, we must also look at the overall composition.
Fischer et al [FSGD08] offer an example of improved retrieval
by including object relationships in a medical application.
I believe that in order to construct a CBIR system we must make sure
that the set of images is such that the following properties hold. The
two "starred" properties (No. 2 and No. 3) may be considered
as optional. One may attempt to build a CBIR system without them, but
when hold, it is far more likely that a mechanical system will match or
exceed human perfomance. Also the presence of the properties listed below
does not negate the need to follow other good CBIR practices such as feedback
from the user [IA03].
- The mapping between semantics and image features is well defined.
This way the issue of "semantic gap" is put to rest. The representation
by features can be made without loss of information, in such as way so that
only images with similar interpretation and no others are mapped into the
same set of features. (Examples abound in the pattern recognition literature,
including OCR and fingerprint recognition.)
- * The context is known. For example, we may
know that the object of interest occupies a particular area in the image
or is close some other well defined feature. The context
can be taken advantage in the design and selection of the features.
- * The matching of images requires careful scrutiny.
This is a process that humans are not very good at and it is likely that machines
can match or even exceed human performance. Medical, industrial, forensic,
security, as well as other applications seem to offer problems of this type.
- The accuracy requirements are well defined. Knowing
the relative significance of false matches versus omitted true matches is
particularly important.
- The computational requirements are well defined.
In many cases instant response is not needed. For example, in a medical application
it may take well over an hour to produce an image, so waiting another hour
to find matches in a database is not particularly onerous. In security applications
a fast response is often necessary but in that case the database is proprietary
(not a public collection) and it may organized accordingly for fast retrieval.
I do not see how we can have a mathematical representation without loss
of information for the general category of "all images" but
we have a good chance of achieving that by confining the problem into
retrieving images from a particular category only (for example, matching
an iris scan to a collection of iris scans). Biometrics and medical applications
are areas that seem particularly promising. Probably, the most successful
instance of CBIR is the automatic fingerprint identification (retrieving
a match to a fingerprint from a database) is used in practice by the FBI
and there are several commercial products for fingerprint verification
(checking whether two fingerprints are the same), for example by Zvetco.
(The mention of such systems attests only to their existence, rather than
to their specific abilities.) In such applications the mapping of images
to features was well understood and the main challenge has been the reliable
extraction of such features from the image (for example, the minutiae
in the case of fingerprints). Research is now continuing in more challenging
soft biometrics such as the matching of tattoos and scars (Lee
et al [LJJ08]).
I think that the first property is critical because if the images and
objects of interest be characterized formally so the results can be scalable.
The use of a standard set of images may be fine for the initial stages
of the research but the algorithm must soon be tested on any image
that satisfies appropriate criteria. Let me digress here for a moment.
I have worked in industry where an imaging product is tested in the customer's
site with blue collar workers (as opposed to graduate students) handling
the equipment and, while hard, it is quite possible to design systems
that perform well in an unlimited testing set (for example in the patent
by Pavlidis et al [PJHHL06]).
Evaluation of methodologies solely on the basis of statistical results
of their performance on "standard sets" is fraught with risks
because of the difficulty of ensuring that the membership of the set and
the rules of evaluation are representative of the real world where the
methodologies will be applied. See, for example, the study by Ioannidis
[Io05]. Even though [Io05] address mainly medical studies, several of
the issues it raises are also relevant to pattern recognition. It is important
to pay close attention to what is considered a satisfactory performance.
In conclusion, the answer to the question of the title is that past CBIR
research has relied excessively on the mapping of images onto unreliable
features and the use of statistical tests on relatively small sets of
images. While such tests may produce impressive results in publications,
they are not scalable and therefore not applicable to practical situations.
Retrieval of images without restrictions in their nature from large datases
such as found on the web is well beyond the current state of the art and
it will have to wait both advances in image analysis and in computer architectures.
Sources Cited
[DAL08] |
T. M. Deserno, S. Antani, and R. Long "Ontology of Gaps in
Content-Based Image Retrieval", Journal of Digital Imaging,
2008. |
[DH73] |
R. O. Duda and P. E. Hart Pattern Classification and Scene Analysis,
Wiley, 1973. |
[DHS01] |
R. O. Duda, P. E. Hart, and D. G. Stork Pattern Classification,
2nd edition, Wiley, 2001. |
[FSGD08] |
B. Fischer, M. Sauren, M. O. Güld, and T. M. Deserno "Scene
Analysis with Strcuctural Prototypes for Content-Based Image Retrieval
in Medicine", Proc. SPIE Conf. No. 6914, 2008. |
[HLMS08] |
A. Hanjalic, R. Lienhart, W-Y Ma, and J. R. Smith "The Holy
Grail of Multimedia Information Retrieval: So Close or Yet So Far
Away?" IEEE Proceedings, Special Issue on Multimedia
Information Retrieval, 96 (4), 2008, pp. 541-547. |
[HSHZ04] |
L. Huston, R. Sukthankar, D. Hoiem. and J. Zhang "SnapFind:
Brute Force Interactive Image Retrieval" Proceedings of International
Conference on Image Processing and Graphics, 2004. |
[IA03] |
Q. Iqbal, and J. K. Aggarwal “Feature Integration, Multi-image
Queries and Relevance Feedback in Image Retrieval”, Proc.
VISUAL 2003, pp. 467-474. |
[Io05] |
John P. A. Ioannidis "Why
Most Published Research Findings Are False" Public Library
of Science - Medicine 2(8): e124, 2005. |
[Ju91] |
B. Julesz "Early vision and focal attention", Reviews
of Modern Physics, vol. 63, (July 1991), pp. 735-772. |
[LJJ08] |
J-E. Lee, A. K. Jain, R. Jin "Scars, Marks, and Tattoos (SMT):
Soft Biometric for Suspect and Victim Identification" under
review |
[PJHHL06] |
T. Pavlidis, E. Joseph, D. He, E. Hatton, and K. Lu "Measurement
of dimensions of solid objects from two-dimensional image(s)"
U. S. Patent 6,995,762, February 7, 2006. Discussion of the
methodology and links to sources on
line. |
[Pe89] |
Arno Penzias Ideas and Information, Norton, 1989. |
[RB98] |
V. S. Ramachandran and S. Blakeslee Phantoms in the Brain,
William Morrow and Company Inc., New York, 1998 |
[RMV07] |
Rasiwasia, N., P. J. Moreno, and N. Vasconcelos “Bridging
the Gap: Query by Semantic Example”, IEEE Trans. Multimedia,
vol. 9, 2007, pp. 923-938. |
[SWSGJ00] |
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain
"Content-Based Image Retrieval at the End of the Early Years",
IEEE Trans. PAMI, vol. 22, 2000, pp 1349-1380. |
[YSGB05] |
K. Yanai, N. V. Shirahatti, P. Gabbur, and K. Barnard "Evaluation
Strategies for Image Understanding and Retrieval", Proc.
ACM MIR, Nov. 2005. |
|