## An Evaluation of the Scale Invariant Feature Transform (SIFT)
## AbstractAn approximate analysis of SIFT combined with numerical results on the properties of extrema on which SIFT is based is used to suggest certain limitations of the transform. A set of images was selected to test this hypothesis and actual tests with an implementation of SIFT confirmed the limitations. SIFT appears to be useful in CBIR only when searching for nearly identical images. This is the case in forensic applications and in intellectual property disputes but it is hard to think of other applications. ## 1. IntroductionThe Scale Invariant Feature Transform (SIFT) has been proposed by D. Lowe [Lo04] as a tool for Content Based Image Retrieval (CBIR) and it has been used for that purpose by some authors (for example, [SZ08] and [LJJ08]). It has also been found lacking in some applications [Ba07]. SIFT is based on convolving an image with a Gaussian kernel for several
values of the variance denotes such differences, then the method is
based on finding extrema of this function with respect to all the three
variables. Such maxima are candidates for D(x,
y, σ)keypoints that,
after additional analysis, are used to characterize objects in an image.Extrema of support
of the Gaussian kernel matches (at least approximately) the shape
of the (gray scale) image function. Such a "tuning" property
in the scale space has been known so one may argue that SIFT acts as a
local frequency transform. However such extrema may occur in other positions
as well. [Lo04] discusses this issue briefly in Section 3.1 without any
analysis. The paper states that for a circle the maximum occurs at the
center when the "circular positive central region of the difference-of-Gaussian
function matches the size and location of the circle. For a very elongated
ellipse, there will be two maxima near each end of the ellipse."
It seems that the actual situation is more complex. In order to reduce
the clutter on images I consider only maxima of
and figures that are black on a white background. Figure F1 shows the
original shape (A), all the maxima found (B), and the two strongest maxima
(C). D(x, y, σ)
Extrema seem to occur not only as a result of matching a shape with the support of a kernel but also in other points in response to local curvature. This is also illustrated in Figure F2.
While [Lo04] suggests several ways to select keypoints from the candidate keypoints, there is no precise characterization of keypoints in terms of human perception of an image. Furthermore, the experience with the results of implementations of SIFT supports the view that there is no well defined connection between keypoints and the perceptual characteristics of an image. It is possible to obtain some analytical results for a limited class of images, as shown next. ## 2. The Case of Separable ImagesIn order to simplify the analysis I consider a separable image, one where
the gray scale function .
The Gaussian kernel is also separable,I(x,y) = X(x)Y(y) so
that the convolution of a separable image with a Gaussian kernel can be
expressed as the product of two one-dimensional convolutions, g(x, y, σ) = h(x, σ)h(y, σ) and XC(x,
σ) . If
two successive values of YC(y, σ) have ratio σ,
(q), then we have
q>1
and using to
subscript x to denote such a derivative, we have
x
is a step
function, 0 for negative values of X(x) and 1 for
positive values of x. Then it is easily shown
that for a symmetric kernel the derivative of the convolution equals x,
so that the above equation becomes
h(x, σ)
is also a step function the derivative
with respect to Y(y) takes a similar form:
y
find a common σ pair
that satisfies both. Then we have to look for an extremum of x_{0}, y_{0} with respect to D(x_{0},
y_{0}, σ).
If the ratio ofσ over YC(y, qσ)
is bounded by YC(y, σ) and a similar property holds for
q, then there will be values of XC()
(or x) that satisfy both equations as illustrated
in Figure F3. Because of the symmetric from of Eqs. (E3a) and (E3b)
zeros will occur when y and x
are the same so that they will be along the diagonal of the corner.
y
Therefore we may expect several locations of extrema in and the only issue is to find extrema
with respect to y. Appendices A and B show
specific examples. Because we cannot solve the Gaussian case analytically,
the analysis is performed for the σcase of a "boxcar" kernel
with support and it is followed by numerical results for the
Gaussian case. We consider two cases. In the first the one dimensional
function is a pulse so that the image contains a right rectangle (Appendix
A). In the second the one dimensional function is a step, so that
the image contains a corner with a vertical and a horizontal side (Appendix
B). and constant amplitude S
(so its area is one)1/SThe results confirm that candidates for keypoints occur not only when the support of the Gaussian kernel matches a shape, but also at several other points as predicted from Equ. (E3). ## 3. A Controlled Test of SIFT to Illustrate the Consequences of the Nature of the KeypointsIn order to illustrate the above observations I selected a set of images
with careful control of their characteristics. Professor Anil Jain of
Michigan State University kindly agreed to have one of his doctoral students,
Ms. Figure F4 shows the seven images used in the tests. The captions include the dimension of the image, the number of keypoints found by the SIFT method, and, in parentheses, the average number of pixels per keypoint.. The table below shows the results of the matching by SIFT keypoints in pairs of images from Figure F4.
Because of the brightness scale invariance of SIFT one would expected many matching key points for the pair A, B and this is indeed what happened as shown in the first row of the table. For this pair of images the performance of SIFT is far superior than that of algorithms based on color histograms or edge histograms. (The brightness enhancement transformation distorts both histograms in a significant way while the semantics remain unaltered.) For images A and F the performance of SIFT continues to be good, as expected. Because of the difference in pose of the main object, there only half as many matching keypoints in the A, G pair as they were in the A, F pair. Does such a drop agree with human impressions? Clearly, SIFT very well in matching images that are nearly identical, but even relatively small changes produce a big drop in matching points. The A, H pair presents a bigger challenge because H includes the subject of A in a different pose and with other object present. Figure F5 shows the matching of the keypoints.
Finally, SIFT does poorly in the pair A, J as expected because the object common in the two images is in significantly different poses. Figure F6 shows the matching of the keypoints.
I also thought to be interesting to compare image J and one centered on the beach ground (K). That produced several matches as shown in the Table and in Figure F7 below.
There are many keypoint matches not only between parts of the ground but also between parts if the dog and the ground. Note that image J appears to be more similar to K (91 matches) than to A (7 matches) which is not the answer that one looking for pictures of the dog would have liked. On the other hand, if one looks at images as a whole, then J and K are certainly more similar to each other than the A, J pair. ## 4. ConclusionsBoth the approximate analysis and the numerical simulations confirm that
extrema do occur at points where the support of the kernel matches roughly
the area of a shape (Figure F.B) but also in other places that have no
connection with image features. Keypoints depend strongly on A major characteristic of keypoints is that the luminance of the image does not enter in their definition. This can be both advantage (as in the case of matching images A and B in Section 3) but it can also be a disadvantage because slight variations in luminance will produce candidates for keypoints similar to those produced by large variations. One consequence of the close connection between the support of the kernel
and the size of variations in a image is that blurring will produce major
changes in the candidates for keypoints. If we assume that the blurring
kernel is also Gaussian with variance will be the same as convolving the original image
with Gaussian kernel with variance σ_{2
} ,
where σ is given by
σ
for
which the two keypoints were detected. As a result blurred images are
likely to produce a significant challenge (see also [Ba07]).
σBecause of the lack of perceptual significance of the keypoints, SIFT can be used to match images that are supposed to be nearly identical (as A, B, and F are), but not just similar (as A and G or H are). This is not a problem in the forensic application discussed in [LJJ08] where a tattoo image must be matched with an identical one in a database (if such an entry exists). Another potential application of SIFT is in intellectual property searches where one looks for nearly identical images. (The case of images A, B, and F offers such an example, all three have been produced from the same original.) The big question (given the computational cost of SIFT) is whether there are simpler ways to achieve the same result. ## 5. Works Cited[Ba07] T. Bakken "An Evaluation of the SIFT algorithm for CBIR"
[Lo04] D. Lowe "Distinctive Image Features from scale-invariant
keypoints" [LJJ08] J-E. Lee, A. K. Jain, R. Jin "Scars, Marks, and Tattoos (SMT):
Soft Biometric for Suspect and Victim Identification" [SZ08] J. Sivic and A. Zisserman "Efficient Visual Search for Objects
in Videos", [We] Weisstein, Eric W. "Convolution." From |

theopavlidis.com |
Site Map |