Horrible Software Engineering

Summary: It seems that the advertisement blocker that is part of the Norton Internet Security Systems decides that an image is an advertisement strictly on the basis of its dimensions. This causes the blocking of many images that are not advertisements and interferes with several applications. The remedy suggested in the Norton web side is six pages long and consists of adding sites in the "allowed" list. This places significant burden on the users. Norton has ignored much simpler solutions, such as considering the format of the image file (most advertisements are in GIF), as well as reliable methods that are discussed in the literature. By opting for a minimal implementation the system places significant burden to the users.

A Case Study in the Norton Internet Security System

It all started when the thumb sketches in one of my web pages would not display on my web browser. I checked the source code through the View>Source browser button and I saw that the respective code had just disappeared. The original code had been:
<body>
<h4>Mystery</h4>
<p><img src="books8JPG" width="160" height="120"></p>
<p><img src="books8JPG" width="120" height="90"></p>
</body>
The code down loaded in the browser was:
<body>
<h4>Mystery</h4>
<p><img src="books8JPG" width="160" height="120"></p>
</body>
The source code for the second image had disappeared! (I should add that the original file had many more pictures. The above file was created while trying to track down the problem.)

I determined that the problem was not in the server (both by checking with the server customer service and by displaying the page correctly on another machine). The next possible culprit was the browser, but both Internet Explorer and Mozilla Firefox displayed incorrectly in my machine and correctly on another. Clearly there was a filter running on my machine that ate up pictures of certain dimensions. (I confirmed that by changing the height from 90 to 92 caused the page to be displayed correctly.) My first thought that such a filter was some "malware", maybe a practical joke. I used Norton to search for such a culprit but then a friend suggested that Norton itself might be the culprit. I turned off Norton Internet Security and, presto, the file was downloaded and displayed correctly.

It seemed unbelievable that Norton Internet Security would eliminate an image simply on the basis of its dimensions but the evidence was so strong that I went to their customer support web site. At the top of the list of cases I received was:

Cannot access SoftBank HAWKS Baseball Broadband TV after installing Norton Internet Security
width=120 height=90

Obviously, I was not the first victim! There was a link to a lengthy set of instructions on how to fix the problem. I printed them out and they came to six pages. It was explained there that such images were suppressed because they were interpreted as advertisements!!! (Note: the original link to the Norton web side has expired so I have copied the page in question in my own web site and also converted it to a PDF file.)

Obviously Norton took the standard sizes of the Internet Advertising Bureau and banned any image that has one of the sizes from any web page. What a solution! Maybe Norton has software engineers that lack not only technical knowledge but also common sense. However, Erez Zadok, a colleague at Stony Brook, suggested that the problem may lie less in thesoftware engineers than in a corporate culture that wants "results" no matter what. So even people who should know better implement bad designs to meet goals imposed by clueless managers. By an interesting coincidence the Dilbert (Oct. 6, 2005) strip of the day I encountered this problem had the "boss" hiring "Dogbert" to write the FAQs in order to anticipate "our customers' most likely questions." First FAQ? "Where does your CEO live? I need to know so I can throw your cruddy product through his biggest window."

You may be curious to know whether the lengthy instructions helped me fix the problem. The answer is NO! So I did a simpler remedy, I disabled the Norton Ad Blocker. You can do that by going to Norton AntiSpam>Ad Blocking and turning it off. (This is also rather odd design. Most people associate SPAM with e-mail rather than internet browsing.)

If you are writing programs that create web pages that display images you can use the following C++ function to Norton-proof your pages. This would save any future viewers of the pages from having either to figure how to disable the Ad Blocker or having to struggle with the stuff in the Norton "support" page. The array of ad sizes has been taken from the standards site cited above.


static BOOL AdProtect(int *width, int *height)
{
#define ADSZ 16
	POINT AdvSize[ADSZ] = {	{300, 250}, {250, 250}, {240, 400},
		{336, 280}, {180, 150}, {468, 60}, {234, 60}, {88, 31},
		{120, 90}, {120, 60}, {120, 240}, {125, 125}, {728, 90},
		{160, 600}, {120, 600}, {300, 600} };
	for(int i=0; i<ADSZ; i++) {
		if(*width != AdvSize[i].x) continue;
		if(*height != AdvSize[i].y) continue;
		// we hit a standard advertising size
		// we increase one of the dimensions by one pixel
		// you can do something fancier if you wish
		if(*width > *height) (*width)++; else (*height)++;
		return TRUE;
	}
	return FALSE;
}
This horrible Norton feature has caused a lot of pain amongst users. I was directed to several web sides that discuss this problem:
page is trashed by ad blocker ,
Thumbnail HTML is Fine on Server; "Corrupted" When Viewed in Browser,
Verschwundene Thumbnails, etc, etc.

How many person-hours have been lost because of the horrible Norton design?

At the very least, instead of just suppressing what it thinks is an advertisement the Norton software should have provided a hint to the viewer of the page, replacing the suppressed image tag by a short text, for example.

Some not so Horrible Solutions

The sad truth is that there are several simple ways to classify images as advertisements that are more reliable than their dimensions.

File Format: Most advertisements use the GIF format while most images use the JPEG format. The following table shows the results from a quick test on three web pages with news, two from Yahoo and one from BBC.

Format/Kind JPEG GIF
Ad or Label 0+1+ 0 = 1 20+16+18 = 54
Photograph * 7+ 4+1 = 12 0+1+3 = 4

* I use the term photograph for a multitone image that is part of the main page content. (Illustrating a story, for example.)

Each of the numbers in a triplet corresponds to a particular page. If an "ad-detector" left JPEG images alone it would let very few ads go through and it would also have saved a lot of pain for people who create pages with thumbnail sketches (the latter are almost always in JPEG format). Since checking the format is a trivial operation (it can be found from the suffix of the file name) it is inexcusable that Norton did not implement that simple step.

Trivial Image Statistics: Images that are text displays or graphics tent to use significantly fewer distinct colors than photographs. The former usually have fewer than 100 distinct colors, the latter have in excess of 1000 distinct colors. Of course images with few colors are encoded best with GIF (or PNG) while image with thousands of colors are encoded best in JPEG. Unfortunately, creators of image files do not always use the best format, so this statistic may come handy.

Image Analysis: There are also statistical tests that tell text or graphics (that are present in all advertisements) from photographs. The two pictures below show an advertisement (the only one I found in JPEG in the pages I looked) and its gray scale (luminance) version.

Notice that the gray scale version conveys all the useful information that the color version does. This is because the human eye has difficulty discriminating colors of the same luminance, so ad designers choose colors with different luminance for text and background. By going to the gray scale version we simplify the needed processing. The two drawings below are gray scale histograms. The one on the left has been taken over the part with the text "Earn the vacation ..." and the one on the right over the image of the sleeping person.

The one of the left has three peaks that are far more prominent that in the histogram on the right. The three peaks correspond to dark (shadow behind the letters), background and white (the letters). The histogram on the right has significant values outside its peaks. (It is easy to distinguish between the two histograms analytically but I want to keep this document free of mathematical formulas.)

Therefore one simple technique to identify images as possible advertisement is to compute the histogram over several parts of the image and look for the pattern that suggests a text display. If text covers a significant part of the image than we may flag it as advertisement.

Getting Really Serious

I described above some simple methods that could detect advertisements in a more reliable way than the current Norton implementation. However, there are also more more reliable (although more complex) methods. The subject has received significant attention in the technical periodical literature. The following is the most recent publication on the topic that I was able to find:

Jianying Hu and Amit Bagga "Categorizing images in Web documents" IEEE Multimedia, vol. 11 (2004), No. 1, pp. 22- 30. (The article is available on line for subscribers or for those who are members of institutions that subscribe to the IEEE journal package.)

The paper specifies several classes of images and assign an image to a class on the basis of three types criteria: whether it is graphics or multitone; on the basis of the surrounding text; and by extracting text from images.

In addition to the relatively recent literature that deals directly with web issues there is a huge literature on page segmentation describing techniques that have been used in high end copiers to apply different filters in different parts of the page as well as in OCR systems to find the parts of page with text to be recognized. You can start by visiting the Page Segmentation part of Keith Price's on line bibliography on machine vision. Another interesting site is the one discussing the results of the 2003 ICDAR page segmentation competition. Note that these techniques are different from those that have access to HTML file of web pages. The suspect image is considered as a page to be (possibly) segmented and classified as text/graphics or multitone image. Actually, the problem for web images is a bit easier than for printed images because the latter use halftones so that both text and photographs use only two colors: black and white.

Theo Pavlidis (t dot pavlidis at ieee dot org)
Version of October 14, 2005 (Original: October 5, 2005)

 
theopavlidis.com Site Map