Expressing Queries in Content Based Image Retrieval Systems

May 12th, 2009 by vq

In my work, I claim that there is indeed a need for various kinds of visual queries, and that systems for image retrieval should support these in some manner. Several different surveys of CBIR systems have been made, covering both different systems and different aspects of these systems. However, I was not able to find any surveys with a special focus on techniques for querying these systems. The 2000 survey made by Veltkamp and Tanase includes a section on query specification techniques, but this has not been summarized to any degree. Consequently, I decided to do a survey of the techniques available in these systems. I’ve surveyed 58 systems from the past 13 years.
This has not been a very large part of my work, and I’m not claiming that this is a comprehensive study, that I have included all systems or that this is a very in-depth analysis of query techniques, but I think I’ve been able to at least present a summary of what techniques have been used in CBIR systems. I’ve identified 6 main categories of query specification techniques. These are by no means new categories and I think all of them have been described previously. I’ve only created my own descriptions of them:

  1.  Query by Internal Example (QBIE). These systems allow queries based on images already present in the image database. Queries are made by selecting one or more images from the collection and using these as a basis for a similarity search. This type of query can either represent an initial query to the system, or be initiated based on the results obtained from another query. A number of the systems using this approach present the user with small set of sample images, representing the topics or image types present in the collection. One recent example of the QBIE approach is the FIRE system.
  2. Query by External Example (QBEE). These systems allow queries based on images external to the image database. Queries are made by submitting one or more images to the system and using these as a basis for similarity search. One recent example of this approach is the Retrievr system.
  3. Query by Spatial Composition (QBSC). These systems allow the user to compose a query image representing their information needs using one or more drawing tools. This query type is also sometimes known as Query by Sketch.  A recent example of this approach is the Retrievr system.
  4. Query by Features (QBF). These systems allow queries based on specification of low level features. Queries are made by defining and manipulating colour histograms, creating or selecting texture samples or other feature specification techniques. One example of this approach is the MARS system.
  5. Query by Image Area (QBA). These systems allow queries based on selection of an image segment. Queries are expressed by selecting a section of an image (Internal or external) and using this selection as a basis for a similarity search. One example of this approach is the BLOBWORLD system.
  6. Query by Text (QBT). These systems allow text-based queries. Queries are either expressed as keywords or through selection of collection categories. These queries are normally used as a method of initiating a search, providing the user with an initial set of images which can be used in a QBE search. One recent example of this approach is the CORTINA system.

Most of the surveyed systems support one (21) or two (25) query of these techniques. Only 8 systems support 3 specification techniques.  

Type
Number
Percentage
QBIE
36
62,07 %
QBT
17
29,31 %
QBEE
16
27,59 %
QBSC
11
18,97 %
QBF
11
18,97 %
QBA
7
12,07 %

 

The most used technique is Query by Internal Example, both alone and in combination with other techniques. In a majority of the systems combining QBIE with other techniques, these other techniques are often used as a point-of-entry to the system. The initial query to the system is often expressed the other techniques, while QBIE is used to either refine the query using a relevance-feedback loop, or to initiate a new query based on one or more of the retrieved images. For example, in the AMORE system the user may select a category of images through a textual label (e.g. “arts” or “travel”), or by choosing a random set of images.
Only 11 of the systems support the type of queries I’m concerned, queries by spatial composition. For these systems, I’ve tried to identify and classify the tools the user has available when composing the queries. 5 different categories of tools for composing queries using QBSC have been identified:

  1. Freehand drawing (F). These systems allow visual query specification through the use of freehand drawing. This refers to the use of a mouse (or similar tactile input devices) to create an interface similar to drawing using pen and paper. This allows the user a high degree of freedom to express any types of content, only limited to the user’s level of skill in freehand drawing.
  2. Colour specification (C). These systems allow visual query specification through the use of colours in combination with other tools. This allows the user to specify which colours should be present in the query image, as well as the spatial distribution of the colour.
  3. Geometric Primitives (GP). These systems allow visual query specification through building a query image using geometric primitives such as circles, squares and lines. This allows the user to build the spatial composition of the query using such shapes.
  4. Shape Prototypes (SP). These systems allow visual query specification through the use of example shapes or shape prototypes, representing real-world objects. This allows the user to use these shape prototypes to spatially arrange the query participants within the query image.
  5. Texture (T). These systems allow visual query specification through the use of texture samples or texture specification tools. This allows the user to express which textures should be present in an image, as well as specify the spatial arrangement of these textures.

 

System
Tool types
 
F
C
GP
SP
T
WISE 
x
x
x
Picasso 
x
x
x
CHROMA
x
x
DrawSearch
x
x
Retrievr
x
x
VP Image Retrieval system 
x
x
ImageScape
x
x
QBIC 
x
x
x
Herminate Museum
x
x
NETRA 
x
x
VisualSeek 
 
x
 
 
x
Total (11 systems)
7
9
4
1
4

 

I’ve used these findings in a discussion on the expressive convenience of visual queries and image retrieval. I’ll post more in this in the near future.

What is an image, exactly?

April 29th, 2009 by vq

It’s been very quite on this site for quite some time now, as I was struck by illness some time ago (You can read about it – in Norwegian – at my private homepage). However, I’m back in business and working hard to hand in my dissertation sometime this summer.

One of the things I’ve spent some time struggling with, is the definition of the concepts of “Image”, “Digital Image” and the more elusive “Mental image”. On one hand, these are concepts most of us use on a daily basis, at least the two first concepts. However, using these in my thesis has caused me some problems.

First of all, what is an image, and what is the difference between an image and a digital image? Is there any difference? The word “Image” stems from the Latin word imago (imitation, copy, likeness, bust), but an image is generally a representation, or double of something. In common usage, it is an artefact that reproduces the likeness of some subject, at several different levels. At the most basic level, they represent a response to light as perceived by our visual senses, while at the most complex level they represent abstract ideas dependent on the viewers knowledge, experience and mood. In everyday life, terms like image and digital image is used interchangeably to describe this concept. However, the general term “image” might relate to several different concepts. Distinguishing between these concepts is essential for a discussion on digital image retrieval.

The first time I encountered these concepts in an academic setting, was when I took an undergraduate course in Media Studies. If my memory isn’t entirely off, I think the definition we were given went something like a “visual representation of an object, scene, person or abstraction, produced on a medium”. When I first started my research in image retrieval (if my master thesis actually can be considered research), this was my working definition. However, when I started working with image information needs and image retrieval tasks, I ran into some problems. Basically, when someone wishes to find an image, they most likely have some idea of the type of image they are looking for, ranging from a very basic idea (“it should contain a dolphin! I like dolphins!”) to a specific image they have encountered before (I would like to find that image of that strange Italian woman with the funny smile. Whatever was the name of that image?). Whatever their need, they most likely have an idea of what they are looking for, or an “inner image”.  For me, it became necessary to have a clear definition about what this is called, and to be able to distinguish this from the “real” image. When using visual queries, this “inner image” is likely to represents the only frame of reference for the user, and I needed to have a name and proper definition for it.

Furthermore, what is a “digital image”? And is it actually any different from “image”? I believe that most people do not have a clear distinction between these, other than that “digital images” are images that are shown on some sort of monitor, screen or other digital display device. But is this really the digital image? The definition of “image” I used above fits this perfectly. But is this really the “digital image”? In my (somewhat confused) mind, what we see on a monitor is only a representation of the digital image. The actual digital image is actually nothing more than a binary stream, representing the pixels in an image.  And it is this binary file that is being analyzed and used in various content based image retrieval systems, not the actual representation of these images. I’ve discussed this in the CAIM project and with my supervisor, and I think that at least we agree that it is an important distinction.
However, a large part of existing literature in the field does not seem to think so.  “Image” and “Digital Image” is used interchangeably, and “Image” is used to represent both digital images and “physical “ images. Am I wrong in thinking that these are different concepts?

Anyway – I have decided to distinguish between the three concepts defined above, illustrated in the lovely image below:

An image is an image is an image

The figure presents the different image concepts I mentioned. First of all, there is the actual visual representation of the image: The observer is watching a visual representation of a dolphin produced on a computer screen. This is a common understanding of the term “image”. In this dissertation, this is called the visual image, and is defined as a visual representation of an object, scene, person or abstraction, produced on a medium. In the case of the image of a dolphin, this is a visual representation of a dolphin swimming under water, produced digitally on a computer monitor. This representation is not synonymous with the actual digital image stored in a computer system – it is merely a representation of the image. The actual digital image consists of pixels, arranged as arrays of two or more dimensions, as defined by the syntactical characteristics of the image. This I have defined as a Digital Image: a set of two-dimensional arrays composed of pixels whose locations hold digital colour and/or brightness information which, when represented on a suitable digital medium, form a visual image.

Finally, the mental image is the “inner image” a person has of the ideas, events and objects represented in a visual image.  This might be the internalization of a visual image, or might be a pure mental image, or ideal image, representing objects, scenes, concepts or events. In the context of image retrieval, a mental image may represent potential images an enquirer might be interested in.  In the case of Figure 7, the mental image is likely to be identical, or at least very similar to, the visual representation of the image.  However, in some cases the mental image might not be directly related to a visual image. I have dined this as a “mental image”: an internal visualization of an object, concept, event or scene in the mind of an individual.

Finally, I have decided to use the term image is used a common denominator of these three visual concepts when discussing concepts that may be valid for all three forms. This is defined as all visual representations of objects, concepts, events or scenes.

That’s about where I’m at right now. And I guess this is quite likely the final definitions I’ll use, as time is running out, and I don’t think I’ll gain much by spending a lot of time on this. However, any any ideas and thoughts you might have on this would be very welcome!

(And stay tuned for some more frequent updates)

Poster and presentation at Verdikt

October 16th, 2008 by vq

I will be presenting two parts of my project at the VERDIKT programme conference 2008:

A poster titled “A Framework for Evaluating Visual Query Modality“. This will be a presentation of a framework for evaluating Visual Query Images based on the book “Reading Images – The Grammar of Visual van Leeuwen. I will present both the framework as well as some of the results obtained through using the framework.

An oral presentation titled “Expressing Visual Queries – What are the Major Challenges?”. This will be a presentation of some of the main challenges facing users expressing visual queries. The work is based on three studies performed with both students from University of Bergen and students from the Bergen Academy of the Arts.Abstracts for both presentations can be found by following the above links.

Example of interface videos

September 26th, 2008 by vq

It’s been quiet here for a while. I’ve been rather busy finishing my last data collection round, and I’m deeply into analysing the stuff now. One of the data sources I’m studying is the participants’ interactions with the different visual query interfaces. I’d like to share two of these videos with you.

The first is a video of participant 22 creating queries in the “Retrievr” interface (The video is about 20mb in size, so modem users beware):

[qt:/movies/22_Retrievr.mov 480 240]

The second video is of participant 25 creating queries in the “VISI” interface (17mb).

[qt:/movies/25_VISI.mov 700 275]

The videos have been edited, so only the actual query process is shown. The format, particularly on the second video, is a bit off. I know, I’ll fix it. Soon. Enjoy!

A framework for evaluating visual queries

August 13th, 2008 by vq

The development of a formal framework for evaluating the visual query images has held a high priority for me during the last few months. My project requires that I perform a classification of the images, and the classification schema needs to be formalized in a way that can be discussed, criticized and repeated by my peers (Although I, at times, have some difficulties believing that anyone else would actually want to do this).

A nice rule-of-thumb is that an evaluation criterion should be objective. This is relatively easy to obtain for some measurements – this block of stone weights exactly 5 pounds, or the temperature is exactly 15 degrees. However, interpreting and understanding images is very much dependent on the eye of the beholder. I can’t imagine any way of objectively determining, for example, that the use of colour in an image is exactly 49.5% realistic, or that the degree of abstraction in a drawing of a seagull is exactly 15%. And as I have come to understand during the last few years, the realism in an image might be very dependent on cultural background and the personal experiences and knowledge of the observer. So how does one go about evaluating such images?

I have decided that there is no way of avoiding some sort of subjective evaluation of the query images. A pure mechanical evaluation process would be utterly pointless. Ask about anyone who has worked within the fields of CBIR during the last decade and you’ll probably get a lengthy explanation of why this is, at best, an utterly impossible and borderline insane approach (Well, maybe not). At the same time, I would very much like to see at least some quantifiable measurements in the evaluation. Accordingly, I have developed a framework (or method?) which, hopefully, can manage to combine a subjective evaluation with some criteria which at least have a minimal degree of objectivity.

In yesterday’s posts, I mentioned that I based my framework on the work on visual modality by Gunter Kress and Theo van Leeuwen. I considered each of their major modality markers, and tried to figure out which of them might be relevant when expressing visual queries. While all of them might be used, the nature of the query images is such that they would result in a minimal score on several of the markers. For example, the average number of colours used in the first 162 visual queries was 3.84 (min 2 colours, max 12 colours). Consequently, spending a lot of time evaluating the colour modulation, colour differentiation, colour modulation, brightness and illumination in these images might end up being an exercise in futility. So I decided to focus my attention some modality markers which would provide me with some sort of tool of discriminating between the different images.

In order to combine the subjective evaluation with some sort of quantifiable measures, I decided to create a set of Boolean (yes/no) criteria for each of the markers. For each image, the evaluator (me and hopefully some other (un)lucky individuals) would check if the different criteria is satisfied. In addition, the evaluator would, for each modality marker and each image, subjectively mark the visual modality on a scale from 1 (lowest) through 5 (highest), based on their own feelings and knowledge about image modality. The modality markers and criteria are the following:

Use of colour
This is a sort of combination of the different colour markers presented by Kress and van Leeuwen. This describes the degree of which colour is used to create a realistic image. The following criteria have been used:

Monochromatic: The image is created exclusively using two colours, possibly different shades of the same colour.

Basic colour use: One or more objects in the image is coloured in a single colour. Different objects may have different colours.

Varied colour use: One or more objects in the image is coloured in more than a single colour.

Modulated colour use: One or more objects in the imaged is coloured using colour gradients, i.e. the sky is graded from a deep blue on top towards a lighter blue further down.

Illumination: One or more light sources, either implicit or explicit, is used to create shadows, play of light, variances in brightness or other effects of light.

Contextualization
This describes the degree of which contextual elements and background is used in the image. Contextual elements represent background and articulated details that are not directly relevant to the major interesting objects in an image, but provide these objects with a context. Examples of this would be the inclusion of the sun or a coral reef in a query after a dolphin. The following criteria have been used:

Use of Objects of Interest: The object(s) of interest represent the major objects represented in a query, such as a dolphin. This is likely to be included in most images.

Use of background: This represents the inclusion of some sort of background other than a “neutral” background, i.e. a white background or a “neutral” canvas.

Use of symbolic contextual elements: This represents the inclusion of objects of high symbolic value to the query, such as the inclusion of a “the sun” to represent outdoors or a bright day, or the use of a straight or curved line to indicate the surface of the sea.

Use of detailed contextual elements: This represents the inclusion of objects which might not have a high symbolic value, but might be naturally found in a photograph or a drawing with a high visual modality, such as a school of fish, seaweeds or a coral reef in a query after a dolphin.

Representation (Abstraction)
This describes the degree of which abstraction has been used in the image. In this context, abstraction describes the process of simplifying a visual object, from a completely realistic depiction to some sort of simpler depiction, while still retaining a connection to the simplified object. The following criteria have been used:

Geometric primitives: This describes the use of geometric primitives, (i.e. lines, squares and circles) to represent objects in an image.

Outlines: This describes the use of outlines to represent objects in an image.

Symbolic visual elements: This represents the use of visual elements that are of high symbolic value, such as a dolphin’s eyes or mouth, the limbs of a human, the branches of a tree or sails on a sailing boat.

Detailed visual elements: This describes the use of detailed visual elements, that is, elements that is not of a high symbolic value, but would be included in a photograph or a highly realistic depiction. Examples of this could be individual strands of hair, leaves on a tree or similar minor details.

Texture: This describes the use of texture to represent the surface of an object, rather than using plain colour.

Depth (Or composition?)
This describes the degree of which depth has been used to give an image perspective and composition. The following criteria have been used:

Scaling: This represents the use of realistic scaling, i.e. a dolphin is depicted in its natural size related to other objects in an image.

Overlap: This describes the use of overlapping objects to represent the order and distance of the depicted objects in relation to the observer.

Central perspective: This describes the use of the central perspective to give the image depth and composition.

I’m not completely satisfied with all of the terms I’ve used, such as “detailed contextual elements”. The proper naming and naming of these terms, concepts and phrases is something I’m working on. Similarly, some of the criteria might be considered overlapping, or as not having well-defined borders, such as the distinction between “symbolic contextual elements” and “detailed contextual elements”. But this is the best I have come up with so far, and, given the time I have available now, will have to do.

In addition to this, the number of individual (or unique) objects in the image will be counted. An individual object is a visual object that represents a unique entity. For example, a disembodied head would be one individual object, while the hands, the head and the torso of a person might be considered a single object, even if they are not drawn completely together. An image of a dolphin eating a fish would contain at least 2 individual objects, even if they are connected in the drawing.

Finally, the evaluator will be given the opportunity to rate the difficulty of evaluating the image on a scale from 1 (easy) to 5 (difficult), in an attempt to identify images which might be problematic.

This framework might not be ideal, or even very good. However, I feel that it is at least slightly better than relying on the gut-instincts of one researcher. If everything turns out as I hope, at least two other evaluators with background from visual arts will provide me with their individual evaluations in addition to my own evaluations. Together this might even turn out to present some interesting results and insights.

I hope.

Kress and van Leeuwen on Visual Modality

August 12th, 2008 by vq

‘Modality’ is originally a term from linguistics, and refers to the truth value or credibility of statements about the world. In the book “Reading Images: The Grammar of Visual Design” Kress and van Leeuwen discusses the concept of modality used on images, visual modality. Visual Modality, as used by me, should be understood as the degree of which an image represents a realistic depiction of both an image as a whole, and the different objects represented in an image.

While Kress and van Leeuwen presents a thorough discussion on the role of visual modality, their concept of modality markers is of particular relevance for my project. A modality marker is a marker representing one facet of the naturalistic modality of an image. Together, these modality markers represent the important aspects of an images naturalistic modality. The key visual modality markers, as identified by Kress and van Leeuwen, are:

Colour Saturation: The amount of colours used in an image, from full colour to the absence of colour.

Colour Differentiation: The use of different colours used, from a maximally diversified range of colours to monochrome.

Colour Modulation: The use of different types of a given colour, for example the use of many different shades of red, to plain, unmodulated colour.

Contextualization: The use of background as a means to contextualize an image, from the absence of background colour to the most fully articulated and detailed background.

Representation: The degree of abstraction, from maximum abstraction to maximum representation of pictorial detail.

Depth: The degree of depth present in an image, from the absence of depth to maximally deep perspective.

Illumination: The degree of use of play light, from the fullest representation of the play of light and shade, to the complete absence.

Brightness: The degree of brightness in an image, from the use of maximum different degrees of brightness to just two degrees, such as black or white, dark grey and lighter grey, or two brightness values of the same values.

Each of the modality markers can be represented as a scale from zero visual modality, to full visual modality. Taken together, the modality markers determine an image’s modality configuration – how close a given visual query is to the real world objects they represent.

When I first read this work, I simply could not believe my luck, and considered using the modality markers directly as a tool for evaluating and categorizing the visual query images. However, when I tried to use the framework on a sample of the query images, things became somewhat more complicated. The three images below represent typical visual query images created by the participants in my project.

Some more seagulls.

I attempted to map these images onto a modality configuration based on giving each marker a score from 1 through 10, where 1 represented the lowest possible modality, and 10 represented a “naturalistic” modality. The result of this is illustrated below:

Modality marker configuration

Basing my classification on the Kress & van Leeuwens description of modality markers, almost all of “my” query images would obtain very low modality scores. This might be useful if my goal was to compare the visual queries to photographs, art images and the likes. However, as a tool for discriminating between different approaches to creating visual queries, this is at best sub-optimal. So it seemed necessary to create some sort of new classification framework based on Kress & van Leeuwens visual modality markers, but adapted to my needs. This framework is now more or less completed, and will be presented here as soon as possible.

Thoughts on classifying visual query images

August 12th, 2008 by vq

One of the major goals in my project is to understand how people go about creating visual queries. Do they use colour? Do the create real-life, photorealistic drawings? To what degree do they simplify and abstract the objects they draw, and to what degree to they create complete drawings?

One my main research hypotheses states that users will create simple, iconic queries rather than fully detailed, realistic drawings. But how should I go about evaluating this?

The 6 images above were all created by participants in my experiments, based on the textual queries “Find images of a seagull” and “Find images of a seagull eating”. Which one of these images would you consider the most “realistic” image? And to what degree are they “fully detailed”?

My first attempt to classify these images resulted in the paper Evaluating Use of Interfaces for Visual Query Specification, presented at Nokobit 2007. While the paper was well received, I am not particularly happy with the framework I used to classify the images. Images were classified on two axes; iconic vs. realistic, and complete vs. object-of-interest.

Iconic represents objects and scenes drawn as simplified drawings with an iconic representation of their real-world counterparts. Realistic describe images where the objects and scenes are drawn as to resemble the real-world objects they represent.

Complete describes images where the images are drawn as to resemble a complete image, with objects in a natural environment. Object-of-interest describes images where only an object of interest, e.g. a dolphin, is depicted.

While this provided me with a quick-and-easy way of quickly categorizing the images and providing some insights into how users create visual queries, the method used leaves a lot to be desired. As we all know, validity and generality are two fundamental aspects of any scientific work. Unfortunately, the method used was neither very transparent nor repeatable. Classification into the different categories was performed by me alone, based on my own, barely-qualified opinions.

While I’m pretty confident that the findings reported in the study are more ore less sound and provide important insights into the area, the data material would clearly benefit from a more transparent and repeatable analysis. As a result, I have developed a new framework for classifying query images. The framework is based on the work presented in Gunther Kress’ and Theo van Leeuwen’s book “Reading Images: The Grammar of Visual Design”, particularly concerning image modality.

The following posts will present the basics of their theory, the basics of the framework and how I’m using this framework to categorize the query images. Hopefully, the results will be slightly more general and valid than my first attempt.

Some example images

April 17th, 2008 by vq

Below are some examples of the visual queries created by the participants in my evaluations.


Nokobit 2007 – best contribution

November 21st, 2007 by vq

My Nokobit paper was selected as “Best Contribution” at Nokobit’07 today. Needless to say I’m both very happy and very proud about this :)

Visual Queries – What? Why?

October 26th, 2007 by vq

When asked what my research project is about, I usually start by answering “Visual Queries”. In approximately 90% of the cases, this is met by a blank stare. Further explanations usually end up with the question “Why is this interesting”?. So, as a public service: Here is some basic concepts and motivations for why on earth I’m interested in this.

Consider the task of retrieving images of a dolphin playing with a ball. In a traditional image retrieval system, such as Google Images, this would normally be achieved through a keyword based search, e.g. “Dolphin, ball, play“. Simple and easy.

A simple definition of a visual image query is that is is an image query based on similarity between a user drawn image and images in an image collection. In the case of the playful dolphin, a visual query might be expressed through the following image:

Visual Query - dolphin playing with a ball

This might seem as a very inconvenient approach for image retrieval, particularly when a keyword based search in google images is a trivial task. However, there are several cases where visual queries might be better suited than linguistic queries.

First of all, no satisfactory solution has yet been found for automatic indexing and description of the semantic content of an image. While it is possible to compare fingerprints and perform tasks such as face recognition using automated processes, creating software that can automatically identify a broad range of objects in an image has proven very difficult. As a result, the objects, scenes and activities illustrated in an image have to be described manually. As long as image collections are small, this dies pose a significant problem. But stop for a moment and consider the amount of images available throug a service such as Google images. A basic search on the word Dolphin returns almost 5 million images. And this is probably only a fraction of the images on the web containing dolphins, as the results are based on finding the textual term “dolphin” somehow connected to the image. As image collections grow larger, manual annotation became prone to the problems of volume and subjectivity.

The problem of volume refers to the fact that manually annotation of an image is a time consuming task. Indexing times quoted in literature range from about 7 minutes per image from stock photographs at Getty Images, to more than 40 minutes pr image for a slide collection at Rensselaer Polytechnic . While it is relatively easy to create annotations for a small number of images, even a small personal computer now has the possibility to store millions of images, making manual annotation a daunting task, at best.

Furthermore, the combination of rich image content and differences in human perception makes it possible for two individuals to have very diverging interpretations of the same image. As a result, the description is prone to be both subjective and incomplete. Consider the image below.

Image from an Aquapark
Depending on your world view, you might say that the dolphin is joyfully playing with its caretaker. A second caretaker is watching, maybe evaluating the performance of the first caretaker. The trio is quite likely performing before an enthusiastic audience. On the other hand, it might be considered as an exploitation of an unhappy animal. The dolphin is held as a slave by the cynical owners of the aqua park in order to maximize their profits by showcasing the poor animal to a mindless audience. Both these stories might be true, and might prove a challenge when describing the image. This is called the problem of subjectivity.

While text based classification has a high expressive power, there are some limitations when dealing with visual objects. Some syntactical image features are difficult to describe with words. For example, although we have a set of terms describing the different colours, none of these terms are exact. Every colour has a broad range of different shades and intensities. Although most people are able to differentiate between two different shades, it is difficult to express the differences verbally without using fuzzy terms like “more” or “less” red. Furthermore, creating exact and objective textual descriptions of textures or shapes is difficult. We call this the problem explicability.

While it might be simple and unproblematic to describe a single image with keywords describing the basic objects in the image (Dolphin, caretaker, beach ball), describing the structural characteristics of the image might prove a bigger problem. How does one describe the pose of the dolphin? The relationship between the dolphin and the caretaker, or the distribution of these objects in the image? And furthermore, how would you express a query after images with such distributions?

Finally – while I’m confident that every reader of this blog is more than capable of performing a search on google images using a basic set of keywords. However, there are some people who for various reasons are incapable of expressing even simple text based keywords, through lack of education, mental handicaps or other reasons. Without aid, these people are denied access to the vast collection of information available to the rest of us. Even if they might not be able to read a linguistic text, they might benefit from images, sound clips or movies presenting information.