Nokia Challenge

Visual Landmark Recognition

Mobile devices provide ubiquitous access to internet, and so to almost unlimited amounts of data. Finding the information relevant to you can be a time consuming task in itself. In the modern fast paced life, when you are on the go, coming up with suitable query terms and typing them on a virtual touch keyboard is simply too slow. Image recognition has been widely recognized as a potential novel way of accessing data relevant to your immediate surroundings: snap a picture of something and the system tells you about it.

Nokia and NAVTEQ together have created a dataset of street view data where individual buildings are identified. The dataset consists of 150k panoramic images aligned with a 3D city model consisting of 14k buildings obtained from footprint and elevation data. The images were labeled by projecting the 3D model into the panoramic images, computing visibility, and recording the identities of visible buildings.

For each visible building in each panorama, we computed a bounding box of its projection in the visibility mask and generated overlapping perspective views of its inside. Each image has a 60-degree field of view, 640x480 resolution and 50% overlap with the neighboring images. Because the virtual perspective camera has the same center as the original panorama, we call these perspective central images, or PCIs. In total, we generated 1.06M PCIs.  For each PCI, we recorded the following information: field of view, center of projection, camera orientation, visibility mask, and the building label.

Further, for each PCI, we also generated a perspective frontal image, or a PFI, by shooting a ray through the center of projection of a PCI and computing the ray intersection point with the scene geometry. The frontal view was generated by rendering a scene looking at a plane along the normal at the point where the ray meets the geometry.  A PFI was only generated if an intersection point is found and the angle between the viewing direction and the normal at the intersection point is less than 45 degrees. In total, we generated 638k PFIs.  For each PFI, in addition to the aforementioned types of information given for a PCI, we also give the warping plane parameters p, n and d; the intersection point in the scene geometry, normal of the plane at that point and the distance of the virtual camera along the normal respectively.

We are now opening up this data base of PCI and PFI images to the community and challenge you to build a recognition system that can quickly retrieve the correct landmark given a query image. In other words, given a query image, return the correct building label. To test your system we are also providing 803 labeled query images of landmarks in San Francisco taken with several different camera phones by various people several months after the database images were collected. These images are taken from a pedestrian's perspective at street level.

Challenges present in these queries include clutter (e.g., vehicles, pedestrians, seasonal vegetation), shadows cast on buildings, reflections and glare on windows, and severe perspective with extreme angles (e.g., photos taken at street corners). There are often large photometric and geometric distortions separating these query images from their closest matches in the database. For 596 of the query images, real GPS coordinates collected from the camera phones' onboard sensors are available. The remaining 207 query images have simulated GPS coordinates generated from a Gaussian error model. We specify if the GPS tag is real or simulated.

In addition to the accuracy of the system, the efficiency should be considered. Particularly in mobile context, it is necessary to consider what computations should be done on device and what information transmitted over the air to minimize query times. Additionally, the server side processing time needs to be fast enough to return the answer in a fraction of a second.

A baseline solution is given in Chen et. al, "City-Scale Landmark Identification on Mobile Devices". A pre-print of the paper and the datasets are available at

ACM Multimedia 2011

Nov 28th - Dec 1st, 2011 Scottsdale, Arizona, USA

Back To Top