Module1 - Mtrl - Computer vision

This page will give a small introduction to image (signal) processing and computer vision. For (much) more material there is this nice online course here

Introduction to Computer Vision (Udacity) Links to an external site.

Tools

There are a number of very useful tools for doing computer vision. Deep learning has redefined the field more or less completely in the last few years.

An often used library for level computer vision and image processing is OpenCV Links to an external site., which has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android Links to an external site. and Mac OS. OpenCV is also well integrated with ROS which is probably the most used framework for robotics research today. There are quite a large set of tutorials in OpenCV Links to an external site. with code for all three languages. Every day more and more of what is shown below can be done by deep learning. However, it is still helpful to use the classical methods to illustrate the problem we are trying to solver and in some cases the deep methods are still to computationally intense for some embedded applications.

You will investigate deep learning based methods for computer vision in Assignment Object detection.

Image processing / Filtering

We are now ready to start processing the image

Color space

You probably heard of RGB but did you know that there are 100+ different, so called, color spaces.That is, different ways to represent color information.

In the RGB color space, the color for each pixel is represented by the three channels red, green and blue. HSV is an example of another color space. It stands for Hue, Saturation and Value. Here the Hue corresponds roughly to the "color" which in some cases allows for simpler separation of colors. The figure below illustrates the HSV color space. Notice how it looks like a cone and where Hue can be thought of as an angle that varies continuously across the color spectrum. Some cameras send the image information in YUV, and so on.

Links to an external site.

Gradients and Edges

Images are nothing but 2D signals and just like you look at derivates/gradients of time series data of scalar signals you can do so spatially in the image. This will give information about how the intensity changes in the images. Large gradients correspond to quick changes and these parts of the image typically contain important information. For example, the border between objects is typically defined by an edge, marking the change from one color/intensity to another.

Blurring, de-blurring and scale space

If you ever used a camera you know that your images will suffer from motion blur if you move the camera or what you try to capture moves too fast with respect to your exposure time. Long exposure times are typically needed when there is little light and you thus need to integrate over longer time to get a good enough signal to noise ratio in the image sensor. This type of blur is something we typically want to get rid of. In a video, i.e. a sequence of images, you can estimate the motion and use this to compensate to perform de-blurring (example of deblurring in video editing software) Links to an external site..

Odd as it may seem, we often perform blurring on purpose on images. This is not only the case when we want an object of interest to pop out more in the eye of an observer of a photo but also when we work with computer vision. Blurring is the mechanism by which we can remove the final grained information (high spatial frequency) and let the computer focus on the more large scale structures. One often performs analysis of images in scale space Links to an external site., i.e., we add one dimension to the analysis of an image and in addition to looking at different positions (x,y) we also look at different scales (different amounts of blurring/smoothing.

Blurring / smoothing an image are example of filtering the image information by removing the high frequency content. The images below show examples of how efficient median filtering (blurring) can be at removing noise. The left image is an image destroyed by noise and the right one shows the result after applying a median filter. The first example comes from this page. Links to an external site.

Frequency analysis

Again just like when doing signal processing on time series data we can look at the spatial frequency content of an image and perform analysis and processing in that domain. We can represent the image information in the frequency domain and some filtering can be done in a simpler way there.

Segmentation and clustering

Another basic operation in image processing is image segmentation. Here the task is to separate the image into regions corresponding to, for example, different objects. As described in this wikipedia article Links to an external site. there are many different way to go about this.

Take a look at the notebook http://localhost:8888/notebooks/cv_08_hsv_segmentation.ipynb Links to an external site. for an example of HSV based segmentation. The image below is an example from that notebook.

Q3: Can you segment out the lemon in the image fruits_320_213.jpg (shown above)?

Morphological operations/transformations

You can use morphological operations to clean up your images. The two fundamental operations are dilation Links to an external site. and erosion Links to an external site.. In its simplest form they operate on binary images, i.e., black and white where we think of white as being the signal we are looking for. In this context you can think of dilation as letting each white pixel spill over to the pixels nearby. How far it spreads defines how much dilation is performed. Erosion is the reverse of this, where a pixels in the resulting image are only kept white if all pixels in the original image within some region are white.

The examples below are from http://what-when-how.com/introduction-to-video-and-image-processing/morphology-introduction-to-video-and-image-processing-part-1/ Links to an external site.

Erosion:

Dilation:

By combining these operations we can achieve a) Removing small objects, b) Filling holes and c) Isolating objects, as illustrated in the images below.

2min video by Aaron Bobick on morphological operations Links to an external site.

Features

OpenCV provides a good tutorial on understanding features Links to an external site.. An image feature typically represents a position in the image that is significant somehow and a good one should be easy to track and easy to compare between images to find the same corresponding physical point in the world.

Features are used for an abundance of tasks in computer vision; stereo matching, object tracking, object recognition, optical flow calculation, etc

When speaking of a feature detector, one often thinks of two mechanisms: one that finds the positions of points in the image that are of interest (keypoints), and one that describes these points somehow (descriptor).

For many of the tasks where we use features it is important that the feature can be detected at the same location again if the object in the image has not changed. Many keypoint detectors look for positions in the image where the image gradients are large in both x and y. This means that the position is well defined. If the gradient is large in only one direction, i.e. we essentially have a line, the position can slide along the line with a big local change in the image. One of the first detectors, and still used, is the Harris Corner detector Links to an external site..

Today, various forms of deep nets are gradually replacing the use of handcrafted features. Classical features such as SIFT Links to an external site., SURF Links to an external site. and ORB Links to an external site. are still being used heavily though.

Feature matching

Finding correspondences between features in one frame to those in another is a fundamental problem. One typically makes use of both the location of the feature and the descriptor of it when performing the matching. As this is something that is often performed at low level in a system and at high frame rates one wants this to be quick. There is therefore a trade-off between having a descriptor that is very powerful, and being able to rely a lot on that, but spending a lot of resources computing these, or alternatively using simpler descriptors, but relying on various forms of robust methods to tell inliers from outliers.

Below you can see an example output from matching two images. The upper image shows all matches and the second one where only the strongest matches have been used and you can see that the number of correct matches is much higher.

RANSAC

RANSAC Links to an external site. or RANdom SAmple Consensus is one of the most frequently used algoritms in the context of matching. The basic idea is simple. We randomly draw a minimum number of points to be able to estimate a model. We then check how many of the other points support this model. We repeat this a number of times and keep the model that had the biggest support in the data.

RANSAC to find line in point data

Imaging that you have a set of points. These points could be points with high gradients in an image for example, i.e., points that could belong to lines. Now draw 2 points randomly and calculate the equation of the line through the points and then check how many of the other points are "close enough" to the line. Repeat this a bunch of times and then pick the line that had the best support in the data.

RANSAC to find point correspondences

Now imaging instead that we have two sets of keypoints in two images. We are looking for finding the best pairing between the keypoints. We randomly draw 4 points to be able to estimate the so called homography Links to an external site. between the two images. We transform the points from one image to the other using the homography and check how many points end up close to points in the other image. Points that are close enough are considered belonging to the set of inliers. We repeat this a number of times and pick the set of matches that resulted in the largest set of inliers.

Look at the image below to see how we use RANSAC to clean up the matches between two images.

Stereo and Disparity

One of the most common uses for features is to be able to match points between the left and the right image in a stereo setup. If we have the correspondence between points in the two images we can calculate the distance to the corresponding point in the world. The difference in image position between the left and the right image is called disparity. Points in the world that are far away generate a small disparity, and points close by a large disparity. The depth is inversely proportional to the disparity.

The proportionality constant that relates depth and disparity is made up of the product of the focal length of the camera and the base line, i.e., the distance between the cameras (assuming that one image has been translated sideways). In formulas

disparity = baseline * focal length / depth

OpenCV provides an explanation for this here Links to an external site..

Below is an example of a an image and the corresponding disparity image.

Note that you often need to specify how large disparity you expect to find as a maximum. For example, if you have 0.1m baseline, a focal length of 500 pixels and you expect nothing closer than 1m you would expect disparities of 50 pixels or less. As you see from the formula above the closer the object the larger the disparity. The method for disparity calculation used in the notebook makes use of a sliding window and looks for the highest correlation between windows in the left and right images. The distance between the location of the highest correlated windows give the disparity.

You find lots of more images to test on at http://vision.middlebury.edu/stereo/data/ Links to an external site..

Today there are deep learning based methods that can infer depth from single images. If you think of it, as a human you are able to make estimates of depth even from a single printed image, simply by making use of experience and the relation between object sizes in the image. Take a look at this video for an example Unsupervised Monocular Depth Estimation with Left-Right Consistency Links to an external site..

RGB-D

One of the challenges when working with stereo is to be able to find correspondences between the the left and the right camera. This is especially troublesome in environments where there is little or no texture. Two images of a perfectly white wall provide no distinct points to match and thus we cannot estimate the distance. Ever since Microsoft released the Kinect sensor, so called RGB-D sensors have become very popular as a way to provide depth information directly from the sensor. The depth can be calculated in several ways.

One way is to use structured light Links to an external site.. A project projects a pattern on the world which a camera perceives. By finding correspondences between the known pattern projection on the world and the recording images we can calculate the distance. This means that we essentially bring our own texture to the world. Most such sensors struggle in outdoor environments and have trouble when several sensors are looking at the same scene as the patterns interfere. More recently, time of flight based RGB-D Links to an external site. sensors are becoming widely available.

The video People Detection in RGB-D Data (Kinect based people detection) Links to an external site. gives an example of how the data from a consumer level RGB-D sensor can be used to detect and track people.

Camera model and calibration

As we have seen above, there are several cases where we need to have a model for the camera to be able to calculate for example distances to objects. To get familiar with the standard camera model take a look at these

A camera model Links to an external site. (3:00min) from Udacity
A Camera model perspective Links to an external site. (1:05min) from Udacity
The slides rob2-08-camera-calibration.pdf Download rob2-08-camera-calibration.pdf
including a description of the stantd camera model

Q1: If you are given the principal point, the focal length and an image coordinate, can you calculate the corresponding bearing (assume that we have a perfect pinhole camera model Links to an external site.)?

Q2: The image below was captured with camera that was very roughly 40cm above the table with the calibration pattern and with a resolution of 1280x720. The paper on the table is a standard A4 paper, i.e. 297x210mm. The size of the pattern in the image is roughly 865x610 pixels. What are rough estimates for the focal lengths in x and y (in pixels). You will be asked this in a quiz so work it out. How does the estimate for fx and fy differ?

Different camera models

Some camera setups require a bit more advanced models. If you have a catadioptric or a fisheye camera (very large field of view), you probably want to consider using for example this calibration toolbox.

https://sites.google.com/site/scarabotix/ocamcalib-omnidirectional-camera-calibration-toolbox-for-matlab Links to an external site.

Below is an example of a catadioptric camera setup and an example image from such a camera.

The images below give an example of a typical fisheye lense and an example image acquired with such a lense.

Fisheye lense.jpg

Q3: Give an example of an application where large field of view is desired and one where you do not want that.

Q4: What are pros/cons with the two camera setups above that give larger field of view?

Recognition and classification

One of the most studied areas in computer vision is that of object recognition and classification. That is, identifying instances and classes of objects. To see that this is challenging look at the image below (from michael-felsberg-vision-20161006.pdf Download michael-felsberg-vision-20161006.pdf

). We have three classes of objects (train, bottle and bicycle) and the images show a large variation in viewpoint, scale, illumination and context.

Nowadays, classification is completely dominated by deep learning based techniques. Traditional methods typically made heavy use of bag-of-words (icvss08_az_bow.pdf Download icvss08_az_bow.pdf

) representations.

The success of deep learning methods on these tasks can be attributed, in no small part, to the generation of a very large dataset of annotated images in ImageNet Links to an external site. (http://www.image-net.org Links to an external site.).

You will explore detection and classification using deep learning further down.

Detection

Before we can recognize or classify an object in a scene we need to detect it. Traditionally this has been approached along two paths. In one strand of work, ways to find regions in the images that are likely to contain objects have been been developed. These objects proposals have been based on various forms of visual attention mechanisms and or functions that measure the objectness of regions. The idea was to try to identify a few object proposals in which recognition of objects can be performed. We would then be able to run a quite expensive recognition algorithm since we only need to do it for a few proposals. We instead spend the energy on the proposal generation. In a second strand of work we use brute force to generate all possible object proposals in the image by sweeping over the image both in x and y and in scale. Recognition then needs to be very fast since we need to perform it very often.

Modern deep learning based techniques typically combine the proposal generation with the classification.

One of the first well working example of object detection was the Viola-Jones object detector Links to an external site. used for face detection. In the image below the face is detected and one of the eyes.

Background subtraction / foreground detection

In some cases the setup allows us to make a good model for what the background looks like. Imagine a surveillance camera mounted in a fixed position. It will be possible to describe very well what the background should look like and changes come from objects.

Additional work on background subtraction can be found here

https://sites.google.com/site/pbassegmenter/ Links to an external site.
http://cvprlab.uniparthenope.it/#code Links to an external site.

Tracking

Tracking objects in the image is typically done by tracking features or image patches between frames. In cases where assumptions about the motion of the object can be made, a motion model in combination with, for example, a Kalman filter can improve the tracking. Such a model, however, typically imposes stronger limitations in the motion of the object.

As you saw in the face detection example above you can accomplish a form of tracking by detecting the object from scratch in every frame. What is left is then to connect the detections temporally. Compared to other scenarios, most tracking use cases assume that we get images at high frame rates compared to the motion of the objects and this means that one can make assumptions about the location of the objects in the next frame. This means that instead of performing detection in the entire image we can instead look for the object near where it was before and make use of matching rather than detection.

Deep Learning

By now you cannot have missed the enormous impact that deep learning have had on many domains in the last years. Many of you are probably using deep learning already or will use it in your research.

You will look at an example of applying this in Assignment Object detection.