Module1 - Mtrl - Computer vision

This page will give a small introduction to image (signal) processing and computer vision. For (much) more material there is this nice online course here

Introduction to Computer Vision (Udacity) Links to an external site.

Tools

There are a number of very useful tools for doing computer vision. Deep learning has redefined the field more or less completely in the last few years.

We will give a number of example of more low level computer vision, or image processing where we use OpenCV Links to an external site., which has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android Links to an external site. and Mac OS. OpenCV is also well integrated with ROS which is probably the most used framework for robotics research today. There are quite a large set of tutorials in OpenCV Links to an external site. with code for all three languages. Every day more and more of what is shown below can be done by deep learning. However, it is still helpful to use the classical methods to illustrate the problem we are trying to solver and in some cases the deep methods are still to computationally intense for some embedded applications.

You will investigate deep learning based methods for computer vision in Assignment Object detection.

Preparing your computer

To go through these demos we will assume that you have an environment as described in Course computer environment.

Before moving on, we need to do some preparation.

Open a terminal (if you do not know how to, you should go back to the page Course computer environment)

Create a directory where we will install some python stuff not to mess up any other things using python or let other things influence this installation for this matter. We will use the python virtualenv for this.

cd ~
virtualenv --system-site-packages ~/cvenv

Activate the virtualenv

source ~/cvenv/bin/activate

This should change the prompt in the terminal to be prefixed with (cvenv). Now install some further python packages in this virtualenv

pip install opencv-python
pip install opencv-contrib-python
pip install jupyter 
pip install matplotlib

To deactive the virtualenv you do

deactivate

Remember to activate it again when you want to use it. Which probably is in about 1 min :-)

Now check out some code that we will look at to illustrate some key concepts

cd ~/
git clone https://github.com/pjensfelt/wasp_as1_cv
cd ~/wasp_as1_cv

From now on we will assume that you are standing in the directory ~/wasp_as1_cv and that you have activated the cvenv virtualenv. That is we assume that you have done

source ~/cvenv/bin/activate
cd ~/wasp_as1_cv

You will be be using Jupyter notebooks for most of the material here. Let us get started

Getting the image into the computer

The first thing you want to do is to get image data into the computer so that you can do things with it. In this case we will simply display the data.

In a terminal execute the following

jupyter notebook

If a browser window does not open up automatically, take a look at the output in the terminal window and guide the browser to the address displayed there. Open the file cv_01_loading_and_displaying.ipynb, which would mean that your browser would be

http://localhost:8888/notebooks/cv_01_loading_and_displaying.ipynb Links to an external site.

Note that if you start more notebooks the port (default 8888) will be different for newer instances. The page should look something like

Now use the Run button to step through the notebook. As you press Run that cell in the notebook will be executed. First a star (*) will show up and then that star turns into a number, when the execution of that cell is done. Any output produced by the code in the cell will be displayed in the output field under the cell. This is where errors will be shown as well. If all goes well, you should see what is shown in the image below. Notice the two windows that pop up.

If you encounter an error you will most likely have to fix the error before you can move on. In this notebook the most likely error will be that the image cannot be found.

Q0: Can you make it load another image?

Running a python program

Now let us do the same thing in a python program in a terminal.

Run

python cv_01_loading_and_displaying.py

Not able to run? Did you remember to active your cvenv if you changed window?

Getting an image directly from a camera

Now open the second notebook,

http://localhost:8888/notebooks/cv_02_capture_cam_image.ipynb Links to an external site.

You can easily open it from within the jupyter browser window with Open under the File menu.

If you run the virtual machine you need to remember to let the camera be seen by yourVirtualMachine. This will not work with some built-in laptop cameras.

Stream images from a camera

Open and run the notebook

http://localhost:8888/notebooks/cv_03_capture_image_stream.ipynb Links to an external site.

Image processing / Filtering

We are now ready to start processing the image

Color space

You probably heard of RGB but did you know that there are 100+ different, so called, color spaces.That is, different ways to represent color information.

In the RGB color space, the color for each pixel is represented by the three channels red, green and blue. HSV is an example of another color space. It stands for Hue, Saturation and Value. Here the Hue corresponds roughly to the "color" which in some cases allows for simpler separation of colors. The figure below illustrates the HSV color space. Notice how it looks like a cone and where Hue can be thought of as an angle that varies continuously across the color spectrum. Some cameras send the image information in YUV, and so on.

Links to an external site.

Run the notebook

http://localhost:8888/notebooks/cv_04_color_spaces.ipynb Links to an external site.

to investigate RGB and HSV color spaces a bit further. You will be able to look at images like the ones below.

Q1: How do you convert from RGB to HSV?

Gradients and Edges

Images are nothing but 2D signals and just like you look at derivates/gradients of time series data of scalar signals you can do so spatially in the image. This will give information about how the intensity changes in the images. Large gradients correspond to quick changes and these parts of the image typically contain important information. For example, the border between objects is typically defined by an edge, marking the change from one color/intensity to another.

Run the notebook

http://localhost:8888/notebooks/cv_05_edge_detection.ipynb Links to an external site.

to get a demo of the Canny edge detector Links to an external site..

Investigate the two parameters for the detector by running this notebook

http://localhost:8888/notebooks/cv_06_edge_detection_slider.ipynb Links to an external site.

Note: ESC or SPACE close the windows

The advent of deep learning based methods may have reduced the number of times that you perform edge detection explicitly but if you look at what the network is doing, you find that it often learns that it should look for gradient/edge information at the low levels of abstraction. The image below shows features from the first layers in a deep net (from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Links to an external site.)

Blurring, de-blurring and scale space

If you ever used a camera you know that your images will suffer from motion blur if you move the camera or what you try to capture moves too fast with respect to your exposure time. Long exposure times are typically needed when there is little light and you thus need to integrate over longer time to get a good enough signal to noise ratio in the image sensor. This type of blur is something we typically want to get rid of. In a video, i.e. a sequence of images, you can estimate the motion and use this to compensate to perform de-blurring (example of deblurring in video editing software) Links to an external site..

Odd as it may seem, we often perform blurring on purpose on images. This is not only the case when we want an object of interest to pop out more in the eye of an observer of a photo but also when we work with computer vision. Blurring is the mechanism by which we can remove the final grained information (high spatial frequency) and let the computer focus on the more large scale structures. One often performs analysis of images in scale space Links to an external site., i.e., we add one dimension to the analysis of an image and in addition to looking at different positions (x,y) we also look at different scales (different amounts of blurring/smoothing.

Use notebook http://localhost:8888/notebooks/cv_07_blur.ipynb Links to an external site. to test some blurring algorithms (box, Gaussian and Median). You can try different kernel sizes with a slider in the input image window.

Blurring / smoothing an image are example of filtering the image information by removing the high frequency content. The images below show examples of how efficient median filtering (blurring) can be at removing noise. The left image is an image destroyed by noise and the right one shows the result after applying a median filter. The first example comes from this page. Links to an external site.

Q2: Can you replicate the result. Change to the image noisy_image_example.png (the left image above) and see if you can replicate the result to the right.

Another more extreme example from wikipedia can be seen below.

Q2(extra): It is common to blur the image before doing edge detection. Write a program that combines the blurring with edge detection.

Frequency analysis

Again just like when doing signal processing on time series data we can look at the spatial frequency content of an image and perform analysis and processing in that domain. We can represent the image information in the frequency domain and some filtering can be done in a simpler way there.

Let us test a frequency based method for removing motion blur. Open a terminal and make sure that you are in the wasp_as1_cv folder. Then invoke the deconvolution.py program in three different ways as shown below.

cd ~/wasp_as1_cv
python deconvolution.py --angle 135 --d 22 test_images/opencv_samples_data/licenseplate_motion.jpg
python deconvolution.py --angle 86 --d 31 test_images/opencv_samples_data/text_motion.jpg
python deconvolution.py --circle --d 19 test_images/opencv_samples_data/text_defocus.jpg

The first one should result in the images below

Segmentation and clustering

Another basic operation in image processing is image segmentation. Here the task is to separate the image into regions corresponding to, for example, different objects. As described in this wikipedia article Links to an external site. there are many different way to go about this.

Take a look at the notebook http://localhost:8888/notebooks/cv_08_hsv_segmentation.ipynb Links to an external site. for an example of HSV based segmentation. The image below is an example from that notebook.

Q3: Can you segment out the lemon in the image fruits_320_213.jpg (shown above)?

Morphological operations/transformations

You can use morphological operations to clean up your images. The two fundamental operations are dilation Links to an external site. and erosion Links to an external site.. In its simplest form they operate on binary images, i.e., black and white where we think of white as being the signal we are looking for. In this context you can think of dilation as letting each white pixel spill over to the pixels nearby. How far it spreads defines how much dilation is performed. Erosion is the reverse of this, where a pixels in the resulting image are only kept white if all pixels in the original image within some region are white.

The examples below are from http://what-when-how.com/introduction-to-video-and-image-processing/morphology-introduction-to-video-and-image-processing-part-1/ Links to an external site.

Erosion:

Dilation:

By combining these operations we can achieve a) Removing small objects, b) Filling holes and c) Isolating objects, as illustrated in the images below.

2min video by Aaron Bobick on morphological operations Links to an external site.

Features

OpenCV provides a good tutorial on understanding features Links to an external site.. An image feature typically represents a position in the image that is significant somehow and a good one should be easy to track and easy to compare between images to find the same corresponding physical point in the world.

Features are used for an abundance of tasks in computer vision; stereo matching, object tracking, object recognition, optical flow calculation, etc

When speaking of a feature detector, one often thinks of two mechanisms: one that finds the positions of points in the image that are of interest (keypoints), and one that describes these points somehow (descriptor).

For many of the tasks where we use features it is important that the feature can be detected at the same location again if the object in the image has not changed. Many keypoint detectors look for positions in the image where the image gradients are large in both x and y. This means that the position is well defined. If the gradient is large in only one direction, i.e. we essentially have a line, the position can slide along the line with a big local change in the image. One of the first detectors, and still used, is the Harris Corner detector Links to an external site..

Today, various forms of deep nets are gradually replacing the use of handcrafted features. Classical features such as SIFT Links to an external site., SURF Links to an external site. and ORB Links to an external site. are still being used heavily though.

Take a look at the notebook http://localhost:8888/notebooks/cv_10_feature_detection.ipynb Links to an external site. which should give you results like

Note that the are a number of parameter for each feature detection in the example above.

Feature matching

Finding correspondences between features in one frame to those in another is a fundamental problem. One typically makes use of both the location of the feature and the descriptor of it when performing the matching. As this is something that is often performed at low level in a system and at high frame rates one wants this to be quick. There is therefore a trade-off between having a descriptor that is very powerful, and being able to rely a lot on that, but spending a lot of resources computing these, or alternatively using simpler descriptors, but relying on various forms of robust methods to tell inliers from outliers.

Run the notebook to investigate the feature matching between images http://localhost:8888/notebooks/cv_11_feature_matching.ipynb Links to an external site.. Below you can see an example output from matching two images. The upper image shows all matches and the second one where only the strongest matches have been used and you can see that the number of correct matches is much higher.

RANSAC

RANSAC Links to an external site. or RANdom SAmple Consensus is one of the most frequently used algoritms in the context of matching. The basic idea is simple. We randomly draw a minimum number of points to be able to estimate a model. We then check how many of the other points support this model. We repeat this a number of times and keep the model that had the biggest support in the data.

RANSAC to find line in point data

Imaging that you have a set of points. These points could be points with high gradients in an image for example, i.e., points that could belong to lines. Now draw 2 points randomly and calculate the equation of the line through the points and then check how many of the other points are "close enough" to the line. Repeat this a bunch of times and then pick the line that had the best support in the data.

RANSAC to find point correspondences

Now imaging instead that we have two sets of keypoints in two images. We are looking for finding the best pairing between the keypoints. We randomly draw 4 points to be able to estimate the so called homography Links to an external site. between the two images. We transform the points from one image to the other using the homography and check how many points end up close to points in the other image. Points that are close enough are considered belonging to the set of inliers. We repeat this a number of times and pick the set of matches that resulted in the largest set of inliers.

Look at the notebook http://localhost:8888/notebooks/cv_12_feature_matching_ransac.ipynb Links to an external site. to see how we use RANSAC to clean up the matches between two images.

Stereo and Disparity

One of the most common uses for features is to be able to match points between the left and the right image in a stereo setup. If we have the correspondence between points in the two images we can calculate the distance to the corresponding point in the world. The difference in image position between the left and the right image is called disparity. Points in the world that are far away generate a small disparity, and points close by a large disparity. The depth is inversely proportional to the disparity.

The proportionality constant that relates depth and disparity is made up of the product of the focal length of the camera and the base line, i.e., the distance between the cameras (assuming that one image has been translated sideways). In formulas

disparity = baseline * focal length / depth

OpenCV provides an explanation for this here Links to an external site..

Take the notebook http://localhost:8888/notebooks/cv_13_disparity.ipynb Links to an external site. for a spin.

Note that you often need to specify how large disparity you expect to find as a maximum. For example, if you have 0.1m baseline, a focal length of 500 pixels and you expect nothing closer than 1m you would expect disparities of 50 pixels or less. As you see from the formula above the closer the object the larger the disparity. The method for disparity calculation used in the notebook makes use of a sliding window and looks for the highest correlation between windows in the left and right images. The distance between the location of the highest correlated windows give the disparity.

You find lots of more images to test on at http://vision.middlebury.edu/stereo/data/ Links to an external site..

Today there are deep learning based methods that can infer depth from single images. If you think of it, as a human you are able to make estimates of depth even from a single printed image, simply by making use of experience and the relation between object sizes in the image. Take a look at this video for an example Unsupervised Monocular Depth Estimation with Left-Right Consistency Links to an external site..

RGB-D

One of the challenges when working with stereo is to be able to find correspondences between the the left and the right camera. This is especially troublesome in environments where there is little or no texture. Two images of a perfectly white wall provide no distinct points to match and thus we cannot estimate the distance. Ever since Microsoft released the Kinect sensor, so called RGB-D sensors have become very popular as a way to provide depth information directly from the sensor. The depth can be calculated in several ways.

One way is to use structured light Links to an external site.. A project projects a pattern on the world which a camera perceives. By finding correspondences between the known pattern projection on the world and the recording images we can calculate the distance. This means that we essentially bring our own texture to the world. Most such sensors struggle in outdoor environments and have trouble when several sensors are looking at the same scene as the patterns interfere. More recently, time of flight based RGB-D Links to an external site. sensors are becoming widely available.

The video People Detection in RGB-D Data (Kinect based people detection) Links to an external site. gives an example of how the data from a consumer level RGB-D sensor can be used to detect and track people.

Camera model and calibration

As we have seen above, there are several cases where we need to have a model for the camera to be able to calculate for example distances to objects. To get familiar with the standard camera model take a look at these

A camera model Links to an external site. (3:00min) from Udacity
A Camera model perspective Links to an external site. (1:05min) from Udacity
The slides rob2-08-camera-calibration.pdf Download rob2-08-camera-calibration.pdf
including a description of the stantd camera model

Q4: If you are given the principal point, the focal length and an image coordinate, can you calculate the corresponding bearing (assume that we have a perfect pinhole camera model Links to an external site.)?

Q5: The image below was captured with camera that was very roughly 40cm above the table with the calibration pattern and with a resolution of 1280x720. The paper on the table is a standard A4 paper, i.e. 297x210mm. The size of the pattern in the image is roughly 865x610 pixels. What are rough estimates for the focal lengths in x and y (in pixels). You will be asked this in a quiz so work it out. How does the estimate for fx and fy differ?

Different camera models

Some camera setups require a bit more advanced models. If you have a catadioptric or a fisheye camera (very large field of view), you probably want to consider using for example this calibration toolbox.

https://sites.google.com/site/scarabotix/ocamcalib-toolbox Links to an external site.

Below is an example of a catadioptric camera setup and an example image from such a camera.

The images below give an example of a typical fisheye lense and an example image acquired with such a lense.

Fisheye lense.jpg

Q6: Give an example of an application where large field of view is desired and one where you do not want that.

Q7: What are pros/cons with the two camera setups above that give larger field of view?

Recognition and classification

One of the most studied areas in computer vision is that of object recognition and classification. That is, identifying instances and classes of objects. To see that this is challenging look at the image below (from michael-felsberg-vision-20161006.pdf Download michael-felsberg-vision-20161006.pdf

). We have three classes of objects (train, bottle and bicycle) and the images show a large variation in viewpoint, scale, illumination and context.

Nowadays, classification is completely dominated by deep learning based techniques. Traditional methods typically made heavy use of bag-of-words (icvss08_az_bow.pdf Download icvss08_az_bow.pdf

) representations.

The success of deep learning methods on these tasks can be attributed, in no small part, to the generation of a very large dataset of annotated images in ImageNet Links to an external site. (http://www.image-net.org Links to an external site.).

You will explore detection and classification using deep learning further down.

Detection

Before we can recognize or classify an object in a scene we need to detect it. Traditionally this has been approached along two paths. In one strand of work, ways to find regions in the images that are likely to contain objects have been been developed. These objects proposals have been based on various forms of visual attention mechanisms and or functions that measure the objectness of regions. The idea was to try to identify a few object proposals in which recognition of objects can be performed. We would then be able to run a quite expensive recognition algorithm since we only need to do it for a few proposals. We instead spend the energy on the proposal generation. In a second strand of work we use brute force to generate all possible object proposals in the image by sweeping over the image both in x and y and in scale. Recognition then needs to be very fast since we need to perform it very often.

Modern deep learning based techniques typically combine the proposal generation with the classification.

One of the first well working example of object detection was the Viola-Jones object detector Links to an external site. used for face detection.

Run the notebook http://localhost:8888/notebooks/cv_15_face_detection.ipynb Links to an external site. and you should be able to detect faces in a live stream of images from your camera. If you have two good eyes and better lighting you should see two eyes per face.

Background subtraction / foreground detection

In some cases the setup allows us to make a good model for what the background looks like. Imagine a surveillance camera mounted in a fixed position. It will be possible to describe very well what the background should look like and changes come from objects.

Let us take a look at how you can make use of this to find people moving in a scene where you have a fixed camera. Look at the notebook http://localhost:8888/notebooks/cv_16_background_subtraction.ipynb Links to an external site. and you should see what is behind the images below.

Additional work on background subtraction can be found here

https://sites.google.com/site/pbassegmenter/ Links to an external site.
http://cvprlab.uniparthenope.it/index.php/download.html Links to an external site.

Tracking

Tracking objects in the image is typically done by tracking features or image patches between frames. In cases where assumptions about the motion of the object can be made, a motion model in combination with, for example, a Kalman filter can improve the tracking. Such a model, however, typically imposes stronger limitations in the motion of the object.

As you saw in the face detection example above you can accomplish a form of tracking by detecting the object from scratch in every frame. What is left is then to connect the detections temporally. Compared to other scenarios, most tracking use cases assume that we get images at high frame rates compared to the motion of the objects and this means that one can make assumptions about the location of the objects in the next frame. This means that instead of performing detection in the entire image we can instead look for the object near where it was before and make use of matching rather than detection.

Deep Learning

By now you cannot have missed the enormous impact that deep learning have had on many domains in the last years. If you have somehow missed this, take a look at one of the many pages exemplifying the achievements.

Many of you are probably using deep learning already or will use it in your research. Here we will mostly consider the deep learning tools as black boxes and see what they can do for us. We will come back for more details in the next course (Autonomous Systems II).

For a very quick introduction to deep learning, and convolutional neural networks in particular, take a look at the following two videos
Introduction to Deep Learning: What Is Deep Learning? (3:33min)
Introduction to Deep Learning: What are Convolutional Neural Networks? (4:44min)

For an example of how to use deep learning for image classification and how to do transfer learning take a look at https://www.tensorflow.org/tutorials/images/transfer_learning_with_hub Links to an external site.