Module1 - Mtrl - Computer vision
Below you will be guided through some material on computer vision. A good starting point to get an overview is to scan Michael Felsberg's slides on sensing, perception and computer vision from the last course round of the WASP autonomous systems course. You will get a live presentation by Felsberg at the 2-day session.
- michael-felsberg-overview-20161005.pdf Download michael-felsberg-overview-20161005.pdf
- michael-felsberg-vision-20161006.pdf Download michael-felsberg-vision-20161006.pdf
For (much) more material there is also this nice online course
Introduction to Computer Vision (Udacity) Links to an external site.
Tools
There are a number of very useful tools for doing computer vision. Those of you that have already attended a course on computer vision probably used Matlab which has a rich set of computer vision and image processing tools.
Another popular tool is Pillow Links to an external site., which is a continuation of PIL, the Python Imaging Library. You find a tutorial for Pillow here Links to an external site..
Finally, there is OpenCV Links to an external site., which has C++, C, Python, Java and MATLAB interfaces and supports Windows, Linux, Android Links to an external site. and Mac OS. OpenCV is also well integrated with ROS which is probably the most used framework for robotics research today. OpenCV currently is maintained in two branches, version 2 and 3. There are quite a large set of tutorials in OpenCV Links to an external site. with code for all three languages.
We will use OpenCV here as it has the richest set of features and naturally integrates with ROS. Buckle up and get ready!
Preparing your computer
Before moving on, we need to do some preparation.
Open a terminal (if you do not know how to, you should go back to the page Course computer environment)
Create a directory where we will install some python stuff not to mess up any other things using python or let other things influence this installation for this matter. We will use the python virtualenv for this.
cd ~ virtualenv --system-site-packages ~/cvenv
Activate the virtualenv
source ~/cvenv/bin/activate
This should change the prompt in the terminal to be prefixed with (cvenv). Now install some further python packages in this virtualenv
pip install opencv-python
pip install opencv-contrib-python pip install jupyter
pip install matplotlib
To deactive the virtualenv you do
deactivate
Remember to activate it again when you want to use it. Which probably is in about 1 min :-)
Now check out some code that we will look at to illustrate some key concepts
cd ~/ git clone https://github.com/pjensfelt/wasp_as1_cv cd ~/wasp_as1_cv
From now on we will assume that you are standing in the directory ~/wasp_as1_cv and that you have activated the cvenv virtualenv. That is we assume that you have done
source ~/cvenv/bin/activate cd ~/wasp_as1_cv
You will be be using Jupyter notebooks for most of the material here. Let us get started
Getting the image into the computer
The first thing you want to do is to get image data into the computer so that you can do things with it. In this case we will simply display the data.
In a terminal execute the following
jupyter notebook cv_01_loading_and_displaying.ipynb
If a browser window does not open up automatically, take a look at the output in the terminal window and guide the browser to the address displayed there, typically http://localhost:8888/notebooks/cv_01_loading_and_displaying.ipynb Links to an external site.
Note that if you start more notebooks the port (default 8888) will be different for newer instances. The page should look something like
Now use the Run button to step through the notebook. As you press Run that cell in the notebook will be executed. First a star (*) will show up and then that star turns into a number, when the execution of that cell is done. Any output produced by the code in the cell will be displayed in the output field under the cell. This is where errors will be shown as well. If all goes well, you should see what is shown in the image below. Notice the two windows that pop up.
If you encounter an error you will most likely have to fix the error before you can move on. In this notebook the most likely error will be that the image cannot be found.
Q0: Can you make it load another image?
Running a python program
Now let us do the same thing in a python program in a terminal.
Run
python cv_01_loading_and_displaying.py
Getting an image directly from a camera
Now open the second notebook,
http://localhost:8888/notebooks/cv_02_capture_cam_image.ipynb Links to an external site.
You can easily do that from within the jupyter browser window with Open under the File menu. If you closed it already start it on the command line like you did with the first notebook.
Stream images from a camera
Open and run the notebook
http://localhost:8888/notebooks/cv_03_capture_image_stream.ipynb Links to an external site.
Image processing / Filtering
We are now ready to start processing the image
Color space
You probably heard of RGB but did you know that there are 100+ different, so called, color spaces.That is, different ways to represent color information.
In the RGB color space, the color for each pixel is represented by the three channels red, green and blue. HSV is an example of another color space. It stands for Hue, Saturation and Value. Here the Hue corresponds roughly to the "color" which in some cases allows for simpler separation of colors. The figure below illustrates the HSV color space. Notice how it looks like a cone and where Hue can be thought of as an angle that varies continuously across the color spectrum. Some cameras send the image information in YUV, and so on.
Run the notebook
http://localhost:8888/notebooks/cv_04_color_spaces.ipynb Links to an external site.
to investigate RGB and HSV color spaces a bit further. You will be able to look at images like the ones below.
Q1: How do you convert from RGB to HSV?
Gradients and Edges
Images are nothing but 2D signals and just like you look at derivates/gradients of time series data of scalar signals you can do so spatially in the image. This will give information about how the intensity changes in the images. Large gradients correspond to quick changes and these parts of the image typically contain important information. For example, the border between objects is typically defined by an edge, marking the change from one color/intensity to another.
Run the notebook
http://localhost:8888/notebooks/cv_05_edge_detection.ipynb Links to an external site.
to get a demo of the Canny edge detector Links to an external site..
Investigate the two parameters for the detector by running this notebook
http://localhost:8888/notebooks/cv_06_edge_detection_slider.ipynb Links to an external site.
Note: ESC or SPACE close the windows
The advent of deep learning based methods may have reduced the number of times that you perform edge detection explicitly but if you look at what the network is doing, you find that it often learns that it should look for gradient/edge information at the low levels of abstraction. The image below shows features from the first layers in a deep net (from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf Links to an external site.)
Blurring, de-blurring and scale space
If you ever used a camera you know that your images will suffer from motion blur if you move the camera or what you try to capture moves too fast with respect to your exposure time. Long exposure times are typically needed when there is little light and you thus need to integrate over longer time to get a good enough signal to noise ratio in the image sensor. This type of blur is something we typically want to get rid of. In a video, i.e. a sequence of images, you can estimate the motion and use this to compensate to perform de-blurring (example of deblurring in video editing software) Links to an external site..
Odd as it may seem, we often perform blurring on purpose on images. This is not only the case when we want an object of interest to pop out more in the eye of an observer of a photo but also when we work with computer vision. Blurring is the mechanism by which we can remove the final grained information (high spatial frequency) and let the computer focus on the more large scale structures. One often performs analysis of images in scale space Links to an external site., i.e., we add one dimension to the analysis of an image and in addition to looking at different positions (x,y) we also look at different scales (different amounts of blurring/smoothing.
Use notebook http://localhost:8888/notebooks/cv_07_blur.ipynb Links to an external site. to test some blurring algorithms (box, Gaussian and Median). You can try different kernel sizes with a slider in the input image window.
Blurring / smoothing an image are example of filtering the image information by removing the high frequency content. The images below show examples of how efficient median filtering (blurring) can be at removing noise. The left image is an image destroyed by noise and the right one shows the result after applying a median filter. The first example comes from this page. Links to an external site.
Q2: Can you replicate the result. Change to the image noisy_image_example.png (the left image above) and see if you can replicate the result to the right.
Another more extreme example from wikipedia can be seen below.
Q2(extra): It is common to blur the image before doing edge detection. Write a program that combines the blurring with edge detection.
Frequency analysis
Again just like when doing signal processing on time series data we can look at the spatial frequency content of an image and perform analysis and processing in that domain. We can represent the image information in the frequency domain and some filtering can be done in a simpler way there.
Let us test a frequency based method for removing motion blur. Open a terminal and make sure that you are in the wasp_as1_cv folder. Then invoke the deconvolution.py program in three different ways as shown below.
cd ~/wasp_as1_cv
python deconvolution.py --angle 135 --d 22 test_images/opencv_samples_data/licenseplate_motion.jpg
python deconvolution.py --angle 86 --d 31 test_images/opencv_samples_data/text_motion.jpg
python deconvolution.py --circle --d 19 test_images/opencv_samples_data/text_defocus.jpg
The first one should result in the images below
Segmentation and clustering
Another basic operation in image processing is image segmentation. Here the task is to separate the image into regions corresponding to, for example, different objects. As described in this wikipedia article Links to an external site. there are many different way to go about this.
Take a look at the notebook http://localhost:8888/notebooks/cv_08_hsv_segmentation.ipynb Links to an external site. for an example of HSV based segmentation. The image below is an example from that notebook.
Q3: Can you segment out the lemon in the image fruits_320_213.jpg (shown above)?
Morphological operations/transformations
You can use morphological operations to clean up your images. The two fundamental operations are dilation Links to an external site. and erosion Links to an external site.. In its simplest form they operate on binary images, i.e., black and white where we think of white as being the signal we are looking for. In this context you can think of dilation as letting each white pixel spill over to the pixels nearby. How far it spreads defines how much dilation is performed. Erosion is the reverse of this, where a pixels in the resulting image are only kept white if all pixels in the original image within some region are white.
The examples below are from http://what-when-how.com/introduction-to-video-and-image-processing/morphology-introduction-to-video-and-image-processing-part-1/ Links to an external site.
Erosion:
Dilation:
By combining these operations we can achieve a) Removing small objects, b) Filling holes and c) Isolating objects, as illustrated in the images below.
2min video by Aaron Bobick on morphological operations Links to an external site.
Features
OpenCV provides a good tutorial on understanding features Links to an external site.. An image feature typically represents a position in the image that is significant somehow and a good one should be easy to track and easy to compare between images to find the same corresponding physical point in the world.
Features are used for an abundance of tasks in computer vision; stereo matching, object tracking, object recognition, optical flow calculation, etc
When speaking of a feature detector, one often thinks of two mechanisms: one that finds the positions of points in the image that are of interest (keypoints), and one that describes these points somehow (descriptor).
For many of the tasks where we use features it is important that the feature can be detected at the same location again if the object in the image has not changed. Many keypoint detectors look for positions in the image where the image gradients are large in both x and y. This means that the position is well defined. If the gradient is large in only one direction, i.e. we essentially have a line, the position can slide along the line with a big local change in the image. One of the first detectors, and still used, is the Harris Corner detector Links to an external site..
Today, various forms of deep nets are gradually replacing the use of handcrafted features. Classical features such as SIFT Links to an external site., SURF Links to an external site. and ORB Links to an external site. are still being used heavily though.
Take a look at the notebook http://localhost:8888/notebooks/cv_10_feature_detection.ipynb
Links to an external site. which should give you results like
Note that the are a number of parameter for each feature detection in the example above.
Feature matching
Finding correspondences between features in one frame to those in another is a fundamental problem. One typically makes use of both the location of the feature and the descriptor of it when performing the matching. As this is something that is often performed at low level in a system and at high frame rates one wants this to be quick. There is therefore a trade-off between having a descriptor that is very powerful, and being able to rely a lot on that, but spending a lot of resources computing these, or alternatively using simpler descriptors, but relying on various forms of robust methods to tell inliers from outliers.
Run the notebook to investigate the feature matching between images http://localhost:8888/notebooks/cv_11_feature_matching.ipynb Links to an external site.. Below you can see an example output from matching two images. The upper image shows all matches and the second one where only the strongest matches have been used and you can see that the number of correct matches is much higher.
RANSAC
RANSAC Links to an external site. or RANdom SAmple Consensus is one of the most frequently used algoritms in the context of matching. The basic idea is simple. We randomly draw a minimum number of points to be able to estimate a model. We then check how many of the other points support this model. We repeat this a number of times and keep the model that had the biggest support in the data.
RANSAC to find line in point data
Imaging that you have a set of points. These points could be points with high gradients in an image for example, i.e., points that could belong to lines. Now draw 2 points randomly and calculate the equation of the line through the points and then check how many of the other points are "close enough" to the line. Repeat this a bunch of times and then pick the line that had the best support in the data.
RANSAC to find point correspondences
Now imaging instead that we have two sets of keypoints in two images. We are looking for finding the best pairing between the keypoints. We randomly draw 4 points to be able to estimate the so called homography Links to an external site. between the two images. We transform the points from one image to the other using the homography and check how many points end up close to points in the other image. Points that are close enough are considered belonging to the set of inliers. We repeat this a number of times and pick the set of matches that resulted in the largest set of inliers.
Look at the notebook http://localhost:8888/notebooks/cv_12_feature_matching_ransac.ipynb Links to an external site. to see how we use RANSAC to clean up the matches between two images.
Stereo and Disparity
One of the most common uses for features is to be able to match points between the left and the right image in a stereo setup. If we have the correspondence between points in the two images we can calculate the distance to the corresponding point in the world. The difference in image position between the left and the right image is called disparity. Points in the world that are far away generate a small disparity, and points close by a large disparity. The depth is inversely proportional to the disparity.
The proportionality constant that relates depth and disparity is made up of the product of the focal length of the camera and the base line, i.e., the distance between the cameras (assuming that one image has been translated sideways). In formulas
disparity = baseline * focal length / depth
OpenCV provides an explanation for this here Links to an external site..
Take the notebook http://localhost:8888/notebooks/cv_13_disparity.ipynb Links to an external site. for a spin.
Note that you often need to specify how large disparity you expect to find as a maximum. For example, if you have 0.1m baseline, a focal length of 500 pixels and you expect nothing closer than 1m you would expect disparities of 50 pixels or less. As you see from the formula above the closer the object the larger the disparity. The method for disparity calculation used in the notebook makes use of a sliding window and looks for the highest correlation between windows in the left and right images. The distance between the location of the highest correlated windows give the disparity.
You find lots of more images to test on at http://vision.middlebury.edu/stereo/data/ Links to an external site..
Today there are deep learning based methods that can infer depth from single images. If you think of it, as a human you are able to make estimates of depth even from a single printed image, simply by making use of experience and the relation between object sizes in the image. Take a look at this video for an example Unsupervised Monocular Depth Estimation with Left-Right Consistency Links to an external site..
RGB-D
One of the challenges when working with stereo is to be able to find correspondences between the the left and the right camera. This is especially troublesome in environments where there is little or no texture. Two images of a perfectly white wall provide no distinct points to match and thus we cannot estimate the distance. Ever since Microsoft released the Kinect sensor, so called RGB-D sensors have become very popular as a way to provide depth information directly from the sensor. The depth can be calculated in several ways.
One way is to use structured light Links to an external site.. A project projects a pattern on the world which a camera perceives. By finding correspondences between the known pattern projection on the world and the recording images we can calculate the distance. This means that we essentially bring our own texture to the world. Most such sensors struggle in outdoor environments and have trouble when several sensors are looking at the same scene as the patterns interfere. More recently, time of flight based RGB-D Links to an external site. sensors are becoming widely available.
The video People Detection in RGB-D Data (Kinect based people detection) Links to an external site. gives an example of how the data from a consumer level RGB-D sensor can be used to detect and track people.
Camera model and calibration
As we have seen above, there are several cases where we need to have a model for the camera to be able to calculate for example distances to objects. To get familiar with the standard camera model take a look at these
- A camera model Links to an external site. (3:00min) from Udacity
- A Camera model perspective Links to an external site. (1:05min) from Udacity
- The slides rob2-08-camera-calibration.pdf Download rob2-08-camera-calibration.pdf including a description of the stantd camera model
Q4: If you are given the principal point, the focal length and an image coordinate, can you calculate the corresponding bearing (assume that we have a perfect pinhole camera model Links to an external site.)?
Q5: The image below was captured with camera that was very roughly 40cm above the table with the calibration pattern and with a resolution of 1280x720. The paper on the table is a standard A4 paper, i.e. 297x210mm. The size of the pattern in the image is roughly 865x610 pixels. What are rough estimates for the focal lengths in x and y (in pixels). You will be asked this in a quiz so work it out. How does the estimate for fx and fy differ?
Now open up the ROS tutorial for camera calibration that you can reach from this page http://wiki.ros.org/camera_calibration Links to an external site..
We will start by calibrating your single USB webcam so go to the monocolor calibration tutorial. Use this calibration pattern Links to an external site. and print it on an A4 paper (or even better A3 if you have that).
Note that the size of the pattern that you are asked to provide to the calibration program is one less in each dimension than the number of squares since the program looks for corners between squares. That is, the pattern that you printed has 9 x 7 squares which means that the size should be 8x6. The "--square" parameter expects the size of a square in meters. The larger the squares the better typically. Also note that you need to adapt the camera name to our setup (e.g. cv_camera). So the suitable calibration command in the tutorial might look like
rosrun camera_calibration cameracalibrator.py --size 8x6 --square 0.0281 image:=/cv_camera/image_raw camera:=/cv_camera
Ideally you would want a calibration pattern that is mounted on a very stiff board. As an alternative you can place the pattern on a flat surface and instead of moving the pattern as described in the tutorial move the camera.
If you collect a lot of images (you see how many in the window you started the calibration in) it could take several minutes once you press CALIBRATE.
The images that were used for the calibration are saved in a file in the /tmp folder. You can see this in the terminal where the calibration program was started.
Q6: What is the focal length of your camera? What about the principal point? What do these parameters correspond to physically?
Q7: What size of the image was the calibration program using?
Q8: Save an image of the calibration pattern like in the image above and make a rough estimate for the distance between the camera and the pattern. Repeat the calclations from the question about the focal length above. How close the your estimate to the values of the focal length that you got in the calibration? If they are very far off, could it be that the image sizes were different? What happens to the focal length (in pixels) when you change the image size?
Q9: The calibration routine gives a focal length in pixels. What do you need to know to calculate the focal length in mm instead? Why not give it in mm always?
Now that you have calibrated the camera, we can take a look at a rectified version of the image. That is, an image where we have used our model of the distortion and compensated for it. In one terminal run this
rosparam set cv_camera/device_id 0
rosrun cv_camera cv_camera_node
to start publishing images from your camera. Remember to set the device id for the camera you want to use. See Course computer environment for more information. In a second terminal run this, which will perform the actual rectification.
ROS_NAMESPACE=cv_camera rosrun image_proc image_proc
This will publish a bunch of new topics. Take a look at the rectified image using
rosrun image_view image_view image:=/cv_camera/image_rect
Q10: Do you see any difference? Look at how straight lines differ for example.
Different camera models
Some camera setups require a bit more advanced models. If you have a catadioptric or a fisheye camera (very large field of view), you probably want to consider using for example this calibration toolbox.
https://sites.google.com/site/scarabotix/ocamcalib-toolbox Links to an external site.
Below is an example of a catadioptric camera setup and an example image from such a camera.
The images below give an example of a typical fisheye lense and an example image acquired with such a lense.
Q11: Give an example of an application where large field of view is desired and one where you do not want that.
Q12: What are pros/cons with the two camera setups above that give larger field of view?
Recognition and classification
One of the most studied areas in computer vision is that of object recognition and classification. That is, identifying instances and classes of objects. To see that this is challenging look at the image below (from michael-felsberg-vision-20161006.pdf Download michael-felsberg-vision-20161006.pdf). We have three classes of objects (train, bottle and bicycle) and the images show a large variation in viewpoint, scale, illumination and context.
Nowadays, classification is completely dominated by deep learning based techniques. Traditional methods typically made heavy use of bag-of-words (icvss08_az_bow.pdf Download icvss08_az_bow.pdf) representations.
The success of deep learning methods on these tasks can be attributed, in no small part, to the generation of a very large dataset of annotated images in ImageNet Links to an external site. (http://www.image-net.org Links to an external site.).
You will explore classification using deep learning in the next section of this module in some depth.
Detection
Before we can recognize or classify an object in a scene we need to detect it. Traditionally this has been approached along two paths. In one strand of work, ways to find regions in the images that are likely to contain objects have been been developed. These objects proposals have been based on various forms of visual attention mechanisms and or functions that measure the objectness of regions. The idea was to try to identify a few object proposals in which recognition of objects can be performed. We would then be able to run a quite expensive recognition algorithm since we only need to do it for a few proposals. We instead spend the energy on the proposal generation. In a second strand of work we use brute force to generate all possible object proposals in the image by sweeping over the image both in x and y and in scale. Recognition then needs to be very fast since we need to perform it very often.
Modern deep learning based techniques typically combine the proposal generation with the classification.
One of the first well working example of object detection was the Viola-Jones object detector Links to an external site. used for face detection.
Run the notebook http://localhost:8888/notebooks/cv_15_face_detection.ipynb Links to an external site. and you should be able to detect faces in a live stream of images from your camera. If you have two good eyes and better lighting you should see two eyes per face.
Background subtraction / foreground detection
In some cases the setup allows us to make a good model for what the background looks like. Imagine a surveillance camera mounted in a fixed position. It will be possible to describe very well what the background should look like and changes come from objects.
Let us take a look at how you can make use of this to find people moving in a scene where you have a fixed camera. Look at the notebook http://localhost:8888/notebooks/cv_16_background_subtraction.ipynb Links to an external site. and you should see what is behind the images below.
Additional work on background subtraction can be found here
https://sites.google.com/site/pbassegmenter/
Links to an external site.
http://cvprlab.uniparthenope.it/index.php/download.html
Links to an external site.
Tracking
Tracking objects in the image is typically done by tracking features or image patches between frames. In cases where assumptions about the motion of the object can be made, a motion model in combination with, for example, a Kalman filter can improve the tracking. Such a model, however, typically imposes stronger limitations in the motion of the object.
As you saw in the face detection example above you can accomplish a form of tracking by detecting the object from scratch in every frame. What is left is then to connect the detections temporally. Compared to other scenarios, most tracking use cases assume that we get images at high frame rates compared to the motion of the objects and this means that one can make assumptions about the location of the objects in the next frame. This means that instead of performing detection in the entire image we can instead look for the object near where it was before and make use of matching rather than detection.
Action recognition
Most of what we have talked about so far has been about single images or tracking objects between frames. Highly autonomous systems interacting with or operating near people will need to be able to recognize activities as well. It is not enough to tell what objects are in the scene but also what is going on. A good resource for action recognition papers can be found here Links to an external site.
When recognising actions, it is important to have an idea of how the human body is moving. OpenPose Links to an external site. can be used to detect human body, hand and facial key points of multiple people in the same frames.
Challenges and Benchmarks
The computer vision community has been using benchmark datasets for quite some time to measure performance and compare approaches. Below is a list of some of the ongoing challenges. Studying the results from these gives a clear indication of how deep learning has impacted the computer vision field. Deep methods are now defining state of the art in most of the areas.
Links to an external site.Large Scale Visual Recognition Challenge (ILSVRC)
http://www.image-net.org/challenges/LSVRC/ Links to an external site.
https://www.kaggle.com/c/imagenet-object-localization-challenge Links to an external site.
"The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) evaluates algorithms for object detection and image classification at large scale. One high level motivation is to allow researchers to compare progress in detection across a wider variety of objects -- taking advantage of the quite expensive labeling effort. Another motivation is to measure the progress of computer vision for large scale image indexing for retrieval and annotation."
Visual Object Tracking (VOT)
http://www.votchallenge.net Links to an external site.
"The VOT challenges provide the visual tracking community with a precisely defined and repeatable way of comparing short-term trackers as well as a common platform for discussing the evaluation and advancements made in the field of visual tracking."
A Large-Scale Video Benchmark for Human Activity Understanding (ActivityNet)
http://activity-net.org Links to an external site.
"Our benchmark aims at covering a wide range of complex human activities that are of interest to people in their daily living. We illustrate three scenarios in which ActivityNet can be used to compare algorithms for human activity understanding: global video classification,trimmed activity classification and activity detection."