Description Assignment 2: Perception

Objectives

The objectives of the second assignment is to learn about sensor fusion and learning-based perception. Sensor fusion is, as the name implies, the fusion of two or more sensor's data. Sensor fusion is used in order to reduce uncertainty and improve robustness. Different sensors are good at different things, with sensor fusion one is able to take advantage of the benefits of each sensor.

Learning-based perception is extensively used in autonomous driving scenarios. It is used to identify and track other vehicles, to segment the different driving lanes, to read traffic signals and signs, and to classify pedestrians to name a few.

Prerequisite

First make sure that you have done the assignments setup.

Submission

Assignment 2: Perception

Reading material

This assignment tasks are designed to be mostly stand alone and the topics is vast so it is almost impossible to provide a concise reading guideline.

Sensor fusion

A brief, high level, introduction to sensor fusion is provide by wikipedia Links to an external site..

When working with sensor fusion we often combine a model that describes what we know about the system with sensor data. The sensor data is often assumed to be uncertain and this uncertainty is typically described by some statistics such as the standard deviation (covariance in the multi dimensional case). The system is often a dynamical system, i.e a system that can be described by differential equation. In a computer model the differential equation is often replaced by a difference equation on the form $LaTeX: \mathbf{x}_{k+1}= \mathbf{f}( \mathbf{x}_k, \mathbf{u}_k) + \mathbf{w}_k$ $\mathbf{x}_{k+1}= \mathbf{f}( \mathbf{x}_k, \mathbf{u}_k) + \mathbf{w}_k$ , where $LaTeX: \mathbf{x}_k$ $\mathbf{x}_k$ is the vector of state variables (sometimes a scalar) we are trying to estimate, such as the position of a robot or the temperature in the center of a boiler, $LaTeX: \mathbf{u}_k$ $\mathbf{u}_k$ is in input signal and $LaTeX: \mathbf{f}$ $\mathbf{f}$ describes the system dynamics, i.e. how the future state of the system depends on the previous state and the input signal and $LaTeX: \mathbf{w}_k$ $\mathbf{w}_k$ is the so called process noise expressing our uncertainty in the model. We will get back to how to compute $LaTeX: \mathbf{u}_k$ $\mathbf{u}_k$ given that we want to achieve certain states. To be able to use sensor data we need to relate the sensor data to the state of the system. This is modeled by the measurement equation $LaTeX: \mathbf{y}_k=\mathbf{h}(\mathbf{x}_k) + \mathbf{v}_k$ $\mathbf{y}_k=\mathbf{h}(\mathbf{x}_k) + \mathbf{v}_k$ , where $LaTeX: \mathbf{y}_k$ $\mathbf{y}_k$ is the measurement vector and $LaTeX: \mathbf{v}_k$ $\mathbf{v}_k$ is measurement noise. The latter models the uncertainty in the measurements.

If you are new to the topic the following movies provide an easily accessible introduction topic. Movies 1-3 are most relevant.

https://www.youtube.com/playlist?list=PLn8PRpmsu08ryYoBpEKzoMOveSTyS-h4a Links to an external site.

The Kalman filter is used in the "GPS aided INS" exercises, in a similar way as described in movie 3 above. The Kalman filter is covered in an easily accessible way in the following movies. Movies 1-5 are most relevant for this course.

https://se.mathworks.com/videos/series/understanding-kalman-filters.html?s_eid=PSM_15028 Links to an external site.

If you find the topic interesting, you could also look into this course in Sensor Fusion offered at Linköping University. The course covers both theory and practice in sensor fusion. Most of the course material as well as some movies introducing the different concepts in the course is available from the course homepage.

http://www.control.isy.liu.se/student/tsrt14 Links to an external site.

During the 1st 2-day meeting you got some hands on experience with the IMU. Remember that if you did not do the exercise then this is part of the Make Up Routines.

Learning-based perception

We will leave the details of learning-based methods to the numerous courses in WASP that teach these. In this course we will instead focus on using them with the main aim to get some feeling for what is possible and roughly what goes into using them.

If you have no idea about deep neural networks we suggest that you watch for example

https://www.youtube.com/watch?v=QDX-1M5Nj7s Links to an external site.

In task 2.3 you will look at object detection with YOLO. To get a bit of background, take a look at the following page which describes an earlier version. No need to look at the architecture for YOLOv8 Links to an external site..

https://www.jeremyjordan.me/object-detection-one-stage/ Links to an external site.

Task 2.1

In this specific case expresses the kinematic model of the robot. In the first task we will look at fusing information from two wheel encoders to estimate the position of a mobile robot. This results in something called odometry. The difference equation from before, $LaTeX: \mathbf{x}_{k+1}= \mathbf{f}( \mathbf{x}_k, \mathbf{u}_k)$ $\mathbf{x}_{k+1}= \mathbf{f}( \mathbf{x}_k, \mathbf{u}_k)$ , in this specific case is given by

$LaTeX: x_{k+1}=x_k+v\cdot \Delta t\cdot\cos(\theta)\\ y_{k+1}=y_k+v\cdot \Delta t\cdot\sin(\theta)\\ \theta_{k+1}=\theta_k+\omega\cdot \Delta t$ $x_{k+1}=x_k+v\cdot \Delta t\cdot\cos(\theta)\\ y_{k+1}=y_k+v\cdot \Delta t\cdot\sin(\theta)\\ \theta_{k+1}=\theta_k+\omega\cdot \Delta t$

or in vector form $LaTeX: \mathbf{x}_{k+1}=\mathbf{f}^1(\mathbf{x}_k,u^1_k)$ $\mathbf{x}_{k+1}=\mathbf{f}^1(\mathbf{x}_k,u^1_k)$ , where $LaTeX: \mathbf{x}=(x,y,\theta)^T$ $\mathbf{x}=(x,y,\theta)^T$ is the robot pose, $LaTeX: \mathbf{u}^1_k=(v_k, \omega_k)^T$ $\mathbf{u}^1_k=(v_k, \omega_k)^T$ is the input when considering the control input to be the linear and angular velocities respectively and $LaTeX: \Delta t$ $\Delta t$ is the timestep.

In this assignment we will work with a differential drive robot. This robot has two powered wheels that can be controlled independently. The translation speed of each wheel is given by $LaTeX: r\cdot\frac{d\varphi}{dt}$ $r\cdot\frac{d\varphi}{dt}$ .

The translation and rotation speed of the robot as a function of the wheel speeds are given by

$LaTeX: v=\frac{r}{2}(\dot{\varphi}_R+\dot{\varphi}_L)=\frac{r}{2}(d\varphi_R/dt+d\varphi_L/dt)\\ \omega=\frac{r}{B}(\dot{\varphi_R}-\dot{\varphi}_L)=\frac{r}{B}(d\varphi_R/dt-d\varphi_L/dt)$ $v=\frac{r}{2}(\dot{\varphi}_R+\dot{\varphi}_L)=\frac{r}{2}(d\varphi_R/dt+d\varphi_L/dt)\\ \omega=\frac{r}{B}(\dot{\varphi_R}-\dot{\varphi}_L)=\frac{r}{B}(d\varphi_R/dt-d\varphi_L/dt)$

with $LaTeX: \varphi_L$ $\varphi_L$ and $LaTeX: \varphi_R$ $\varphi_R$ being the wheel rotation angles for the right (R) and left (L) wheels.

Another way to write the equation, often more natural when working with wheel encoders is to consider the relative distance ( LaTeX: D_k $D_k$ ) and rotation ( $LaTeX: \Delta\theta_k$ $\Delta\theta_k$ )

$LaTeX: x_{k+1}=x_k+D_k\cdot\cos(\theta)\\ y_{k+1}=y_k+D_k\cdot\sin(\theta)\\ \theta_{k+1}= \theta_k+\Delta\theta_k$ $x_{k+1}=x_k+D_k\cdot\cos(\theta)\\ y_{k+1}=y_k+D_k\cdot\sin(\theta)\\ \theta_{k+1}= \theta_k+\Delta\theta_k$

or in vector form $LaTeX: \mathbf{x}_{k+1}=\text{f}^2(\mathbf{x}_k,\mathbf{u}^2_k)$ $\mathbf{x}_{k+1}=\text{f}^2(\mathbf{x}_k,\mathbf{u}^2_k)$ where $LaTeX: \mathbf{u}^2_k=(D_k,\Delta\theta_k)^T$ $\mathbf{u}^2_k=(D_k,\Delta\theta_k)^T$ .

Let $LaTeX: \Delta E_R$ $\Delta E_R$ and $LaTeX: \Delta E_L$ $\Delta E_L$ denote the difference in encoder reading over a period of time $LaTeX: \Delta T$ $\Delta T$

If LaTeX: K $K$ is the scaling factor to transform encoder difference to angle, $LaTeX: \Delta\varphi=K\Delta E$ $\Delta\varphi=K\Delta E$

$LaTeX: v=\frac{r}{2}(K\Delta E_R + K\Delta E_L)/\Delta T\rightarrow D=v\Delta T=\frac{r}{2}(K\Delta E_R+K\Delta E_L)\\ \omega=\frac{r}{B}(K\Delta E_R-K\Delta E_L)/\Delta T\rightarrow \Delta\theta=\omega\Delta T=\frac{r}{B}(K\Delta E_R-K\Delta E_L)$ $v=\frac{r}{2}(K\Delta E_R + K\Delta E_L)/\Delta T\rightarrow D=v\Delta T=\frac{r}{2}(K\Delta E_R+K\Delta E_L)\\ \omega=\frac{r}{B}(K\Delta E_R-K\Delta E_L)/\Delta T\rightarrow \Delta\theta=\omega\Delta T=\frac{r}{B}(K\Delta E_R-K\Delta E_L)$

The kinematic parameters for the differential drive robot configuration are:

Ticks per revolution: 48 * 64 = 3072
Update interval/rate: 50 ms (20 Hz)
Wheel radius: r
Distance between the wheels, "base": B

You need to find values for r and B for the DD2419 robot platform.

Your task is to a implement simple odometry from wheel encoders from a rosbag of a TurtleBot and to to show your estimated trajectory and compare to the expected trajectory. Please look at the report template Links to an external site. to see what you are expected to be able to answer / comment.

Run the odometry node (which you will edit when doing the task):

ros2 run assignment_2 odometry

Run the rosbag using the course supplied launch file:

ros2 launch assignment_2 odometry.launch.py bag_dir:=<DIR>/real_robot

If you use the external SSD that we provide it would be:

ros2 launch assignment_2 odometry.launch.py bag_dir:=/home/wasp/Downloads/real_robot

As you can notice the robot does not move in RViz. All the data is presented at the origin. Your task is to implement functioning odometry, using encoder data recorded from the robot. You will do this by filling in the missing code in the file wasp_autonomous_systems/assignment_2/assignment_2/odometry.py. We have supplied you with a code skeleton and marked a couple of places with the comment # TODO: Fill in, you are allowed to change the file however you like though.

After you have filled in your code you can again run the odometry node and the launch file:

ros2 run assignment_2 odometry

ros2 launch assignment_2 odometry.launch.py bag_dir:=<DIR>/real_robot

If you use the external SSD that we provide it would be:

ros2 launch assignment_2 odometry.launch.py bag_dir:=/home/wasp/Downloads/real_robot

NOTE: It is a good idea to start the odometry node before the launch file, which plays the rosbag, as otherwise you may miss some of the data.

Video Demonstration

Below is a video demonstrating what you should be seeing.

Task 2.2

Detect collision with the accelerometer and trigger a change of direction with TurtleBot in the simulator. Show the trajectory of the robot in a room.

Run the supplied collision detection launch file:

ros2 launch assignment_2 collision_detection.launch.py

You will see the TurtleBot in a new, simplified, environment. The TurtleBot is on it's way to becoming autonomous! However, as you will see, it needs your help. If you run the autonomous controller:

ros2 run wasp_autonomous_systems autonomous_controller --ros-args -p use_sim_time:=true

then you will see that the TurtleBot starts moving on its own! Though it can only move in a straight line and will therefore crash. Luckily, there is a mechanics that has been implemented that will make the TurtleBot reverse and turn around when a collision has been detected. The collision detection is not finished though. It is therefore your job to finish it by editing the file wasp_autonomous_systems/assignment_2/assignment_2/collision_detection.py. We have supplied you with a code skeleton and marked a couple of places with the comment # TODO: Fill in, you are allowed to change the file however you like though.

Hopefully you noticed this when we ran the autonomous controller: --ros-args -p use_sim_time:=true. This enables us to speed up or slow down the simulation without altering how the robot behaves. Go ahead and speed up the simulation in the Webots GUI. Hopefully you will see the robot moving faster but behave the same as it did when it was running at real-time. What happens when you set 'use_sim_time' to false for the autonomous controller node (i.e., --ros-args -p use_sim_time:=false)? Why do you think this happens? You can look at the autonomous controller code at wasp_autonomous_systems/wasp_autonomous_systems/wasp_autonomous_systems/autonomous_controller.py. Why do you think it is not as necessary to set this argument for your collision detection node?

After you have filled in your code you can again run the launch file:

ros2 launch assignment_2 collision_detection.launch.py

together with your collision detection:

ros2 run assignment_2 collision_detection

and the autonomous controller

ros2 run wasp_autonomous_systems autonomous_controller

EXTRA: Look at the file wasp_autonomous_systems/wasp_autonomous_systems/wasp_autonomous_systems/autonomous_controller.py and see if you can improve it. Maybe you can use other sensor data to prevent the collision from taking place in the first place.

HINT: Use rqt to figure out what happens with the imu data when a collision occur.

Video Demonstration

Below is a video demonstrating what you should be seeing.

Task 2.3

In this task we will look at using a learning-based method for perception. To be more specific we will use version of YOLO (you only look once) which originally presented in this CVPR 2016 paper Links to an external site.. "Looking once" refers to performing the object detection in a single pass where object candidates and class probabilities are regressed in one step from images. This is in contrast to two-stage methods (e.g. R-CNN) where in a first step object proposals are found (might be an object in this region irrespective of class) and then in a second stage classify the objects in these regions.

We will experiment with YOLOv8 which is a recent method for object detection and image segmention. That is it segments the image to tell exactly which pixles belongs to what object. We experiment with the same model in two contexts.

The KITTI sequence from the previous assignment and on a video stream from your own camera.
The output from YOLOv8 are object masks and class probabilities.

NOTE: When you run the YOLOv8 model with the program kitti_segmentation it will download the model if it is not found in the current directory. This means that if you later run the code in another directory it will have to download the model again.

KITTI

First run YOLOv8 on the KITTI seguence. Keep an eye on the report template which contains the questions to answer.

Terminal 1:

rviz2 -d ~/ros2_ws/install/assignment_2/share/assignment_2/rviz/kitti_segmentation.rviz --ros-args -p use_sim_time:=true

Terminal 2:

ros2 run assignment_2 kitti_segmentation --ros-args -p use_sim_time:=true -p "device:='cpu'"

You can change "device:='cpu'" to "device:='0'" or "device:='1'", ..., if you have a supported (NVidia) GPU that you can run on. The number correspond to the GPU to run on, so if you only have a single GPU then it would be 0. First time you run this it will download the YOLOv8n-seg model, make sure that it has finished downloading before proceeding.

Terminal 3:

ros2 launch wasp_autonomous_systems kitti_car.launch.py

Terminal 4:

ros2 bag play --read-ahead-queue-size 100 -l -r 1.0 --clock 100 <DIR>/kitti

If you use the external SSD that we provide it would be:

ros2 bag play --read-ahead-queue-size 100 -l -r 1.0 --clock 100 ~/Downloads/kitti

Below is a video demonstrating what you should be seeing.

Your own USB camera

In this second experiment we will use the stream from your own camera. We assume that you have access to a usb camera here (like the one built into almost all laptop or the one you use for your Zoom calls if that is different). You do not need it for much time so borrow one if you do not have one.

First we need to connect to the camera and make sure that we stream the data on the right format.

In terminal 1 run the following and look at the output which will identify which pixel formats you camera supports.

ros2 run usb_cam usb_cam_node_exe

The example below comes from a DELL XPS 15 laptop.

[INFO] [1694371116.778560205] [usb_cam]: Starting 'default_cam' (/dev/video0) at 640x480 via mmap (yuyv) at 30 FPS
[INFO] [1694371116.906163866] [usb_cam]: This devices supproted formats:
[INFO] [1694371116.906502453] [usb_cam]:     Motion-JPEG: 1280 x 720 (30 Hz)
[INFO] [1694371116.906553468] [usb_cam]:     Motion-JPEG: 960 x 540 (30 Hz)
[INFO] [1694371116.906588362] [usb_cam]:     Motion-JPEG: 848 x 480 (30 Hz)
[INFO] [1694371116.906622112] [usb_cam]:     Motion-JPEG: 640 x 480 (30 Hz)
[INFO] [1694371116.906655094] [usb_cam]:     Motion-JPEG: 640 x 360 (30 Hz)
[INFO] [1694371116.906685552] [usb_cam]:     YUYV 4:2:2: 640 x 480 (30 Hz)
[INFO] [1694371116.906716407] [usb_cam]:     YUYV 4:2:2: 640 x 360 (30 Hz)
[INFO] [1694371116.906751438] [usb_cam]:     YUYV 4:2:2: 424 x 240 (30 Hz)
[INFO] [1694371116.906784599] [usb_cam]:     YUYV 4:2:2: 320 x 240 (30 Hz)
[INFO] [1694371116.906818547] [usb_cam]:     YUYV 4:2:2: 320 x 180 (30 Hz)
[INFO] [1694371116.906851220] [usb_cam]:     YUYV 4:2:2: 160 x 120 (30 Hz)
[INFO] [1694371116.906884296] [usb_cam]: Setting 'brightness' to 50
unknown control 'white_balance_temperature_auto'

Now you want to map this output to the the possible format

yuyv2rgb: V4L2 capture format of YUYV, ROS image encoding of RGB8
uyvy2rgb: V4L2 capture format of UYVY, ROS image encoding of RGB8
mjpeg2rgb: V4L2 capture format of MJPEG, ROS image encoding of RGB8
rgb8: V4L2 capture format and ROS image encoding format of RGB8
yuyv: V4L2 capture format and ROS image encoding format of YUYV
uyvy: V4L2 capture format and ROS image encoding format of UYVY
m4202rgb8: V4L2 capture format of M420 (aka YUV420), ROS image encoding of RGB8
mono8: V4L2 capture format and ROS image encoding format of MONO8
mono16: V4L2 capture format and ROS image encoding format of MONO16
y102mono8: V4L2 capture format of Y10 (aka MONO10), ROS image encoding of MONO8

Look in your the output from your camera and pick one of the formats from the top of the list but maybe nit with the highest resolution and frequency to reduce computational load. In the example output above Motion-JPEG is at the top and this maps to mjpeg2rgb in the format lists just above here.

In the example below you would instead pick format uyvy (note lower case!)

[INFO] [1694362882.491533860] [usb_cam]: Starting 'default_cam' (/dev/video0) at 1280x720 via mmap (uyvy) at 60 FPS
[INFO] [1694362882.602455583] [usb_cam]: This devices supproted formats:
[INFO] [1694362882.602522994] [usb_cam]: 	UYVY 4:2:2: 1920 x 1080 (60 Hz)
[INFO] [1694362882.602531694] [usb_cam]: 	UYVY 4:2:2: 1920 x 1080 (30 Hz)
[INFO] [1694362882.602538334] [usb_cam]: 	UYVY 4:2:2: 1280 x 720 (60 Hz)
[INFO] [1694362882.602544934] [usb_cam]: 	UYVY 4:2:2: 1280 x 720 (30 Hz)
[INFO] [1694362882.602551204] [usb_cam]: 	UYVY 4:2:2: 960 x 540 (60 Hz)
[INFO] [1694362882.602557524] [usb_cam]: 	UYVY 4:2:2: 960 x 540 (30 Hz)
[INFO] [1694362882.602563995] [usb_cam]: Setting 'brightness' to 50

Terminal 1, start your camera assuming that you have the camera that we tested last above ie format uyvy and we want a bit lower resolution 1280x720:

ros2 run usb_cam usb_cam_node_exe --ros-args -p framerate:=30.0 -p image_height:=720 -p image_width:=1280 -p pixel_format:=uyvy

Terminal 2, check if you can see an image from your camera:

ros2 run rqt_image_view rqt_image_view /image_raw

Terminal 3, run YOLOv8 segmentation:

ros2 run assignment_2 camera_segmentation --ros-args -p "device:='cpu'"

You can change "device:='cpu'" to "device:='0'" or "device:='1'", ..., if you have a supported (NVidia) GPU that you can run on. The number corresponds to the GPU to run on, so if you only have a single GPU then it would be 0.

Terminal 4, show segmentation:

ros2 run rqt_image_view rqt_image_view /segmentation/image_raw

If you have not done so already make sure to check what you are supposed to do with all of this :-) by looking at the Assignment 2: Tasks 2.1-3.

I you want to see how we an combine the information from the semantic segmentation from the camera to mark which points from the LiDAR data comes from dynamic objects, take a look at the conditionally elective task Description Elective task A.

Below is a video demonstrating what you should be seeing.

Task 2.4 (Sensor Fusion GPS aided Inertial Navigation System)

The mandatory task 2.4 is described in Part 1 of this document Download Part 1 of this document.

It is easiest solved in Matlab, where these files Download these files are available for download, however you can also use Octave, Python, Julia, or any other tool of your choice, but you will then need to convert the provided code yourself.

The problem describes how to use sensor fusion by a Kalman filter to do positioning by combining sensor information from a GPS and an IMU (accelerometer and gyro). You will use prerecorded real world data and study the performance in a situation with GPS outage. You will get some experience of tuning a sensor fusion filter in a real situation. You will also study the performance improvement achievable by using an improved motion model and an additional speedometer sensor.

Optional material: To really understand the dynamical model and transformations used in the assignment you might want to have a look at

but you will be able to do the exercise without looking at this.

Extra task (if you miss the deadline)

Implement a Kalman filter for Task 2.1. Run the prediction and can see how the uncertainty increases as the robot moves. It is sufficient to plot the standard deviation in LaTeX: x $x$ , LaTeX: y $y$ , $LaTeX: \theta$ $\theta$ in a graph as a function of time. How does the uncertainty change along the path?

Submission: Include your results in the report by extending the template appropriately.

Conditionally elective tasks

Elective task A Dynamic Object Detection
- Described here Description Elective task A
Elective task B (GPS aided INS - with added motion constraint model)
- Described here Description Elective task B