Reinforcement Learning Challenge

Hints:

The challenge can be solved without any background on Deep Reinforcement Learning, but such knowledge might help.
The actor-critic network (DDPG algorithm) is described in the following notes Download the following notes
Be prepared that training will take time. You can solve other tasks while the system is training.
You must use Google Chrome as browser

----------------------------------------------------------------------------------------------------------

The task is to learn how to swing up and balance the following pendulum in its upwards position.

Phase1 (model):

Using Google Chrome put philon-xx.control.lth.se in the address field, where xx is your team number (so team 01 uses philon-01.control.lth.se etc...). It will be possible to login in to other team's computers, but do not do that. Let us know if you have trouble connecting.
Login with your team's login credentials. Start the server, and choose the Docker image frtn75_rl
Use the jupyter notebook Lab2_1_RL.ipynb, which implements an actor-critic RL network that trains a swingup on a model of the system. It is written in the Julia language.
Be prepared for that it might take some time to train. You are aiming for a plot where the variable theta is kept close to zero. If your network is successful swinging up the pendulum most of the times you try it, you are ready for phase 2.
Save screenshots of your most successful attempt.
Save your RL agent as a bson file and download it to your own computer. You will use it in phase 2, available Wednesday morning.

-------

Further illustrations:

The following figure shows a typical behavior in the initial phase of the training. The pendulum angle theta = +-pi corresponds to the pendulum hanging down. The figure shows a pendulum oscillating violently for 800 seconds.

After some training it might look like this, where there are some almost successful balancing attempts.

And if you are successful, things should start to look more like this

Hint: It is good if you test your final swingup agent at least 10 times, using test(agent, env), and make sure it succeeds most of the time.

Hint: The command train(agent, n_episodes = N) will continue training your agent from the state it is in, using an additional N episodes.

Phase 2 (real system) - day 2 (9.00-11.00)

The real system is connected to other computers: heron-xx.control.lth.se (where xx is your team number). Only one user can be logged in and run a real system at a time. Login in the same way, and start the same container (frtn75-rl). Login starts 09.00.
Download your trained agent (the bson-file) to the folder.
Run the notebook Lab2_2_Experiment on the heron-machine and test your trained RL agent from Lab2_RL on a physical device.
The Lab2_2_Experiment notebook will execute your agent 10 times on the real system, and present the results. If we are lucky, also a video stream of the real pendulum will be available. Your score will be judged from the resulting plots of these 10 swingup trials. Count how many of these 10 swingups that were successful (theta should be close to zero for at least half of the time period, for a swingup trial to be judged successful).
You can try several times (but see scoring system below)
Mail information about 1) the number of successful attempts and 2) the bson file to bob@control.lth.se After we receive the mail, we will not accept any new entries from your team.
You will be scored both by the number of successful swingups (max 10p) and by the time you mail your result. First maile result (with at least one successful swingup) gets 3 bonus points, 2nd team gets 2p, 3rd team gets 1p.
After finishing the experiment, please logout from the jupyter notebook, before closing the browser window !

---------------------------------------------------------------------------------------------------

Further illustrations:

This is a borderline failed attempt on the real system:

and this is a successful attempt (with the same agent)