Uploading files from a set of saved KTH Social web pages

Given a set of saved KTH Social files in the directory, a config.json file, and a transformed_urls.json file. Extract the URLs from the "mainContent" part of a saved KTH Social page, then do a wget to fetch each of the files.

The config.json file contains the access token:
{
"canvas":{
 "access_token": "8..K"
 }
}

The transformed_urls.json file can contain simply {}

The Download mine_for_urls_in_page.py

processes a html file and fetches all of the KTH Social files and saves them with the course code as a prefix. Note that this retains the hash value/time stamp as part of the final file name, this is to ensure that the file names are unique as with in a set of files for a course - Canvas only distinguishes files based upon their name - and does not consider which folder they are in. For example:

for i in ../*.html; do ../mine_for_urls_in_page.py 11 IK1552 Internetworking "$i"; done

will process each of the files in turn, and output some text as it works:

list_of_URLS_in_page:
https://www.kth.se/social/files/553a6ea9f276542b61f29a55/IK1550-1552-Acronyms-list-20150424.xlsx
new filename: IK1552-53a6ea9f276542b61f29a55-IK1550-1552-Acronyms-list-20150424.xlsx
wget_cmd: wget -O IK1552-53a6ea9f276542b61f29a55-IK1550-1552-Acronyms-list-20150424.xlsx https://www.kth.se/social/files/553a6ea9f276542b61f29a55/IK1550-1552-Acronyms-list-20150424.xlsx
 --2016-07-01 13:25:55-- https://www.kth.se/social/files/553a6ea9f276542b61f29a55/IK1550-1552-Acronyms-list-20150424.xlsx
Resolving www.kth.se (www.kth.se)... 130.237.28.40, 2001:6b0:1:11c2::82ed:1c28
Connecting to www.kth.se (www.kth.se)|130.237.28.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18304 (18K) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘IK1552-53a6ea9f276542b61f29a55-IK1550-1552-Acronyms-list-20150424.xlsx’
IK1552-53a6ea9f2765 100%[=====================>] 17.88K --.-KB/s in 0.001s
2016-07-01 13:25:55 (24.9 MB/s) - ‘IK1552-53a6ea9f276542b61f29a55-IK1550-1552-Acronyms-list-20150424.xlsx’ saved [18304/18304]
...

The directory will contain the two json files and all of the downloaded files. In this case:

config.json
IK1552-50eaa7df27654063cd0998b-IK1550-Module1.pdf
IK1552-50eaa9af276542210c341a8-IK1550-Module2.pdf
IK1552-516b3c2f2765470d14e44c3-IK1550-Module3-notes.pdf
IK1552-516bf44f276542d5cecf13a-IK1550-Module4-notes.pdf
IK1552-516c58ff276540cd955d2f4-IK1550-Module5-notes.pdf
IK1552-5264925f27654746e9d30bd-IK1550-Module6-word.pdf
IK1552-5267860f276542d39ff3265-IK1550-Module7-word.pdf
IK1552-526789ff276542b209082c4-IK1550-Module8-word.pdf
IK1552-533a611f276545fa9443cfa-IK1550-Module9-word.pdf
IK1552-533a628f2765470ecb9ae8d-IK1550-Module10-word.pdf
IK1552-53a409af2765403f4851cb4-IK1550-Module11-word.pdf
IK1552-53a4612f27654068a92ca21-IK1552-Module12-word.pdf
IK1552-53a5141f276540812767983-IK1550-Module13-word.pdf
IK1552-53a5807f276541665804d13-IK1550-Module14-word.pdf
IK1552-53a6ea9f276542b61f29a55-IK1550-1552-Acronyms-list-20150424.xlsx
...
IK1552-5cdd3a6f276543faf991acc-ICT-keywords-20141203.xlsx
IK1552-6f8161ff27654133c0795b4-Module1-notes.pdf
IK1552-6f81cf1f276541c428495ff-Module2-notes.pdf
IK1552-6f81f82f276541560e279e9-Module3-notes.pdf
IK1552-6f92de2f276541f2b32f41c-Module4-notes.pdf
IK1552-6f938d3f27654353fd5cfed-Module5-notes.pdf
IK1552-6f93fd3f276542f645ad0e5-Module6-notes.pdf
IK1552-6f9633ff276544dd954b26b-Module7-notes.pdf
IK1552-6f96b15f276544f402292d6-Module8-notes.pdf
IK1552-6f9979ff276547af58d8aea-Module9-notes.pdf
IK1552-6f9f15cf2765401e8bb290d-Module10-notes.pdf
IK1552-6fcfd0cf276543c09079a93-Module11-notes.pdf
IK1552-6fd05a1f27654509d72f5d4-Module12-notes.pdf
IK1552-6fd176bf2765463049daa72-Module13-notes.pdf
IK1552-6fd1c69f27654652f45956e-Module14-notes.pdf
IK1552-6fd2c59f276546b363f8976-Simple_UDP_server-example2.c
IK1552-6fd2c81f276546ea0356a49-Simple_UDP_client-example2.c
IK1552-6fd2cd7f276547295b5c8da-udp-socket-server-example6.c
IK1552-6fd2d0df27654718945c051-udp-socket-example6.c
IK1552-70df84bf276543c19a49ad8-II2202-presenting-data-Maguire-20150920a-with-notes.pdf
IK1552-729da01f27654389f1b67e7-cyber-defense-exercise%20v%202016-04-08_print.pdf
transformed_urls.json

Now you can simply zip these files:

zip IK1552files IK1552*

Next you can upload them to a new folder in your course files, for example: IK1552-files. Then upload and extract the zip file in this Canvas folder (see https://guides.instructure.com/m/4152/l/41385-how-do-i-upload-zip-files-as-an-instructor Links to an external site.). This will produce:

uploaded-IK1552-pages-20160701-trimmed.png

It also produces an updated transformation file, whose contents we can see with the following command:

 cat transformed_urls.json |python -m json.tool

{
    "IK1552": {
        "https://www.kth.se/social/files/550eaa7df27654063cd0998b/IK1550-Module1.pdf": "IK1552-50eaa7df27654063cd0998b-IK1550-Module1.pdf",
        "https://www.kth.se/social/files/550eaa9af276542210c341a8/IK1550-Module2.pdf": "IK1552-50eaa9af276542210c341a8-IK1550-Module2.pdf",
        "https://www.kth.se/social/files/5516b3c2f2765470d14e44c3/IK1550-Module3-notes.pdf": "IK1552-516b3c2f2765470d14e44c3-IK1550-Module3-notes.pdf",
        ...
        "https://www.kth.se/social/files/570df84bf276543c19a49ad8/II2202-presenting-data-Maguire-20150920a-with-notes.pdf": "IK1552-70df84bf276543c19a49ad8-II2202-presenting-data-Maguire-20150920a-with-notes.pdf",
        "https://www.kth.se/social/files/5729da01f27654389f1b67e7/cyber-defense-exercise%20v%202016-04-08_print.pdf": "IK1552-729da01f27654389f1b67e7-cyber-defense-exercise%20v%202016-04-08_print.pdf"
    }
}

The main function that does the mining and download is basically as follows:

def mine_page_info_for_URLs(course_id, course_code, module_name, filename):
    global Verbose_Flag
    global transformed_urls

    with open(filename, 'r') as file_handle:          # get existing HTML page
        page_contents = file_handle.read()
    file_handle.closed

    tree = html.parse(StringIO(page_contents), html.HTMLParser())
    
    # The actual content of the page can be found in the <div class="mainContent"> in the <div class="paragraphs">
    mainContent=tree.xpath('//div[@class="mainContent"]')
    if len(mainContent) == 0:
        if Verbose_Flag:
            print("No mainContent - file {} is not from KTH Social".format(filename))
        return False
    list_of_URLs=tree.xpath('//div[@class="mainContent"]//div[@class="paragraphs"]//a/@href')
    # if there are no entries for this course, asking for these URLS will generate a KeyError
    try:
        transformed_urls_for_this_course=transformed_urls[course_code]
    except KeyError:
        if Verbose_Flag:
            print("no transformed URLs for course code={}".format(course_code))
        transformed_urls_for_this_course={}
        transformed_urls[course_code]=transformed_urls_for_this_course
    print("list_of_URLS_in_page:")
    for e in list_of_URLs:
        # copy URLs from within KTH Social
        # use as the name of the file, the string after the prefix with "/" turned into "-"     
        KTH_social_file_prefix="https://www.kth.se/social/files/"
        if e.startswith(KTH_social_file_prefix):
            filename_offset=len(KTH_social_file_prefix)+1
            filename_for_download=course_code+"-"+e[filename_offset:].replace("/", "-")
            transformed_urls[course_code][e]=filename_for_download wget_cmd="wget -O "+ filename_for_download + " " + e    
            return_code = subprocess.call(wget_cmd, shell=True)
    return False