Indexing a course
The process of generating an index of existing course pages consists of the following steps:
Get the programs needed from https://github.com/gqmaguirejr/Canvas-tools Links to an external site.
The programs are
cgetall.py - -to get all the pages for the course to a local directory
find_keyords_phrase_in_files.py --- to collect the keywords and phrases
create_page_from_json.py -- to create a page with the index material
ccreate.py-- to insert the resulting page (not available - yet in the public github)
Create a directory to put all the pages for the course
mkdir /tmp/testdik1552
Get all the pages for the course_id 17234
./cgetall.py 17234 /tmp/testdik1552
Get information about the modules in this course
./modules-items-in-course.py 17234
The above creates the file: modules-in-course-17234.json
Find the keywords and phrases in the files in the indicated directory
./find_keyords_phrase_in_files.py -r /tmp/testdik1552
The above creates the file: keywords_and_phrases_testdik1552.json
The "-r" option causes the program to ignore everything after a horizontal rules <hr> or </ hr>
Create an empty file words-for-course-COURSE_ID.json
This file should contain:
{"words_to_ignore": [ "______last_word_marker______" ], "words_to_merge": { "______last_merge_marker______": [] } }
later add to this file appropriate entries;
- For example in words_to_ignore "3rd method use alternative algorithms",
- For example in words_to_merge "(George) Guilder’s Law": ["George) Guilder’s Law", "Guilder’s Law"],
Create an index
./create_page_from_json.py 17234 keywords_and_phrases_testdik1552.json
The above creates the file: stats_for_course-17234.html
Copy (or rename) the file, so that it has a name suitable for a Canvas page
cp stats_for_course-17234.html test-page-3.html
Upload the created page and give it a title, in this case, "Test page 3"
./ccreate.py https://xxxx.instructure.com/courses/17234/pages/test-page-3 "Test page 3"
Put the page into a module in the course by going to a module, clicking the plus symbol then select the type of module to add as "Page" and then select the item in the scrolling list "Test page 3" Note that you need to refresh the "Modules" page to be able to see the new choice of page. Note that if your use the ccreate.py script multiple times, the page will have names of the form: "Test page 3-i", such as "Test page 3-6". You can eliminate the uploaded page by using the page: https://xxxx.instructure.com/courses/17234/pages Click on the three vertical dots on the right and select "Delete" You can eliminate the page from the module using the three dots on the right and select "Remove".
Background
Basically the process is based on creating in a local directory a copy of all of the HTML pages in a Canvas course along with some metadata about the module items in the course. Once you have the files, you can find keywords and phrases from the HTML and then construct the index or in my case a number of different indexes. I have split the process of finding keywords and phrases into two parts, the first works on the HTML files to find the strings in the various tags and stores this into a JSON formatted file - and the second part is part of the program computes the indexes. In this second program I started by splitting the text into words with a simple regular expression and then switched to using the Python NLTK package - specifically, the functions nltk.sent_tokenize(string1) nltk.word_tokenize(string2).
Observations and Future work
Overall, the process of generating an index was useful - as I found mis-spellings, inconsistent use of various terms and capitalization, random characters that seemed to have been typos or poor alternative img descriptions, ...). It also was a nice forcing function to rework some of the content.
However, it remains a work in progress. I know that there are a number of weaknesses, such as not being careful in the final index to language tag entries and there is a need to remove some additional words that probably should not be in the index. Also, this is not a general-purpose natural language processing program - it could make better use of the NLTK package and it is very English language centric (in that it assumes the default language of the course is English, it does not pass the actual language information to the tokenization function, and it only contains stop words in English).