Notes-20160714
Since it is so simple to access the content of the page, an obvious tool is a program that extracts all of the text from the page to a file. These files can be concatenated together and feed to a spelling checker, grammar check, word processing package, or other tool.
This tool is called extract_text_from_pages_in_course-1.py Download extract_text_from_pages_in_course-1.py you run the page by simply giving the course_id as the only argument, for example:
../extract_text_from_pages_in_course.py 11
This output the URLs (titles of pages) as it goes. It creates files of the form: extractedtext<course_id>-<page_url>.txt, for example:
extractedtext11-aim.txt
Where the last part of the URL (i.e., to right of the right most "/") of the page was "aims".
Now you can put all of the extracted text into a single file with:
cat extractedtext11-* > combined-text.txt
Note that the start of each file is marked with:
\n<<<<<<<<<<"+course_id+'-'+fixed_title+'>>>>>>>>>>\n
Note that the line begins and ends with a newline character.
--------------
Note that I updated this page and the program on 2016.07.15 to use the page's URL, rather than the page title. This was to avoid the need to cleanup the page title to avoid problems with filenames.
-------------
The key part of the code is:
def extract_text_from_pages_in_course(course_id):
list_of_all_pages=[]
# Use the Canvas API to get the list of pages for this course
#GET /api/v1/courses/:course_id/pages
url = baseUrl + '%s/pages' % (course_id)
if Verbose_Flag:
print("url: " + url)
r = requests.get(url, headers = header)
if Verbose_Flag:
write_to_log("result of getting pages: " + r.text)
if r.status_code == requests.codes.ok:
page_response=r.json()
else:
print("No pages for course_id: {}".format(course_id))
return False
for p_response in page_response:
list_of_all_pages.append(p_response)
# the following is needed when the response has been paginated
# i.e., when the response is split into pieces - each returning only some of the list of modules
# see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
while r.links['current']['url'] != r.links['last']['url']:
r = requests.get(r.links['next']['url'], headers=header)
page_response = r.json()
for p_response in page_response:
list_of_all_pages.append(p_response)
for p in list_of_all_pages:
print("{}".format(p["title"]))
# Use the Canvas API to GET the page
#GET /api/v1/courses/:course_id/pages/:url
url = baseUrl + '%s/pages/%s' % (course_id, p["url"])
if Verbose_Flag:
print(url)
payload={}
r = requests.get(url, headers = header, data=payload)
if r.status_code == requests.codes.ok:
page_response = r.json()
if Verbose_Flag:
print("body: {}".format(page_response["body"]))
document = html.document_fromstring(page_response["body"])
raw_text = document.text_content()
if Verbose_Flag:
print("raw_text: {}".format(raw_text))
else:
print("No pages for course_id: {}".format(course_id))
return False
# see http://www.erinhengel.com/software/textatistic/
# there is no sense processing files that do not have text in them
if len(raw_text) > 0:
try:
page_url=page_response["url"]
tail_of_page_url=page_url[page_url.rfind("/")+1:]
with open('extractedtext'+course_id+'-'+tail_of_page_url+'.txt', "wb") as writer:
# put a distinct header line in so that one can concatinate all of the extracted text and feed it to Word or other systems for spelling and grammar checking
header_line="\n>>>>>>>>>\n'
encoded_output=bytes(header_line, 'UTF-8')
writer.write(encoded_output)
#
# now output the text that came from the page
#
encoded_output=bytes(raw_text, 'UTF-8')
writer.write(encoded_output)
writer.close()
continue
except IOError as e:
print("for filename: {0} I/O error({1}): {2}".format(fixed_title, e.errno, e.strerror))
continue
else:
continue
return True