Notes-20160714

Since it is so simple to access the content of the page, an obvious tool is a program that extracts all of the text from the page to a file. These files can be concatenated together and feed to a spelling checker, grammar check, word processing package, or other tool. 

This tool is called  Download extract_text_from_pages_in_course-1.py

 you run the page by simply giving the course_id as the only argument, for example:

../extract_text_from_pages_in_course.py 11

This output the URLs (titles of pages) as it goes. It creates files of the form: extractedtext<course_id>-<page_url>.txt, for example:

extractedtext11-aim.txt

Where the last part of the URL (i.e., to right of the right most "/") of the page was "aims".

Now you can put all of the extracted text into a single file with:

 cat extractedtext11-* > combined-text.txt

Note that the start of each file is marked with:

\n<<<<<<<<<<"+course_id+'-'+fixed_title+'>>>>>>>>>>\n

Note that the line begins and ends with a newline character.

--------------

Note that I updated this page and the program on 2016.07.15 to use the page's URL, rather than the page title. This was to avoid the need to cleanup the page title to avoid problems with filenames.

-------------

The key part of the code is:

def extract_text_from_pages_in_course(course_id):
       list_of_all_pages=[]

       # Use the Canvas API to get the list of pages for this course
       #GET /api/v1/courses/:course_id/pages

       url = baseUrl + '%s/pages' % (course_id)
       if Verbose_Flag:
              print("url: " + url)

       r = requests.get(url, headers = header)
       if Verbose_Flag:
              write_to_log("result of getting pages: " + r.text)
       if r.status_code == requests.codes.ok:
              page_response=r.json()
       else:
              print("No pages for course_id: {}".format(course_id))
              return False


       for p_response in page_response:  
              list_of_all_pages.append(p_response)

       # the following is needed when the response has been paginated
       # i.e., when the response is split into pieces - each returning only some of the list of modules
       # see "Handling Pagination" - Discussion created by tyler.clair@usu.edu on Apr 27, 2015, https://community.canvaslms.com/thread/1500
       while r.links['current']['url'] != r.links['last']['url']:  
              r = requests.get(r.links['next']['url'], headers=header)  
              page_response = r.json()  
              for p_response in page_response:  
                     list_of_all_pages.append(p_response)

       for p in list_of_all_pages:
              print("{}".format(p["title"]))
              # Use the Canvas API to GET the page
              #GET /api/v1/courses/:course_id/pages/:url

              url = baseUrl + '%s/pages/%s' % (course_id, p["url"])
              if Verbose_Flag:
                     print(url)
              payload={}
              r = requests.get(url, headers = header, data=payload)
              if r.status_code == requests.codes.ok:
                     page_response = r.json()  
                     if Verbose_Flag:
                            print("body: {}".format(page_response["body"]))

                     document = html.document_fromstring(page_response["body"])
                     raw_text = document.text_content()
                     if Verbose_Flag:
                            print("raw_text: {}".format(raw_text))
              else:
                     print("No pages for course_id: {}".format(course_id))
                     return False

              # see http://www.erinhengel.com/software/textatistic/
              # there is no sense processing files that do not have text in them
              if len(raw_text) > 0:
                     try:
                            page_url=page_response["url"]
                            tail_of_page_url=page_url[page_url.rfind("/")+1:]
                            with open('extractedtext'+course_id+'-'+tail_of_page_url+'.txt', "wb") as writer:
                                   # put a distinct header line in so that one can concatinate all of the extracted text and feed it to Word or other systems for spelling and grammar checking
                                   header_line="\n>>>>>>>>>\n'
                                   encoded_output=bytes(header_line, 'UTF-8')
                                   writer.write(encoded_output)
                                   #
                                   # now output the text that came from the page
                                   #
                                   encoded_output=bytes(raw_text, 'UTF-8')
                                   writer.write(encoded_output)
                                   writer.close()
                            continue
                     except IOError as e:
                            print("for filename: {0} I/O error({1}): {2}".format(fixed_title, e.errno, e.strerror))
                            continue
              else:
                     continue
       return True