HW3 HT2021

Due 4 Oct 2021 by 17:00
Points 1
Submitting a file upload
Available 23 Sep 2021 at 17:00 - 17 Jan 2022 at 17:00

This assignment was locked 17 Jan 2022 at 17:00.

HW3

Homework 3: Explanation and causality

Due Mon Oct 4 at 17:00

1. Two research studies

1) A group of students is asked to write a program that solves a specified problem.

Each student may choose one of two groups:

Group A is instructed to comment their code according to a given Style Guide.
Group B does not need to comment their code at all.

When running the programs, it turns out that the best programs were from group A (the commented programs; "best" being defined in some way not relevant to this assignment).

a. Can we draw the conclusion that comments improve the code? Explain!

b. Suggest improvements to the study!

2) Another group of students is asked to write a program that solves a specific problem.

They are randomly assigned to two groups:

Group C will use Haskell (a functional programming language)
Group D will use Java

When running the resulting programs, it turns out that the fastest running programs were from group D.

a. Can we draw the conclusion that functional programming languages give slower code? Explain!

b. Suggest improvements to the study!

2. Zipf's law

The American linguist George Kingsley Zipf (1902-1950) has given name to Zipf's law, which describes for example the distribution of word frequencies in a language. Typically there are a few words that are used often, and most words are used only rarely. The most common words in English are "the" (7.14%), "of" (4.16%), and "and" (3.04%). These frequencies were calculated from a large dataset of Google books, see the article by Peter Norvig here, which contained 97,565 distinct words occurring a total of approximately 743 billion times.

Zipf's law states that if the words are ranked according to their frequency of occurrence, with the most common word getting rank1, the next most common word rank 2, etc, then the frequency of the word of rank n is approximately ${C n^{-\alpha}}$ for some parameters ${C, \alpha > 0}$ , where ${\alpha}$ is close to 1 .

Zipf's law is also encountered in quite different datasets, such as the distribution of population sizes of cities, and many other.

a. Find a dataset where it is reasonable to test Zipf's law. It should involve values (e.g., frequencies, sizes, etc) that span over several orders of magnitude, and have a reasonably large number of instances. Here are some examples - finding some other dataset yourself to try out is even better:

b. Order the frequency (or size etc) values according to rank, and illustrate their statistics in a log-log diagram Links to an external site. of frequency vs rank. Make sure that your diagram is created to the standards of a scientific publication, for example, it should have labeled axes, a figure caption that describes the diagram, and it is plotted in a clear and readable way.

c. Use linear regression on the values in the diagram to fit a power law of the form ${C n^{-\alpha}}$ to the data. Plot the resulting line in the log-log diagram, and also give all details of your fit (i.e., the resulting coefficients). You can use any tool or library that you prefer for this, just make sure to describe your methods. Does your result agree with Zipf's law?

d. Try to find a published scientific article that deals with your data set, and compare the results to yours (preferably from a peer-reviewed publication, but a recent publication in a preprint archive such as arxiv may also be used). Do not modify your own results, a comparison is sufficient. If you cannot find any published research, which could happen, e.g., if you found a data set on your own, then mentioned briefly how you did the search.

e. What type of scientific statement is Zipf's law - is it a hypothesis, a conjecture, a mathematical theorem or something else? Discuss and motivate your answer briefly.

Rubric

Title:

Find a rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --