HW3 HT2022

Due 5 Oct 2022 by 17:00
Points 1
Submitting a file upload
Available 23 Sep 2022 at 17:00 - 31 Jan 2023 at 17:00

This assignment was locked 31 Jan 2023 at 17:00.

HW3

Homework 3: Explanation and causality, scientific laws and regularities

Due Wednesday Oct 5 at 17:00

1. Two research studies

1) A group of students is asked to write a program that solves a specified problem.

Each student may choose one of two groups:

Group A is instructed to comment their code according to a given Style Guide.
Group B does not need to comment their code at all.

When running the programs, it turns out that the best programs were from group A (the commented programs; "best" being defined in some way not relevant to this assignment).

a. Can we draw the conclusion that comments improve the code? Explain!

b. Suggest improvements to the study!

2) Another group of students is asked to write a program that solves a specific problem.

They are randomly assigned to two groups:

Group C will use Haskell (a functional programming language)
Group D will use Java

When running the resulting programs, it turns out that the fastest running programs were from group D.

a. Can we draw the conclusion that functional programming languages give slower code? Explain!

b. Suggest improvements to the study!

2. Zipf's law

The American linguist George Kingsley Zipf (1902-1950) has given name to Zipf's law, which describes for example the distribution of word frequencies in a language. Typically there are a few words that are used often, and most words are used only rarely. The most common words in English are "the" (7.14%), "of" (4.16%), and "and" (3.04%). These frequencies were calculated from a large dataset of Google books, see the article by Peter Norvig here, which contained 97,565 distinct words occurring a total of approximately 743 billion times.

Zipf's law states that if the words are ranked according to their frequency of occurrence, with the most common word getting rank1, the next most common word rank 2, etc, then the frequency of the word of rank n is approximately ${C n^{-\alpha}}$ for some parameters ${C, \alpha > 0}$ , where ${\alpha}$ is close to 1 .

Zipf's law is also encountered in quite different datasets, such as the distribution of population sizes of cities, and many other.

a. Find a dataset where it is reasonable to test Zipf's law. It should involve values (e.g., frequencies, sizes, etc) that span over several orders of magnitude, and have a reasonably large number of instances. Here are some examples:

word frequencies in English (from Peter Norvig)
word frequencies in Chinese
frequencies of family names, e.g., in Sweden
sizes of lunar craters
wealth of very rich people
world's largest cities
citation data for scientific articles or authors (several datasets) Links to an external site.
or any of the datasets linked on Aaron Clauset's page here Links to an external site., such as
peak intensity of gamma ray sun flares 1980-89
all US city populations from the 2000 census
terrorist attacks worldwide 1968-2006 - number of direct deaths
frequency of occurrence of unique words in Herman Melville's Moby Dick

Finding an interesting dataset of your own to explore in this is even better. There are for example many possibilities relating to various internet statistics, from he number of followers of top Instagram or Tiktok influencers (from Cristiano Ronaldo (480M), Kylie Jenner (370M) and down in the first case; from Khabane Iame (150M), charli d'amelio (147M), and down in the second case), to more technical aspects of internet traffic statistics (a recent review article, though no dataset, can be found here). Links to an external site.

b. Order the frequency (or size etc) values according to rank, and illustrate their statistics in a log-log diagram Links to an external site. of frequency vs rank. Your diagram should be created to the standards of a scientific publication, for example:
+ it should have correctly labeled axes,
+ it should have a figure caption that describes the diagram,
+ it is plotted in a clear and readable way, which for example means that it is large enough to be easily interpretable, and that the symbols used for individual points should be chosen so that they are small, precise, but still easily readable (small crosses or x's, or dots of suitable size)

c. Use linear regression on the values in the diagram to fit a power law of the form ${C n^{-\alpha}}$ to the data (i.e., fit a linear function to the logarithmic data). Plot the resulting line in the log-log diagram, and you must also give all details of your fit (i.e., the resulting coefficients). You can use any tool or library that you prefer for this, just make sure to describe your methods. Does your result agree with Zipf's law? (some background in linear regression can for example be found here).

Note that when fitting experimental data, it is a normal part of an experimental procedure to determine the range over which the data is fitted based on the researcher's best judgement. Data at either end of the range of values (such as rank) may be influenced by other effects than the scaling law (for example, data on city size is less likely to include smaller villages and settlements).

d. What type of scientific statement is Zipf's law - is it a hypothesis, a conjecture, a mathematical theorem or something else? Discuss and motivate your answer.

-----------------------------------------------------

The purpose of this exercise is to give a perspective on the nature of scientific laws and regularities through an example, but also to practice the creation and presentation of figures and diagrams, which is an important part of academic writing. Academic writing in turn is one of the most essential learning goals of the course.

An entirely voluntary exercise for those who would like to learn more about regression would be to also try an alternative method of curve fitting, which is to fit a power law of the form ${C n^{-\alpha}}$ directly to the data using some suitable library function (which could then be plotted as a straight line in the log-log diagram), rather first taking logarithms and fitting a straight line after that. These two methods will most likely not give the same result - why is that?

And should anyone like to explore these topics further, you could for example look at:

A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data Links to an external site." SIAM Review 51(4), 661-703 (2009). (arXiv:0706.1062 Links to an external site.)

M. E. J. Newman, "Power laws, Pareto distributions and Zipf's law. Links to an external site." Contemporary Physics 46, 323 (2005).

Rubric

Title:

Find a rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --