HW3 HT2021
- Due 4 Oct 2021 by 17:00
- Points 1
- Submitting a file upload
- Available 23 Sep 2021 at 17:00 - 17 Jan 2022 at 17:00
HW3
Homework 3: Explanation and causality
Due Mon Oct 4 at 17:00
Suggested reading
- Ladyman 3.3-3.6, 7.1-7.2
- Ten Simple Rules for Better Figures
Optional reading:
Zip's law and city sizes Links to an external site.
1. Two research studies
1) A group of students is asked to write a program that solves a specified problem.
Each student may choose one of two groups:
- Group A is instructed to comment their code according to a given Style Guide.
- Group B does not need to comment their code at all.
When running the programs, it turns out that the best programs were from group A (the commented programs; "best" being defined in some way not relevant to this assignment).
a. Can we draw the conclusion that comments improve the code? Explain!
b. Suggest improvements to the study!
2) Another group of students is asked to write a program that solves a specific problem.
They are randomly assigned to two groups:
- Group C will use Haskell (a functional programming language)
- Group D will use Java
When running the resulting programs, it turns out that the fastest running programs were from group D.
a. Can we draw the conclusion that functional programming languages give slower code? Explain!
b. Suggest improvements to the study!
2. Zipf's law
The American linguist George Kingsley Zipf (1902-1950) has given name to Zipf's law, which describes for example the distribution of word frequencies in a language. Typically there are a few words that are used often, and most words are used only rarely. The most common words in English are "the" (7.14%), "of" (4.16%), and "and" (3.04%). These frequencies were calculated from a large dataset of Google books, see the article by Peter Norvig here, which contained 97,565 distinct words occurring a total of approximately 743 billion times.
Zipf's law states that if the words are ranked according to their frequency of occurrence, with the most common word getting rank1, the next most common word rank 2, etc, then the frequency of the word of rank n is approximately for some parameters
, where
is close to 1 .
Zipf's law is also encountered in quite different datasets, such as the distribution of population sizes of cities, and many other.
a. Find a dataset where it is reasonable to test Zipf's law. It should involve values (e.g., frequencies, sizes, etc) that span over several orders of magnitude, and have a reasonably large number of instances. Here are some examples - finding some other dataset yourself to try out is even better:
- word frequencies in English (from Peter Norvig)
- word frequencies in Chinese
- frequencies of family names, e.g., in Sweden
- sizes of lunar craters
- wealth of very rich people
- world's largest cities
- citation data for scientific articles or authors (several datasets) Links to an external site.
- peak intensity of gamma ray sun flares 1980-89 Links to an external site.
- all US city populations from the 2000 census Links to an external site.
- terrorist attacks worldwide 1968-2006 - number of direct deaths Links to an external site.
- frequency of occurrence of unique words in Herman Melville's Moby Dick Links to an external site.
- etc
b. Order the frequency (or size etc) values according to rank, and illustrate their statistics in a log-log diagram Links to an external site. of frequency vs rank. Make sure that your diagram is created to the standards of a scientific publication, for example, it should have labeled axes, a figure caption that describes the diagram, and it is plotted in a clear and readable way.
c. Use linear regression on the values in the diagram to fit a power law of the form to the data. Plot the resulting line in the log-log diagram, and also give all details of your fit (i.e., the resulting coefficients). You can use any tool or library that you prefer for this, just make sure to describe your methods. Does your result agree with Zipf's law?
d. Try to find a published scientific article that deals with your data set, and compare the results to yours (preferably from a peer-reviewed publication, but a recent publication in a preprint archive such as arxiv may also be used). Do not modify your own results, a comparison is sufficient. If you cannot find any published research, which could happen, e.g., if you found a data set on your own, then mentioned briefly how you did the search.
e. What type of scientific statement is Zipf's law - is it a hypothesis, a conjecture, a mathematical theorem or something else? Discuss and motivate your answer briefly.