• kth.se
  • Student web
  • Intranet
  • kth.se
  • Student web
  • Intranet
Login
DA2210 HT21 (vettig21)
HW3 HT2021
Skip to content
Dashboard
  • Login
  • Dashboard
  • Calendar
  • Inbox
  • History
  • Help
Close
  • Min översikt
  • DA2210 HT21 (vettig21)
  • Assignments
  • HW3 HT2021
  • Home
  • Assignments
  • Modules
  • Quizzes
  • Media Gallery
  • Course Evaluation

HW3 HT2021

  • Due 4 Oct 2021 by 17:00
  • Points 1
  • Submitting a file upload
  • Available 23 Sep 2021 at 17:00 - 17 Jan 2022 at 17:00
This assignment was locked 17 Jan 2022 at 17:00.

HW3

Homework 3: Explanation and causality

Due Mon Oct 4 at 17:00

Suggested reading

  • Ladyman 3.3-3.6, 7.1-7.2
  • Ten Simple Rules for Better Figures

Optional reading:

Zip's law and city sizes Links to an external site.

1. Two research studies

1) A group of students is asked to write a program that solves a specified problem.

Each student may choose one of two groups:

  • Group A is instructed to comment their code according to a given Style Guide.
  • Group B does not need to comment their code at all.

When running the programs, it turns out that the best programs were from group A (the commented programs; "best" being defined in some way not relevant to this assignment).

a. Can we draw the conclusion that comments improve the code? Explain!

b. Suggest improvements to the study!

2) Another group of students is asked to write a program that solves a specific problem.

They are randomly assigned to two groups:

  • Group C will use Haskell (a functional programming language)
  • Group D will use Java

When running the resulting programs, it turns out that the fastest running programs were from group D.

a. Can we draw the conclusion that functional programming languages give slower code? Explain!

b. Suggest improvements to the study!

2. Zipf's law

The American linguist George Kingsley Zipf (1902-1950) has given name to Zipf's law, which describes for example the distribution of word frequencies in a language. Typically there are a few words that are used often, and most words are used only rarely. The most common words in English are "the" (7.14%), "of" (4.16%), and "and" (3.04%). These frequencies were calculated from a large dataset of Google books, see the article by Peter Norvig here, which contained 97,565 distinct words occurring a total of approximately 743 billion times.

Zipf's law states that if the words are ranked according to their frequency of occurrence, with the most common word getting rank1, the next most common word rank 2, etc, then the frequency of the word of rank n is approximately  {C n^{-\alpha}} for some parameters {C, \alpha > 0}, where {\alpha} is close to 1 .

Zipf's law is also encountered in quite different datasets, such as the distribution of population sizes of cities, and many other.

a. Find a dataset where it is reasonable to test Zipf's law. It should involve values (e.g., frequencies, sizes, etc) that span over several orders of magnitude, and have a reasonably large number of instances. Here are some examples - finding some other dataset yourself to try out is even better:

  • word frequencies in English  (from Peter Norvig)
  • word frequencies in Chinese
  • frequencies of family names, e.g., in Sweden
  • sizes of lunar craters
  • wealth of very rich people
  • world's largest cities
  • citation data for scientific articles or authors (several datasets) Links to an external site.
  • peak intensity of gamma ray sun flares 1980-89 Links to an external site.
  • all US city populations from the 2000 census Links to an external site.
  • terrorist attacks worldwide 1968-2006 - number of direct deaths Links to an external site.
  • frequency of occurrence of unique words in Herman Melville's Moby Dick Links to an external site.
  • etc

b. Order the frequency (or size etc) values according to rank, and illustrate their statistics in a log-log diagram Links to an external site. of frequency vs rank. Make sure that your diagram is created to the standards of a scientific publication, for example, it should have labeled axes, a figure caption that describes the diagram, and it is plotted in a clear and readable way.

c. Use linear regression on the values in the diagram to fit a power law of the form {C n^{-\alpha}} to the data. Plot the resulting line in the log-log diagram, and also give all details of your fit (i.e., the resulting coefficients). You can use any tool or library that you prefer for this, just make sure to describe your methods. Does your result agree with Zipf's law?

d. Try to find a published scientific article that deals with your data set, and compare the results to yours (preferably from a peer-reviewed publication, but a recent publication in a preprint archive such as arxiv may also be used). Do not modify your own results, a comparison is sufficient. If you cannot find any published research, which could happen, e.g., if you found a data set on your own, then mentioned briefly how you did the search.

e. What type of scientific statement is Zipf's law - is it a hypothesis, a conjecture, a mathematical theorem or something else? Discuss and motivate your answer briefly.

1633359600 10/04/2021 05:00pm
Please include a description
Additional comments:
Rating max score to > Pts
Please include a rating title

Rubric

Find rubric
Please include a title
Find a rubric
Title
You've already rated students with this rubric. Any major changes could affect their assessment results.
 
 
 
 
 
 
 
     
Can't change a rubric once you've started using it.  
Title
Criteria Ratings Pts
This criterion is linked to a learning outcome Description of criterion
threshold: 5 pts
Edit criterion description Delete criterion row
5 to >0 Pts Full marks blank
0 to >0 Pts No marks blank_2
This area will be used by the assessor to leave comments related to this criterion.
pts
  / 5 pts
--
Additional comments
This criterion is linked to a learning outcome Description of criterion
threshold: 5 pts
Edit criterion description Delete criterion row
5 to >0 Pts Full marks blank
0 to >0 Pts No marks blank_2
This area will be used by the assessor to leave comments related to this criterion.
pts
  / 5 pts
--
Additional comments
Total points: 5 out of 5