Re-use of data
Key takeaways
- There are different reasons for re-using data.
- Re-using data opens up new possibilities for data-driven research but there are also pitfalls.
- Licenses are important to clarify the conditions for re-use of data (and source code).
- High quality metadata is important to make data more re-usable.
- There are many resources available for finding useful data.
Re-using data has potential to be more efficient, since you don't have to spend time collecting data yourself when someone else already has collected useful data. In practice though, the re-usability is very much dependent on data quality and data documentation. When re-using other peoples data, expect to spend some time to find out how it works and what you need to do to make it work for your purposes. However, the benefits of re-using data can be seen in many research fields both in areas that have been data-driven for a long period of time as well as fields where a data-driven approach is relatively new. Also other stakeholders may benefit from the re-use of data.
Re-use of data also enables a more sustainable use of resources. Listen again to Anna and Jonas talk about how they see benefits of improved access to data in their fields of research.
When data is shared in public data repositories this opens up possibilities to a more data-driven approach in research fields even if you do not produce all the data yourself. Let's hear Johan Rung talk about data sharing in Covid-19 research and how the data centre at Scilifelab built a Swedish data portal for sharing Covid-19 related data Links to an external site.
Martin Isaksson shares another perspective on the value of open data-sets and re-use of data where already existing data makes it possible to avoid dealing with sensitive data:
Assignment
Explore
Can you find data within your own research area that may be useful for you? Browse for subject specific datarepositories at Re3data.org Links to an external site. or search in a broad generic dataportal or other sources for data listed under data sources below the learn more section.
Reflection
Read the essay On the Reuse of scientific data [1] - who would you think as a re-user of your data and for what purpose - reproducibility or integration? Do the different purposes require different quality requirements/standards?
Would you consider re-using someone else's data? What would you need in order to do that?
Learn more
Sources for open data-sets:
Scientific data repositories, browse by subject, type etc: Re3data.org Links to an external site.
Sources for Open data sets and API:s at KTH
https://www.adeye.se/open-kth Links to an external site.
https://zenodo.org/communities/kth Links to an external site.
https://www.kth.se/en/api/anvand-data-fran-kth-1.57059
Broad generic dataportals , search engines and databases:
Zenodo Links to an external site.
Google dataset search Links to an external site.
EU Open data portal Links to an external site.
SND Swedish National Dataservice data catalogue Links to an external site.
Datahub Links to an external site.
Datasets mentioned by Martin Isaksson are available here:
https://datasetsearch.research.google.com/ Links to an external site.
https://www.kaggle.com/datasets Links to an external site.
https://www.openml.org/search?type=data Links to an external site.
see API here: https://openml.github.io/openml-python/develop/examples/30_extended/datasets_tutorial.html Links to an external site.)
https://www.tensorflow.org/datasets Links to an external site.
https://pytorch.org/docs/stable/torchvision/datasets.html Links to an external site.
Progress