Catalyzing Science with Open Data — What is Open Data?

A new data age is sweeping the nation. In February 2015, the White House hired the first-ever U.S. Chief Data Scientist, D.J. Patil, announcing that “the data age has arrived.” This accelerates Federal agency efforts to unleash the power of government data by making it available to everyone, as a strategic national asset for entrepreneurship and economic development, belonging to all American citizens because taxpayers paid to create these data. Similar initiatives are underway at international, state, and local government scales.

Now, more than any time in human history, scientists have a plethora of machine-readable datasets and big data for new analysis and research never possible before. For example, U.S. government’s Federal hub for open data – data.gov – has over 125,000 datasets that are free, discoverable, and usable with open licenses. Anyone and everyone – students, scientists, government employees, businesses, the general public – are invited to use and leverage these open resources.

Open data has multi-trillion dollar potential to unlock innovation, fuel entrepreneurship, and stimulate economic development with high-tech, high-paying jobs. A 2013 McKinsey report, “Open data: Unlocking innovation and performance with liquid information” estimates that open data will add $3 trillion to $5 trillion in economic value to the global economy each year.

Open data is already changing the face of our world. You might not even realize it, but you are using open data every day. Open data underlies web search for real estate (Zillow; RedFin), transportation and GPS options (Google maps; Uber), consumer information and marketing (FindTheBest; CarFax), and a multiplicity of real-time weather websites and smartphone applications are just a handful of examples. Incredibly creative people are leveraging open data to develop new ways of sharing information and interacting with the world, and changing the face of science and our economy in the process.

For academics and researchers, this open data revolution means opportunity. It means a chance to do something we’ve never been able to do before, and to change the world as a result. Doing that requires new ways of thinking, new modes of operation, and new connections between research and government. This is an exciting time for researchers, and open data is a fantastic new tool for pushing the boundaries of science.

 What is Open Data?: Key Terms 

(definitions from Executive Order M-13-13 and U.S. Federal Open Data Policy – Managing Information as an Asset)

License: (1) a contract that grants explicit rights to use data and intellectual property, or (2) a digital permit containing descriptions of rights that can be applied to data and content. See link for information on Project Open Data licenses.

Creative Commons Licenses: One of several public copyright licenses that enable the free distribution of an otherwise copyrighted work. A CC license is used when an author wants to give people the right to share, use, and build upon a work that they have created.
Data: All structured information.
Dataset: A collection of data presented in tabular or non-tabular form.
Government information: Information created, collected, processed, disseminated, or disposed of, by or for the Federal Government.
Information: Any communication or representation of knowledge such as facts, data, or opinions in any medium or forum, including textual, numerical, graphic, cartographic, narrative, or audiovisual forms.
License: (1) a contract that grants explicit rights to use data and intellectual property, or (2) a digital permit containing descriptions of rights that can be applied to data and content. See link for information on Project Open Data licenses.
Open data: Publicly available data structured in a way that enables the data to be fully discoverable and usable by end users. In general, open data will be consistent with the following principles:

  • Public. Consistent with OMB’s Open Government Directive, agencies must adopt a presumption in favor of openness to the extent permitted by law and subject to privacy, confidentiality, security, or other valid restrictions.
  • Accessible. Open data are made available in convenient, modifiable, and open formats that can be retrieved, downloaded, indexed, and searched. Formats should be machine-readable (i.e., data are reasonably structured to allow automated processing). Open data structures do not discriminate against any person or group of persons and should be made available to the widest range of users for the widest range of purposes, often by providing the data in multiple formats for consumption. To the extent permitted by law, these formats should be non-proprietary, publicly available, and no restrictions should be placed upon their use.
  • Described. Open data are described fully so that consumers of the data have sufficient information to understand their strengths, weaknesses, analytical limitations, security requirements, as well as how to process them. This involves the use of robust, granular metadata (i.e., fields or elements that describe data), thorough documentation of data elements, data dictionaries, and, if applicable, additional descriptions of the purpose of the collection, the population of interest, the characteristics of the sample, and the method of data collection.
  • Reusable. Open data are made available under an open license that places no restrictions on their use.
  • Complete. Open data are published in primary forms (i.e., as collected at the source), with the finest possible level of granularity that is practicable and permitted by law and other requirements. Derived or aggregate open data should also be published but must reference the primary data.
  • Timely. Open data are made available as quickly as necessary to preserve the value of the data. Frequency of release should account for key audiences and downstream needs.
  • Managed Post-Release. A point of contact must be designated to assist with data use and to respond to complaints about adherence to these open data requirements.
Project Open Data: An online repository of tools, best practices, and schema to help Federal agencies adopt the Open Data Policy framework. It includes definitions, code, checklists, case studies, and more, and enables collaboration across the Federal Government, in partnership with public developers, as applicable. Visit Project Open Data for a more comprehensive glossary of terms related to open data.

 


This is the first in a new Capital Chemist series of “Catalyzing Science with Open Data” posts. Stay tuned for future posts on how open data matters to you!

Please tweet @khoney with hashtag #OpenScience to suggest future topics, projects, and success stories.

The #OpenScience blog series is cross-posted by the American Association for the Advancement of Science (AAAS) Fellowship Program AAAS Sci on the Fly blog.


Photo credit (featured image): sacharulesCC BY-NC-ND 2.0