Stop wasting data! Make units of measurement machine-readable

The media was abuzz in 1999 when NASA's Mars Climate Orbiter crashed into the Martian atmosphere after it missed its orbit.

Stop wasting data! Make units of measurement machine-readable

The media was abuzz in 1999 when NASA's Mars Climate Orbiter crashed into the Martian atmosphere after it missed its orbit. This information was not exchanged by the navigation software. It didn't have a way to verify units. The spacecraft was sent on an incorrect trajectory when the software of one team produced data in imperial units instead of the expected metric ones. This resulted in the loss of five-years of effort and hundreds upon millions of taxpayer dollars.

Two decades on, such problems persist. Researchers from different fields assume their colleagues know details, but fail to specify them. This causes them to be remiss in documenting units. They may leave them out completely, give multiple definitions, or use convenience units that were never officially recognized.

It is difficult for humans to understand numbers that are missing or sloppy. This problem becomes even more challenging when computers are involved. With the exception of F#, most software packages, data-management tools, and programming languages do not have built-in support to associate units with numeric data. Information is stored and managed as 'unitless values'. Bioscience and aerospace engineering, among others, have established conventions for unit representation, including the Unified Code for Units of Measure and the Quantities, Units, Dimensions and Types Ontology. There aren't any common technical specifications on how to represent quantities or their associated units, which can confuse machines.

In recent years, there have been numerous calls to make data sets FAIR (Findable. Accessible. Interoperable. and Reusable) and to ensure open data adhere to the 5-star deployment plan suggested by Tim Berners-Lee, World Wide Web inventor. This scheme aims to make open data searchable, free, structured, and free. Many researchers now commit to depositing data in open and free repositories that include appropriate metadata.

These efforts are undermined by chaos around units. Many scientists spend more time wrangling data than they do doing research. Researchers' individual informatics methods are hindered if data aren't interoperable and machine-readable. Data sharing is less beneficial.

If measurement units are not documented in a consistent manner for easy and unambiguous data exchange, information may be difficult to use or misinterpreted. High-quality data from multidisciplinary, international sources is essential for all global problems, including climate change and pandemics. Humanity will be hurt more than hundreds of million of dollars for one single spacecraft that crashes.

This group includes scientists with backgrounds in computer science, chemistry, and metrology. The Task Group on Digital Representation of Units of Measurement was established by the CODATA global collaboration (Committee on Data of International Science Council). DRUM's goal is to collaborate with international science unions that fall under the International Science Council in order to increase awareness about units and quantities in digital formats, and enable them to be represented by their communities. The International Committee for Weights and Measures, an intergovernmental organization, created the Digital International System of Units in 2019. The Digital SI Expert Group's goals are complementary to those of DRUM. They focus on global agreed norms for unit representation within the metrology community. These groups are represented by all comment authors.

We need your help now, after a few years of our mission. Scientists, information technologists, and standards organizations are asked to submit case studies, problem areas and pain points, and possible solutions (see "Call to Action").

This is how you can all help create interoperable data using machine-readable quantities or units of measurement.

Scientists: Make sure units are properly annotated and present. Your software and analysis tools must be able to associate quantities with units. Use symbols that are easily understood.

Developers: Take note of the widely used digital representation systems for units. You can choose one to include in your system.

Funders: Help develop fully interoperable representation platforms for units and services.

Everyone: Send us your use cases, problems and solutions to drum@codata.org Get in touch to find out if your science union or professional society has a designated ambassador.

Many measurements are taken without the use of units. Units are usually assumed to be applicable to a specific context. Temperature is an example: "In the 20s" in the United States means it's bitterly cold, while the Celsius-using countries have a warm summer day. The country determines whether cholesterol is measured in milligrams/decilitre or millimoles/litre. However, skilled people are able to deduce what unitless numbers mean from scientific papers and data sets. Computers are unable to draw on context or common sense, making it even more difficult for them to unravel such problems.

Different units can mean different things depending on the situation. One kilocalorie is a Calorie that starts with C. This unit is used to denote food energy. It's the energy required to heat 1 kilogram of water at 1°C. Calories and calories differ by 1,000. However, the term cal (lower case c) is widely used for both. While the intended meaning may be obvious to someone interested in thermodynamics and the nutritional value of hamburgers, it can be difficult to understand for a computer. The gravitational constant G, which is the local acceleration due gravity, is sometimes confused with g. However, g can also be used to refer to grams. Sometimes, the metre is written with M, which is the mega prefix and the unit of molarity. Computers can stumble because of these conventions.

Many times, the same quantity is represented in different units. For example, solubility can be expressed in kilograms per litre or moles per cubic centimetre (mol dm-3). These units can be easily converted, provided that the units are properly documented. Sometimes, the same unit can be written in more than one way. A microgram can be written in mcg, or ug. You can represent acceleration in metres per squared as m/s2, or m/s2, or m/s2 respectively. Typography conventions include italics, bolding and superscripts. These are easy to read for humans but difficult to understand by machines. It is difficult to automatically parse these units or map them into a single, unambiguous, and interoperable representation.

Computer systems that crunch numbers and share data aren't designed to assist users. Let's take Excel spreadsheets as an example: a currency sign is the only unit that can go in computable fields. It is possible to associate a unit with a quantity, but this can be done using inconsistent, arbitrary practices such as the unit string in the header row. This association can be easily broken when data is transferred or used for calculations.

These problems are being addressed with a lot of work. There are many standards, conventions, and best practices for units that are easily accessible. The International System of Units (SI) is widely used to provide standard names and typographical representations of quantities and associated units. International initiatives also have achieved significant standardization, such as the International Organization for Standardization(ISO), International Electrotechnical Commissions (IEC), and the United Nations Economic Commission for Europe (UNEC).

FAIR Digital Objects Forum (FDO Forum), a forum that produces FAIR Digital Objects, aims to improve scientific information representation and transmission. It also includes fully machine-actionable semantics. FAIR Digital Objects, in principle, "bind all the critical information about an entity and create a new type of actionable, meaningful and technology-independent object that pervades all aspects of life today". There is still much to be done.

Machine reading is possible with around 20 systems. These systems include UCUM and the QUDT Ontology. Each has its own shortcomings, but each meets the needs of different communities.

Many efforts are being made to link conventions in order to encourage interoperability or allow analysis to combine multiple data sets. The Units of Measurement web-service uses UCUM code to map definitions within six units representation systems, each of which was prepared by a member of our taskforce. A pilot Units of Measurement Interoperability Service is being developed by another DRUM member that intends to cover more representation systems (see go.nature.com/3vevfdo). There is no universal way to connect them because none have been fully adopted.

DRUM, Digital SI, and other international organizations have been working together to increase awareness and support interoperability efforts, such as the CIPM and the International Science Council, Research Data Alliance, and the GO FAIR Initiative.

We want to collect and organize legacy solutions that have been used to achieve interoperability. These can be collected and compiled into an "information layer", a kind of helpline for computers.

The Digital SI Task Group, which has appointed the Digital SI Expert Group, set another ambitious goal: to build a solid, unambiguous data exchange framework that is based on SI units. This would allow for the resolution of long-standing problems in a solid manner. It could, for example, stop the practice of representing units in different ways for certain quantities. This will help to prevent future systems from perpetuating the problems plaguing the digital domain. The project will eventually produce standards for unit representation in the global metrology community. They will be applicable to all levels of research, industrial and commercial, and flexible enough to accommodate diverse constituencies.

So far, DRUM and the Digital SI Expert Group have collected a dozen use cases and curated a list of nearly 50 available unit representation systems to improve understanding of how units are expressed in databases, digital publishing, software, code, scripting and scientific field vocabularies and ontologies (see go.nature.com/38mbpxo).

DRUM also has a network made up of 26 'ambassadors from 46 international science associations and unions. The DRUM task team is currently conducting surveys to determine how units are used. These results will be published later in the year.

This report is intended to serve as a guideline. All scientists must agree on a way to represent units and quantities. These models should contain formal definitions that are both human-friendly and suitable for machine processing. These databases should allow for access to this knowledge. For information and unit conversions, they should use service-oriented infrastructures such as websites and computer programs. Programming environments, data-storage platforms, and analytical software must be 'unit aware.

DRUM can seed this work, but it will not succeed without broad collaboration across many scientific and information-technology communities. The effort is being supported by volunteers such as us, and should be funded by private-sector firms and funding agencies. Even a small amount of R&D funding could be used to support the work and allow for international and national agreements to encourage the use of interoperable units.

All agree that useful, intelligible data is the core of science. This is why it is important to have insights from different disciplines in order to solve global problems. These needs are not being met by current research systems. It is now that data and knowledge are easily accessible to both machines and people.

NEXT NEWS