Is data the New Oil in R&D too?

data pipeline

The value of the DATA 

“Data is the new oil” (1) means that data has become the most valuable economical resource, as is indicated by the skyrocketing value of the digital tech giants. In R&D however, data has always been the most valuable resource, inspiring scientists, making or breaking theories. So, in R&D, data is the (good) old oil.

The DIKW Pyramid

Nevertheless, the analogy between the roles of oil for the economy and data for R&D is worth pursuing while walking up the “DIKW” pyramid (2):

data information knowledge wisdom pyramid

  • Data without a context pollutes R&D’s computers, just like crude oil outside a proper container; imagine oil spills polluting a beach.
  • Information is data with context. Just like the crude oil in a tanker, it can be converted into something useful by refining.
  • Knowledge is a big word for information ready to be used; in R&D, you may prefer “Know-how” as the K-word. Creating knowledge from information in R&D is like converting crude oil into products that can readily be used to fuel airplanes, to make plastics, etc.
  • Wisdom is an even bigger word for being able to decide “What” to do “Why”, giving us two simple W-words. In a responsible company like Unilever (3), it is questioned whether it is the right decision to fly to attend a meeting, and to use plastics for packaging because of the environmental impact, which affects the use of oil. In R&D, the decision was taken to stop animal testing to generate product safety data.

In this contribution, I focus on the bottom half of the DIKW pyramid: Data and Information. In a next contribution I will deal with the upper half: Knowledge and Wisdom.

So, nothing has changed in the lab?

Looking from the corridor at people working in a lab, one may have the impression that little changed in the last century: there are still people working on a bench, weighing, pipetting. Going inside and having a better look, there is a lot of change: new kinds of measurement, hyphenated techniques, sample changers, computers embedded in the equipment. Having a look at the data produced, the change is huge: the data from a measurement has exploded from a single visual reading to computer files with a high-end size growing by a factor 1000 per decade (4). The data from the lab has changed, so what is the effect on the rest of the DIKW pyramid?

What is the effect on the rest of the DIKW pyramid?

Once, the context promoting data to information was recorded in paper lab notebook together with the data, often just a visual reading of the result of the measurement. In those days, glueing charts in the notebook was a burden, but at least data and context were together. With the advent of computers embedded in scientific instruments, the amount of data that could be printed and plotted, exploded beyond being practical to glue it in the paper notebook. Nowadays, the data is often even to big to include it into the database behind an Electronic Laboratory Notebook (ELN). In my opinion, the primary function of an ELN is to capture the context and the data that is discussed in the ELN entry, which I call “result data” and put at the top of the data layer. This implies that the ELN does not have to include the complete data if the underlying dataset is stored elsewhere, like in a data lake dedicated to the ELN.

That dataset must refer back to the ELN entry, be immutable, and hashed to guarantee the data integrity.

Information, please!

I assume here the context in an ELN entry meets its author’s own needs, company policies, and the demands of regulatory bodies. However, does the ELN entry contain extra context needed to turn this data into information to build a piece of knowledge? How is the author to know about such needs? Is the author willing to make the effort to add the extra context and allowed by their management to spend time on it? These questions indicate that possibly data has proper context for its original purpose, but still is not fit for building knowledge.

To ensure that a measurement can be used to build a piece of knowledge without need to consult the experimenter’s memory, it helps that the measurement is executed, processed, and reported according to a Standard Operating Procedure (SOP). This SOP should specify metadata as the structured part of the context, including metadata that is only needed to build the knowledge. Standardising the measurement and structuring the data goes beyond the “paper-on-glass” ELN as a simple replacement of the paper lab notebook. The required functionality is sometimes called a Laboratory Execution System (LES) but can be part of e.g. a Laboratory Information Management System (LIMS). I advise to still capture the context and results of a measurement in an ELN as a personal record, but refer to the LES database for the structured data and metadata.

Stewards needed

Even when an experimenter has performed a quality measurement, the reported tuple of result data and metadata might not add to the information to build a piece of knowledge. I remember a collection of excellent property data generated over many years that proved to useless for knowledge building because it contained only the codes of the measured samples. The composition of the samples had been known to the project teams requesting the measurements, but had not been preserved afterwards. My lesson learnt is that the information to build a piece of knowledge needs a steward (5) that actively monitors the information becoming available for completeness and consistency.

It takes effort to make the data and metadata readable for people outside your own organisation or make it machine-readable according to the FAIR guiding principles (6).

However, consider the benefits if you have multiple lab sites using different instruments to collect the same type of data, if you need to archive the data beyond the lifetime of the instruments, or if you want to build and maintain pieces of knowledge with machine support. In general, adopting global standards (7) contributes to the efficiency of the stewardship.

Remarks:

  1. This catchy phrase was apparently coined in 2006 by the mathematician and entrepreneur Clive Humby.
  2. The Data-Information-Knowledge-Wisdom pyramid or hierarchy was made a commonly used framework in 1989 by Russell Ackoff, but similar propositions were made before by others.
  3. I worked 30 years for Unilever as a scientist, IT manager for R&D, and member of the R&D digital transformation team. I use examples from this experience when appropriate.
  4. The factor 1000 is not an accurate estimate, but rather my impression. In the days that I worked as an NMR spectroscopist: I saw the file size grow from kB to MB to GB with the advent of multi-dimensional techniques. Later, working as the IT manager of an R&D lab, I saw TB storage devices “smuggled” into the lab by scientists, exceeding by far the storage capacity offered to R&D by IT.
  5. The term “data steward” is most common, but I link the role more to information and knowledge than data. Let’s use it as shorthand for “DIKW steward”.
  6. Findable, Accessible, Interoperable, and Reusable. See https://www.go-fair.org/fair-principles
  7. Amongst others, I recommend:
    Allotrope Foundation  
    Pistoia Alliance
    SiLA Standardization in Lab Automation 

rik pepermans paperless lab academyRik Pepermans recently established Rik Pepermans Consulting to guide R&D organisations in their digital transformation. He assists his customers in defining their digital ambition, evaluates their current digital maturity, facilitates creating a roadmap to meet their ambition, and supports its implementation.

Rik got a degree in chemical engineering and a PhD in chemistry. He enjoyed the synthesis lab and advanced NMR, but modelling of molecules got him really excited: ”You don’t understand what you cannot simulate.” He started teaching molecular modelling just before he switched to Unilever R&D as my main job but kept teaching for 20 more years to share his vision with a next generation.

After 20 years as a researcher and R&D manager, Rik made a career switch to IT manager for R&D, driven by the vision of a digitised research function. His perseverance as a matchmaker between R&D and IT earned him the nickname “R&D IT Evangelist”. In the last decade, his combined experience in R&D and IT gave him the opportunity to co-shape the digital transformation of Unilever R&D. Now he founded his own company to catalyse the digital transformation of other R&D organisations.