Trust in data and data quality through data lineage
Data is generated, analysed, shared and used for a wide range of purposes at an unprecedented rate. For example, millions of pictures are shared through social media daily, and personal information is registered by government organizations to govern effectively. When such data is used, the origin and quality of this data is not always known. Furthermore, when data is generated, it is also not always known how that data is going to be used or where it ends up. With increasing digitization of society, it is imperative for societies and governments to be able to trust their data—this is possible through data lineage, according to a new WODC-study.

What is data lineage?
Following the life course of data through registering its origin, changes along the way and end-use is called data lineage. This is possible through metadata, which is the inclusion of information on the data into the data itself.
An example of data lineage is when someone takes a picture while on vacation. As the picture is taken, the name of the photographer is added to the metadata, as well as the date when the picture was taken and the location. Before the picture is uploaded to social media, the photographer uses a graphic editor to remove several tourists from the image. This edit is also included in the metadata.
When someone sees the picture on social media and accesses the metadata, it becomes immediately clear that the picture is altered. That way, an end-user can consider whether the picture is authentic or an adequate source. For example, the picture might be considered adequate to determine where the photographer was on vacation, but not necessarily to determine how busy the visited location was, due to the edit made.
Data lineage in the justice system
The Dutch justice system is increasingly relying on algorithmic and data-driven tools when making and implementing policy. Generating and exchanging data within the justice system is fragmented though, limiting the extent to which data origin and quality can be determined and evaluated. Thus, the need for data lineage (tools) is highlighted within this domain, as users should be able to adequately consider what data to use (or not use).
The WODC-study suggests various ways and conditions for data professionals how to tackle data lineage issues, such as what (technical) tools and methods are available, and what the pros and cons of these tools and methods are.