What are the differences between Data Lineage and Data Provenance?

48,436

Solution 1

From our experience, data provenance includes only high level view of the system for business users, so they can roughly navigate where their data come from. It's provided by variety of modeling tools or just simple custom tables and charts. Data lineage is a more specific term and includes two sides - business (data) lineage and technical (data) lineage. Business lineage pictures data flows on a business-term level and it's provided by solutions like Collibra, Alation and many others. Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements. Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager.

Solution 2

Data Provenance is,

data lineage (what is the genealogy,history of its journey, where did it begin, how did it come into being, how did it change over time, where has it been, systems it has traveled, any loss or gain) (i.e. data oriented, metadata)

PLUS

the inputs, entities, systems and processes that influenced the data (i.e. process oriented) which can be used to reproduce the data.

Solution 3

See this section in the Wikipedia articl on provenance: https://en.wikipedia.org/wiki/Provenance#Science. It links to collections of academic and industry work on provenance.

To succinctly answer your question: in general, there's not enough context known to differentiate between data lineage and data provenance. Within a specific context, you could look for, or create, specific and possibly different, definitions.

Share:
48,436
CSY
Author by

CSY

Updated on August 04, 2021

Comments

  • CSY
    CSY almost 3 years

    From wiki,

    Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.

    Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins.

    It seems that both concepts are talking about about where the data comes from but I'm still confused about the differences. Are both the concepts the same? If they are different, can someone shares an example?

    Thanks,

  • nircraft
    nircraft over 5 years
    Can you explain it a little more for others
  • Dennis Jaheruddin
    Dennis Jaheruddin almost 3 years
    I think this is a reasonable analysis of the definition, and have written an answer on what this should mean in practice stackoverflow.com/a/68642058/983722
  • Dennis Jaheruddin
    Dennis Jaheruddin almost 3 years
    Though there may be various definitions, according to any one that I am aware of provenance is certainly more than just the point of origin.