SQLTeam.com | Weblogs | Forums

Homebrew data lineage


Given a number of tables (from staging up to denormalised tables), views and stored procedures. How can I identify where a column in a view or table originates from or is used?

I googled on "data lineage" and it apears they are all part of a "platform", a big solution.

How can I parse SQL scripts or SP scripts to find out the source and the destination tables/views and columns?

I am not sure I understand but one way is to do

Sp_helptext your_view
Sp_helptext your_procedure

Or in ssms, open database where artifacts exist, open views node or procedures node, right click and select create

Sp_helptext returns the definition of the view (or SP). (I learned something new)

Let me explain what we want.
Suppose we have a database with a number of tables and views.

create table dbo.DaTable(
	id		int	identity	NOT NULL	primary key
	, a_code	int	NOT NULL
	, amount	int	NOT NULL
	, price		decimal(8, 2)	NOT NULL
	, client_id	int	NOT NULL

CREATE VIEW dbo.view2 AS
SELECT id as view_id, a_code as MyCode, amount MyAmount, price
FROM dbo.DaTable
WHERE a_code IN (6, 12)

create table dbo.aggregatedDaTable(
	id		int	identity	NOT NULL	primary key
	, Filtered_code	int	NOT NULL
	, ordered_amount	int	NOT NULL
	, total_price	decimal(8, 2)	NOT NULL

A SP could contain this code:

INSERT INTO  dbo.aggregatedDaTable (Filtered_code, ordered_amount, total_price)
SELECT MyCode, SUM(MyAmount), SUM(MyAmount * price)
FROM dbo.view2

We want to know in what tables and views data from dbo.DaTable is used.
We want to know where the data in dbo.aggregatedDaTable comes from.

If we can extract the dependencies between the tables/views and their columns, we can query that data set. We need the dependencies in a format somewhat like this:

Table_View				Column			SrcTable	SourceColumn
dbo.aggregatedDaTable	id				null		null
dbo.aggregatedDaTable	Filtered_code	dbo.view2	MyCode
dbo.aggregatedDaTable	ordered_amount	dbo.view2	MyAmount
dbo.aggregatedDaTable	total_price		dbo.view2	MyAmount
dbo.aggregatedDaTable	total_price		dbo.view2	price
dbo.view2				view_id			dbo.DaTable	id
dbo.view2				MyCode			dbo.DaTable	a_code
dbo.view2				MyAmount		dbo.DaTable	amount
dbo.view2				price			dbo.DaTable	price

If you want to know where the data in dbo.aggregatedDaTable.total_price comes from, you get dbo.view2.MyAmount and price.
If you want to know where the data in dbo.view2.MyAmount comes from, you get dbo.DaTable.Amount.

When someone finds an anomaly in dbo.aggregatedDaTable.total_price, you now have a means to find where the data originates from. It may be a bug in the view definition or a data quality problem in the staging table.
Suppose, you find out the data in the staging table dbo.DaTable is corrupted is some way, you can now find out all the tables and views that data is used in.

This should get you on the right track. What you ask for could be done using dynamic sql as you can see the sql artifacts are referenced using a single quote, so you doing this dynamically is possible. heading off to work. will try to look later on

SELECT referencing_schema_name
FROM sys.dm_sql_referencing_entities('dbo.projects', 'Object')

select referenced_schema_name, referenced_entity_name,referenced_minor_name 
from sys.dm_sql_referenced_entities('dbo.projects_sp', 'Object')

select referenced_schema_name, referenced_entity_name,referenced_minor_name 
from sys.dm_sql_referenced_entities('dbo.projects', 'Object')

select *
 from sys.dm_sql_referencing_entities('dbo.projects', 'Object') x

ok try this and post back

SELECT o.name AS ObjectName,
       c.name as ColumnName,
	   t.name as SrcTable,	
	   o.type_desc as ObjectType

  FROM plt.sys.sql_expression_dependencies sed
  inner JOIN sys.objects o ON sed.referencing_id = o.[object_id]
  inner join sys.tables t on t.name = sed.referenced_entity_name
  inner join sys.columns c on c.object_id = t.object_id

Download and install SQL Search from Redgate - then use that to search the system for any matching values.