Prevent recursive CTE visiting nodes multiple times

sql sql-server recursion graph common-table-expression

15,525

Solution 1

The CTE's are recursive.

When your CTE's have multiple initial conditions, that means they also have different recursion stacks, and there is no way to use information from one stack in another stack.

In your example, the recursion stacks will go as follows:

(1) - first IN condition
(1, 2)
(1, 2, 3)
(1, 2, 3, 4)
(1, 2, 3) - no more children
(1, 2) - no more children
(1) - no more children, going to second IN condition

(3) - second condition
(3, 4)
(3) - no more children, returning

As you can see, these recursion stack do not intersect.

You could probably record the visited values in a temporary table, JOIN each value with the temptable and do not follow this value it if it's found, but SQL Server does not support these things.

So you just use SELECT DISTINCT.

Solution 2

This is the approach I used. It has been tested against several methods and was the most performant. It combines the temp table idea suggested by Quassnoi and the use of both distinct and a left join to eliminate redundant paths to the recursion. The level of the recursion is also included.

I left the failed CTE approach in the code so you could compare results.

If someone has a better idea, I'd love to know it.

create table #bar (unique_id int identity(10,10), parent_id int, child_id int)
insert #bar  (parent_id, child_id)
SELECT 1,2 UNION ALL
SELECT 2,3 UNION ALL
SELECT 3,4 UNION ALL
SELECT 2,5 UNION ALL
SELECT 2,5 UNION ALL
SELECT 5,6

SET NOCOUNT ON

;with foo(unique_id, parent_id,child_id, ord, lvl) as (
    -- anchor member that happens to select first and last edges:
    select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
    from #bar where parent_id in (1,3)
union all
-- recursive member:
select b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), foo.lvl+1
    from #bar b
    join foo on b.parent_id = foo.child_id
)
select unique_id, parent_id,child_id, ord, lvl from foo

/***********************************
    Manual Recursion
***********************************/
Declare @lvl as int
Declare @rows as int
DECLARE @foo as Table(
    unique_id int,
    parent_id int,
    child_id int,
    ord int,
    lvl int)

--Get anchor condition
INSERT @foo (unique_id, parent_id, child_id, ord, lvl)
select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
    from #bar where parent_id in (1,3)

set @rows=@@ROWCOUNT
set @lvl=0

--Do recursion
WHILE @rows > 0
BEGIN
    set @lvl = @lvl + 1

    INSERT @foo (unique_id, parent_id, child_id, ord, lvl)
    SELECT DISTINCT b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), @lvl
    FROM #bar b
     inner join @foo f on b.parent_id = f.child_id
     --might be multiple paths to this recursion so eliminate duplicates
     left join @foo dup on dup.unique_id = b.unique_id
    WHERE f.lvl = @lvl-1 and dup.child_id is null

    set @rows=@@ROWCOUNT 
END

SELECT * from @foo

DROP TABLE #bar

Solution 3

Do you happen to know which of the two edges is on a deeper level in the tree? Because in that case, you could make edge 3->4 the anchor member and start walking up the tree until you find edge 1->2.

Something like this:

with foo(parent_id, child_id)
as
(
    select parent_id, child_id
    from #bar
    where parent_id = 3

    union all

    select parent_id, child_id
    from #bar b
    inner join foo f on b.child_id = f.parent_id
    where b.parent_id <> 1
)
select *
from foo

Solution 4

Is this what you want to do?

create table #bar (parent_id int, child_id int)
insert #bar values (1,2)
insert #bar values (2,3)
insert #bar values (3,4)

declare @start_node table (parent_id int)
insert @start_node values (1)
insert @start_node values (3)

;with foo(parent_id,child_id) as (
    select
        parent_id
        ,child_id
    from #bar where parent_id in (select parent_id from @start_node)

    union all

    select
        #bar.*
    from #bar
        join foo on #bar.parent_id = foo.child_id
    where foo.child_id not in (select parent_id from @start_node)
)
select parent_id,child_id from foo

Edit - @bacar - I don't think this is the temp table solution Quasnoi was proposing. I believe they were suggesting basically duplicate the entire recursion member contents during each recursion, and use that as a join to prevent reprocessing (and that this is not supported in ss2k5). My approach is supported, and the only change to your original is in the predicate in the recursion member to exclude recursing down paths that were explicitly in your starting set. I only added the table variable so that you would define the starting parent_ids in one location, you could just as easily have used this predicate with your original query:

where foo.child_id not in (1,3)

Solution 5

EDIT -- This doesn't work at all. This is a method to stop chasing triangle routes. It doesn't do what the OP wanted.

Or you can use a recursive token separated string.

I'm at home on my laptop ( no sql server ) so this might not be completely right but here goes.....

; WITH NodeNetwork AS (
  -- Anchor Definition
  SELECT
     b.[parent_Id] AS [Parent_ID]
     , b.[child_Id] AS [Child_ID]
     , CAST(b.[Parent_Id] AS VARCHAR(MAX)) AS [NodePath]
  FROM
     #bar AS b

  -- Recursive Definition
  UNION ALL SELECT
     b.[Parent_Id]
     , b.[child_Id]
     , CAST(nn.[NodePath] + '-' + CAST(b.[Parent_Id] AS VARCHAR(MAX)) AS VARCHAR(MAX))
  FROM
     NodeNetwork AS nn
     JOIN #bar AS b ON b.[Parent_Id] = nn.[Child_ID]
  WHERE
     nn.[NodePath] NOT LIKE '%[-]' + CAST(b.[Parent_Id] AS VARCHAR(MAX)) + '%'
  )
  SELECT * FROM NodeNetwork

Or similar. Sorry It's late and I can't test it. I'll check on Monday morning. Credit for this must go to Peter Larsson (Peso)

The idea was generated here: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=115290

View more solutions

15,525

Author by

bacar

An experienced & analytical software development leader, with polyglot hand-on development skills, expertise in design/architecture of maintainable, scalable systems, and exotics equity derivatives pricing & risk systems. Extensive experience with user-facing systems in investment banks, working closely with quants, and on regulatory projects. A strong record in driving DevOps & agile, improving development processes and self-improvement. https://www.linkedin.com/in/barisacar/

Updated on June 06, 2022

Comments

bacar almost 2 years
Consider the following simple DAG:
```
  1->2->3->4
```
And a table, #bar, describing this (I'm using SQL Server 2005):
```
parent_id   child_id
1           2
2           3
3           4
//... other edges, not connected to the subgraph above
```
Now imagine that I have some other arbitrary criteria that select the first and last edges, i.e. 1->2 and 3->4. I want to use these to find the rest of my graph.

I can write a recursive CTE as follows (I'm using terminology from MSDN):
```
with foo(parent_id,child_id) as (
// anchor member that happens to select first and last edges:
select parent_id,child_id from #bar where parent_id in (1,3)
union all
// recursive member:
select #bar.* from #bar
join foo on #bar.parent_id = foo.child_id
)
select parent_id,child_id from foo
```
However, this results in edge 3->4 being selected twice:
```
parent_id  child_id
1          2
3          4
2          3
3          4    // 2nd appearance!
```
How can I prevent the query from recursing into subgraphs that have already been described? I could achieve this if, in my "recursive member" part of the query, I could reference all data that has been retrieved by the recursive CTE so far (and supply a predicate indicating in the recursive member excluding nodes already visited). However, I think I can access data that was returned by the last iteration of the recursive member only.

This will not scale well when there is a lot of such repetition. Is there a way of preventing this unnecessary additional recursion?

Note that I could use "select distinct" in the last line of my statement to achieve the desired results, but this seems to be applied after all the (repeated) recursion is done, so I don't think this is an ideal solution.

Edit - hainstech suggests stopping recursion by adding a predicate to exclude recursing down paths that were explicitly in the starting set, i.e. recurse only where foo.child_id not in (1,3). That works for the case above only because it simple - all the repeated sections begin within the anchor set of nodes. It doesn't solve the general case where they may not be. e.g., consider adding edges 1->4 and 4->5 to the above set. Edge 4->5 will be captured twice, even with the suggested predicate. :(
bacar almost 15 years

I only have one initial condition but it can select multiple rows. Do you have a link to any documentation that states that each row is treated with independent stacks? (I presume I could emulate the desired effect using my own temp tables - not tried yet) I don't have enough 'reputation' to vote up your answer >:(
bacar almost 15 years

aha, you revised your answer to suggest temp table while I was commenting :)
bacar almost 15 years

I'm afraid I don't - I generally don't even know whether the edges selected in the initial query might belong to the same overall graph, or not :( Thanks though!
bacar almost 15 years

I don't want to eliminate routes that don't end up in my 'last' edge (I want everything that connects to any of the edges specified in the anchor) - but I agree that this is an interesting (separate) problem :)
bacar almost 15 years

Yes - this is an implementation of the "temp table" solution suggested by Quassnoi Thanks!
Andomar almost 15 years

If you add (6,1), would that count as being connected?
bacar almost 15 years

OK, I see now - that works for this simple case where all the repeated parts are contained within the initial set - it doesn't prevent repeatedly recursing into other subgraphs that are shared by, but not included directly within the initial set. e.g. consider graph: insert #bar values (1,2) insert #bar values (2,3) insert #bar values (1,3) insert #bar values (3,4) And initial start node of 1 only. Your query would still pick out edge 3->4 twice, because I can't insert the nodes I've visited into the table variable in the recursive member part (as you mention). :(
bacar almost 15 years

(Note that an ability to track nodes you've already visited would actually help in not infinitely looping through cyclic graphs, too!)
Andomar almost 15 years

Heh, I meant (6,1) to your original example! Or say, (7,1) to mine. Or is it given that (1,2) is an edge node and can't have entries before it?
bacar almost 15 years

Oh, right! Sorry. Then no, I don't want to select that. Sorry, I've been rather lax with my terminology - I only want to select the subgraphs that are "downstream" of the initial set of nodes.
Andomar almost 15 years

Thanks, I guess that's implicitly clear when people talk about graphs (I oughta read a book about them some day!)
ahains almost 15 years

@bacar - thanks for the helpful example, you may want to edit your question to incorporate that. You can't save any kind of state with the CTE, so I think you may have to go with an iterative tsql approach or something with the CLR.
bacar almost 15 years

Done - have tried to keep it as simple as possible though. Thanks!
bacar about 13 years

Not sure I've followed what this is meant to achieve. This actually selects edge 3->4 three times, rather than two! (I wanted it just once). It also select 2->3 twice. It also doesn't apply the original "arbitrary criteria" (where parent_id in (1,3)). If I add it, it gives the same results as I got in the initial problem statement.
bacar about 13 years

(FWIW - Taking @Quassnoi's answer/comments as true, I don't see how any solution involving just a recursive CTE could possibly solve the problem, irrespective of how you manipulate the data within it.)
Transact Charlie about 13 years

ya - -you are right. This would stop you from chasing triangle routes (if you had some entry where a child somehow pointed back at it's own parent or grandparent). It doesn't actually answer your question.
BD. about 10 years

Awesome, easy to understand and runs FAST! Thanks!
Smitty about 3 years

Super fast execution using this query. Helped me reduce the time for a stored proc from 2 mins to under 1 second. I know it isn't the answer to "how to do this w/ a CTE" but it is a very performant solution if you are willing to break away from the CTE recursion.