What are the advantages of a query using a derived table(s) over a query not using them?

12,258

Solution 1

In your examples, the derived table is not strictly necessary. There are numerous cases where you might need to join to an aggregate or similar, and a derived table is really the only way to handle that:

SELECT *
FROM A
LEFT JOIN (
    SELECT x, SUM(y)
    FROM B
    GROUP BY x
) AS B
    ON B.x = A.x

In addition, if expressions are used to derive columns from derived columns with a lot of shared intermediate calculations, a set of nested derived tables or stacked CTEs is the only way to do it:

SELECT x, y, z1, z2
FROM (
    SELECT *
           ,x + y AS z1
           ,x - y AS z2
    FROM (
        SELECT x * 2 AS y
        FROM A
    ) AS A
) AS A

As far as maintainability, using stacked CTEs or derived tables (they are basically equivalent) and can make for more readable and maintainable code, as well as facilitating cut-and-paste re-use and refactoring. The optimizer can typically flatten then very easily.

I typically use stacked CTEs instead of nesting for a little better readability (same two examples):

WITH B AS (
    SELECT x, SUM(y)
    FROM B
    GROUP BY x
)
SELECT *
FROM A
LEFT JOIN B
    ON B.x = A.x

WITH A1 AS (
    SELECT x * 2 AS y
    FROM A
)
,A2 AS (
    SELECT *
           ,x + y AS z1
           ,x - y AS z2
    FROM A1
)
SELECT x, y, z1, z2
FROM A2

Regarding your question about:

SELECT x, x * 2 AS y, x + x*2 AS z1, x - x*2 AS z2 
FROM A 

This has the x * 2 code repeated 3 times. If this business rule needs to change, it will have to change in 3 places - a recipe for injection of defects. This gets compounded any time you have intermediate calculations which need to be consistent and defined in only one place.

This would not be as much of a problem if SQL Server's scalar user-defined functions could be inlined (or if they performed acceptably), you could simply build your UDFs to stack your results and the optimizer would elimnate redundant calls. Unfortunately SQL Server's scalar UDF implementation cannot handle that well for large sets of rows.

Solution 2

I typically use a derived table (or a CTE, which is a sometimes-superior alternative to derived queries in SQL 2005/2008) to simplify reading and building queries, or in cases where SQL doesn't allow me to do a particular operation.

For example, one of the things you can't do without a derived table or CTE is put an aggregate function in a WHERE clause. This won't work:

SELECT  name, city, joindate
FROM    members 
        INNER JOIN cities ON cities.cityid = derived.cityid
WHERE   ROW_NUMBER() OVER (PARTITION BY cityid ORDER BY joindate) = 1

But this will work:

SELECT  name, city, joindate
FROM    
( 
    SELECT  name, 
            cityid,
            joindate,
            ROW_NUMBER() OVER (PARTITION BY cityid ORDER BY joindate) AS rownum 
    FROM    members 
) derived INNER JOIN cities ON cities.cityid = derived.cityid
WHERE   rn = 1

Advanced caveats, especially for large-scale analytics

If you're working on relatively small data sets (not gigabytes) you can probably stop reading here. If you're working with gigabytes ot terabytes of data and using derived tables, read on...

For very large-scale data operations, it's sometimes preferable to create a temporary table instead of using a derived query. This may happen if SQL's statistics suggest that your derived query will return many more rows than the query will actually return, which happens more often than you'd think. Queries where your main query self-joins with a non-recursive CTE are are also problematic.

It's also possible that derived tables will generate unexpected query plans. For example, even if you put a strict WHERE clause in your derived table to make that query very selective, SQL Server may re-order your query plan so your WHERE clause is evaluated in the query plan. See this Microsoft Connect feedback for a discussion of this issue and a workaround.

So, for very performance-intensive queries (especially data-warehousing queries on 100GB+ tables), I always like to prototype a temporary-table solution to see if you get better performance than you get from a derived table or CTE. This seems counter-intuitive since you're doing more I/O than an ideal single-query solution, but with temp tables you get total control over the query plan used and the order each subquery is evaluated. Sometimes this can increase performance 10x or more.

I also tend to prefer temp tables in cases where I have to use query hints to force SQL to do what I want-- if the SQL optimizer is already "misbehaving", temp tables are often a clearer way to force them to act the way you want.

I'm not suggesting this is a common case-- most of the time the temporary table solution will be at least a little worse and sometimes query hints are one's only recourse. But don't assume that a CTE or derived-query solution will be your fastest option either. Test, test, test!

Solution 3

Derived tables often replace correlated subqueries and are generally considerably faster.

They also can be used to limit greatly the number of records searched thorugh for a large table and thus may also improve speed of the query.

As with all potentially performance improving techniques, you need to test to see if they did improve performance. A derived table will almost always strongly outperform a correlated subquery but there is the possibility it may not.

Further there are times when you need to join to data containing an aggregate calulation which is almost impossible to do without a derived table or CTE (which is essentually another way of writing a derived tbale in many cases).

Derived tables are one of my most useful ways of figuring out complex data for reporting as well. You can do this in pieces using table variables or temp tables too, but if you don't want to see the code in procedural steps, people often change them to derived tables once they work out what they want using temp tables.

Aggregating data from a union is another place where you need derived tables.

Solution 4

Using your terminology and example the derived tables is only more complex with no advantages. However, some things require a derived table. Those can be in the most complex cases CTEs (as demonstrated above). But, simple joins can demonstrate the necessity of derived tables, all you must do is craft a query that requires the use of an aggregate, here we use a variant of the quota query to demonstrate this.

Select all of the customer's most expensive transactions

SELECT transactions.*
FROM transactions
JOIN (
  select user_id, max(spent) AS spent
  from transactions
  group by user_id
) as derived_table
USING (
  derived_table.user_id = transaction.user_id
  AND derived_table.spent = transactions.spent
)

Solution 5

In this case, the derived table allows YEAR(O.OrderDate) = 1996 in a WHERE clause.

In the outer where clause, it's useless because it would change the JOIN to INNER.

Personally, I prefer the derived table (or CTE) construct because it puts the filter into the correct place

Another example:

SELECT
     C.CustomerID, C.CompanyName,
     COUNT(D.OrderID) AS TotalOrders,
     COUNT(DISTINCT D.ProductID) AS DifferentProducts
FROM
     Customers C
     LEFT OUTER JOIN
     (
     SELECT
        OrderID, P.ProductID
     FROM
        Orders O
        JOIN
        Products P ON O.somethingID = P.somethingID
     WHERE YEAR(Orders.OrderDate) = 1996
     ) D
     ON C.CustomerID = D.CustomerID
GROUP BY
     C.CustomerID, C.CompanyName

In this case, you couldn't capture both products and order aggregates if there is no relation between Customers and Products. Of course, this is contrived but I hope I've captured the concept

Edit:

I need to explicitly JOIN T1 and T2 before the JOIN onto MyTable. It does happen. The derived T1/T2 join can be a different query to 2 LEFT JOINs with no derived table. It happens quite often

SELECT
     --stuff--
FROM
     myTable M1
     LEFT OUTER JOIN
     (
     SELECT
        T1.ColA, T2.ColB
     FROM
        T1
        JOIN
        T2 ON T1.somethingID = T2.somethingID
     WHERE
        --filter--
     ) D
     ON M1.ColA = D.ColA AND M1.ColB = D.ColB
Share:
12,258
AspOnMyNet
Author by

AspOnMyNet

Updated on June 30, 2022

Comments

  • AspOnMyNet
    AspOnMyNet about 2 years

    I know how derived tables are used, but I still can’t really see any real advantages of using them.

    For example, in the following article http://techahead.wordpress.com/2007/10/01/sql-derived-tables/ the author tried to show benefits of a query using derived table over a query without one with an example, where we want to generate a report that shows off the total number of orders each customer placed in 1996, and we want this result set to include all customers, including those that didn’t place any orders that year and those that have never placed any orders at all( he’s using Northwind database ).

    But when I compare the two queries, I fail to see any advantages of a query using a derived table ( if nothing else, use of a derived table doesn't appear to simplify our code, at least not in this example):

    Regular query:

    SELECT C.CustomerID, C.CompanyName, COUNT(O.OrderID) AS TotalOrders
    FROM Customers C LEFT OUTER JOIN Orders O ON
           C.CustomerID = O.CustomerID AND YEAR(O.OrderDate) = 1996
    GROUP BY C.CustomerID, C.CompanyName
    

    Query using a derived table:

    SELECT C.CustomerID, C.CompanyName, COUNT(dOrders.OrderID) AS TotalOrders
    FROM Customers C LEFT OUTER JOIN
            (SELECT * FROM Orders WHERE YEAR(Orders.OrderDate) = 1996) AS dOrders
         ON
            C.CustomerID = dOrders.CustomerID
    GROUP BY C.CustomerID, C.CompanyName
    

    Perhaps this just wasn’t a good example, so could you show me an example where benefits of derived table are more obvious?

    thanx

    REPLY TO GBN:

    In this case, you couldn't capture both products and order aggregates if there is no relation between Customers and Products.

    Could you elaborate what exactly you mean? Wouldn’t the following query produce the same result set as your query:

    SELECT 
         C.CustomerID, C.CompanyName,
         COUNT(O.OrderID) AS TotalOrders,
         COUNT(DISTINCT P.ProductID) AS DifferentProducts 
    FROM Customers C LEFT OUTER JOIN Orders O ON
           C.CustomerID = O.CustomerID AND YEAR(O.OrderDate) = 1996
       LEFT OUTER JOIN Products P ON 
           O.somethingID = P.somethingID  
    GROUP BY C.CustomerID, C.CompanyName
    

    REPLY TO CADE ROUX:

    In addition, if expressions are used to derive columns from derived columns with a lot of shared intermediate calculations, a set of nested derived tables or stacked CTEs is the only way to do it:

    SELECT x, y, z1, z2
    FROM (
        SELECT *
               ,x + y AS z1
               ,x - y AS z2
        FROM (
            SELECT x * 2 AS y
            FROM A
        ) AS A
    ) AS A
    

    Wouldn't the following query produce the same result as your above query:

    SELECT x, x * 2 AS y, x + x*2 AS z1, x - x*2 AS z2
    FROM A