Why & When should I use SPARSE COLUMN? (SQL SERVER 2008)

40,328

Solution 1

A sparse column doesn't use 4x the amount of space to store a value, it uses a (fixed) 4 extra bytes per non-null value. (As you've already stated, a NULL takes 0 space.)

  • So a non-null value stored in a bit column would be 1 bit + 4 bytes = 4.125 bytes. But if 99% of these are NULL, it is still a net savings.

  • A non-null value stored in a GUID (UniqueIdentifier) column is 16 bytes + 4 bytes = 20 bytes. So if only 50% of these are NULL, that's still a net savings.

So the "expected savings" depends strongly on what kind of column we're talking about, and your estimate of what ratio will be null vs non-null. Variable width columns (varchars) are probably a little more difficult to predict accurately.

This Books Online Page has a table showing what percentage of different data types would need to be null for you to end up with a benefit.

So when should you use a Sparse Column? When you expect a significant percentage of the rows to have a NULL value. Some examples that come to mind:

  • A "Order Return Date" column in an order table. You would hope that a very small percent of sales would result in returned products.
  • A "4th Address" line in an Address table. Most mailing addresses, even if you need a Department name and a "Care Of" probably don't need 4 separate lines.
  • A "Suffix" column in a customer table. A fairly low percent of people have a "Jr." or "III" or "Esquire" after their name.

Solution 2

  • Storing a null in a sparse column takes up no space at all.

  • To any external application the column will behave the same

  • Sparse columns work really well with filtered indexes as you will only want to create an index to deal with the non-empty attributes in the column.

  • You can create a column set over the sparse columns that returns an xml clip of all of the non-null data from columns covered by the set. The column set behaves like a column itself. Note: you can only have one column set per table.

  • Change Data Capture and Transactional replication both work, but not the column sets feature.

Downsides

  • If a sparse column has data in it it will take 4 more bytes than a normal column e.g. even a bit (0.125 bytes normally) is 4.125 bytes and unique identifier rises form 16 bytes to 20 bytes.

  • Not all data type can be sparse: text, ntext, image, timestamp, user-defined data type, geometry, or geography or varbinray (max) with the FILESTREAM attribute cannot be sparse. (Changed17/5/2009 thanks Alex for spotting the typo)

  • computed columns can't be sparse (although sparse columns can take part in a calculation in another computed column)

  • You can't apply rules or have default values.

  • Sparse columns cannot form part of a clustered index. If you need to do that use a computed column based on the sparse column and create the clustered index on that (which sort of defeats the object).

  • Merge replication doesn't work.

  • Data compression doesn't work.

  • Access (read and write) to sparse columns is more expensive, but I haven't been able to find any exact figures on this.

Reference

Solution 3

You're reading it wrong - it never takes 4x the space.

Specifically, it says 4* (4 bytes, see footnote), not 4x (multiply by 4). The only case where it's exactly 4x the space is a char(4), which would see savings if the NULLs exist more than 64% of the time.

"*The length is equal to the average of the data that is contained in the type, plus 2 or 4 bytes."

Solution 4

| datetime NULL      | datetime SPARSE NULL | datetime SPARSE NULL |
|--------------------|----------------------|----------------------|
| 20171213 (8 bytes) | 20171213 (12 bytes)  | 20171213 (12 bytes)  |
| NULL     (8 bytes) | 20171213 (12 bytes)  | 20171213 (12 bytes)  |
| 20171213 (8 bytes) | NULL      (0 bytes)  | NULL      (0 bytes)  |
| NULL     (8 bytes) | NULL      (0 bytes)  | NULL      (0 bytes)  |

You lose 4 bytes not just once per row; but for every cell in the row that is not null.

Share:
40,328
priyanka.sarkar
Author by

priyanka.sarkar

Student

Updated on July 05, 2022

Comments

  • priyanka.sarkar
    priyanka.sarkar almost 2 years

    After going thru some tutorials on SQL Server 2008's new feature "SPARSE COLUMN", I have found that it doesn't take any space if the column value is 0 or NULL but when there is a value, it takes 4 times the space a regular(non sparse) column holds.

    If my understanding is correct, then why I will go for that at the time of database design? And if I use that, then at what situation will I be?

    Also out of curiosity, how does no space get reserved when a column is defined as sparse column (I mean to say, what is the internal implementation for that?)

  • priyanka.sarkar
    priyanka.sarkar over 14 years
    Thanks .. I already covered that site. But even in that site failed to answer the questions which I have asked here. In which situation should I go with sparse column when it takes 4 times extra space!
  • topski
    topski over 11 years
    There was a suggested edit that it is 4 extra bytes per non-null field. This is incorrect, I've rolled this back to say row. To clarify with an example: if we have a 100-row table with a GUID field, 10 rows of which have values, 90 rows are NULL. A GUID field is normally 16 bytes, so a normal (non-sparse) GUID would be 16*100 = 1600 bytes. If we made this a sparse field, the space used would instead be only (16+4)*10 = 200 bytes, not (16*10)+4 = 164 bytes. The 4-byte penalty applies to every populated row.
  • Tim Post
    Tim Post over 11 years
    Stack overflow isn't a place for protracted discussion .. additionally, you answered rather late basically just summarizing information that was already neatly summarized and accepted. I'm not removing this, as it does answer the question, but the community might react negatively to it.
  • aelveborn
    aelveborn about 10 years
    But surely, if there are multiple sparse fields in a row, the 4-byte penalty applies to each of the fields separately? In which case it would be more natural to say per field (which would really mean per field per row). It actually took me quite a while to realize your understanding of per field was per field per table.
  • topski
    topski about 10 years
    Yes, you'd have to calculate the predicted penalty separately for multiple sparse fields in the same table. I guess it would make sense to evaluate them together if you expect them to coincide (like if you have 5 fields that are only populated if a product gets returned or something).
  • OGHaza
    OGHaza about 10 years
    Row feels very misleading to me. Tables have columns, records have fields, correct? 4 bytes per non-null column might be ambiguous, but I don't think 4 bytes per non-null field is. As it is this answer implies there is a 4 byte cap per record, no matter how many non-null sparse fields it has.
  • topski
    topski almost 10 years
    I get your point, @OGHaza, I've always used "column" and "field" interchangeably, so I'm not sure if everyone would understand the distinction based on those terms alone. But yes, you'd get multiple penalties if you had multiple sparse columns in a table. I guess I assumed this would be obvious, the "it" in my very first sentence is really "each sparse column". I'll think about how to clarify the answer.
  • ypercubeᵀᴹ
    ypercubeᵀᴹ almost 10 years
    @OGHaza Regarding Rows vs Records
  • OGHaza
    OGHaza almost 10 years
    ypercube aha so my terminology is all over the place :P @BradC, having read your comment, and then reread the above I do see what you mean and actually your phrasing seems about right. Apologies
  • topski
    topski almost 10 years
    I've edited some of the terminology in the answer to make it more clear, and hopefully avoid the field/row/column confusion.
  • MattW
    MattW over 9 years
    Space is not the only thing you should be considering. Update load is also important - remember, any time you null out or bring in a value, you're changing the row length and hence having to potentially rebuild, and possibly split, your page. So while a person's name suffix or an address line 4 is still a good candidate for sparse, return date might not be - if you're say Amazon you would probably be happier chewing up the space to reduce the cost of updates.
  • Trisped
    Trisped over 9 years
    @OGHaza Actually, ypercube's meta link is an example of why you can't ask random people for a definition and expect a correct definition. The most up voted answer takes two people's personal definition and presents it as truth, the next two take a survey of terms in documentation, and the next three are unsupported opinions. While they do a good job of pointing out what a record is, they fail to provide backup for what it is not, what a record is, what to call where a value is in a row, or provide any documentation defining the terms Row, Record, and Field.
  • sam yi
    sam yi about 9 years
  • Brain2000
    Brain2000 about 7 years
    I'm going to upvote this. The explanation of all the sparse fields being held in one XML field clears things up in my head a bit. No-one else mentioned that.
  • Martin Smith
    Martin Smith over 6 years
    @Brain2000 probably because it is incorrect. This is how sparse columns are actually stored. improve.dk/sparse-column-storage-ndash-the-sparse-vector. They can be returned to the client as XML if using column sets,