Database design : preferred field length for file paths

21,456

Solution 1

You can use VARCHAR(MAX) or NVARCHAR(MAX).

These are variable length fields meaning they are designed to store values of different length. There is no extra overhead for longer values over shorter values.

Defining MAX means the field can be up to 2GB.

From MSDN (varchar), nvarchar has similar documentation:

Use varchar when the sizes of the column data entries vary considerably.

Use varchar(max) when the sizes of the column data entries vary considerably, and the size might exceed 8,000 bytes.

Solution 2

Use the appropriate length for the data you intend to support. Since you're using SQL Server you should use nvarchar(260) as the upper limit for storing path names, since that is the specification limit for typical Windows machines. Under certain circumstances you can create paths longer than that, however Windows Explorer will tend to have problems handling them. SQL Server cannot handle filenames longer than 260 characters. This includes SQL Server on Linux.

I can prove SQL Server uses an nvarchar(260) column internally to store SQL Server Database filenames, with the path included. Checking the definition of the sys.master_files view, we see the following T-SQL:

 CREATE VIEW sys.master_files AS
    SELECT
        database_id     = f.dbid,
        file_id         = f.fileid,
        file_guid       = f.fileguid,
        type            = f.filetype,
        type_desc       = ft.name,
        data_space_id   = f.grpid,
        name            = f.lname,
        physical_name   = f.pname,
        state           = convert(tinyint, case f.filestate     -- Map enum EMDFileState to AvailablityStates
                                when 0 then 0 when 10 then 0    -- ONLINE
                                when 4 then 7   -- DEFUNCT
                                when 5 then 3 when 9 then 3 -- RECOVERY_PENDING
                                when 7 then 1 when 8 then 1 when 11 then 1  -- RESTORING
                                when 12 then 4  -- SUSPECT
                                else 6 end),    -- OFFLINE
        state_desc      = st.name,
        f.size,
        max_size            = f.maxsize,
        f.growth,
        is_media_read_only  = sysconv(bit, f.status & 8),       -- FIL_READONLY_MEDIA
        is_read_only            = sysconv(bit, f.status & 16),  -- FIL_READONLY
        is_sparse           = sysconv(bit, f.status & 256), -- FIL_SPARSE_FILE
        is_percent_growth   = sysconv(bit, f.status & 32),  -- FIL_PERCENT_GROWTH
        is_name_reserved        = sysconv(bit, case f.filestate when 3 then 1 else 0 end), -- x_efs_DroppedReusePending
        create_lsn          = GetNumericLsn(f.createlsn),
        drop_lsn                = GetNumericLsn(f.droplsn),
        read_only_lsn           = GetNumericLsn(f.readonlylsn),
        read_write_lsn      = GetNumericLsn(f.readwritelsn),
        differential_base_lsn   = GetNumericLsn(f.diffbaselsn),
        differential_base_guid  = f.diffbaseguid,
        differential_base_time  = nullif(f.diffbasetime, 0),
        redo_start_lsn          = GetNumericLsn(f.redostartlsn),
        redo_start_fork_guid    = f.redostartforkguid,
        redo_target_lsn     = GetNumericLsn(f.redotargetlsn),
        redo_target_fork_guid   = f.forkguid,
        backup_lsn          = GetNumericLsn(f.backuplsn),
        credential_id       = cr.credential_id
    FROM sys.sysbrickfiles f
    LEFT JOIN sys.syspalvalues st ON st.class = 'DBFS' AND st.value = f.filestate
    LEFT JOIN sys.syspalvalues ft ON ft.class = 'DBFT' AND ft.value = f.filetype
    LEFT JOIN sys.credentials cr ON f.pname LIKE cr.name + N'%' COLLATE database_default
    WHERE f.dbid < 0x7fff -- consistent with sys.databases
        AND f.pruid = 0
        AND f.filestate NOT IN (1, 2)   -- x_efs_Dummy, x_efs_Dropped
        AND has_access('MF', 1) = 1

Microsoft Docs for sys.master_files says this about the physical_name column:

physical_name nvarchar(260) Operating-system file name.

But let's not trust that. We see the physical file name is referenced as physical_name = f.pname. And the table alias "f" points to FROM sys.sysbrickfiles f. Therefore, SQL Server stores the filename in sys.sysbrickfiles, which is an internal table that is only visible from the Dedicated Administrator Connection, or DAC as its frequently known. Connecting to the DAC, and generating a temp table from the output of sys.sysbrickfiles, we see the following:

CREATE TABLE #sysbrickfiles
(
      brickid           int              NOT NULL
    , dbid              int              NOT NULL
    , pruid             int              NOT NULL
    , fileid            int              NOT NULL
    , grpid             int              NOT NULL
    , status            int              NOT NULL
    , filetype          tinyint          NOT NULL
    , filestate         tinyint          NOT NULL
    , size              int              NOT NULL
    , maxsize           int              NOT NULL
    , growth            int              NOT NULL
    , lname             nvarchar(128)    NOT NULL
    , pname             nvarchar(260)    NOT NULL
    , createlsn         binary(10)       NULL
    , droplsn           binary(10)       NULL
    , fileguid          uniqueidentifier NULL
    , internalstatus    int              NOT NULL
    , readonlylsn       binary(10)       NULL
    , readwritelsn      binary(10)       NULL
    , readonlybaselsn   binary(10)       NULL
    , firstupdatelsn    binary(10)       NULL
    , lastupdatelsn     binary(10)       NULL
    , backuplsn         binary(10)       NULL
    , diffbaselsn       binary(10)       NULL
    , diffbaseguid      uniqueidentifier NULL
    , diffbasetime      datetime         NOT NULL
    , diffbaseseclsn    binary(10)       NULL
    , redostartlsn      binary(10)       NULL
    , redotargetlsn     binary(10)       NULL
    , forkguid          uniqueidentifier NULL
    , forklsn           binary(10)       NULL
    , forkvc            bigint           NOT NULL
    , redostartforkguid uniqueidentifier NULL
);

As you can see, the pname column is indeed defined as nvarchar(260).

Also, if we attempt to create a database using a filename that is longer than 260 characters, we see an error is returned:

Msg 103, Level 15, State 3, Line 7
The file that starts with 'F:\AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAARGH.mdf' is too long. Maximum length is 259.

Using anything other than an nvarchar(260) column to store filenames in SQL Server is both wasteful, and creates technical debt.

The length of a column is extremely important performance-wise. Column length directly affects:

  • memory grants for queries against the column. When the query processor creates a query plan, it uses the size of each column present in the query as a basis for the amount of memory required to run the query. It doesn't use the actual size of the data present in each column, instead it "guesses" that the average size of the data will be 50% of the maximum length of the column.
  • Ability to index the column efficiently. Larger columns create significantly larger indexes. Larger indexes require the more memory and disk throughput than smaller indexes. SQL Server has a maximum key length of 1700 bytes for non-clustered indexes (as of SQL Server 2016) and a maximum key length of 900 bytes for clustered indexes. If you attempt to create an index on columns larger than those maximum amounts, you get errors, and possibly not until run-time when it can be very costly to fix.
  • character-based primary/foreign key performance is severely affected by larger columns lengths. When primary keys are referenced via foreign keys, the size requirements for memory, disk, and I/O are duplicated for each foreign key. Take for example a Customer table where the key is the CustomerName column, defined as varchar(500). Every table that references customers will now require a 500-byte CustomerName column. If that column was defined as a varchar(100) instead, every query referencing those columns will save 200 bytes per row in memory and disk I/O.
  • Erik Darling shows that Predicate Pushdown does not work for (MAX) data types, which can severely limit performance.

Solution 3

If your using SQL Server, it's good to know that Microsoft is using nvarchar(260) fields to store file path and name in the system tables (like sys.database_files, or sys.sysaltfiles, or sys.master_files).

Column name      Data type        Description
-------------    -------------    ---------------------------
physical_name    nvarchar(260)    Operating-system file name.

Good practice could be to use the same format to store your path and file name.

You will, of course, need to enforce the size in your UI to be sure that it will not be truncated during INSERT or UPDATE.

Solution 4

The Length of a file path cannot be predicted. It could be very short as 'C:\' or could be very lengthy like 'C:\Program Files\Microsoft SQL Server\110\LocalDB\Binn\Resources\1033' or even more. But in database level there is no harm using something like VARCHAR(MAX)

See Maximum size of VARCHAR(MAX)

Solution 5

I suggest you do not store the paths in your existing table. Create a new table having a sequential counter as the clustered primary key and a character column of the maximum length of your db program. I use SQL Server so I would use varchar(max).

Create a column in your data table to hold the primary key of the "paths" table. Insert into the "paths" table first then use the primary key as the foreign key back in your data table.

The advantage of storing the value in another table is it does not influence the data size of the base table. Queries of the base table which do not involve the "paths" do not suffer from having to pull in a large character value which increases the IO traffic.

Share:
21,456
Frosty Z
Author by

Frosty Z

https://www.linkedin.com/in/maxime-pacary

Updated on July 09, 2022

Comments

  • Frosty Z
    Frosty Z almost 2 years

    I have to store file paths in a DB field (/tmp/aaa/bbb, C:\temp\xxx\yyy, etc.). I can't really tell how long they could be.

    Given this http://en.wikipedia.org/wiki/Comparison_of_file_systems and that http://msdn.microsoft.com/en-us/library/aa365247.aspx, depending on the file system there could be theoretically no length limit for a path.

    I guess that defining this field as a LONGBLOB or VARCHAR(very high value) wouldn't be wise. I've thought about something like VARCHAR(1024) which should be suitable for most frequent (even if not all) cases, and not too big as a DB field. What would you recommend ?

    Thanks.

  • Frosty Z
    Frosty Z over 13 years
    I guess that such a technique should be used carefully, since it involves an additional join (with its performance impacts) each time we want the main table data + the path...
  • RC_Cleland
    RC_Cleland over 13 years
    Frosty Z you are correct that is should be used carefully as any design should be. My experience is the reduction in table space (IO) far outweighs any join cost. Also the main table will be reduced by a where clause prior to the join so again the overhead of retrieving data for the join is minimized
  • David Aldridge
    David Aldridge about 9 years
    It's difficult to judge the effectiveness of the technique without knowing more about the table structure. If this is a table that holds only the details of files then keeping the paths in this table probably makes sense. If the majority of the table accesses do not require the path to be read then it might be worthwhile, but the saving would mostly affect full table scans. It would always make the retrieval of the path less efficient, of course.
  • Paul
    Paul almost 6 years
    I found this answer useful for defining a Windows filename length (excluding the folder path).
  • Hannah Vernon
    Hannah Vernon about 5 years
    @Paul - actually, the 260 character limit is including the path.
  • Hannah Vernon
    Hannah Vernon about 5 years
    Actually, there is harm in using (max) types. See my answer above for the reasons. The Windows NTFS file system supports paths up to 32768 characters in length, but only when using the Unicode APIs. When using long path names, prefix the path with the characters `\\?`. SQL Server does not support long filenames, even if you attempt to use the unicode API method. As of 2019, SQL Server only supports filenames up to 260 characters in length.
  • Hannah Vernon
    Hannah Vernon about 5 years
    Actually, there is harm in using (max) types. See my answer above for the reasons. The Windows NTFS file system supports paths up to 32768 characters in length, but only when using the Unicode APIs. When using long path names, prefix the path with the characters `\\?`. SQL Server does not support long filenames, even if you attempt to use the unicode API method. As of 2019, SQL Server only supports filenames up to 260 characters in length.
  • Hannah Vernon
    Hannah Vernon about 5 years
    The Windows NTFS file system supports paths up to 32768 characters in length, but only when using the Unicode APIs. When using long path names, prefix the path with the characters `\\?`. SQL Server does not support long filenames, even if you attempt to use the unicode API method. As of 2019, SQL Server only supports filenames up to 260 characters in length.
  • Hannah Vernon
    Hannah Vernon about 5 years
    Actually, there is harm in using (max) types. See my answer for the reasons, not least of which are unnecessarily large memory allocations. The Windows NTFS file system supports paths up to 32768 characters in length, but only when using the Unicode APIs. When using long path names, prefix the path with the characters \\?. However, SQL Server does not support long filenames, even if you attempt to use the unicode API method. As of 2019, SQL Server only supports filenames up to 260 characters in length.
  • CervEd
    CervEd over 3 years
    While this is a very thorough and detailed answer it heavily presumes that all paths stored are paths limited to 260 chars. Not necessarily the case if they are unix paths, URLs etc.
  • Hannah Vernon
    Hannah Vernon over 3 years
    @CervEd - that's a good point, especially since SQL Server can run on Linux. Having said that, the point is to not just blindly use a varchar(max) for the memory and performance reasons noted in my answer. Choose a maximum length that makes sense and that you're prepared to support.
  • CervEd
    CervEd over 3 years
    no I get it and I think it makes sense in most scenarios and workloads but there are instances where 260 wouldn't be enough. But in such cases the operation would fail, and at which point one could revisit if VARCHAR(n) should be larger.