SQL Server uses high CPU when searching inside nvarchar strings

11,771

Solution 1

Looking for an explanation for this.

NVarchar is 16 bit and Unicode comparison rules are a lot more complicated than ASCII - special chars for the various languages that are supported at the same time require quote some more processing.

Solution 2

My guess is that LIKE is implemented using an O(n^2) algorithm as opposed to an O(n) algorithm; it would probably have to be for the leading % to work. Since the Unicode string is twice as long, that seems consistent with your numbers.

Solution 3

A LIKE %% search is implemented as > and < . Now more the number of rows, more the processing time as SQL can't really make effective use of statistics for %% like searches.

Additionally unicode search requires additional storage and along with collation complications, it would typically not be as efficient as the plain vanilla varchar search. The fastest collation search as you have observed is the binary collation search.

These kind of searches are best suited for Full-Text Search or implemented using FuzzyLookup with an in-memory hash table in case you have lots of RAM and a pretty static table.

HTH

Solution 4

I've seen similar problems in SQL Server. There was a case where I was using parameterized queries, and my parameter was UTF-8 (default in .net) and the field was varchar (so not utf-8). Ended up with was converting every index value to utf-8 just to do a simple index lookup. This might be related in that the entire string might be getting translated to another character set to do the comparison. Also for nvarchar, "a" would be the same as "á" meaning that there's a lot more work going on there to figure out if 2 strings are equal in unicode. Also, you might want to use full text indexing, although I'm not sure if that solves your problem.

Share:
11,771

Related videos on Youtube

Michael J Swart
Author by

Michael J Swart

See my website for more about me: http://michaeljswart.com/?page_id=2

Updated on July 07, 2020

Comments

  • Michael J Swart
    Michael J Swart almost 4 years

    Check out the following example. It shows that searching within a unicode string (nvarchar) is almost eight times as bad as searching within a varchar string. And on par with implicit conversions. Looking for an explanation for this. Or a way to search within nvarchar strings more efficiently.

    use tempdb
    create table test
    (
        testid int identity primary key,
        v varchar(36),
        nv nvarchar(36),
        filler char(500)
    )
    go
    
    set nocount on
    set statistics time off
    insert test (v, nv)
    select CAST (newid() as varchar(36)),
        CAST (newid() as nvarchar(36))
    go 1000000
    
    set statistics time on
    -- search utf8 string
    select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
    -- CPU time = 906 ms,  elapsed time = 911 ms.
    
    -- search utf8 string using unicode (uses convert_implicit)
    select COUNT(1) from test where v like N'%abcd%' option (maxdop 1)
    -- CPU time = 6969 ms,  elapsed time = 6970 ms.
    
    -- search unicode string
    select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
    -- CPU time = 6844 ms,  elapsed time = 6911 ms.
    
    • Michael J Swart
      Michael J Swart over 13 years
      FYI, turns out the higher CPU in the implicit conversion example (query 2) is not due to the conversion itself, but to unicode comparison logic, just like the other unicode query (query 3).
    • ZygD
      ZygD over 13 years
      This an excellent question and I've added a link to my answer here varchar-vs-nvarchar-performance
    • Michael J Swart
      Michael J Swart over 13 years
      @gbn, in that post you linked to msdn.microsoft.com/en-us/library/ms189617.aspx which is the explanation I like best. Thanks!
    • Michael J Swart
      Michael J Swart about 13 years
      Turned this question into a blog post: michaeljswart.com/2011/02/…
  • Michael J Swart
    Michael J Swart over 13 years
    Hmmm. interesting. In theory then using a binary collation might be a bit faster... stay tuned.
  • Michael J Swart
    Michael J Swart over 13 years
    Oh my God, that's it! When using "nv COLLATE Latin1_General_Bin like N'%ABCD%'" I get: -- CPU time = 890 ms, elapsed time = 881 ms.
  • Michael J Swart
    Michael J Swart over 13 years
    Thanks Kibbee. The collation that was used was already accent sensitive and so it wasn't that particular cause. Also full text indexing doesn't work in my case because the strings I'm searching aren't on word boundaries. But thanks for helping.
  • Michael J Swart
    Michael J Swart over 13 years
    You're right, that explanation is consistent with the numbers, until I did a further experiment (see comment under TomTom's answer). Thanks for stopping by Larry
  • Larry Coleman
    Larry Coleman over 13 years
    @Michael: I'm curious about whether you see the same result with the varchar column.
  • Michael J Swart
    Michael J Swart over 13 years
    With varchar+latin collation I get "cpu time = 891" which is a bit better than without the collation, but I can't tell if it's significantly better without having a decent grasp of stats. :-)
  • TomTom
    TomTom over 13 years
    Let me guess - you are english speaker ;) Talk to some people from germany and france and you start realizing the partially ODD rules around accents and special chars. This simply take time to resolve ;) Good we nailed that ;)
  • Solomon Rutzky
    Solomon Rutzky about 7 years
    -1 I really don't want to be negative, but everything stated in this answer is incorrect. .NET / Windows / SQL Server use UTF-16 Little Endian ("Unicode" in Microsoft-land). There is no UTF-8 unless you have a byte[] of those bytes; a string is UTF-16 LE, same as NVARCHAR (and XML) in SQL Server. Your issue was VARCHAR data using a SQL Server Collation (one starting with SQL_) in the index and comparing that to an NVARCHAR string. That combination requires an implicit conversion due to 2 different sorting algorithms. VARCHAR data with a Windows Collation wouldn't do that. (cont)
  • Solomon Rutzky
    Solomon Rutzky about 7 years
    Also, 'a' and 'á' are not the same in NVARCHAR. Whether or not they equate is determined by the accent-sensitivity option (i.e. _AI vs _AS in the name) of each particular Collation. And they can be deemed as being either the same or different for both VARCHAR and NVARCHAR. Try the following to see them as being equal as VARCHAR data using a deprecated SQL Server Collation: SELECT 1 WHERE 'a' = 'á' COLLATE SQL_Latin1_General_CP1_CI_AI;. And to clarify: you could have used a VARCHAR param in your query to fix it; it converted to NVARCHAR due to datatype precedence.