SQL Server uses high CPU when searching inside nvarchar strings

sql-server tsql

11,771

Solution 1

Looking for an explanation for this.

NVarchar is 16 bit and Unicode comparison rules are a lot more complicated than ASCII - special chars for the various languages that are supported at the same time require quote some more processing.

Solution 2

My guess is that LIKE is implemented using an O(n^2) algorithm as opposed to an O(n) algorithm; it would probably have to be for the leading % to work. Since the Unicode string is twice as long, that seems consistent with your numbers.

Solution 3

A LIKE %% search is implemented as > and < . Now more the number of rows, more the processing time as SQL can't really make effective use of statistics for %% like searches.

Additionally unicode search requires additional storage and along with collation complications, it would typically not be as efficient as the plain vanilla varchar search. The fastest collation search as you have observed is the binary collation search.

These kind of searches are best suited for Full-Text Search or implemented using FuzzyLookup with an in-memory hash table in case you have lots of RAM and a pretty static table.

HTH

Solution 4

I've seen similar problems in SQL Server. There was a case where I was using parameterized queries, and my parameter was UTF-8 (default in .net) and the field was varchar (so not utf-8). Ended up with was converting every index value to utf-8 just to do a simple index lookup. This might be related in that the entire string might be getting translated to another character set to do the comparison. Also for nvarchar, "a" would be the same as "á" meaning that there's a lot more work going on there to figure out if 2 strings are equal in unicode. Also, you might want to use full text indexing, although I'm not sure if that solves your problem.

View more solutions

11,771

Michael J Swart

See my website for more about me: http://michaeljswart.com/?page_id=2

Updated on July 07, 2020

Comments

Michael J Swart almost 4 years
Check out the following example. It shows that searching within a unicode string (nvarchar) is almost eight times as bad as searching within a varchar string. And on par with implicit conversions. Looking for an explanation for this. Or a way to search within nvarchar strings more efficiently.
```
use tempdb
create table test
(
    testid int identity primary key,
    v varchar(36),
    nv nvarchar(36),
    filler char(500)
)
go

set nocount on
set statistics time off
insert test (v, nv)
select CAST (newid() as varchar(36)),
    CAST (newid() as nvarchar(36))
go 1000000

set statistics time on
-- search utf8 string
select COUNT(1) from test where v like '%abcd%' option (maxdop 1)
-- CPU time = 906 ms,  elapsed time = 911 ms.

-- search utf8 string using unicode (uses convert_implicit)
select COUNT(1) from test where v like N'%abcd%' option (maxdop 1)
-- CPU time = 6969 ms,  elapsed time = 6970 ms.

-- search unicode string
select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1)
-- CPU time = 6844 ms,  elapsed time = 6911 ms.
```
- Michael J Swart over 13 years
  
  FYI, turns out the higher CPU in the implicit conversion example (query 2) is not due to the conversion itself, but to unicode comparison logic, just like the other unicode query (query 3).
- ZygD over 13 years
  
  This an excellent question and I've added a link to my answer here varchar-vs-nvarchar-performance
- Michael J Swart over 13 years
  
  @gbn, in that post you linked to msdn.microsoft.com/en-us/library/ms189617.aspx which is the explanation I like best. Thanks!
- Michael J Swart about 13 years
  
  Turned this question into a blog post: michaeljswart.com/2011/02/…
Michael J Swart over 13 years

Hmmm. interesting. In theory then using a binary collation might be a bit faster... stay tuned.
Michael J Swart over 13 years

Oh my God, that's it! When using "nv COLLATE Latin1_General_Bin like N'%ABCD%'" I get: -- CPU time = 890 ms, elapsed time = 881 ms.
Michael J Swart over 13 years

Thanks Kibbee. The collation that was used was already accent sensitive and so it wasn't that particular cause. Also full text indexing doesn't work in my case because the strings I'm searching aren't on word boundaries. But thanks for helping.
Michael J Swart over 13 years

You're right, that explanation is consistent with the numbers, until I did a further experiment (see comment under TomTom's answer). Thanks for stopping by Larry
Larry Coleman over 13 years

@Michael: I'm curious about whether you see the same result with the varchar column.
Michael J Swart over 13 years

With varchar+latin collation I get "cpu time = 891" which is a bit better than without the collation, but I can't tell if it's significantly better without having a decent grasp of stats. :-)
TomTom over 13 years

Let me guess - you are english speaker ;) Talk to some people from germany and france and you start realizing the partially ODD rules around accents and special chars. This simply take time to resolve ;) Good we nailed that ;)
Solomon Rutzky about 7 years

-1 I really don't want to be negative, but everything stated in this answer is incorrect. .NET / Windows / SQL Server use UTF-16 Little Endian ("Unicode" in Microsoft-land). There is no UTF-8 unless you have a byte[] of those bytes; a string is UTF-16 LE, same as NVARCHAR (and XML) in SQL Server. Your issue was VARCHAR data using a SQL Server Collation (one starting with SQL_) in the index and comparing that to an NVARCHAR string. That combination requires an implicit conversion due to 2 different sorting algorithms. VARCHAR data with a Windows Collation wouldn't do that. (cont)
Solomon Rutzky about 7 years

Also, 'a' and 'á' are not the same in NVARCHAR. Whether or not they equate is determined by the accent-sensitivity option (i.e. _AI vs _AS in the name) of each particular Collation. And they can be deemed as being either the same or different for both VARCHAR and NVARCHAR. Try the following to see them as being equal as VARCHAR data using a deprecated SQL Server Collation: SELECT 1 WHERE 'a' = 'á' COLLATE SQL_Latin1_General_CP1_CI_AI;. And to clarify: you could have used a VARCHAR param in your query to fix it; it converted to NVARCHAR due to datatype precedence.