SQL Server uses high CPU when searching inside nvarchar strings
Solution 1
Looking for an explanation for this.
NVarchar is 16 bit and Unicode comparison rules are a lot more complicated than ASCII - special chars for the various languages that are supported at the same time require quote some more processing.
Solution 2
My guess is that LIKE
is implemented using an O(n^2) algorithm as opposed to an O(n) algorithm; it would probably have to be for the leading %
to work. Since the Unicode string is twice as long, that seems consistent with your numbers.
Solution 3
A LIKE %% search is implemented as > and < . Now more the number of rows, more the processing time as SQL can't really make effective use of statistics for %% like searches.
Additionally unicode search requires additional storage and along with collation complications, it would typically not be as efficient as the plain vanilla varchar search. The fastest collation search as you have observed is the binary collation search.
These kind of searches are best suited for Full-Text Search or implemented using FuzzyLookup with an in-memory hash table in case you have lots of RAM and a pretty static table.
HTH
Solution 4
I've seen similar problems in SQL Server. There was a case where I was using parameterized queries, and my parameter was UTF-8 (default in .net) and the field was varchar (so not utf-8). Ended up with was converting every index value to utf-8 just to do a simple index lookup. This might be related in that the entire string might be getting translated to another character set to do the comparison. Also for nvarchar, "a" would be the same as "á" meaning that there's a lot more work going on there to figure out if 2 strings are equal in unicode. Also, you might want to use full text indexing, although I'm not sure if that solves your problem.
Related videos on Youtube
Michael J Swart
See my website for more about me: http://michaeljswart.com/?page_id=2
Updated on July 07, 2020Comments
-
Michael J Swart almost 4 years
Check out the following example. It shows that searching within a unicode string (nvarchar) is almost eight times as bad as searching within a varchar string. And on par with implicit conversions. Looking for an explanation for this. Or a way to search within nvarchar strings more efficiently.
use tempdb create table test ( testid int identity primary key, v varchar(36), nv nvarchar(36), filler char(500) ) go set nocount on set statistics time off insert test (v, nv) select CAST (newid() as varchar(36)), CAST (newid() as nvarchar(36)) go 1000000 set statistics time on -- search utf8 string select COUNT(1) from test where v like '%abcd%' option (maxdop 1) -- CPU time = 906 ms, elapsed time = 911 ms. -- search utf8 string using unicode (uses convert_implicit) select COUNT(1) from test where v like N'%abcd%' option (maxdop 1) -- CPU time = 6969 ms, elapsed time = 6970 ms. -- search unicode string select COUNT(1) from test where nv like N'%abcd%' option (maxdop 1) -- CPU time = 6844 ms, elapsed time = 6911 ms.
-
Michael J Swart over 13 yearsFYI, turns out the higher CPU in the implicit conversion example (query 2) is not due to the conversion itself, but to unicode comparison logic, just like the other unicode query (query 3).
-
ZygD over 13 yearsThis an excellent question and I've added a link to my answer here varchar-vs-nvarchar-performance
-
Michael J Swart over 13 years@gbn, in that post you linked to msdn.microsoft.com/en-us/library/ms189617.aspx which is the explanation I like best. Thanks!
-
Michael J Swart about 13 yearsTurned this question into a blog post: michaeljswart.com/2011/02/…
-
-
Michael J Swart over 13 yearsHmmm. interesting. In theory then using a binary collation might be a bit faster... stay tuned.
-
Michael J Swart over 13 yearsOh my God, that's it! When using "nv COLLATE Latin1_General_Bin like N'%ABCD%'" I get: -- CPU time = 890 ms, elapsed time = 881 ms.
-
Michael J Swart over 13 yearsThanks Kibbee. The collation that was used was already accent sensitive and so it wasn't that particular cause. Also full text indexing doesn't work in my case because the strings I'm searching aren't on word boundaries. But thanks for helping.
-
Michael J Swart over 13 yearsYou're right, that explanation is consistent with the numbers, until I did a further experiment (see comment under TomTom's answer). Thanks for stopping by Larry
-
Larry Coleman over 13 years@Michael: I'm curious about whether you see the same result with the varchar column.
-
Michael J Swart over 13 yearsWith varchar+latin collation I get "cpu time = 891" which is a bit better than without the collation, but I can't tell if it's significantly better without having a decent grasp of stats. :-)
-
TomTom over 13 yearsLet me guess - you are english speaker ;) Talk to some people from germany and france and you start realizing the partially ODD rules around accents and special chars. This simply take time to resolve ;) Good we nailed that ;)
-
Solomon Rutzky about 7 years-1 I really don't want to be negative, but everything stated in this answer is incorrect. .NET / Windows / SQL Server use UTF-16 Little Endian ("Unicode" in Microsoft-land). There is no UTF-8 unless you have a
byte[]
of those bytes; a string is UTF-16 LE, same asNVARCHAR
(andXML
) in SQL Server. Your issue wasVARCHAR
data using a SQL Server Collation (one starting withSQL_
) in the index and comparing that to anNVARCHAR
string. That combination requires an implicit conversion due to 2 different sorting algorithms.VARCHAR
data with a Windows Collation wouldn't do that. (cont) -
Solomon Rutzky about 7 yearsAlso,
'a'
and'á'
are not the same inNVARCHAR
. Whether or not they equate is determined by the accent-sensitivity option (i.e._AI
vs_AS
in the name) of each particular Collation. And they can be deemed as being either the same or different for bothVARCHAR
andNVARCHAR
. Try the following to see them as being equal asVARCHAR
data using a deprecated SQL Server Collation:SELECT 1 WHERE 'a' = 'á' COLLATE SQL_Latin1_General_CP1_CI_AI;
. And to clarify: you could have used aVARCHAR
param in your query to fix it; it converted toNVARCHAR
due to datatype precedence.