What can cause long query durations without high resource usage?

8,997

Solution 1

As you are running SQL 2005 you can take your SQL Profiler data and compare it against the Perfmon data to see if you can see a correlation. This is done by saving your trace data and perfmon data to files using the normal techniques. Then open the SQL Profiler trace within profiler, and then one of the options in the file menu will be Import Performance Data. This will let you select a query and see what the counters were doing at that time (or close to it depending on your perfmon collection interval).

Disk queue spikes are never good. Especially that high. What's the IO that you are pushing to the disk when the queue gets that high? Basically you don't want a disk queue any higher than (2*n) where n in the number of disks in the array. Since you are using a 2 disk RAID 1 n=1 in your case (since you only get the speed of a single disk).

In perfmon there's a counter which is the seconds per read and seconds per write. What do these counters look like when the queries start taking a long time to run. What about normally? (Anything over .02 seconds is bad.) What's the estimated page life expectancy? (Anything under 300 seconds is usually bad, but this can vary.) What's the SQL Server cache hit ratio? (Anything below ~97% is usually bad. I like mine above 99.9%.)

Solution 2

Few things that maybe unhelpful or could be useful;
If this is happening to stored procedures, it could be parameter sniffing -> http://omnibuzz-sql.blogspot.com/2006/11/parameter-sniffing-stored-procedures.html
Are you using ASP for the web application? We had an issue that was similar but related to ASP + IIS and SQL using stored procedures. I seem to remember it being semaphore timeouts that would cause this. It'd take almost 30+ seconds to run a query but then everything was fine for a while. I can't find my info on it but I seem to remember it being related to IIS timeouts, this was on the IIS side.

This tool might be helpful too -> http://blog.brianhartsock.com/2008/12/16/quick-and-dirty-sql-server-slow-query-log/

Share:
8,997

Related videos on Youtube

Ge3ng
Author by

Ge3ng

Updated on September 17, 2022

Comments

  • Ge3ng
    Ge3ng over 1 year

    In advance: sorry for the length of the question... unable to get the right balance between detail and brevity.

    We are having issues with the DB Server for our web application where queries that should (and usually do) run in a very short time (< 10ms) are, on random occasions, taking anywhere between 1 ~ 30 seconds to execute - with no obvious pattern. According to our profiler traces, some of them are even "do-nothing" queries, such as "exec sp_reset_connection" (usually runs in 0ms; observed peaks of 3~6s), and "SET NO_BROWSETABLE ON", etc. Some examples are:

    SELECT * FROM [Localisation].[TimeZoneRule] WHERE [Name] = 'EU'
    

    Where TimeZoneRule has about 500,000 rows in 5 columns. Has a surrogate primary key, and an index on Name. Usually takes 0.97ms, peaks at 11s. Table is NEVER written to (was pre-populated prior to go-live). Profiler records it as taking 0-15 CPU, 18-25 Reads, 0-1 Writes (no idea why the write).

    UPDATE [Core].[User] SET [LastUsed] = GETUTCDATE() WHERE Id = '<uid>'
    

    Where User has about 30,000 rows over about 10 columns (one of which is an Xml column). Id is the clustered primary key. Table is regularly written to and read from. Usually takes 10~20ms, peaks at 26s. Profiler records it as taking 0 CPU, 15-36 Reads, 0-1 Writes.

    INSERT INTO [Log].[Session] (ASPSessionId, Start, ClientAddress, ClientSoftware, ProxyAddress, ProxySoftware)
       VALUES(<number>, GETUTCDATE(), '<ipv4address>', '<User agent string>', '<ipv4address>', '<proxy software name (if present)>')
    

    Where Session has about 1,000,000 rows in it over about 8 columns. Has a surrogate primary key (identity), and an index on ASPSessionId. Table is regularly written to, but rarely read from (only by us directly from SSMS). Usually takes 15~150ms, peaks at 5s. I don't have the profile record for it at hand, but from memory, CPU is around 0, Reads and writes were between 0 and 100 each.

    The setup we're using is a mirrored setup with a Dell 2950 as the principle (2 4-core xeon 2.6, 16Gb RAM), and a Dell 6850 as the mirror (4 HT Xeon 3.2, 8Gb RAM). Both running SQL 2005 SP4 64-bit. The database in question is not particularly large, around 16Gb in size. The primary has 6 SAS disks divided into 3 RAID-1 volumes; one for System + Page + TempDB, one for the database's MDF, and one for the transaction log + hourly log backup + daily DB backup. I know the log situation is far from the best - in terms of disk IO (see below), and data security.

    So far, we think we've eliminated:

    • The mirror. We separated the servers, and ran using one of them (and then switched to the other), but the performance issues remained.
    • Blocking due to locks(*). TimeZoneRule is never written to and, by my reckoning, should never have an exclusive lock on it. Additionally, we've checked through the traces and on many occasions the "problem query" is the only one running - the only other activity is other connections disconnecting
    • Poor indexing. With the low figures for Reads and CPU, it would suggest that SQL Server is using indexes effectively.
    • Disk IO. PerfMon indicates some odd figures for the data file drive (but only that drive) - whilst data read/ write rates rarely seem to exceed 32KB/s, the Current Disk Queue Length spikes to around 215 for durations of 2-5 seconds at approximately 45-60minute intervals with no fixed pattern. However, these do not correlate with times of poor query performance. The disk queue lengths for the other two drives [system + page + tempdb] and [log + backups] never exceeds 3.

    (*) We've tried getting profiler to capture events related to lock acquisition, but the trace bloats to unreadable proportions and, worse, the web application grinds to a halt.

    Not being DBAs, we're rapidly running out of ideas. Can anyone think of anything I should consider looking at next or anything I've stupidly missed?

    • Admin
      Admin almost 15 years
      Any chance you can capture the Process category counters (all instances) and hopefully be able to identify the process that is generating the periodic I/O?
  • Ge3ng
    Ge3ng almost 15 years
    Thanks for your response. Yes, the application is classic ASP. Occasionally we notice the poor performance on stored procedures, but mainly they are "ad-hoc" type queries. Regarding timeouts: the ADODB Command and Connection objects both support the CommandTimeout property that cancels a query that runs longer than it's specified value (default: 30seconds).
  • Ge3ng
    Ge3ng almost 15 years
    Thanks for your response. The error log is showing no growth events of any type (but we're currently in an activity lull since most of our users aren't around); just backup and checkdb successes from our automated tasks.
  • Ge3ng
    Ge3ng almost 15 years
    Thanks for your response. We have examined the query plans (both estimated and actual) of many of our longer running queries. We have managed to bring the performance up of those that had excessive Reads/ CPU figures - but we're still seeing queries that have reads in the order of 18-30, no CPU usage, and no writes taking over 10 seconds. The query plans for these show they're doing absolutely nothing taxing. We have a trace permanently running from another server logging all queries into a separate DB but have found no patterns in the issues.
  • Ge3ng
    Ge3ng almost 15 years
    Thanks for your response. I didn't realise profiler had a built-in ability to compare perf data - very handy. We did this exercise manually and found no real correlation (other than at backup/ checkdb time at midnight). Outside of backup time, the disk queue length spikes are accompanied by low read rates of 6KB/sec (and no writes) - very puzzling. During backup, the read rates peak at 1MB/sec, and write at 34KB/sec. Have just added the other counters (seconds to read/write, page life, and ratio) - will let you know what they say when I have a representative sample.
  • Ge3ng
    Ge3ng almost 15 years
    I have representative samples of "normal" running. Page life expectency: 38468 - 41425 (avg: 39943). Cache hit ratio 99.79 - 9.88% (avg: 99.85%). Average seconds to read: 0.005 - 0.196 (avg: 0.006), to write: 0 - 0.001 (avg: 0). Will investigate the cause of the 0.196 spike.
  • Ge3ng
    Ge3ng almost 15 years
    Thanks for your response. We're not updating our statistics regularly, and had just left it to auto update on it's own. I've now enabled asynchronous updating as you suggest. However, I imagine that a statistics update would only affect queries using a table upon which the stats are being rebuilt - given that TimeZoneRule is never written to, it's stats will never get out of date (according to what I've read so far, an automatic rebuild is triggered when a table changes by 20%). Additionally, stats rebuilds would not affect queries like "exec sp_reset_connection"
  • Ge3ng
    Ge3ng almost 15 years
    In some of our performance traces, the seconds to read is a little high - 1.02s. I'll get our hardware guys to look into this. Thanks.
  • Ryan Ferretti
    Ryan Ferretti almost 15 years
    That could be a cause or a symptom. If the hardware is being pushed beyond it's limit because of lack of indexes then it's a symptom. If the hardware just isn't powerfull enough and the database is properly tuned then its a cause. Happy hunting.
  • Peter
    Peter over 12 years
    Did you ever find out why the sp_reset_connection ran slow im having the same problem...