SQL Server Lock Timeout Exceeded Deleting Records in a Loop

10,446

Solution 1

I've found the answer: my looped delete is conflicting with the ghost cleanup proc.

Using Nicholas's suggestion, I added a BEGIN TRANSACTION and a COMMIT. I wrapped the delete loop in a BEGIN TRY / BEGIN CATCH. In the BEGIN CATCH, right before a ROLLBACK, I ran sp_lock and sp_who2. (I added the code changes in the question above.)

When my process blocked, I saw the following output:

spid   dbid   ObjId       IndId  Type Resource                         Mode     Status
------ ------ ----------- ------ ---- -------------------------------- -------- ------
20     2      1401108082  0      TAB                                   IX       GRANT
20     2      1401108082  1      PAG  1:102368                         X        GRANT

SPID  Status     Login HostName BlkBy DBName Command       CPUTime DiskIO
----  ---------- ----- -------- ----- ------ ------------- ------- ------
20    BACKGROUND sa    .        .     tempdb GHOST CLEANUP 31      0

For future reference, when SQL Server deletes records, it sets a bit on them to just mark them as "ghost records". Every few minutes, an internal process called ghost cleanup runs to reclaim pages of records that have been fully deleted (i.e. all records are ghost records).

The ghost cleanup process was discussed on ServerFault in this question.

Here is Paul S. Randal's explanation of the ghost cleanup process.

It is possible to disable the ghost cleanup process with a trace flag. But I did not have to do so in this case.

I ended up adding a lock wait timeout of 100 ms. This causes occasional lock wait timeouts in the ghost record cleanup process, but that is acceptable. I also added an our loop that retries lock timeouts up to 5 times. With these two changes, my process now usually completes. Now it only gets a timeout if there is a very long process pushing lots of data around that acquires table or page locks on the data that my process needs to clean up.

EDIT 2016-07-20

The final code looks like this:

-- Do not block long if records are locked.
SET LOCK_TIMEOUT 100

-- This process volunteers to be a deadlock victim in the case of a deadlock.
SET DEADLOCK_PRIORITY LOW

DECLARE @Error BIT
SET @Error = 0

DECLARE @ErrMsg VARCHAR(1000)
DECLARE @DeletedCount INT
SELECT @DeletedCount = 0

DECLARE @LockTimeoutCount INT
SET @LockTimeoutCount = 0

DECLARE @ContinueDeleting BIT,
    @LastDeleteSuccessful BIT

SET @ContinueDeleting = 1
SET @LastDeleteSuccessful = 1

WHILE @ContinueDeleting = 1
BEGIN
    DECLARE @RowCount INT
    SET @RowCount = 0

    BEGIN TRY

        BEGIN TRANSACTION

        -- The READPAST below attempts to skip over locked records.
        -- However, it might still cause a lock wait error (1222) if a page or index is locked, because the delete has to modify indexes.
        -- The threshold for row lock escalation to table locks is around 5,000 records,
        -- so keep the deleted number smaller than this limit in case we are deleting a large chunk of data.
        -- Table name, field, and value are all set dynamically in the actual script.
        SET @SQL = N'DELETE TOP (1000) MyTable WITH(ROWLOCK, READPAST) WHERE MyField = SomeValue' 
        EXEC sp_executesql @SQL, N'@ProcGuid uniqueidentifier', @ProcGUID

        SET @RowCount = @@ROWCOUNT

        COMMIT

        SET @LastDeleteSuccessful = 1

        SET @DeletedCount = @DeletedCount + @RowCount
        IF @RowCount = 0
        BEGIN
            SET @ContinueDeleting = 0
        END

    END TRY
    BEGIN CATCH

        IF @@TRANCOUNT > 0
            ROLLBACK

        IF Error_Number() = 1222 -- Lock timeout
        BEGIN

            IF @LastDeleteSuccessful = 1
            BEGIN
                -- If we hit a lock timeout, and we had already deleted something successfully, try again.
                SET @LastDeleteSuccessful = 0
            END
            ELSE
            BEGIN
                -- The last delete failed, too.  Give up for now.  The job will run again shortly.
                SET @ContinueDeleting = 0
            END
        END
        ELSE -- On anything other than a lock timeout, report an error.
        BEGIN       
            SET @ErrMsg = 'An error occurred cleaning up data.  Table: MyTable Column: MyColumn Value: SomeValue.  Message: ' + ERROR_MESSAGE() + ' Error Number: ' + CONVERT(VARCHAR(20), ERROR_NUMBER()) + ' Line: ' + CONVERT(VARCHAR(20), ERROR_LINE())
            PRINT @ErrMsg -- this error message will be included in the SQL Server job history
            SET @Error = 1
            SET @ContinueDeleting = 0
        END

    END CATCH

END

IF @Error <> 0
    RAISERROR('Not all data could be cleaned up.  See previous messages.', 16, 1)

Solution 2

You or somebody else using the connection is setting the lock timeout to something other than the default. See http://msdn.microsoft.com/en-US/library/ms189470(v=SQL.90).aspx for details.

The default lock time it is -1 milliseconds, meaning "Wait Forever".

The row hints are nice, but they are a code smell and should be avoided. Let SQL Server do its job. It's got more information than you do about the system as a whole.

For starters, you can't control the lock size: Lock escalation occurs automatically, based on the number of outstanding locks. It starts with row locks. If you accumulate too many row locks, SQL Server escalates to page lock. Acquire too many page locks and it escalates to table locks. See http://msdn.microsoft.com/en-us/library/ms184286(v=SQL.90).aspx for lock escalation details. There are a couple of trace flags you can set, however, that will prevent lock escalation: however, that will degrade the SQL Server's performance.

Another thing: you should wrap the DELETE statement in a transaction, especially in a stored procedure.

DECLARE @Count INT
SET @Count = 1
WHILE @Count > 0
  BEGIN
    BEGIN TRANSACTION
    DELETE TOP (1000) FROM MyTable WITH (ROWLOCK, READPAST) WHERE MyField = SomeValue
    SET @Count = @@ROWCOUNT
    COMMIT TRANSACTION
  END

This makes clear your intent and ensures that locks are release when they ought to be.

10,446

Paul Williams

I've been programming for about 24 years now (eek!). My first job was in COBOL, but mainframes and I do not mix. I coded several years in Java back in the JDK 1.0.2 days. Did some C++ and COM. Worked on Java + WFC from Microsoft. Been writing C# and SQL code since 2004. Married with two kids. Kids are destined to either game or sympathize with gamers. Wife is wonderful, understanding, and sympathetic, but she=g=d=+++----CARRIER LOST

Updated on June 04, 2022

Comments

Paul Williams almost 2 years
I am testing a process that deletes many, many records at once. It cannot TRUNCATE TABLE, because there are records in there that need to stay.

Because of the volume, I have broken the delete into a loop similar to this:
```
-- Do not block if records are locked.
SET LOCK_TIMEOUT 0
-- This process should be chosen as a deadlock victim in the case of a deadlock.
SET DEADLOCK_PRIORITY LOW
SET NOCOUNT ON

DECLARE @Count
SET @Count = 1
WHILE @Count > 0
BEGIN TRY
    BEGIN TRANSACTION -- added per comment below

    DELETE TOP (1000) FROM MyTable WITH (ROWLOCK, READPAST) WHERE MyField = SomeValue
    SET @Count == @@ROWCOUNT

    COMMIT
END TRY
BEGIN CATCH
    exec sp_lock -- added to display the open locks after the timeout
    exec sp_who2 -- shows the active processes

    IF @@TRANCOUNT > 0
        ROLLBACK
    RETURN -- ignoring this error for brevity
END CATCH
```
MyTable is a clustered table. MyField is in the first column in the clustered index. It indicates a logical grouping of records, so MyField = SomeValue often selects many records. I don't care in which order they are deleted so long as one group is processed at a time. There are no other indexes on this table.

I added the ROWLOCK hint to try to avoid lock escalations we have seen in production. I added the READPAST hint to avoid deleting records locked by other processes. That should never happen, but I am trying to be safe.

Problem: sometimes this loop hits a lock timeout 1222 "Lock request time out period exceeded" when it is the only thing running.

I am positive there is no other activity on this system while I am testing this process, because it is my own developer box, nobody else is connected, there are no other processes running on it, and the profiler shows no activity.

I can re-run the same script a second later and it picks up where it left off, happily deleting records-- until the next lock timeout.

I have tried a BEGIN TRY / BEGIN CATCH to ignore the 1222 error and retry the delete, but it fails again immediately with the same lock timeout error. It also fails again if I add a short delay before retrying.

I assume the lock timeouts are because of something like a page split, but I am not sure why this would conflict with the current loop iteration. The prior delete statement should have already completed, and I thought that meant any page splits were also finished.

Why is the DELETE loop hitting a lock timeout against itself?

Is there a way the process can avoid this lock timeout or detect that it is safe to resume?

This is on SQL Server 2005.

-- EDIT --

I added the Lock:Timeout event to the profiler. It's timing out on a PAGELOCK during the delete:
```
Event Class: Lock:Timeout
TextData:    1:15634  (one example of several)
Mode:        7 - IU
Type:        6 - PAGE
```
DBCC PAGE reports these pages are outside of the range of the master database (ID 1).

-- EDIT 2 --

I added a BEGIN TRY / BEGIN CATCH and ran an exec sp_lock in the catch block. Here is what I saw:
```
spid dbid ObjId      IndId Type Resource Mode Status
19   2    1401108082 1     PAG  1:52841  X    GRANT  (tempdb.dbo.MyTable)
19   2    1401108082 0     TAB           IX   GRANT  (tempdb.dbo.MyTable)
Me   2    1401108082 0     TAB           IX   GRANT  (tempdb.dbo.MyTable)
Me   1    1115151018 0     TAB           IS   GRANT  (master..spt_values)  (?)
```
SPID 19 is a SQL Server TASK MANAGER. Why would one of these task managers be acquiring locks on MyTable?
Quantum Elf about 13 years

SQL does not escalate a row lock to a page lock - it directly escalates to a table lock. sqlskills.com/BLOGS/PAUL/post/…
Paul Williams about 13 years

You are correct that the code sets LOCK_TIMEOUT to 0. I just included that above; sorry for not mentioning it earlier.
Paul Williams about 13 years

Wrapping this in a transaction has helped identify the open locks at the time of the lock timeout. See edits above.
MsBao almost 8 years

Could you post your production solution after the fix?
Paul Williams almost 8 years

@RonnieOverby I added a sample solution. Our actual production code is more complex than this, because it cleans up several different tables via dynamic SQL. This code above does not include that extra baggage.
MsBao almost 8 years

Awesome. Thanks for taking the time to do that.