Recovering from Redis connectivity loss

10,153

Solution 1

Edit 2015-05-02: Although the later versions of the StackExchange.Redis client are explicitly supposed to handle this sort of "lost connection" -> "reconnect" logic internally and automagically, my testing has shown beyond any real doubt that they don't manage to pull it off successfully, and hence this sort of thing is still needed, at least in busy environments. I yanked the code below out of my caching layer for a while, and ended up with tens of thousands of connection failure errors. I put it back in, and those all went away.

Edit 2015-02-24: This method should no longer be necessary. The latest versions of the StackExchange.Redis client correctly handle being disconnected - they'll reconnect themselves up automatically, and the workaround below just interferes with things. Keeping it here for historical purposes, but my recommendation is to ignore it.


Here's a few methods from my SimpleCacheRedis<T> wrapper that show how I'm handling the problem:

public async Task<TValue> GetAsync(string key, Func<Task<TValue>> missingFunc)
{
    key = GetKey(key);
    var value = default(TValue);
    try
    {
        var db = _connection.GetDatabase();
        var str = await db.StringGetAsync(key);
        if (!str.IsNullOrEmpty)
        {
            value = _jsonSerializer.Deserialize<TValue>(str);
        }
    }
    catch (RedisConnectionException ex)
    {
        HandleRedisConnectionError(ex);
    }
    catch (Exception ex)
    {
        _logger.Error("Error retrieving item '" + key +
                      "' from Redis cache; falling back to missingFunc(). Error = " + ex);
    }
    if (value == default(TValue))
    {
        present = false;
        value = await missingFunc();
        await PerformAddAsync(key, value);
    }
    return value;
}

private void HandleRedisConnectionError(RedisConnectionException ex)
{
    _logger.Error("Connection error with Redis cache; recreating connection for the next try, and falling back to missingFunc() for this one. Error = " + ex.Message);
    Task.Run(async () =>
    {
        try
        {
            await CreateConnectionAsync();
        }
        catch (Exception genEx)
        {
            _logger.Error("Unable to recreate redis connection (sigh); bailing for now: " + genEx.Message);
        }
    });
}

private async Task CreateConnectionAsync()
{
    if (_attemptingToConnect) return;
    var sw = new StringWriter();
    try
    {
        _attemptingToConnect = true;
        _connection = await ConnectionMultiplexer.ConnectAsync(_redisCs, sw);
    }
    catch (Exception ex)
    {
        _logger.Error("Unable to connect to redis async: " + ex);
        _logger.Debug("internal log: \r\n" + sw);
        throw;
    }
    finally
    {
        _attemptingToConnect = false;
    }
}

Basically, if I see that I can't connect to Redis because of a RedisConnectionException, I spin off a separate async task to recreate the shared connection. That call, of course, may fail, but calls were going to be failing during that period anyway. And as soon as it succeeds, any new calls will use that newly (re-)created connection. Like I said above, kinda boring.

My situation may be a little different from yours, in that I'm not using Redis as a permanent store, but just as a cache. That means the only impact from a lost redis connection is that I need to retrieve the result from the DB rather than from cache. So I can afford to treat certain things a little loosely.

Solution 2

All right, I guess I'll answer my own question if nobody wants to, although it seems strange, being such a basic use case.

Here is the class managing connectivity loss:

static class RedisConnectionManager
{
    private static readonly Dictionary<string, IDatabase> _dictionary = new Dictionary<string, IDatabase>();

    internal static IDatabase GetDatabase(string connectionString)
    {
        lock (_dictionary)
        {
            if (!_dictionary.ContainsKey(connectionString))
                _dictionary.Add(connectionString, ConnectionMultiplexer.Connect(connectionString).GetDatabase());
            if (!_dictionary[connectionString].Multiplexer.IsConnected)
            {
                _dictionary[connectionString].Multiplexer.Dispose();
                _dictionary[connectionString] = ConnectionMultiplexer.Connect(connectionString).GetDatabase();
            }
            return _dictionary[connectionString];
        }
    }
}

This class handles multiple connection strings, so if you have just one, the code will be even simpler. Please note explicit Multiplexer.Dispose() call. Since underlying object owns physical TCP connection, you can't wait until GC kicks in to release the resources. By that time, depending on your load, you may have thousands of orphaned TCP connections hanging around.

This code works reasonably well, but I am still not 100% sure this is the best way to handle this. Please let me know if anyone has the idea how to improve this.

Share:
10,153
Konstantin Surkov
Author by

Konstantin Surkov

Updated on June 21, 2022

Comments

  • Konstantin Surkov
    Konstantin Surkov almost 2 years

    I am looking for the reference implementation of the recovery [in multi-threaded environment] after Redis connection has been lost. Was not able to find anything that makes sense so far.

    Setup: I have an Azure worker role that runs the same code (ThreadProc) in multiple threads. Initially, I had static ConnectionMultiplexer and did .GetDatabase() before every Redis operation. That didn't pass stress testing at all (tons of "can't connect" errors once the load increased from low to moderate). I have changed it to this:

    static readonly ConnectionMultiplexer _connection = ConnectionMultiplexer.Connect(...);
    static readonly IDatabase _cache = _connection.GetDatabase();
    
    void ThreadProc() // running in multiple threads
    {
        while (true)
        {
          // using _cache here
        }
    }
    

    That worked pretty well even under high load (1000+ op/s per worker role instance) until I got "No connection is available to service this operation", and the thing can't recover since.

    Please let me know what would be the correct/recommended code that can recover from intermittent connectivity problems.

  • Myles J
    Myles J over 9 years
    Nice solution! Thanks for sharing.
  • Jeff
    Jeff over 8 years
    Is this still the case that SS can't handle reconnect?
  • Pharao2k
    Pharao2k over 8 years
    I don't want to criticize your code but it doesn't handle parallelism cleanly, correct? Since no locking is involved, two parallel get-invocations could result in two created connection, one of which would not be referenced anymore. Though I'm unsure if that would actually be a problem ;)
  • Ken Smith
    Ken Smith over 8 years
    @Pharao2k - That's a valid point: you could definitely have two threads simultaneously check the _attemptingToConnect variable, see that it's not set, and then both try to establish a connection, one of which would get orphaned, never used and never closed (by my code, at least). A lock of some sort would fix that. On the other hand, unless you have a really busy server, it's not going to happen frequently, I think the consequences of an orphaned connection are small, and I like having simpler code that doesn't suffer from the complexities of navigating locks. Still might be worth doing...