How to increase Redis performance when 100% CPU? Sharding? Fastest .Net Client?

18,614

Solution 1

We found an issue inside our application. Communication about updated data in our cache to the local memory cache was realized through a redis channel subscription.

Every time local cache was flushed, items expired or items were updated messages got sent to all (35) webservers wich in turn started updating more items, etc, etc.

Disabling the messages for the updated keys improved our situation by 10 fold.

Network bandwidth dropped from 1.2 Gbps to 200Mbps and CPU utilization is 40% at 150% the load we had so far at a moment of extreme calculations and updates.

Solution 2

My first, simple suggestion if you haven't done it already would be to turn off all RDB or AOF backups on your Master at the very least. Of course then your slaves might fall behind if they're still saving to disk. See this for an idea of the cost of RDB dumps

Another thing to do is to make sure you're pipelining all of your commands. If you're sending many commands individually that can be grouped into a pipeline you should see a bump in performance.

Also, this SO post has a good answer about profiling Redis

More info about your use case, and data structure would be helpful in deciding whether there's a simple change you could make to the way you're actually using Redis that would give you an improvement.

Edit: In response to your latest comment, it's good to note that every time you have a slave lose connection and reconnect, it will re-sync with the master. In previous versions of Redis this was always a complete re-sync, so it was quite expensive. Apparently in 2.8 the slave is now able to request a partial re-sync of just the data it's missed since it's disconnection. I don't know much about the details, but if either your master or any of your slaves aren't on 2.8.* and you have a shaky connection, that could really hurt your cpu performance by constantly forcing your master to re-sync the slaves. More info Here

Solution 3

The first thing to do would be to look at slowlog get 50 (or pick any number of rows) - this shows the last 50 commands that took non-trivial amounts of time. It could be that some of the things you are doing are simply taking too long. I get worried if I see anything in slowlog - I usually see items every few days. If you are seeing lots of items constantly, then: you need to investigate what you are actually doing on the server. One killer thing to never do is keys, but there are other things.

The next thing to do is: cache. Requests that get short-circuited before they hit the back end are free. We use redis extensively, but that doesn't mean we ignore local memory too.

Share:
18,614
baskabas
Author by

baskabas

Developer and part-time audiophile

Updated on June 16, 2022

Comments

  • baskabas
    baskabas almost 2 years

    Due to massive load increases on our website redis is now struggling with peak load because the redis server instance is reaching 100% CPU (on one of eight cores) resulting in time outs.

    We've updated our client software to ServiceStack V3 (coming from BookSleeve 1.1.0.4) and upgraded the redis server to 2.8.11 (coming from 2.4.x). I chose ServiceStack due to the existence of the Harbour.RedisSessionStateStore that uses ServiceStack.Redis. We used AngiesList.Redis before together with BookSleeve, but we experienced 100% with that too.

    We have eight redis servers configured as a master/slave tree. One single server for session state tho. The others are for data cache. One master with two master/slaves connected to two slaves each.

    The servers hold about 600 client connections at peak when they start to get clogged at 100% CPU.

    What can we do to increase performance?

    Sharding and/or StackExchange Redis client (no session state client available to my knowledge...).

    Or could it be something else? The session server also hits 100% and it is not connected to any other servers (data and network throughput are low).


    Update 1: Analysis of redis-cli INFO

    Here's the output of the INFO command after one night of running Redis 2.8.

    # Server
    redis_version:2.8.11
    redis_git_sha1:00000000
    redis_git_dirty:0
    redis_build_id:7a57b118eb75b37f
    redis_mode:standalone
    os:Linux 2.6.32-431.11.2.el6.x86_64 x86_64
    arch_bits:64
    multiplexing_api:epoll
    gcc_version:4.4.7
    process_id:5843
    run_id:d5bb838857d61a9673e36e5bf608fad5a588ac5c
    tcp_port:6379
    uptime_in_seconds:152778
    uptime_in_days:1
    hz:10
    lru_clock:10765770
    config_file:/etc/redis/6379.conf
    
    # Clients
    connected_clients:299
    client_longest_output_list:0
    client_biggest_input_buf:0
    blocked_clients:0
    
    # Memory
    used_memory:80266784
    used_memory_human:76.55M
    used_memory_rss:80719872
    used_memory_peak:1079667208
    used_memory_peak_human:1.01G
    used_memory_lua:33792
    mem_fragmentation_ratio:1.01
    mem_allocator:jemalloc-3.2.0
    
    # Persistence
    loading:0
    rdb_changes_since_last_save:70245
    rdb_bgsave_in_progress:0
    rdb_last_save_time:1403274022
    rdb_last_bgsave_status:ok
    rdb_last_bgsave_time_sec:0
    rdb_current_bgsave_time_sec:-1
    aof_enabled:0
    aof_rewrite_in_progress:0
    aof_rewrite_scheduled:0
    aof_last_rewrite_time_sec:-1
    aof_current_rewrite_time_sec:-1
    aof_last_bgrewrite_status:ok
    aof_last_write_status:ok
    
    # Stats
    total_connections_received:3375
    total_commands_processed:30975281
    instantaneous_ops_per_sec:163
    rejected_connections:0
    sync_full:10
    sync_partial_ok:0
    sync_partial_err:5
    expired_keys:8059370
    evicted_keys:0
    keyspace_hits:97513
    keyspace_misses:46044
    pubsub_channels:2
    pubsub_patterns:0
    latest_fork_usec:22040
    
    # Replication
    role:master
    connected_slaves:2
    slave0:ip=xxx.xxx.xxx.xxx,port=6379,state=online,offset=272643782764,lag=1
    slave1:ip=xxx.xxx.xxx.xxx,port=6379,state=online,offset=272643784216,lag=1
    master_repl_offset:272643811961
    repl_backlog_active:1
    repl_backlog_size:1048576
    repl_backlog_first_byte_offset:272642763386
    repl_backlog_histlen:1048576
    
    # CPU
    used_cpu_sys:20774.19
    used_cpu_user:2458.50
    used_cpu_sys_children:304.17
    used_cpu_user_children:1446.23
    
    # Keyspace
    db0:keys=77863,expires=77863,avg_ttl=3181732
    db6:keys=11855,expires=11855,avg_ttl=3126767
    

    Update 2: twemproxy (Sharding)

    I've discovered an interesting component called twemproxy. This component, as I understand it, could Shard across multiple redis instances.

    Would this help relieve the CPU?

    It would save us a lot of programming time, but it would still take some effort to configure 3 extra instances on each server. So I'm hoping somebody can confirm or debunk this solution before we put in the work.

    • Marc Gravell
      Marc Gravell almost 10 years
      Can you clarify: is it Redis that has high cpu, or your web tier? This is important to be clear about. What is the operate? (Redis shows the instantaneous opcount via "info")
    • baskabas
      baskabas almost 10 years
      @Marc: I clarified the post. I will have look at the opcount. The info command is very slow during these loads.
  • baskabas
    baskabas almost 10 years
    Slowlog shows 71 entries (and counting) on our session cache and 128 entries and counting on the data cache. We are not experiencing al lot of load at the moment. How long do entries stay in here? Are these counts a lot?
  • Marc Gravell
    Marc Gravell almost 10 years
    @baskabas that doesn't sound like it is continuous, at least - unless it is simply trimmed to that number. Are any of the times offensively large? I get worried at anything over 5ms. But the next thing to look at would be "monitor" (briefly) - to see what the server is actually doing.
  • baskabas
    baskabas almost 10 years
    Could you elaborate on your last remark (in the answer)? We use local memcache on our webservers to lighten the load on our cache. Because we run 2 workers we are now also putting a local instance of redis on each webserver. What would happen if we would make them slaves?
  • baskabas
    baskabas almost 10 years
    most of the entries are between 10ms-20ms. Two stand out, both "PSYNC ? -1" 80ms-105ms. Again we only have 750 concurrent users right now... last night we hit 10K+. I've monitored quite a few times and I just see a lot of keys (and their data) passing by. What would you be looking for?
  • baskabas
    baskabas almost 10 years
    I am willing to try this. Could someone confirm that this does not introduce any other issues.
  • baskabas
    baskabas almost 10 years
    I also found a setting called rdbcompression = true. Could disabling this save CPU? Or will the larger file be the next bottleneck? We have more than enough memory.
  • wallacer
    wallacer almost 10 years
    which doesn't introduce other issues? disabling RDB backup? We run our master and one slave as memory-only. No RDB or AOF. We then have another disk slave which handles creating our disk backups. It causes no "other issues"... You're just giving the redis-server process less work to do.
  • wallacer
    wallacer almost 10 years
    yes, disabling rdbcompression should reduce CPU usage as well. But only if you're actually saving rdb backups on your master, which I'm suggesting you don't do. Disk backups are a great job for a slave machine...
  • baskabas
    baskabas almost 10 years
    Yes, thanks wallacer. I've read into the subject by now. We've disabled saving the RDB since we don't need it. AOF was disabled by default.
  • wallacer
    wallacer almost 10 years
    awesome. Feel free to upvote if this was helpful... just sayin ;)
  • baskabas
    baskabas almost 10 years
    I was trying to, but you need 15+ reputation for that... we'll have to see if it makes any difference tonight, but thanks! ;)
  • Marc Gravell
    Marc Gravell almost 10 years
    @baskabas psync is replication, but that doesn't sound terrible. Monitor would be my next step.
  • baskabas
    baskabas almost 10 years
    Will do on tonights shift. Have increased the slowlog capacity to 1024 (was at 128) and it filled up quite fast.
  • baskabas
    baskabas almost 10 years
    Disabling the snapshots and AOF relieved the cache quite a bit. A collegue mentioned something about a buffer limits " fd=259 name= age=3 idle=3 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=3190 oll=25 omem=298154600 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. [5843] 22 Jun 08:23:29.808 # Connection with slave xx.xx.xx.xx:6379 lost."
  • Philip P.
    Philip P. over 9 years
    Interesting answer and an interesting problem. To clarify, you used Redis pub/sub to keep caches on the 35 web boxes in sync, is that correct? Was there any particular pattern that you used there, for example using keys command as was suggested in one of the answers? Also out of interest, what sort of CPU spec was the Redis machine?
  • baskabas
    baskabas over 9 years
    Nothing particular, just send a message when a key gets updated in our CacheManager and remove that key from local memcache on the other end. CPU: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz.