Netty slower than Tomcat

15,578

The method messageReceived is executed using a worker thread that is possibly getting blocked by RequestHandler#handle which may be busy doing some I/O work. You could try adding into the channel pipeline an OrderdMemoryAwareThreadPoolExecutor (recommended) for executing the handlers or alternatively, try dispatching your handler work to a new ThreadPoolExecutor and passing a reference to the socket channel for later writing the response back to client. Ex.:

@Override
public void messageReceived(ChannelHandlerContext ctx, MessageEvent e) {   

    executor.submit(new Runnable() {
        processHandlerAndRespond(e);        
    });
}

private void processHandlerAndRespond(MessageEvent e) {

    ChannelBuffer in = (ChannelBuffer) e.getMessage();
    in.readerIndex(4);
    ChannelBuffer out = ChannelBuffers.dynamicBuffer(512);
    out.writerIndex(8); // Skip the length and status code
    boolean success = handler.handle(new ChannelBufferInputStream(in), new ChannelBufferOutputStream(out), new NettyErrorStream(out));
    if (success) {
        out.setInt(0, out.writerIndex() - 8); // length
        out.setInt(4, 0); // Status
    }
    Channels.write(e.getChannel(), out, e.getRemoteAddress());
} 
Share:
15,578

Related videos on Youtube

voidmain
Author by

voidmain

CEO of FusionAuth, coder, husband, father, wine/beer/cocktail enthusiast, home-brewer, golfer, drummer, music lover.

Updated on June 06, 2022

Comments

  • voidmain
    voidmain almost 2 years

    We just finished building a server to store data to disk and fronted it with Netty. During load testing we were seeing Netty scaling to about 8,000 messages per second. Given our systems, this looked really low. For a benchmark, we wrote a Tomcat front-end and run the same load tests. With these tests we were getting roughly 25,000 messages per second.

    Here are the specs for our load testing machine:

    • Macbook Pro Quad core
    • 16GB of RAM
    • Java 1.6

    Here is the load test setup for Netty:

    • 10 threads
    • 100,000 messages per thread
    • Netty server code (pretty standard) - our Netty pipeline on the server is two handlers: a FrameDecoder and a SimpleChannelHandler that handles the request and response.
    • Client side JIO using Commons Pool to pool and reuse connections (the pool was sized the same as the # of threads)

    Here is the load test setup for Tomcat:

    • 10 threads
    • 100,000 messages per thread
    • Tomcat 7.0.16 with default configuration using a Servlet to call the server code
    • Client side using URLConnection without any pooling

    My main question is why such a huge different in performance? Is there something obvious with respect to Netty that can get it to run faster than Tomcat?

    Edit: Here is the main Netty server code:

    NioServerSocketChannelFactory factory = new NioServerSocketChannelFactory();
    ServerBootstrap server = new ServerBootstrap(factory);
    server.setPipelineFactory(new ChannelPipelineFactory() {
      public ChannelPipeline getPipeline() {
        RequestDecoder decoder = injector.getInstance(RequestDecoder.class);
        ContentStoreChannelHandler handler = injector.getInstance(ContentStoreChannelHandler.class);
        return Channels.pipeline(decoder, handler);
      }
    });
    
    server.setOption("child.tcpNoDelay", true);
    server.setOption("child.keepAlive", true);
    Channel channel = server.bind(new InetSocketAddress(port));
    allChannels.add(channel);
    

    Our handlers look like this:

    public class RequestDecoder extends FrameDecoder {
      @Override
      protected ChannelBuffer decode(ChannelHandlerContext ctx, Channel channel, ChannelBuffer buffer) {
        if (buffer.readableBytes() < 4) {
          return null;
        }
    
        buffer.markReaderIndex();
        int length = buffer.readInt();
        if (buffer.readableBytes() < length) {
          buffer.resetReaderIndex();
          return null;
        }
    
        return buffer;
      }
    }
    
    public class ContentStoreChannelHandler extends SimpleChannelHandler {
      private final RequestHandler handler;
    
      @Inject
      public ContentStoreChannelHandler(RequestHandler handler) {
        this.handler = handler;
      }
    
      @Override
      public void messageReceived(ChannelHandlerContext ctx, MessageEvent e) {
        ChannelBuffer in = (ChannelBuffer) e.getMessage();
        in.readerIndex(4);
    
        ChannelBuffer out = ChannelBuffers.dynamicBuffer(512);
        out.writerIndex(8); // Skip the length and status code
    
        boolean success = handler.handle(new ChannelBufferInputStream(in), new ChannelBufferOutputStream(out), new NettyErrorStream(out));
        if (success) {
          out.setInt(0, out.writerIndex() - 8); // length
          out.setInt(4, 0); // Status
        }
    
        Channels.write(e.getChannel(), out, e.getRemoteAddress());
      }
    
      @Override
      public void exceptionCaught(ChannelHandlerContext ctx, ExceptionEvent e) {
        Throwable throwable = e.getCause();
        ChannelBuffer out = ChannelBuffers.dynamicBuffer(8);
        out.writeInt(0); // Length
        out.writeInt(Errors.generalException.getCode()); // status
    
        Channels.write(ctx, e.getFuture(), out);
      }
    
      @Override
      public void channelOpen(ChannelHandlerContext ctx, ChannelStateEvent e) {
        NettyContentStoreServer.allChannels.add(e.getChannel());
      }
    }
    

    UPDATE:

    I've managed to get my Netty solution to within 4,000/second. A few weeks back I was testing a client side PING in my connection pool as a safe guard against idle sockets but I forgot to remove that code before I started load testing. This code effectively PINGed the server every time a Socket was checked out from the pool (using Commons Pool). I commented that code out and I'm now getting 21,000/second with Netty and 25,000/second with Tomcat.

    Although, this is great news on the Netty side, I'm still getting 4,000/second less with Netty than Tomcat. I can post my client side (which I thought I had ruled out but apparently not) if anyone is interested in seeing that.

    • Sully
      Sully over 11 years
    • irreputable
      irreputable over 11 years
      it's almost as if Netty used only 1 core.
    • Veebs
      Veebs over 11 years
      Here are some tuning tips for netty: stackoverflow.com/questions/6856116/…
    • forty-two
      forty-two over 11 years
      If the loaders setups are different, how can you attribute the result diffenrence to the server?
    • voidmain
      voidmain over 11 years
      @D3mon-1stVFW I don't see anything there that indicates why the performance is bad.
    • voidmain
      voidmain over 11 years
      @irreputable that would actually make some sense. Although our Netty setup definitely uses multiple threads on the server.
    • voidmain
      voidmain over 11 years
      @forty-two I definitely agree to an extent. The place where it breaks down is that we did a lot of timing and it appeared that the server was the slow point. In fact, it seems that the server was spending most of its time in select and socket reads.
    • voidmain
      voidmain over 11 years
      @Veebs I can try some VM tweaks, but both are using the same parameters. And besides, that shouldn't make a 300% difference.
    • CharlieQ
      CharlieQ over 11 years
      how many workers are used ? did your SimpleChannelHandler fork another handle thread?
    • Norman Maurer
      Norman Maurer over 11 years
      Without see your actual code its impossible. I suspect you may block the worker-thread, but I can't say for sure without see the source.
    • voidmain
      voidmain over 11 years
      @CharlieQ we don't fork anything in the SimpleChannelHandler. We are using the default constructor for the NIOServerSocketChannelFactory, which uses Executors.newCachedThreadPool().
    • voidmain
      voidmain over 11 years
      @NormanMaurer I added in our Netty code. Let me know if you need more or have questions.
    • Norman Maurer
      Norman Maurer over 11 years
      What does RequestHandler.handle(..) do ?
    • voidmain
      voidmain over 11 years
      @NormanMaurer that is our code that actually does the server processing. Both the Tomcat and Netty servers call that method. Tomcat uses the HttpServletRequest.getInputStream() and HttpServletResponse.getOutputStream() to get the streams. Netty uses the ChannelBuffer wrappers. Not saying that it couldn't be the issue, but I highly doubt that it is since both Netty and Tomcat are using the same method call.
    • Norman Maurer
      Norman Maurer over 11 years
      Do you do any blocking stuff in there ? Anything that could take some time?
    • Nic
      Nic over 11 years
      I've seen some slowdowns happening with 'getRemoteAddress()' on OS X (because of InetAddress lookups...) Maybe try and see if the lookups are having an impact?
    • voidmain
      voidmain over 11 years
      @NormanMaurer there isn't much blocking code in there and it wouldn't matter since the Tomcat and Netty servers use the exact same code. Therefore, I think we can safely rule out everything from RequestHandler down. It has to be somewhere in the IO stack.
    • voidmain
      voidmain over 11 years
      @Nic that actually gave me a 15% bump or so. However, I am still only getting around 9,000/second.
    • voidmain
      voidmain over 11 years
      Everyone, I just found a major oversight on my part that has really improved Netty's performance. However, it still isn't as fast as Tomcat. See the update in the main post for information.
    • Morten Haraldsen
      Morten Haraldsen over 11 years
      To simplify your code use ReplayingDecoder[1]. (I do not know how big your messages are, but we easily do >75k messages on an average laptop with binary payloads in the size of 12-200 bytes.) [1]link
    • johnstlr
      johnstlr over 11 years
      As Norman said, do consider if your handler code is performing blocking operations. Netty and Tomcat have different thread architectures out of the box Consider on a quad core machine Netty will by default allocate 8 worker threads. Tomcat will probably allocate up to 200. You have 10 threads posting in. Assuming the values I've given here, 20% of your load threads will be waiting rather than being processed by Netty.
    • voidmain
      voidmain over 11 years
      @MortenHaraldsen ReplayingDecoder doesn't work according the docs. I tried 3 different forms and all had the same error. The returned ChannelBuffer didn't contain everything. Reverting back to my FrameDecoder works fine though. Looks like bug in Netty. I would love to know how you are getting 75K messages though. That is what we were expecting from Netty, but have not even gotten close. We are still slower than Tomcat by a large number.
    • voidmain
      voidmain over 11 years
      @johnstlr I bumped the worker threads to 100 and it didn't improve performance at all. In fact, it looks like the performance went down by 1,000/second.
    • voidmain
      voidmain over 11 years
      Also, I wanted to mention that we have a performance test that hits the RequestHandler directly (without any networking code) and that portion of the code can process 100,000/second.
    • johnstlr
      johnstlr over 11 years
      Have you tried calling ContentStoreChannelHandler.messageReceived directly (comment out the channel.write) to see if you can get 100,000/second? This would rule out the integration with Netty at that level. Also, how big are the requests? Just wondering if the default AdaptiveReceiveBufferSizePredictor, starting at reads of 1024 bytes and increasing over time, is appropriate.
    • voidmain
      voidmain over 11 years
      @johnstlr I played around with that this morning a bit, and I wasn't able to get it working without major rework to the client and server. That change would essentially make the server not send a response and that caused the client to become more asynchronous, which not what our use case is. I could play around with the AdaptiveReceiveBufferSize configuration, but doubt it would have a huge impact. Anyone else have any thoughts?
    • voidmain
      voidmain over 11 years
      Bump. Anyone have ideas on this one?
    • Yuriy Nakonechnyy
      Yuriy Nakonechnyy almost 9 years
      Curiosity bump :) Was this problem solved and did Netty beat Tomcat?
    • voidmain
      voidmain almost 9 years
      @Yura - We never went back to Netty for more testing. In fact, that application never made it to production. We will likely resurrect it sometime this yea though. I'll likely try the latest versions of Netty and see if I can get it to work.
    • Yuriy Nakonechnyy
      Yuriy Nakonechnyy almost 9 years
      @BrianPontarelli ok, thanks for reply - would be great to hear the results :) I'll probably also test Netty vs Tomcat performance to compare both, because strangely this Q/A was the only adequate benchmark between both found via Google
    • hariszhr
      hariszhr almost 8 years
      this is just a wild idea. if ur data is flowing only one way, may be you can try making it a one way road.
    • HoaPhan
      HoaPhan over 2 years
      You can use jvisualvm or top to check their resources consumption. If Netty is using less than Tomcat then you can imagine you ran run several replicas of that server and divided the load. Or test with just a fraction of the load. Basically, the idea of Netty or nonblocking is being cheap to start with and being elastic, I think. Kind of what Java was missing from the start (the big and mighty JVM with a massive number of threads/workers for perf).