Small writes to SMB network share are slow on Windows, fast over CIFS Linux mount

11,453

Solution 1

The C++ endl is defined to output '\n' followed by a flush. flush() is an expensive operation, so you should generally avoid using endl as your default end of line as it can create exactly the performance issue you are seeing (and not just with SMB, but with any ofstream with an expensive flush including local spinning rust or even the latest NVMe at some ridiculously high rate of output).

Replacing endl with "\n" will fix the performance above by allowing the system to buffer as intended. Except some libraries may flush on "\n", in which case you have more headaches (see https://stackoverflow.com/questions/21129162/tell-endl-not-to-flush for a solution overriding the sync() method).

Now to complicate things, flush() is only defined for what happens within the library buffers. The effect of flush on operating system, disk, and other external buffers is not defined. For Microsoft.NET "When you call the FileStream.Flush method, the operating system I/O buffer is also flushed." (https://msdn.microsoft.com/en-us/library/2bw4h516(v=vs.110).aspx) This makes flush particularly expensive for Visual Studio C++ as it will round-trip the write all the way out to the physical media at the far end of your remote server as you are seeing. GCC on the other hand says "A last reminder: there are usually more buffers involved than just those at the language/library level. Kernel buffers, disk buffers, and the like will also have an effect. Inspecting and changing those are system-dependent." (https://gcc.gnu.org/onlinedocs/libstdc++/manual/streambufs.html) Your Ubuntu traces would seem to indicate that the operating system / network buffers are not flushed by the library flush(). System dependant behaviour would be all the more reason to avoid endl and flushing excessively. If you are using VC++ you might try switching to a Windows GCC derivative to see how the system dependant behaviours react, or alternatively using Wine to run the Windows executable on Ubuntu.

More generally you need to think about your requirements to determine if flushing every line is appropriate or not. endl is generally suitable for interactive streams such as the display (we need the user to actually see our output, and not in bursts), but generally not suitable for other types of streams including files where the flushing overhead can be significant. I've seen apps flush on every 1 and 2 and 4 and 8 byte writes... it's not pretty to see the OS grind millions of IOs to write a 1MB file.

As an example a log file may need flushing every line if you are debugging a crash because you need to flush the ofstream before the crash occurs; while another log file may not need flushing every line if it is just producing verbose informational logging that is expected to flush automatically before the application terminates. It need not be either/or as you could derive a class with a more sophisticated flush algorithm to suit specific requirements.

Compare your case with the contrasting case of people who need to ensure their data is completely persisted to disk and not vulnerable in an operating system buffer (https://stackoverflow.com/questions/7522479/how-do-i-ensure-data-is-written-to-disk-before-closing-fstream).

Note that as written, outFile.flush() is superfluous as it flushes an already flushed ofstream. To be pedantic, you should have used endl alone or preferably "\n" with outFile.flush() but not both.

Solution 2

The performance of remote file operations, such as read/write, using SMB protocol can be affected by the size of buffers allocated by servers and clients. The buffer size determines the number of round trips needed to send a fixed amount of data. Every time when requests and responses are sent between client and server, the amount of time taken is equal to at least the latency between both sides, which could be very significant in the case of Wide Area Network (WAN).

SMB buffer -- The MaxBufferSize can be configured through the following registry setting:

HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\SizeReqBuf

Data Type: REG_DWORD

Range: 1024 to 65535 (Choose value as per your requirement above 5000)

BUT SMB SIGNING effects the maximum buffer size allowed. Thus we need to disable SMB signing as well to aechieve our goal. Following registry need to be created on both server side and if possible on client side as well.

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\LanManWorkstation\Parameters

Value Name: EnableSecuritySignature

Data Type: REG_DWORD

Data: 0 (disable), 1 (enable)

Solution 3

I don't have enough reputation to leave a comment (which I think would be better given the level of verification on this answer).

I notice that one big variance in your Linux vs Windows level trace is that you're using SMB1 on Linux and SMB2 in Windows. Perhaps the batch oplock mechanism performs better in SMB1 samba than the SMB2 exclusive lease implementation. In both cases these should allow for some amount of client side caching.

1) Perhaps try setting a lower max protocol level in Samba to try out windows with SMB1 2) Validate that exclusive oplocks or leases are taken out

Hope this helps :)

Solution 4

Interesting phenomenon. Here is what I would try - I have no idea if this really helps. If it was my machine, I would extensively watch the SMB perfcounters. One of them will show the cause.

More things to try

Add more Worker Threads

In case the SMB_RDR obens up one write I/O Request per line (what should not happen here), it may help to add some threads to the execution engine.

Set "AdditionalCriticalWorkerThreads" to 2, then to 4.

HKLM\System\CurrentControlSet\Control\Session Manager\Executive\AdditionalCriticalWorkerThreads

The default is 0, which means that no additional critical kernel worker threads are added. Which is usualy ok. This value affects the number of threads that the file system cache uses for read-ahead and write-behind requests. Raising this value can allow for more queued I/O in the storage subsystem (which is good, when you want to write line-by-line), but it's more CPU expensive.

Add more Queue Length

Increasing the "AdditionalCriticalWorkerThreads" value raises the number of threads that the file server can use to service concurrent requests.

HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters\MaxThreadsPerQueue

The default is 20. An indication that the value may need to be increased is if the SMB2 work queues are growing very large (perfcounter ‘Server Work Queues\Queue Length\SMB2*’. should be <100).

Share:
11,453

Related videos on Youtube

mevatron
Author by

mevatron

Updated on September 18, 2022

Comments

  • mevatron
    mevatron over 1 year

    I have been struggling to fix a performance problem with a SMB/CIFS share when performing small writes.

    First, let me describe my current network setup:

    Server

    • Synology DS215j (with SMB3 support enabled)

    Clients (same computer dual-booted wired Gig-E)

    • Ubuntu 14.04.5 LTS, Trusty Tahr
    • Windows 8.1

    smb.conf

    [global]
        printcap name=cups
        winbind enum groups=yes
        include=/var/tmp/nginx/smb.netbios.aliases.conf
        socket options=TCP_NODELAY IPTOS_LOWDELAY SO_RCVBUF=65536 SO_SNDBUF=65536
        security=user
        local master=no
        realm=*
        passdb backend=smbpasswd
        printing=cups
        max protocol=SMB3
        winbind enum users=yes
        load printers=yes
        workgroup=WORKGROUP
    

    I'm currently testing the small write performance with the following program written in C++ (on GitHub here):

    #include <iostream>
    #include <fstream>
    #include <sstream>
    
    using namespace std;
    
    int main(int argc, char* argv[])
    {
        ofstream outFile(argv[1]);
        for(int i = 0; i < 1000000; i++)
        {
            outFile << "Line #" << i << endl;   
        }
    
        outFile.flush();
        outFile.close();
        return 0;
    }
    

    Linux mount configuration:

    //192.168.1.10/nas-main on /mnt/nas-main type cifs (rw,noexec,nodev)
    

    Program run-time on Linux (peaks network output at ~100Mbps):

    $ time ./nas-write-test /mnt/nas-main/home/will/test.txt
    
    real    0m0.965s
    user    0m0.148s
    sys 0m0.672s
    

    PCAP snapshot showing chunking of many lines into a single TCP packet:

    Linux PCAP snapshot

    Program run-time on Windows as measured by PowerShell:

    > Measure-Command {start-process .\nas-write-test.exe -argumentlist "Z:\home\will\test-win.txt" -wait}
    
    
    Days              : 0
    Hours             : 0
    Minutes           : 9
    Seconds           : 29
    Milliseconds      : 316
    Ticks             : 5693166949
    TotalDays         : 0.00658931359837963
    TotalHours        : 0.158143526361111
    TotalMinutes      : 9.48861158166667
    TotalSeconds      : 569.3166949
    TotalMilliseconds : 569316.6949
    

    PCAP snapshot on Windows showing single line per SMB Write Request:

    Windows PCAP snapshot

    This same program takes about 10 minutes (~2.3Mbps) on Windows. Obviously, the Windows PCAP shows a very noisy SMB conversation with very low payload efficiency.

    Are there any settings on Windows that can improve small write performance? It seems from looking at packet captures that Windows doesn't buffer the writes properly and immediately sends out the data one line at a time. Whereas, on Linux, the data is heavily buffered and thus has far superior performance. Let me know if the PCAP files would be helpful, and I can find a way to upload them.

    Update 10/27/16:

    As mentioned by @sehafoc, I reduced the Samba servers max protocol setting to SMB1 with the following:

    max protocol=NT1

    The above setting resulted in the exact same behavior.

    I also removed the variable of Samba by creating a share on another Windows 10 machine, and it also exhibits the same behavior as the Samba server, so I'm beginning to believe this is a write caching bug with Windows clients in general.

    Update: 10/06/17:

    Full Linux packet capture (14MB)

    Full Windows packet capture (375MB)

    Update: 10/12/17:

    I also setup an NFS share and Windows does write with no buffering for this as well. So, it's definitely an underlying Windows client issue as far as I can tell, which is definitely unfortunate :-/

    Any help would be appreciated!

  • mevatron
    mevatron over 7 years
    Thanks for the tip; however, I tried both of these remedies and I am still seeing the above behavior :-/
  • Adi Jha
    Adi Jha over 7 years
    You ma also like to check why "Synology DS215j " is not using SMB3. By default SMB3 is enabled on Win 8.1.
  • mevatron
    mevatron over 6 years
    Thanks a million! You deserve way more than 100 points, but that's all I can give :) This was definitely the problem!