efficiency of fwrite for massive numbers of small writes
Solution 1
First of all, fwrite()
is a library and not a system call. Secondly, it already buffers the data.
You might want to experiment with increasing the size of the buffer. This is done by using setvbuf()
. On my system this only helps a tiny bit, but YMMV.
If setvbuf()
does not help, you could do your own buffering and only call fwrite()
once you've accumulated enough data. This involves more work, but will almost certainly speed up the writing as your own buffering can be made much more lightweight that fwrite()
's.
edit: If anyone tells you that it's the sheer number of fwrite()
calls that is the problem, demand to see evidence. Better still, do your own performance tests. On my computer, 500,000,000 two-byte writes using fwrite()
take 11 seconds. This equates to throughput of about 90MB/s.
Last but not least, the huge discrepancy between 11 seconds in my test and one hour mentioned in your question hints at the possibility that there's something else going on in your code that's causing the very poor performance.
Solution 2
your problem is not the buffering for fwrite()
, but the total overhead of making the library call with small amounts of data. if you write just 1MB of data, you make 250000 function calls. you'd better try to collect your data in memory and then write to the disk with one single call to fwrite()
.
UPDATE: if you need an evidence:
$ dd if=/dev/zero of=/dev/null count=50000000 bs=2
50000000+0 records in
50000000+0 records out
100000000 bytes (100 MB) copied, 55.3583 s, 1.8 MB/s
$ dd if=/dev/zero of=/dev/null count=50 bs=2000000
50+0 records in
50+0 records out
100000000 bytes (100 MB) copied, 0.0122651 s, 8.2 GB/s
Solution 3
OK, well, that was interesting. I thought I'd write some actual code to see what the speed was. And here it is. Compiled using C++ DevStudio 2010 Express. There's quite a bit of code here. It times 5 ways of writing the data:-
- Naively calling fwrite
- Using a buffer and doing fewer calls to fwrite using bigger buffers
- Using the Win32 API naively
- Using a buffer and doing fewer calls to Win32 using bigger buffers
- Using Win32 but double buffering the output and using asynchronous writes
Please check that I've not done something a bit stupid with any of the above.
The program uses QueryPerformanceCounter for timing the code and ends the timing after the file has been closed to try and include any pending internal buffered data.
The results on my machine (an old WinXP SP3 box):-
- fwrite on its own is generally the fastest although the buffered version can sometimes beat it if you get the size and iterations just right.
- Naive Win32 is significantly slower
- Buffered Win32 doubles the speed but it is still easily beaten by fwrite
- Asynchronous writes were not significantly better than the buffered version. Perhaps someone could check my code and make sure I've not done something stupid as I've never really used the asynchronous IO before.
You may get different results depending on your setup.
Feel free to edit and improve the code.
#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <memory.h>
#include <Windows.h>
const int
// how many times fwrite/my_fwrite is called
c_iterations = 10000000,
// the size of the buffer used by my_fwrite
c_buffer_size = 100000;
char
buffer1 [c_buffer_size],
buffer2 [c_buffer_size],
*current_buffer = buffer1;
int
write_ptr = 0;
__int64
write_offset = 0;
OVERLAPPED
overlapped = {0};
// write to a buffer, when buffer full, write the buffer to the file using fwrite
void my_fwrite (void *ptr, int size, int count, FILE *fp)
{
const int
c = size * count;
if (write_ptr + c > c_buffer_size)
{
fwrite (buffer1, write_ptr, 1, fp);
write_ptr = 0;
}
memcpy (&buffer1 [write_ptr], ptr, c);
write_ptr += c;
}
// write to a buffer, when buffer full, write the buffer to the file using Win32 WriteFile
void my_fwrite (void *ptr, int size, int count, HANDLE fp)
{
const int
c = size * count;
if (write_ptr + c > c_buffer_size)
{
DWORD
written;
WriteFile (fp, buffer1, write_ptr, &written, 0);
write_ptr = 0;
}
memcpy (&buffer1 [write_ptr], ptr, c);
write_ptr += c;
}
// write to a double buffer, when buffer full, write the buffer to the file using
// asynchronous WriteFile (waiting for previous write to complete)
void my_fwrite (void *ptr, int size, int count, HANDLE fp, HANDLE wait)
{
const int
c = size * count;
if (write_ptr + c > c_buffer_size)
{
WaitForSingleObject (wait, INFINITE);
overlapped.Offset = write_offset & 0xffffffff;
overlapped.OffsetHigh = write_offset >> 32;
overlapped.hEvent = wait;
WriteFile (fp, current_buffer, write_ptr, 0, &overlapped);
write_offset += write_ptr;
write_ptr = 0;
current_buffer = current_buffer == buffer1 ? buffer2 : buffer1;
}
memcpy (current_buffer + write_ptr, ptr, c);
write_ptr += c;
}
int main ()
{
// do lots of little writes
FILE
*f1 = fopen ("f1.bin", "wb");
LARGE_INTEGER
f1_start,
f1_end;
QueryPerformanceCounter (&f1_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
fwrite (&i, sizeof i, 1, f1);
}
fclose (f1);
QueryPerformanceCounter (&f1_end);
// do a few big writes
FILE
*f2 = fopen ("f2.bin", "wb");
LARGE_INTEGER
f2_start,
f2_end;
QueryPerformanceCounter (&f2_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
my_fwrite (&i, sizeof i, 1, f2);
}
if (write_ptr)
{
fwrite (buffer1, write_ptr, 1, f2);
write_ptr = 0;
}
fclose (f2);
QueryPerformanceCounter (&f2_end);
// use Win32 API, without buffer
HANDLE
f3 = CreateFile (TEXT ("f3.bin"), GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0);
LARGE_INTEGER
f3_start,
f3_end;
QueryPerformanceCounter (&f3_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
DWORD
written;
WriteFile (f3, &i, sizeof i, &written, 0);
}
CloseHandle (f3);
QueryPerformanceCounter (&f3_end);
// use Win32 API, with buffer
HANDLE
f4 = CreateFile (TEXT ("f4.bin"), GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_FLAG_WRITE_THROUGH, 0);
LARGE_INTEGER
f4_start,
f4_end;
QueryPerformanceCounter (&f4_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
my_fwrite (&i, sizeof i, 1, f4);
}
if (write_ptr)
{
DWORD
written;
WriteFile (f4, buffer1, write_ptr, &written, 0);
write_ptr = 0;
}
CloseHandle (f4);
QueryPerformanceCounter (&f4_end);
// use Win32 API, with double buffering
HANDLE
f5 = CreateFile (TEXT ("f5.bin"), GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_FLAG_OVERLAPPED | FILE_FLAG_WRITE_THROUGH, 0),
wait = CreateEvent (0, false, true, 0);
LARGE_INTEGER
f5_start,
f5_end;
QueryPerformanceCounter (&f5_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
my_fwrite (&i, sizeof i, 1, f5, wait);
}
if (write_ptr)
{
WaitForSingleObject (wait, INFINITE);
overlapped.Offset = write_offset & 0xffffffff;
overlapped.OffsetHigh = write_offset >> 32;
overlapped.hEvent = wait;
WriteFile (f5, current_buffer, write_ptr, 0, &overlapped);
WaitForSingleObject (wait, INFINITE);
write_ptr = 0;
}
CloseHandle (f5);
QueryPerformanceCounter (&f5_end);
CloseHandle (wait);
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
printf (" fwrites without buffering = %dms\n", (1000 * (f1_end.QuadPart - f1_start.QuadPart)) / freq.QuadPart);
printf (" fwrites with buffering = %dms\n", (1000 * (f2_end.QuadPart - f2_start.QuadPart)) / freq.QuadPart);
printf (" Win32 without buffering = %dms\n", (1000 * (f3_end.QuadPart - f3_start.QuadPart)) / freq.QuadPart);
printf (" Win32 with buffering = %dms\n", (1000 * (f4_end.QuadPart - f4_start.QuadPart)) / freq.QuadPart);
printf ("Win32 with double buffering = %dms\n", (1000 * (f5_end.QuadPart - f5_start.QuadPart)) / freq.QuadPart);
}
Solution 4
First and foremost: small fwrites() are slower, because each fwrite has to test the validity of its parameters, do the equivalent of flockfile(), possibly fflush(), append the data, return success: this overhead adds up -- not so much as tiny calls to write(2), but it's still noticeable.
Proof:
#include <stdio.h>
#include <stdlib.h>
static void w(const void *buf, size_t nbytes)
{
size_t n;
if(!nbytes)
return;
n = fwrite(buf, 1, nbytes, stdout);
if(n >= nbytes)
return;
if(!n) {
perror("stdout");
exit(111);
}
w(buf+n, nbytes-n);
}
/* Usage: time $0 <$bigfile >/dev/null */
int main(int argc, char *argv[])
{
char buf[32*1024];
size_t sz;
sz = atoi(argv[1]);
if(sz > sizeof(buf))
return 111;
if(sz == 0)
sz = sizeof(buf);
for(;;) {
size_t r = fread(buf, 1, sz, stdin);
if(r < 1)
break;
w(buf, r);
}
return 0;
}
That being said, you could do what many commenters suggested, ie add your own buffering before fwrite: it's very trivial code, but you should test if it really gives you any benefit.
If you don't want to roll your own, you can use eg the buffer interface in skalibs, but you'll probably take longer to read the docs than to write it yourself (imho).
camelccc
Updated on July 22, 2022Comments
-
camelccc almost 2 years
I have a program that saves many large files >1GB using
fwrite
It works fine, but unfortunately due to the nature of the data each call tofwrite
only writes 1-4bytes. with the result that the write can take over an hour, with most of this time seemingly due to the syscall overhead (or at least in the library function of fwrite). I have a similar problem withfread
.Does anyone know of any existing / library functions that will buffer these writes and reads with an inline function, or is this another roll your own?
-
Skizz over 11 yearsSo instead of calling fwrite, use a memory buffer and a current write / read pointer, flushing / filling the buffer when full / empty and starting at the begining again.
-
Skizz over 11 yearsThe problem's not the buffering, but the shear number of calls to fwrite.
-
NPE over 11 years@Skizz: What makes you think that? If you have any evidence, I'd love to see it.
-
lenik over 11 years@Skizz please, show us how you generate the data, then you may get advices. but generally
std::vector<your_stuff>
should solve the problem with pointers, writing, flushing and you need only onefwrite()
at the end. or maybe more, from time to time. -
Skizz over 11 yearswell, writing over a gigabyte of data in 1-4 bytes chunks is an awful lot of fwrite calls.
-
NPE over 11 years@Skizz: That's not exactly evidence, is it?
-
Skizz over 11 yearshow does std::vector<T> work with various types of different size?
-
Skizz over 11 years@NPE: Well, more gut-feeling. If you can reduce the 250 million function calls by a factor of a million (250 writes of a 4 meg buffer), then you've saved yourself a lot of execution time. If you then double buffer the writing, the limiting factor becomes the IO bandwidth.
-
Skizz over 11 yearsIf I could, I'd give an extra +1 for the timings.
-
NPE over 11 yearsWith regards to the timings, GNU
dd
does not usefwrite()
. Assuming yourdd
is the same, the timings have little to do with the question. -
NPE over 11 yearsOn my system, I can do 250 million
fwrite()
calls in 5.5 seconds. If you reduce this by a factor or ten, you've saved five seconds. OP is talking about an hour. -
geekpp over 11 yearsI agree with NPE. fwrite is NOT a system call!! There is no cost to call it multiple times. People saying the opposite need to back to school. You can just setup a big enougth buffer to reduce the underlying system call witch is the "write(fd,void*,int)" function.
-
geekpp over 11 yearsThis answer is plain wrong. take a look at NPE answer and the comments (or my c++ solution) to save you time.
-
camelccc over 11 yearsmore and more interesting this gets. Tried setbuf() and no difference. I think the OS is buffering anyway. did a few tests and it seems that calling fwrite with a lot of small writes that are all different sizes is rather slower than calling fwrite with small writes that are all the same size.
-
Skizz over 11 yearsI should add that I built the program as a Windows Console application.
-
tmyklebu over 11 yearsCool! What results do you get?
-
fxtentacle about 7 yearsThis answer is highly misleading. dd with bs=2 will actually issue one write syscall to the kernel every two bytes. fwrite with its default buffering enabled will be one local library function call every two bytes, and then a write syscall every time the buffer gets full. The main overhead is the kernel calls, so dd bs=2 is not an accurate emulation of fwrite with 2 byte blocks.