How to write repeated free-form strings to a file, as fast as 'dd'?

5,135

Solution 1

$ time perl -e \
    '$count=1024*1024; while ($count>0) { print "x" x 384; $count--; }' > out
real    0m1.284s
user    0m0.316s
sys 0m0.961s
$ ls -lh out
-rw-r--r-- 1 me group 384M Apr 16 19:47 out

Replace "x" x 384 (which produces a string of 384 xs) with whatever you like.

You can optimize this further by using a bigger string in each loop, and bypassing normal standard out buffering.

$ perl -e \
   '$count=384; while ($count>0) {
      syswrite(STDOUT, "x" x (1024*1024),  1024*1024);
      $count--;
    }' > out

In this case, the syswrite calls will pass down 1M at a time to the underlying write syscall, which is getting pretty good. (I'm getting around 0.940s user with this.)

Hint: make sure you call sync between each test to avoid having the previous run's flushing interfere with the current run's I/O.

For reference, I get this time:

$ time dd if=/dev/zero bs=1024 count=$((1024*384)) of=./out
393216+0 records in
393216+0 records out
402653184 bytes (403 MB) copied, 1.41404 s, 285 MB/s

real    0m1.480s
user    0m0.054s
sys 0m1.410s

Solution 2

It's generally expected that shells are slow at processing large pieces of data. For most scripts, you know in advance which bits of data are likely to be small and which bits of data are likely to be large.

  • Prefer to rely on shell built-ins for small data, because forking and exec'ing an external process induces a constant overhead.
  • Prefer to rely on external, special-purpose tools for large data, because special-purpose compiled tools are more efficient than an interpreted general-purpose language.

dd makes read and write calls that use the block size. You can observe this with strace (or truss, trace, … depending on your OS):

$ strace -s9 dd if=/dev/zero of=/dev/null ibs=1024k obs=2048k count=4
✄
read(0, "\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152
read(0, "\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
read(0, "\0\0\0\0\0\0\0\0\0"..., 1048576) = 1048576
write(1, "\0\0\0\0\0\0\0\0\0"..., 2097152) = 2097152
✄

Most other tools have a much lower cap on the maximum buffer size, so they would make more syscalls, and hence take more time. But note that this is an unrealistic benchmark: if you were writing to a regular file or a pipe or a socket, the kernel would probably not write more than a few kilobytes per syscall anyway.

Solution 3

You can use dd for this! First write the string to the beginning of the file. Then do:

dd if=$FILE of=$FILE bs=$STRING_LENGTH seek=1 count=$REPEAT_TIMES

Note: if your $STRING_LENGTH is small, you might do something like

dd if=$FILE of=$FILE bs=$STRING_LENGTH seek=1 count=$((1024/$REPEAT_TIMES))
dd if=$FILE of=$FILE bs=1024 seek=1 count=$(($REPEAT_TIMES/1024))

(This example only works if STRING_LENGTH is a power of 2 and REPEAT_TIMES is a multiple of 1024, but you get the idea)

If you want to use this to overwrite a file (e.g. purging) use conv=notrunc

Solution 4

I've finally got my idea on how to do this working... It uses a tee |tee |tee chain, which runs at close to dd's speed..

# ============================================================================
# repstr
#
# Brief:
#   Make multiple (repeat) copies of a string.
#   Option -e, --eval is used as in 'echo -e'
#
# Return:
#   The resulting string is sent to stdout
#
#   Args:       Option      $1         $2
#             -e, --eval   COUNT      STRING
#     repstr             $((2**40))    "x"       # 1 TB:     xxxxxxxxx...
# eg. repstr  -e            7         "AB\tC\n"  # 7 lines:  AB<TAB>C
#     repstr                2         "ऑढळ|a"   # 2 copies:  ऑढळ|aऑढळ|a 
#

[[ "$1" == "-e" || "$1" == "--eval" ]] && { e="-e"; shift 1; }|| e=""
 count="$1"
string="$2"
[[ "${count}" == ""         ]] && exit 1 # $count must be an integer
[[ "${count//[0-9]/}" != "" ]] && exit 2 # $count is not an integer
[[ "${count}" == "0"        ]] && exit 0 # nothing to do
[[ "${string}" == ""        ]] && exit 0 # nothing to do
#
# ========================================================================
# Find the highest 'power of 2' which, when calculated**, is <= count
#   ie. check ascending 'powers of 2'
((leqXpo=0))  # Exponent which makes 2** <= count 
((leqCnt=1))  # A count which is <= count
while ((count>=leqCnt)) ;do
  ((leqXpo+=1))
  ((leqCnt*=2))
done
((leqXpo-=1))
((leqCnt/=2))
#   
# ======================================================================================
# Output $string to 'tee's which are daisy-chained in groups of descending 'powers of 2'
todo=$count
for ((xpo=leqXpo ;xpo>0 ;xpo--)) ;do
  tchain="" 
  floor=$((2**xpo))
  if ((todo>=(2**xpo))) ; then
    for ((t=0 ;t<xpo ;t++)) ;do tchain="$tchain|tee -" ;done
    eval echo -n $e \"'$string'\" $tchain # >/dev/null
    ((todo-=floor))
  fi
done
if ((todo==1)) ;then 
  eval echo -n $e \"'$string'\" # >/dev/null
fi
#

Here are some time test results.. I've gone to 32 GB because thats the about the size of a test file I wanted to create (which is what started me off on this issue)

NOTE: (2**30), etc. refers to the number of strings (to achieve a particular GB filesize)
-----
dd method (just for reference)                              real/user/sys
* 8GB                                                       =================================
    if=/dev/zero bs=1024 count=$(((1024**2)*8))         #   2m46.941s / 00m3.828s / 0m56.864s

tee method: fewer tests, because it didn't overflow, and the number-of-strings:time ratio is linear
tee method:              count        string                real/user/sys  
* 8GB                    ==========   ============          =================================
  tee(2**33)>stdout      $((2**33))   "x"               #   1m50.605s / 0m01.496s / 0m27.774s
  tee(2**30)>stdout  -e  $((2**30))   "xxx\txxx\n"      #   1m49.055s / 0m01.560s / 0m27.750s
* 32GB                                                     
  tee(2**35)>stdout  -e  $((2**35))   "x"               #   
  tee(2**32)>stdout  -e  $((2**32))   "xxx\txxx\n"      #   7m34.867s / 0m06.020s / 1m52.459s

python method: '.write'  uses 'file.write()' 
               '>stcout' uses 'sys.stdout.write()'. It handles \n in args (but I know very little python)
                            count   string                   real/user/sys
* 8GB                       =====   ===================      =================================
  python(2**33)a .write     2**33    "x"                 # OverflowError: repeated string is too long
  python(2**33)a >stdout    2**33    "x"                 # OverflowError: repeated string is too long
  python(2**30)b .write     2**30   '"xxxxxxxX" *2**0'   #   6m52.576s / 6m32.325s / 0m19.701s
  python(2**30)b >stdout    2**30   '"xxxxxxxX" *2**0'   #   8m11.374s / 7m49.101s / 0m19.573s
  python(2**30)c .write     2**20   '"xxxxxxxX" *2**10'  #   2m14.693s / 0m03.464s / 0m22.585s 
  python(2**30)c >stdout    2**20   '"xxxxxxxX" *2**10'  #   2m32.114s / 0m03.828s / 0m22.497s
  python(2**30)d .write     2**10   '"xxxxxxxX" *2**20'  #   2m16.495s / 0m00.024s / 0m12.029s
  python(2**30)d >stdout    2**10   '"xxxxxxxX" *2**20'  #   2m24.848s / 0m00.060s / 0m11.925s
  python(2**30)e .write     2**0    '"xxxxxxxX" *2**30'  # OverflowError: repeated string is too long
  python(2**30)e >stdout    2**0    '"xxxxxxxX" *2**30'  # OverflowError: repeated string is too long
* 32GB
  python(2**32)f.write      2**12   '"xxxxxxxX" *2**20'  #   7m58.608s / 0m00.160s / 0m48.703s
  python(2**32)f>stdout     2**12   '"xxxxxxxX" *2**20'  #   7m14.858s / 0m00.136s / 0m49.087s

perl method:
                           count   string                    real      / user       / sys
* 8GB                      =====   ===================       =================================
  perl(2**33)a .syswrite>  2**33    "a"        x 2**0    # Sloooooow! It would take 24 hours.   I extrapolated after 1 hour.   
  perl(2**33)a >stdout     2**33    "a"        x 2**0    #  31m46.405s / 31m13.925s /  0m22.745s
  perl(2**30)b .syswrite>  2**30    "aaaaaaaA" x 2**0    # 100m41.394s / 11m11.846s / 89m27.175s
  perl(2**30)b >stdout     2**30    "aaaaaaaA" x 2**0    #   4m15.553s /  3m54.615s /  0m19.949s
  perl(2**30)c .syswrite>  2**20    "aaaaaaaA" x 2**10   #   1m47.996s /  0m10.941s /  0m15.017s
  perl(2**30)c >stdout     2**20    "aaaaaaaA" x 2**10   #   1m47.608s /  0m12.237s /  0m23.761s
  perl(2**30)d .syswrite>  2**10    "aaaaaaaA" x 2**20   #   1m52.062s /  0m10.373s /  0m13.253s
  perl(2**30)d >stdout     2**10    "aaaaaaaA" x 2**20   #   1m48.499s /  0m13.361s /  0m22.197s
  perl(2**30)e .syswrite>  2**0     "aaaaaaaA" x 2**30   # Out of memory during string extend at -e line 1.   
  perl(2**30)e >stdout     2**0     "aaaaaaaA" x 2**30   # Out of memory during string extend at -e line 1.   
* 32GB
  perl(2**32)f .syswrite>  2**12    "aaaaaaaA" x 2**20   #   7m34.241s /  0m41.447s / 0m51.727s
  perl(2**32)f >stdout     2**12    "aaaaaaaA" x 2**20   #  10m58.444s /  0m53.771s / 1m28.498s

Solution 5

Python version:

import sys

CHAR = sys.argv[1] if len(sys.argv) > 1 else "x"

block = CHAR * 1024
count = 1024 * 384

with open("testout.bin", "w") as outf:
    for i in xrange(count):
        outf.write(block)

python2.7 writestr.py x
0.27s user 0.69s system 99% cpu 0.963 total

dd if=/dev/zero of=testout.bin bs=1024 count=$((1024*384))
0.05s user 1.05s system 94% cpu 1.167 total

Python has a higher initialization cost, but overall beat dd on my system.

Share:
5,135

Related videos on Youtube

Peter.O
Author by

Peter.O

Updated on September 18, 2022

Comments

  • Peter.O
    Peter.O almost 2 years

    dd can write repeating \0 bytes to a file very fast, but it can't write repeating arbitrary strings.
    Is there a bash-shell method to write repeating arbitrary strings equally as fast as 'dd' (including \0)?

    All the suggestions I've encountered in 6 months of Linux are things like printf "%${1}s" | sed -e "s/ /${2}/g", but this is painfully slow compared to dd, as shown below, and sed crashes after approximately 384 MB (on my box) -- actually that's not bad for a single line-length :) -- but it did crash!
    I suppose that wouldn't be an issue for sed, if the string contained a newline.

    Speed comparison of dd vs. printf+sed:

                                real        user        sys       
    WRITE 384 MB: 'dd'          0m03.833s   0m00.004s   0m00.548s
    WRITE 384 MB: 'printf+sed'  1m39.551s   1m34.754s   0m02.968s
    
    # the two commands used   
    dd if=/dev/zero bs=1024 count=$((1024*384))
    printf "%$((1024*1024*384))s" |sed -e "s/ /x/g"
    

    I have an idea how to do this in a bash-shell script, but there's no point re-inventing the wheel. :)

  • Peter.O
    Peter.O about 13 years
    Another inciteful answer, thanks... I really like your bullet-point maxims about when to "Prefer".... I'm starting to differentiate between shell built-ins and the externals... I've close to finished my alternative method.. it's speed is very close dd, and seems to be rather indifferent to the string size... (I'll try to post it sometime tomorrow, once I get it ship-shape :) ...
  • Peter.O
    Peter.O about 13 years
    Interesting and useful.. As the string length reduces, the time increases ..On my box your exact command took real/user/sys **0m4.565s**/0m0.804s/0m0.904s ..with a string "x\n", it took r/u/s **0m30.227s**/0m29.202s/0m0.880s... but that's still certainly faster than printf--sed ... The 384 byte string version is about the same speed as dd on my system too...(it's funny how things vary... I actually got a slower dd speed this time...
  • asoundmove
    asoundmove about 13 years
    @fred.bear, spelling tip: I suppose you meant "insightful" rather than "inciteful" (which does not exist, but could be linked to "to incite").
  • Peter.O
    Peter.O about 13 years
    @asoundmove: Thanks. I'm quite happy with such alerts.. but I definitely(?) meant 'inciteful' :) oxforddictionaries.com/view/entry/m_en_gb0404940#m_en_gb0404‌​940 (but not to incite to illegal actions, as the strict sense of the word implies... I may have got the two cross wired.. I recall both sentiments; "insight" and "being spurred on"... Actually, I'll concede.. Hey, :) my excuse is: not a lot of sleep last night. too much Q&A.... I think I did mean mainly "insight".. but I definitely recall thinking of both words. (a bit off topic, but a change is as good as a holiday :)
  • asoundmove
    asoundmove about 13 years
    @fred.bear: oh it does exist! New one on me. Learn something new everyday.
  • Peter.O
    Peter.O about 13 years
    In essence that's what I've done too..(but differently)... I'll check this later (busy now), and I've "answered" the question with my "tee" versoon...
  • Peter.O
    Peter.O about 13 years
    @user-unknown: I've looked at it again.. I think the idea is good (but I would, as we have both used a binary doubling :).. It creates a lot of files.. which then have to be selectively catd again to get the final desired number of strings. eg 987654321 ... repeats of your string... and as you said it slows downa lot with larger numbers of repeating strings... It has been running for aprox 40 mins to make a 32GB file, so I killed it. (I'm after a 35 GB file..) ... The tee process I've used takes 7-9 minutes... but I all for the binary idea.. binay splits and doublings are powerful tools
  • Peter.O
    Peter.O about 13 years
    This is looking very good.. It think that the actual number of repeats (xrange) would depend on system resources, but it can get several GB of strings from xrange alone... (easily dealt with wit a bit of bounds checking)... I've included some test times in my answer.. Both your method, and my method are close to 'dd', timewise..
  • Peter.O
    Peter.O about 13 years
    I've included some test times in my answer (so that all times relate to the same hardware).
  • erik
    erik about 11 years
    For better comparability you should perform all of your tests on /dev/shm to avoid interfering with the cache of your harddisk. Of course only if you have enough RAM in your machine.