wget: obtaining files matching regex

12,645

Solution 1

Be careful --accept-regex is for the complete URL. But our target is some specific files. So we will use -A.

For example,

wget -r -np -nH -A "IMG[012][0-9].jpg" http://x.com/y/z/ 

will download all the files from IMG00.jpg to IMG29.jpg from the URL.

Note that a matching pattern contains shell-like wildcards, e.g. ‘books’ or ‘zelazny196[0-9]*’.

reference: wget manual: https://www.gnu.org/software/wget/manual/wget.html regex: https://regexone.com/

Solution 2

I'm reading in wget man page:

  --accept-regex urlregex
  --reject-regex urlregex
       Specify a regular expression to accept or reject the complete URL.

and noticing that it mentions the complete URL (e.g. something like
ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/diffs-000121.tar.gz)

So I suggest (without having tried it) to use
--accept-regex='.*diffs\-0001[0-9][0-9]\.tar\.gz'

(and perhaps give the appropriate --regex-type too)

BTW, for such tasks, I would also consider using some scripting language à la Python (or use libcurl or curl)

Share:
12,645

Related videos on Youtube

Mark Jin
Author by

Mark Jin

5th year PhD student in University of Michigan. Working on data transformation, data integration, database usability.

Updated on August 02, 2022

Comments

  • Mark Jin
    Mark Jin 3 months

    According to the man page of wget, --acccept-regex is the argument to use when I need to selectively transfer files whose names matching a certain regular expression. However, I am not sure how to use --accept-regex.

    Assuming I want to obtain files diffs-000107.tar.gz, diffs-000114.tar.gz, diffs-000121.tar.gz, diffs-000128.tar.gz in IMDB data directory ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/. "diffs\-0001[0-9]{2}\.tar\.gz" seems to be an ok regex to describe the file names.

    However, when executing the following wget command

    wget -r --accept-regex='diffs\-0001[0-9]{2}\.tar\.gz' ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/
    

    wget indiscriminately acquires all files in the ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/ directory.

    I wonder if anyone could tell what I have possibly done wrong?

  • Mark Jin
    Mark Jin over 5 years
    Thanks, Basile. I tried what you suggested, and even added "--regex-type=posix". But the same problem still exists.

Related