wget: obtaining files matching regex
Solution 1
Be careful --accept-regex is for the complete URL. But our target is some specific files. So we will use -A.
For example,
wget -r -np -nH -A "IMG[012][0-9].jpg" http://x.com/y/z/
will download all the files from IMG00.jpg to IMG29.jpg from the URL.
Note that a matching pattern contains shell-like wildcards, e.g. ‘books’ or ‘zelazny196[0-9]*’.
reference: wget manual: https://www.gnu.org/software/wget/manual/wget.html regex: https://regexone.com/
Solution 2
I'm reading in wget man page:
--accept-regex urlregex --reject-regex urlregex Specify a regular expression to accept or reject the complete URL.
and noticing that it mentions the complete URL (e.g. something like
ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/diffs-000121.tar.gz)
So I suggest (without having tried it) to use
--accept-regex='.*diffs\-0001[0-9][0-9]\.tar\.gz'
(and perhaps give the appropriate --regex-type too)
BTW, for such tasks, I would also consider using some scripting language à la Python (or use libcurl or curl)
Related videos on Youtube
Mark Jin
5th year PhD student in University of Michigan. Working on data transformation, data integration, database usability.
Updated on August 02, 2022Comments
-
Mark Jin 3 monthsAccording to the man page of wget, --acccept-regex is the argument to use when I need to selectively transfer files whose names matching a certain regular expression. However, I am not sure how to use --accept-regex.
Assuming I want to obtain files diffs-000107.tar.gz, diffs-000114.tar.gz, diffs-000121.tar.gz, diffs-000128.tar.gz in IMDB data directory ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/. "diffs\-0001[0-9]{2}\.tar\.gz" seems to be an ok regex to describe the file names.
However, when executing the following wget command
wget -r --accept-regex='diffs\-0001[0-9]{2}\.tar\.gz' ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/wget indiscriminately acquires all files in the ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/ directory.
I wonder if anyone could tell what I have possibly done wrong?
-
Mark Jin over 5 yearsThanks, Basile. I tried what you suggested, and even added "--regex-type=posix". But the same problem still exists.