How to get a list of available files using wget or curl?

41,904

Solution 1

You can't do the equivalent of an ls unless the server provides such listings itself. You could however retrieve index.html and then check for includes, e.g. something like

wget -O - http://www.example.com | grep "type=.\?text/javascript.\?"

Note that this relies on the HTML being formatted in a certain way -- in this case with the includes on individual lines for example. If you want to do this properly, I'd recommend parsing the HTML and extracting the javascript includes that way.

Solution 2

Let's consider this open directory (http://tug.ctan.org/macros/latex2e/required/amscls/) as the object of our experimentation. This directory belongs to the Comprehensive TeX Archive Network, so don't be too worried about downloading malicious files.

Now, let's suppose that we want to list all files whose extension is pdf. We can do so by executing the following command.

The command shown below will save the output of wget in the file main.log. Because wget send a request for each file and it prints some information about the request, we can then grep the output to get a list of files which belong to the specified directory.

wget \
  --accept '*.pdf' \
  --reject-regex '/\?C=[A-Z];O=[A-Z]$' \
  --execute robots=off \
  --recursive \
  --level=0 \
  --no-parent \
  --spider \
  'http://tug.ctan.org/macros/latex2e/required/amscls/doc/' 2>&1 | tee main.log

Now, we can list the files whose extension is pdf by using grep.

grep '^--' main.log
--2020-11-23 10:39:46--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsbooka.pdf
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsclass.pdf
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsdtx.pdf
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsmidx.pdf
--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsthdoc.pdf
--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/thmtest.pdf
--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/upref.pdf

Note that we could also get the list of all files in the directory and then execute grep on the output of the command. However, doing this would have taken more time since apparently a request is sent for each file. By using the --accept, we can make wget send a request for only those files in which we are interested in.

Last but not least, the sizes of the files are saved in the file main.log, so you can check that information in that file.

Share:
41,904
nachocab
Author by

nachocab

Bioinformatics PhD student at Boston University

Updated on November 24, 2020

Comments

  • nachocab
    nachocab over 3 years

    I'd like to know if it's possible to do an ls of a URL, so I can see what *.js files are available in a website, for example. Something like:

    wget --list-files -A.js stackoverflow.com
    

    and get

    ajax/libs/jquery/1.7.1/jquery.min.js
    js/full.js
    js/stub.js
    ...