How to programmatically extract information from a web page, using Linux command line?

html linux extract html-content-extraction

11,607

Solution 1

That's because wget is sending a certain types of headers that makes it easy to detect.

# wget --debug cnet.com | less
[...]
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: www.cnet.com
Connection: Keep-Alive
[...]

Notice the

User-Agent: Wget/1.13.4

I think that if you change that for

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14

It would work.

# wget --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14' 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'

That seems to be working fine from here. :D

Solution 2

You need to use -O to write the STDOUT

wget -O- http://www.xe.com/currencytables/?from=USD&date=2012-10-15

But it looks like xe.com does not want you to do automated downloads. I would suggest not doing automated downloads at xe.com

Solution 3

Did you visit the link in the response?

From http://www.xe.com/errors/noautoextract.htm:

We do offer a number of licensing options which allow you to incorporate XE.com currency functionality into your software, websites, and services. For more information, contact us at:
XE.com Licensing
+1 416 214-5606
[email protected]
You will appreciate that the time, effort and expense we put into creating and maintaining our site is considerable. Our services and data is proprietary, and the result of many years of hard work. Unauthorized use of our services, even as a result of a simple mistake or failure to read the terms of use, is unacceptable.

This sounds like there is an API that you could use but you will have to pay for it. Needless to say, you should respect these terms, not try to get around them.

11,607

Author by

ysap

I've been a computers/computing enthusiast since many years ago. Started coding FORTRAN on punched cards, then moved to BASIC on my MC6809 based DRAGON-64 and then the x86 IBM-PC era. I had the opportunity of working on mainframes, minis, workstations, PC's and embedded hardware. Today I am doing mainly embedded coding - C and ASM on various processors, and on various programming environments and toolchains like MS Visual Studio, Eclipse CDT, ARM DS and more. Was lucky enough to be at the right time at the right place to get to work as a VLSI designer for a top tier chip company, working on a world class processor family. Always looking to solving problem in the most elegant way! - Yaniv Sapir

Updated on July 18, 2022

Comments

ysap almost 2 years
I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates.

The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o populating the Date: and From: boxes. For example, the URL http://www.xe.com/currencytables/?from=USD&date=2012-10-15 gives the table of conversion rates from USD to other currencies on the day of Oct. 15th, 2012.

Now, assume I have a list of dates, I can loop through the list and change the date part of that URL to get the required page. If I can extract the rates list, then simple grep EUR will give me the relevant exchange rate (I can use awk to specifically extract the rate).

The question is, how can I get the page(s) using Linux command line command? I tried wget but it did not do the job.

If not CLI, is there an easy and straight forward way to programmatically do that (i.e., will require less time than do copy-paste of the dates to the browser's address bar)?

UPDATE 1:

When running:
```
$ wget 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
```
I get a file which contain:
```
<HTML>
<HEAD><TITLE>Autoextraction Prohibited</TITLE></HEAD>
<BODY>
Automated extraction of our content is prohibited.  See <A HREF="http://www.xe.com/errors/noautoextract.htm">http://www.xe.com/errors/noautoextract.htm</A>.
</BODY>
</HTML>
```
so it seems like the server can identify the type of query and blocks the wget. Any way around this?

UPDATE 2:

After reading the response from the wget command and the comments/answers, I checked the ToS of the website and found this clause:
```
You agree that you shall not:
...
f. use any automatic or manual process to collect, harvest, gather, or extract
   information about other visitors to or users of the Services, or otherwise
   systematically extract data or data fields, including without limitation any
   financial and/or currency data or e-mail addresses;
```
which, I guess, concludes the efforts in this front.

Now, for my curiosity, if wget generates an HTTP request, how does the server know that it was a command and not a browser request?
ysap about 11 years

Thanks. Although, to the best of my understanding, this paragraph, by itself, does not prohibit my attempt (but, IANAL), I did find the relevant clause in the ToS and updated the question accordingly.
ysap almost 10 years

Yes, this seems to do the trick. Tested for curiosity's sake, however, since ToS prohibit this.