wget grabbing empty files that actually exist

9,023

Solution 1

You might want to try turning on wget's debug switch -d to see what's going on.

Example

$ wget -d http://services.runescape.com/m=itemdb_rs/api/graph/19227.json
DEBUG output created by Wget 1.12 on linux-gnu.

--2013-09-21 13:22:46--  http://services.runescape.com/m=itemdb_rs/api/graph/19227.json
Resolving services.runescape.com... 216.115.77.143, 8.26.16.145, 62.67.0.145, ...
Caching services.runescape.com => 216.115.77.143 8.26.16.145 62.67.0.145 64.94.237.145
Connecting to services.runescape.com|216.115.77.143|:80... connected.
Created socket 3.
Releasing 0x0000000000f251e0 (new refcount 1).

---request begin---
GET /m=itemdb_rs/api/graph/19227.json HTTP/1.0
Referer: http://www.google.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: services.runescape.com
Connection: Keep-Alive
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300

---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK
Date: Sat, 21-Sep-2013 17:22:47 GMT
Server: JAGeX/3.1
Content-type: text/html; charset=ISO-8859-1
Content-Encoding: gzip
Cache-control: no-cache
Pragma: no-cache
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: settings=wwGlrZHF5gKN6D3mDdihco3oPeYN2KFybL9hUUFqOvk; version=1; path=/; domain=.runescape.com; Expires=Tue, 20-Sep-2016 17:22:47 GMT; Max-Age=94608000
Connection: Keep-alive
Content-length: 1668

---response end---
200 OK
cdm: 1 2 3 4 5 6 7 8
Stored cookie runescape.com -1 (ANY) / <permanent> <insecure> [expiry 2016-09-20 13:22:47] settings wwGlrZHF5gKN6D3mDdihco3oPeYN2KFybL9hUUFqOvk
Registered socket 3 for persistent reuse.
Length: 1668 (1.6K) [text/html]
Saving to: “19227.json”

100%[==============================================================================================================================>] 1,668       --.-K/s   in 0.08s   

2013-09-21 13:22:47 (21.4 KB/s) - “19227.json” saved [1668/1668]

Solution 2

Is there anywhere I could improve my script to check filesize or whatever before downloading?

Before downloading wouldn't make any sense, because obviously the server fails to properly reply to your download requests. It should either return the proper file or it should return a HTTP error code, but apparently it does neither. You could try do determine the remote file size with a HTTP HEAD request, but that won't do you any good when the remote file is fine but the GET delivery would still fail.

Instead, use a loop in your script to loop through all the files you want to download. Download each file with a single wget request, and then check the file size of the file you have downloaded. If it is a 0 byte file and you are sure that it shouldn't be, repeat the request. You should of course add a failsafe limit so your script won't repeat the request endlessly if it always fails, and maybe also a delay (in case the server is rate limiting your requests and failing them intentionally).

Share:
9,023

Related videos on Youtube

Jaska Börner
Author by

Jaska Börner

I'm a computer technician by trade and a hobbyist programmer...of course, I'm also a musician and an artist and a whole lot of other things that ultimately leave me with no free time. I fix computers for a living, but in my spare time I delve into the vast sea of information regarding what makes them tick... I collect computers as well, mostly for use in BOINC (distributed computing), on which my username is NullCoding* - the asterisk is intentional; it's a team thing.

Updated on September 18, 2022

Comments

  • Jaska Börner
    Jaska Börner almost 2 years

    I have a script that's meant to download a certain number of files from a remote server. It only needs to do this every 24 hours, since they are JSON sources for a database on my server. The files are updated on the remote server around midnight GMT, and my script runs an hour after that to ensure they are properly updated already.

    The problem is that I consistently notice it fails to download at least twenty or more out of the 132 files, except it doesn't think it's failed at all (I see 200 OK). They are JSONs, so they are at most 8KB in size. In the wget logfile, I see this:

    --2013-09-21 12:01:10--  http://services.runescape.com/m=itemdb_rs/api/graph/19227.json
    Reusing existing connection to services.runescape.com:80.
    HTTP request sent, awaiting response... 200 OK
    Length: 0 [text/html]
    Saving to: `./jsons/19227.json'
    
     0K                                                        0.00 =0s
    
    2013-09-21 12:01:10 (0.00 B/s) - `./jsons/19227.json' saved [0/0]
    

    This doesn't make any sense. There's no rhyme or reason to the failures. I re-tried many times and each time it wrote 0-byte files at random, not failing on the same files each time. The frustrating part is there are no errors anywhere, so nothing gets caught in the error log...

    no-clobber doesn't matter in this case. The files are meant to be overwritten, as they become out-of-date every 24 hours, and even "good data" from the day before is "bad data" today.

    Is there anywhere I could improve my script to check filesize or whatever before downloading? I tried on my Mac at home and got the same exact result, even using "spider mode" to check if it exists first. The most frustrating part is if I were to paste the URL into a browser, it loads the whole JSON just as it should...I take it "retries" won't help as wget is not running into any HTTP errors anyway.

    • Martin von Wittich
      Martin von Wittich almost 11 years
      Is it possible that the API you're using has limits in place and that you are exceeding these limits, causing the server to return empty results to your requests?
    • slm
      slm almost 11 years
      Where's the script?
    • Jaska Börner
      Jaska Börner almost 11 years
      The script itself is here and yes it is possible the remote server would throttle me, but that wouldn't explain why I'm able to just go ahead and re-run the script immediately. I asked them - my server is actually whitelisted.
  • Jaska Börner
    Jaska Börner almost 11 years
    I was using a delay before (when I tried with cURL) but it didn't seem to change anything. But then again I was calling it from PHP and it just did not work on my server, so I will try implementing a bit of a delay between requests here as well.
  • Jaska Börner
    Jaska Börner almost 11 years
    I did try this but it's literally not showing a single error anywhere. Just more verbose information about a 0-byte file. I even put in a function to check how many downloads resulted in 0-byte files...honestly I'm starting to want to blame their server! I'll try contacting them in case...thanks for the tip on debug messages though. Useful!
  • slm
    slm almost 11 years
    @JaskaBörner - yeah, in looking at your script I don't see you doing anything wrong or out of the ordinary, looks perfectly valid to me.
  • roaima
    roaima almost 5 years
    So now you've not only ignored the missing data but also deleted any indication of it having failed? How does that address the OP's issue?