Using wget to download protected web pages

Often commercial websites protect the access to some of their content with a password. But these websites want this protected content to be searchable. In order to index this content they grant access to the private part of the site to search engines. To do so they check the signature of the agent connecting to their web server.

By changing the signature of your agent, ie: forcing the Referer: and by changing the signature of the User-Agent: you can have access to this protected content.

How to access this protected content with wget


You need to create a .wgetrc file in your home directory with setting the Referer and the User-Agent variable set with the signature of a well known search engineb

Exemple:
# .wgetrc file
# Last change: 11-24-2006

header = Accept-Language: en
header = Accept-Encoding: gzip, deflate
header = Keep-Alive: 300
user_agent = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html
referer = http://www.google.com

# .wgetrc file
# Last change: 11-24-2006

header = Accept-Language: en
header = Accept-Encoding: gzip, deflate
header = Keep-Alive: 300
user_agent = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html
referer = http://www.google.com

Then when you get the page with wget you'll get this output:

$ wget http://www.example.com/protected_content.html
--14:30:23--  http://www.example.com/
           => `protected_content.html'
Resolving www.example.com... 172.16.1.66
Connecting to www.example.com|172.16.1.66|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37,356 (36K) [text/html]

100%[====================================>] 37,356       100.31K/s 

14:30:25 (100.05 KB/s) - `protected_content.html' saved [37356/37356]

$ wget http://www.example.com/protected_content.html
--14:30:23-- http://www.example.com/
=> `protected_content.html'
Resolving www.example.com... 172.16.1.66
Connecting to www.example.com|172.16.1.66|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37,356 (36K) [text/html]

100%[====================================>] 37,356 100.31K/s

14:30:25 (100.05 KB/s) - `protected_content.html' saved [37356/37356]

I can't show you a real example here, it will be considered as stilling.

For more detail on HTTP protocol check the RFC-1954 and RFC-2068
For more information about what is the User-Agent, search for User-Agent on WikiPedia
 

Leave a message

(Required)
(Required and not displayed)
(Optional)
obfuscated letters Enter the text shown in the image