What is Wget ?
“Wget is a computer program that retrieves content from web servers, and is part of the GNU Project. Its name is derived from World Wide Web and get. It supports downloading via HTTP, HTTPS, and FTP protocols.” -Wikipedia
In technical terms wget is essentially a web spider that scrapes or leeches content from webpages. This tool could prove to be much more useful than it is credited for in our daily life. For example, have you been in a situation where you want to download some 20 – 30 images from a webpage, it could become quite tiresome. This is where wget comes in, with just one line you can download all the files in one go. Now that’s COOL.
Although wget is around since 1996, most of modern day’s average users doesn’t know about it.
Features Of Wget :-
Wget is very useful for slower networks, say if a download got interrupted somehow it will try again to download the complete file from that point onwards until the whole file is completely downloaded.
Recursive Download :
Wget can follow the HTML links on a web page and recursively download the files. It is the same tool that a soldier had used to download thousands of secret documents from the US army’s Intranet that were later published on the Wikileaks website.
Wget works on ‘Point & Shoot’ principle meaning – once started it doesn’t require further user interaction to complete the given process.
Wget Commands :-
- Download a single file from the Internet
- Download a file but save it locally under a different name
wget ‐‐output-document=filename.html example.com
- Download a file and save it in a specific folder
wget ‐‐directory-prefix=folder/subfolder example.com
- Resume an interrupted download previously started by wget itself
wget ‐‐continue example.com/big.file.iso
- Download a file but only if the version on server is newer than your local copy
wget ‐‐continue ‐‐timestamping wordpress.org/latest.zip
- Download multiple URLs with wget. Put the list of URLs in another text file on separate lines and pass it to wget.
wget ‐‐input list-of-file-urls.txt
- Download a list of sequentially numbered files from a server
- Download a web page with all assets – like stylesheets and inline images – that are required to properly display the web page offline.
wget ‐‐page-requisites ‐‐span-hosts ‐‐convert-links ‐‐adjust-extension http://example.com/dir/file
Mirror Complete Websites With Wget :-
- Download an entire website including all the linked pages and files
wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com/
- Download all the MP3 files from a sub directory
wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://example.com/mp3/
- Download all images from a website in a common folder
wget ‐‐directory-prefix=files/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://example.com/images/
- Download the PDF documents from a website through recursion but stay within specific domains.
wget ‐‐mirror ‐‐domains=abc.com,files.abc.com,docs.abc.com ‐‐accept=pdf http://abc.com/
- Download all files from a website but exclude a few directories.
wget ‐‐recursive ‐‐no-clobber ‐‐no-parent ‐‐exclude-directories /forums,/support http://example.com
Download Restricted Content With Wget :-
There are multiple occasions where the content is behind a login screen or is blocked for a group of users. In these cases wget could be used to download the required content.
- Download files from websites that check the User Agent and the HTTP Referer
wget ‐‐refer=http://google.com ‐‐user-agent=”Mozilla/5.0 Firefox/4.0.1″ http://nytimes.com
- Download files from a password protected sites
wget ‐‐http-user=labnol ‐‐http-password=hello123 http://example.com/secret/file.zip
- Fetch pages that are behind a login page. You need to replace user and password with the actual form fields while the URL should point to the Form Submit (action) page.
wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=labnol&password=123′ http://example.com/login.php
wget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://example.com/paywall
These are some of the most commonly used commands for wget, there many many more which you can try to find and see what they do. There is much more to this tool than meets the eye. Ciao..