In this quick post I will explain how to edit an external website transparently, allowing
wget to follow links it would not have seen otherwise.
Section intitulée wget-mirror-issuesWget mirror issues
You may already know this,
wget has an awesome
Here is the command I use for example:
wget --mirror --convert-links --adjust-extension \ --page-requisites --no-parent --progress=dot \ --recursive --level=6 https://target.com/
This will give me a folder with everything nice and tidy with directories, HTML files, etc.
wget is able to follow links in various places like
iframe… as you can see in the source code.
But some websites may use custom attributes like the famous
wget does not offer any option to use custom attributes on a tag (the option
--follow-tags is used to specify the HTML tag, you cannot control the attribute itself) we will have to cheat if we want it to download our images.
Section intitulée using-a-proxy-to-edit-the-html-on-the-flyUsing a proxy to edit the HTML on the fly
If you cannot control the website you are mirroring, one option is to put a local proxy in front of it.
That’s what we are going to do, instead of
wget going directly to our target, we will add a proxy:
local ┌────────────┐ │ │ │ ┌──────┐ │ ┌──────────┐ │ │ wget ├──┼────────►│target.com│ │ └──────┘ │ └──────────┘ │ │ │ │ │ ┌──────┐ │ ┌──────────┐ │ │ wget │ │ │target.com│ │ └──┬───┘ │ └──────────┘ │ │ │ ▲ │ ▼ │ │ │ ┌──────┐ │ │ │ │proxy ├──┼──────────────┘ │ └──────┘ │ │ │ └────────────┘
This proxy can be anything but let’s use an open-source one: https://mitmproxy.org/ ❤️
We start by writing a small Python script that will edit all the HTTP responses passing through this proxy:
""" Fix the lazy loader SRC. """ from mitmproxy import http def response(flow: http.HTTPFlow) -> None: if flow.response and flow.response.content: flow.response.content = flow.response.content.replace( b"data-src", b"src" )
and then we can launch the proxy:
mitmdump -s src.py
This is starting a proxy that can be called at
localhost:8080 – so let’s tell
wget to use this proxy and we are good to go!
wget -e use_proxy=yes -e http_proxy=localhost:8080 -e https_proxy=localhost:8080 \ --mirror --convert-links --adjust-extension \ --page-requisites --no-parent --progress=dot \ --recursive --level=6 https://target.com/
Every request now passes through our proxy and our Python script, the
<img data-src tags are seen as
<img src and our local mirror will be complete!
This tip can be used for a lot of other use-cases, and thanks to Mitmproxy, it’s really easy to set up. Happy proxifying! 👋