2min.

Pro-tip: Using wget Mirror Mode with Custom HTML Attributes

In this quick post I will explain how to edit an external website transparently, allowing wget to follow links it would not have seen otherwise.

Section intitulée wget-mirror-issuesWget mirror issues

You may already know this, wget has an awesome --mirror option allowing to capture an entire website with all the images, JavaScript, CSS and Fonts it needs. It also fixes all the links to make everything relative, so browsing it locally is possible.

Here is the command I use for example:

wget --mirror --convert-links --adjust-extension \
  --page-requisites --no-parent --progress=dot \
  --recursive --level=6 https://target.com/

This will give me a folder with everything nice and tidy with directories, HTML files, etc.

wget is able to follow links in various places like href, img src, iframe as you can see in the source code.

But some websites may use custom attributes like the famous data-src, used by plenty of lazy-loading JavaScript library, in a time when native laziness was not implemented in our browsers.

As wget does not offer any option to use custom attributes on a tag (the option --follow-tags is used to specify the HTML tag, you cannot control the attribute itself) we will have to cheat if we want it to download our images.

Section intitulée using-a-proxy-to-edit-the-html-on-the-flyUsing a proxy to edit the HTML on the fly

If you cannot control the website you are mirroring, one option is to put a local proxy in front of it.

That’s what we are going to do, instead of wget going directly to our target, we will add a proxy:

    local
┌────────────┐
│            │
│  ┌──────┐  │         ┌──────────┐
│  │ wget ├──┼────────►│target.com│
│  └──────┘  │         └──────────┘
│            │
│            │
│  ┌──────┐  │         ┌──────────┐
│  │ wget │  │         │target.com│
│  └──┬───┘  │         └──────────┘
│     │      │              ▲
│     ▼      │              │
│  ┌──────┐  │              │
│  │proxy ├──┼──────────────┘
│  └──────┘  │
│            │
└────────────┘

This proxy can be anything but let’s use an open-source one: https://mitmproxy.org/ ❤️

We start by writing a small Python script that will edit all the HTTP responses passing through this proxy:

"""
Fix the lazy loader SRC.
"""
from mitmproxy import http

def response(flow: http.HTTPFlow) -> None:
    if flow.response and flow.response.content:
        flow.response.content = flow.response.content.replace(
            b"data-src", b"src"
        )

and then we can launch the proxy:

mitmdump -s src.py

This is starting a proxy that can be called at localhost:8080 – so let’s tell wget to use this proxy and we are good to go!

wget -e use_proxy=yes -e http_proxy=localhost:8080 -e https_proxy=localhost:8080 \
  --mirror --convert-links --adjust-extension \
  --page-requisites --no-parent --progress=dot \
  --recursive --level=6 https://target.com/

Every request now passes through our proxy and our Python script, the <img data-src tags are seen as <img src and our local mirror will be complete!

This tip can be used for a lot of other use-cases, and thanks to Mitmproxy, it’s really easy to set up. Happy proxifying! 👋

Commentaires et discussions