Monolith: The Perfect Tool for Web Archiving

Today, I want to talk to you about a really cool tool for archiving web pages. Sure, you can already save a web page with your browser, but this tool, called Monolith, does it 1000 times better. It not only saves the target page but also embeds all CSS elements, images, and JavaScript into a single HTML5 file.

Unlike a standard save or even using wget, Monolith integrates all assets as data URLs. This means that your browser will display the page exactly as it was on the web, even without an Internet connection!

Installing it is super simple. Whether you are on Windows, macOS, GNU/Linux, or even on exotic devices with ARM processors, it will work:

  • With Cargo (cross-platform): cargo install monolith
  • Via Homebrew (macOS and GNU/Linux): brew install monolith
  • With Snapcraft (GNU/Linux): snap install monolith
  • And many other options…

To save, for example, this article from my site, just enter the following command:

monolith https://www.tech2geek.net/monolith-archivage-web-html-autonome.html -o monolith.html

And bam, it generates a monolith.html file with everything in it. You can open it easily in your browser even without internet access, it’s magical.

But Monolith has many more tricks up its sleeve. You can, for instance, use it directly with a STDIN input:

cat index.html | monolith -aMcIiFfv -b https://site.com/ - > result.html

Here, we pass the HTML content via the standard input, with a few additional options:

  • -a to remove audio
  • -M to not add date and URL information
  • -c to exclude CSS
  • -I to isolate the document
  • -i to remove images
  • -F to exclude web fonts
  • -f to skip frames
  • -v to remove videos

In short, you have full control over what you want to keep or exclude.

You can also specify allowed or forbidden domains for fetching assets, like:

monolith -I -d example.com -d www.example.com https://example.com -o example-only.html

Here we only allow the domains example.com and www.example.com. Everything else will be excluded. Or conversely, you can exclude domains, typically those serving ads:

monolith -I -B -d .googleusercontent.com -d googleanalytics.com -d .google.com https://example.com -o example-no-ads.html

Note that Monolith does not embed a JavaScript engine. So for more complex web pages that fetch data after the initial load, it can be limited. But no worries! We can use a headless browser like Chromium beforehand to preprocess the page before passing it to Monolith:

chromium --headless --incognito --dump-dom https://github.com | monolith - -I -b https://github.com -o github.html

And there you go, problem solved!

Perfect for web archivists or data hoarders who want to keep a trace of everything, or even automate it all in their scripts.

"Because of the Google update, I, like many other blogs, lost a lot of traffic."

Join the Newsletter

Please, subscribe to get our latest content by email.

Mohamed SAKHRI
Mohamed SAKHRI

I'm the creator and editor-in-chief of Tech To Geek. Through this little blog, I share with you my passion for technology. I specialize in various operating systems such as Windows, Linux, macOS, and Android, focusing on providing practical and valuable guides.

Articles: 1454

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *