Hello everyone! Today, I’m going to tell you about a new ally in our quest to explore and capture the vastness of the web: Trafilatura.
This open-source tool, coded in Python, will allow you to efficiently collect text from web pages while greatly simplifying the task for developers and users. Let’s dive into the details together.
Trafilatura is both a command-line software and a Python (+R) library developed to meet the specific needs of crawling, extracting, and processing text from internet sources.
The tool is also capable of retrieving metadata or comments from web pages. The idea behind this project is to avoid getting lost in the HTML jungle by extracting only the essential content and disregarding the rest (sidebar, header, footer, etc.). The challenge here is to eliminate these unnecessary “parasitic elements” and access only the relevant content.
To install it, it’s super simple with pip:
pip install trafilatura
And for the code, here is an example of use:
from trafilutura import fetch_url, extract downloaded = fetch_url('https://easy-tutorials.com') result = extract(downloaded) print(result['content']) # Print main content only.
Now that you have an understanding of what Trafilatura is capable of, let’s delve into its available features. Trafilatura can handle sitemaps (txt and xml) as well as feeds (atom, json, rss). You can provide it with lists of links to retrieve and even apply filters to specific content, including deduplication if needed.
In terms of input, Trafilatura can accept a simple URL or HTML directly. It efficiently manages requests, minimizing the risk of being blocked by servers, and allows you to initiate multiple retrievals in parallel. As for the retrieved data, in addition to metadata and text, Trafilatura can also extract links, HTML formatting, and comments from web pages.
When it comes to output, Trafilatura supports various formats such as text, CSV, JSON, and even XML.
Additionally, you can directly call Trafilatura from your terminal using the following command:
trafilatura -u "https://easy-tutorials.com"
In summary, Trafilatura is an incredibly valuable tool for individuals working with web data who want to focus on the essential content. It simplifies the process of retrieving and extracting data from web pages, allowing users to efficiently manage sitemaps, feeds, and lists of links. With features like content filtering, deduplication, and support for various output formats, Trafilatura streamlines the workflow and helps users extract the relevant information they need from the vastness of the web.