Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is the gap between this and beautiful soup?


This tool can extract data in a structured format from virtually any website, with any HTML structure.

With Beautiful Soup, you'd need to explicitly tell where each piece of data exists referencing HTML tags, ids, classes, etc. For each website you'd want to process.


The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features

Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.


Maybe bs4 + newspaper3k rolled into one? But still, what's the gap?


Regarding content extraction it's more accurate than newspaper3k (especially for languages other than English) and it entails more information: metadata, text, and comments. It works out of the box in most cases so no need to write a particular scraper for a given websites, which saves time. If you care about 2-3 websites and are willing to write and maintain scraping scripts then bs4/lxml/whatever is also fine.

It also features functions and a command-line interface to collect data on your own (say find recent news using feeds). So it's not merely about text extraction in the end but also text discovery.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: