Hello, I am an author of the scrape. I did it more to try it, but who knows, maybe it will be useful to someone.
I went trough the description pages like http://thepiratebay.se/torrent/$i by increasing the $i and saving the magnet if pirate bay didn't return 404 error. I went trough the pages as unlogged user, though, so that might be the reason why I got only 1.5m torrents.
I didn't know pirate bay has hidden porn torrents; there is TONS of porn in the scrape already.
The script is in perl, I will post it to pastebin in a moment.
I am thinking of releasing new versions once a week and putting the hash of the torrent of the newest version on some public site. (Say, some twitter account.)
But it would still be more proof of concept than really anything useful - the comments and descriptions ARE important.
edit: More I am thinking about it, the less useful it sounds.
First, the information about seeders vary constantly, especially with the new torrents.
Also, it STILL depends on single point of failure - the Pirate Bay itself. If TPB will be down for any reason, I will have no place to scrape this from and it will all fall apart anyway.
Plus, I think Pirate Bay itself should make dumps like this. It would probably be much better for their database anyway :)
I like the idea of a weekly twitter update with the master magnet hash. I feel like the purpose would not be the usefulness of the string of chars, but more to prove a point.
The porn torrents are only hidden from naive searchers; all the pages for them are still accessible if you've got a direct link to them, so your scraper should've picked all of them up.
i tried to run the script, however, i get an error (added diagnostics for more info, so line 13 refers to line 11 of your script, line 27 to line 25):
Can't use an undefined value as an ARRAY reference at
piratebay_magnet_scrape.pl line 13 (#1)
(F) A value used as either a hard reference or a symbolic reference must
be a defined value. This helps to delurk some insidious errors.
Uncaught exception from user code:
Can't use an undefined value as an ARRAY reference at piratebay_magnet_scrape.pl line 13.
at piratebay_magnet_scrape.pl line 13
main::__ANON__(20697, 0, undef, 0, 0) called at /usr/share/perl5/Parallel/ForkManager.pm line 354
Parallel::ForkManager::on_finish('Parallel::ForkManager=HASH(0x9cd7ac8)', 20697, 0, undef, 0, 0) called at /usr/share/perl5/Parallel/ForkManager.pm line 333
Parallel::ForkManager::wait_one_child('Parallel::ForkManager=HASH(0x9cd7ac8)', undef) called at /usr/share/perl5/Parallel/ForkManager.pm line 285
Parallel::ForkManager::start('Parallel::ForkManager=HASH(0x9cd7ac8)') called at piratebay_magnet_scrape.pl line 27
I went trough the description pages like http://thepiratebay.se/torrent/$i by increasing the $i and saving the magnet if pirate bay didn't return 404 error. I went trough the pages as unlogged user, though, so that might be the reason why I got only 1.5m torrents.
I didn't know pirate bay has hidden porn torrents; there is TONS of porn in the scrape already.
The script is in perl, I will post it to pastebin in a moment.
edit: allright, the script itself is here http://pastebin.com/8RXXthXB
as you can see, it's not very complicated.