00:03:44 what non-crawling URL enumeration mechanisms are there other than CDX and search engines? 00:19:46 Is something going on with Tumblr or is it just standard archiving going on in ArchiveBot with no special circumstances behind it? 00:19:57 I'm seeing a lot of Tumblr blogs there 00:20:28 i'm doing a backup with some help 00:20:42 tumblr's been deleting blogs and i want to save what i care about and was going to save later this summer 00:20:42 We have a channel for Tumblr, and the reason's been mentioned there: some ToS change in November. 00:20:59 (what's the channel for tumblr?) 00:21:16 Please be dumblr 00:21:36 #tumbledown (see wiki, where each major project has a page mentioning the channel) 00:22:01 I was about to say #tumbledown, just remembered it lol 00:50:47 Does anyone here have time to archive https://forum.doom9.org/, https://forum.videohelp.com/, https://www.digitalfaq.com/forum/, and https://audiosex.pro/ to get whatever Imgur links remain there? 04:00:13 Thezt edited ISP Hosting (+180, ZoomInternet offline): https://wiki.archiveteam.org/?diff=49843&oldid=49839 04:54:42 Worth keeping an eye on Nintendo emulators; there was a DMCA against Dolphin’s Steam release: https://dolphin-emu.org/blog/2023/05/27/dolphin-steam-indefinitely-postponed/ 05:18:40 XDA Forums is another one worth archiving and scraping for Imgur links 05:18:52 Nintendo being Nintendo 05:23:27 They poked the hornets nest by trying to get listed on Steam 05:23:51 Their software is legal, but Nintendo doesn't give a shit about legality 05:25:06 the takedown also makes no sense 05:25:36 you can send a takedown for a "coming soon" page because you think the software that *will* be published in the future there is a copyright violation? 13:03:38 buttflare 521 stole me a few pages at planetminecraft crawl and due to how their shitty pagination works the missing data is at the end of the pagination and there is no way to skip to therre 13:10:24 the takedown also makes no sense 13:10:24 you can send a takedown for a "coming soon" page because you think the software that *will* be 13:10:31 They're a Japanese company 13:10:41 Copyright law tends to be whatever they want it to be 13:10:50 And if it's not they don't give a shit 13:12:21 What are the devs going to do? Spend hundreds of thousands of dollars trying to sue them in court which Nintendo can easily drag out to a year long court battle in order to bleed them dry? 13:12:45 This is the same company that DMCAs let's play videos on YouTube 13:13:52 it was apparently not exactly a DMCA takedown; see https://mastodon.delroth.net/@delroth/110440308907131051 13:14:29 but basically seems to just have been between Valve & Nintendo 13:14:34 but yes, Ninty being Ninty 21:29:21 JAA: Seems like The PokéCommunity archivation job is not as "almost done" as I thought. Do you mind doing another Imgur batch from there once you have time? 22:06:46 on the topic of Nintendo, https://torrentfreak.com/nintendos-war-with-1fichier-is-not-over-but-could-be-for-0-00-230419/ is a fascinating read 22:26:42 Doesn't surprise me that Nintendo didn't take them up on their offer 22:27:50 Nintendo wants it their way no matter what 22:36:01 I'm not convinced Nintendo will win this in appeals 22:36:15 there's some very interesting stuff going on there with the reasoning of the court 22:36:53 the argument is essentially "everybody knows Nintendo, you should've known", but... that is not consistent with the expectation of "equal rule of law for all" that AFAIK the EU sets as a hard requirement for membership 22:37:16 if 1fichier takes this to the EU, it could become a very tasty case 22:44:21 Yeah, but countries can ignore EU law once they're in 22:44:47 They might lose EU grants over it, but other than that there's nothing stopping them 22:46:04 For example anti-LGBT zones in Poland, pretty much everything going on in Hungary, or how countries like Sweden chose to ignore the repeal of the Data Retention Directive 22:47:10 And since France is #2 in the EU after Germany, nothing will happen to them 22:47:24 Germany needs their approval to rule the EU 23:10:09 Wow, grab-site can be pretty CPU intensive 23:10:15 I presume this is a Python moment? 23:17:22 andrew: Is this a large crawl that has been retrieving stuff from numerous hosts? 23:23:37 JAA: it's a pretty large crawl intended to only crawl a couple hosts 23:23:59 it started veering off course and crawling some other stuff though, so I added a nice regex to the ignores list 23:24:27 Hmm, typically, slowdowns are from the cookie jar, which is horrible once it accumulates cookies from a couple thousand hosts. 23:24:46 its CPU usage seems to be high when it's parsing through some large pages 23:24:56 with lots of links 23:25:36 I know that grab-site does some things differently than AB in that area but am not familiar with the details, so can't comment on that. 23:25:53 HTML parsing is a significant factor overall though. 23:25:58 is the HTML parsing done in Python 23:26:26 I don't know what grab-site does. AB uses libxml2. wpull defaults to html5lib, which is pure Python and *SLOW*. 23:26:42 well that explains a lot :P 23:27:05 But IIRC grab-site uses something else that's supposed to be faster. 23:27:34 Yeah, ludios_wpull uses html5-parser. 23:28:51 Another thing that can matter is the DB insertion, but that only comes into play when you go to ... hundreds of millions of URLs?