-
pabs
what non-crawling URL enumeration mechanisms are there other than CDX and search engines?
-
icedice
Is something going on with Tumblr or is it just standard archiving going on in ArchiveBot with no special circumstances behind it?
-
icedice
I'm seeing a lot of Tumblr blogs there
-
hlgs|m
i'm doing a backup with some help
-
hlgs|m
tumblr's been deleting blogs and i want to save what i care about and was going to save later this summer
-
JAA
We have a channel for Tumblr, and the reason's been mentioned there: some ToS change in November.
-
hlgs|m
(what's the channel for tumblr?)
-
icedice
Please be dumblr
-
JAA
#tumbledown (see wiki, where each major project has a page mentioning the channel)
-
icedice
I was about to say #tumbledown, just remembered it lol
-
icedice
Does anyone here have time to archive
forum.doom9.org,
forum.videohelp.com,
digitalfaq.com/forum, and
audiosex.pro to get whatever Imgur links remain there?
-
h2ibot
Thezt edited ISP Hosting (+180, ZoomInternet offline):
wiki.archiveteam.org/?diff=49843&oldid=49839
-
tech234a
Worth keeping an eye on Nintendo emulators; there was a DMCA against Dolphin’s Steam release:
dolphin-emu.org/blog/2023/05/27/dolphin-steam-indefinitely-postponed
-
icedice
XDA Forums is another one worth archiving and scraping for Imgur links
-
icedice
Nintendo being Nintendo
-
icedice
They poked the hornets nest by trying to get listed on Steam
-
icedice
Their software is legal, but Nintendo doesn't give a shit about legality
-
nicolas17
the takedown also makes no sense
-
nicolas17
you can send a takedown for a "coming soon" page because you think the software that *will* be published in the future there is a copyright violation?
-
masterX244
buttflare 521 stole me a few pages at planetminecraft crawl and due to how their shitty pagination works the missing data is at the end of the pagination and there is no way to skip to therre
-
icedice
<nicolas17> the takedown also makes no sense
-
icedice
<nicolas17> you can send a takedown for a "coming soon" page because you think the software that *will* be
-
icedice
They're a Japanese company
-
icedice
Copyright law tends to be whatever they want it to be
-
icedice
And if it's not they don't give a shit
-
icedice
What are the devs going to do? Spend hundreds of thousands of dollars trying to sue them in court which Nintendo can easily drag out to a year long court battle in order to bleed them dry?
-
icedice
This is the same company that DMCAs let's play videos on YouTube
-
FireFly
it was apparently not exactly a DMCA takedown; see
mastodon.delroth.net/@delroth/110440308907131051
-
FireFly
but basically seems to just have been between Valve & Nintendo
-
FireFly
but yes, Ninty being Ninty
-
icedice
JAA: Seems like The PokéCommunity archivation job is not as "almost done" as I thought. Do you mind doing another Imgur batch from there once you have time?
-
joepie91|m
-
icedice
Doesn't surprise me that Nintendo didn't take them up on their offer
-
icedice
Nintendo wants it their way no matter what
-
joepie91|m
I'm not convinced Nintendo will win this in appeals
-
joepie91|m
there's some very interesting stuff going on there with the reasoning of the court
-
joepie91|m
the argument is essentially "everybody knows Nintendo, you should've known", but... that is not consistent with the expectation of "equal rule of law for all" that AFAIK the EU sets as a hard requirement for membership
-
joepie91|m
if 1fichier takes this to the EU, it could become a very tasty case
-
icedice
Yeah, but countries can ignore EU law once they're in
-
icedice
They might lose EU grants over it, but other than that there's nothing stopping them
-
icedice
For example anti-LGBT zones in Poland, pretty much everything going on in Hungary, or how countries like Sweden chose to ignore the repeal of the Data Retention Directive
-
icedice
And since France is #2 in the EU after Germany, nothing will happen to them
-
icedice
Germany needs their approval to rule the EU
-
andrew
Wow, grab-site can be pretty CPU intensive
-
andrew
I presume this is a Python moment?
-
JAA
andrew: Is this a large crawl that has been retrieving stuff from numerous hosts?
-
andrew
JAA: it's a pretty large crawl intended to only crawl a couple hosts
-
andrew
it started veering off course and crawling some other stuff though, so I added a nice regex to the ignores list
-
JAA
Hmm, typically, slowdowns are from the cookie jar, which is horrible once it accumulates cookies from a couple thousand hosts.
-
andrew
its CPU usage seems to be high when it's parsing through some large pages
-
andrew
with lots of links
-
JAA
I know that grab-site does some things differently than AB in that area but am not familiar with the details, so can't comment on that.
-
JAA
HTML parsing is a significant factor overall though.
-
andrew
is the HTML parsing done in Python
-
JAA
I don't know what grab-site does. AB uses libxml2. wpull defaults to html5lib, which is pure Python and *SLOW*.
-
andrew
well that explains a lot :P
-
JAA
But IIRC grab-site uses something else that's supposed to be faster.
-
JAA
Yeah, ludios_wpull uses html5-parser.
-
JAA
Another thing that can matter is the DB insertion, but that only comes into play when you go to ... hundreds of millions of URLs?