-
h2ibot
JustAnotherArchivist edited Blogger (+182, Fix source and tracker links, update status):
wiki.archiveteam.org/?diff=51174&oldid=51148
-
phuz-test
Anyone wanna archive the Questionable Content forums? (
forums.questionablecontent.net) It's a webforum with approximately 900k posts, about 20 years old. Mostly webcomic related content.
-
nicolas17
why though? is it at risk of dying?
-
phuzion
Yeah, the comic's server is showing 503s a lot, and it's likely that they're going to move to a new host. The forum was locked some time ago, and no new posts or registrations are allowed
-
phuzion
Locked as of Jan 1 this year.
-
nicolas17
oh :| I didn't know of that
-
phuzion
The current speculation is that Jeph (comic author) and his tech team might opt not to migrate the forums because of the additional complexity in doing so
-
» nicolas17 hasn't even read the comics in a few years
-
» Pedrosso didn't know of its existence
-
» Pedrosso wants it saved anyway, of course
-
pokechu22
phuzion: I've queued an archivebot job for them - 900k posts is large but should be doable
-
phuzion
pokechu22: the forums are behind cloudflare, so I'd check to make sure that it's working properly at some point.
-
pokechu22
Yeah - there's other stuff running in archivebot right now so it's not started yet, but if it finishes abnormally quickly then I'll know it's cloudflare at least (not sure what we could do about it for something that large though)
-
JAA
ohai
-
JAA
Looks like their Buttflare config isn't very aggressive, so should be possible even if it doesn't work with AB.
-
Pedrosso
should DeviantArt's sitemap be grabbed proactively? I'm surprised it hasn't been hidden from public yet and it's very big
-
JAA
Seems to be running fine, albeit with timeouts and general slugginess.
-
phuzion
JAA: Yeah that server is creaky right now. The main comic page doesn't load about 1/3 of the time, it seems.
-
JAA
Barto pls
-
JAA
Worst SLA ever
-
fireonlive
xD
-
project10
JAA: are the logs from an AB job saved/accessible anywhere?
-
JAA
project10: Yes, they're in the *-meta.warc.gz file. For aborted or crashed jobs, there's a -wpull.log.gz file instead, though that isn't indexed by the viewer; it should normally be in the same item as the *.json file.
-
project10
cool, I kinda had an inkling it might be saved in the data uploaded to IA itself. Throw nothing away and all that
-
fireonlive
=]
-
JAA
Yeah, we do currently throw away the DB file though, which has some data that's hard to extract otherwise and is much more suitable for many analysis things.
-
fireonlive
ah like a quick sweep for failed urls or certain outlinks i suppose
-
JAA
There's a... uh... three years old issue about it:
ArchiveTeam/ArchiveBot #465
-
JAA
Yeah. And some links get indexed by wpull but silently ignored. They only appear in the raw responses and in the DB.
-
fireonlive
ahh
-
sonick
Has there already mentioned dotup.org and its light version, light.dotup.org, the website that will be shut down on November 30?
-
sonick
These sites are relatively simple and could be done by AB.
-
sonick
The light version and the normal version seem to have different size limits for uploading and different uploaded content.
-
pabs
sonick: JAA did the non-lite version on 20231103
-
pabs
-
pabs
stuck light one in AB now
-
pabs
hmm, file uploads are still enabled
-
pabs
its in Deathwatch so I guess someone will do another save near the deadline
-
sonick
ok, thanks.
-
JAA
sonick, pabs: That job only grabbed the most recent files because the pagination's limited, but I intend to do another run bruteforcing the older files (extensions have to be guessed).
-
JAA
If anyone is able to access
javiermilei.com , a grab-site crawl would be great. I tried from machines in 9 countries and got blocked everywhere. It might need to be run in Argentina.
-
anarchat
i'm blocked by CF there as well
-
nicolas17
JAA: yeah works for me, I think he had some kind of raffle a while ago so they had to stop people using bots from foreign IPs?
-
nicolas17
give me a tl;dr for grab-site
-
JAA
nicolas17: Nice. The way I run grab-site is without the web interface and stuff, one container per target site:
gitea.arpa.li/JustAnotherArchivist/grab-site-docker
-
JAA
Since I can't look at the site, no idea what options are required here. :-|
-
that_lurker
Do you use docker exec for ignores and such or just yolo it without
-
JAA
nano as root
-
JAA
No risk, no fun.
-
JAA
But since it's a mount, it should be fine.
-
fireonlive
no vim for JAA :o?
-
project10
no emacs for JAA?
-
JAA
I'd normally use a magnetised needle, but that's kind of hard to do remotely.
-
fireonlive
true true
-
fireonlive
we need to get you one of those surgeon robots
-
JAA
Butterflies would work though, I suppose.
-
kpcyrd
I wish we could archive the editor jokes eventually
-
» JAA archives kpcyrd.
-
that_lurker
I like to to rearange the 1 and 0 with electric magnet. Might edit the text or might kill the disk, but at least im not running as root so im safe
-
» kpcyrd .zip
-
nicolas17
JAA: do I need to build the container or is it in some repo already?
-
that_lurker
the build command is in the readme
-
JAA
^
-
JAA
Haven't pushed it anywhere, no.
-
nicolas17
yes
-
nicolas17
that_lurker: I'm asking if I have to :P
-
nicolas17
ugh
-
nicolas17
JAA: does grab-site run as an unprivileged user inside the container?
-
nicolas17
[Errno 13] Permission denied: '/data/javiermilei.com-2023-11-20-e125cf6c'
-
JAA
-
nicolas17
so I guess I need to make the data dir world writable
-
nicolas17
RuntimeError: html5-parser and lxml are using different versions of libxml2. This happens commonly when using pip installed versions of lxml. Use pip install --no-binary lxml lxml instead. libxml2 versions: html5-parser: (2, 9, 14) != lxml: (2, 10, 3)
-
JAA
Hmm, yeah, there might be room for improvement.
-
JAA
Ugh
-
nicolas17
non-reproducible build tsk tsk
-
fireonlive
you developers and your chmod 777 😾
-
JAA
I'll take a look at the libxml2 issue in a sec.
-
project10
.
-
AK
my builds are reproducible, they will fail every time 🤷
-
fireonlive
xD
-
h2ibot
Megame edited Deathwatch (+219, /* 2023 */ Okada Books - Nov 30):
wiki.archiveteam.org/?diff=51175&oldid=51170
-
JAA
Is there an equivalent to
snapshot.debian.org for Alpine, such that you can 'install packages as they were at a specific datetime'?
-
nicolas17
forcing it to alpine 3.13 isn't enough? there's breaking changes to packages within the same alpine version? bleh
-
kpcyrd
JAA: no, but please let me know if you find one
-
JAA
nicolas17: I mean, it might be, but my point is rather about how to do reproducible builds with Alpine.
-
kpcyrd
sad news: you can't
-
JAA
Welp
-
kpcyrd
the error is likely related to python dependencies and unrelated to alpine tho?
-
JAA
Yeah
-
JAA
I was thinking more broadly about reproducibility.
-
nicolas17
-
JAA
I was not aware they outright delete old packages. Oof.
-
nicolas17
so does debian
-
nicolas17
hence having a separate snapshot service :P
-
kpcyrd
the other problem with alpine is the build environments are not really documented, even if you have all old packages its difficult to tell which ones you need to pick to re-create the original build environment
-
kpcyrd
other distros solve this with buildinfo files (the OG sbom basically), but Alpine is also stuck in this apk2-apk3 migration thing
-
kpcyrd
so they decided against adding buildinfo files to apk2
-
JAA
Well, yeah, but snapshot is part of the Debian project. So bit different, I think. (Although I believe snapshot.d.o might sometimes miss things if there are rapid uploads? I've heard something like that at least.)
-
kpcyrd
the only "proper" archive I'm aware of is
archive.archlinux.org
-
nicolas17
snapshot.debian.org becoming an Official Part of the Project is relatively recent, it used to be snapshot.debian.net
-
JAA
Yeah, Arch seems to do a good job at this.
-
JAA
Ah, interesting.
-
JAA
> The official recommendation is to keep your own mirror / repository with all the specific package and their versions that you may want to use.
-
JAA
For Alpine. Ok then...
-
kpcyrd
🤷
-
nicolas17
but yeah I bet your problem is not pinning Python dep versions
-
JAA
Yeah, but indirectly. grab-site doesn't directly depend on html5-parser or lxml.
-
nicolas17
or you could push your working image somewhere :p
-
kpcyrd
JAA: the python ecosystem is very silly compared to other languages. Ideally you would have something like package-lock.json, Cargo.lock or composer.lock that records your dependency graph.
-
fireonlive
hmm you can pin requirements in the .txt can’t you
-
fireonlive
but i guess that’s also annoying
-
kpcyrd
-
kpcyrd
"python is supposed to be easy, can we have easy dependency management too?" - "we have easy dependency management at home"
-
kpcyrd
-
fireonlive
😅
-
nicolas17
kpcyrd: yet pipenv and poetry seem to do exactly what you say?
-
fireonlive
but who uses those
-
JAA
There's also pip-tools. But I don't disagree.
-
JAA
On the other hand, those packages need to be *constantly* updated for bug or security fixes anywhere in the dependency tree, which is also very silly.
-
JAA
those package lists*
-
fireonlive
hmm yeah
-
Gooshka
-
Gooshka
-
fireonlive
oh interesting, thanks Gooshka. what sites/fronts does it have there?
-
Gooshka
I sent some links in AB channel.
-
fireonlive
oh wow; all of yandex
-
fireonlive
Gooshka: ah! i missed that. thanks as always :)
-
Gooshka
-
Gooshka
etc.
-
Gooshka
-
Gooshka
yandex.ru/dev - technologies of Yandex.
-
Gooshka
yatalks2023.com ,
yatalks.yandex.ru , I can't find YaTalks before 2023 on sites like yatalks2023.com, only pages like this:
yatalks2023.com/2022/ru .
-
nicolas17
JAA: got a working grab-site yet? :P
-
Gooshka
-
Gooshka
shedevrum.ai - AI by Yandex creates beatiful pictures of animals and people.
-
Gooshka
yandex.ru/lab/countries - game in which you guess what country is on photo. It follows Russian laws, so Abkhazia is not part of Georgia according to this. Player 2 is Alisa, AI by Yandex. Some other goods under /lab/ directory.
-
fireonlive
hii; so i'm very crudely™ monitoring urls that archivebot hits until something more betterer is in place - so far i'm looking for blogger/blogspot and imgur.. any others I should look for?
-
fireonlive
imgur because most pipelines just get a 429 from imgur right away
-
fireonlive
(it seems)
-
thuban
-
fireonlive
ah yes :)
-
fireonlive
thanks
-
JAA
fireonlive: Telegram, perhaps?
-
JAA
You'll want to filter out the share links though.
-
fireonlive
ah right!
-
thuban
ah, someone should add the 'do you have a list' template to the telegram wiki page
-
thuban
idk exactly what the regex would be
-
JAA
The reason it isn't there is that we don't currently have a bot that takes arbitrary URLs and extracts items for the tracker from it, like we do for Imgur and MediaFire.
-
JAA
Added it, but we won't be able to make full use of the lists easily yet.
-
h2ibot
JustAnotherArchivist edited Telegram (+113, Add URL list CTA):
wiki.archiveteam.org/?diff=51176&oldid=50298
-
fireonlive
JAA++
-
eggdrop
[karma] 'JAA' now has 4 karma!
-
thuban
ah, fair