00:34:00 <h2ibot> JustAnotherArchivist edited Blogger (+182, Fix source and tracker links, update status): https://wiki.archiveteam.org/?diff=51174&oldid=51148
02:09:06 <phuz-test> Anyone wanna archive the Questionable Content forums? (https://forums.questionablecontent.net) It's a webforum with approximately 900k posts, about 20 years old. Mostly webcomic related content.
02:09:59 <nicolas17> why though? is it at risk of dying?
02:10:40 <phuzion> Yeah, the comic's server is showing 503s a lot, and it's likely that they're going to move to a new host. The forum was locked some time ago, and no new posts or registrations are allowed
02:10:49 <phuzion> Locked as of Jan 1 this year.
02:11:12 <nicolas17> oh :| I didn't know of that
02:11:40 <phuzion> The current speculation is that Jeph (comic author) and his tech team might opt not to migrate the forums because of the additional complexity in doing so
02:11:53 * nicolas17 hasn't even read the comics in a few years
02:15:24 * Pedrosso didn't know of its existence
02:15:41 * Pedrosso wants it saved anyway, of course
02:21:36 <pokechu22> phuzion: I've queued an archivebot job for them - 900k posts is large but should be doable
02:22:36 <phuzion> pokechu22: the forums are behind cloudflare, so I'd check to make sure that it's working properly at some point.
02:26:53 <pokechu22> Yeah - there's other stuff running in archivebot right now so it's not started yet, but if it finishes abnormally quickly then I'll know it's cloudflare at least (not sure what we could do about it for something that large though)
02:37:18 <JAA> ohai
02:37:39 <JAA> Looks like their Buttflare config isn't very aggressive, so should be possible even if it doesn't work with AB.
02:42:11 <Pedrosso> should DeviantArt's sitemap be grabbed proactively? I'm surprised it hasn't been hidden from public yet and it's very big
02:43:11 <JAA> Seems to be running fine, albeit with timeouts and general slugginess.
03:19:17 <phuzion> JAA: Yeah that server is creaky right now. The main comic page doesn't load about 1/3 of the time, it seems.
03:35:44 <JAA> Barto pls
03:35:47 <JAA> Worst SLA ever
03:35:52 <fireonlive> xD
05:41:05 <project10> JAA: are the logs from an AB job saved/accessible anywhere?
05:46:04 <JAA> project10: Yes, they're in the *-meta.warc.gz file. For aborted or crashed jobs, there's a -wpull.log.gz file instead, though that isn't indexed by the viewer; it should normally be in the same item as the *.json file.
05:46:43 <project10> cool, I kinda had an inkling it might be saved in the data uploaded to IA itself. Throw nothing away and all that
05:47:20 <fireonlive> =]
05:47:34 <JAA> Yeah, we do currently throw away the DB file though, which has some data that's hard to extract otherwise and is much more suitable for many analysis things.
05:48:17 <fireonlive> ah like a quick sweep for failed urls or certain outlinks i suppose
05:48:19 <JAA> There's a... uh... three years old issue about it: https://github.com/ArchiveTeam/ArchiveBot/issues/465
05:49:19 <JAA> Yeah. And some links get indexed by wpull but silently ignored. They only appear in the raw responses and in the DB.
05:50:20 <fireonlive> ahh
07:42:13 <sonick> Has there already mentioned dotup.org and its light version, light.dotup.org, the website that will be shut down on November 30?
07:43:00 <sonick> These sites are relatively simple and could be done by AB.
07:45:02 <sonick> The light version and the normal version seem to have different size limits for uploading and different uploaded content.
08:41:58 <pabs> sonick: JAA did the non-lite version on 20231103
08:42:37 <pabs> https://archive.fart.website/archivebot/viewer/?q=dotup.org
08:44:59 <pabs> stuck light one in AB now
08:45:32 <pabs> hmm, file uploads are still enabled
08:46:03 <pabs> its in Deathwatch so I guess someone will do another save near the deadline
08:59:34 <sonick> ok, thanks.
15:46:01 <JAA> sonick, pabs: That job only grabbed the most recent files because the pagination's limited, but I intend to do another run bruteforcing the older files (extensions have to be guessed).
15:57:23 <JAA> If anyone is able to access https://javiermilei.com/ , a grab-site crawl would be great. I tried from machines in 9 countries and got blocked everywhere. It might need to be run in Argentina.
15:58:16 <anarchat> i'm blocked by CF there as well
16:41:10 <nicolas17> JAA: yeah works for me, I think he had some kind of raffle a while ago so they had to stop people using bots from foreign IPs?
16:41:40 <nicolas17> give me a tl;dr for grab-site
16:51:53 <JAA> nicolas17: Nice. The way I run grab-site is without the web interface and stuff, one container per target site: https://gitea.arpa.li/JustAnotherArchivist/grab-site-docker
16:54:28 <JAA> Since I can't look at the site, no idea what options are required here. :-|
17:07:52 <that_lurker> Do you use docker exec for ignores and such or just yolo it without
17:10:48 <JAA> nano as root
17:11:01 <JAA> No risk, no fun.
17:11:34 <JAA> But since it's a mount, it should be fine.
17:12:11 <fireonlive> no vim for JAA :o?
17:12:41 <project10> no emacs for JAA?
17:13:50 <JAA> I'd normally use a magnetised needle, but that's kind of hard to do remotely.
17:14:01 <fireonlive> true true
17:14:09 <fireonlive> we need to get you one of those surgeon robots
17:14:11 <JAA> Butterflies would work though, I suppose.
17:15:02 <kpcyrd> I wish we could archive the editor jokes eventually
17:16:30 * JAA archives kpcyrd.
17:16:39 <that_lurker> I like to to rearange the 1 and 0 with electric magnet. Might edit the text or might kill the disk, but at least im not running as root so im safe
17:16:44 * kpcyrd .zip
17:17:55 <nicolas17> JAA: do I need to build the container or is it in some repo already?
17:18:17 <that_lurker> the build command is in the readme
17:18:43 <JAA> ^
17:18:55 <JAA> Haven't pushed it anywhere, no.
17:18:55 <nicolas17> yes
17:19:02 <nicolas17> that_lurker: I'm asking if I have to :P
17:25:16 <nicolas17> ugh
17:25:30 <nicolas17> JAA: does grab-site run as an unprivileged user inside the container?
17:27:05 <nicolas17> [Errno 13] Permission denied: '/data/javiermilei.com-2023-11-20-e125cf6c'
17:27:38 <JAA> nicolas17: Yes: https://gitea.arpa.li/JustAnotherArchivist/grab-site-docker/src/commit/398726f73e84233a584fe096d916799fa3c90006/Dockerfile#L48
17:28:05 <nicolas17> so I guess I need to make the data dir world writable
17:28:31 <nicolas17> RuntimeError: html5-parser and lxml are using different versions of libxml2. This happens commonly when using pip installed versions of lxml. Use pip install --no-binary lxml lxml instead. libxml2 versions: html5-parser: (2, 9, 14) != lxml: (2, 10, 3)
17:28:32 <JAA> Hmm, yeah, there might be room for improvement.
17:28:49 <JAA> Ugh
17:29:14 <nicolas17> non-reproducible build tsk tsk
17:29:58 <fireonlive> you developers and your chmod 777 😾
17:32:44 <JAA> I'll take a look at the libxml2 issue in a sec.
17:33:58 <project10> .
17:34:02 <AK> my builds are reproducible, they will fail every time 🤷
17:36:40 <fireonlive> xD
17:37:31 <h2ibot> Megame edited Deathwatch (+219, /* 2023 */ Okada Books - Nov 30): https://wiki.archiveteam.org/?diff=51175&oldid=51170
18:00:25 <JAA> Is there an equivalent to https://snapshot.debian.org/ for Alpine, such that you can 'install packages as they were at a specific datetime'?
18:02:10 <nicolas17> forcing it to alpine 3.13 isn't enough? there's breaking changes to packages within the same alpine version? bleh
18:05:31 <kpcyrd> JAA: no, but please let me know if you find one
18:06:21 <JAA> nicolas17: I mean, it might be, but my point is rather about how to do reproducible builds with Alpine.
18:06:35 <kpcyrd> sad news: you can't
18:06:38 <JAA> Welp
18:06:55 <kpcyrd> the error is likely related to python dependencies and unrelated to alpine tho?
18:07:00 <JAA> Yeah
18:07:10 <JAA> I was thinking more broadly about reproducibility.
18:07:12 <nicolas17> https://gitlab.alpinelinux.org/alpine/abuild/-/issues/9996
18:07:56 <JAA> I was not aware they outright delete old packages. Oof.
18:08:17 <nicolas17> so does debian
18:08:23 <nicolas17> hence having a separate snapshot service :P
18:08:44 <kpcyrd> the other problem with alpine is the build environments are not really documented, even if you have all old packages its difficult to tell which ones you need to pick to re-create the original build environment
18:10:04 <kpcyrd> other distros solve this with buildinfo files (the OG sbom basically), but Alpine is also stuck in this apk2-apk3 migration thing
18:10:14 <kpcyrd> so they decided against adding buildinfo files to apk2
18:10:25 <JAA> Well, yeah, but snapshot is part of the Debian project. So bit different, I think. (Although I believe snapshot.d.o might sometimes miss things if there are rapid uploads? I've heard something like that at least.)
18:10:54 <kpcyrd> the only "proper" archive I'm aware of is https://archive.archlinux.org/
18:11:06 <nicolas17> snapshot.debian.org becoming an Official Part of the Project is relatively recent, it used to be snapshot.debian.net
18:11:07 <JAA> Yeah, Arch seems to do a good job at this.
18:11:16 <JAA> Ah, interesting.
18:17:14 <JAA> > The official recommendation is to keep your own mirror / repository with all the specific package and their versions that you may want to use.
18:17:27 <JAA> For Alpine. Ok then...
18:18:04 <kpcyrd> 🤷
18:20:00 <nicolas17> but yeah I bet your problem is not pinning Python dep versions
18:21:18 <JAA> Yeah, but indirectly. grab-site doesn't directly depend on html5-parser or lxml.
18:22:16 <nicolas17> or you could push your working image somewhere :p
18:37:24 <kpcyrd> JAA: the python ecosystem is very silly compared to other languages. Ideally you would have something like package-lock.json, Cargo.lock or composer.lock that records your dependency graph.
18:38:36 <fireonlive> hmm you can pin requirements in the .txt can’t you
18:38:44 <fireonlive> but i guess that’s also annoying
18:40:07 <kpcyrd> tl;dr "yeah idk lol" https://stackoverflow.com/questions/52665596/equivalent-of-package-json-and-package-lock-json-for-pip
18:42:26 <kpcyrd> "python is supposed to be easy, can we have easy dependency management too?" - "we have easy dependency management at home"
18:42:34 <kpcyrd> dependency management at home: https://stackoverflow.com/questions/58218592/feature-comparison-between-npm-pip-pipenv-and-poetry-package-managers
18:47:25 <fireonlive> 😅
19:00:18 <nicolas17> kpcyrd: yet pipenv and poetry seem to do exactly what you say?
19:01:00 <fireonlive> but who uses those
19:08:47 <JAA> There's also pip-tools. But I don't disagree.
19:09:32 <JAA> On the other hand, those packages need to be *constantly* updated for bug or security fixes anywhere in the dependency tree, which is also very silly.
19:09:40 <JAA> those package lists*
19:55:02 <fireonlive> hmm yeah
20:02:54 <Gooshka> https://www.forbes.ru/biznes/494353-andeks-zadumalsa-o-prodaze-svoego-biznesa-v-izraile - Yandex thinks about selling its business located in Israel.
20:03:19 <Gooshka> https://www.golosameriki.com/a/yandex-can-sell-its-entire-business-in-russia/7355003.html -Yandex can sell its entire business in Russia
20:03:20 <fireonlive> oh interesting, thanks Gooshka. what sites/fronts does it have there?
20:03:47 <Gooshka> I sent some links in AB channel.
20:03:51 <fireonlive> oh wow; all of yandex
20:04:00 <fireonlive> Gooshka: ah! i missed that. thanks as always :)
20:05:44 <Gooshka> https://github.com/yandex/ , https://github.com/yandex-cloud/ , https://huggingface.co/yandex , https://yandex.ru/company/ , https://yandex.ru/legal/ , https://yandex.ru/support/
20:05:49 <Gooshka> etc.
20:09:37 <Gooshka> https://toloka.ai/ , https://toloka.ai/tolokers/ru/ (formerly https://toloka.yandex.ru/ ), has page on WKP: https://en.wikipedia.org/wiki/Toloka .
20:11:27 <Gooshka> https://yandex.ru/dev/ - technologies of Yandex.
20:15:15 <Gooshka> https://yatalks2023.com/ , https://yatalks.yandex.ru/ , I can't find YaTalks before 2023 on sites like yatalks2023.com, only pages like this: https://yatalks2023.com/2022/ru .
20:15:37 <nicolas17> JAA: got a working grab-site yet? :P
20:36:00 <Gooshka> https://habr.com/ru/companies/yandex/ - blog of Yandex team.
20:36:50 <Gooshka> https://shedevrum.ai/ - AI by Yandex creates beatiful pictures of animals and people.
20:41:35 <Gooshka> https://yandex.ru/lab/countries - game in which you guess what country is on photo. It follows Russian laws, so Abkhazia is not part of Georgia according to this. Player 2 is Alisa, AI by Yandex. Some other goods under /lab/ directory.
21:54:50 <fireonlive> hii; so i'm very crudely™ monitoring urls that archivebot hits until something more betterer is in place - so far i'm looking for blogger/blogspot and imgur.. any others I should look for?
21:55:16 <fireonlive> imgur because most pipelines just get a 429 from imgur right away
21:55:19 <fireonlive> (it seems)
22:06:49 <thuban> fireonlive: https://wiki.archiveteam.org/index.php/Category:Projects_requiring_URL_lists mediafire
22:07:07 <fireonlive> ah yes :)
22:07:10 <fireonlive> thanks
22:37:04 <JAA> fireonlive: Telegram, perhaps?
22:37:34 <JAA> You'll want to filter out the share links though.
22:38:26 <fireonlive> ah right!
22:43:56 <thuban> ah, someone should add the 'do you have a list' template to the telegram wiki page
22:44:49 <thuban> idk exactly what the regex would be
22:48:21 <JAA> The reason it isn't there is that we don't currently have a bot that takes arbitrary URLs and extracts items for the tracker from it, like we do for Imgur and MediaFire.
22:57:55 <JAA> Added it, but we won't be able to make full use of the lists easily yet.
22:58:35 <h2ibot> JustAnotherArchivist edited Telegram (+113, Add URL list CTA): https://wiki.archiveteam.org/?diff=51176&oldid=50298
23:04:34 <fireonlive> JAA++
23:04:34 -eggdrop- [karma] 'JAA' now has 4 karma!
23:21:06 <thuban> ah, fair