00:34:00 JustAnotherArchivist edited Blogger (+182, Fix source and tracker links, update status): https://wiki.archiveteam.org/?diff=51174&oldid=51148 02:09:06 Anyone wanna archive the Questionable Content forums? (https://forums.questionablecontent.net) It's a webforum with approximately 900k posts, about 20 years old. Mostly webcomic related content. 02:09:59 why though? is it at risk of dying? 02:10:40 Yeah, the comic's server is showing 503s a lot, and it's likely that they're going to move to a new host. The forum was locked some time ago, and no new posts or registrations are allowed 02:10:49 Locked as of Jan 1 this year. 02:11:12 oh :| I didn't know of that 02:11:40 The current speculation is that Jeph (comic author) and his tech team might opt not to migrate the forums because of the additional complexity in doing so 02:11:53 * nicolas17 hasn't even read the comics in a few years 02:15:24 * Pedrosso didn't know of its existence 02:15:41 * Pedrosso wants it saved anyway, of course 02:21:36 phuzion: I've queued an archivebot job for them - 900k posts is large but should be doable 02:22:36 pokechu22: the forums are behind cloudflare, so I'd check to make sure that it's working properly at some point. 02:26:53 Yeah - there's other stuff running in archivebot right now so it's not started yet, but if it finishes abnormally quickly then I'll know it's cloudflare at least (not sure what we could do about it for something that large though) 02:37:18 ohai 02:37:39 Looks like their Buttflare config isn't very aggressive, so should be possible even if it doesn't work with AB. 02:42:11 should DeviantArt's sitemap be grabbed proactively? I'm surprised it hasn't been hidden from public yet and it's very big 02:43:11 Seems to be running fine, albeit with timeouts and general slugginess. 03:19:17 JAA: Yeah that server is creaky right now. The main comic page doesn't load about 1/3 of the time, it seems. 03:35:44 Barto pls 03:35:47 Worst SLA ever 03:35:52 xD 05:41:05 JAA: are the logs from an AB job saved/accessible anywhere? 05:46:04 project10: Yes, they're in the *-meta.warc.gz file. For aborted or crashed jobs, there's a -wpull.log.gz file instead, though that isn't indexed by the viewer; it should normally be in the same item as the *.json file. 05:46:43 cool, I kinda had an inkling it might be saved in the data uploaded to IA itself. Throw nothing away and all that 05:47:20 =] 05:47:34 Yeah, we do currently throw away the DB file though, which has some data that's hard to extract otherwise and is much more suitable for many analysis things. 05:48:17 ah like a quick sweep for failed urls or certain outlinks i suppose 05:48:19 There's a... uh... three years old issue about it: https://github.com/ArchiveTeam/ArchiveBot/issues/465 05:49:19 Yeah. And some links get indexed by wpull but silently ignored. They only appear in the raw responses and in the DB. 05:50:20 ahh 07:42:13 Has there already mentioned dotup.org and its light version, light.dotup.org, the website that will be shut down on November 30? 07:43:00 These sites are relatively simple and could be done by AB. 07:45:02 The light version and the normal version seem to have different size limits for uploading and different uploaded content. 08:41:58 sonick: JAA did the non-lite version on 20231103 08:42:37 https://archive.fart.website/archivebot/viewer/?q=dotup.org 08:44:59 stuck light one in AB now 08:45:32 hmm, file uploads are still enabled 08:46:03 its in Deathwatch so I guess someone will do another save near the deadline 08:59:34 ok, thanks. 15:46:01 sonick, pabs: That job only grabbed the most recent files because the pagination's limited, but I intend to do another run bruteforcing the older files (extensions have to be guessed). 15:57:23 If anyone is able to access https://javiermilei.com/ , a grab-site crawl would be great. I tried from machines in 9 countries and got blocked everywhere. It might need to be run in Argentina. 15:58:16 i'm blocked by CF there as well 16:41:10 JAA: yeah works for me, I think he had some kind of raffle a while ago so they had to stop people using bots from foreign IPs? 16:41:40 give me a tl;dr for grab-site 16:51:53 nicolas17: Nice. The way I run grab-site is without the web interface and stuff, one container per target site: https://gitea.arpa.li/JustAnotherArchivist/grab-site-docker 16:54:28 Since I can't look at the site, no idea what options are required here. :-| 17:07:52 Do you use docker exec for ignores and such or just yolo it without 17:10:48 nano as root 17:11:01 No risk, no fun. 17:11:34 But since it's a mount, it should be fine. 17:12:11 no vim for JAA :o? 17:12:41 no emacs for JAA? 17:13:50 I'd normally use a magnetised needle, but that's kind of hard to do remotely. 17:14:01 true true 17:14:09 we need to get you one of those surgeon robots 17:14:11 Butterflies would work though, I suppose. 17:15:02 I wish we could archive the editor jokes eventually 17:16:30 * JAA archives kpcyrd. 17:16:39 I like to to rearange the 1 and 0 with electric magnet. Might edit the text or might kill the disk, but at least im not running as root so im safe 17:16:44 * kpcyrd .zip 17:17:55 JAA: do I need to build the container or is it in some repo already? 17:18:17 the build command is in the readme 17:18:43 ^ 17:18:55 Haven't pushed it anywhere, no. 17:18:55 yes 17:19:02 that_lurker: I'm asking if I have to :P 17:25:16 ugh 17:25:30 JAA: does grab-site run as an unprivileged user inside the container? 17:27:05 [Errno 13] Permission denied: '/data/javiermilei.com-2023-11-20-e125cf6c' 17:27:38 nicolas17: Yes: https://gitea.arpa.li/JustAnotherArchivist/grab-site-docker/src/commit/398726f73e84233a584fe096d916799fa3c90006/Dockerfile#L48 17:28:05 so I guess I need to make the data dir world writable 17:28:31 RuntimeError: html5-parser and lxml are using different versions of libxml2. This happens commonly when using pip installed versions of lxml. Use pip install --no-binary lxml lxml instead. libxml2 versions: html5-parser: (2, 9, 14) != lxml: (2, 10, 3) 17:28:32 Hmm, yeah, there might be room for improvement. 17:28:49 Ugh 17:29:14 non-reproducible build tsk tsk 17:29:58 you developers and your chmod 777 😾 17:32:44 I'll take a look at the libxml2 issue in a sec. 17:33:58 . 17:34:02 my builds are reproducible, they will fail every time 🤷 17:36:40 xD 17:37:31 Megame edited Deathwatch (+219, /* 2023 */ Okada Books - Nov 30): https://wiki.archiveteam.org/?diff=51175&oldid=51170 18:00:25 Is there an equivalent to https://snapshot.debian.org/ for Alpine, such that you can 'install packages as they were at a specific datetime'? 18:02:10 forcing it to alpine 3.13 isn't enough? there's breaking changes to packages within the same alpine version? bleh 18:05:31 JAA: no, but please let me know if you find one 18:06:21 nicolas17: I mean, it might be, but my point is rather about how to do reproducible builds with Alpine. 18:06:35 sad news: you can't 18:06:38 Welp 18:06:55 the error is likely related to python dependencies and unrelated to alpine tho? 18:07:00 Yeah 18:07:10 I was thinking more broadly about reproducibility. 18:07:12 https://gitlab.alpinelinux.org/alpine/abuild/-/issues/9996 18:07:56 I was not aware they outright delete old packages. Oof. 18:08:17 so does debian 18:08:23 hence having a separate snapshot service :P 18:08:44 the other problem with alpine is the build environments are not really documented, even if you have all old packages its difficult to tell which ones you need to pick to re-create the original build environment 18:10:04 other distros solve this with buildinfo files (the OG sbom basically), but Alpine is also stuck in this apk2-apk3 migration thing 18:10:14 so they decided against adding buildinfo files to apk2 18:10:25 Well, yeah, but snapshot is part of the Debian project. So bit different, I think. (Although I believe snapshot.d.o might sometimes miss things if there are rapid uploads? I've heard something like that at least.) 18:10:54 the only "proper" archive I'm aware of is https://archive.archlinux.org/ 18:11:06 snapshot.debian.org becoming an Official Part of the Project is relatively recent, it used to be snapshot.debian.net 18:11:07 Yeah, Arch seems to do a good job at this. 18:11:16 Ah, interesting. 18:17:14 > The official recommendation is to keep your own mirror / repository with all the specific package and their versions that you may want to use. 18:17:27 For Alpine. Ok then... 18:18:04 🤷 18:20:00 but yeah I bet your problem is not pinning Python dep versions 18:21:18 Yeah, but indirectly. grab-site doesn't directly depend on html5-parser or lxml. 18:22:16 or you could push your working image somewhere :p 18:37:24 JAA: the python ecosystem is very silly compared to other languages. Ideally you would have something like package-lock.json, Cargo.lock or composer.lock that records your dependency graph. 18:38:36 hmm you can pin requirements in the .txt can’t you 18:38:44 but i guess that’s also annoying 18:40:07 tl;dr "yeah idk lol" https://stackoverflow.com/questions/52665596/equivalent-of-package-json-and-package-lock-json-for-pip 18:42:26 "python is supposed to be easy, can we have easy dependency management too?" - "we have easy dependency management at home" 18:42:34 dependency management at home: https://stackoverflow.com/questions/58218592/feature-comparison-between-npm-pip-pipenv-and-poetry-package-managers 18:47:25 😅 19:00:18 kpcyrd: yet pipenv and poetry seem to do exactly what you say? 19:01:00 but who uses those 19:08:47 There's also pip-tools. But I don't disagree. 19:09:32 On the other hand, those packages need to be *constantly* updated for bug or security fixes anywhere in the dependency tree, which is also very silly. 19:09:40 those package lists* 19:55:02 hmm yeah 20:02:54 https://www.forbes.ru/biznes/494353-andeks-zadumalsa-o-prodaze-svoego-biznesa-v-izraile - Yandex thinks about selling its business located in Israel. 20:03:19 https://www.golosameriki.com/a/yandex-can-sell-its-entire-business-in-russia/7355003.html -Yandex can sell its entire business in Russia 20:03:20 oh interesting, thanks Gooshka. what sites/fronts does it have there? 20:03:47 I sent some links in AB channel. 20:03:51 oh wow; all of yandex 20:04:00 Gooshka: ah! i missed that. thanks as always :) 20:05:44 https://github.com/yandex/ , https://github.com/yandex-cloud/ , https://huggingface.co/yandex , https://yandex.ru/company/ , https://yandex.ru/legal/ , https://yandex.ru/support/ 20:05:49 etc. 20:09:37 https://toloka.ai/ , https://toloka.ai/tolokers/ru/ (formerly https://toloka.yandex.ru/ ), has page on WKP: https://en.wikipedia.org/wiki/Toloka . 20:11:27 https://yandex.ru/dev/ - technologies of Yandex. 20:15:15 https://yatalks2023.com/ , https://yatalks.yandex.ru/ , I can't find YaTalks before 2023 on sites like yatalks2023.com, only pages like this: https://yatalks2023.com/2022/ru . 20:15:37 JAA: got a working grab-site yet? :P 20:36:00 https://habr.com/ru/companies/yandex/ - blog of Yandex team. 20:36:50 https://shedevrum.ai/ - AI by Yandex creates beatiful pictures of animals and people. 20:41:35 https://yandex.ru/lab/countries - game in which you guess what country is on photo. It follows Russian laws, so Abkhazia is not part of Georgia according to this. Player 2 is Alisa, AI by Yandex. Some other goods under /lab/ directory. 21:54:50 hii; so i'm very crudely™ monitoring urls that archivebot hits until something more betterer is in place - so far i'm looking for blogger/blogspot and imgur.. any others I should look for? 21:55:16 imgur because most pipelines just get a 429 from imgur right away 21:55:19 (it seems) 22:06:49 fireonlive: https://wiki.archiveteam.org/index.php/Category:Projects_requiring_URL_lists mediafire 22:07:07 ah yes :) 22:07:10 thanks 22:37:04 fireonlive: Telegram, perhaps? 22:37:34 You'll want to filter out the share links though. 22:38:26 ah right! 22:43:56 ah, someone should add the 'do you have a list' template to the telegram wiki page 22:44:49 idk exactly what the regex would be 22:48:21 The reason it isn't there is that we don't currently have a bot that takes arbitrary URLs and extracts items for the tracker from it, like we do for Imgur and MediaFire. 22:57:55 Added it, but we won't be able to make full use of the lists easily yet. 22:58:35 JustAnotherArchivist edited Telegram (+113, Add URL list CTA): https://wiki.archiveteam.org/?diff=51176&oldid=50298 23:04:34 JAA++ 23:04:34 -eggdrop- [karma] 'JAA' now has 4 karma! 23:21:06 ah, fair