00:04:36 <thuban> ok, i'm currently scraping all the page images, and i'll feed them back into archivebot once i've got the list
00:04:46 <thuban> (it'll take a little while, i'm being gentle)
00:07:32 <h2ibot> JustAnotherArchivist edited Deathwatch (+159, /* 2023 */ Add OneHallyu): https://wiki.archiveteam.org/?diff=51153&oldid=51152
00:08:12 <vokunal|m> It looks to be about 1.1M posts
00:08:21 <vokunal|m> nvm
00:09:18 <thuban> Peroniko: did you want any advice about converting to pdf and/or uploading to ia?
00:09:20 <thuban> the former is easy, but the latter will probably require some manual work due to the lack / inconsistent formatting of metadata
00:10:14 <vokunal|m> ~11.7M posts on OneHallyu
00:10:33 <JAA> Yeah, 11.8 million is what the homepage says.
00:11:38 <vokunal|m> someday i'll learn. First time i helped with a forum, I didn't see any number, the second time, i count it manually. Fool me twice, I better learn the third time
00:11:45 <JAA> :-)
00:11:46 <Peroniko> thuban: I have uploaded some thing to IA, but if there is a guide for better metadata it would be helpful.
00:11:55 <Peroniko> I've uploaded this for example: https://archive.org/details/arhitektura-graficki-dio
00:13:11 <thuban> Peroniko: the general metadata documentation is here: https://archive.org/developers/metadata-schema/index.html
00:24:25 <JAA> OneHallyu is running through AB now. We'll see how that goes.
00:24:40 <JAA> Buttflare is involved.
00:39:57 <kpcyrd> from -ot: how do I archive videos hosted on sharepoint? it's going to be deleted in a few days: https://kth-my.sharepoint.com/:v:/g/personal/longz_ug_kth_se/EesSEHqiHHtQabKFAAAx5PsB-5r8MVtnp5NECOtKN-YGsA?e=8s0kaj
00:42:55 <nicolas17> it's probably some temporary signed URL that will change on every load and can't be archived in a way that lets the original link work in WBM
00:44:58 <nicolas17> oof that seems to be DASH even
00:50:33 <thuban> kpcyrd: i don't think there's anything reliably plug-and-play for sharepoint
00:50:41 <nicolas17> transcoded on the fly from an original .mp4 that seems impossible to access
00:51:11 <thuban> you could try https://github.com/kylon/Sharedown or some of the workarounds suggested in sharepoint-related issues at https://github.com/snobu/destreamer
00:51:26 <thuban> (or punt and screen-record it)
00:52:19 <nulldata> There's an open PR for yt-dlp to add SharePoint. https://github.com/yt-dlp/yt-dlp/pull/6531
00:52:52 <nicolas17> I *could* grab the DASH but it sucks that I can't access the original .mp4 :/
00:54:12 <kpcyrd> I'm trying to get it into the wayback machine specifically:
00:54:13 <kpcyrd> https://web.archive.org/web/20231116002236/https://kth-my.sharepoint.com/personal/longz_ug_kth_se/_layouts/15/stream.aspx?id=%2Fpersonal%2Flongz%5Fug%5Fkth%5Fse%2FDocuments%2Fbox%5Ffiles%2FKTH%20SR%20Meetup%2F2020%2D11%2D24%2013%2E06%2E13%20Localization%20of%20Unreproducible%20Builds%2F2020%2D11%2D24%2013%2E06%2E13%20Localization%20of%20Unreproducible%20Builds%20%2D%20Jifeng%
00:54:15 <kpcyrd> 20Xuan%2Emp4&ga=1
00:54:19 <nicolas17> that's not going to work
00:54:25 <kpcyrd> rip
00:54:32 <nicolas17> there's timestamped, signed URLs that change every time you load the page
00:55:04 <thuban> ia item not adequate?
00:55:41 <nicolas17> ffmpeg doesn't do parallel requests (in fact I'm not sure if it does proper HTTP keepalive) so this DASH remux is taking me forever
01:09:31 <nicolas17> oh great I got some 503 Service Unavailable too
02:12:56 <nulldata> Gitlab is now requiring new users to verify using a phone number or credit card, or account will be deleted. So far only seems to apply to new accounts, but something to keep an eye on if they expand it to existing accounts. https://lemmy.world/post/8297909
02:23:49 <flashfire42|m> https://www.androidpolice.com/ensuring-high-quality-apps-on-google-play/
02:26:41 <flashfire42|m> https://www.animenewsnetwork.com/news/2023-11-11/crunchyroll-ends-digital-manga-app-on-mobile-web-on-december-11/
02:28:42 <JAA> Correct link for the latter: https://www.animenewsnetwork.com/news/2023-11-11/crunchyroll-ends-digital-manga-app-on-mobile-web-on-december-11/.204339
05:00:33 <h2ibot> JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51154&oldid=51143
07:53:11 <that_lurker> https://bird.makeup would be a nice alternative way to grab twitter(x) stuff. They create a mastodon account where all the tweets are posted.
07:54:10 <that_lurker> also means there is no rate limiting
07:54:26 <that_lurker> other than on their end of course
13:28:29 <null> https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat
13:29:54 <rktk> stupid sexy nickserv
13:29:56 <rktk> https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat
13:30:03 <rktk> Does anyone know of a full dump or half dump of open subtitles
13:30:07 <rktk> this is a real slap in the face
13:34:18 <h2ibot> MasterX244 edited List of websites excluded from the Wayback Machine (+28): https://wiki.archiveteam.org/?diff=51155&oldid=51036
14:00:23 <h2ibot> JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51156&oldid=51155
15:19:34 <arkiver> opensubtitles closing themselves?
16:31:29 <Hans5958> Has anyone backed this up yet? https://pabio.com/blog/company/bankruptcy/
16:43:55 <murb> oh talking of which https://www.bleed-clothing.com/de/info # "Wir sind insolvent."
17:12:09 <JAA> arkiver: 'Only' the API, as I understand it?
17:20:39 <Megame> Hans5958, murb I threw them in AB
17:21:06 <murb> ta
18:16:40 <fireonlive> here's the forum post about it: https://forum.opensubtitles.org/viewtopic.php?t=17930#p47873
18:17:16 <fireonlive> looks like the 'new rest api' still has a free tier: https://opensubtitles.stoplight.io/docs/opensubtitles-api/a7d25b650b784-api-subscription-prices
18:17:42 <fireonlive> though those prices are.. hm.
18:17:54 <JAA> XML-RPC... Ok, yeah, I agree that needs to die already.
18:19:02 <fireonlive> https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat posted earlier says "This decision, initially disclosed in a forum post, will primarily affect non-VIP users, while VIP members will continue to enjoy access to the API." so i guess they're keeping it around for VIP people for a
18:19:02 <fireonlive> bit longer at least?
18:19:12 <fireonlive> and yeah it does haha
20:15:41 <anarchat> so uh
20:16:24 <anarchat> apparently, i have a blogspot blog, two actually... and i learned this because google/blogger.com wrote me to tell me i haven't logged in since 2007 and so they will delete my shit... i wonder if we need to do something about this
20:16:36 <anarchat> my two blogs are totally irrelevant and empty, but there might be others facing destruction out there
20:18:50 <nicolas17> anarchat: it has been discussed before; how do we find "all blogs"?
20:20:17 <anarchat> i have no idea
20:30:16 <fireonlive> #frogger :)
20:33:34 <nicolas17> kpcyrd: https://data.nicolas17.xyz/localization-unreproducible-builds.mp4 this is from the DASH stream on sharepoint
20:34:08 <nicolas17> it wasn't easy to get because if I just fed the DASH manifest to ffmpeg, one or two segments would randomly give a "503 service unavailable" and ffmpeg doesn't retry
20:34:17 <nicolas17> so I got a gap in the video
20:34:36 <nicolas17> I had to download all segments and rewrite the manifest to use those local files
20:35:21 <nicolas17> take it and figure out what to do with it; archive.org item or whatever :P
20:46:41 <Pedrosso> https://transfer.archivete.am/J2GVQ/sporeforums1.txt does this list of largely unarchived spore forums have any URLs that the bot wouldn't be able to archive properly? In either case could the viable URLs be fed to AB?
20:48:17 <JAA> The ones that aren't entire domains could be problematic recursion-wise.
20:48:31 <Pedrosso> Problematic in what way?
20:48:31 <JAA> Other than that, not sure.
20:49:18 <JAA> Not recursing properly. If you !a https://example.org/foo/ and it has a link to /bar, that won't be followed.
20:49:32 <Pedrosso> not even with offsite links allowed?
20:49:46 <JAA> No, because they're not offsite.
20:50:21 <JAA> Offsite = different host
20:50:54 <Pedrosso> I getcha, had hoped it was just offsite (named after different host) = outside recursion
20:51:41 <JAA> For example, it wouldn't recurse anywhere useful from https://www.mobygames.com/forum/game/36030/spore/ because those URLs aren't in .../spore/.
20:52:11 <JAA> In that case, !a https://www.mobygames.com/forum/game/36030/spore would work though.
20:52:22 <JAA> But yeah, each of those needs to be looked at individually.
20:52:31 <JAA> And some might simply not be possible.
20:52:42 <Pedrosso> what does the / at the end do?
20:53:25 <JAA> It's a path segment delimiter. For the purpose of AB, the last slash in the path part of the URL determines where it'll recurse onsite.
20:53:50 <JAA> From https://www.mobygames.com/forum/game/36030/spore, it would recurse to any link starting with https://www.mobygames.com/forum/game/36030/ .
20:54:00 <Pedrosso> I see
20:56:06 <Pedrosso> Should I send an transfer.archivete.am link with just the full-domain ones then?
20:57:17 <JAA> No need, this one is fine.
20:58:29 <Pedrosso> Alright. Thanks a lot then
21:04:48 <JAA> I wonder whether https://blog.seamonkey-project.org/2023/11/14/migrating-off-archive-mozilla-org/ only applies to SeaMonkey or also to other projects or even the entire archive.mozilla.org.
21:04:56 <JAA> (It's already running through AB courtesy of arkiver.)
21:05:22 <JAA> Cc pabs ^
21:12:50 <JAA> I'm currently listing all of archive.mozilla.org. It's ... large.
21:13:09 <JAA> I'll have a size estimate later.
21:20:27 <arkiver> JAA: maybe it's too large for ArchiveBot, i wonder how large it is. hope we can archive it entirely
21:24:12 <JAA> I'm already up to over 1.2 million *directories* after only processing 17k.
21:24:15 <JAA> So yeah...
21:29:16 <JAA> To rephrase it a bit clearer: I've processed 17k directories and discovered over 1.2 million directories from those. I'm recursing through the dir tree, obviously.
21:30:07 <JAA> And those numbers are now at 32k done, 2.1M discovered.
21:30:14 <JAA> It'll be a while...
21:31:00 <Pedrosso> How long will it stay up? Assuming it has any sort of shutdown date
21:31:55 <JAA> See link above
21:32:43 <JAA> Beware of https://archive.mozilla.org/pub/firefox/tinderbox-builds/ , those subdirs are *massive*. Like, 100 MB dir listings massive.
21:33:48 <JAA> There's also at least one which doesn't finish loading within a minute.
21:34:13 <Pedrosso> does AB ignore something if it doesn't load within a minute?
21:34:14 <project10> mod_autoindex like 😰
21:34:39 <JAA> It's complicated.
21:35:27 <JAA> AB expects the HTTP headers within 20 seconds and the complete response within 6 hours, but slow processing of parallel requests (such as link extraction or compressing for WARC) can break the retrieval.
21:35:45 <JAA> I bet most of the dirs in there were not listed correctly on the first attempt by AB.
21:36:54 <JAA> The 1 minute timeout is the default in qwarc, which I'm using for listing this more efficiently.
21:37:09 <nicolas17> 100MB *listings*?
21:38:29 <JAA> Running into a problem, will need to restart the listing.
21:40:52 <Pedrosso> I bid you good luck with this, lookin' forward to seeing just how big the listing file will be.
21:41:50 <JAA> nicolas17: Yes, autoland-linux64 is that one, it contains 195k entries.
21:42:37 <JAA> autoland-macosx64-debug times out on the server side after a bit over a minute with a 502.
21:50:49 <JAA> Listing restarted, going more faster now.
21:54:51 <JAA> (I hope there are no loops via symlinks.)
21:59:36 <JAA> Oh, this time autoland-linux64 repeatedly timed out as well, yay.
22:02:12 <JAA> I think I'm running into SQLite lock contention at this point. But processing 7-9k dirs per minute isn't bad.
22:09:40 <Pedrosso> As for what I sent of spore forums, here are a few archive-related comments about the domains of the few that weren't directly in the domain https://transfer.archivete.am/mawW8/sporeforums%20addendum.txt
22:17:30 <Pedrosso> An addendum to that addendum; https://gamefaqs.gamespot.com/ has an archive but https://gamefaqs.gamespot.com/boards/926714-spore/72994456 (posted before the archive) is missing (https://gamefaqs.gamespot.com/boards/926714-spore has 1 archive from ArchiveTeam though)
22:20:32 <nicolas17> my modem rebooted... maybe because of telegrab at high concurrency /o\
22:21:09 <JAA> I don't think we ever fully archived GameFAQs. I believe there were unsuccessful/incomplete attempts only.
22:21:19 <nicolas17> (reddit is much more prone to doing that)
22:22:31 <nicolas17> does wget-at use keepalive?
22:25:45 <Pedrosso> Ah, I see.
22:50:33 <JAA> Now doing over 10k dirs per minute. Brrrrr
22:50:55 <JAA> Still going to take at least 3 hours to get through the remaining queue. lol
22:51:06 <JAA> So yes, it is marginally too big for AB. :-P
22:52:14 <Pedrosso> What alternatives are there then?
22:52:46 <JAA> It does depend a bit on how many files there are and how large they are.
22:52:53 <JAA> DPoS would be an option.
22:53:13 <JAA> Or maybe it can be done with AB with a few !ao < jobs rather than one big recursive one.
22:53:43 <JAA> The listings I've retrieved so far are already over 1 GiB of WARC, i.e. after compression.
22:54:04 <Pedrosso> o_o
22:54:10 <Pedrosso> how many would "a few" be?
22:54:11 <h2ibot> Arkiver uploaded File:Blogger-icon.png: https://wiki.archiveteam.org/?title=File%3ABlogger-icon.png
22:54:45 <JAA> And arkiver spoke: 'let there be an icon!', and there was an icon.
22:54:57 <Pedrosso> That is how it be
22:55:17 <fireonlive> and it was glorious
22:55:56 <Flashfire42> And the administrators of those websites said "Did anybody hear that?, Must have been the wind"
22:57:57 <Flashfire42> Whenever a new project is about to start I always imagine some kind of eldritch abomination machine just slowly whirring to life. With eyes of red blink into existence and start a march towards their target
23:00:09 <JAA> Pedrosso: 'A few' would be more than 'a couple' but not 'many'. :-P  I don't know, it depends on the output of the listing.
23:00:12 <h2ibot> JAABot edited CurrentWarriorProject (+4): https://wiki.archiveteam.org/?diff=51159&oldid=51154
23:02:59 <arkiver> yep :)
23:03:19 <arkiver> JAA: what is your opinion on already writing a WARC-TLS-Cipher-Suite field before it's standardised?
23:03:32 <arkiver> (related to that issue on the warc specs github repo)
23:04:50 <arkiver> or actually
23:05:05 <arkiver> WARC-Cipher-Suite (the value starting with TLS_ already makes it clear it's for TLS)
23:05:21 <fireonlive> (thank you for not calling it SSL)
23:06:01 <arkiver> i'm glad i made your day fireonlive :)
23:06:07 <fireonlive> :D
23:11:28 <Pedrosso> JAA: oh, well it's nice that there are such convenient solutions
23:26:10 <pabs> JAA: I expect archive.mozilla.org has a lot of stuff that isn't that useful to archive, like millions of test results :)
23:27:50 <JAA> arkiver: Fine with me, it's not a violation of the spec to write fields that aren't specified. Might be worth leaving a comment about the intent on https://github.com/iipc/warc-specifications/issues/86 though and seeing if anyone else has concerns about that.
23:29:22 <JAA> pabs: Yeah, I'm sure there are more and less useful parts to it.
23:29:40 <JAA> Have you possibly seen another announcement from Mozilla themselves about it?
23:29:54 <pabs> not yet, but I did just wake up :)
23:30:46 <pabs> nothing on https://planet.mozilla.org/
23:31:22 <pabs> nothing on https://blog.thunderbird.net/ either
23:31:32 <JAA> Ah yes, time zones. :-)
23:31:35 <pabs> maybe it is only ex-Mozilla projects moving?
23:49:23 <thuban> repeating some requests related to old.dlib.me here, since they got lost in #archivebot:
23:49:33 <thuban> https://transfer.archivete.am/AGArb/www.old.dlib.me-document-viewers-nom - yet another slightly different viewer url
23:49:45 <thuban> https://transfer.archivete.am/ej3GO/www.old.dlib.me-item-pdfs - a small number of items available as pdf rather than through the document viewer
23:50:04 <thuban> https://transfer.archivete.am/y7EDo/www.old.dlib.me-item-info-byname - item info pages, as linked from the library index (extracted from post xhr--not that we can duplicate that, but it's what external links are likely to be). media items like photos and videos are included in page assets
23:50:19 <thuban> https://transfer.archivete.am/157ht6/www.old.dlib.me-item-info-byid - item info pages, by document id (this is the only way to see metadata for some items, mostly newspapers)
23:50:39 <thuban> i believe that's everything that will actually work
23:56:23 <JAA> I'll run them shortly.