00:04:36 ok, i'm currently scraping all the page images, and i'll feed them back into archivebot once i've got the list 00:04:46 (it'll take a little while, i'm being gentle) 00:07:32 JustAnotherArchivist edited Deathwatch (+159, /* 2023 */ Add OneHallyu): https://wiki.archiveteam.org/?diff=51153&oldid=51152 00:08:12 It looks to be about 1.1M posts 00:08:21 nvm 00:09:18 Peroniko: did you want any advice about converting to pdf and/or uploading to ia? 00:09:20 the former is easy, but the latter will probably require some manual work due to the lack / inconsistent formatting of metadata 00:10:14 ~11.7M posts on OneHallyu 00:10:33 Yeah, 11.8 million is what the homepage says. 00:11:38 someday i'll learn. First time i helped with a forum, I didn't see any number, the second time, i count it manually. Fool me twice, I better learn the third time 00:11:45 :-) 00:11:46 thuban: I have uploaded some thing to IA, but if there is a guide for better metadata it would be helpful. 00:11:55 I've uploaded this for example: https://archive.org/details/arhitektura-graficki-dio 00:13:11 Peroniko: the general metadata documentation is here: https://archive.org/developers/metadata-schema/index.html 00:24:25 OneHallyu is running through AB now. We'll see how that goes. 00:24:40 Buttflare is involved. 00:39:57 from -ot: how do I archive videos hosted on sharepoint? it's going to be deleted in a few days: https://kth-my.sharepoint.com/:v:/g/personal/longz_ug_kth_se/EesSEHqiHHtQabKFAAAx5PsB-5r8MVtnp5NECOtKN-YGsA?e=8s0kaj 00:42:55 it's probably some temporary signed URL that will change on every load and can't be archived in a way that lets the original link work in WBM 00:44:58 oof that seems to be DASH even 00:50:33 kpcyrd: i don't think there's anything reliably plug-and-play for sharepoint 00:50:41 transcoded on the fly from an original .mp4 that seems impossible to access 00:51:11 you could try https://github.com/kylon/Sharedown or some of the workarounds suggested in sharepoint-related issues at https://github.com/snobu/destreamer 00:51:26 (or punt and screen-record it) 00:52:19 There's an open PR for yt-dlp to add SharePoint. https://github.com/yt-dlp/yt-dlp/pull/6531 00:52:52 I *could* grab the DASH but it sucks that I can't access the original .mp4 :/ 00:54:12 I'm trying to get it into the wayback machine specifically: 00:54:13 https://web.archive.org/web/20231116002236/https://kth-my.sharepoint.com/personal/longz_ug_kth_se/_layouts/15/stream.aspx?id=%2Fpersonal%2Flongz%5Fug%5Fkth%5Fse%2FDocuments%2Fbox%5Ffiles%2FKTH%20SR%20Meetup%2F2020%2D11%2D24%2013%2E06%2E13%20Localization%20of%20Unreproducible%20Builds%2F2020%2D11%2D24%2013%2E06%2E13%20Localization%20of%20Unreproducible%20Builds%20%2D%20Jifeng% 00:54:15 20Xuan%2Emp4&ga=1 00:54:19 that's not going to work 00:54:25 rip 00:54:32 there's timestamped, signed URLs that change every time you load the page 00:55:04 ia item not adequate? 00:55:41 ffmpeg doesn't do parallel requests (in fact I'm not sure if it does proper HTTP keepalive) so this DASH remux is taking me forever 01:09:31 oh great I got some 503 Service Unavailable too 02:12:56 Gitlab is now requiring new users to verify using a phone number or credit card, or account will be deleted. So far only seems to apply to new accounts, but something to keep an eye on if they expand it to existing accounts. https://lemmy.world/post/8297909 02:23:49 https://www.androidpolice.com/ensuring-high-quality-apps-on-google-play/ 02:26:41 https://www.animenewsnetwork.com/news/2023-11-11/crunchyroll-ends-digital-manga-app-on-mobile-web-on-december-11/ 02:28:42 Correct link for the latter: https://www.animenewsnetwork.com/news/2023-11-11/crunchyroll-ends-digital-manga-app-on-mobile-web-on-december-11/.204339 05:00:33 JAABot edited CurrentWarriorProject (-4): https://wiki.archiveteam.org/?diff=51154&oldid=51143 07:53:11 https://bird.makeup would be a nice alternative way to grab twitter(x) stuff. They create a mastodon account where all the tweets are posted. 07:54:10 also means there is no rate limiting 07:54:26 other than on their end of course 13:28:29 https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat 13:29:54 stupid sexy nickserv 13:29:56 https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat 13:30:03 Does anyone know of a full dump or half dump of open subtitles 13:30:07 this is a real slap in the face 13:34:18 MasterX244 edited List of websites excluded from the Wayback Machine (+28): https://wiki.archiveteam.org/?diff=51155&oldid=51036 14:00:23 JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51156&oldid=51155 15:19:34 opensubtitles closing themselves? 16:31:29 Has anyone backed this up yet? https://pabio.com/blog/company/bankruptcy/ 16:43:55 oh talking of which https://www.bleed-clothing.com/de/info # "Wir sind insolvent." 17:12:09 arkiver: 'Only' the API, as I understand it? 17:20:39 Hans5958, murb I threw them in AB 17:21:06 ta 18:16:40 here's the forum post about it: https://forum.opensubtitles.org/viewtopic.php?t=17930#p47873 18:17:16 looks like the 'new rest api' still has a free tier: https://opensubtitles.stoplight.io/docs/opensubtitles-api/a7d25b650b784-api-subscription-prices 18:17:42 though those prices are.. hm. 18:17:54 XML-RPC... Ok, yeah, I agree that needs to die already. 18:19:02 https://blog.opensubtitles.com/opensubtitles/saying-goodbye-to-opensubtitles-org-api-embrace-the-20-black-friday-treat posted earlier says "This decision, initially disclosed in a forum post, will primarily affect non-VIP users, while VIP members will continue to enjoy access to the API." so i guess they're keeping it around for VIP people for a 18:19:02 bit longer at least? 18:19:12 and yeah it does haha 20:15:41 so uh 20:16:24 apparently, i have a blogspot blog, two actually... and i learned this because google/blogger.com wrote me to tell me i haven't logged in since 2007 and so they will delete my shit... i wonder if we need to do something about this 20:16:36 my two blogs are totally irrelevant and empty, but there might be others facing destruction out there 20:18:50 anarchat: it has been discussed before; how do we find "all blogs"? 20:20:17 i have no idea 20:30:16 #frogger :) 20:33:34 kpcyrd: https://data.nicolas17.xyz/localization-unreproducible-builds.mp4 this is from the DASH stream on sharepoint 20:34:08 it wasn't easy to get because if I just fed the DASH manifest to ffmpeg, one or two segments would randomly give a "503 service unavailable" and ffmpeg doesn't retry 20:34:17 so I got a gap in the video 20:34:36 I had to download all segments and rewrite the manifest to use those local files 20:35:21 take it and figure out what to do with it; archive.org item or whatever :P 20:46:41 https://transfer.archivete.am/J2GVQ/sporeforums1.txt does this list of largely unarchived spore forums have any URLs that the bot wouldn't be able to archive properly? In either case could the viable URLs be fed to AB? 20:48:17 The ones that aren't entire domains could be problematic recursion-wise. 20:48:31 Problematic in what way? 20:48:31 Other than that, not sure. 20:49:18 Not recursing properly. If you !a https://example.org/foo/ and it has a link to /bar, that won't be followed. 20:49:32 not even with offsite links allowed? 20:49:46 No, because they're not offsite. 20:50:21 Offsite = different host 20:50:54 I getcha, had hoped it was just offsite (named after different host) = outside recursion 20:51:41 For example, it wouldn't recurse anywhere useful from https://www.mobygames.com/forum/game/36030/spore/ because those URLs aren't in .../spore/. 20:52:11 In that case, !a https://www.mobygames.com/forum/game/36030/spore would work though. 20:52:22 But yeah, each of those needs to be looked at individually. 20:52:31 And some might simply not be possible. 20:52:42 what does the / at the end do? 20:53:25 It's a path segment delimiter. For the purpose of AB, the last slash in the path part of the URL determines where it'll recurse onsite. 20:53:50 From https://www.mobygames.com/forum/game/36030/spore, it would recurse to any link starting with https://www.mobygames.com/forum/game/36030/ . 20:54:00 I see 20:56:06 Should I send an transfer.archivete.am link with just the full-domain ones then? 20:57:17 No need, this one is fine. 20:58:29 Alright. Thanks a lot then 21:04:48 I wonder whether https://blog.seamonkey-project.org/2023/11/14/migrating-off-archive-mozilla-org/ only applies to SeaMonkey or also to other projects or even the entire archive.mozilla.org. 21:04:56 (It's already running through AB courtesy of arkiver.) 21:05:22 Cc pabs ^ 21:12:50 I'm currently listing all of archive.mozilla.org. It's ... large. 21:13:09 I'll have a size estimate later. 21:20:27 JAA: maybe it's too large for ArchiveBot, i wonder how large it is. hope we can archive it entirely 21:24:12 I'm already up to over 1.2 million *directories* after only processing 17k. 21:24:15 So yeah... 21:29:16 To rephrase it a bit clearer: I've processed 17k directories and discovered over 1.2 million directories from those. I'm recursing through the dir tree, obviously. 21:30:07 And those numbers are now at 32k done, 2.1M discovered. 21:30:14 It'll be a while... 21:31:00 How long will it stay up? Assuming it has any sort of shutdown date 21:31:55 See link above 21:32:43 Beware of https://archive.mozilla.org/pub/firefox/tinderbox-builds/ , those subdirs are *massive*. Like, 100 MB dir listings massive. 21:33:48 There's also at least one which doesn't finish loading within a minute. 21:34:13 does AB ignore something if it doesn't load within a minute? 21:34:14 mod_autoindex like 😰 21:34:39 It's complicated. 21:35:27 AB expects the HTTP headers within 20 seconds and the complete response within 6 hours, but slow processing of parallel requests (such as link extraction or compressing for WARC) can break the retrieval. 21:35:45 I bet most of the dirs in there were not listed correctly on the first attempt by AB. 21:36:54 The 1 minute timeout is the default in qwarc, which I'm using for listing this more efficiently. 21:37:09 100MB *listings*? 21:38:29 Running into a problem, will need to restart the listing. 21:40:52 I bid you good luck with this, lookin' forward to seeing just how big the listing file will be. 21:41:50 nicolas17: Yes, autoland-linux64 is that one, it contains 195k entries. 21:42:37 autoland-macosx64-debug times out on the server side after a bit over a minute with a 502. 21:50:49 Listing restarted, going more faster now. 21:54:51 (I hope there are no loops via symlinks.) 21:59:36 Oh, this time autoland-linux64 repeatedly timed out as well, yay. 22:02:12 I think I'm running into SQLite lock contention at this point. But processing 7-9k dirs per minute isn't bad. 22:09:40 As for what I sent of spore forums, here are a few archive-related comments about the domains of the few that weren't directly in the domain https://transfer.archivete.am/mawW8/sporeforums%20addendum.txt 22:17:30 An addendum to that addendum; https://gamefaqs.gamespot.com/ has an archive but https://gamefaqs.gamespot.com/boards/926714-spore/72994456 (posted before the archive) is missing (https://gamefaqs.gamespot.com/boards/926714-spore has 1 archive from ArchiveTeam though) 22:20:32 my modem rebooted... maybe because of telegrab at high concurrency /o\ 22:21:09 I don't think we ever fully archived GameFAQs. I believe there were unsuccessful/incomplete attempts only. 22:21:19 (reddit is much more prone to doing that) 22:22:31 does wget-at use keepalive? 22:25:45 Ah, I see. 22:50:33 Now doing over 10k dirs per minute. Brrrrr 22:50:55 Still going to take at least 3 hours to get through the remaining queue. lol 22:51:06 So yes, it is marginally too big for AB. :-P 22:52:14 What alternatives are there then? 22:52:46 It does depend a bit on how many files there are and how large they are. 22:52:53 DPoS would be an option. 22:53:13 Or maybe it can be done with AB with a few !ao < jobs rather than one big recursive one. 22:53:43 The listings I've retrieved so far are already over 1 GiB of WARC, i.e. after compression. 22:54:04 o_o 22:54:10 how many would "a few" be? 22:54:11 Arkiver uploaded File:Blogger-icon.png: https://wiki.archiveteam.org/?title=File%3ABlogger-icon.png 22:54:45 And arkiver spoke: 'let there be an icon!', and there was an icon. 22:54:57 That is how it be 22:55:17 and it was glorious 22:55:56 And the administrators of those websites said "Did anybody hear that?, Must have been the wind" 22:57:57 Whenever a new project is about to start I always imagine some kind of eldritch abomination machine just slowly whirring to life. With eyes of red blink into existence and start a march towards their target 23:00:09 Pedrosso: 'A few' would be more than 'a couple' but not 'many'. :-P I don't know, it depends on the output of the listing. 23:00:12 JAABot edited CurrentWarriorProject (+4): https://wiki.archiveteam.org/?diff=51159&oldid=51154 23:02:59 yep :) 23:03:19 JAA: what is your opinion on already writing a WARC-TLS-Cipher-Suite field before it's standardised? 23:03:32 (related to that issue on the warc specs github repo) 23:04:50 or actually 23:05:05 WARC-Cipher-Suite (the value starting with TLS_ already makes it clear it's for TLS) 23:05:21 (thank you for not calling it SSL) 23:06:01 i'm glad i made your day fireonlive :) 23:06:07 :D 23:11:28 JAA: oh, well it's nice that there are such convenient solutions 23:26:10 JAA: I expect archive.mozilla.org has a lot of stuff that isn't that useful to archive, like millions of test results :) 23:27:50 arkiver: Fine with me, it's not a violation of the spec to write fields that aren't specified. Might be worth leaving a comment about the intent on https://github.com/iipc/warc-specifications/issues/86 though and seeing if anyone else has concerns about that. 23:29:22 pabs: Yeah, I'm sure there are more and less useful parts to it. 23:29:40 Have you possibly seen another announcement from Mozilla themselves about it? 23:29:54 not yet, but I did just wake up :) 23:30:46 nothing on https://planet.mozilla.org/ 23:31:22 nothing on https://blog.thunderbird.net/ either 23:31:32 Ah yes, time zones. :-) 23:31:35 maybe it is only ex-Mozilla projects moving? 23:49:23 repeating some requests related to old.dlib.me here, since they got lost in #archivebot: 23:49:33 https://transfer.archivete.am/AGArb/www.old.dlib.me-document-viewers-nom - yet another slightly different viewer url 23:49:45 https://transfer.archivete.am/ej3GO/www.old.dlib.me-item-pdfs - a small number of items available as pdf rather than through the document viewer 23:50:04 https://transfer.archivete.am/y7EDo/www.old.dlib.me-item-info-byname - item info pages, as linked from the library index (extracted from post xhr--not that we can duplicate that, but it's what external links are likely to be). media items like photos and videos are included in page assets 23:50:19 https://transfer.archivete.am/157ht6/www.old.dlib.me-item-info-byid - item info pages, by document id (this is the only way to see metadata for some items, mostly newspapers) 23:50:39 i believe that's everything that will actually work 23:56:23 I'll run them shortly.