08:21:47 Sent an email to the Wysp contact email since it looks like it is/was run by one person 08:51:01 looks like youtube-grab was filling my root partition like a crazy 09:51:30 PaulWise edited Mailman2 (+31, tetaneutral lists not yet archived): https://wiki.archiveteam.org/?diff=50007&oldid=49982 12:41:48 Over 80% of the Knowledge Adventure CDN is done. ETA is still early hours of the 28th. 14:20:49 Is there a script floating about I can dump a list of warrior ips in to to get some stats out of ? I've got all the os metrics but working on visability of the warriors atm and it occurred to me it may be a solved problem 14:24:55 Maybe, but I haven't heard of it. Most powerusers don't run the warrior, so that's one thing to keep in mind. Docker log aggregation is a solved problem in a much more general sense than AT, of course. Loki comes to mind (and is used by some people here), but there are countless others. 14:26:18 Oh yeah I have grafana up I could use loki good point 14:26:49 I'm sure I'll get a bit of a perf boost dropping docker but for scaling it makes it to easy atm lol, cheers JAA 14:28:18 I'm not suggesting dropping Docker. I'm distinguishing the warrior (either VM or Docker image, controlled via web interface, yadda yadda) from the project images. The latter is what virtually all powerusers run. 14:28:45 Bare metal is too easy to mess up and produce weird or bad data, so it's strongly recommeneded against. 14:29:23 And with modern containerisation, it probably doesn't even matter that much for performance anymore. 14:29:50 (Unless you're on Windows/macOS, where Docker is basically a Linux VM as I understand it.) 14:31:54 This is all on digital ocean, hetz and aws so yeah the docker overhead won’t be much, dw no plans to go bare metal, I’m just working on a script to let me control all those warrior instances at the same time and wanted to see if I was duplicating work 14:36:48 i suggest you to edit the main page, cuz it has not been edited since 24th of April. Also, after we sort out the Egloos and LINE BLOG data, we should focus on some other short term projects to do, like Bedrock Automation. 14:37:02 on Wiki, ofc 14:39:34 VickoSaviour: every account can edit wiki, edit needs to be accepted by op 14:52:01 VickoSaviour: The main page was last edited a week ago. You're not looking in the right place. 18:41:17 This Discord server will go unmoderated: https://discord.gg/5Zq8MVq2SW https://www.tiktok.com/t/ZT8Jch5cw/ 18:54:52 balrog: I have attempted to download sourceforge wikis using wikiteam tools but I'm not 100% sure I got all of them since there's no good index (and it seems I didn't find https://mldonkey.sourceforge.net/Main_Page). I'm downloading that now though 18:55:50 pokechu22: it's not only wikis but anything hosted that's affected (*.sourceforge.net) 18:56:20 I'm not sure if there's a way to get an index of all SF projects and then generate an index? 18:56:27 Yeah, just pointing out my previous attempt at downloading wikis 18:56:38 e.g. for mldonkey there's https://mldonkey.sourceforge.net/forums/ as well 18:56:47 and wikiteam tools won't get that 18:57:05 now, a lot of projects have merely static sites, not PHP-driven 18:57:17 but if they're forcing update to PHP7, I suspect a lot of them will just break 18:57:25 a lot of the PHP-driven ones* 18:57:43 anyway I wanted to bring this to the attention of Archive Team, since this was not properly announced... 18:58:07 I believe my main technique was stuff like googling `site:sourceforge.net wiki` and then trying to get stuff to show up; I also did `site:sourceforge.net inurl:Main_Page` etc. But that isn't complete. There's also a lot of ones that seem to have been already lost or moved to the non-mediawiki system (e.g. see https://axiomengine.sourceforge.net/, or 18:58:09 https://tilemaster.sourceforge.net/ which links to http://sourceforge.net/apps/mediawiki/tilemaster/ which is dead) 18:59:04 might be able to scrape project names from https://sourceforge.net/directory/ 18:59:52 that tilemaster one looks like an SF-run mediawiki system that's dead, as opposed to the project maintainers installing mediawiki themselves 19:00:19 oh yeah, one other thing to note is that running the tools directly on https://mldonkey.sourceforge.net/Main_Page will fail because it detects http://mldonkey.sourceforge.net/mediawiki/api.php and http://mldonkey.sourceforge.net/mediawiki/index.php but that redirects to https. This breaks the POST requests wikiteam tools use (causing only 1 revision to be exported for each 19:00:20 page). You need to specify --api and --index for it to work right. 19:00:56 Yeah, that's what I gathered, but it still makes finding valid wikis difficult since there's a bunch of dead links in pages that still say "wiki" 19:01:16 An appropriate approach might be: 1. collect list of all SF projects; 2. use wget-warc/wpull/archivebot to ingest PROJECTNAME.sourceforge.net but do not recursively follow redirects to https://sourceforge.net/*; 3. review the scraped contents for actual MediaWiki 19:02:01 (and for 2, do not recursively follow redirects to different domains, again to reduce scope) 19:02:01 I initially scraped wikiapiary for this which is where I got my big list, but the query I used gives different results now: 19:02:03 https://wikiapiary.com/w/index.php?title=Special:Ask&limit=500&q=%5B%5BCategory%3AWebsite%5D%5D+%5B%5BHas+farm%3A%3AFarm%3ASourceForge%5D%5D+%5B%5BIs+defunct%3A%3Afalse%5D%5D&p=mainlabel%3D-2D%2Fformat%3Dtable&po=%3F%3DWiki%0A%3FHas+pages+count%3DPages%0A%3FHas+edit+count%3DEdits%0A%3FHas+API+URL%3DAPI%0A%3FHas+URL%3DURL%0A%3FHas+Internet+Archive+added+date%3DIA%0A%3FHas+imag 19:02:05 es+count%3DFiles%0A%3FCapture+date%3DLast+sample%0A 19:02:13 pokechu22: my point is that this isn't only wikis here 19:02:24 this is HTTPS sites, whatever the maintainers chose to run 19:02:26 Yeah, I understand that 19:02:57 I suspect that 99%+ of them redirect to https://sourceforge.net/projectname or so 19:03:09 and/or the maintainers never uploaded files 19:03:18 Does every project have a subdomain like that? (I also remember seeing some things on sourceforge.io and I don't know what the difference was) 19:04:23 I believe yes all do have a subdomain but most don't have content, and those subdomains redirect to https://sourceforge.net/projects/projectname for those that don't have content there 19:05:28 yeah that's the behavior that I see. For some the "content" there is just a redirect to another website 19:05:50 My old notes mention https://bibdesk.sourceforge.io/mediawiki as something broken, and it looks like https://bibdesk.sourceforge.net/ redirects to https://bibdesk.sourceforge.io/. I've definitely seen other ones that just redirect off site (e.g. to github) 19:06:20 another .io redirect: https://lynkeos.sourceforge.net/ -> https://lynkeos.sourceforge.io/ - this one does have a working wiki 19:06:57 ... and on the other hand, https://alphaplot.sourceforge.net/wiki redirects to http://www.alphaplotwiki.com/, which is now dead (but I did previously save it) 19:07:05 directory; sorting by name and paginating though only lets you go to 999 before it errors out 19:07:06 .io is a separate newer infra apparently https://sourceforge.net/p/forge/documentation/Project%20Web%20Services/#php-version-and-io-domain 19:07:10 https://sourceforge.net/directory/?sort=name&page=999 vs https://sourceforge.net/directory/?sort=name&page=1000 19:07:28 999 ends on "Boy on Riddlin" so it's not everything 19:08:06 Looks like https://micro-os-plus.sourceforge.io/ redirects to https://micro-os-plus.sourceforge.net/ so .net vs .io doesn't matter apart from making AB's on-site handling a bit of a mess 19:08:09 fireonlive: sigh. and their UI shows buttons for 1000+ 19:08:31 yeah :/ 19:08:46 > Projects web space is a subdomain under .sourceforge.io and uses PHP 7. Please be prepared for PHP 8 upgrades which are expected later in the year. 19:08:46 > Projects registered before Nov 2016 started on an older service using PHP 5.4 and subdomain of sourceforge.net. If you have an older project, you can switch your project web over at any point using the project web settings under Admin -> Project Web Hosting -> PHP Version. 19:09:00 looks like you can search aa* ab* etc (but not a*) 19:09:04 so .sourceforge.net are PHP 5.4 (and at risk), while .sourceforge.io are PHP7 (and not yet at risk) 19:10:16 "We based our search system on the Lucene search engine. We expose a sub-set of the Lucene Query Parser Syntax so you can try some advanced queries." (https://sourceforge.net/p/forge/documentation/Finding%20Software/) 19:10:58 I should also note that I get a 429 when using wikiteam tools after ~400, 500 pages (and then wikiteam tools immediately stop rather than retrying). This happens both with no delay specified and with a 1-second delay between each request. Dunno what causes it exactly. 19:12:10 name:/title:/project:aa* don't seem to work though; so i guess rough search is all you can get 19:17:00 russss: looks like https://audioscrobbler.sourceforge.net/wiki/ is broken (which doesn't surprise me but is still unfortunate). https://audioscrobbler.sourceforge.net/wiki/index.php is borked and https://audioscrobbler.sourceforge.net/wiki/api.php doesn't exist at all 19:17:59 https://gitlab.softwareheritage.org/swh/meta/-/issues/735 19:18:03 interesting issue here 19:18:37 looks like all projects might be in the sitemaps? 19:18:54 pokechu22: I looked and I'm not sure there was ever anything in that wiki 19:19:11 found via: https://forge.softwareheritage.org/T735 19:19:16 I think that was from some separate Sourceforge wiki product 19:19:31 https://web.archive.org/web/*/https://audioscrobbler.sourceforge.net/* makes it look like there is something... 19:19:44 ah, not mediawiki, though: https://web.archive.org/web/20030415211152/http://audioscrobbler.sourceforge.net:80/wiki/index.php/Bob's%20Happy%20Fun%20Page 19:19:52 oh gitlab has the same content 19:24:34 it was a huge blast from the past to get the email from Sourceforge saying it would be shut down, I had no idea it was still there 19:25:11 and that PHP version is massively long out of support. someone said that the version they're upgrading to is also out of support! 19:25:37 so kudos to SF for keeping it up this long I guess? 19:26:31 i ran https://forge.softwareheritage.org/source/snippets/browse/master/listers/sourceforge/sourceforge-ls-projects.py and got 108,911 results: https://transfer.archivete.am/inline/13m9To/sourceforge-projects.txt 19:26:49 seems low, though. unless sourceforge isn't just htat popular 19:27:10 though.. the tool does exclude the projects namespace assuming /p/ exists too hm 19:36:22 oh doesn't actually seem to be any '.net/projects/' links in those sitemaps 19:36:32 gotta go for a bit; gl :3 21:13:44 on 2020-10-22 it was reported there were 480,711 projects and in 2021 317,973 (via https://gitlab.softwareheritage.org/swh/meta/-/issues/735) and now (maybe?) 108,911? i can't see anything about them purging inactive projects on a quick search... 209,062 less projects in 2 years? unless sitemap doesn't list everything anymore or there's a bug in 21:13:44 that code i didn't quickly spot :p 21:15:26 (or that's the old school system and the others are on .io?) 21:15:30 (not sure) 21:31:17 https://sourceforge.net/directory/?clear says there should be at least 192k projects. 21:31:49 Some of the filters are overlapping (e.g. projects supporting multiple OS), others don't seem to cover all projects (e.g. Status), but 192k hits for Windows projects, so it should be at least that. 21:33:19 hmm so 108,911 is deffo undercounting then 21:34:10 Also, #sourceforget exists, let's use it. 22:08:48 JAA: how is progress on the school of dragons archive? The game shuts down on Friday. 22:11:25 betamax: 12:41:48 <@JAA> Over 80% of the Knowledge Adventure CDN is done. ETA is still early hours of the 28th. 22:13:50 betamax: this is one of the files that appeared in the last few weeks, sad http://media.schoolofdragons.com/Content/DWAPromos/en-US/SoD-061623_ClosingSale.jpg 22:24:09 thanks! (I assume knowledge advebture CDN == School of Dragons? 22:24:26 (sorry I'm a bit out of the loop) 22:24:56 yes, origin.ka.cdn 22:38:39 closing sale?? 22:38:45 for a game that won't be playable? 22:38:49 o_o 22:39:04 s/playable/playable very shortly after/ 22:39:26 that's kind of a weird concept yeah 22:39:30 unless it's like, physical merch 22:39:33 'We'll slaughter your beloved game next week. Now give us monies please.' 22:39:40 (but I guess not) 22:40:11 might be the price of things within the game, in in-game money 22:41:51 Could be, their original announcement from a few weeks ago did say that purchases would be disabled and to use in-game currency before the shutdown. 22:42:30 ahh 22:43:30 hmm.. lol. i guess if in app purchases are disabled... but at that point just make everything free i guess for one last hurrah lol 22:43:54 Yeah, that's what other games have done. No reason not to, really. 22:44:53 Regarding the CDN/bucket, when the current download completes, I'll rerun the bucket listing and grab anything that was missed. Then there are a few objects that weren't downloaded correctly due to URL encoding reasons (question marks in names...). 22:45:18 s/that was missed/that's new/ I guess 23:18:05 Manu edited ArchiveTeam Warrior (+255, Add info that an rsync max connection error is…): https://wiki.archiveteam.org/?diff=50013&oldid=49883 23:21:03 Anyone trying to tackle Microsoft Language Portal? It's shutting down on 2023 June 30 as per https://old.reddit.com/r/Archiveteam/comments/1488sdy/microsoft_language_portal_will_be_removed_on_june/ 23:25:45 afaik no 23:26:24 it says some or all? of it will be moved to microsoft learn but... we all know how careful companies are 23:32:53 there's large Polish-language PhpBB forum farm at https://www.fora.pl/ existing since 2005. A large portion of the forums is dead, there was also a pruning done in 2016 by the server owners 23:34:38 thinking about archiving it. Is this doable for ArchiveBot? I presume that yall need some discussion or reason to throw larger sites into ArchiveBot 23:53:07 you could ask in #archivebot there seem to be a few people pumping away commands at the moment in there 23:55:10 I don't know polish, but those look like large numbers in the statistics section at the bottom... 23:55:34 Hmm, mikolaj|m, it looks like it's not just https://www.fora.pl/ to grab, since it's just a hub, but also the countless subdomains that represent the forums they are hosting, like http://www.naturalnemetody.fora.pl/ - coming from https://www.fora.pl/?file=cat&md=index&cid=15 23:55:52 I was about to throw it in until you said that ryz 23:55:54 82 078 forums, 5 616 613 users, 159 344 842 "statements" (not sure if those are threads or posts) 23:56:13 Time to start chipping away at the individual forums? 23:56:27 82 thousand forums?! Oo;;; 23:56:29 I can start queueing those 1 at a time until I forget what I am doing 23:56:30 I like how http://www.random.fora.pl/ seems to be an actual forum in addition to a redirect site 23:57:01 I've done long lists, but those were usually a few hundred at most; 82k is probably too much to manually handle 23:57:12 ...82 thousand is holy shit, a lot Oo; 23:57:37 I once scraped a forum by hand for links to other sites. I will start if I get given the OK 23:58:09 It looks like the list is at https://img.mruczek.trade/fora.txt which is "only" 64790 23:58:09 First things first, need to find where are the 82k amount of forums can be found, if it's more than just what the example source link is about~ 23:58:48 That's from line 279 of view-source:http://www.random.fora.pl/