-
OrIdow6
Sent an email to the Wysp contact email since it looks like it is/was run by one person
-
Barto
looks like youtube-grab was filling my root partition like a crazy
-
h2ibot
PaulWise edited Mailman2 (+31, tetaneutral lists not yet archived):
wiki.archiveteam.org/?diff=50007&oldid=49982
-
JAA
Over 80% of the Knowledge Adventure CDN is done. ETA is still early hours of the 28th.
-
Dallas
Is there a script floating about I can dump a list of warrior ips in to to get some stats out of ? I've got all the os metrics but working on visability of the warriors atm and it occurred to me it may be a solved problem
-
JAA
Maybe, but I haven't heard of it. Most powerusers don't run the warrior, so that's one thing to keep in mind. Docker log aggregation is a solved problem in a much more general sense than AT, of course. Loki comes to mind (and is used by some people here), but there are countless others.
-
Dallas
Oh yeah I have grafana up I could use loki good point
-
Dallas
I'm sure I'll get a bit of a perf boost dropping docker but for scaling it makes it to easy atm lol, cheers JAA
-
JAA
I'm not suggesting dropping Docker. I'm distinguishing the warrior (either VM or Docker image, controlled via web interface, yadda yadda) from the project images. The latter is what virtually all powerusers run.
-
JAA
Bare metal is too easy to mess up and produce weird or bad data, so it's strongly recommeneded against.
-
JAA
And with modern containerisation, it probably doesn't even matter that much for performance anymore.
-
JAA
(Unless you're on Windows/macOS, where Docker is basically a Linux VM as I understand it.)
-
Dallas
This is all on digital ocean, hetz and aws so yeah the docker overhead won’t be much, dw no plans to go bare metal, I’m just working on a script to let me control all those warrior instances at the same time and wanted to see if I was duplicating work
-
VickoSaviour
i suggest you to edit the main page, cuz it has not been edited since 24th of April. Also, after we sort out the Egloos and LINE BLOG data, we should focus on some other short term projects to do, like Bedrock Automation.
-
VickoSaviour
on Wiki, ofc
-
BigBrain
VickoSaviour: every account can edit wiki, edit needs to be accepted by op
-
JAA
VickoSaviour: The main page was last edited a week ago. You're not looking in the right place.
-
upintheairsheep
-
pokechu22
balrog: I have attempted to download sourceforge wikis using wikiteam tools but I'm not 100% sure I got all of them since there's no good index (and it seems I didn't find
mldonkey.sourceforge.net/Main_Page). I'm downloading that now though
-
balrog
pokechu22: it's not only wikis but anything hosted that's affected (*.sourceforge.net)
-
balrog
I'm not sure if there's a way to get an index of all SF projects and then generate an index?
-
pokechu22
Yeah, just pointing out my previous attempt at downloading wikis
-
balrog
e.g. for mldonkey there's
mldonkey.sourceforge.net/forums as well
-
balrog
and wikiteam tools won't get that
-
balrog
now, a lot of projects have merely static sites, not PHP-driven
-
balrog
but if they're forcing update to PHP7, I suspect a lot of them will just break
-
balrog
a lot of the PHP-driven ones*
-
balrog
anyway I wanted to bring this to the attention of Archive Team, since this was not properly announced...
-
pokechu22
I believe my main technique was stuff like googling `site:sourceforge.net wiki` and then trying to get stuff to show up; I also did `site:sourceforge.net inurl:Main_Page` etc. But that isn't complete. There's also a lot of ones that seem to have been already lost or moved to the non-mediawiki system (e.g. see
axiomengine.sourceforge.net, or
-
pokechu22
-
balrog
might be able to scrape project names from
sourceforge.net/directory
-
balrog
that tilemaster one looks like an SF-run mediawiki system that's dead, as opposed to the project maintainers installing mediawiki themselves
-
pokechu22
oh yeah, one other thing to note is that running the tools directly on
mldonkey.sourceforge.net/Main_Page will fail because it detects
mldonkey.sourceforge.net/mediawiki/api.php and
mldonkey.sourceforge.net/mediawiki/index.php but that redirects to https. This breaks the POST requests wikiteam tools use (causing only 1 revision to be exported for each
-
pokechu22
page). You need to specify --api and --index for it to work right.
-
pokechu22
Yeah, that's what I gathered, but it still makes finding valid wikis difficult since there's a bunch of dead links in pages that still say "wiki"
-
balrog
An appropriate approach might be: 1. collect list of all SF projects; 2. use wget-warc/wpull/archivebot to ingest PROJECTNAME.sourceforge.net but do not recursively follow redirects to
sourceforge.net/*; 3. review the scraped contents for actual MediaWiki
-
balrog
(and for 2, do not recursively follow redirects to different domains, again to reduce scope)
-
pokechu22
I initially scraped wikiapiary for this which is where I got my big list, but the query I used gives different results now:
-
pokechu22
-
pokechu22
es+count%3DFiles%0A%3FCapture+date%3DLast+sample%0A
-
balrog
pokechu22: my point is that this isn't only wikis here
-
balrog
this is HTTPS sites, whatever the maintainers chose to run
-
pokechu22
Yeah, I understand that
-
balrog
I suspect that 99%+ of them redirect to
sourceforge.net/projectname or so
-
balrog
and/or the maintainers never uploaded files
-
pokechu22
Does every project have a subdomain like that? (I also remember seeing some things on sourceforge.io and I don't know what the difference was)
-
balrog
I believe yes all do have a subdomain but most don't have content, and those subdomains redirect to
sourceforge.net/projects/projectname for those that don't have content there
-
balrog
yeah that's the behavior that I see. For some the "content" there is just a redirect to another website
-
pokechu22
My old notes mention
bibdesk.sourceforge.io/mediawiki as something broken, and it looks like
bibdesk.sourceforge.net redirects to
bibdesk.sourceforge.io. I've definitely seen other ones that just redirect off site (e.g. to github)
-
pokechu22
another .io redirect:
lynkeos.sourceforge.net ->
lynkeos.sourceforge.io - this one does have a working wiki
-
pokechu22
... and on the other hand,
alphaplot.sourceforge.net/wiki redirects to
alphaplotwiki.com, which is now dead (but I did previously save it)
-
fireonlive
directory; sorting by name and paginating though only lets you go to 999 before it errors out
-
balrog
-
fireonlive
-
fireonlive
999 ends on "Boy on Riddlin" so it's not everything
-
pokechu22
Looks like
micro-os-plus.sourceforge.io redirects to
micro-os-plus.sourceforge.net so .net vs .io doesn't matter apart from making AB's on-site handling a bit of a mess
-
balrog
fireonlive: sigh. and their UI shows buttons for 1000+
-
fireonlive
yeah :/
-
balrog
> Projects web space is a subdomain under .sourceforge.io and uses PHP 7. Please be prepared for PHP 8 upgrades which are expected later in the year.
-
balrog
> Projects registered before Nov 2016 started on an older service using PHP 5.4 and subdomain of sourceforge.net. If you have an older project, you can switch your project web over at any point using the project web settings under Admin -> Project Web Hosting -> PHP Version.
-
fireonlive
looks like you can search aa* ab* etc (but not a*)
-
balrog
so .sourceforge.net are PHP 5.4 (and at risk), while .sourceforge.io are PHP7 (and not yet at risk)
-
fireonlive
"We based our search system on the Lucene search engine. We expose a sub-set of the Lucene Query Parser Syntax so you can try some advanced queries." (
sourceforge.net/p/forge/documentation/Finding%20Software)
-
pokechu22
I should also note that I get a 429 when using wikiteam tools after ~400, 500 pages (and then wikiteam tools immediately stop rather than retrying). This happens both with no delay specified and with a 1-second delay between each request. Dunno what causes it exactly.
-
fireonlive
name:/title:/project:aa* don't seem to work though; so i guess rough search is all you can get
-
pokechu22
-
fireonlive
-
fireonlive
interesting issue here
-
fireonlive
looks like all projects might be in the sitemaps?
-
russss
pokechu22: I looked and I'm not sure there was ever anything in that wiki
-
fireonlive
-
russss
I think that was from some separate Sourceforge wiki product
-
pokechu22
-
pokechu22
-
fireonlive
oh gitlab has the same content
-
russss
it was a huge blast from the past to get the email from Sourceforge saying it would be shut down, I had no idea it was still there
-
russss
and that PHP version is massively long out of support. someone said that the version they're upgrading to is also out of support!
-
russss
so kudos to SF for keeping it up this long I guess?
-
fireonlive
-
fireonlive
seems low, though. unless sourceforge isn't just htat popular
-
fireonlive
though.. the tool does exclude the projects namespace assuming /p/ exists too hm
-
fireonlive
oh doesn't actually seem to be any '.net/projects/' links in those sitemaps
-
fireonlive
gotta go for a bit; gl :3
-
fireonlive
on 2020-10-22 it was reported there were 480,711 projects and in 2021 317,973 (via
gitlab.softwareheritage.org/swh/meta/-/issues/735) and now (maybe?) 108,911? i can't see anything about them purging inactive projects on a quick search... 209,062 less projects in 2 years? unless sitemap doesn't list everything anymore or there's a bug in
-
fireonlive
that code i didn't quickly spot :p
-
fireonlive
(or that's the old school system and the others are on .io?)
-
fireonlive
(not sure)
-
JAA
sourceforge.net/directory/?clear says there should be at least 192k projects.
-
JAA
Some of the filters are overlapping (e.g. projects supporting multiple OS), others don't seem to cover all projects (e.g. Status), but 192k hits for Windows projects, so it should be at least that.
-
fireonlive
hmm so 108,911 is deffo undercounting then
-
JAA
Also, #sourceforget exists, let's use it.
-
betamax
JAA: how is progress on the school of dragons archive? The game shuts down on Friday.
-
JAA
betamax: 12:41:48 <@JAA> Over 80% of the Knowledge Adventure CDN is done. ETA is still early hours of the 28th.
-
nicolas17
betamax: this is one of the files that appeared in the last few weeks, sad
media.schoolofdragons.com/Content/D…os/en-US/SoD-061623_ClosingSale.jpg
-
betamax
thanks! (I assume knowledge advebture CDN == School of Dragons?
-
betamax
(sorry I'm a bit out of the loop)
-
nicolas17
yes, origin.ka.cdn
-
fireonlive
closing sale??
-
fireonlive
for a game that won't be playable?
-
fireonlive
o_o
-
fireonlive
s/playable/playable very shortly after/
-
FireFly
that's kind of a weird concept yeah
-
FireFly
unless it's like, physical merch
-
JAA
'We'll slaughter your beloved game next week. Now give us monies please.'
-
FireFly
(but I guess not)
-
nicolas17
might be the price of things within the game, in in-game money
-
JAA
Could be, their original announcement from a few weeks ago did say that purchases would be disabled and to use in-game currency before the shutdown.
-
fireonlive
ahh
-
fireonlive
hmm.. lol. i guess if in app purchases are disabled... but at that point just make everything free i guess for one last hurrah lol
-
JAA
Yeah, that's what other games have done. No reason not to, really.
-
JAA
Regarding the CDN/bucket, when the current download completes, I'll rerun the bucket listing and grab anything that was missed. Then there are a few objects that weren't downloaded correctly due to URL encoding reasons (question marks in names...).
-
JAA
s/that was missed/that's new/ I guess
-
h2ibot
Manu edited ArchiveTeam Warrior (+255, Add info that an rsync max connection error is…):
wiki.archiveteam.org/?diff=50013&oldid=49883
-
Ryz
Anyone trying to tackle Microsoft Language Portal? It's shutting down on 2023 June 30 as per
old.reddit.com/r/Archiveteam/commen…uage_portal_will_be_removed_on_june
-
fireonlive
afaik no
-
fireonlive
it says some or all? of it will be moved to microsoft learn but... we all know how careful companies are
-
mikolaj|m
there's large Polish-language PhpBB forum farm at
fora.pl existing since 2005. A large portion of the forums is dead, there was also a pruning done in 2016 by the server owners
-
mikolaj|m
thinking about archiving it. Is this doable for ArchiveBot? I presume that yall need some discussion or reason to throw larger sites into ArchiveBot
-
fireonlive
you could ask in #archivebot there seem to be a few people pumping away commands at the moment in there
-
pokechu22
I don't know polish, but those look like large numbers in the statistics section at the bottom...
-
Ryz
Hmm, mikolaj|m, it looks like it's not just
fora.pl to grab, since it's just a hub, but also the countless subdomains that represent the forums they are hosting, like
naturalnemetody.fora.pl - coming from
fora.pl/?file=cat&md=index&cid=15
-
flashfire42
I was about to throw it in until you said that ryz
-
pokechu22
82 078 forums, 5 616 613 users, 159 344 842 "statements" (not sure if those are threads or posts)
-
flashfire42
Time to start chipping away at the individual forums?
-
Ryz
82 thousand forums?! Oo;;;
-
flashfire42
I can start queueing those 1 at a time until I forget what I am doing
-
pokechu22
I like how
random.fora.pl seems to be an actual forum in addition to a redirect site
-
pokechu22
I've done long lists, but those were usually a few hundred at most; 82k is probably too much to manually handle
-
Ryz
...82 thousand is holy shit, a lot Oo;
-
flashfire42
I once scraped a forum by hand for links to other sites. I will start if I get given the OK
-
pokechu22
It looks like the list is at
img.mruczek.trade/fora.txt which is "only" 64790
-
Ryz
First things first, need to find where are the 82k amount of forums can be found, if it's more than just what the example source link is about~
-
pokechu22
That's from line 279 of view-source:http://www.random.fora.pl/