-
fireonlive
oof
-
h2ibot
FireonLive edited Mailman2 (+44, Add CA/Browser Forum):
wiki.archiveteam.org/?diff=50159&oldid=50149
-
h2ibot
PaulWise created Bugzilla (+3994, add project to archive bugzilla instances):
wiki.archiveteam.org/?title=Bugzilla
-
pabs
JAA: ^
-
fireonlive
pabs: 👍
-
» pabs just airing out his todo/archive-* lists :)
-
pabs
hope other folks can/want to help with them :)
-
fireonlive
:)
-
nicolas17
that's always tricky wrt cooperative instances
-
pabs
hm?
-
nicolas17
like, I *could* give you a DB dump of the KDE forum and you avoid having to scrape it, but it would include private messages, so I would need to figure out what tables to exclude
-
pabs
scraping is probably better anyway so it ends up in the WBM?
-
nicolas17
same for bugzilla, there's private tickets sometimes
-
nicolas17
yeah true
-
pabs
there are similar issues with GitLab/etc instances too
-
nicolas17
guess the most helpful thing there is an admin providing IDs then
-
pabs
the buglist.cgi search on the page can handle that I think
-
nicolas17
and if I don't bother filtering out stuff and give you the ID of a private ticket, you can't fetch that anyway
-
nicolas17
pabs: I meant more broadly (IDs of forum posts, gitlab project list, etc)
-
pabs
ack yeah
-
nicolas17
-
fireonlive
incoming shit
-
pabs
are KDE git repos on SWH or the TODO for Codearchiver?
-
h2ibot
FireonLive edited Discourse (+360489, Add in uncategorized forums that don't require…):
wiki.archiveteam.org/?diff=50161&oldid=50148
-
fireonlive
there it is
-
fireonlive
i don't love it but also don't want to lose it 🤷
-
pabs
-
fireonlive
i like pabs' layout more but not my page
-
nicolas17
"+360489" wow
-
fireonlive
i guess i coulda manually visited all 4k links myself :D
-
fireonlive
it'd have to be like right after a certain something in the day
-
h2ibot
Pokechu22 edited Bugzilla (+35, /* Archived */…):
wiki.archiveteam.org/?diff=50162&oldid=50160
-
fireonlive
watch next, wherein fireonlive edits 4 TiB into the wiki to hold some personal backups
-
nicolas17
pabs: when I offered stuff to softwareheritage they were in "we're busy getting started and archiving stuff from big sites like github" mode and would get to custom stuff later
-
pabs
nicolas17: they now have a self-service(ish) thing for archiving gitlab and other forge types
-
fireonlive
pabs: should there be a section for dead bugzillas?
-
nicolas17
then it seems 7 years passed and they didn't bother contacting KDE? time flies
-
pabs
-
nicolas17
-
pabs
yeah, I sense they are not well organised or under-resourced technically
-
fireonlive
they still use svn :o
-
pabs
-
h2ibot
FireonLive edited Bugzilla (+37, add The Document Foundation):
wiki.archiveteam.org/?diff=50163&oldid=50162
-
nicolas17
I was almost expecting to find "freenode" mentioned in
wiki.softwareheritage.org/wiki/IRC_channels :P
-
fireonlive
haha
-
pabs
ah, I already submitted
invent.kde.org there, it is pending on them contacting the KDE folks though
-
fireonlive
i find it interesting they ask for random gitlab (gittea/etc) intances but not for users' github (or gitlab.com?) repos
-
pabs
they archive all of github
-
fireonlive
is it just because of potential costs i wonder or something else
-
pabs
and gitlab.com and many other gitlab sites
-
pabs
fireonlive: re dead bugzillas, yeah probably, for folks to look up old archives in the WBM?
-
fireonlive
ye but they stop to ask KDE 'can we' first
-
fireonlive
pabs: ye i was thinking so
-
fireonlive
versus just everyone on github
-
pabs
right
-
fireonlive
wonder why the difference
-
nicolas17
fireonlive: if they stop to ask KDE "can we svnmirror your entire SVN repository", we'll tell them "no, we can just send you a tarball!"
-
pabs
maybe in case they overload the sites?
-
fireonlive
ah perhaps
-
fireonlive
nicolas17: true in that case :)
-
fireonlive
-
fireonlive
oh i guess it's channel-based
-
fireonlive
so i'll allow it lol
-
nicolas17
pabs: I originally created the kde-git-repositories item on archive.org when some Russian devs were worried about Internet blockages, or depeering strongly affecting their bandwidth, and this way they could use bittorrent
-
fireonlive
russians? in MY kde? it's more likely than you think!
-
pabs
-
Barto
pabs: ab goes brr
-
fireonlive
brrrrrrrrr
-
VickoSaviour
progaming.ba forum is up for a limited amount of time
-
VickoSaviour
even tho is wall locked, i have the username and password to get all of the files on.
-
VickoSaviour
just send me pm on hackint
-
OrIdow6
Why would replayweb.page say that an URL is in the WARC when listing requests, but claim it wasn't found when I try to view it?
-
OrIdow6
Whatever, record's in the file
-
pabs
-
rewby
tzt, fireonlive, JAA, arkiver: I got a (probably non-exhaustive) list of domains hosted by the (soon to shut down) FutureQuest:
transfer.archivete.am/rgTXc/domains.txt
-
dx
hey! do you have any graceful ways to handle the thing where phpbb forums add &sid=hash to every link? archive.org seems to struggle with it, every thread link here goes nowhere:
web.archive.org/web/20230402125320/…id=d29688f6831c923e7a7ec107ad150803
-
masterX244
I think the ?archiveteam urlfudgery on archivebot crawls is there to suppress that
-
thuban
fireonlive: Как пропатчить KDE2 под FreeBSD?
-
OrIdow6
Wysp will be delayed another day, got sidetracked
-
imer
alright, keep us posted :)
-
arkiver
rewby: nice! checking it out
-
thuban
fyi all: VickoSaviour is offline, but i am grabbing progaming.ba per some previous discussion
-
arkiver
OrIdow6: do you have a channel name idea? :) i believe as idea here was posted before too
-
arkiver
rewby: how did you collect this list?
-
OrIdow6
arkiver: Not really, may be able to do something with will-o-the-wisps or whispers
-
OrIdow6
Part of the issue is that the obvious puns are so straightforward as to be uncreative
-
rewby|backup
arkiver: It's the list from the forward dns section of
bgp.tools/prefix/69.5.0.0/19#dns (which in turn is certificate transparency logs and other magic that I don't recall)
-
rewby|backup
Worth noting I didn't write it out manually, I asked the developer of the site to run a DB query for me
-
JAA
dx: To expand on masterX244's reply: What we do is start the crawl from
example.org/?archiveteam. That request sets the cookies, and then pages loaded after that won't have the sid params in links. It's a separate URL so that when the homepage gets loaded later, the cookies are already in place an browsing will work naturally. Once it got a few URLs, we ignore any URL with an sid param. The
-
JAA
'?archiveteam' suffix has no special meaning; it just has to be a unique URL so the actual homepage is retrieved with cookies later.
-
JAA
This isn't perfect though. Eventually, the session cookie might expire, and then the crawl gets another page with sid param links, which would get ignored, so coverage might be slightly incomplete. Unless the forums are very broken, that shouldn't be a significant fraction though.
-
dx
JAA, masterX244: thank you!
-
nighthnh099_
I have a massive list of urls (90K urls) for a website that might shut down any day now, not all of these exist so I got a script that checked which gave back a status 200 then to mirror it
-
nighthnh099_
when I ran the script, my computer started lagging and explorer did some very strange things so I had to restart my computer
-
nighthnh099_
can someone else run the script for me? after running the script, you can run a dir command and do a find and replace to turn the files it mirrored into urls
-
nighthnh099_
then whoever runs the script can just put it into a spreadsheet and let ia save the urls
-
phaeton
if you're still looking for channel name ideas, i propose #wispaway....wisp away is semi-commonly misused instead of whisk away which means to take away suddenly
-
JAA
nighthnh099_: We have our own tooling that can archive things much more efficiently and quickly than feeding to IA. I can take a look. Which site is it? And please upload the list to
transfer.archivete.am .
-
nighthnh099_
transfer.archivete.am/2mctU/urls.txt the urls start at 4000 because that's as far as I got before I had to restart; basically the urls are a bunch of game scripts for an app, not all of the urls exist though; I might need help with finding the upper limit of the list because I forgot to do that
-
JAA
Yeah, the upper limit is definitely higher.
-
JAA
Quickly poked the APK but didn't see anything of relevance. Might need DEX decompiling.
-
nighthnh099_
I already did all of that
-
nighthnh099_
oh wait do you need the script I said? sorry I forgot to ask
-
JAA
No need, I'll run the list through ArchiveBot. But need to find the upper bound first.
-
nighthnh099_
archivebot skips 404s?
-
JAA
No, they'll just get archived as well.
-
nighthnh099_
oh, that's kinda messy haha
-
JAA
Well, depends on how you look at it.
-
JAA
Archiving them records that they didn't exist.
-
JAA
Whereas if you only archive the ones that exist, a future archaeologist won't know whether they were simply missed.
-
nighthnh099_
oh, my reasons for not archiving them would be it's hard to filter through them when someone in the future decides to make a local server for the game
-
JAA
It's trivial to filter that out.
-
nighthnh099_
oh how? I don't know haha
-
JAA
Especially when you work with the WARC file ArchiveBot will produce.
-
JAA
Well, the tooling for it is currently suboptimal, but it can be done with warcio and a 10-line Python script or so.
-
JAA
It'll be easier once I finish the thing I've been working on for far too long now.
-
JAA
Anyway... so how do we find the upper limit?
-
JAA
Actually, I just checked 100000 to 100099, no hits there, so I'll do up to 100k.
-
JAA
It's running, current ETA is 5-6 hours.
-
nighthnh099_
JAA: I think 97017 is the upper limit
-
nighthnh099_
thanks for running it! also worth noting that it needs to be http, not https; everything is 404 on https for some reason
-
JAA
Yeah, I noticed.
-
JAA
Just another badly configured web server. :-)
-
nighthnh099_
oh wait a second, the mention on the site of a shut down is just the name of a story someone uploaded
-
nighthnh099_
well doesn't make anything less urgent I guess
-
nighthnh099_
the app itself has been gone since 2020 so the site could shut down any day now
-
JAA
Yeah, given how small it is, no reason not to archive it anyway.
-
nighthnh099_
JAA: I have to log out now so I guess I'll just see those urls in the CDX at some point?
-
nighthnh099_
also maybe you can zip up the files it mirrored and send it to me? I want a copy myself haha
-
nighthnh099_
will probably just ping once I open irc again
-
fireonlive
thuban: :3
-
JAA
nighthnh099_: Yes, they'll appear in the WBM eventually. The WARCs will be listed at
archive.fart.website/archivebot/viewer/job/61ha7 eventually. We don't produce plain files, so I can't simply create a ZIP for you.
-
nighthnh099_
oh wait I wasn't joined to archivebot, oh okay
-
masterX244
warcat allows to "unpack" WARCs though if you need the plain files inside
-
nighthnh099_
thanks
-
JAA
Yeah, not sure how that would handle the 404s though.
-
Barto
pabs: that thing definitely goes brr too
-
kiska
-
kiska
Not sure how much content there is to save from this
-
fireonlive
“30,000,000 pieces of content” interesting… hm. it’s blockchain stuff so idk lol
-
nicolas17
why should we worry, it's decentralized right? :P
-
fireonlive
🤐
-
nicolas17
(it's probably centralized and only using blockchain for regulation evasion purposes)
-
fireonlive
i seem to recall public companies just changing their names to like include AI or blockchain and their stock prices just shooting up instantly
-
fireonlive
semi related lol
-
FavoritoHJS
-
FavoritoHJS
also appears twitter no longer requires account login for seeing posts? if so, i guess a warrior project is once again possible
-
nicolas17
only individual posts afaik
-
nicolas17
you can't see replies to them or what it's replying to
-
FavoritoHJS
still better than the nothing that was there before
-
fireonlive
#deadcat for Gfycat
-
arkiver
JAA: those domains rewby|backup found - do you think AB is enough for that?
-
JAA
1555 domains, might be feasible, but not sure.
-
JAA
Ryz has been feeding domains from that platform in, I think?
-
JAA
I haven't been paying a whole lot of attention.
-
arkiver
looks like these sites may not be very large?
-
arkiver
upcoming projects this month are:
-
arkiver
Wysp ( OrIdow6 )
-
arkiver
Skyblog
-
arkiver
Stitcher
-
arkiver
Xuite
-
arkiver
and Gfycat
-
arkiver
if stitcher is not huge we'll get in with AB
-
arkiver
OrIdow6: #wyspedaway for wysp
-
anarcat
-
anarcat
"The Bluetooth connection between your smartphone and your VanMoof is encrypted for security purposes. Each time you log into your VanMoof account, this encryption key is being downloaded from VanMoof’s server."
-
anarcat
kolektiva.social/@phill⊙mnc/110701490653058697 "Little birdies tell me VanMoof has officially collapsed. They'll be making a statement shortly.
-
anarcat
If you own one of their bikes now is the time to grab your encryption keys before their servers go offline"
-
anarcat
isn't the future great?
-
murb
oh i recongise the shape of the bike, so i've probably seen them. but wasn't aware of the brand until onw.
-
arkiver
hah
-
arkiver
sounds like all 'smart' things about that bike will soon stop functioning
-
murb
i wonder what smart things you need on a bike...
-
flashfire42
A helmet?
-
murb
predictive braking?
-
flashfire42
an icecream container?
-
JAA
Ah yes, the internet of shit.
-
flashfire42
on your head
-
murb
flashfire42: why would you need one of those?
-
murb
cycling is really quite safe.
-
flashfire42
Magpies
-
murb
flashfire42: avoid .au then.
-
flashfire42
Bit hard when I live there
-
murb
how i avoid being swooped,.. i live on another continent.
-
h2ibot
Yts98 created Games/Engines, Platforms and Hostings (+2012, Created page with "== Engines == *…):
wiki.archiveteam.org/?title=Games/E…nes%2C%20Platforms%20and%20Hostings
-
h2ibot
-
JAA
> The VanMoof S5 & A5 will just keep getting better. And better. Via over-the-air updates, we can continuously improve your bike long after your first ride. From the Halo Ring Interface to Hi-Vis Lights, this bike has revolution, built in.
-
JAA
...
-
JAA
Off to -ot for that I guess.
-
murb
i hope they'll change the tyres etc.
-
masterX244
smart shit is a PITA; or anythbing with firmware. (its rare to find a firmware updater that allows local files instead of only connecting straight to server, and for those that also allow local files: always backup those files)
-
Barto
#stallmanwasright
-
h2ibot
FireonLive edited Current Projects (+38, add IRC channel for Wysp):
wiki.archiveteam.org/?diff=50166&oldid=50156
-
h2ibot
FireonLive edited Wysp (+19, add IRC channel):
wiki.archiveteam.org/?diff=50167&oldid=50158
-
murb
Barto: a stopped clock etc.
-
h2ibot
Yts98 edited Games/Engines, Platforms and Hostings (+263):
wiki.archiveteam.org/?diff=50168&oldid=50164
-
Ryz
Hello JAA and arkiver, I'm basing the archiving regarding FutureQuest run domains on what flashfire42 has fed me with
bgp.tools/prefix/69.5.0.0/19#dns
-
JAA
There is a more complete list, also from bgp.tools, see above.
-
JAA
I can throw it all into queueh2ibot if it's suitable for that.
-
arkiver
the one that rewby posted
-
h2ibot
Arkiver edited YouTube (+120, Change YouTube rules):
wiki.archiveteam.org/?diff=50169&oldid=49723
-
fireonlive
here's rewb\y's list via bgp.tools (thanks rewb\y!):
transfer.archivete.am/rgTXc/domains.txt
-
h2ibot
-
Ryz
JAA, I'm not too certain on running it automatically via queueh2ibot because I've encountered some oddball websites where it would need to be treated with that particular pipeline, and others are just geo-restricted
-
h2ibot
-
h2ibot
Arkiver edited YouTube (+9, Fix formatting):
wiki.archiveteam.org/?diff=50172&oldid=50171
-
JAA
I mean, if you'd like to run the 1555 domains manually, that's also fine with me, but it's a lot of work.
-
Ryz
That is unfortunately true... ><;
-
h2ibot
FireonLive edited YouTube (+209, Update infoboxes):
wiki.archiveteam.org/?diff=50173&oldid=50172
-
h2ibot
FireonLive edited YouTube (+20, use 2=YouTube to make infobox not so weird):
wiki.archiveteam.org/?diff=50174&oldid=50173
-
Ryz
Hmm, how about this JAA, while I work on the one that flashfire42 gave me for now, queueh2ibot can process
transfer.archivete.am/rgTXc/domains.txt
-
nicolas17
so, how do I archive 5-10GB files such that they appear on WBM, with inter-URL payload deduplication?
-
JAA
Ryz: Sure, if you give me that list to filter out duplicates.
-
JAA
nicolas17: wget-at or qwarc
-
flashfire42
such as they appear in wbm
-
flashfire42
is the core issue
-
nicolas17
archivebot doesn't deduplicate, qwarc would work but then I need Approval:tm: to make my uploaded WARCs appear in WBM
-
JAA
Right
-
Ryz
Here it is JAA, what flashfire42 has fed me:
bgp.tools/prefix/69.5.0.0/19#dns
-
Ryz
Which includes the odd URLs that for some reason end with '.' oo;
-
fireonlive
those are “fully qualified”
-
JAA
Technically, all domains end with a dot.
-
fireonlive
ye
-
JAA
That has 1999 domains...?
-
h2ibot
Arkiver edited YouTube (+59, Allow archiving ads that are actually used as…):
wiki.archiveteam.org/?diff=50175&oldid=50174
-
h2ibot
-
pabs
anarcat: dunno if you have any bandwidth for AB jobs, but you might be interested in these new projects
wiki.archiveteam.org/?title=Bugzilla wiki.archiveteam.org/?title=IRC/Logs also
wiki.archiveteam.org/?title=Mailman2
-
JAA
rewby: See above, I'm getting 1999 results on
bgp.tools/prefix/69.5.0.0/19#dns , i.e. more than the 1555 in your list that comes directly from the DB? Something's not right there.
-
JAA
No dupes with the trailing dot either.
-
JAA
There's little overlap, too.
-
JAA
1238 domains appear in the DB list but not on the page. 1683 domains appear on the page but not in the DB list.
-
h2ibot
-
JAA
So only a bit over 300 overlap.