00:05:27 <fireonlive> oof
01:08:30 <h2ibot> FireonLive edited Mailman2 (+44, Add CA/Browser Forum): https://wiki.archiveteam.org/?diff=50159&oldid=50149
04:11:04 <h2ibot> PaulWise created Bugzilla (+3994, add project to archive bugzilla instances): https://wiki.archiveteam.org/?title=Bugzilla
04:11:39 <pabs> JAA: ^
04:11:43 <fireonlive> pabs: 👍
04:12:13 * pabs just airing out his todo/archive-* lists :)
04:12:26 <pabs> hope other folks can/want to help with them :)
04:13:20 <fireonlive> :)
04:16:04 <nicolas17> that's always tricky wrt cooperative instances
04:17:13 <pabs> hm?
04:17:14 <nicolas17> like, I *could* give you a DB dump of the KDE forum and you avoid having to scrape it, but it would include private messages, so I would need to figure out what tables to exclude
04:17:52 <pabs> scraping is probably better anyway so it ends up in the WBM?
04:18:13 <nicolas17> same for bugzilla, there's private tickets sometimes
04:18:17 <nicolas17> yeah true
04:18:20 <pabs> there are similar issues with GitLab/etc instances too
04:18:33 <nicolas17> guess the most helpful thing there is an admin providing IDs then
04:18:57 <pabs> the buglist.cgi search on the page can handle that I think
04:19:09 <nicolas17> and if I don't bother filtering out stuff and give you the ID of a private ticket, you can't fetch that anyway
04:19:34 <nicolas17> pabs: I meant more broadly (IDs of forum posts, gitlab project list, etc)
04:19:45 <pabs> ack yeah
04:21:41 <nicolas17> and that reminds me I should update https://archive.org/details/kde-git-repositories
04:22:11 <fireonlive> incoming shit
04:22:42 <pabs> are KDE git repos on SWH or the TODO for Codearchiver?
04:23:06 <h2ibot> FireonLive edited Discourse (+360489, Add in uncategorized forums that don't require…): https://wiki.archiveteam.org/?diff=50161&oldid=50148
04:23:09 <fireonlive> there it is
04:23:31 <fireonlive> i don't love it but also don't want to lose it 🤷
04:23:44 <pabs> hmm, only KDE phabricator on https://archive.softwareheritage.org/coverage/
04:23:48 <fireonlive> i like pabs' layout more but not my page
04:24:28 <nicolas17> "+360489" wow
04:24:43 <fireonlive> i guess i coulda manually visited all 4k links myself :D
04:25:07 <fireonlive> it'd have to be like right after a certain something in the day
04:26:07 <h2ibot> Pokechu22 edited Bugzilla (+35, /* Archived */…): https://wiki.archiveteam.org/?diff=50162&oldid=50160
04:26:10 <fireonlive> watch next, wherein fireonlive edits 4 TiB into the wiki to hold some personal backups
04:26:53 <nicolas17> pabs: when I offered stuff to softwareheritage they were in "we're busy getting started and archiving stuff from big sites like github" mode and would get to custom stuff later
04:27:30 <pabs> nicolas17: they now have a self-service(ish) thing for archiving gitlab and other forge types
04:27:31 <fireonlive> pabs: should there be a section for dead bugzillas?
04:27:31 <nicolas17> then it seems 7 years passed and they didn't bother contacting KDE? time flies
04:27:46 <pabs> https://archive.softwareheritage.org/add-forge/request/
04:27:49 <nicolas17> https://wiki.softwareheritage.org/wiki/Suggestion_box:_source_code_to_add/KDE
04:28:26 <pabs> yeah, I sense they are not well organised or under-resourced technically
04:28:55 <fireonlive> they still use svn :o
04:28:59 <pabs> er better link https://archive.softwareheritage.org/add-forge/request/list/
04:29:07 <h2ibot> FireonLive edited Bugzilla (+37, add The Document Foundation): https://wiki.archiveteam.org/?diff=50163&oldid=50162
04:29:17 <nicolas17> I was almost expecting to find "freenode" mentioned in https://wiki.softwareheritage.org/wiki/IRC_channels :P
04:29:38 <fireonlive> haha
04:29:46 <pabs> ah, I already submitted https://invent.kde.org/ there, it is pending on them contacting the KDE folks though
04:30:42 <fireonlive> i find it interesting they ask for random gitlab (gittea/etc) intances but not for users' github (or gitlab.com?) repos
04:30:53 <pabs> they archive all of github
04:30:56 <fireonlive> is it just because of potential costs i wonder or something else
04:31:06 <pabs> and gitlab.com and many other gitlab sites
04:31:16 <pabs> fireonlive: re dead bugzillas, yeah probably, for folks to look up old archives in the WBM?
04:31:19 <fireonlive> ye but they stop to ask KDE 'can we' first
04:31:25 <fireonlive> pabs: ye i was thinking so
04:31:34 <fireonlive> versus just everyone on github
04:31:41 <pabs> right
04:31:56 <fireonlive> wonder why the difference
04:32:11 <nicolas17> fireonlive: if they stop to ask KDE "can we svnmirror your entire SVN repository", we'll tell them "no, we can just send you a tarball!"
04:32:14 <pabs> maybe in case they overload the sites?
04:32:19 <fireonlive> ah perhaps
04:32:31 <fireonlive> nicolas17: true in that case :)
04:33:27 <fireonlive> https://wiki.softwareheritage.org/wiki/IRC#IRC_access_list pffft no groupserv
04:33:53 <fireonlive> oh i guess it's channel-based
04:34:14 <fireonlive> so i'll allow it lol
04:34:18 <nicolas17> pabs: I originally created the kde-git-repositories item on archive.org when some Russian devs were worried about Internet blockages, or depeering strongly affecting their bandwidth, and this way they could use bittorrent
05:14:55 <fireonlive> russians? in MY kde? it's more likely than you think!
05:44:34 <pabs> https://techcrunch.com/2023/07/10/vanmoof-the-e-bike-darling-skids-off-track-sales-paused-execs-depart/
06:08:00 <Barto> pabs: ab goes brr
06:16:16 <fireonlive> brrrrrrrrr
08:25:34 <VickoSaviour> progaming.ba forum is up for a limited amount of time
08:26:39 <VickoSaviour> even tho is wall locked, i have the username and password to get all of the files on.
08:28:50 <VickoSaviour> just send me pm on hackint
09:40:39 <OrIdow6> Why would replayweb.page say that an URL is in the WARC when listing requests, but claim it wasn't found when I try to view it?
09:42:31 <OrIdow6> Whatever, record's in the file
09:47:47 <pabs> Barto: another one for you https://simpleflying.com/wisk-aero-boeing-subsidiary/ :)
10:12:18 <rewby> tzt, fireonlive, JAA, arkiver: I got a (probably non-exhaustive) list of domains hosted by the (soon to shut down) FutureQuest: https://transfer.archivete.am/rgTXc/domains.txt
10:14:02 <dx> hey! do you have any graceful ways to handle the thing where phpbb forums add &sid=hash to every link? archive.org seems to struggle with it, every thread link here goes nowhere: https://web.archive.org/web/20230402125320/https://www.freestompboxes.org/viewforum.php?f=1&sid=d29688f6831c923e7a7ec107ad150803
12:07:54 <masterX244> I think the ?archiveteam urlfudgery on archivebot crawls is there to suppress that
12:41:07 <thuban> fireonlive: Как пропатчить KDE2 под FreeBSD?
13:10:28 <OrIdow6> Wysp will be delayed another day, got sidetracked
13:17:43 <imer> alright, keep us posted :)
13:33:40 <arkiver> rewby: nice! checking it out
13:33:55 <thuban> fyi all: VickoSaviour is offline, but i am grabbing progaming.ba per some previous discussion
13:34:06 <arkiver> OrIdow6: do you have a channel name idea? :) i believe as idea here was posted before too
13:34:33 <arkiver> rewby: how did you collect this list?
13:41:14 <OrIdow6> arkiver: Not really, may be able to do something with will-o-the-wisps or whispers
13:43:36 <OrIdow6> Part of the issue is that the obvious puns are so straightforward as to be uncreative
14:07:01 <rewby|backup> arkiver: It's the list from the forward dns section of https://bgp.tools/prefix/69.5.0.0/19#dns (which in turn is certificate transparency logs and other magic that I don't recall)
14:07:24 <rewby|backup> Worth noting I didn't write it out manually, I asked the developer of the site to run a DB query for me
14:24:43 <JAA> dx: To expand on masterX244's reply: What we do is start the crawl from https://example.org/?archiveteam. That request sets the cookies, and then pages loaded after that won't have the sid params in links. It's a separate URL so that when the homepage gets loaded later, the cookies are already in place an browsing will work naturally. Once it got a few URLs, we ignore any URL with an sid param. The
14:24:49 <JAA> '?archiveteam' suffix has no special meaning; it just has to be a unique URL so the actual homepage is retrieved with cookies later.
14:25:35 <JAA> This isn't perfect though. Eventually, the session cookie might expire, and then the crawl gets another page with sid param links, which would get ignored, so coverage might be slightly incomplete. Unless the forums are very broken, that shouldn't be a significant fraction though.
15:01:42 <dx> JAA, masterX244: thank you!
15:02:53 <nighthnh099_> I have a massive list of urls (90K urls) for a website that might shut down any day now, not all of these exist so I got a script that checked which gave back a status 200 then to mirror it
15:03:16 <nighthnh099_> when I ran the script, my computer started lagging and explorer did some very strange things so I had to restart my computer
15:03:48 <nighthnh099_> can someone else run the script for me? after running the script, you can run a dir command and do a find and replace to turn the files it mirrored into urls
15:04:12 <nighthnh099_> then whoever runs the script can just put it into a spreadsheet and let ia save the urls
15:08:14 <phaeton> if you're still looking for channel name ideas, i propose #wispaway....wisp away is semi-commonly misused instead of whisk away which means to take away suddenly
15:12:04 <JAA> nighthnh099_: We have our own tooling that can archive things much more efficiently and quickly than feeding to IA. I can take a look. Which site is it? And please upload the list to https://transfer.archivete.am/ .
15:14:49 <nighthnh099_> https://transfer.archivete.am/2mctU/urls.txt the urls start at 4000 because that's as far as I got before I had to restart; basically the urls are a bunch of game scripts for an app, not all of the urls exist though; I might need help with finding the upper limit of the list because I forgot to do that
15:18:54 <JAA> Yeah, the upper limit is definitely higher.
15:32:11 <JAA> Quickly poked the APK but didn't see anything of relevance. Might need DEX decompiling.
15:35:00 <nighthnh099_> I already did all of that
15:35:22 <nighthnh099_> oh wait do you need the script I said? sorry I forgot to ask
15:36:41 <JAA> No need, I'll run the list through ArchiveBot. But need to find the upper bound first.
15:37:37 <nighthnh099_> archivebot skips 404s?
15:39:30 <JAA> No, they'll just get archived as well.
15:40:11 <nighthnh099_> oh, that's kinda messy haha
15:40:26 <JAA> Well, depends on how you look at it.
15:40:34 <JAA> Archiving them records that they didn't exist.
15:40:55 <JAA> Whereas if you only archive the ones that exist, a future archaeologist won't know whether they were simply missed.
15:41:26 <nighthnh099_> oh, my reasons for not archiving them would be it's hard to filter through them when someone in the future decides to make a local server for the game
15:42:04 <JAA> It's trivial to filter that out.
15:42:29 <nighthnh099_> oh how? I don't know haha
15:42:31 <JAA> Especially when you work with the WARC file ArchiveBot will produce.
15:43:22 <JAA> Well, the tooling for it is currently suboptimal, but it can be done with warcio and a 10-line Python script or so.
15:43:45 <JAA> It'll be easier once I finish the thing I've been working on for far too long now.
15:43:59 <JAA> Anyway... so how do we find the upper limit?
15:44:56 <JAA> Actually, I just checked 100000 to 100099, no hits there, so I'll do up to 100k.
15:49:14 <JAA> It's running, current ETA is 5-6 hours.
15:52:09 <nighthnh099_> JAA: I think 97017 is the upper limit
15:52:52 <nighthnh099_> thanks for running it! also worth noting that it needs to be http, not https; everything is 404 on https for some reason
15:53:01 <JAA> Yeah, I noticed.
15:53:12 <JAA> Just another badly configured web server. :-)
15:54:57 <nighthnh099_> oh wait a second, the mention on the site of a shut down is just the name of a story someone uploaded
15:55:09 <nighthnh099_> well doesn't make anything less urgent I guess
15:55:26 <nighthnh099_> the app itself has been gone since 2020 so the site could shut down any day now
15:56:25 <JAA> Yeah, given how small it is, no reason not to archive it anyway.
16:04:11 <nighthnh099_> JAA: I have to log out now so I guess I'll just see those urls in the CDX at some point?
16:05:07 <nighthnh099_> also maybe you can zip up the files it mirrored and send it to me? I want a copy myself haha
16:05:26 <nighthnh099_> will probably just ping once I open irc again
16:05:29 <fireonlive> thuban: :3
16:10:03 <JAA> nighthnh099_: Yes, they'll appear in the WBM eventually. The WARCs will be listed at https://archive.fart.website/archivebot/viewer/job/61ha7 eventually. We don't produce plain files, so I can't simply create a ZIP for you.
16:10:38 <nighthnh099_> oh wait I wasn't joined to archivebot, oh okay
16:11:18 <masterX244> warcat allows to "unpack" WARCs though if you need the plain files inside
16:11:28 <nighthnh099_> thanks
16:11:48 <JAA> Yeah, not sure how that would handle the 404s though.
17:10:19 <Barto> pabs: that thing definitely goes brr too
17:16:31 <kiska> RIP LBRY? https://twitter.com/LBRYcom/status/1678866789407551489
17:18:24 <kiska> Not sure how much content there is to save from this
17:28:20 <fireonlive> “30,000,000 pieces of content” interesting… hm. it’s blockchain stuff so idk lol
17:34:22 <nicolas17> why should we worry, it's decentralized right? :P
17:35:46 <fireonlive> 🤐
17:36:18 <nicolas17> (it's probably centralized and only using blockchain for regulation evasion purposes)
17:38:11 <fireonlive> i seem to recall public companies just changing their names to like include AI or blockchain and their stock prices just shooting up instantly
17:38:14 <fireonlive> semi related lol
19:42:29 <FavoritoHJS> this is not a drill, we have a dying site https://forums.terraria.org/index.php?threads/gfycat-shutting-down-this-september.120070/
19:44:42 <FavoritoHJS> also appears twitter no longer requires account login for seeing posts? if so, i guess a warrior project is once again possible
19:45:16 <nicolas17> only individual posts afaik
19:45:28 <nicolas17> you can't see replies to them or what it's replying to
19:45:39 <FavoritoHJS> still better than the nothing that was there before
19:45:39 <fireonlive> #deadcat for Gfycat
20:50:46 <arkiver> JAA: those domains rewby|backup found - do you think AB is enough for that?
20:53:19 <JAA> 1555 domains, might be feasible, but not sure.
20:53:31 <JAA> Ryz has been feeding domains from that platform in, I think?
20:53:37 <JAA> I haven't been paying a whole lot of attention.
20:54:59 <arkiver> looks like these sites may not be very large?
20:55:20 <arkiver> upcoming projects this month are:
20:55:27 <arkiver> Wysp ( OrIdow6 )
20:55:30 <arkiver> Skyblog
20:55:35 <arkiver> Stitcher
20:55:38 <arkiver> Xuite
20:55:40 <arkiver> and Gfycat
20:56:04 <arkiver> if stitcher is not huge we'll get in with AB
20:57:01 <arkiver> OrIdow6: #wyspedaway for wysp
20:58:31 <anarcat> i'm not sure how we can help with this, but https://github.com/grossartig/vanmoof-encryption-key-exporter
20:58:48 <anarcat> "The Bluetooth connection between your smartphone and your VanMoof is encrypted for security purposes. Each time you log into your VanMoof account, this encryption key is being downloaded from VanMoof’s server."
20:59:09 <anarcat> https://kolektiva.social/@phill⊙mnc/110701490653058697 "Little birdies tell me VanMoof has officially collapsed. They'll be making a statement shortly.
20:59:10 <anarcat> If you own one of their bikes now is the time to grab your encryption keys before their servers go offline"
20:59:22 <anarcat> isn't the future great?
21:01:55 <murb> oh i recongise the shape of the bike, so i've probably seen them. but wasn't aware of the brand until onw.
21:03:19 <arkiver> hah
21:03:32 <arkiver> sounds like all 'smart' things about that bike will soon stop functioning
21:03:54 <murb> i wonder what smart things you need on a bike...
21:04:04 <flashfire42> A helmet?
21:04:05 <murb> predictive braking?
21:04:08 <flashfire42> an icecream container?
21:04:09 <JAA> Ah yes, the internet of shit.
21:04:10 <flashfire42> on your head
21:04:22 <murb> flashfire42: why would you need one of those?
21:04:30 <murb> cycling is really quite safe.
21:04:30 <flashfire42> Magpies
21:04:36 <murb> flashfire42: avoid .au then.
21:04:44 <flashfire42> Bit hard when I live there
21:05:17 <murb> how i avoid being swooped,.. i live on another continent.
21:05:24 <h2ibot> Yts98 created Games/Engines, Platforms and Hostings (+2012, Created page with "== Engines ==  *…): https://wiki.archiveteam.org/?title=Games/Engines%2C%20Platforms%20and%20Hostings
21:06:23 <h2ibot> Yts98 edited Games (+90): https://wiki.archiveteam.org/?diff=50165&oldid=46613
21:06:39 <JAA> > The VanMoof S5 & A5 will just keep getting better. And better. Via over-the-air updates, we can continuously improve your bike long after your first ride. From the Halo Ring Interface to Hi-Vis Lights, this bike has revolution, built in.
21:06:43 <JAA> ...
21:06:57 <JAA> Off to -ot for that I guess.
21:06:59 <murb> i hope they'll change the tyres etc.
21:07:48 <masterX244> smart shit is a PITA; or anythbing with firmware. (its rare to find a firmware updater that allows local files instead of only connecting straight to server, and for those that also allow local files: always backup those files)
21:12:44 <Barto> #stallmanwasright
21:31:28 <h2ibot> FireonLive edited Current Projects (+38, add IRC channel for Wysp): https://wiki.archiveteam.org/?diff=50166&oldid=50156
21:32:28 <h2ibot> FireonLive edited Wysp (+19, add IRC channel): https://wiki.archiveteam.org/?diff=50167&oldid=50158
21:35:27 <murb> Barto: a stopped clock etc.
21:49:32 <h2ibot> Yts98 edited Games/Engines, Platforms and Hostings (+263): https://wiki.archiveteam.org/?diff=50168&oldid=50164
22:42:29 <Ryz> Hello JAA and arkiver, I'm basing the archiving regarding FutureQuest run domains on what flashfire42 has fed me with https://bgp.tools/prefix/69.5.0.0/19#dns
22:42:53 <JAA> There is a more complete list, also from bgp.tools, see above.
22:43:24 <JAA> I can throw it all into queueh2ibot if it's suitable for that.
22:43:33 <arkiver> the one that rewby posted
22:43:44 <h2ibot> Arkiver edited YouTube (+120, Change YouTube rules): https://wiki.archiveteam.org/?diff=50169&oldid=49723
22:44:26 <fireonlive> here's rewb\y's list via bgp.tools (thanks rewb\y!): https://transfer.archivete.am/rgTXc/domains.txt
22:44:44 <h2ibot> Arkiver edited YouTube (+95): https://wiki.archiveteam.org/?diff=50170&oldid=50169
22:44:48 <Ryz> JAA, I'm not too certain on running it automatically via queueh2ibot because I've encountered some oddball websites where it would need to be treated with that particular pipeline, and others are just geo-restricted
22:45:45 <h2ibot> Arkiver edited YouTube (+7): https://wiki.archiveteam.org/?diff=50171&oldid=50170
22:45:46 <h2ibot> Arkiver edited YouTube (+9, Fix formatting): https://wiki.archiveteam.org/?diff=50172&oldid=50171
22:45:59 <JAA> I mean, if you'd like to run the 1555 domains manually, that's also fine with me, but it's a lot of work.
22:46:40 <Ryz> That is unfortunately true... ><;
22:52:46 <h2ibot> FireonLive edited YouTube (+209, Update infoboxes): https://wiki.archiveteam.org/?diff=50173&oldid=50172
23:00:49 <h2ibot> FireonLive edited YouTube (+20, use 2=YouTube to make infobox not so weird): https://wiki.archiveteam.org/?diff=50174&oldid=50173
23:01:38 <Ryz> Hmm, how about this JAA, while I work on the one that flashfire42 gave me for now, queueh2ibot can process https://transfer.archivete.am/rgTXc/domains.txt
23:09:57 <nicolas17> so, how do I archive 5-10GB files such that they appear on WBM, with inter-URL payload deduplication?
23:10:42 <JAA> Ryz: Sure, if you give me that list to filter out duplicates.
23:10:46 <JAA> nicolas17: wget-at or qwarc
23:10:58 <flashfire42> such as they appear in wbm
23:11:02 <flashfire42> is the core issue
23:11:07 <nicolas17> archivebot doesn't deduplicate, qwarc would work but then I need Approval:tm: to make my uploaded WARCs appear in WBM
23:11:14 <JAA> Right
23:11:20 <Ryz> Here it is JAA, what flashfire42 has fed me: https://bgp.tools/prefix/69.5.0.0/19#dns
23:11:33 <Ryz> Which includes the odd URLs that for some reason end with '.' oo;
23:11:49 <fireonlive> those are “fully qualified”
23:11:50 <JAA> Technically, all domains end with a dot.
23:11:53 <fireonlive> ye
23:13:04 <JAA> That has 1999 domains...?
23:13:51 <h2ibot> Arkiver edited YouTube (+59, Allow archiving ads that are actually used as…): https://wiki.archiveteam.org/?diff=50175&oldid=50174
23:18:51 <h2ibot> Arkiver edited YouTube (+0): https://wiki.archiveteam.org/?diff=50176&oldid=50175
23:19:04 <pabs> anarcat: dunno if you have any bandwidth for AB jobs, but you might be interested in these new projects https://wiki.archiveteam.org/?title=Bugzilla https://wiki.archiveteam.org/?title=IRC/Logs also https://wiki.archiveteam.org/?title=Mailman2
23:20:15 <JAA> rewby: See above, I'm getting 1999 results on https://bgp.tools/prefix/69.5.0.0/19#dns , i.e. more than the 1555 in your list that comes directly from the DB? Something's not right there.
23:20:55 <JAA> No dupes with the trailing dot either.
23:23:03 <JAA> There's little overlap, too.
23:23:34 <JAA> 1238 domains appear in the DB list but not on the page. 1683 domains appear on the page but not in the DB list.
23:23:52 <h2ibot> Arkiver edited YouTube (+9): https://wiki.archiveteam.org/?diff=50177&oldid=50176
23:23:53 <JAA> So only a bit over 300 overlap.