00:05:27 oof 01:08:30 FireonLive edited Mailman2 (+44, Add CA/Browser Forum): https://wiki.archiveteam.org/?diff=50159&oldid=50149 04:11:04 PaulWise created Bugzilla (+3994, add project to archive bugzilla instances): https://wiki.archiveteam.org/?title=Bugzilla 04:11:39 JAA: ^ 04:11:43 pabs: 👍 04:12:13 * pabs just airing out his todo/archive-* lists :) 04:12:26 hope other folks can/want to help with them :) 04:13:20 :) 04:16:04 that's always tricky wrt cooperative instances 04:17:13 hm? 04:17:14 like, I *could* give you a DB dump of the KDE forum and you avoid having to scrape it, but it would include private messages, so I would need to figure out what tables to exclude 04:17:52 scraping is probably better anyway so it ends up in the WBM? 04:18:13 same for bugzilla, there's private tickets sometimes 04:18:17 yeah true 04:18:20 there are similar issues with GitLab/etc instances too 04:18:33 guess the most helpful thing there is an admin providing IDs then 04:18:57 the buglist.cgi search on the page can handle that I think 04:19:09 and if I don't bother filtering out stuff and give you the ID of a private ticket, you can't fetch that anyway 04:19:34 pabs: I meant more broadly (IDs of forum posts, gitlab project list, etc) 04:19:45 ack yeah 04:21:41 and that reminds me I should update https://archive.org/details/kde-git-repositories 04:22:11 incoming shit 04:22:42 are KDE git repos on SWH or the TODO for Codearchiver? 04:23:06 FireonLive edited Discourse (+360489, Add in uncategorized forums that don't require…): https://wiki.archiveteam.org/?diff=50161&oldid=50148 04:23:09 there it is 04:23:31 i don't love it but also don't want to lose it 🤷 04:23:44 hmm, only KDE phabricator on https://archive.softwareheritage.org/coverage/ 04:23:48 i like pabs' layout more but not my page 04:24:28 "+360489" wow 04:24:43 i guess i coulda manually visited all 4k links myself :D 04:25:07 it'd have to be like right after a certain something in the day 04:26:07 Pokechu22 edited Bugzilla (+35, /* Archived */…): https://wiki.archiveteam.org/?diff=50162&oldid=50160 04:26:10 watch next, wherein fireonlive edits 4 TiB into the wiki to hold some personal backups 04:26:53 pabs: when I offered stuff to softwareheritage they were in "we're busy getting started and archiving stuff from big sites like github" mode and would get to custom stuff later 04:27:30 nicolas17: they now have a self-service(ish) thing for archiving gitlab and other forge types 04:27:31 pabs: should there be a section for dead bugzillas? 04:27:31 then it seems 7 years passed and they didn't bother contacting KDE? time flies 04:27:46 https://archive.softwareheritage.org/add-forge/request/ 04:27:49 https://wiki.softwareheritage.org/wiki/Suggestion_box:_source_code_to_add/KDE 04:28:26 yeah, I sense they are not well organised or under-resourced technically 04:28:55 they still use svn :o 04:28:59 er better link https://archive.softwareheritage.org/add-forge/request/list/ 04:29:07 FireonLive edited Bugzilla (+37, add The Document Foundation): https://wiki.archiveteam.org/?diff=50163&oldid=50162 04:29:17 I was almost expecting to find "freenode" mentioned in https://wiki.softwareheritage.org/wiki/IRC_channels :P 04:29:38 haha 04:29:46 ah, I already submitted https://invent.kde.org/ there, it is pending on them contacting the KDE folks though 04:30:42 i find it interesting they ask for random gitlab (gittea/etc) intances but not for users' github (or gitlab.com?) repos 04:30:53 they archive all of github 04:30:56 is it just because of potential costs i wonder or something else 04:31:06 and gitlab.com and many other gitlab sites 04:31:16 fireonlive: re dead bugzillas, yeah probably, for folks to look up old archives in the WBM? 04:31:19 ye but they stop to ask KDE 'can we' first 04:31:25 pabs: ye i was thinking so 04:31:34 versus just everyone on github 04:31:41 right 04:31:56 wonder why the difference 04:32:11 fireonlive: if they stop to ask KDE "can we svnmirror your entire SVN repository", we'll tell them "no, we can just send you a tarball!" 04:32:14 maybe in case they overload the sites? 04:32:19 ah perhaps 04:32:31 nicolas17: true in that case :) 04:33:27 https://wiki.softwareheritage.org/wiki/IRC#IRC_access_list pffft no groupserv 04:33:53 oh i guess it's channel-based 04:34:14 so i'll allow it lol 04:34:18 pabs: I originally created the kde-git-repositories item on archive.org when some Russian devs were worried about Internet blockages, or depeering strongly affecting their bandwidth, and this way they could use bittorrent 05:14:55 russians? in MY kde? it's more likely than you think! 05:44:34 https://techcrunch.com/2023/07/10/vanmoof-the-e-bike-darling-skids-off-track-sales-paused-execs-depart/ 06:08:00 pabs: ab goes brr 06:16:16 brrrrrrrrr 08:25:34 progaming.ba forum is up for a limited amount of time 08:26:39 even tho is wall locked, i have the username and password to get all of the files on. 08:28:50 just send me pm on hackint 09:40:39 Why would replayweb.page say that an URL is in the WARC when listing requests, but claim it wasn't found when I try to view it? 09:42:31 Whatever, record's in the file 09:47:47 Barto: another one for you https://simpleflying.com/wisk-aero-boeing-subsidiary/ :) 10:12:18 tzt, fireonlive, JAA, arkiver: I got a (probably non-exhaustive) list of domains hosted by the (soon to shut down) FutureQuest: https://transfer.archivete.am/rgTXc/domains.txt 10:14:02 hey! do you have any graceful ways to handle the thing where phpbb forums add &sid=hash to every link? archive.org seems to struggle with it, every thread link here goes nowhere: https://web.archive.org/web/20230402125320/https://www.freestompboxes.org/viewforum.php?f=1&sid=d29688f6831c923e7a7ec107ad150803 12:07:54 I think the ?archiveteam urlfudgery on archivebot crawls is there to suppress that 12:41:07 fireonlive: Как пропатчить KDE2 под FreeBSD? 13:10:28 Wysp will be delayed another day, got sidetracked 13:17:43 alright, keep us posted :) 13:33:40 rewby: nice! checking it out 13:33:55 fyi all: VickoSaviour is offline, but i am grabbing progaming.ba per some previous discussion 13:34:06 OrIdow6: do you have a channel name idea? :) i believe as idea here was posted before too 13:34:33 rewby: how did you collect this list? 13:41:14 arkiver: Not really, may be able to do something with will-o-the-wisps or whispers 13:43:36 Part of the issue is that the obvious puns are so straightforward as to be uncreative 14:07:01 arkiver: It's the list from the forward dns section of https://bgp.tools/prefix/69.5.0.0/19#dns (which in turn is certificate transparency logs and other magic that I don't recall) 14:07:24 Worth noting I didn't write it out manually, I asked the developer of the site to run a DB query for me 14:24:43 dx: To expand on masterX244's reply: What we do is start the crawl from https://example.org/?archiveteam. That request sets the cookies, and then pages loaded after that won't have the sid params in links. It's a separate URL so that when the homepage gets loaded later, the cookies are already in place an browsing will work naturally. Once it got a few URLs, we ignore any URL with an sid param. The 14:24:49 '?archiveteam' suffix has no special meaning; it just has to be a unique URL so the actual homepage is retrieved with cookies later. 14:25:35 This isn't perfect though. Eventually, the session cookie might expire, and then the crawl gets another page with sid param links, which would get ignored, so coverage might be slightly incomplete. Unless the forums are very broken, that shouldn't be a significant fraction though. 15:01:42 JAA, masterX244: thank you! 15:02:53 I have a massive list of urls (90K urls) for a website that might shut down any day now, not all of these exist so I got a script that checked which gave back a status 200 then to mirror it 15:03:16 when I ran the script, my computer started lagging and explorer did some very strange things so I had to restart my computer 15:03:48 can someone else run the script for me? after running the script, you can run a dir command and do a find and replace to turn the files it mirrored into urls 15:04:12 then whoever runs the script can just put it into a spreadsheet and let ia save the urls 15:08:14 if you're still looking for channel name ideas, i propose #wispaway....wisp away is semi-commonly misused instead of whisk away which means to take away suddenly 15:12:04 nighthnh099_: We have our own tooling that can archive things much more efficiently and quickly than feeding to IA. I can take a look. Which site is it? And please upload the list to https://transfer.archivete.am/ . 15:14:49 https://transfer.archivete.am/2mctU/urls.txt the urls start at 4000 because that's as far as I got before I had to restart; basically the urls are a bunch of game scripts for an app, not all of the urls exist though; I might need help with finding the upper limit of the list because I forgot to do that 15:18:54 Yeah, the upper limit is definitely higher. 15:32:11 Quickly poked the APK but didn't see anything of relevance. Might need DEX decompiling. 15:35:00 I already did all of that 15:35:22 oh wait do you need the script I said? sorry I forgot to ask 15:36:41 No need, I'll run the list through ArchiveBot. But need to find the upper bound first. 15:37:37 archivebot skips 404s? 15:39:30 No, they'll just get archived as well. 15:40:11 oh, that's kinda messy haha 15:40:26 Well, depends on how you look at it. 15:40:34 Archiving them records that they didn't exist. 15:40:55 Whereas if you only archive the ones that exist, a future archaeologist won't know whether they were simply missed. 15:41:26 oh, my reasons for not archiving them would be it's hard to filter through them when someone in the future decides to make a local server for the game 15:42:04 It's trivial to filter that out. 15:42:29 oh how? I don't know haha 15:42:31 Especially when you work with the WARC file ArchiveBot will produce. 15:43:22 Well, the tooling for it is currently suboptimal, but it can be done with warcio and a 10-line Python script or so. 15:43:45 It'll be easier once I finish the thing I've been working on for far too long now. 15:43:59 Anyway... so how do we find the upper limit? 15:44:56 Actually, I just checked 100000 to 100099, no hits there, so I'll do up to 100k. 15:49:14 It's running, current ETA is 5-6 hours. 15:52:09 JAA: I think 97017 is the upper limit 15:52:52 thanks for running it! also worth noting that it needs to be http, not https; everything is 404 on https for some reason 15:53:01 Yeah, I noticed. 15:53:12 Just another badly configured web server. :-) 15:54:57 oh wait a second, the mention on the site of a shut down is just the name of a story someone uploaded 15:55:09 well doesn't make anything less urgent I guess 15:55:26 the app itself has been gone since 2020 so the site could shut down any day now 15:56:25 Yeah, given how small it is, no reason not to archive it anyway. 16:04:11 JAA: I have to log out now so I guess I'll just see those urls in the CDX at some point? 16:05:07 also maybe you can zip up the files it mirrored and send it to me? I want a copy myself haha 16:05:26 will probably just ping once I open irc again 16:05:29 thuban: :3 16:10:03 nighthnh099_: Yes, they'll appear in the WBM eventually. The WARCs will be listed at https://archive.fart.website/archivebot/viewer/job/61ha7 eventually. We don't produce plain files, so I can't simply create a ZIP for you. 16:10:38 oh wait I wasn't joined to archivebot, oh okay 16:11:18 warcat allows to "unpack" WARCs though if you need the plain files inside 16:11:28 thanks 16:11:48 Yeah, not sure how that would handle the 404s though. 17:10:19 pabs: that thing definitely goes brr too 17:16:31 RIP LBRY? https://twitter.com/LBRYcom/status/1678866789407551489 17:18:24 Not sure how much content there is to save from this 17:28:20 “30,000,000 pieces of content” interesting… hm. it’s blockchain stuff so idk lol 17:34:22 why should we worry, it's decentralized right? :P 17:35:46 🤐 17:36:18 (it's probably centralized and only using blockchain for regulation evasion purposes) 17:38:11 i seem to recall public companies just changing their names to like include AI or blockchain and their stock prices just shooting up instantly 17:38:14 semi related lol 19:42:29 this is not a drill, we have a dying site https://forums.terraria.org/index.php?threads/gfycat-shutting-down-this-september.120070/ 19:44:42 also appears twitter no longer requires account login for seeing posts? if so, i guess a warrior project is once again possible 19:45:16 only individual posts afaik 19:45:28 you can't see replies to them or what it's replying to 19:45:39 still better than the nothing that was there before 19:45:39 #deadcat for Gfycat 20:50:46 JAA: those domains rewby|backup found - do you think AB is enough for that? 20:53:19 1555 domains, might be feasible, but not sure. 20:53:31 Ryz has been feeding domains from that platform in, I think? 20:53:37 I haven't been paying a whole lot of attention. 20:54:59 looks like these sites may not be very large? 20:55:20 upcoming projects this month are: 20:55:27 Wysp ( OrIdow6 ) 20:55:30 Skyblog 20:55:35 Stitcher 20:55:38 Xuite 20:55:40 and Gfycat 20:56:04 if stitcher is not huge we'll get in with AB 20:57:01 OrIdow6: #wyspedaway for wysp 20:58:31 i'm not sure how we can help with this, but https://github.com/grossartig/vanmoof-encryption-key-exporter 20:58:48 "The Bluetooth connection between your smartphone and your VanMoof is encrypted for security purposes. Each time you log into your VanMoof account, this encryption key is being downloaded from VanMoof’s server." 20:59:09 https://kolektiva.social/@phill⊙mnc/110701490653058697 "Little birdies tell me VanMoof has officially collapsed. They'll be making a statement shortly. 20:59:10 If you own one of their bikes now is the time to grab your encryption keys before their servers go offline" 20:59:22 isn't the future great? 21:01:55 oh i recongise the shape of the bike, so i've probably seen them. but wasn't aware of the brand until onw. 21:03:19 hah 21:03:32 sounds like all 'smart' things about that bike will soon stop functioning 21:03:54 i wonder what smart things you need on a bike... 21:04:04 A helmet? 21:04:05 predictive braking? 21:04:08 an icecream container? 21:04:09 Ah yes, the internet of shit. 21:04:10 on your head 21:04:22 flashfire42: why would you need one of those? 21:04:30 cycling is really quite safe. 21:04:30 Magpies 21:04:36 flashfire42: avoid .au then. 21:04:44 Bit hard when I live there 21:05:17 how i avoid being swooped,.. i live on another continent. 21:05:24 Yts98 created Games/Engines, Platforms and Hostings (+2012, Created page with "== Engines == *…): https://wiki.archiveteam.org/?title=Games/Engines%2C%20Platforms%20and%20Hostings 21:06:23 Yts98 edited Games (+90): https://wiki.archiveteam.org/?diff=50165&oldid=46613 21:06:39 > The VanMoof S5 & A5 will just keep getting better. And better. Via over-the-air updates, we can continuously improve your bike long after your first ride. From the Halo Ring Interface to Hi-Vis Lights, this bike has revolution, built in. 21:06:43 ... 21:06:57 Off to -ot for that I guess. 21:06:59 i hope they'll change the tyres etc. 21:07:48 smart shit is a PITA; or anythbing with firmware. (its rare to find a firmware updater that allows local files instead of only connecting straight to server, and for those that also allow local files: always backup those files) 21:12:44 #stallmanwasright 21:31:28 FireonLive edited Current Projects (+38, add IRC channel for Wysp): https://wiki.archiveteam.org/?diff=50166&oldid=50156 21:32:28 FireonLive edited Wysp (+19, add IRC channel): https://wiki.archiveteam.org/?diff=50167&oldid=50158 21:35:27 Barto: a stopped clock etc. 21:49:32 Yts98 edited Games/Engines, Platforms and Hostings (+263): https://wiki.archiveteam.org/?diff=50168&oldid=50164 22:42:29 Hello JAA and arkiver, I'm basing the archiving regarding FutureQuest run domains on what flashfire42 has fed me with https://bgp.tools/prefix/69.5.0.0/19#dns 22:42:53 There is a more complete list, also from bgp.tools, see above. 22:43:24 I can throw it all into queueh2ibot if it's suitable for that. 22:43:33 the one that rewby posted 22:43:44 Arkiver edited YouTube (+120, Change YouTube rules): https://wiki.archiveteam.org/?diff=50169&oldid=49723 22:44:26 here's rewb\y's list via bgp.tools (thanks rewb\y!): https://transfer.archivete.am/rgTXc/domains.txt 22:44:44 Arkiver edited YouTube (+95): https://wiki.archiveteam.org/?diff=50170&oldid=50169 22:44:48 JAA, I'm not too certain on running it automatically via queueh2ibot because I've encountered some oddball websites where it would need to be treated with that particular pipeline, and others are just geo-restricted 22:45:45 Arkiver edited YouTube (+7): https://wiki.archiveteam.org/?diff=50171&oldid=50170 22:45:46 Arkiver edited YouTube (+9, Fix formatting): https://wiki.archiveteam.org/?diff=50172&oldid=50171 22:45:59 I mean, if you'd like to run the 1555 domains manually, that's also fine with me, but it's a lot of work. 22:46:40 That is unfortunately true... ><; 22:52:46 FireonLive edited YouTube (+209, Update infoboxes): https://wiki.archiveteam.org/?diff=50173&oldid=50172 23:00:49 FireonLive edited YouTube (+20, use 2=YouTube to make infobox not so weird): https://wiki.archiveteam.org/?diff=50174&oldid=50173 23:01:38 Hmm, how about this JAA, while I work on the one that flashfire42 gave me for now, queueh2ibot can process https://transfer.archivete.am/rgTXc/domains.txt 23:09:57 so, how do I archive 5-10GB files such that they appear on WBM, with inter-URL payload deduplication? 23:10:42 Ryz: Sure, if you give me that list to filter out duplicates. 23:10:46 nicolas17: wget-at or qwarc 23:10:58 such as they appear in wbm 23:11:02 is the core issue 23:11:07 archivebot doesn't deduplicate, qwarc would work but then I need Approval:tm: to make my uploaded WARCs appear in WBM 23:11:14 Right 23:11:20 Here it is JAA, what flashfire42 has fed me: https://bgp.tools/prefix/69.5.0.0/19#dns 23:11:33 Which includes the odd URLs that for some reason end with '.' oo; 23:11:49 those are “fully qualified” 23:11:50 Technically, all domains end with a dot. 23:11:53 ye 23:13:04 That has 1999 domains...? 23:13:51 Arkiver edited YouTube (+59, Allow archiving ads that are actually used as…): https://wiki.archiveteam.org/?diff=50175&oldid=50174 23:18:51 Arkiver edited YouTube (+0): https://wiki.archiveteam.org/?diff=50176&oldid=50175 23:19:04 anarcat: dunno if you have any bandwidth for AB jobs, but you might be interested in these new projects https://wiki.archiveteam.org/?title=Bugzilla https://wiki.archiveteam.org/?title=IRC/Logs also https://wiki.archiveteam.org/?title=Mailman2 23:20:15 rewby: See above, I'm getting 1999 results on https://bgp.tools/prefix/69.5.0.0/19#dns , i.e. more than the 1555 in your list that comes directly from the DB? Something's not right there. 23:20:55 No dupes with the trailing dot either. 23:23:03 There's little overlap, too. 23:23:34 1238 domains appear in the DB list but not on the page. 1683 domains appear on the page but not in the DB list. 23:23:52 Arkiver edited YouTube (+9): https://wiki.archiveteam.org/?diff=50177&oldid=50176 23:23:53 So only a bit over 300 overlap.