08:20:43 euhm.. is there a simple command line tool to decompress these .warc.zst files with embedded dictionaries? I know I can read it in code, but I miss a simple cli tool to go through the content 09:46:47 hello. 09:46:53 hi 09:47:06 How do I propose a site to get saved? 09:47:14 you propose 09:47:17 go ahead and propose 09:47:26 OK, just a second. 09:47:37 https://www.hoax-slayer.net/notice-hoax-slayer-is-closing-down/ 09:47:52 Hoax-Slayer, a debunking website, is going away at the end of the month. 09:47:53 looks like a blog basically 09:48:07 we can handle it with archivebot 09:49:13 It's also this site, https://www.hoax-slayer.com/ 09:49:25 I think that's where a lot of the debunkings still are. 09:50:01 ah ok, we can do both domains 09:50:12 Great! 09:50:26 it's in progress -- if you're interested you can watch it at http://dashboard.at.ninjawedding.org/3 -- if not stand back and relax, it'll make its way into the wayback machine in a few days 09:50:44 Wonderful. That was a lot easier than i thought. Hooray for scripts. 09:50:53 Thank you. 09:52:16 thanks for bringing it to our attention! 11:07:10 what's the difference between bs and ot? 11:08:46 See topic 11:08:52 #archiveteam i announcements, -bs is on-topic discussion, -ot is off-topic discussion 11:08:59 I agree the nomenclature isn't exactly straightforward 11:26:02 ddg sure does a lot of work to get search results that still aren't very good 11:29:56 what's the correct channel for discussing grab-site? 11:40:55 probably here or -dev LeighR 11:41:02 ok 12:22:40 https://techcrunch.com/2021/05/03/private-equity-firm-apollo-agrees-to-buy-verizon-media-assets-for-5-billion/ (sorry if this has already been mentioned, but I didn't see it!) 12:29:43 mentioned it in #noanswers but it should also be mentioned here 14:03:42 Jake: what a mess 14:09:23 yup. quite the mess. 14:33:31 JAA: thanks, that sure explains why turning it up didn't seem to make it go faster :) 14:35:36 fortunately the data's not gone yet and i think it'll finish today, or even this morning 18:58:40 serx: (from #archiveteam) I threw http://meristation.as.com/zonaforo/ into AB to get something at least, but I agree that AB will not be able to get everything by the deadline. 19:00:52 This would need a different solution to get it all 19:39:38 Oh boy. 19:48:22 yep 19:50:10 Hold my beer. 19:57:35 You do have a bunch of people who would love to be going fuller-throttle on Y!A... 19:59:30 if Yahoo would let us 20:03:17 JAA: please no 20:03:27 please no qwarc on y!a 20:04:09 I mean, people who are probably willing to redirect that energy to whatever JAA comes up with 20:04:33 for meristation 20:05:48 HCross: I'm looking into MeriStation, not Y!A. 20:05:54 ahh 20:06:41 I am willing to look closely at logs and tell what I see 20:06:43 Have it working but seems very slow. Like 5+ second response time per page. I'll look a bit closer later. 21:34:27 https://twitter.com/olesovhcom/status/1389330380411523076 Hopefully this one won't catch on fire! 21:48:12 hi, I would like to suggest a podcast to consider for archival please, if this is the right channel for suggestions. it's called Hong Kong Connection, a weekly video documentary series by Radio Television Hong Kong (RTHK), a government news outlet. https://podcast.rthk.hk/podcast/item.php?pid=280&lang=en-US its archival extends back to 2010, but 21:48:12 there was an announcement today that the agency is moving to remove videos over 1 year old from their site, to align with their Youtube presence 21:49:47 that podcast is one of several programmes on the site, but it is arguably one of the more iconic series with historical value, as it has footage of political protests from 2019, etc. 21:52:02 news article about potential removal: https://hongkongfp.com/2021/05/03/hong-kong-broadcaster-rthk-to-delete-shows-over-a-year-old-from-internet-as-viewers-scramble-to-save-backups/ 21:56:58 the show has also been subject to episodes being cut recently due to its coverage of issues perceived as sensitive by new leadership at the broadcaster 21:59:36 the Chinese version of the series has more videos: https://podcast.rthk.hk/podcast/item.php?pid=244&lang=zh-CN xml feeds with mp4 links: https://podcast.rthk.hk/podcast/hkconnection_en_i.xml (English) https://podcast.rthk.hk/podcast/hongkongconnection_i.xml (Chinese) 22:02:01 thanks for considering 22:10:42 sounds doable. i'm thinking download all video + upload to ia as one item for each language. opinions? do we want anything from the web presence? (there seems to be very little content other than the videos themselves--they don't, e.g., have individual descriptions.) 22:13:09 thuban: the xml feeds seem to provide more details for each video than the main site itself 22:13:09 https://podcast.rthk.hk/podcast/item.php?pid=280&eid=180380&year=2021&lang=en-US 22:13:18 click description 22:13:58 there only seems to be a single description for the entire podcast 22:14:30 thanks very much! yeah, the main content are the videos themselves, the site is mainly for format context (i.e. it once lived on this page and it had a dropdown menu to select year) 22:16:33 it was loading blank on 1 video and that on other, had only looked at 2 till your reply. Yea that makes it easier to just need video and title 22:16:59 I'd say that right now, any old news programs in HK are in danger of disappearing 22:17:30 LeighR: thanks, for some reason ff's rss viewer doesn't seem to want to print the details 22:18:07 that said, a lot of them seem to be cut off; if there's an authoritative source they're being drawn from i'm not sure where it is 22:18:11 in the English version, some of the details are stuck inside [CDATA] custom tags 22:19:53 LeighR: indeed, especially given the new leadership and the recent report about the agency needing "reform" 22:20:07 wait, clicking "this episode" right under video does give diff descriptions like this short one. Are the descripts in the feed? 22:20:09 返回 22:20:09 After the Lockdown 22:20:10 2021-04-18 22:20:10 The government imposed lockdown in four streets in the Jordan district in late January and conducted mandatory Covid-19 tests on the residents. This episode records this unprecedented operation. 22:21:20 ooh 22:21:28 yes - under itunes:subtitle and itunes:summary, stuck inside [CDATA ] tags 22:21:32 those are the cut-off descriptions in the feed 22:22:04 thuban: you're right 22:22:43 so need web for full ones or they are all cut off? 22:22:57 need web 22:23:24 they seem to be in the source rather than xhr'd from an api, but that's ok 22:23:29 i can pull those 22:24:23 is there anything I can do to help? stuff like trying to grab all of Y!A is fun, but this is actually important. 23:02:49 the back episodes of Letter to Hong Kong might be worth saving: https://podcast.rthk.hk/podcast/item.php?pid=162&lang=en-US it's an audio podcast of short audio clips, public figures (politicians from different parties, current affairs critics, academics, etc.) reading a letter addressed to the people. for example, this one is from a former 23:02:49 pan-democratic legislator who is now in prison commenting on the events leading up to his trial: https://podcast.rthk.hk/podcast/item.php?pid=162&eid=171170&year=2020&list=1&lang=en-US 23:16:05 (background: he was convicted on charges along the lines of obstruction of justice for attempting to reason with a group that rushed into a subway station and started attacking commuters. he got injured while trying to help people on the train. the incident was all over the local news with questions over the official narrative of the incident, and 23:16:06 he was later arrested.)) 23:35:08 arkiver: So I had an idea for how to save Aimix-BBS ("shut down" a few days ago, still up), based on the point that it blocks seemingly everyone but typically takes a few hundred requests to do it 23:35:16 I thought it might work to have items be single URLs, and then have the warriors queue on-targeted extracted URLs to backfeed (there should be a few million at most, not a big site) instead of running them themselves 23:35:17 This would basically turn it into a big, distributed ArchiveBot, where you have to make a project update instead of give a command in IRC to add or remove an ignore 23:36:51 nuroten: ok, i'm currently pulling thumbnails + videos + descriptions and formatting metadata in a way IA will understand 23:37:10 downloading will take a while, but everything will be set to upload once it's finished! 23:37:24 thuban: thanks so much! :) 23:37:41 no problem! 23:37:47 someone remind me to update this every six months or something 23:38:09 i'll take a look at _letter to hong kong_ later 23:39:12 if I had to choose only one thing to be saved from the RTHK website, that would be it (HK Connection). it's one of the most recognised local shows. 23:40:09 i was going to suggest earlier that if they also delete old material from their youtube--https://www.youtube.com/channel/UC6of7UYhctnYmqABjUqzuxw--that might be worth looking into as well, but it seems like a _lot_ of content. did our news project ever extend to video? 23:41:53 thanks! anything you can save from either would be great, the announcement seems to apply to anything in their archive older than 1 year, but if I had to guess, current affairs shows would be on the main chopping board 23:55:42 OrIdow6: sounds good with me 23:56:44 but 23:57:06 JAA: didnt we have some setup with a ton of IPs that could be used for those purposes? 23:57:12 OrIdow6: how permanent are the band