00:04:25 (oh, the other thing is that i included external identifiers so that videos/thumbnails could be matched without relying on titles or filenames, which i don't think the other uploader did. i don't think that's a big deal, though) 00:06:23 oh FUCK i forgot to set the mediatype 00:08:24 arkiver: is there any way to get around this? i'd like to be able to use the identifier i originally wanted :/ 00:14:40 I *love* working with 40 GB of JSON. Just *wonderful*! 00:16:03 thuban: i'll change it for you 00:16:15 ping me the item when it's uploaded! 00:16:20 arkiver: ah, thank you so much :) 00:16:27 change the mediatype that is, not the identifier 00:16:38 https://archive.org/details/rthk-podcast-hkconnection-en-thumbnails 00:16:51 i interrupted the upload when i realized what i'd done 00:18:41 Oh, only 35.7 GB in the end. Phew, that's much better... Please shoot me. 00:34:29 JAA any luck on the arcpublishing backup? it sounds like it worked? :) 00:35:46 I'm still shooting myself in the foot. :-) 00:37:42 jodizzle: https://transfer.archivete.am/TV0oq/hk.appledaily.com-archive-article-urls.zst 00:38:29 Ignore the video-leaf URLs, I think. Still need to look into that more closely. But otherwise, that should be all articles from the /archive/ listings. 00:43:31 The video-leaf stuff are unique IDs for the videos, not paths on the website, it seems. Not all video-leaf 'URLs' are in that list. 00:50:06 jodizzle: 201466 videos according to my extraction (which is definitely crude). 00:50:42 And almost all of them, namely 193142, are MP4. 00:52:30 They also conveniently provide a 'filesize' field in the JSON. Not sure how accurate it is for M3U8, but summing that up gets me to 1.93 TB. 00:54:53 rewby: I love how your HK Apple Daily article URL list is called sorted_urls.txt and is anything but sorted. :-P 00:58:01 Looks like the /archive/ iteration is incomplete. :-( For example, /sports/20150707/UPVFKXHDPUZPUU4UJTHHQCSFCI/ does not appear on https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/archive/20150707/ 00:58:46 Hello. Sorry, was night time in Hong Kong, so probably missed a lot after I posted that txt of websites. 00:59:26 orly: https://hackint.logs.kiska.pw/archiveteam-bs/20210623#c292224 01:00:09 JAA, thanks 01:07:37 1.65M URLs in my article URLs list, 3.22M in rewby's list from the sitemaps. :-| 01:42:57 But 98k (90k without video-leaf) URLs in my list that aren't in the sitemaps. 02:31:28 Folks is there anyway to scrape this? https://playboard.co/en/channel/UCeqUUXaM75wrK5Aalo6UorQ/videos 02:31:46 A lot of the videos aren't up anymore, but scraping metadata too is still good 02:43:07 pagination is js-triggered and server-side by repeated POSTs, so would not work in ab/wbm 02:43:31 but a custom scraper could do the pagination, then generate video urls to feed to archivebot 02:44:20 between the json and the video pages should get all the metadata 02:46:53 JAA (or others): is that worth doing or would it be purely duplicative of the youtube archive work on this channel? i haven't been in #down-the-tube 03:05:12 were you guys able to download all the youtube videos from apple daily? 03:05:45 aaaaa: definitely a lot of them, but 'all' is unclear https://wiki.archiveteam.org/index.php/Apple_Daily#Others 03:21:59 i wrote that pager just in case; it's running now 03:36:42 https://transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst 04:30:10 https://workspaceupdates.googleblog.com/2021/06/drive-file-link-updates.html 06:20:26 JAA: If I'm doing this right, it looks like my m3u8 collector only actually found m3u8s from 9097 articles. Of those, 2219 are present in the hk.appledaily.com-archive-article-urls list you sent. 07:23:09 Hiya. I just realised I missed one university students' press when I posted that txt of Hong Kong stuff yesterday. It's the Chinese University Campus Radio. Facebook: cuhkcampusradio Youtube: UCg-D5uUTXfTSolC_zY5KXCg 07:23:29 Is there a channel for Google drive currently? 07:25:03 orly: mention the youtube channel in #down-the-tube ? 07:25:45 Sure thing. But they mainly upload politically sensitive stuff on their Facebook. Their Youtube is basically just archive of their meeting recordings. 08:30:25 I started another appledaily video crawl via article URLs pointed at https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/, this time getting mp4s as well. 08:45:30 Could you make sure to get https://appledaily-hk-appledaily-prod.cdn.arcpublishing.com/video/lifestyle/ first? We missed that YouTube channel. Thanks. 09:32:30 jodizzle: How did you get mp4s? Can you give me a few pointers? 10:51:11 avoozl it's going well with viva forum grap and will probably be done before the deadline 11:16:22 To add to that: The quick threads-only grab of the viva forum finished yesterday (https://archive.fart.website/archivebot/viewer/job/34m7c), so all messages should be in there. wessel's grab will be more complete with outgoing links, redirects from direct links to individual posts etc. 11:24:45 wessel1512: cool is there any warcs I can pick up and take a look at? 11:25:09 alard: awesome 11:26:40 I'll kick off some downloads and see if my parser still works 11:26:57 i dont know if they have been uploaded to AI jet 11:27:08 the links in the website alard pasted are working 11:27:36 it'll take a day or so to get those here, but at least things are moving 11:28:28 you can download the warcs form https://archive.fart.website/archivebot/viewer/job/ade35 11:28:56 I'll add them to the list 11:29:21 your list ? 11:29:37 of things to download so I can start parsing 11:29:54 I've been building a warc parsed that extracts structured posts/threads from it so that it can be easily searched 11:30:36 it produces json chunks like this https://archive.org/details/warceater_yahooanswers and has a little tool for building search indices on top of them and host them with a generic forum skin 11:30:47 than its better to use https://archive.fart.website/archivebot/viewer/job/34m7c 11:31:04 the second one has outlinks the first one is just the threads, right? 11:31:24 in reverse 11:31:46 check 11:31:51 the fist one has outlinks 11:31:54 ok looking forward to having some time to play with this 11:31:58 thanks 16:28:30 Are there any tools that can scan whether your file contains any personally identifiable metadata (ex: IP address, location, browser fingerprint, etc)? Would like to use one before uploading things 16:29:17 For example, if you download YT metadata files through yt-dl, your ip is shown in the metadata .json files 17:50:53 ...Oh, ooh, Windows 11 to support Android apps via Amazon's appstore (wondering how it would affect archiving operations): https://www.theverge.com/2021/6/24/22548428/microsoft-windows-11-android-apps-support-amazon-store 18:53:52 jodizzle: Oops, forgot to upload this last night, here's my list of video streams from the archive pagination: https://transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst 18:54:16 I'll throw the MP4s into AB now but will let you handle the M3U8. 18:56:15 (There are some duplicates in this list.) 19:00:32 JAA: thoughts on the playboard metadata? (conversation between me and aaaaa above) 19:14:28 thuban: fixed your item to mediatype image 19:15:11 arkiver: thank you! 19:17:38 if you're still planning on creating a collection for the show, it may be a good idea to move the thumbs there as well as the episode items 19:18:34 (the same user has also uploaded runs of a couple of other shows, which likewise don't have collections) 19:18:41 HK Apple Daily MP4s are now running through AB, should be about 1.5 TB. 19:19:58 thuban: Playboard is a metadata aggregator for YouTube it seems? Certainly can't hurt to grab the metadata, although we likely already have much of it in #youtubearchive. 19:20:36 Metadata is also generally tiny, so duplication isn't much of a problem there. 19:22:01 JAA: that's right. the zst i uploaded (https://transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst) has all the page urls and should be archivebot-ready. 19:23:30 Hmm, only 9.8k videos? 19:24:04 to all appearances that's as many as playboard knew about 19:24:25 Yeah, apparently. Weak. :-P 19:25:06 i'm checking now to see whether there's anything in the pagination json that isn't in the video page source 19:25:46 Running through AB now. 19:32:39 ^ the only thing missing is the channel's profile image's youtube url: https://yt3.ggpht.com/ytc/AKedOLSXcaGJFYW3dY0xIp9WOOx1JJtrDHyj909W38XbQw 22:36:47 man 22:36:57 I was just writing '#archiveteam-bs before we get shouted at' 22:37:00 and THERE WE GO 22:37:04 :-)