-
thuban
(oh, the other thing is that i included external identifiers so that videos/thumbnails could be matched without relying on titles or filenames, which i don't think the other uploader did. i don't think that's a big deal, though)
-
thuban
oh FUCK i forgot to set the mediatype
-
thuban
arkiver: is there any way to get around this? i'd like to be able to use the identifier i originally wanted :/
-
JAA
I *love* working with 40 GB of JSON. Just *wonderful*!
-
arkiver
thuban: i'll change it for you
-
arkiver
ping me the item when it's uploaded!
-
thuban
arkiver: ah, thank you so much :)
-
arkiver
change the mediatype that is, not the identifier
-
thuban
-
thuban
i interrupted the upload when i realized what i'd done
-
JAA
Oh, only 35.7 GB in the end. Phew, that's much better... Please shoot me.
-
aaaaa
JAA any luck on the arcpublishing backup? it sounds like it worked? :)
-
JAA
I'm still shooting myself in the foot. :-)
-
JAA
-
JAA
Ignore the video-leaf URLs, I think. Still need to look into that more closely. But otherwise, that should be all articles from the /archive/ listings.
-
JAA
The video-leaf stuff are unique IDs for the videos, not paths on the website, it seems. Not all video-leaf 'URLs' are in that list.
-
JAA
jodizzle: 201466 videos according to my extraction (which is definitely crude).
-
JAA
And almost all of them, namely 193142, are MP4.
-
JAA
They also conveniently provide a 'filesize' field in the JSON. Not sure how accurate it is for M3U8, but summing that up gets me to 1.93 TB.
-
JAA
rewby: I love how your HK Apple Daily article URL list is called sorted_urls.txt and is anything but sorted. :-P
-
JAA
Looks like the /archive/ iteration is incomplete. :-( For example, /sports/20150707/UPVFKXHDPUZPUU4UJTHHQCSFCI/ does not appear on
appledaily-hk-appledaily-prod.cdn.arcpublishing.com/archive/20150707
-
orly
Hello. Sorry, was night time in Hong Kong, so probably missed a lot after I posted that txt of websites.
-
JAA
-
orly
JAA, thanks
-
JAA
1.65M URLs in my article URLs list, 3.22M in rewby's list from the sitemaps. :-|
-
JAA
But 98k (90k without video-leaf) URLs in my list that aren't in the sitemaps.
-
aaaaa
-
aaaaa
A lot of the videos aren't up anymore, but scraping metadata too is still good
-
thuban
pagination is js-triggered and server-side by repeated POSTs, so would not work in ab/wbm
-
thuban
but a custom scraper could do the pagination, then generate video urls to feed to archivebot
-
thuban
between the json and the video pages should get all the metadata
-
thuban
JAA (or others): is that worth doing or would it be purely duplicative of the youtube archive work on this channel? i haven't been in #down-the-tube
-
aaaaa
were you guys able to download all the youtube videos from apple daily?
-
thuban
aaaaa: definitely a lot of them, but 'all' is unclear
wiki.archiveteam.org/index.php/Apple_Daily#Others
-
thuban
i wrote that pager just in case; it's running now
-
thuban
-
Ajay1
-
jodizzle
JAA: If I'm doing this right, it looks like my m3u8 collector only actually found m3u8s from 9097 articles. Of those, 2219 are present in the hk.appledaily.com-archive-article-urls list you sent.
-
orly
Hiya. I just realised I missed one university students' press when I posted that txt of Hong Kong stuff yesterday. It's the Chinese University Campus Radio. Facebook: cuhkcampusradio Youtube: UCg-D5uUTXfTSolC_zY5KXCg
-
Ajay1
Is there a channel for Google drive currently?
-
thuban
orly: mention the youtube channel in #down-the-tube ?
-
orly
Sure thing. But they mainly upload politically sensitive stuff on their Facebook. Their Youtube is basically just archive of their meeting recordings.
-
jodizzle
I started another appledaily video crawl via article URLs pointed at
appledaily-hk-appledaily-prod.cdn.arcpublishing.com, this time getting mp4s as well.
-
achivarin
Could you make sure to get
appledaily-hk-appledaily-prod.cdn.arcpublishing.com/video/lifestyle first? We missed that YouTube channel. Thanks.
-
achivarin
jodizzle: How did you get mp4s? Can you give me a few pointers?
-
wessel1512
avoozl it's going well with viva forum grap and will probably be done before the deadline
-
alard
To add to that: The quick threads-only grab of the viva forum finished yesterday (
archive.fart.website/archivebot/viewer/job/34m7c), so all messages should be in there. wessel's grab will be more complete with outgoing links, redirects from direct links to individual posts etc.
-
avoozl
wessel1512: cool is there any warcs I can pick up and take a look at?
-
avoozl
alard: awesome
-
avoozl
I'll kick off some downloads and see if my parser still works
-
wessel1512
i dont know if they have been uploaded to AI jet
-
avoozl
the links in the website alard pasted are working
-
avoozl
it'll take a day or so to get those here, but at least things are moving
-
wessel1512
-
avoozl
I'll add them to the list
-
wessel1512
your list ?
-
avoozl
of things to download so I can start parsing
-
avoozl
I've been building a warc parsed that extracts structured posts/threads from it so that it can be easily searched
-
avoozl
it produces json chunks like this
archive.org/details/warceater_yahooanswers and has a little tool for building search indices on top of them and host them with a generic forum skin
-
wessel1512
-
avoozl
the second one has outlinks the first one is just the threads, right?
-
wessel1512
in reverse
-
avoozl
check
-
wessel1512
the fist one has outlinks
-
avoozl
ok looking forward to having some time to play with this
-
avoozl
thanks
-
aaaaa
Are there any tools that can scan whether your file contains any personally identifiable metadata (ex: IP address, location, browser fingerprint, etc)? Would like to use one before uploading things
-
aaaaa
For example, if you download YT metadata files through yt-dl, your ip is shown in the metadata .json files
-
Ryz
...Oh, ooh, Windows 11 to support Android apps via Amazon's appstore (wondering how it would affect archiving operations):
theverge.com/2021/6/24/22548428/mic…1-android-apps-support-amazon-store
-
JAA
jodizzle: Oops, forgot to upload this last night, here's my list of video streams from the archive pagination:
transfer.archivete.am/IJGRu/hk.appledaily.com-archive-videos.zst
-
JAA
I'll throw the MP4s into AB now but will let you handle the M3U8.
-
JAA
(There are some duplicates in this list.)
-
thuban
JAA: thoughts on the playboard metadata? (conversation between me and aaaaa above)
-
arkiver
thuban: fixed your item to mediatype image
-
thuban
arkiver: thank you!
-
thuban
if you're still planning on creating a collection for the show, it may be a good idea to move the thumbs there as well as the episode items
-
thuban
(the same user has also uploaded runs of a couple of other shows, which likewise don't have collections)
-
JAA
HK Apple Daily MP4s are now running through AB, should be about 1.5 TB.
-
JAA
thuban: Playboard is a metadata aggregator for YouTube it seems? Certainly can't hurt to grab the metadata, although we likely already have much of it in #youtubearchive.
-
JAA
Metadata is also generally tiny, so duplication isn't much of a problem there.
-
thuban
JAA: that's right. the zst i uploaded (
transfer.archivete.am/P369h/appledaily_videos_playboard_urls.txt.zst) has all the page urls and should be archivebot-ready.
-
JAA
Hmm, only 9.8k videos?
-
thuban
to all appearances that's as many as playboard knew about
-
JAA
Yeah, apparently. Weak. :-P
-
thuban
i'm checking now to see whether there's anything in the pagination json that isn't in the video page source
-
JAA
Running through AB now.
-
thuban
^ the only thing missing is the channel's profile image's youtube url:
yt3.ggpht.com/ytc/AKedOLSXcaGJFYW3dY0xIp9WOOx1JJtrDHyj909W38XbQw
-
Kaz
man
-
Kaz
I was just writing '#archiveteam-bs before we get shouted at'
-
Kaz
and THERE WE GO
-
JAA
:-)