-
arkiver
JAA: i like the explanation with "classical" in it better :)
-
arkiver
:P*
-
OrIdow6
kpcyrd: What is sks?
-
thuban
ugh, i'm at the 'staring at packet dumps and recompiling curl' stage of reverse engineering and i'm not getting anywhere
-
thuban
cloudflare lets me through every time on two different browsers _and_ javascript's fetch/xhr but is batting a thousand 403ing command-line tools, and i don't know how, because every header is _the same_. it can't be one-time keys, because i can replay the same request in a browser, cache-free, after failing from curl, and it'll _still_ work.
-
thuban
what are they snooping? the handshake protocol? the http2 settings? the frame batching???
-
OrIdow6
thuban: What is this in the context of?
-
thuban
OrIdow6: a browser game i wanted to enumerate some asset urls from
-
OrIdow6
thuban: Oh
-
OrIdow6
Yeah, those sound like good ideas
-
OrIdow6
Also timings
-
thuban
i thought about suggesting that but it sounded excessive even as a joke :<
-
wizards
is the tool making a request to robots.txt
-
thuban
haha no
-
thuban
i guess i might finally be forced into trying selenium. o this age of brass
-
OrIdow6
Could try to narrow it down by trying some more of the few non-Chrome-based browsers still around
-
Jake
I'd give it a try if you want to DM me the game
-
wizards
would you be willing to share a link to the game publicly?
-
thuban
i'm gonna see if i can get some decrypted dumps from another browser first
-
thuban
yes, i can; no, no obvious answers. i think i'll think about this some more
-
pcr
thuban: doesn't look like it's been updated recently but this could be a little help
github.com/Anorov/cloudflare-scrape
-
wizards
pcr: that's not maintained anymore
Anorov/cloudflare-scrape #406
-
pcr
Yeah, it'll need an update to work, but it's a starting point
-
masterX244
Good that china can't mess with AT pulling a backup of stuff they don't like....
-
Vista2003
-
Vista2003
RIP Apple Daily 1995 - 2021
-
Vista2003
-
Vista2003
Upcoming deadlines:
-
Vista2003
Tomorrow - end of updates from Apple Daily
-
Vista2003
"No later" than the 26th - end of Apple Daily's site
-
kpcyrd
-
h3ndr1k
I don't think we can do anything about sks-keyservers. You cannot to my knowledge list all keys from a keyserver, and it is unlikely they will provide a dump, as it seems, that they shut down because of too many gdpr requests.
-
grawity
hmm, actually, I was fairly sure a few operators *do* provide dumps as a way to bootstrap a new server
-
grawity
but I'm not sure if it's really at risk -- it's not the keyservers that were shut down, and most of them aren't run by the pool's operator
-
grawity
-
grawity
-
grawity
-
h3ndr1k
huh, interesting. So just the pool operator shut down? I could not find much information about the thing. The websites ssl certificate expired in april or so and there is only a notice that some pool dns records were removed.
-
h3ndr1k
maybe someone can just run these urls and the website through archivebot.
-
EggplantN
done
-
Vista2003
-
EggplantN
yep
-
Vista2003
What's the status of the Apple Daily backup?
-
EggplantN
"enough"
-
Vista2003
And what does "enough" include?
-
EggplantN
-
Vista2003
ah ok
-
Vista2003
hk.appledaily.com/local/20210623/WSI6PSB2EFCO5JAUMLLZOP4RGM The website shutdown date is today at 23:59 HKT or 15:59UTC
-
Jake
-
achivarin
Has the outlinks on this page been saved?:
hk.appledaily.com/member Some of those are on different domains like nextdigital.com.hk
-
EggplantN
Yep Jake. That was why we had an outage yesterday
-
EggplantN
trackerproxy relies on BunnyCDN
-
» HCross ducks
-
HCross
and runs away
-
Jake
thought so! interesting postmortem.
-
EggplantN
We used CloudFlare but they’re not up to what we need
-
JAA
thuban: No idea if Cloudflare does it as well, but I believe Google analyses TLS at the bit level to detect different implementations. Since browsers use their own libraries, they behave ever so slightly differently than curl, wget, etc. with OpenSSL or GnuTLS.
-
JAA
Just as another idea of what could be going on.
-
kpcyrd
-
grawity
thuban: cloudflare does profile the TLS handshake, and might block you if yours is significantly different than what it expects from the User-Agent
-
grawity
thuban: e.g. I've discovered that if you're using python-requests, it deliberately disables "session tickets" for its TLS connections, and together with a fake User-Agent it might trip the block -- as in
upbit/pixivpy #171#issuecomment-860264788
-
hexa-
whew
-
JAA
OrIdow6: 16 hours remaining until GREE does whatever they'll be doing.
-
h3ndr1k
kpcyrd: Thanks, might read it later
-
nuroten
hi, is archiveteam already aware of the Hong Kong-based Apple Daily newspaper closing?
-
grawity
looks like it, based on them having a whole wiki page
wiki.archiveteam.org/index.php/Apple_Daily
-
nuroten
grawity: great, thanks :)
-
nuroten
"June 21 2021, the newspaper announced that it is likely to shut down soon" - fwiw, it has been announced the last print edition is this thursday
-
Jake
yup, we haven't updated the wiki page yet.
-
nuroten
I don't know how long the online version will be up after that, given the asset freeze means they are having trouble paying their vendors
-
nuroten
thanks archiveteam!
-
orly
Apple Daily's youtube channel just 404'd.
-
nuroten
hope the account wasn't ... compromised
-
rewby
I think we've already ran them through some emergency archiving
-
nuroten
is it possible / okay to suggest a youtube channel or a website (mainly for text and images) for contingency archiving? I have 2 websites in mind that are safe for now, but after Apple Daily, they may eventually be targeted
-
rewby
You can suggest them
-
rewby
I think we're watching the last moments of appledaily's online presence:
en.appledaily.com
-
nuroten
Youtube channel: D100 - they are a listener/donor-supported public radio in Cantonese, d100.net is the website. they frequently interview local political commentators, professionals, pro-democracy legislators (well, former now) and activists
-
rewby
Yep, hk. also just went down
-
nuroten
rip Apple Daily
-
nuroten
English-speaking digital news: Hong Kong Free Press
hongkongfp.com
-
nuroten
Apple Daily was basically the last major pro-democracy news outlet ... online media would most likely be the next targets
-
nuroten
alongside HKFP, there's Stand News
thestandnews.com/english
-
rewby
You don't have to pick just english things
-
rewby
We archive pretty much anything
-
orly
If I may, a few suggestions: Stand News, another big target (website, youtube, and facebook: thestandnews.com); SocREC, loads of in-the-crowd livestreams with little commentary (multiple youtube channels, one per reporter, e.g. UCg1-HnZBBnpB82g6saKc5FQ); Polymer, one of the bigger publications of the more localist side of the spectrum (website and
-
orly
facebook: polymerhk.com); Hong Kong Free Press, formed from people leaving SCMP (website and facebook: hongkongfp.com)
-
arkiver
nuroten: orly: list all you know!
-
orly
Do you want full URLs or would just names suffice?
-
rewby
Ah neat, I can get a sitemap out of hongkongfp. Generating a list of urls right now...
-
arkiver
URLs to websites, easier than names. if the list if large, you can upload a txt file to transfer.archivete.am and post the URL here
-
arkiver
thanks rewby
-
nuroten
okay :) though I also suggested the English ones because more people can read and understand the contents
-
nuroten
a number of former journalists/radio show hosts, pro-democracy legislators, etc. have youtube channels and patreon accounts. as orly pointed out, facebook is where a lot of it is (I don't do facebook myself, but maybe I can look up some youtube channels if that's of interest?)
-
nuroten
-
rewby
-
rewby
-
rewby
-
rewby
That's the urls I get out of sitemaps
-
rewby
I should really turn this into an IRC bot
-
nuroten
it's a current affairs show, the first 2/3 talks about the day's big news/topics and usually they interview 1-2 guests for commentary/industry insight depending on topic, the last 1/3 is a listener phone-in segment
-
nuroten
rewby: fantastic :)
-
rewby
I don't claim these are complete
-
rewby
I just claim that's what I get out of sitemaps
-
rewby
These urls are a fun mess of unicode
-
rewby
I wonder if that's gonna break anything
-
EggplantN
want these yeeting into urls?
-
rewby
Maybe just in case?
-
rewby
Can't hurt to have them archived
-
arkiver
yeah lets put them in #//
-
nuroten
Wall-fare is a group formed to help incarcerated people and their families, they provide services and raise awareness of poor prison conditions. mentioning it as a historical thing as it was formed in response to the influx of pro-democracy people being incarcerated and handles letters sent to them by the public
-
nuroten
-
nuroten
I haven't checked, but their facebook page posts may have some perspective on prison conditions, what happened to the activists who were convicted and so on
-
orly
Right. I've got a long list. Including online press, online radio, and student press from various universities.
-
orly
-
nuroten
Civil Human Rights Front - not sure how long their website will be up, it was disbanded recently after being investigated for potential NSL violation, the main convener himself has a few lawsuits ongoing
civilhrfront.org
-
nuroten
(this is an alliance of individuals and groups that organise the annual July 1 marches)
-
nuroten
sorry, not lawsuits, cases/charges rather. the CHRF organised peaceful protests to petition for democracy and all that
-
nuroten
orly: you're very organised :) ... I only have a bunch of names and things in my head
-
rewby
I'll go through the list to find some urls
-
rewby
More URLs!
-
rewby
-
rewby
-
rewby
-
rewby
EggplantN: Can you yeet those into //?
-
nuroten
alliance.org.hk this org organises the annual June 4th vigil, leader was arrested shortly before June 4th this year and released on bail, no idea how long the org will be around
-
nuroten
64museum.blogspot.com the website of the museum they run. museum was forced to close (temporarily?) on account on not having a permit or something. one on wordpress and one on blogspot, may be safe but don't know if they will force a takedown like what happened with the hkcharter website and wix
-
nuroten
-
EggplantN
rewby what are the tar.gz file
-
nuroten
that's all for now, thanks! orly's list is a pretty good one to go on already
-
rewby
EggplantN: raw output from my tools. You only care about the urls.txt files
-
EggplantN
ok plz dont name them all urls.txt in future
-
EggplantN
lol
-
EggplantN
ok added
-
EggplantN
waiting for backfeed/the websocket to show me they went through
-
EggplantN
done rewby orly nuroten :)
-
EggplantN
apart from those last 3
-
nuroten
EggplantN: thanks muchly! :D
-
rewby
It's not every site you mentioned
-
EggplantN
-
rewby
Just the ones with sitemaps
-
EggplantN
go brrrrrrrrrrrrrr
-
EggplantN
aight
-
jodizzle
Did anyone put the articles from tw.appledaily.com through #//?
-
thuban
JAA, grawity: thanks for the comments
-
nuroten
jodizzle, in progress according to the wiki page
-
JAA
No, the AB crawl is in progress. Not aware of anyone having thrown it into #//.
-
thuban
seems like maintaining a bypass might be a full-time project. i don't need speed for this particular application, so i punted and used selenium/chromedriver (need to strip "Headless" from the user agent if you're running headless, but that's all)
-
nuroten
oh sorry, misread
-
jodizzle
tw.appledaily.com doesn't appear to have sitemaps, annoyingly. Would have to get the articles /archive/, maybe.
-
Frogging101
youtube is really kicking the goose lately
-
Frogging101
First the age gating, now this unlisted→private thing
-
Frogging101
bullshit
-
Ryz
RIP that particular website that keeps track of YouTube unlisted videos s:
-
thuban
nuroten and/or orly: i still have the downloader i wrote for RTHK podcasts; any i should be working on now?
-
EggplantN
Google seem to be in general on a general cleanup right now
-
nuroten
thuban: is it specifically tailored for RTHK podcasts?
-
thuban
nuroten: yes
-
thuban
(i can of course write scrapers for other publishers' media if it's important, but this is what i happen to have on hand)
-
nuroten
Headliner, that show may or may not disappear soon. production's been axed and the producers on contract didn't get a renewal (read: fired)
podcast.rthk.hk/podcast/item.php?pid=272&lang=zh-CN
-
nuroten
it's a parody current affairs show, but apparently the team does rigorous fact-checking
-
thuban
nuroten: do you have aurl for the rss feed? i've forgotten where it's hidden
-
thuban
*a url
-
nuroten
-
thuban
thanks!
-
nuroten
(it's in a menu after pressing the orange button under the show title/square icon)
-
nuroten
Hong Kong Letters (CN version) was another one someone requested in that spreadsheet from a while back
podcast.rthk.hk/podcast/item.php?pid=42&lang=zh-CN podcast.rthk.hk/podcast/hkletter.xml
-
thuban
running Headliner now
-
nuroten
those are the two main ones aside from HK Connection, if you think it's valuable, there's also the news in sign language
podcast.rthk.hk/podcast/tv_newsreview_i.xml
-
thuban
looks like video download is working fine out of the box, but i'll keep an eye on it in case older eps use yet another format
-
thuban
the only potential issue is that i'm getting the title from the rss xml and the description from the episode page html, and there seems to be an encoding difference... fortunately i can just leave it running and re-grab the metadata later
-
nuroten
sounds good
-
nuroten
this mini-site might be worth backing up, it has video clips of history from 50s to present. it follows a different format and so on from the RTHK Podcasts section so may be better to download as a regular site
app4.rthk.hk/special/rthkmemory
-
thuban
fixed the encoding issue! it was my bad
-
thuban
looks like that site is pretty js-heavy, so would not work well in ab. i can take a look at it later though
-
nuroten
yeah, whatever you can pull is fine, specifically the clips in the Major Events category. there are some other cultural things that might be nice to have but maybe not the first thing I would grab personally
-
nuroten
-
nuroten
thanks :)
-
nuroten
clips from the first 3 episodes of Headliner ever
app4.rthk.hk/special/rthkmemory/programme/34
-
nuroten
for music fans, top 10 popular songs starting from the 70s and 80s
app4.rthk.hk/special/rthkmemory/programme/33 (these last 3 links are from the innovation/programme category)
-
OrIdow6
Would it be possible to have a very quick project for GREE set up in 20 minutes or so?
-
HCross
GREE?
-
OrIdow6
Japanese social network
-
OrIdow6
Among other things
-
OrIdow6
See deathwatch, date was apparently moved up
-
OrIdow6
So less time than I thought
-
AK
-
AK
Time to archive the madman
-
thuban
why, what happened now
-
AK
"Spanish media reporting that John McAfee comitted suicide in a spanish jail cell after he was cleared to be extradited to the U.S."
-
AK
That
-
AK
Umm
-
AK
Didn't expect it to be that honestly
-
AK
I was expecting bitcoin and cocaine
-
AK
-
OrIdow6
HCross: Am I right in saying that arkiver is needed to do backfeed? It's not essential here since it seems that most/all of the publicly-accessible pages are in robots and the list page anyhow, but nice to have
-
HCross
yes
-
OrIdow6
Ok
-
arkiver
OrIdow6: are you already working on this?
-
OrIdow6
arkiver: Yes, mostly done
-
arkiver
else I will try to setup a project for that now
-
OrIdow6
Since it's fairly simple
-
arkiver
alright ping me when it's somewhere
-
OrIdow6
Ok
-
arkiver
OrIdow6: from what i see, all posts are under a username
-
nuroten
thuban: there's a series called Hong Kong Stories (CN: 香港故事) with a different theme each season. it's about everyday people, some of them artisans, farmers, small business owners, etc. there are 10+ of them if you search the CN title, but here's one about food (the subtitle translates roughly to "thinking of the taste of home")
-
nuroten
-
arkiver
so i guess just discovery of users while crawling is needed
-
arkiver
i see some account are behind a login
-
OrIdow6
arkiver: Most users seem to be private
-
OrIdow6
We have 2 lists of what may or may not be all public users
-
arkiver
AK: damn, didnt expect that either
-
arkiver
OrIdow6: perfect
-
arkiver
will get it setup and started as soon as you have it ready
-
Kaz
john mcafee.
-
AK
Yep
-
KRG
extradition seemed to be too much for him
-
Kaz
understandable
-
HCross
let me know when
-
HCross
will go hard
-
HCross
and fast
-
arkiver
HCross: will ping
-
arkiver
OrIdow6: do you have the list of users somewhere?
-
OrIdow6
arkiver: It was in the form of URLs
-
OrIdow6
Let me find them
-
OrIdow6
transfer.archivete.am/2Bw3b/gree_all.txt transfer.archivete.am/HYWQ4/gree.txt , URLs from sitemap and from scraping the user list page, respectively, neither done by me, still need to be parsed
-
OrIdow6
If someone other than me wants to do it, format is user:username
-
arkiver
yeah i'll parse them
-
arkiver
thanks
-
OrIdow6
Thank you
-
AK
Can someone with voice in ab do mcafees twitter?
-
AK
Do we archive articles about peoples death?
-
arkiver
yes
-
arkiver
or is this a 'how' question?
-
AK
Naah it was a do we, I think I've worked out what to do now
-
arkiver
right, so policy question. answer is yes!
-
arkiver
OrIdow6: if gree can handle high load (we'll know when HCross is on the project), i'll put the URLs in #// as well most likely
-
arkiver
though warrior project first
-
HCross
warrior first please
-
HCross
I don't like running // unless I have too
-
arkiver
like i said :)
-
OrIdow6
arkiver: Feeling is that it'll be rickety
-
OrIdow6
I think this was something that had its heyday about 13 years ago or so
-
arkiver
well warrior first, so we'll see
-
HCross
are we talking pentium 4 servers in a closet somewhere
-
HCross
in Japan
-
HCross
cc rewby
-
OrIdow6
-
arkiver
thank OrIdow6
-
AK
If you want to put stuff in #// I can spin up some stuff on that
-
arkiver
we can do a channel, not sure if it's needed
-
OrIdow6
A good test item is user:kakei_toshio
-
arkiver
OrIdow6: is it just me or is there a lot of wikidot stuff in there
-
arkiver
will filter that out now
-
arkiver
should be running in a few
-
HCross
arkiver: let me know when code is ready
-
OrIdow6
Yeah, looks like I did leave a bit in
-
HCross
and I'll get underway
-
arkiver
OrIdow6: no worries, checking it now
-
arkiver
HCross: yeah, rewby EggplantN for target
-
arkiver
archiveteam_gree_
-
arkiver
gree_
-
arkiver
Archive Team GREE:
-
Jake
do we have a channel for GREE or just sticking for -bs?
-
arkiver
may be good yeah
-
arkiver
ideas?
-
rewby
We need targets?
-
AK
#greedy
-
arkiver
nice
-
EggplantN
Rewby or deploy FMT
-
EggplantN
I’m afk
-
rewby
EggplantN: nvme is overloaded and I've not got SSH on any of your boxes
-
rewby
We've deployed two CPX31s
-
rewby
Hopefully enough
-
SketchTheCow
Hey, people
-
SketchTheCow
I'm doing this game event thing today, then I turn back to general high focus.
-
SketchTheCow
Arkiver's getting most of the IA-Archiveteam integration/work done these days, but I'm around.
-
aaa
is this channel for back up apple daily?
-
rewby
Apple Daily is mostly down already. We grabbed what we could.
-
aaa
I think there are still some links that are up
-
aaa
That can still be backed up
-
JAA
Any examples?
-
rewby
If you list them we'll do our best to get 'em
-
aaa
1 min
-
aaa
were you guys able to download the vids in this txt file here?
-
aaa
Videos (M3U8 and TS segments) from article pages extracted by User:Jodizzle in several parts: Part 1 Saved! with ArchiveBot on 2021-06-22:
transfer.archivete.am/15b9yl/hk.appledaily.com-m3u8s-expanded.1.txt and job:atm5u7fjmgegw508c90ty32wi Part 2 Saved! with ArchiveBot on 2021-06-22/23:
-
aaa
transfer.archivete.am/RZHFJ/hk.appledaily.com-m3u8s-expanded.2.txt and job:183qpki4h8e40cswj2035wqmf Part 3 Saved! with ArchiveBot on 2021-06-23:
transfer.archivete.am/OIkBX/hk.appledaily.com-m3u8s-expanded.3.txt and job:5ue8wjnyg1gbg1g7x420b5gpg Part 4 Saved! with ArchiveBot on 2021-06-23:
-
aaa
transfer.archivete.am/CKU9J/hk.appledaily.com-m3u8s-expanded.4.txt and job:91rs9mykjwyxmj5vekc8ol1qf Part 5 In progress... with ArchiveBot on 2021-06-23:
transfer.archivete.am/11wYuN/hk.appledaily.com-m3u8s-expanded.5.txt and job:cxp0gi0dive8hio7t156y9o3r More parts Upcoming...
-
aaa
thanks a lot for your support btw
-
JAA
I mean, it says 'Saved!' and 'In progress...' there.
-
aaa
Yeah I was confused if it meant the links are saved or the actual vid .ts files are saved.
-
JAA
The videos are.
-
aaa
oh wonderful
-
JAA
Part 5 actually finished by now as far as I can see.
-
nuroten
orly: FactWire
factwire.org/?lang=en and HKPORI
pori.hk/?lang=en will probably be safe for longer, but who knows ... in case you want to add it to your list
-
nuroten
612 Humanitarian Fund offers legal advice, financial assistance, etc. for people charged from the 2019 protests
612fund.hk/en/home facebook.com/612Fund/reviews
-
nuroten
after what happened with CHRF some of these orgs on the front lines of providing assistance, organising vigils, etc. might be at risk
-
AK
Added the website now, gonna look at facebook too
-
nuroten
AK: fantastic, thanks
-
aaa
This apple daily site is still up:
nextdigital.com.hk
-
aaa
JAA rewby ^
-
rewby
Didn't we throw that into archivebot?
-
rewby
Ah it appears we have not done so
-
rewby
But I don't have the perms to do it
-
rewby
So leaving that for JAA / other AB operators
-
rewby
Oh wait no we did put it in
-
rewby
and it's finished
-
rewby
Helps if I don't typo into /grep
-
JAA
Yup, added to the wiki page.
-
aaa
-
JAA
Hmm
-
AK
nuroten, website is done :)
-
nuroten
AK: whoa, thanks :D
-
JAA
jodizzle: I should soon have a complete list of all videos on hk.
-
JAA
aaa: Nice find, thanks. That's the complete site it seems. :-)
-
aaa
JAA any help you guys need to archive that site? happy to help as much as I can
-
arkiver
if someone here knows about physics material in Hong Kong being removed, please ping me!
-
arkiver
especially if you are able to get access to the material that will be discarded
-
JAA
I'm crawling through the /archive section right now to collect all articles and videos. Apparently nobody did that before, or at least I haven't seen anyone mention it.
-
arkiver
of course, safety first
-
JAA
That's on HK Apple Daily.
-
aaa
arkiver what physics material are you referring to?
-
arkiver
ugh
-
arkiver
physical
-
arkiver
newspapers, books, DVDs, whatever
-
arkiver
but it's probably too late to get stuff like that out of the country
-
arkiver
but if you know of that, please let me know
-
aaa
Yeah good point, haven't heard anything about that happening yet, but cannot rule anything out of course with the current situation
-
arkiver
i can imagine that with various institutions closing, there would be archives and small (company) libraries closing as well
-
aaa
Yeah it's definitely in the realm of (high) possibilities
-
aaa
Is there any hope to download YT videos that are still online, but private?
-
arkiver
ping me if you find out anything!
-
arkiver
also nuroten orly on physical material ^
-
nuroten
I don't have access, maybe people on LIHKG would know or have something
lihkg.com
-
nuroten
I heard a lot of people will be grabbing the last print edition of Apple Daily first thing on thursday, 1 M copies to be printed
-
aaa
JAA you're crawling through /archive section of the arcpublishing site?
-
JAA
aaa: Yes
-
aaa
thank you!
-
JAA
Half-way done or so.
-
aaa
Do you guys have a guide on how you do these archives? for future reference haha
-
arkiver
nuroten: im hoping we may still be able to take archives out of the hong kong
-
JAA
I'm using qwarc for this, which is completely undocumented.
-
arkiver
nuroten: oof, yeah lihkg is a bit hard to read
-
arkiver
what is lihkg?
-
aaa
arkiver popular hk forum, think hks version of reddit
-
jodizzle
JAA: Great! I'm probably not iterating the articles at all fast enough on my setup.
-
JAA
jodizzle: The /archive/YYYYMMDD/ pages have all the M3U8 URLs. :-)
-
JAA
So no need to retrieve each article page.
-
JAA
How many videos did you discover?
-
jodizzle
Ahh okay. I think I had noticed there being a 'digest' on tw.appledaily.com /archive/YYYYMMDD/ pages, but I didn't think to look in detail for videos, or on hk.appledaily.com. Nice!
-
jodizzle
It looks like my five lists have 27,438 m3u8s. But many of those are different qualities of the same video.
-
jodizzle
I'll leave my thing collecting until you confirm that you have everything.
-
JAA
Well, I'm not sure I trust them that /archive/ lists everything. But I can get you a list of article IDs listed there to compare.
-
jodizzle
Sounds good
-
aaa
There may be some things here that can still be downloaded: ml-welcome01.nxtdig.com.hk/stage/
-
aaa
Ex: ml-welcome01.nxtdig.com.hk/stage/f54332d768439dfbf720661800624b42e7244fb2
-
arkiver
aaa: if you happen to see something interesting on lihkg, can you please let us know here? also regarding physical materials
-
JAA
Surprisingly large. Just a WARC of the /archive/ pages from 2000 to today is 3.8 GB after compression.
-
aaa
arkiver sure
-
thuban
aaa: you mentioned lihkg; do you read chinese?
-
aaa
yh
-
nuroten
-
nuroten
it's a food show by NEXT TV apparently
-
aaa
nuroten that is not apple daily hk
-
aaa
that is a taiwan outlet, so no need to backup
-
thuban
aaa: they have some threads and documents of rthk stuff they're trying to save (since the deleting-old-material warning went up a little while back)
-
thuban
i wrote a high-quality scraper and i'd like to coordinate, but the lists are pretty hard to navigate through google translate. one sec, i'll grab the links
-
nuroten
aaa: oh okay, good to know, thanks ... someone on LIHKG suggested two other channels to back up, that was one of them
-
thuban
-
nuroten
(and yeah, it's not Apple Daily, I wasn't sure how long the tw-produced content will stay up, even with tw-based partner)
-
aaa
thuban those spreadsheets are for RTHK, some controversial RTHK content has already been deleted as of 1 mth ago
-
thuban
aaa: yeah, i got the entire english-language hong kong connection backlog at that time.
-
aaa
nice
-
thuban
i realize that it's not as time-sensitive as apple daily content now, but i still have the downloader and i'd like to get anything else that may be in danger in the future if i can
-
nuroten
aaa: thuban helped save some of the RTHK podcasts back in May when RTHK announced it was taking down content older than 1 year so their website and social media are aligned (whatever that meant)
-
nuroten
a few of us were worried at the time it will also affect their archives, not just youtube
-
aaa
yeah it was just a bullshit excuse lol nuroten
-
nuroten
yeah ... so now hopefully if they decide to quietly drop content, we will hopefully be more prepared
-
JAA
Ewwwww. These /archive/ pages on hk.appledaily.com each contain a ~5 MB JSON object. And some of the keys are themselves JSON strings. Disgusting...
-
thuban
yo dawg, i heard you like...
-
JAA
{ ..., "{\"feedOffset\":0,\"feedQuery\":\"taxonomy.primary_section._id:\\\"%2Fdaily%2Fentertainment\\\"+AND+type:story+AND+(editor_note:\\\"20180111\\\"+OR+display_date:[2018-01-11T16:00:00Z||-24h+TO+2018-01-11T16:00:00Z])\",\"feedSize\":100,\"sort\":\"location:asc\"}": { ... } }
-
Jake
I wonder if zstd would compress it nicely...
-
JAA
I'm sure I could train a dict that would absolutely shred it.
-
thuban
hm, looks like someone else has uploaded hkc (the entire run? not sure, but would bet) to ia as individual episodes
-
thuban
i was planning to upload everything as one item; should i still?
-
arkiver
thuban: how many videos?
-
thuban
arkiver: have i? 297 (with thumbnails)
-
arkiver
from appledaily?
-
nuroten
are the ones uploaded also the English version?
-
thuban
no, this is the rthk stuff we were doing last month
-
thuban
they appear to be
-
nuroten
I kind of like the idea of them as one item where it generates a playlist and can be viewed sequentially, is that also available if they're all individual inside a series collection?
-
thuban
no idea. they're not currently in a collection, though
-
nuroten
but not picky as long as there are copies ... same resolution?
-
thuban
looks like, yeah
-
nuroten
okay ... I remembered there was some weird thing with some of them having different res depending on whether they were from akamai or archive
-
nuroten
they make more sense to me as 1 item (or some way to group them as a set) but yeah, up to you, thanks for saving those :)
-
thuban
yeah. i did a convenience sample of one very new one and one very old one; both were the same size as mine. i think there was an intermediate phase between thos ebut don't recall the details (and it looks like we used very similar methods)
-
thuban
*those but
-
arkiver
thuban: if its 297 videos, the lets do individual items if it's not too much more work
-
arkiver
let me know when they're up and I'll put them in some AT collection
-
arkiver
make sure you have good/correct metadata
-
thuban
arkiver: someone else already upped them as individual episodes
archive.org/details/@kwc114
-
thuban
metadata looks good but they're not in a collection--iirc only the uploader can put them in one; is that right?
-
arkiver
or I can
-
thuban
oh, cool
-
arkiver
or someone else at IA
-
thuban
i do note that they don't seem to have grabbed the original thumbnails. ia generates its own, so not a big deal from a usability perspective, but maybe i should upload them for preservation?
-
arkiver
yes
-
arkiver
but, in one item then
-
thuban
ok, will do
-
thuban
once i remember how the cli works lol