-
Ryz
Heya folks, I may need some help on finding the much older parts of CNN because of
old.reddit.com/r/forgottenwebsites/…to_update_some_parts_of_its_website - one of which is
cnn.com/FOOD/resources (which I'm throwing into AB)
-
Ryz
It seems the rest of the links are dead or redirected back to CNN modern stuff s:
-
Ryz
There's also apparently this version of that page,
edition.cnn.com/FOOD/resources
-
Ryz
This would be harder to find those kinds of links since they're not really subdomains
-
katocala
Ryz there is also
cnn.com/FOOD/news, /restaurants/, /key.ingredient/cardamom/
-
Ryz
Ooo, ooo, katocala, keep 'em coming if you manage to find more 'em~
-
JAA
Ryz: 'More key ingredients' selector half-way down the page on that last one has some more. Also 'related stories' at the end of the main text goes to another section.
-
abcde
Hi folks. LIHKG (lihkg.com), Hong Kong's version of Reddit, is now at risk of being shut down - the HK government singled them out as *the* site that needs to be investigated for "endangering national security" (
unwire.hk/2021/07/07/hk-socialmedia/life-tech). Is there any way to start backing it up?
-
thuban
very javascript-heavy, seems likely to need a dedicated warrior project.
-
thuban
- do you have a rough idea of how big the site is?
-
JAA
Current thread IDs are just over 2.6 million.
-
thuban
oh, they do seem to be sequential
-
thuban
that's good
-
thuban
- what's the important content? threads seem straightforward; is there anything else (like user profiles) we should try and get?
-
abcde
Threads is pretty much the only thing that needs to be backed up imo, user profiles isn't important, and apart from these two, there are no other features on Lihkg
-
thuban
looks like thread data is at
lihkg.com/api_v2/thread<n>/page/<n>?order=reply_time
-
thuban
response text is in .response.item_data[<n>].msg; contains inline html (including images which would be nice to get)
-
thuban
ugh, behind cloudflare protection. that's bad :/
-
thuban
JAA, thoughts? i'm not sure what options we have besides contacting the operators and/or jury-rigging a webdriver setup
-
JAA
I'll have a look when I'm awake again. What rate limiting are you seeing?
-
thuban
"Error 1020: Access denied" on straight attempts to access json in the browser; captchas on copying the legit request as curl.
-
JAA
Mhm
-
thuban
(.response.total_page is all we should need for pagination)
-
OrIdow6
J A A knows all
-
JAA
Lately, I've been pretty dumbfounded by Buttflare's bullshit. They're getting more annoying to deal with. Not that they were pleasant before...
-
OrIdow6
Because of the political sensitivity of this, I think it might be nice to try to get it without contacting them
-
OrIdow6
abcde: Any idea on when it may be shut down?
-
OrIdow6
Any general ideas on how to solve CAPTCHAS, specifically CloudFlare ones?
-
OrIdow6
Specifically as applicable to the previously discussed, but presumably could be useful in the future
-
thuban
ideally we could mimic browser behavior successfully enough not to receive captchas. (this might be via brozzler/webdriver+warcprox, or might be via a very low-level http implementation--the latter presumably much faster but much more difficult to keep updated.)
-
thuban
but i don't know whether that would be a complete solution or cloudflare sometimes throws captchas anyway just to keep its hand in; some hybrid approach might be needed...
-
thuban
(if the latter there's always the 'farm it out' option like we did with yahoo groups. captcha-solving services are kinda sketchy, but depending on how often captchas show up/how well we manage overall throughput, it might be practical to work just with volunteers)
-
OrIdow6
This is assuming it gives capatchas
-
OrIdow6
Which is almost bound to happen even if you run headless
-
OrIdow6
And IIRC some sites have set themselves to always have capatchas no matter what
-
thuban
mmm
-
OrIdow6
(Though it might be questionable to scrape those then, but that can be dealt with when it happens)
-
OrIdow6
Yeah
-
OrIdow6
Might be best to try to cruise just under the capatcha rate on most?
-
rewby
That would require people be very careful about running too many of them
-
OrIdow6
'Hybrid approach" sounds ncie
-
AK
Another option (Which is not gonna be easy), is take advantage of the privacy pass:
support.cloudflare.com/hc/en-us/art…-Using-Privacy-Pass-with-Cloudflare
-
AK
You can get passes for completing the captchas, that then allow you to automatically get through other captchas
-
AK
Might work
-
thuban
"To help mitigate malicious usage of this, we automatically disable Privacy Pass anytime a domain is placed into 'I'm Under Attack!' mode."
-
AK
If we're in under attack mode, everyone gets a captcha, no matter how slow we go
-
thuban
ah, i guess there are intermediate levels of protection. is the documentation for that accessible?
-
AK
-
rewby
I'm thinking that if we can solve the captcha issue, that lihkg website should be archivable by wget-at?
-
thuban
not recursively, but if we generated and piped in the json & resource urls, then sure--but i doubt the premise; i don't see a path to bypassing captcha that is compatible with making requests through wget
-
AK
I don't think we'll be able to bypass the captcha either. Cloudflare have worked pretty hard to make that very hard except by having humans complete the captcha
-
thuban
i think it's _doable_, jut impractical at scale with our resources
-
thuban
*just
-
arkiver
how long are CAPTCHA cookies working until a new one is required?
-
arkiver
is this based on seconds or number of requests?
-
» AK shrugs
-
AK
Wouldn't be surprised if it's a mix of both
-
AK
As well as looking at other requests from ips+asns
-
nuroten
thuban: how are the saves of websites/Facebook pages of political parties going? there's been a wave of resignations following an announcement of potential wage clawback for those who risk being disqualified from their positions in the district or legislative councils. a lot of the parties likely won't be around as they are much longer
-
nuroten
(heard the sad news about LIHKG, thanks for doing whatever you all can to save it)
-
nuroten
the writing is on the wall for democrats in Macau, 21 candidates barred from upcoming legislative elections. source:
hongkongfp.com/2021/07/10/macau-ban…emocrats-from-legislative-elections putting it out there if it would be worthwhile preemptively saving the corresponding party websites/socials of those candidates
-
nuroten
(I can look up the urls if so)
-
duce1337
would it be possible to archive roblox.com?
-
duce1337
the catalog, some games <100 players and more?
-
Jake
I don't believe it's at risk of going away? Any specific reason we should?
-
JAA
thuban: The problem isn't just a low-level HTTP implementation. It's TLS as well these days.
-
JAA
AK, thuban: 'I'm Under Attack' mode is not captchas but only JS challenge. Which is also a blocker at the moment but can be circumvented at least in theory.
-
thuban
nuroten: unfortunately, there is very little we can do about facebook at this time
-
thuban
i think the facebook rate limiting is just generous enough that with careful monitoring and a slow pace, it should be possible to access some recent posts from each page of interest, but i don't know if we can save them as warcs in a way that ia could accept.
-
thuban
(viz., archivebot, #//, etc are b&. i for one am willing to sit here with warcprox or whatever and make requests by hand if it comes to that, but i'm not whitelisted for wayback machine ingestion--and i'm not sure ia whitelists anyone for such artisanal setups)
-
thuban
as for websites, they're chugging along. we're mostly still on media outlets, though.
-
thuban
speaking of, i see the inmediahk job was aborted. anyone know whether we got decent coverage first? (was ddos protection always on or did they activate it in response to us?)
-
thuban
JAA: you are of course correct; i misspoke
-
nuroten
thuban: all right, thanks :)
-
thuban
nuroten: i'm about to do another round of checking on jobs and adding stuff to the hong kong media wiki page; if you care to grab those macau urls and dump them (in that same etherpad, maybe?) we'll see about archiving them as well
-
JAA
thuban: inmediahk.net blocked AB shortly after the job started. Buttflare wasn't enabled at the time. curl on the same machine worked fine, even with identical headers...
-
thuban
JAA: gotcha, thanks
-
JAA
That last part in particular is why I think TLS matters since recently. I'm not actually sure it's TLS, but when curl and AB send the exact same HTTP request down to the header order, it's the only thing I can think of.
-
duce1337
><Jake> I don't believe it's at risk of going away? Any specific reason we should?
-
duce1337
no, but just in case to preserve history
-
thuban
-
JAA
Yup
-
thuban
looks like the passiontimes job probably needs a high delay or abort as well :/
-
thuban
thanks to whoever took care of that!
-
duce1337
-
EggplantN
-> #nevermind
-
thuban
nuroten: did you cross out the twitter links in the 'larger parties' section, and if so, why? they look okay to me
-
nuroten
thuban: not me
-
thuban
huh
-
nuroten
thuban: Macau links at the very bottom of the pad. as I'm unfamiliar with the situation in Macau, maybe someone with knowledge of things there will come by and amend/add to it
-
Megame
Crossed out twitter links was prob me. Just meant I grabbed them in AB
-
Jake
duce1337: sure, but projects require a lot of time and storage space. Roblox, I imagine would take quite a bit of both. The OPs here will obviously consider it.
-
duce1337
ok
-
JAA
I'm not opposed to it (I mean...
transfer.archivete.am/inline/bG4mu/aatt.png ), but lower priority than a bunch of other things.
-
thuban
thank you, nuroten and Megame
-
thuban
AK: what's the story with the ab jobs you ran for
612fund.hk on the 23rd? should we revisit or is the second one good?
-
Jake
I agree