00:05:52 Heya folks, I may need some help on finding the much older parts of CNN because of https://old.reddit.com/r/forgottenwebsites/comments/4xxcxj/cnn_forgot_to_update_some_parts_of_its_website/ - one of which is http://www.cnn.com/FOOD/resources/ (which I'm throwing into AB) 00:06:06 It seems the rest of the links are dead or redirected back to CNN modern stuff s: 00:07:16 There's also apparently this version of that page, http://edition.cnn.com/FOOD/resources/ 00:08:31 This would be harder to find those kinds of links since they're not really subdomains 03:04:38 Ryz there is also http://www.cnn.com/FOOD/news/, /restaurants/, /key.ingredient/cardamom/ 03:14:57 Ooo, ooo, katocala, keep 'em coming if you manage to find more 'em~ 03:21:28 Ryz: 'More key ingredients' selector half-way down the page on that last one has some more. Also 'related stories' at the end of the main text goes to another section. 05:37:11 Hi folks. LIHKG (lihkg.com), Hong Kong's version of Reddit, is now at risk of being shut down - the HK government singled them out as *the* site that needs to be investigated for "endangering national security" (https://unwire.hk/2021/07/07/hk-socialmedia/life-tech/). Is there any way to start backing it up? 05:41:55 very javascript-heavy, seems likely to need a dedicated warrior project. 05:42:12 - do you have a rough idea of how big the site is? 05:43:07 Current thread IDs are just over 2.6 million. 05:43:37 oh, they do seem to be sequential 05:43:42 that's good 05:44:58 - what's the important content? threads seem straightforward; is there anything else (like user profiles) we should try and get? 05:46:32 Threads is pretty much the only thing that needs to be backed up imo, user profiles isn't important, and apart from these two, there are no other features on Lihkg 05:48:40 looks like thread data is at https://lihkg.com/api_v2/thread//page/?order=reply_time 05:50:57 response text is in .response.item_data[].msg; contains inline html (including images which would be nice to get) 05:52:36 ugh, behind cloudflare protection. that's bad :/ 05:55:22 JAA, thoughts? i'm not sure what options we have besides contacting the operators and/or jury-rigging a webdriver setup 05:57:09 I'll have a look when I'm awake again. What rate limiting are you seeing? 05:58:40 "Error 1020: Access denied" on straight attempts to access json in the browser; captchas on copying the legit request as curl. 05:59:01 Mhm 06:00:56 (.response.total_page is all we should need for pagination) 06:05:37 J A A knows all 06:07:21 Lately, I've been pretty dumbfounded by Buttflare's bullshit. They're getting more annoying to deal with. Not that they were pleasant before... 06:20:14 Because of the political sensitivity of this, I think it might be nice to try to get it without contacting them 06:23:50 abcde: Any idea on when it may be shut down? 10:47:23 Any general ideas on how to solve CAPTCHAS, specifically CloudFlare ones? 10:48:48 Specifically as applicable to the previously discussed, but presumably could be useful in the future 10:53:06 ideally we could mimic browser behavior successfully enough not to receive captchas. (this might be via brozzler/webdriver+warcprox, or might be via a very low-level http implementation--the latter presumably much faster but much more difficult to keep updated.) 10:53:17 but i don't know whether that would be a complete solution or cloudflare sometimes throws captchas anyway just to keep its hand in; some hybrid approach might be needed... 11:00:08 (if the latter there's always the 'farm it out' option like we did with yahoo groups. captcha-solving services are kinda sketchy, but depending on how often captchas show up/how well we manage overall throughput, it might be practical to work just with volunteers) 11:02:45 This is assuming it gives capatchas 11:02:55 Which is almost bound to happen even if you run headless 11:03:12 And IIRC some sites have set themselves to always have capatchas no matter what 11:03:27 mmm 11:03:33 (Though it might be questionable to scrape those then, but that can be dealt with when it happens) 11:03:38 Yeah 11:04:10 Might be best to try to cruise just under the capatcha rate on most? 11:04:41 That would require people be very careful about running too many of them 11:04:42 'Hybrid approach" sounds ncie 11:41:44 Another option (Which is not gonna be easy), is take advantage of the privacy pass: https://support.cloudflare.com/hc/en-us/articles/115001992652-Using-Privacy-Pass-with-Cloudflare 11:42:11 You can get passes for completing the captchas, that then allow you to automatically get through other captchas 11:42:24 Might work 11:42:44 "To help mitigate malicious usage of this, we automatically disable Privacy Pass anytime a domain is placed into 'I'm Under Attack!' mode." 11:50:05 If we're in under attack mode, everyone gets a captcha, no matter how slow we go 11:50:57 ah, i guess there are intermediate levels of protection. is the documentation for that accessible? 12:33:23 https://support.cloudflare.com/hc/en-us/articles/200170056-Understanding-the-Cloudflare-Security-Level Some here 12:59:32 I'm thinking that if we can solve the captcha issue, that lihkg website should be archivable by wget-at? 13:08:38 not recursively, but if we generated and piped in the json & resource urls, then sure--but i doubt the premise; i don't see a path to bypassing captcha that is compatible with making requests through wget 13:14:00 I don't think we'll be able to bypass the captcha either. Cloudflare have worked pretty hard to make that very hard except by having humans complete the captcha 13:16:56 i think it's _doable_, jut impractical at scale with our resources 13:17:00 *just 13:28:17 how long are CAPTCHA cookies working until a new one is required? 13:28:25 is this based on seconds or number of requests? 13:28:41 * AK shrugs 13:30:59 Wouldn't be surprised if it's a mix of both 13:31:18 As well as looking at other requests from ips+asns 17:18:23 thuban: how are the saves of websites/Facebook pages of political parties going? there's been a wave of resignations following an announcement of potential wage clawback for those who risk being disqualified from their positions in the district or legislative councils. a lot of the parties likely won't be around as they are much longer 17:19:58 (heard the sad news about LIHKG, thanks for doing whatever you all can to save it) 17:34:46 the writing is on the wall for democrats in Macau, 21 candidates barred from upcoming legislative elections. source: https://hongkongfp.com/2021/07/10/macau-bans-21-democrats-from-legislative-elections/ putting it out there if it would be worthwhile preemptively saving the corresponding party websites/socials of those candidates 17:35:24 (I can look up the urls if so) 18:15:57 would it be possible to archive roblox.com? 18:16:18 the catalog, some games <100 players and more? 18:26:16 I don't believe it's at risk of going away? Any specific reason we should? 19:14:08 thuban: The problem isn't just a low-level HTTP implementation. It's TLS as well these days. 19:14:41 AK, thuban: 'I'm Under Attack' mode is not captchas but only JS challenge. Which is also a blocker at the moment but can be circumvented at least in theory. 20:00:54 nuroten: unfortunately, there is very little we can do about facebook at this time 20:01:12 i think the facebook rate limiting is just generous enough that with careful monitoring and a slow pace, it should be possible to access some recent posts from each page of interest, but i don't know if we can save them as warcs in a way that ia could accept. 20:01:22 (viz., archivebot, #//, etc are b&. i for one am willing to sit here with warcprox or whatever and make requests by hand if it comes to that, but i'm not whitelisted for wayback machine ingestion--and i'm not sure ia whitelists anyone for such artisanal setups) 20:01:37 as for websites, they're chugging along. we're mostly still on media outlets, though. 20:01:44 speaking of, i see the inmediahk job was aborted. anyone know whether we got decent coverage first? (was ddos protection always on or did they activate it in response to us?) 20:02:12 JAA: you are of course correct; i misspoke 20:02:22 thuban: all right, thanks :) 20:04:37 nuroten: i'm about to do another round of checking on jobs and adding stuff to the hong kong media wiki page; if you care to grab those macau urls and dump them (in that same etherpad, maybe?) we'll see about archiving them as well 20:04:41 thuban: inmediahk.net blocked AB shortly after the job started. Buttflare wasn't enabled at the time. curl on the same machine worked fine, even with identical headers... 20:05:21 JAA: gotcha, thanks 20:06:39 That last part in particular is why I think TLS matters since recently. I'm not actually sure it's TLS, but when curl and AB send the exact same HTTP request down to the header order, it's the only thing I can think of. 20:08:26 > I don't believe it's at risk of going away? Any specific reason we should? 20:08:37 no, but just in case to preserve history 20:10:55 JAA: see discussion on 23 june https://hackint.logs.kiska.pw/archiveteam-bs/20210623#c292082 20:11:37 Yup 20:17:36 looks like the passiontimes job probably needs a high delay or abort as well :/ 20:34:30 thanks to whoever took care of that! 20:35:40 xtube is shutting down https://www.ynot.com/xtube-close-abruptly-after-13-years/ 20:37:37 -> #nevermind 20:43:03 nuroten: did you cross out the twitter links in the 'larger parties' section, and if so, why? they look okay to me 20:43:21 thuban: not me 20:44:40 huh 21:18:51 thuban: Macau links at the very bottom of the pad. as I'm unfamiliar with the situation in Macau, maybe someone with knowledge of things there will come by and amend/add to it 21:30:22 Crossed out twitter links was prob me. Just meant I grabbed them in AB 21:37:46 duce1337: sure, but projects require a lot of time and storage space. Roblox, I imagine would take quite a bit of both. The OPs here will obviously consider it. 21:39:41 ok 21:40:33 I'm not opposed to it (I mean... https://transfer.archivete.am/inline/bG4mu/aatt.png ), but lower priority than a bunch of other things. 21:43:43 thank you, nuroten and Megame 21:43:46 AK: what's the story with the ab jobs you ran for https://612fund.hk/ on the 23rd? should we revisit or is the second one good? 21:44:49 I agree