00:18:09 duolingo is killing their forums at the end of march, public announcement to come but a ton of user-created language guides and resources will be gone 00:44:12 We should rename ArchiveTeam to ForumTeam if this keeps up 00:45:17 Anything public so far scowlee or is this inside info? And if the latter when can we expect a public announcement? 00:51:43 OrIdow6: We should remain Archive Team, but Forum Team should specialize in forums. 00:52:28 On the Wiki home, it doesn't mention Fandom. 00:55:43 jamesp: FWIW there has been a never-implemented idea to turn #msgbored into something like that, hence its topic 01:55:46 Beware, Duolingo sometimes returns inaccurate 200s (with body "500 Internal Server Error"), I suspect there's other status code weirdness too 01:55:52 *Duolingo forums API 01:56:22 Doing a quick estimate - does look like there are a few 10m posts 02:04:40 !sa https://youtu.be/lqYTX7parRw 06:25:41 The MapleTip Forums have vanished in the past few hours. The AB job did not manage to grab everything in time, but it looks like a good majority of the content was covered. 14:15:23 OrIdow6: i think it should be announced within a week or so 15:29:40 JAA, looked into using tapatalk for getting missing / machine-readable content from the technologyguide forums some more, turns out we *can* use unauthenticated GET requests for everything 15:31:17 unless there's stuff only visible when logged in? all I've seen that requires login is attachment data (metadata/thumbnails are open) 15:42:14 I'd be about ready to write a script for myself, is there any interest in running this as an "ArchiveTeam crawl", even though the HTML (sans broken pages) is already done? (I probably couldn't grab everything, nor have it end up in WBM) 15:49:25 Arkiver uploaded File:Pinger-logo.png: https://wiki.archiveteam.org/?title=File%3APinger-logo.png 16:20:45 rewby: can we get a target for pinger.pl ? 16:20:51 it would be archiveteam_pinger 16:20:52 pinger_ 16:20:58 Archive Team pinger: 16:36:39 arkiver: Sure. What kind of file size you thinking and what kind of rate? + Is there a channel for this? 16:49:33 rewby: i think will not be large at all 16:49:44 no channel at the moment, but we can think of one 16:50:01 I set the target in the project 16:50:11 *tracker 16:50:20 yep, already pushed first into into it 17:29:32 No docker image yet for pinger? 17:31:23 ThreeHM: I'll go make one 17:31:59 Thanks! 17:33:18 It's building, give it a few minutes and it'll be at the usual address 17:34:19 ThreeHM: Build done 17:51:23 Pinger just started returning a ton of 400's 17:51:44 arkiver ^ 17:51:59 403? 17:52:18 Some, but a was seeing a wall of 400's with a few 403's and 200's 17:52:21 looks fine to me 17:52:24 hmm 17:52:40 the site pretty unstable yeah :/ 17:52:50 Just picked back up 17:53:01 Yeah, that was my thought 17:53:25 lets hope it stays online a little after the 31st 17:53:31 will see about contacting then 17:53:34 them* 17:54:30 anyone have ideas for pinger channel name? 17:54:43 pingas 17:55:13 sorry, I thought at first it would be a project for long term pings... idk lol 17:57:49 Not on Deathwatch? 17:58:13 #pinged maybe? 17:58:31 #pingedout 17:58:39 lets do #pinged 17:58:46 saw yours too late OrIdow6 :P 18:11:53 OrIdow6 edited Deathwatch (+112, /* 2022 */ Add pinger.pl): https://wiki.archiveteam.org/?diff=48225&oldid=48215 18:13:58 Did anything happen to forum.chip.de after the AB job got banned? Looks like they've made their change 18:14:36 I'm going to move its category anyhow 18:16:51 I archived it fully, and it completed ten minutes before they added rules to their Buttflare configuration blocking most automated access. 18:16:54 OrIdow6 edited Deathwatch (+12, Forum.chip.de has made its changes): https://wiki.archiveteam.org/?diff=48226&oldid=48225 18:17:01 Oh, good 18:58:28 Hi, I joined today and my Warrior VM has been running for over 6hrs. I'd like to backup forums threads which are of interest to me on NotebookReview forums, which is closing for good in 2days. Is there a way to selectivey apply my Warrior VM to this and contribute to NBR archiving? @JAA work on this I believe archiving already and I'd like to 18:58:29 help. Thank you 19:05:26 DLoa: There is no distributed project for TechnologyGuide, so no, you can't. I have already archived (nearly) the entire four forums, only a few dozen threads missing that I will be looking into tonight. 20:02:41 JAA, do you want to grab those threads yourself? I've written down my notes here https://gist.github.com/drdaxxy/b7731fb4217a56604956bcaa45641648 20:07:03 daxxy: Brilliant, thanks! Sorry for the delay, didn't have time to look into it yet. 20:08:55 no worries :) what sorta resources / time did the HTML crawl take? 20:11:22 About a day for all four forums with decent parallelism and multiple IPs. Not sure whether the IPs were actually needed or not. 20:11:44 Also, yes, there are threads that require logging in. I'm not sure whether they're accessible to normal users or only mods or similar though. 20:11:52 We generally only archive things that are publicly accessible. 20:12:41 returnHtml=1 on get_thread renders the BBCode as HTML. 20:13:06 Well, partially, anyway. [url=...] is not transformed apparently. 20:14:13 neither is [quote] 20:15:00 nor img, so I have no idea if they actually render any BBCode or just newlines and maybe entities :v 20:15:43 I'm seeing some stuff as well. 20:15:55 But yeah, it's weird. 20:16:11 Smilies aren't translated into img tags either. 20:28:52 okay, at least [b] just gets removed if returnHtml=0, see post 540654 in thread 75253 for example 20:32:52 Aw, there's a get_raw_post method, but that only works for users who can edit the post (i.e. poster/mods). 20:42:15 yeah, I saw that, but now that you say it... I should talk to the mods, they seem interested in archival 20:43:39 but since I figure this definitely isn't the place for crawling with a mod account - would you recommend the *-grab template for "outsiders" right now, or would I likely be better off hacking something together on my own? 20:45:26 The -grab template is really only applicable to distributed projects, which is a major part of AT but not the only thing we do. I used my own tool (qwarc) for archiving the forums, but I can't recommend it to anyone as it's very much not user-friendly. 20:45:47 And yeah, crawling with a mod account is not going to happen. 20:45:56 (... here) 20:48:38 I think I'll regrab all threads with get_thread, probably with returnHtml=0 but haven't decided yet. 20:52:10 Trying to figure out where that transformation happens, but haven't quite found it. 20:56:09 library/Tapatalk/Bridge.php, library/Tapatalk/BbCode/Formatter/Tapatalk.php, mobiquo/mbqClass/lib/read/MbqRdEtForumPost.php are the relevant places I've found 20:56:39 Ah, push/TapatalkPush.php cleanPost, but it delegates to Tapatalk_BbCode_Formatter_Tapatalk which isn't in the plugin. 20:57:16 it's in the archive, you may have only extracted the mobiquo folder 20:58:01 Oh, right. I was grepping inside mobiquo, yeah. 20:59:16 Wow, this code is a mess. 20:59:19 hah 20:59:31 Random indentation is exactly why I love Python. 21:00:50 python2* 21:01:09 * JAA slaps hexa- around a bit with a large trout 21:01:26 * hexa- slaps JAA back with python2.7 … BEST BEFORE 2Y AGO 21:01:42 Great, thanks, now I have food poisoning. :-( 21:01:55 I'm burnt, I do a lot of python packaging in NixOS :( 21:03:24 [b] and [i] get stripped, [u] gets converted to , [color] becomes a font tag, [img] should get stripped in both settings if I'm reading the code correctly. 21:05:57 img stripped? where are you seeing that? 21:07:39 Nevermind, it gets treated specially it seems. 21:07:46 library/Tapatalk/BbCode/Formatter/Tapatalk.php is what I'm looking at. 21:08:04 Specifically the getTags function. 21:13:07 tbh, I don't think there's a need to analyze this properly right now -- we're not gonna get a lossless copy anyway, and clearly they only leave bbcode in that matches the parser in their app 21:13:49 (the android app uses returnHtml=1, btw) 21:15:10 Hmm, it would be neat if we could archive it in a way that someone could simply plug a Wayback Machine URL into the app and it all plays back correctly. But getting that to work would be quite a challenge. 21:15:24 And it'd probably break anyway. 21:15:43 you mean the tapatalk app? 21:16:16 definitely not gonna work 21:19:14 for one, unless there's a way to force it into using the JSON API (doubt it, since the JSON API is newer, it ought to be preferred if client and server support it already), it POSTs to the xml-rpc interface and there's no way to make it request different URLs for different content 21:23:45 Right 21:26:22 writing a new (entirely client-side) webapp that reads everything from WBM (plus an externally hosted search index file, if you wanna get fancy) would work though, and not even take that much effort I think 21:28:02 when you're not supporting 2 protocols in 8 codebases over 3 inheritance levels, this does not have to be complex software :P 21:28:08 :-) 21:29:48 It would have to be in the WBM though due to (the lack of) CORS. 21:30:04 yeah, wasn't sure about that 21:30:23 Anyway, that's something for the future. First step is getting the data. 21:30:34 ...but then you can always just put your site into WBM, right? ^^ 21:31:05 Also, someone here was working on a forum archive ingestion thingy a while ago. Not sure what happened to that idea. 21:31:27 Yes, that's what I did with the Picosong data finder thingy. 21:34:31 I'm going with returnHtml=0. As far as I can tell, it preserves a bit more data than =1 does, and the conversion should be easy enough. 21:34:52 huh, what does it preserve that =1 doesn't? 21:35:02 [b] [i] 21:35:07 hang on 21:35:56 no, =1 transforms them to HTML, but =0 strips them completely 21:36:16 Huh 21:37:31 Oh 21:39:48 any idea about the timeframe? if I (get the mods to) grab anything more, I'd rather do that after you've done your thing (especially with the missing posts) so my traffic won't get in your way 21:42:38 Ok yeah, =1 it is I guess. 21:44:33 I need to leave for a bit but will get it up and running in the next 1-2 hours. 21:45:09 nice 23:20:11 DLoa: JAA This is great news. I was having trouble with Python3 script using wkhtmlto/pdfkit to save to pdf each pages of the threads of interest without running into issues after one or two pages (blocked?). I hope that it'll be possible to download to pdf for offline use. I have an account and can look into the remaining threads that could 23:20:11 be saved. 23:25:40 daxxy: The Tapatalk API has an ... interesting behaviour on thread redirects, e.g. thread 826132 on NotebookReview which redirects to 795536. It returns the data for the merged(?) thread and a positive total_post_num but no actual posts. 23:25:58 huh 23:26:29 I was wondering what it'd do with merged/moved threads but hadn't come across any, thanks 23:27:17 Also, on threads that require logging in, it returns a 'Need valid topic id!' error, e.g. 247631 on NotebookReview. 23:28:26 I can log in on NBR if it helps. 23:49:16 daxxy: Welp: http://forum.notebookreview.com/mobiquo/tapatalk.php?method_name=get_thread&topicId=763489&returnHtml=1&page=1&perPage=100 23:50:01 That one works fine through the website: http://forum.notebookreview.com/threads/why-arent-laptop-gpus-officially-sold.763489/ 23:51:02 weird 23:51:10 There are plenty more like that, it seems. Just running a little test with random IDs right now and immediately hit three like it. 23:51:32 738640 and 742521 are the other two. 23:51:53 Their WAF is very, very odd. 23:53:02 Haven't documented this anywhere yet, but any request containing 'temp' as a word gets blocked. Same for 'tmp' and one other I can't remember right now. And anything with 'nessus' results in a connection reset. 23:54:35 But yeah, can't even get everything through the API. WTF?