00:18:09 <scowlee> duolingo is killing their forums at the end of march, public announcement to come but a ton of user-created language guides and resources will be gone
00:44:12 <OrIdow6> We should rename ArchiveTeam to ForumTeam if this keeps up
00:45:17 <OrIdow6> Anything public so far scowlee or is this inside info? And if the latter when can we expect a public announcement?
00:51:43 <jamesp> OrIdow6: We should remain Archive Team, but Forum Team should specialize in forums.
00:52:28 <jamesp> On the Wiki home, it doesn't mention Fandom.
00:55:43 <OrIdow6> jamesp: FWIW there has been a never-implemented idea to turn #msgbored into something like that, hence its topic
01:55:46 <OrIdow6> Beware, Duolingo sometimes returns inaccurate 200s (with body "500 Internal Server Error"), I suspect there's other status code weirdness too
01:55:52 <OrIdow6> *Duolingo forums API
01:56:22 <OrIdow6> Doing a quick estimate - does look like there are a few 10m posts
02:04:40 <jamesp> !sa https://youtu.be/lqYTX7parRw
06:25:41 <JAA> The MapleTip Forums have vanished in the past few hours. The AB job did not manage to grab everything in time, but it looks like a good majority of the content was covered.
14:15:23 <scowlee> OrIdow6: i think it should be announced within a week or so
15:29:40 <daxxy> JAA, looked into using tapatalk for getting missing / machine-readable content from the technologyguide forums some more, turns out we *can* use unauthenticated GET requests for everything
15:31:17 <daxxy> unless there's stuff only visible when logged in? all I've seen that requires login is attachment data (metadata/thumbnails are open)
15:42:14 <daxxy> I'd be about ready to write a script for myself, is there any interest in running this as an "ArchiveTeam crawl", even though the HTML (sans broken pages) is already done? (I probably couldn't grab everything, nor have it end up in WBM)
15:49:25 <h2ibot> Arkiver uploaded File:Pinger-logo.png: https://wiki.archiveteam.org/?title=File%3APinger-logo.png
16:20:45 <arkiver> rewby: can we get a target for pinger.pl ?
16:20:51 <arkiver> it would be archiveteam_pinger
16:20:52 <arkiver> pinger_
16:20:58 <arkiver> Archive Team pinger:
16:36:39 <rewby> arkiver: Sure. What kind of file size you thinking and what kind of rate? + Is there a channel for this?
16:49:33 <arkiver> rewby: i think will not be large at all
16:49:44 <arkiver> no channel at the moment, but we can think of one
16:50:01 <rewby> I set the target in the project
16:50:11 <rewby> *tracker
16:50:20 <arkiver> yep, already pushed first into into it
17:29:32 <ThreeHM> No docker image yet for pinger?
17:31:23 <rewby> ThreeHM: I'll go make one
17:31:59 <ThreeHM> Thanks!
17:33:18 <rewby> It's building, give it a few minutes and it'll be at the usual address
17:34:19 <rewby> ThreeHM: Build done
17:51:23 <Craigle> Pinger just started returning a ton of 400's
17:51:44 <Craigle> arkiver ^
17:51:59 <arkiver> 403?
17:52:18 <Craigle> Some, but a was seeing a wall of 400's with a few 403's and 200's
17:52:21 <arkiver> looks fine to me
17:52:24 <arkiver> hmm
17:52:40 <arkiver> the site pretty unstable yeah :/
17:52:50 <Craigle> Just picked back up
17:53:01 <Craigle> Yeah, that was my thought
17:53:25 <arkiver> lets hope it stays online a little after the 31st
17:53:31 <arkiver> will see about contacting then
17:53:34 <arkiver> them*
17:54:30 <arkiver> anyone have ideas for pinger channel name?
17:54:43 <Sanqui> pingas
17:55:13 <Sanqui> sorry, I thought at first it would be a project for long term pings...  idk lol
17:57:49 <OrIdow6> Not on Deathwatch?
17:58:13 <monika> #pinged maybe?
17:58:31 <OrIdow6> #pingedout
17:58:39 <arkiver> lets do #pinged
17:58:46 <arkiver> saw yours too late OrIdow6 :P
18:11:53 <h2ibot> OrIdow6 edited Deathwatch (+112, /* 2022 */ Add pinger.pl): https://wiki.archiveteam.org/?diff=48225&oldid=48215
18:13:58 <OrIdow6> Did anything happen to forum.chip.de after the AB job got banned? Looks like they've made their change
18:14:36 <OrIdow6> I'm going to move its category anyhow
18:16:51 <JAA> I archived it fully, and it completed ten minutes before they added rules to their Buttflare configuration blocking most automated access.
18:16:54 <h2ibot> OrIdow6 edited Deathwatch (+12, Forum.chip.de has made its changes): https://wiki.archiveteam.org/?diff=48226&oldid=48225
18:17:01 <OrIdow6> Oh, good
18:58:28 <DLoa> Hi,  I joined today and my Warrior VM has been running for over 6hrs.  I'd like to backup forums threads which are of interest to me on NotebookReview forums, which is closing for good in 2days.  Is there a way to selectivey apply my Warrior VM to this and contribute to NBR archiving?  @JAA work on this I believe archiving already and I'd like to
18:58:29 <DLoa> help.  Thank you
19:05:26 <JAA> DLoa: There is no distributed project for TechnologyGuide, so no, you can't. I have already archived (nearly) the entire four forums, only a few dozen threads missing that I will be looking into tonight.
20:02:41 <daxxy> JAA, do you want to grab those threads yourself? I've written down my notes here https://gist.github.com/drdaxxy/b7731fb4217a56604956bcaa45641648
20:07:03 <JAA> daxxy: Brilliant, thanks! Sorry for the delay, didn't have time to look into it yet.
20:08:55 <daxxy> no worries :) what sorta resources / time did the HTML crawl take?
20:11:22 <JAA> About a day for all four forums with decent parallelism and multiple IPs. Not sure whether the IPs were actually needed or not.
20:11:44 <JAA> Also, yes, there are threads that require logging in. I'm not sure whether they're accessible to normal users or only mods or similar though.
20:11:52 <JAA> We generally only archive things that are publicly accessible.
20:12:41 <JAA> returnHtml=1 on get_thread renders the BBCode as HTML.
20:13:06 <JAA> Well, partially, anyway. [url=...] is not transformed apparently.
20:14:13 <daxxy> neither is [quote]
20:15:00 <daxxy> nor img, so I have no idea if they actually render any BBCode or just newlines and maybe entities :v
20:15:43 <JAA> I'm seeing some <i> stuff as well.
20:15:55 <JAA> But yeah, it's weird.
20:16:11 <JAA> Smilies aren't translated into img tags either.
20:28:52 <daxxy> okay, at least [b] just gets removed if returnHtml=0, see post 540654 in thread 75253 for example
20:32:52 <JAA> Aw, there's a get_raw_post method, but that only works for users who can edit the post (i.e. poster/mods).
20:42:15 <daxxy> yeah, I saw that, but now that you say it... I should talk to the mods, they seem interested in archival
20:43:39 <daxxy> but since I figure this definitely isn't the place for crawling with a mod account - would you recommend the *-grab template for "outsiders" right now, or would I likely be better off hacking something together on my own?
20:45:26 <JAA> The -grab template is really only applicable to distributed projects, which is a major part of AT but not the only thing we do. I used my own tool (qwarc) for archiving the forums, but I can't recommend it to anyone as it's very much not user-friendly.
20:45:47 <JAA> And yeah, crawling with a mod account is not going to happen.
20:45:56 <JAA> (... here)
20:48:38 <JAA> I think I'll regrab all threads with get_thread, probably with returnHtml=0 but haven't decided yet.
20:52:10 <JAA> Trying to figure out where that transformation happens, but haven't quite found it.
20:56:09 <daxxy> library/Tapatalk/Bridge.php, library/Tapatalk/BbCode/Formatter/Tapatalk.php, mobiquo/mbqClass/lib/read/MbqRdEtForumPost.php are the relevant places I've found
20:56:39 <JAA> Ah, push/TapatalkPush.php cleanPost, but it delegates to Tapatalk_BbCode_Formatter_Tapatalk which isn't in the plugin.
20:57:16 <daxxy> it's in the archive, you may have only extracted the mobiquo folder
20:58:01 <JAA> Oh, right. I was grepping inside mobiquo, yeah.
20:59:16 <JAA> Wow, this code is a mess.
20:59:19 <daxxy> hah
20:59:31 <JAA> Random indentation is exactly why I love Python.
21:00:50 <hexa-> python2*
21:01:09 * JAA slaps hexa- around a bit with a large trout
21:01:26 * hexa- slaps JAA back with python2.7 … BEST BEFORE 2Y AGO
21:01:42 <JAA> Great, thanks, now I have food poisoning. :-(
21:01:55 <hexa-> I'm burnt, I do a lot of python packaging in NixOS :(
21:03:24 <JAA> [b] and [i] get stripped, [u] gets converted to <u>, [color] becomes a font tag, [img] should get stripped in both settings if I'm reading the code correctly.
21:05:57 <daxxy> img stripped? where are you seeing that?
21:07:39 <JAA> Nevermind, it gets treated specially it seems.
21:07:46 <JAA> library/Tapatalk/BbCode/Formatter/Tapatalk.php is what I'm looking at.
21:08:04 <JAA> Specifically the getTags function.
21:13:07 <daxxy> tbh, I don't think there's a need to analyze this properly right now -- we're not gonna get a lossless copy anyway, and clearly they only leave bbcode in that matches the parser in their app
21:13:49 <daxxy> (the android app uses returnHtml=1, btw)
21:15:10 <JAA> Hmm, it would be neat if we could archive it in a way that someone could simply plug a Wayback Machine URL into the app and it all plays back correctly. But getting that to work would be quite a challenge.
21:15:24 <JAA> And it'd probably break anyway.
21:15:43 <daxxy> you mean the tapatalk app?
21:16:16 <daxxy> definitely not gonna work
21:19:14 <daxxy> for one, unless there's a way to force it into using the JSON API (doubt it, since the JSON API is newer, it ought to be preferred if client and server support it already), it POSTs to the xml-rpc interface and there's no way to make it request different URLs for different content
21:23:45 <JAA> Right
21:26:22 <daxxy> writing a new (entirely client-side) webapp that reads everything from WBM (plus an externally hosted search index file, if you wanna get fancy) would work though, and not even take that much effort I think
21:28:02 <daxxy> when you're not supporting 2 protocols in 8 codebases over 3 inheritance levels, this does not have to be complex software :P
21:28:08 <JAA> :-)
21:29:48 <JAA> It would have to be in the WBM though due to (the lack of) CORS.
21:30:04 <daxxy> yeah, wasn't sure about that
21:30:23 <JAA> Anyway, that's something for the future. First step is getting the data.
21:30:34 <daxxy> ...but then you can always just put your site into WBM, right? ^^
21:31:05 <JAA> Also, someone here was working on a forum archive ingestion thingy a while ago. Not sure what happened to that idea.
21:31:27 <JAA> Yes, that's what I did with the Picosong data finder thingy.
21:34:31 <JAA> I'm going with returnHtml=0. As far as I can tell, it preserves a bit more data than =1 does, and the conversion should be easy enough.
21:34:52 <daxxy> huh, what does it preserve that =1 doesn't?
21:35:02 <JAA> [b] [i]
21:35:07 <daxxy> hang on
21:35:56 <daxxy> no, =1 transforms them to HTML, but =0 strips them completely
21:36:16 <JAA> Huh
21:37:31 <JAA> Oh
21:39:48 <daxxy> any idea about the timeframe? if I (get the mods to) grab anything more, I'd rather do that after you've done your thing (especially with the missing posts) so my traffic won't get in your way
21:42:38 <JAA> Ok yeah, =1 it is I guess.
21:44:33 <JAA> I need to leave for a bit but will get it up and running in the next 1-2 hours.
21:45:09 <daxxy> nice
23:20:11 <DLoa> DLoa: JAA  This is great news.  I was having trouble with Python3 script using wkhtmlto/pdfkit to save to pdf each pages of the threads of interest without running into issues after one or two pages (blocked?).  I hope that it'll be possible to download to pdf  for offline use.  I have an account and can look into the remaining threads that could
23:20:11 <DLoa> be saved.
23:25:40 <JAA> daxxy: The Tapatalk API has an ... interesting behaviour on thread redirects, e.g. thread 826132 on NotebookReview which redirects to 795536. It returns the data for the merged(?) thread and a positive total_post_num but no actual posts.
23:25:58 <daxxy> huh
23:26:29 <daxxy> I was wondering what it'd do with merged/moved threads but hadn't come across any, thanks
23:27:17 <JAA> Also, on threads that require logging in, it returns a 'Need valid topic id!' error, e.g. 247631 on NotebookReview.
23:28:26 <DLoa> I can log in on NBR if it helps.
23:49:16 <JAA> daxxy: Welp: http://forum.notebookreview.com/mobiquo/tapatalk.php?method_name=get_thread&topicId=763489&returnHtml=1&page=1&perPage=100
23:50:01 <JAA> That one works fine through the website: http://forum.notebookreview.com/threads/why-arent-laptop-gpus-officially-sold.763489/
23:51:02 <daxxy> weird
23:51:10 <JAA> There are plenty more like that, it seems. Just running a little test with random IDs right now and immediately hit three like it.
23:51:32 <JAA> 738640 and 742521 are the other two.
23:51:53 <JAA> Their WAF is very, very odd.
23:53:02 <JAA> Haven't documented this anywhere yet, but any request containing 'temp' as a word gets blocked. Same for 'tmp' and one other I can't remember right now. And anything with 'nessus' results in a connection reset.
23:54:35 <JAA> But yeah, can't even get everything through the API. WTF?