-
scowlee
duolingo is killing their forums at the end of march, public announcement to come but a ton of user-created language guides and resources will be gone
-
OrIdow6
We should rename ArchiveTeam to ForumTeam if this keeps up
-
OrIdow6
Anything public so far scowlee or is this inside info? And if the latter when can we expect a public announcement?
-
jamesp
OrIdow6: We should remain Archive Team, but Forum Team should specialize in forums.
-
jamesp
On the Wiki home, it doesn't mention Fandom.
-
OrIdow6
jamesp: FWIW there has been a never-implemented idea to turn #msgbored into something like that, hence its topic
-
OrIdow6
Beware, Duolingo sometimes returns inaccurate 200s (with body "500 Internal Server Error"), I suspect there's other status code weirdness too
-
OrIdow6
*Duolingo forums API
-
OrIdow6
Doing a quick estimate - does look like there are a few 10m posts
-
jamesp
-
JAA
The MapleTip Forums have vanished in the past few hours. The AB job did not manage to grab everything in time, but it looks like a good majority of the content was covered.
-
scowlee
OrIdow6: i think it should be announced within a week or so
-
daxxy
JAA, looked into using tapatalk for getting missing / machine-readable content from the technologyguide forums some more, turns out we *can* use unauthenticated GET requests for everything
-
daxxy
unless there's stuff only visible when logged in? all I've seen that requires login is attachment data (metadata/thumbnails are open)
-
daxxy
I'd be about ready to write a script for myself, is there any interest in running this as an "ArchiveTeam crawl", even though the HTML (sans broken pages) is already done? (I probably couldn't grab everything, nor have it end up in WBM)
-
h2ibot
-
arkiver
rewby: can we get a target for pinger.pl ?
-
arkiver
it would be archiveteam_pinger
-
arkiver
pinger_
-
arkiver
Archive Team pinger:
-
rewby
arkiver: Sure. What kind of file size you thinking and what kind of rate? + Is there a channel for this?
-
arkiver
rewby: i think will not be large at all
-
arkiver
no channel at the moment, but we can think of one
-
rewby
I set the target in the project
-
rewby
*tracker
-
arkiver
yep, already pushed first into into it
-
ThreeHM
No docker image yet for pinger?
-
rewby
ThreeHM: I'll go make one
-
ThreeHM
Thanks!
-
rewby
It's building, give it a few minutes and it'll be at the usual address
-
rewby
ThreeHM: Build done
-
Craigle
Pinger just started returning a ton of 400's
-
Craigle
arkiver ^
-
arkiver
403?
-
Craigle
Some, but a was seeing a wall of 400's with a few 403's and 200's
-
arkiver
looks fine to me
-
arkiver
hmm
-
arkiver
the site pretty unstable yeah :/
-
Craigle
Just picked back up
-
Craigle
Yeah, that was my thought
-
arkiver
lets hope it stays online a little after the 31st
-
arkiver
will see about contacting then
-
arkiver
them*
-
arkiver
anyone have ideas for pinger channel name?
-
Sanqui
pingas
-
Sanqui
sorry, I thought at first it would be a project for long term pings... idk lol
-
OrIdow6
Not on Deathwatch?
-
monika
#pinged maybe?
-
OrIdow6
#pingedout
-
arkiver
lets do #pinged
-
arkiver
saw yours too late OrIdow6 :P
-
h2ibot
OrIdow6 edited Deathwatch (+112, /* 2022 */ Add pinger.pl):
wiki.archiveteam.org/?diff=48225&oldid=48215
-
OrIdow6
Did anything happen to forum.chip.de after the AB job got banned? Looks like they've made their change
-
OrIdow6
I'm going to move its category anyhow
-
JAA
I archived it fully, and it completed ten minutes before they added rules to their Buttflare configuration blocking most automated access.
-
h2ibot
OrIdow6 edited Deathwatch (+12, Forum.chip.de has made its changes):
wiki.archiveteam.org/?diff=48226&oldid=48225
-
OrIdow6
Oh, good
-
DLoa
Hi, I joined today and my Warrior VM has been running for over 6hrs. I'd like to backup forums threads which are of interest to me on NotebookReview forums, which is closing for good in 2days. Is there a way to selectivey apply my Warrior VM to this and contribute to NBR archiving? @JAA work on this I believe archiving already and I'd like to
-
DLoa
help. Thank you
-
JAA
DLoa: There is no distributed project for TechnologyGuide, so no, you can't. I have already archived (nearly) the entire four forums, only a few dozen threads missing that I will be looking into tonight.
-
daxxy
JAA, do you want to grab those threads yourself? I've written down my notes here
gist.github.com/drdaxxy/b7731fb4217a56604956bcaa45641648
-
JAA
daxxy: Brilliant, thanks! Sorry for the delay, didn't have time to look into it yet.
-
daxxy
no worries :) what sorta resources / time did the HTML crawl take?
-
JAA
About a day for all four forums with decent parallelism and multiple IPs. Not sure whether the IPs were actually needed or not.
-
JAA
Also, yes, there are threads that require logging in. I'm not sure whether they're accessible to normal users or only mods or similar though.
-
JAA
We generally only archive things that are publicly accessible.
-
JAA
returnHtml=1 on get_thread renders the BBCode as HTML.
-
JAA
Well, partially, anyway. [url=...] is not transformed apparently.
-
daxxy
neither is [quote]
-
daxxy
nor img, so I have no idea if they actually render any BBCode or just newlines and maybe entities :v
-
JAA
I'm seeing some <i> stuff as well.
-
JAA
But yeah, it's weird.
-
JAA
Smilies aren't translated into img tags either.
-
daxxy
okay, at least [b] just gets removed if returnHtml=0, see post 540654 in thread 75253 for example
-
JAA
Aw, there's a get_raw_post method, but that only works for users who can edit the post (i.e. poster/mods).
-
daxxy
yeah, I saw that, but now that you say it... I should talk to the mods, they seem interested in archival
-
daxxy
but since I figure this definitely isn't the place for crawling with a mod account - would you recommend the *-grab template for "outsiders" right now, or would I likely be better off hacking something together on my own?
-
JAA
The -grab template is really only applicable to distributed projects, which is a major part of AT but not the only thing we do. I used my own tool (qwarc) for archiving the forums, but I can't recommend it to anyone as it's very much not user-friendly.
-
JAA
And yeah, crawling with a mod account is not going to happen.
-
JAA
(... here)
-
JAA
I think I'll regrab all threads with get_thread, probably with returnHtml=0 but haven't decided yet.
-
JAA
Trying to figure out where that transformation happens, but haven't quite found it.
-
daxxy
library/Tapatalk/Bridge.php, library/Tapatalk/BbCode/Formatter/Tapatalk.php, mobiquo/mbqClass/lib/read/MbqRdEtForumPost.php are the relevant places I've found
-
JAA
Ah, push/TapatalkPush.php cleanPost, but it delegates to Tapatalk_BbCode_Formatter_Tapatalk which isn't in the plugin.
-
daxxy
it's in the archive, you may have only extracted the mobiquo folder
-
JAA
Oh, right. I was grepping inside mobiquo, yeah.
-
JAA
Wow, this code is a mess.
-
daxxy
hah
-
JAA
Random indentation is exactly why I love Python.
-
hexa-
python2*
-
» JAA slaps hexa- around a bit with a large trout
-
» hexa- slaps JAA back with python2.7 … BEST BEFORE 2Y AGO
-
JAA
Great, thanks, now I have food poisoning. :-(
-
hexa-
I'm burnt, I do a lot of python packaging in NixOS :(
-
JAA
[b] and [i] get stripped, [u] gets converted to <u>, [color] becomes a font tag, [img] should get stripped in both settings if I'm reading the code correctly.
-
daxxy
img stripped? where are you seeing that?
-
JAA
Nevermind, it gets treated specially it seems.
-
JAA
library/Tapatalk/BbCode/Formatter/Tapatalk.php is what I'm looking at.
-
JAA
Specifically the getTags function.
-
daxxy
tbh, I don't think there's a need to analyze this properly right now -- we're not gonna get a lossless copy anyway, and clearly they only leave bbcode in that matches the parser in their app
-
daxxy
(the android app uses returnHtml=1, btw)
-
JAA
Hmm, it would be neat if we could archive it in a way that someone could simply plug a Wayback Machine URL into the app and it all plays back correctly. But getting that to work would be quite a challenge.
-
JAA
And it'd probably break anyway.
-
daxxy
you mean the tapatalk app?
-
daxxy
definitely not gonna work
-
daxxy
for one, unless there's a way to force it into using the JSON API (doubt it, since the JSON API is newer, it ought to be preferred if client and server support it already), it POSTs to the xml-rpc interface and there's no way to make it request different URLs for different content
-
JAA
Right
-
daxxy
writing a new (entirely client-side) webapp that reads everything from WBM (plus an externally hosted search index file, if you wanna get fancy) would work though, and not even take that much effort I think
-
daxxy
when you're not supporting 2 protocols in 8 codebases over 3 inheritance levels, this does not have to be complex software :P
-
JAA
:-)
-
JAA
It would have to be in the WBM though due to (the lack of) CORS.
-
daxxy
yeah, wasn't sure about that
-
JAA
Anyway, that's something for the future. First step is getting the data.
-
daxxy
...but then you can always just put your site into WBM, right? ^^
-
JAA
Also, someone here was working on a forum archive ingestion thingy a while ago. Not sure what happened to that idea.
-
JAA
Yes, that's what I did with the Picosong data finder thingy.
-
JAA
I'm going with returnHtml=0. As far as I can tell, it preserves a bit more data than =1 does, and the conversion should be easy enough.
-
daxxy
huh, what does it preserve that =1 doesn't?
-
JAA
[b] [i]
-
daxxy
hang on
-
daxxy
no, =1 transforms them to HTML, but =0 strips them completely
-
JAA
Huh
-
JAA
Oh
-
daxxy
any idea about the timeframe? if I (get the mods to) grab anything more, I'd rather do that after you've done your thing (especially with the missing posts) so my traffic won't get in your way
-
JAA
Ok yeah, =1 it is I guess.
-
JAA
I need to leave for a bit but will get it up and running in the next 1-2 hours.
-
daxxy
nice
-
DLoa
DLoa: JAA This is great news. I was having trouble with Python3 script using wkhtmlto/pdfkit to save to pdf each pages of the threads of interest without running into issues after one or two pages (blocked?). I hope that it'll be possible to download to pdf for offline use. I have an account and can look into the remaining threads that could
-
DLoa
be saved.
-
JAA
daxxy: The Tapatalk API has an ... interesting behaviour on thread redirects, e.g. thread 826132 on NotebookReview which redirects to 795536. It returns the data for the merged(?) thread and a positive total_post_num but no actual posts.
-
daxxy
huh
-
daxxy
I was wondering what it'd do with merged/moved threads but hadn't come across any, thanks
-
JAA
Also, on threads that require logging in, it returns a 'Need valid topic id!' error, e.g. 247631 on NotebookReview.
-
DLoa
I can log in on NBR if it helps.
-
JAA
-
JAA
-
daxxy
weird
-
JAA
There are plenty more like that, it seems. Just running a little test with random IDs right now and immediately hit three like it.
-
JAA
738640 and 742521 are the other two.
-
JAA
Their WAF is very, very odd.
-
JAA
Haven't documented this anywhere yet, but any request containing 'temp' as a word gets blocked. Same for 'tmp' and one other I can't remember right now. And anything with 'nessus' results in a connection reset.
-
JAA
But yeah, can't even get everything through the API. WTF?