-
Terbium
TheTechRobo: wpull isn't that difficult to install i would say
-
JAA
As long as you have a supported (i.e. long EOL) version of Python.
-
JAA
And you deal with the broken dependencies.
-
JAA
And you deal with the broken CLI.
-
Terbium
I'm using the Python 3.12 with it with latest Tornado and dependencies :)
-
JAA
But other than *that*, it's just fine. :-)
-
JAA
Yes, but not the regular wpull. :-)
-
Terbium
Close enough for me (it's just missing the "ludios_" prefix in the name haha)
-
Terbium
Back to working on my PR into ludios_wpull, hopefully will get a Python 3.11+ version into master branch within the next 1-2 weeks
-
Terbium
wpull++
-
eggdrop
[karma] 'wpull' now has 1 karma!
-
TheTechRobo
Wget-AT is love, Wget-AT is life
-
fireonlive
oh! you're the one committing into the python 3.11 branch of pull
-
fireonlive
wpull
-
fireonlive
nice to meet you Terbium :3
-
Terbium
Nice to meet you (again) :)
-
fireonlive
ye again x3
-
fireonlive
:)
-
Terbium
I mostly lurk hehe
-
fireonlive
:3
-
fireonlive
a watchful eye
-
JAA
Ah, now that makes sense. :-)
-
fireonlive
we were wondering who the mysterious committer was
-
fireonlive
:p
-
fireonlive
(#at-changes)
-
Terbium
oh what?
-
ScenarioPlanet
Here's a question about wget-at: is it possible to send & save a POST request with some formdata with a single command? Or does it need any .lua script for that?
-
Terbium
lol, I've been talking to Ivan on and off about it, didn't realize it was stirring up some confusion
-
fireonlive
there's a newly founded channel (thanks to nulldata) that posts new commits in ArchiveTeam repos
-
fireonlive
and stuff related to issues/PRs
-
fireonlive
oh and docker images/wiki changes
-
Terbium
I had an private fork with numerous changes/modernizations for wpull along with grab-site, with CI/CD + docker. just recently got some time to work of migrating some changes to the mainline repo
-
fireonlive
oh awesome :)
-
fireonlive
nice to see :3
-
fireonlive
oh hey, our wiki got linked
news.ycombinator.com/item?id=34734177#34737936 (google alert came in)
-
fireonlive
well, one of our secondary wikis
-
Terbium
still a shame this hasn't moved along in the 2 years i've been watching it:
facebook/zstd #2349
-
Terbium
zstandards pretty nice, thought about converting my WARC datasets to zstd, but never got around to it
-
fireonlive
:(
-
» fireonlive pokes the developers of archivebox too
-
Terbium
archivebox is great, shame it doesn't do recursive crawling
-
JAA
And shame its WARCs are weird because wget.
-
fireonlive
ye, need to switch to wget-at i suppose
-
fireonlive
recursive would be cool :)
-
JAA
I was just wondering how hard it would be to make that work.
-
fireonlive
i do have my instance still up, though i've barely used it since JAA pooped on it
-
fireonlive
:P
-
JAA
:-P
-
JAA
I mean, at least wget doesn't corrupt data or similar.
-
JAA
:-)
-
Terbium
JAA you should use a toilet instead of fireonline's archivebox instance :)
-
fireonlive
xP
-
» fireonlive attempts to remember wget's warc faults
-
Terbium
JS rendering still a big pain to deal with, I'll resorting to Chromium browsers for archiving since wpull and wget-at doesn't cut it for those
-
JAA
Yeah, brozzler I guess for that.
-
fireonlive
-
fireonlive
yeah, the new world of javascript for everything and soon HTTP/2 and HTTP/3
-
Terbium
currently using warcprox with chromium containers in Kubernetes
-
fireonlive
ooh :3
-
JAA
Neat
-
Terbium
brozzler was a bit too vertically integrated for my liking when i looked at it 3-4 years ago
-
fireonlive
is cloudflare happy with you?
-
Terbium
using a couple proxies and captcha solvers to work around buttflare. Still a pain to deal with
-
fireonlive
ahh
-
fireonlive
prowlarr had an integration for one of those pay as you go services, thought it quite neat
-
Terbium
"neat"ly burning a hole in my wallet :P
-
fireonlive
:P
-
fireonlive
gotta set up a free site with "something people want" and in order to get access to said content they solve captchas for you
-
fireonlive
:D
-
Terbium
We could make it an AT project, outsourcing captchas to volunteers lol
-
fireonlive
haha
-
Terbium
But in all seriousness, hcaptchas are pretty difficult compared to recaptcha
-
fireonlive
>_<
-
fireonlive
-
fireonlive
wonder if they deal with the same
-
fireonlive
also oops, meant to offtopic that lol
-
qwertyasdfuiopghjkl
something like a captcha solving leaderboard
-
fireonlive
gotta gamify it :3
-
thuban
you jest, but we did have a leaderboard for joining yahoo groups
-
Terbium
Yep, I was there for yahoo groups lol
-
Terbium
That one was insane
-
fireonlive
ooh, that sounds fun lol
-
fireonlive
'fun'
-
flashfire42|m
Fireonlive constantly doing captchas? And not even getting paid. Not my idea of fun
-
fireonlive
ah were you a yahooligan?
-
Ryz
Hmm, outsourcing ReCAPTCHAs to volunteers; I partipicated in that madness before with Yahoo Groups...I had some interesting image finds from doing the solving~
-
Ryz
*participated
-
Terbium
Yahoo Groups was a big pain due to all the private groups :(
-
Ryz
I wouldn't mind doing a reCAPTCHA stuff for internet archiving purposes, similar to the Yahoo Groups stuff; do 'em when I'm bored or something
-
Ryz
I do ponder if something like that would be possible in general, via #// ...?
-
Ryz
Hmm, interesting, Mwmbl has a Firefox extension that when installed and enabled, uses computer resources for web crawling,
addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler
-
Pedrosso
twitter.com/Mineteria the twitter of a Minecraft server which years back got merged into another. I'm surprised the twitter account is still up
-
eggdrop
-
pabs
-
Pedrosso
will do
-
fireonlive
in which fireonlive types too many words on the etherpad
-
fireonlive
too much pad not enough ether 🥴
-
Doranwen
Oh, the Yahoo Groups captchas brings back so many memories… all that training AI to read rumble strips as crosswalks, lol.
-
Doranwen
The other top people who worked on the fandom side of the project were revisiting the history of it the other day and remembering some of the captcha discussion and joking. Me, I think about YG all the time, but that's because I'm still sorting metadata. So many weird groups that used to exist.
-
thuban
i still have a folder full of screenshots of 'ceci n'est pas un pipe' situations
-
thuban
*une
-
Ryz
Wouldn't mind doing more of the Captcha stuff <#>;
-
Ryz
Saying this because I recall years ago, there's a website that uses Google reCAPTCHA for the sole purpose of doing it for a high score~
-
flashfire42
You realise that was probably aiding a spam operation?
-
Ryz
It was like years ago, like, I think a decade ago?
-
Ryz
It was back then when reCAPTCHAs where all about transcribing text from books because Google Books at the time
-
Ryz
I do know that after solving them enough times, it gives a much harder version, I'm assuming because of having to solve them a bunch in a row
-
JAA
Yeah, there are lots of services where you can spend your time as a Mechanical Turk to earn pennies and help spam bots break captchas.
-
fireonlive
tfw mturk banned me
-
fireonlive
>:(
-
Ryz
Aww yeah, this is what I used to see when I was doing it for a high score:
3.bp.blogspot.com/-SnVfcK0v9Lc/Ur4F…jFTU/s1600/reCAPTCHA+don't+type.jpg
-
ShadowJonathan
Heya, could I request two websites for re-scrape? One is showing signs of bit rot (and is behind cloudflare), while the other I had already requested scrape while the site was having a rough period, but since it's stable now, I'd wanna request again since then it's assured all pages got captured
-
thuban
ShadowJonathan: go ahead, but be aware that we may not be able to do much through cloudflare protection
-
ShadowJonathan
ait, the CF-protected fanfic site is www.fanfiction.net
-
thuban
ah
-
ShadowJonathan
the head domain fanfiction.net stopped resolving, and thats why im kinda panicking
-
ShadowJonathan
or well, its a signal of bitrot and neglect, for me
-
ShadowJonathan
the other site, the well-working one, is www.cyoc.net, a "choose your own adventure" submission website, but NSFW
-
thuban
yeah, we've discussed ffn on a number of occasions (including when that started happening)
-
thuban
but cloudflare's a bitch
-
ShadowJonathan
ah
-
ShadowJonathan
alrighty then :(
-
thuban
there are theoretical plans of attack, but it's a lot of dev work that nobody's had the time to do :(
-
thuban
someone should be along to queue the other site in a bit
-
ShadowJonathan
alrighty, thanks
-
Vokun
The art of second person story telling is underutilized. They should make a non nsfw site. This is an interesting concept
-
Doranwen
ShadowJonathan: I've taken the precautions of downloading all the fics I might ever want to read, via fichub-cli, but that's as good as I know how to do.
-
Doranwen
My "I wish this were being actively worked on" is LiveJournal but #recordedjournal hasn't had any activity in a long while. The only archiving tool out there currently is something someone (not an AT person) cooked up to use that requires Excel macros. I run Linux and don't have MSOffice so can't use it at all, alas.
-
pokechu22
ShadowJonathan: I'm pretty sure our job for cyoc.net is complete (or was complete when it was done a few months ago) - I did check to make sure all pages were captured after the fact
-
ShadowJonathan
ah alright
-
ShadowJonathan
i might've forgotten that, or that might've slipped my mind
-
ShadowJonathan
i still remember the anxiety of trying to download it, so maybe thats that
-
pokechu22
Specifically I think I did a second job when the site started being faster where I saved all of the user pages, and then I checked to make sure all of the stories linked from those had been saved by the first job
-
pokechu22
... and then did one additional job that covered the missed ones (which were mainly new chapters posted afterwards:
archive.fart.website/archivebot/vie…am-www.cyoc.net_missed_chapters.txt)
-
ShadowJonathan
alrighty, thanks :)
-
h2ibot
OrIdow6 edited FanFiction.Net (+565, On false negatives during replay due to…):
wiki.archiveteam.org/?diff=51414&oldid=48810