00:13:57 TheTechRobo: wpull isn't that difficult to install i would say 00:15:00 As long as you have a supported (i.e. long EOL) version of Python. 00:15:07 And you deal with the broken dependencies. 00:15:12 And you deal with the broken CLI. 00:15:31 I'm using the Python 3.12 with it with latest Tornado and dependencies :) 00:15:35 But other than *that*, it's just fine. :-) 00:15:45 Yes, but not the regular wpull. :-) 00:16:15 Close enough for me (it's just missing the "ludios_" prefix in the name haha) 00:19:25 Back to working on my PR into ludios_wpull, hopefully will get a Python 3.11+ version into master branch within the next 1-2 weeks 00:21:14 wpull++ 00:21:15 -eggdrop- [karma] 'wpull' now has 1 karma! 00:21:35 Wget-AT is love, Wget-AT is life 00:22:33 oh! you're the one committing into the python 3.11 branch of pull 00:22:35 wpull 00:22:40 nice to meet you Terbium :3 00:22:58 Nice to meet you (again) :) 00:25:16 ye again x3 00:25:18 :) 00:25:48 I mostly lurk hehe 00:26:14 :3 00:26:20 a watchful eye 00:30:46 Ah, now that makes sense. :-) 00:31:13 we were wondering who the mysterious committer was 00:31:16 :p 00:31:26 (#at-changes) 00:32:56 oh what? 00:33:21 Here's a question about wget-at: is it possible to send & save a POST request with some formdata with a single command? Or does it need any .lua script for that? 00:33:27 lol, I've been talking to Ivan on and off about it, didn't realize it was stirring up some confusion 00:34:19 there's a newly founded channel (thanks to nulldata) that posts new commits in ArchiveTeam repos 00:34:36 and stuff related to issues/PRs 00:34:51 oh and docker images/wiki changes 00:34:56 I had an private fork with numerous changes/modernizations for wpull along with grab-site, with CI/CD + docker. just recently got some time to work of migrating some changes to the mainline repo 00:35:07 oh awesome :) 00:35:24 nice to see :3 00:36:20 oh hey, our wiki got linked https://news.ycombinator.com/item?id=34734177#34737936 (google alert came in) 00:36:34 well, one of our secondary wikis 00:40:20 still a shame this hasn't moved along in the 2 years i've been watching it: https://github.com/facebook/zstd/pull/2349 00:40:59 zstandards pretty nice, thought about converting my WARC datasets to zstd, but never got around to it 00:47:41 :( 00:48:17 * fireonlive pokes the developers of archivebox too 00:48:51 archivebox is great, shame it doesn't do recursive crawling 00:49:13 And shame its WARCs are weird because wget. 00:49:37 ye, need to switch to wget-at i suppose 00:49:46 recursive would be cool :) 00:49:49 I was just wondering how hard it would be to make that work. 00:50:06 i do have my instance still up, though i've barely used it since JAA pooped on it 00:50:09 :P 00:50:15 :-P 00:50:26 I mean, at least wget doesn't corrupt data or similar. 00:50:30 :-) 00:50:34 JAA you should use a toilet instead of fireonline's archivebox instance :) 00:51:20 xP 00:51:57 * fireonlive attempts to remember wget's warc faults 00:52:01 JS rendering still a big pain to deal with, I'll resorting to Chromium browsers for archiving since wpull and wget-at doesn't cut it for those 00:52:27 Yeah, brozzler I guess for that. 00:52:36 ah! https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem 00:53:18 yeah, the new world of javascript for everything and soon HTTP/2 and HTTP/3 00:53:37 currently using warcprox with chromium containers in Kubernetes 00:54:01 ooh :3 00:54:10 Neat 00:54:12 brozzler was a bit too vertically integrated for my liking when i looked at it 3-4 years ago 00:54:14 is cloudflare happy with you? 00:55:04 using a couple proxies and captcha solvers to work around buttflare. Still a pain to deal with 00:55:52 ahh 00:56:25 prowlarr had an integration for one of those pay as you go services, thought it quite neat 00:58:05 "neat"ly burning a hole in my wallet :P 00:58:54 :P 00:59:27 gotta set up a free site with "something people want" and in order to get access to said content they solve captchas for you 00:59:33 :D 01:00:43 We could make it an AT project, outsourcing captchas to volunteers lol 01:00:58 haha 01:01:24 But in all seriousness, hcaptchas are pretty difficult compared to recaptcha 01:01:37 >_< 01:03:13 -+rss/#hackernews- Indexing a Billion Pages: https://blog.mwmbl.org/articles/indexing-a-billion/ https://news.ycombinator.com/item?id=38744224 01:03:22 wonder if they deal with the same 01:03:32 also oops, meant to offtopic that lol 01:04:42 something like a captcha solving leaderboard 01:04:58 gotta gamify it :3 01:05:54 you jest, but we did have a leaderboard for joining yahoo groups 01:07:32 Yep, I was there for yahoo groups lol 01:07:37 That one was insane 01:17:11 ooh, that sounds fun lol 01:17:15 'fun' 02:39:02 Fireonlive constantly doing captchas? And not even getting paid. Not my idea of fun 02:39:26 ah were you a yahooligan? 03:16:55 Hmm, outsourcing ReCAPTCHAs to volunteers; I partipicated in that madness before with Yahoo Groups...I had some interesting image finds from doing the solving~ 03:17:05 *participated 03:17:22 Yahoo Groups was a big pain due to all the private groups :( 03:23:30 I wouldn't mind doing a reCAPTCHA stuff for internet archiving purposes, similar to the Yahoo Groups stuff; do 'em when I'm bored or something 03:31:42 I do ponder if something like that would be possible in general, via #// ...? 04:17:28 Hmm, interesting, Mwmbl has a Firefox extension that when installed and enabled, uses computer resources for web crawling, https://addons.mozilla.org/en-GB/firefox/addon/mwmbl-web-crawler/ 04:34:02 https://twitter.com/Mineteria the twitter of a Minecraft server which years back got merged into another. I'm surprised the twitter account is still up 04:34:03 nitter: https://nitter.net/Mineteria 05:09:55 Pedrosso: add it to https://pad.notkiska.pw/p/archivebot-twitter 05:10:43 will do 06:02:40 in which fireonlive types too many words on the etherpad 06:03:05 too much pad not enough ether 🥴 06:39:48 Oh, the Yahoo Groups captchas brings back so many memories… all that training AI to read rumble strips as crosswalks, lol. 06:41:38 The other top people who worked on the fandom side of the project were revisiting the history of it the other day and remembering some of the captcha discussion and joking. Me, I think about YG all the time, but that's because I'm still sorting metadata. So many weird groups that used to exist. 06:44:08 i still have a folder full of screenshots of 'ceci n'est pas un pipe' situations 06:45:11 *une 06:47:53 Wouldn't mind doing more of the Captcha stuff <#>; 06:49:59 Saying this because I recall years ago, there's a website that uses Google reCAPTCHA for the sole purpose of doing it for a high score~ 06:50:40 You realise that was probably aiding a spam operation? 06:52:12 It was like years ago, like, I think a decade ago? 06:52:44 It was back then when reCAPTCHAs where all about transcribing text from books because Google Books at the time 06:53:29 I do know that after solving them enough times, it gives a much harder version, I'm assuming because of having to solve them a bunch in a row 06:57:06 Yeah, there are lots of services where you can spend your time as a Mechanical Turk to earn pennies and help spam bots break captchas. 06:57:35 tfw mturk banned me 06:57:38 >:( 06:58:04 Aww yeah, this is what I used to see when I was doing it for a high score: https://3.bp.blogspot.com/-SnVfcK0v9Lc/Ur4F_bIMyEI/AAAAAAAACcc/S5inQO5jFTU/s1600/reCAPTCHA+don't+type.jpg 10:04:40 Heya, could I request two websites for re-scrape? One is showing signs of bit rot (and is behind cloudflare), while the other I had already requested scrape while the site was having a rough period, but since it's stable now, I'd wanna request again since then it's assured all pages got captured 11:51:11 ShadowJonathan: go ahead, but be aware that we may not be able to do much through cloudflare protection 11:51:32 ait, the CF-protected fanfic site is www.fanfiction.net 11:51:41 ah 11:51:47 the head domain fanfiction.net stopped resolving, and thats why im kinda panicking 11:51:56 or well, its a signal of bitrot and neglect, for me 11:52:23 the other site, the well-working one, is www.cyoc.net, a "choose your own adventure" submission website, but NSFW 11:53:04 yeah, we've discussed ffn on a number of occasions (including when that started happening) 11:53:15 but cloudflare's a bitch 11:53:38 ah 11:53:40 alrighty then :( 11:56:46 there are theoretical plans of attack, but it's a lot of dev work that nobody's had the time to do :( 11:57:17 someone should be along to queue the other site in a bit 12:19:17 alrighty, thanks 17:07:50 The art of second person story telling is underutilized. They should make a non nsfw site. This is an interesting concept 18:09:04 ShadowJonathan: I've taken the precautions of downloading all the fics I might ever want to read, via fichub-cli, but that's as good as I know how to do. 18:12:44 My "I wish this were being actively worked on" is LiveJournal but #recordedjournal hasn't had any activity in a long while. The only archiving tool out there currently is something someone (not an AT person) cooked up to use that requires Excel macros. I run Linux and don't have MSOffice so can't use it at all, alas. 19:14:08 ShadowJonathan: I'm pretty sure our job for cyoc.net is complete (or was complete when it was done a few months ago) - I did check to make sure all pages were captured after the fact 19:14:41 ah alright 19:14:50 i might've forgotten that, or that might've slipped my mind 19:15:08 i still remember the anxiety of trying to download it, so maybe thats that 19:16:08 Specifically I think I did a second job when the site started being faster where I saved all of the user pages, and then I checked to make sure all of the stories linked from those had been saved by the first job 19:17:35 ... and then did one additional job that covered the missed ones (which were mainly new chapters posted afterwards: https://archive.fart.website/archivebot/viewer/domain/urls-transfer.archivete.am-www.cyoc.net_missed_chapters.txt) 19:30:35 alrighty, thanks :) 21:15:41 OrIdow6 edited FanFiction.Net (+565, On false negatives during replay due to…): https://wiki.archiveteam.org/?diff=51414&oldid=48810