01:23:50 Flashfire42 edited List of websites excluded from the Wayback Machine (+25): https://wiki.archiveteam.org/?diff=51396&oldid=51390 01:34:52 OrIdow6 edited Google Drive (+1594, Make some of my research useful for future…): https://wiki.archiveteam.org/?diff=51397&oldid=50420 01:55:58 OrIdow6++ 01:55:59 -eggdrop- [karma] 'OrIdow6' now has 1 karma! 01:56:34 sites do privately appear in folders at leaast but hm 02:00:57 JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=51398&oldid=51396 02:15:55 ETA for OneHallyu is 4 days, 3 hours. Probably not going to finish in time. 02:22:01 OrIdow6 edited Google Drive (+86, New discoveries involving Sites in Drive): https://wiki.archiveteam.org/?diff=51399&oldid=51397 02:53:35 * fireonlive waits 02:54:00 cron pls 02:54:07 FireonLive edited Current Projects (+95, attempt to clean up/make easier to read the…): https://wiki.archiveteam.org/?diff=51400&oldid=51243 02:54:10 cc JAA/arkiver 02:56:06 looks good fireonlive 02:56:19 =] 02:56:31 i'm not sure if we still need the ukraine/russian sites project there, it's not running since a long time 02:56:56 ah good point 02:57:35 Long-term, perpetual projects? 02:57:41 did we have a word for 'basically forever' 02:57:49 an internal word that is 02:58:07 "occurring repeatedly; so frequent as to seem endless and uninterrupted." 02:58:09 that works 02:58:30 i'd just say long term 02:58:40 can't promise we'll keep them running forever 02:58:43 I've used 'continuous' before, but doesn't really say much. 02:58:50 Yeah, 'long term' is good. 02:59:46 ah ok 03:00:10 Nothing will keep running forever. The heat death of the universe will consume it all. 03:00:17 yep :) 03:00:28 so we might as well fade out now? 03:00:28 50/50 on leaving a blank section, but will leave an empty medium for now, to show that it 'can' exist 03:00:34 arkiver: that's my dream 03:00:55 ouch 03:01:02 '(none currently)'? 03:01:15 ah that works 03:01:27 Rather than just an empty section, which may look weird. 03:01:56 I plan on hanging out on this channel until the heat death. :-) 03:02:06 :) 03:02:15 have we had a scripts only project in the past N years 03:02:33 *removes commented out section* 03:04:12 FireonLive edited Current Projects (-133): https://wiki.archiveteam.org/?diff=51401&oldid=51400 03:05:07 arkiver: "2019-202? coronavirus outbreak: Documenting and preserving data, events, and impacts of the virus on society. IRC Channel #coronarchive (on hackint)" < would you call this one not running as well? 03:05:19 yes 03:05:22 kk 03:06:16 FireonLive edited Current Projects (-167, remove coronavirus): https://wiki.archiveteam.org/?diff=51402&oldid=51401 03:06:36 wow i cured the world i guess 03:06:37 :p 03:07:10 Photobucket did the purge *long* ago, right? 03:07:14 Should that be removed from upcoming? 03:07:47 also I feel like some of these hiatuses will never be unhiatused (is that a word?) 03:07:50 ex. Audit 2014 03:08:28 finally, https://wiki.archiveteam.org/index.php/NewsGrabber has been largely replaced with #//, right? should the wiki page be updated with that info? 03:08:29 there was a line on the audit 2014 haitus bullet that it would be done in 2016 that i removed a few months ago 03:08:47 oh, it does say under project status 03:08:50 re: NewsGrabber it does say "Archiving status Project superseded by URLs" 03:08:51 ye 03:09:03 lemme just... 03:10:18 TheTechRobo edited NewsGrabber (+51, replaced with #//): https://wiki.archiveteam.org/?diff=51403&oldid=50757 03:10:43 hey you 03:10:48 conflicting my edit 03:10:52 . 03:11:19 FireonLive edited NewsGrabber (+74): https://wiki.archiveteam.org/?diff=51404&oldid=51403 03:11:39 oops forgot a message lol 03:13:19 TheTechRobo edited NewsGrabber (+1, Add a period): https://wiki.archiveteam.org/?diff=51405&oldid=51404 03:13:20 TheTechRobo edited URLs (+222, Add urls-sources): https://wiki.archiveteam.org/?diff=51406&oldid=50427 03:14:19 FireonLive edited Current Projects (+0, alphabetize "on hiatus"): https://wiki.archiveteam.org/?diff=51407&oldid=51402 03:14:19 oh yeah newsgrabber was kind of out predecessor to #// 03:15:01 https://wiki.archiveteam.org/index.php/Project_Newsletter < neat idea 03:15:36 > yt-dlp can be used to download article URLs, making it possible to preserve news in video-form just as well as news in text-form. 03:15:36 I don't think we have that in URLs, do we? I suppose the storage would get unwieldy 03:15:52 Might be nice for high-value stuff, though 03:16:17 archivebot used to use youtube-dl (before the fork) but not any longer 03:16:42 Yeah 03:16:52 That integration was always jank IIRC though 03:16:59 we don't use yt-dlp in any project, except for the bot in #down-the-tube to discover videos of a channel for queuing 03:17:06 err 03:17:09 arkiver: I thought yt-dlp was replaced in the bot? 03:17:20 any Warrior project i should say 03:17:24 arkiver: is the bot on git :p 03:17:29 TheTechRobo: only partially replaced 03:17:40 fireonlive: no, it has keys that i didn't separate out yet 03:17:44 but yes i should get it on git 03:17:44 ahh np 03:17:47 * TheTechRobo asked about that before :P 03:17:52 no rushy 03:18:03 Yes please :-) 03:18:03 arkiver: should Photobucket be removed from upcoming/proposed? or is it still planned? 03:18:09 just have to free up some time for that 03:18:13 Can we have the tracker next? 03:18:19 TheTechRobo: i don't think it's planned at the moment 03:18:24 good luck for tracker :P 03:19:00 TheTechRobo: it current tracker is so very duck taped together (with sensitive stuff spread across it), that it will likely not be released publicly any time soon 03:19:01 I've been asking ever since I touched Seesaw. Universal-tracker is, despite the name, not very universal 03:19:01 arkiver: oh, one more q: are the IDs it generates stored in a database or something somewhere alongside the explanation provided/project/etc? 03:19:10 i believe the old tracker on github should still somewhat work? 03:19:16 or is it mainly for irc logs? 03:19:17 arkiver: Somewhat is right. 03:19:26 fireonlive: I also have the same question about `-e` 03:19:37 fireonlive: the bot for queuing you main? they are currently only in the logs 03:19:43 arkiver: No backfeed, slow, no offloader, etc 03:19:43 together with the explanation, only in the logs 03:19:45 ye indeed 03:19:51 ah ok :) 03:19:57 TheTechRobo: yeah 03:20:14 eventually ™ 03:20:17 :D 03:20:21 i guess :/ 03:20:23 tracker is more understandable 03:20:32 so i don't hold that one against y'all lol 03:20:46 :) 03:20:49 :) 03:20:50 The lack of an offloader was the main reason I never archived very much of Strawpoll. Whenever tracker was running, even idle, ~4GB of RAM usage because everything was in memory 03:21:39 i could have setup a project for that, if i was aware 03:21:41 Maybe a project for 2024. Building Universal-tracker 3 :P 03:21:52 set up* 03:21:52 arkiver: No shutdown notice, I just felt like archiving it 03:21:58 ah okey 03:22:03 it's still online? 03:22:05 No 03:22:07 i'm sure someone will set something big on fire on 2024 03:22:14 went offline without shutdown notice? 03:22:18 well a lot of somethings 03:22:24 arkiver: No idea 03:22:33 I thought "y'know maybe I should continue archiving strawpoll" and it was ded 03:22:47 fireonlive: maybe, i expected more to burn down with higher interest rates. maybe that will come next year still as the rates stay somewhat high and companies need to refinance 03:22:59 TheTechRobo: sad :/ 03:23:03 Yeah 03:23:16 https://support.fandom.com/hc/en-us/articles/7951865547671-August-2022-StrawPoll-me-closure 03:23:16 updated 2023-02-09, closed 2022-08 03:23:18 I did get a bunch of polls, but nowhere near everything :/ 03:25:07 are they on IA? 03:25:54 https://nitter.net/StrawPollme 03:25:58 arkiver: i think? this was when I was very new to ATY 03:25:59 *AT 03:25:59 their twitter kinda died lol 03:26:10 ah: https://archive.org/details/strawpoll-my-grab 03:26:22 TheTechRobo edited Strawpoll.me (+2, Update info): https://wiki.archiveteam.org/?diff=51408&oldid=49804 03:26:24 apparently they were having technical issues? and i guess didn't want to spend resources into fixing it 03:26:35 fireonlive: lol 03:26:58 https://old.reddit.com/r/NoStupidQuestions/comments/rzwnpk/is_strawpollme_going_to_be_broken_forever/ 03:27:19 i vaguely remember others saying 'not to use the .me' version as well 03:28:44 ooh, dark IA items with poll data :3 03:29:26 fireonlive: Yeah, not sure what's up with that 03:41:49 TheTechRobo: how did you archive it? 03:41:57 meaning how was the WARC created 03:43:29 looks like v1.20.3-at of https://github.com/archiveteam/wget-lua 03:43:39 from https://git.thetechrobo.ca/TheTechRobo/strawpoll-grab/src/branch/master/get-wget-lua.sh anyhow 03:45:20 https://git.thetechrobo.ca/TheTechRobo/strawpoll-grab/src/branch/master/pipeline.py#L161 03:50:56 thanks 03:51:11 TheTechRobo: i've moved the strawpoll item to archiveteam-fire , it will soon be in the Wayback Machine 03:51:43 i have my own collection? 03:51:47 :D 03:52:07 also sweet news :) 03:53:21 hah i guess so :) 03:53:54 :3 04:00:55 arkiver: Holy shit lmao 04:04:21 TheTechRobo: ? 04:05:04 arkiver: My shitty code made it into the WBM! :P 04:05:56 well as long as the records are fine, it should be good :) 04:06:41 I don't even think Wget-AT *lets* you write invalid records :P 04:07:00 Well, I guess you could override DNS 04:07:02 yeah :) 04:07:09 for DNS yes i guess 04:07:11 But you could do that anyway 04:08:51 Wget-AT is amazing 04:09:00 Wget-AT++ 04:09:01 -eggdrop- [karma] 'Wget-AT' now has 2 karma! 04:09:16 thanks :) 04:09:21 many improvements coming up! 04:09:35 =] 04:09:52 * arkiver is preparing a response to the recent responses from the TLS working group on our proposed mime types and URIs for SSL/TLS 04:10:58 good luck with those IETF types 04:11:25 I'd also suggest adding some sort of unit testing 04:12:02 i'm sure it's on mind 04:12:11 Yeah 04:26:40 thanks... 04:44:08 🆕 !tell now supports hostmasks (nick!user@host) e.g. !tell *!*@balls.example hello 04:44:21 (with wildcards) 05:20:27 Mmm, welp, from a random checking of links to ignore on some ArchiveBot jobs, sadly https://forum.mobilelegends.com/ has shut down earlier this year on April 30 05:20:50 ...I don't think we have much people in the mobile game area of things strongly about it :c 18:09:45 Hi Jason, 18:09:45 Sorry to bother you, but Joe Baugher has died. He wrote up so much about 18:09:45 aviation throughout the years and his articles are invaluable. Would you 18:09:45 mind asking the Archive team to archive his home page one last time? 18:09:45 https://www.joebaugher.com/ 18:09:47 All the best, 18:09:50 Chris 18:44:25 SketchCow: I’m not jason, but i think this is something we can do :) 18:45:45 oh wait. you're jason >.< 18:47:08 it’s queued :) 18:49:57 is there something easy to quickly (multi connection) download a list of urls (without following links) into a warc? 18:51:48 Nulo|m: i’m not as experienced as other users here (which might have better answers for you), but you shouldd just be able to use wget for that 18:52:04 i guess just have to make a script to run many wget right? 18:52:17 also i can't find a flag to not download files into a file when i'm already downloading them into a warc in wget 18:53:54 Nulo|m: well, it has to download them, but there's the --delete-after flag which gets rid of them once they're in the warc 18:54:33 thanks! 18:54:35 wget has a --background mode, but i would assume they then cannot write into the same warc file 18:54:47 if you have multiple running i mean 18:56:15 there's also wpull, a wget fork, which archivebot also uses to download things. that one supports concurrency, but depending on your python version it might be a little fidddly to set up: https://github.com/ArchiveTeam/wpull 18:57:01 correction: it's not a fork, just another tool. my bad 18:57:39 what do you need the warc for, if i may ask? 18:58:00 i'm downloading product pages to then scrap them offline 18:58:51 why the warc then, and not just the pages themselves? 18:59:27 because if i need to pull more info later that i wasn't scrapping before, i can still just pull from the warc 18:59:46 also my scraper is kind of hacky so if it's bad i can just re-run it on the WARCs 19:00:05 i see, that makes sense. 19:00:07 also i should be able to run the scraper on WARCs from archive.org or other sources :) 19:02:51 ok. apart from wpull i am running out of ideas. hopefully someone else can give you better answers when they're back :) 19:02:59 did you have a look at https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem already? 19:04:13 no, thank you! 19:04:22 i think i'll make a script based on wget though 19:04:49 there’s wget-at to :) 19:04:50 too 19:05:25 yah but wget works fine for me and i believe wget-at doesn't have multi-connection, just improved warc stuff? 19:05:40 ah, that would probably be the fork than that i confused wpull with earlier 19:06:19 improved warc stuff sounds pretty paramount :3 19:07:05 hehe but the warcs generated by gnu wget work fine with warcio.js which is what i'm using so 👍️ 20:53:19 I'm averaging 5k OneHallyu topics per hour now. They went read-only at 2023-12-20T11:23Z or so (date of the last post by an admin). If they shut it down at the same time of day, I expect to have covered about 81% of the topics. 20:54:09 more parallelism/IPs unlikely to help? 20:54:27 Their potato is too slow. 20:54:39 6 second average response time. 20:59:45 Let's see what happens if I throw more at it... 21:00:19 * Barto observes an explosion in the horizon 21:00:41 also try less, if there's resource contention on the server it could have weird effects 21:02:15 Can't easily go to less, but yeah, I might if this makes it worse. 21:02:33 ("half the threads, 2 second response time" would be a net win, though unlikely) 21:02:34 Average response time now: 8404 ms ._. 21:02:40 x_x 21:03:14 Throughput still went up a bit though. 21:04:00 hm how's your network-layer latency to their server? 21:04:51 They hide behind Buttflare, so no idea. 21:05:08 oh :| 21:05:33 that latency is also irrelevant if they're in CF 21:06:33 Depends on what their backend looks like, but the point is rather that I can't measure it anyway. 21:07:11 if there wasn't CF, doing the crawl from somewhere closer could help 21:07:51 Possibly, although it can usually be balanced by higher concurrency. 21:54:30 I'm back down to the same throughput from before I increased the concurrency. 23:21:34 “Bluesky makes web view public, login no longer required to read posts” https://news.ycombinator.com/item?id=38739130 23:22:38 nicolas17: feel free to use #fire-spam for testing 23:23:03 everyone got to witness the bee movie so what's a bit more :p