00:24:22 wow 00:24:25 my system got all laggy 00:24:30 turns out it was swapping 00:24:42 because the archivebot.com tab was using 3GB of RAM and growing 00:26:51 Haha. It tends to do that 00:27:41 How long did you have it open? 00:27:53 like a minute or two, idk what's up with that 00:28:04 I've had it open for hours, and it uses under 200 MB. 00:28:05 I expanded the log of a job 00:28:13 which may have affected it 00:29:24 ~300 MB when expanding all logs, although it disappeared from about:performance for a bit. lol 00:30:17 Firefox, by the way. 00:32:12 I should change my support.apple.com scraping script to not trash my SSD... 00:32:30 /dev/shm <3 00:32:47 instead of "rm data/*; download everything; if git diff --quiet; then commit; fi" 00:33:03 I should read the existing file and compare it with what was downloaded, if it's the same then don't write anything 00:36:03 Ah 00:56:35 it's like 480MB, I wouldn't want to keep that in memory between runs 01:10:26 Oh yeah not from like a minute or two, at least for me 03:36:24 Is there anything else to save from Evernote? Is there user content to get? Considering https://techcrunch.com/2023/11/29/its-official-evernote-will-restrict-free-users-to-50-notes/ 03:40:05 Ryz: I think it's all private content sooo 03:41:25 Hey were the videos from funnyordie.com saved? 03:41:25 they're not on the site anymore (since years ago), but seem to all still be on S3. the old cdn used to redirect, now 404s. 03:41:35 https://web.archive.org/cdx/search/cdx?url=vo.fod4.com/v/*&limit=1000 03:41:41 https://web.archive.org/cdx/search/cdx?url=http://s3.amazonaws.com/production.videos.funnyordie.com/v/*&limit=1000 03:42:11 archive.org 403s 11:53:12 Hi, I am here with a request similar to EndOfTerm archive but for Argentina, as the incoming government has already stated it's intent to dismantle most agencies 11:54:10 We are already working on archiving the data by downloading it as wee understand archive.org doesn't necessarily automatically index all pages. We understand ArchiveBot helps with this 11:54:36 The new term starts on 10th of December 11:55:20 We have compiled a list of sites which is not exchaustive 11:55:36 I found argentina.gob.ar and educ.ar in the archive but there are quite a few more that are not 11:56:33 It would be prefereable to have the sites in archive.org rather than just downloading for preservation as this ensures public access to all whereas the distribution aspect after downloading is complex 12:06:42 Some of the content is multimedia and we are having a hard time knowing how to archive it 12:06:58 Example https://www.cont.ar/ 12:14:42 Inti83: Hello, please stick around, we're definitely able to help with this. 12:15:27 ArchiveBot is able to crawl and download many websites, which then get uploaded to the Internet Archive and become possible to browse in the Wayback Machine 12:15:35 It has its limitations though 12:15:54 A good start would be to create a page on the wiki with a list of websites, then we can make notes for if individual websites were successful to crawl with ArchiveBot 12:16:11 (BTW, I can't access cont.ar at all over here at Europe. I'm getting a Cloudflare block page.) 12:17:13 Hi, cloudflare may be a problem, we encountered some problems with this even at this end 12:17:32 OK, I'll get started on a wiki page 12:33:13 What is a good nomeclature for such a page? 13:45:21 Hi, I keep getting disconnected. I wrote earlier about an End Of Term archive for Argentina. I was wondering how to follow nomenclature norms in order to start a new page and add the links as suggested? 13:45:44 There's a Government Backup page but it is US based 13:48:06 Inti83: "Argentina" is fine (we have a number of country pages, and can always add other sections/subpages as needed) 13:48:21 ok 14:18:45 Hey, OK. I sent the page for review 14:19:13 It has a list of pages we have compiled so far as relevant, although we have issued out a call so will be most likely adding more 14:20:55 I'll likely get disconnected again soon but I will connect again when poss 15:01:02 Inti83: I've approved your page. 15:01:09 Thank you! 15:01:29 Inti83 created Argentina (+4734, Add cultural links & YT): https://wiki.archiveteam.org/?title=Argentina 15:05:40 Thanks, we are going to test using grab-site to save these. If this works, do we consider the site saved? Or how do you usually proceeed in these cases? This tool: https://github.com/ArchiveTeam/grab-site 15:28:01 Inti83: Generally, WARCs from most people won't get added to the Wayback Machine as there is the possibility of tampering. But if it works with grab-site, it will almost certainly work with ArchiveBot as they share the same crawling code 16:42:03 https://abcnews.go.com/Business/google-begins-process-deleting-inactive-gmail-accounts/story?id=105281283 16:42:24 Have anyone noticed this? Google will start deleting inactive accounts. 16:49:50 (I wasn't in #googlecrash, so I can't see history) 18:15:26 Naruyoko - Yeah, that is what is prompted the grab for Blogger. #frogger 18:25:15 I see 18:26:54 I have a list of ~2000 or so cdn.discordapp.com urls from wiki.tockdom.com, would someone mind adding them to archivebot for me?https://transfer.archivete.am/j5cLW/tockdom_discord_urls.txt 18:26:55 inline (for browser viewing): https://transfer.archivete.am/inline/j5cLW/tockdom_discord_urls.txt 18:27:21 or is there something else I should do with them? 18:27:49 soap: Sure, I'll throw them in. 18:28:41 thanks! 19:24:21 JAA: what are the requirements for an archivebot pipeline? 19:25:18 if there's more .ar sites blocked so they only work from Argentina, it could be a problem 19:25:30 "I can't access cont.ar at all over here at Europe. I'm getting a Cloudflare block page." 19:26:33 cont.ar and cine.ar have user only content which may be why 19:27:49 nicolas17: Right. Stable machine with uptime measured at least in the months. Clean network. For the hardware, SSDs are basically required to operate at an acceptable speed, but otherwise, things can be scaled to fit what's available; the ideal machine would have a good number of CPU cores/threads. 19:28:21 RAM is rarely relevant, but more is better for caching. 19:28:31 running an archivebot crawler from a .ar IP would help with those cases, I doubt I can offer hardware for that but maybe I (or inti83) can find people who can? 19:29:08 For a more targeted project rather than a general pipeline, the uptime requirement would be less strict, I suppose. 19:29:24 is that something like grab-site? 19:29:27 Since we'll want to archive these things within weeks anyway. 19:29:35 yes; i think i can find people 19:29:41 what do i need to do? 19:29:42 grab-site is essentially a local version of ArchiveBot. 19:30:56 'Local' as in 'not distributed'; AB has a control node to coordinate the different machines (pipelines). 19:31:15 how would we run the archivebot from here? 19:32:03 I think I know people with servers inside the Cabase IXP :D 19:33:54 cool let me know and i can ask people who are on this whether they have the hardware capacity, may be possible 19:34:21 So the AB setup is fairly messy, and the install notes aren't entirely complete I think. If I can get access to a suitable server provided by a trustworthy party, I can set it up. 19:35:03 I assume this would be set up as a matchonly pipeline (though probably without matchonly in the name so that -p matchonly doesn't hit it), to avoid long-running jobs accidentally ending up on it? 19:35:48 Yes 20:17:20 I just found something interesting for future data-analysis purposes, archive.org has "access-control-allow-origin: *", so you can make client-side JS code to eg. get a cdx file and process it and return the extracted data, and do distributed computing by just giving people a link, kind of like the imgur bruteforce thing :D 20:45:07 do you have any tips on archiving atom archives? we are having some trouble: https://share.riseup.net/#G_1seXPsbK1wKVUwdMCNpw 20:50:37 so many links 20:51:24 That probably needs ignores of some sort but I don't have any specific recomendations 20:53:28 yeah, sadly this endpoint is used for everything: it always goes through it :/ 20:53:52 It looks like there is filter faceting, but that might not be the only thing. 22:49:31 Woot possibly more AB pipelines?