00:24:22 <nicolas17> wow
00:24:25 <nicolas17> my system got all laggy
00:24:30 <nicolas17> turns out it was swapping
00:24:42 <nicolas17> because the archivebot.com tab was using 3GB of RAM and growing
00:26:51 <Pedrosso> Haha. It tends to do that
00:27:41 <JAA> How long did you have it open?
00:27:53 <nicolas17> like a minute or two, idk what's up with that
00:28:04 <JAA> I've had it open for hours, and it uses under 200 MB.
00:28:05 <nicolas17> I expanded the log of a job
00:28:13 <nicolas17> which may have affected it
00:29:24 <JAA> ~300 MB when expanding all logs, although it disappeared from about:performance for a bit. lol
00:30:17 <JAA> Firefox, by the way.
00:32:12 <nicolas17> I should change my support.apple.com scraping script to not trash my SSD...
00:32:30 <JAA> /dev/shm <3
00:32:47 <nicolas17> instead of "rm data/*; download everything; if git diff --quiet; then commit; fi"
00:33:03 <nicolas17> I should read the existing file and compare it with what was downloaded, if it's the same then don't write anything
00:36:03 <JAA> Ah
00:56:35 <nicolas17> it's like 480MB, I wouldn't want to keep that in memory between runs
01:10:26 <Pedrosso> Oh yeah not from like a minute or two, at least for me
03:36:24 <Ryz> Is there anything else to save from Evernote? Is there user content to get? Considering https://techcrunch.com/2023/11/29/its-official-evernote-will-restrict-free-users-to-50-notes/
03:40:05 <nicolas17> Ryz: I think it's all private content sooo
03:41:25 <lindowsME> Hey were the videos from funnyordie.com saved?
03:41:25 <lindowsME> they're not on the site anymore (since years ago), but seem to all still be on S3. the old cdn used to redirect, now 404s.
03:41:35 <lindowsME> https://web.archive.org/cdx/search/cdx?url=vo.fod4.com/v/*&limit=1000
03:41:41 <lindowsME> https://web.archive.org/cdx/search/cdx?url=http://s3.amazonaws.com/production.videos.funnyordie.com/v/*&limit=1000
03:42:11 <lindowsME> archive.org 403s
11:53:12 <Inti83> Hi, I am here with a request similar to EndOfTerm archive but for Argentina, as the incoming government has already stated it's intent to dismantle most agencies
11:54:10 <Inti83> We are already working on archiving the data by downloading it as wee understand archive.org doesn't necessarily automatically index all pages. We understand ArchiveBot helps with this
11:54:36 <Inti83> The new term starts on 10th of December
11:55:20 <Inti83> We have compiled a list of sites which is not exchaustive
11:55:36 <Inti83> I found argentina.gob.ar and educ.ar in the archive but there are quite a few more that are not
11:56:33 <Inti83> It would be prefereable to have the sites in archive.org rather than just downloading for preservation as this ensures public access to all whereas the distribution aspect after downloading is complex
12:06:42 <Inti83> Some of the content is multimedia and we are having a hard time knowing how to archive it
12:06:58 <Inti83> Example https://www.cont.ar/
12:14:42 <Sanqui> Inti83: Hello, please stick around, we're definitely able to help with this.
12:15:27 <Sanqui> ArchiveBot is able to crawl and download many websites, which then get uploaded to the Internet Archive and become possible to browse in the Wayback Machine
12:15:35 <Sanqui> It has its limitations though
12:15:54 <Sanqui> A good start would be to create a page on the wiki with a list of websites, then we can make notes for if individual websites were successful to crawl with ArchiveBot
12:16:11 <Sanqui> (BTW, I can't access cont.ar at all over here at Europe.  I'm getting a Cloudflare block page.)
12:17:13 <Inti83> Hi, cloudflare may be a problem, we encountered some problems with this even at this end
12:17:32 <Inti83> OK, I'll get started on a wiki page
12:33:13 <Inti83>  What is a good nomeclature for such a page?
13:45:21 <Inti83> Hi, I keep getting disconnected. I wrote earlier about an End Of Term archive for Argentina. I was wondering how to follow nomenclature norms in order to start a new page and add the links as suggested?
13:45:44 <Inti83> There's a Government Backup page but it is US based
13:48:06 <thuban> Inti83: "Argentina" is fine (we have a number of country pages, and can always add other sections/subpages as needed)
13:48:21 <Inti83> ok
14:18:45 <Inti83> Hey, OK. I sent the page for review
14:19:13 <Inti83> It has a list of pages we have compiled so far as relevant, although we have issued out a call so will be most likely adding more
14:20:55 <Inti83> I'll likely get disconnected again soon but I will connect again when poss
15:01:02 <rewby|backup> Inti83: I've approved your page.
15:01:09 <Inti83> Thank you!
15:01:29 <h2ibot> Inti83 created Argentina (+4734, Add cultural links & YT): https://wiki.archiveteam.org/?title=Argentina
15:05:40 <Inti83> Thanks, we are going to test using grab-site to save these. If this works, do we consider the site saved? Or how do you usually proceeed in these cases? This tool: https://github.com/ArchiveTeam/grab-site
15:28:01 <TheTechRobo> Inti83: Generally, WARCs from most people won't get added to the Wayback Machine as there is the possibility of tampering. But if it works with grab-site, it will almost certainly work with ArchiveBot as they share the same crawling code
16:42:03 <Naruyoko> https://abcnews.go.com/Business/google-begins-process-deleting-inactive-gmail-accounts/story?id=105281283
16:42:24 <Naruyoko> Have anyone noticed this? Google will start deleting inactive accounts.
16:49:50 <Naruyoko> (I wasn't in #googlecrash, so I can't see history)
18:15:26 <nulldata> Naruyoko - Yeah, that is what is prompted the grab for Blogger. #frogger
18:25:15 <Naruyoko> I see
18:26:54 <soap> I have a list of ~2000 or so cdn.discordapp.com urls from wiki.tockdom.com, would someone mind adding them to archivebot for me?https://transfer.archivete.am/j5cLW/tockdom_discord_urls.txt
18:26:55 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/j5cLW/tockdom_discord_urls.txt
18:27:21 <soap> or is there something else I should do with them?
18:27:49 <JAA> soap: Sure, I'll throw them in.
18:28:41 <soap> thanks!
19:24:21 <nicolas17> JAA: what are the requirements for an archivebot pipeline?
19:25:18 <nicolas17> if there's more .ar sites blocked so they only work from Argentina, it could be a problem
19:25:30 <nicolas17> "I can't access cont.ar at all over here at Europe.  I'm getting a Cloudflare block page."
19:26:33 <inti83> cont.ar and cine.ar have user only content which may be why
19:27:49 <JAA> nicolas17: Right. Stable machine with uptime measured at least in the months. Clean network. For the hardware, SSDs are basically required to operate at an acceptable speed, but otherwise, things can be scaled to fit what's available; the ideal machine would have a good number of CPU cores/threads.
19:28:21 <JAA> RAM is rarely relevant, but more is better for caching.
19:28:31 <nicolas17> running an archivebot crawler from a .ar IP would help with those cases, I doubt I can offer hardware for that but maybe I (or inti83) can find people who can?
19:29:08 <JAA> For a more targeted project rather than a general pipeline, the uptime requirement would be less strict, I suppose.
19:29:24 <inti83> is that something like grab-site?
19:29:27 <JAA> Since we'll want to archive these things within weeks anyway.
19:29:35 <inti83> yes; i think i can find people
19:29:41 <inti83> what do i need to do?
19:29:42 <JAA> grab-site is essentially a local version of ArchiveBot.
19:30:56 <JAA> 'Local' as in 'not distributed'; AB has a control node to coordinate the different machines (pipelines).
19:31:15 <inti83> how would we run the archivebot from here?
19:32:03 <nicolas17> I think I know people with servers inside the Cabase IXP :D
19:33:54 <inti83> cool let me know and i can ask people who are on this whether they have the hardware capacity, may be possible
19:34:21 <JAA> So the AB setup is fairly messy, and the install notes aren't entirely complete I think. If I can get access to a suitable server provided by a trustworthy party, I can set it up.
19:35:03 <pokechu22> I assume this would be set up as a matchonly pipeline (though probably without matchonly in the name so that -p matchonly doesn't hit it), to avoid long-running jobs accidentally ending up on it?
19:35:48 <JAA> Yes
20:17:20 <nicolas17> I just found something interesting for future data-analysis purposes, archive.org has "access-control-allow-origin: *", so you can make client-side JS code to eg. get a cdx file and process it and return the extracted data, and do distributed computing by just giving people a link, kind of like the imgur bruteforce thing :D
20:45:07 <inti83> do you have any tips on archiving atom archives? we are having some trouble: https://share.riseup.net/#G_1seXPsbK1wKVUwdMCNpw
20:50:37 <inti83> so many links
20:51:24 <pokechu22> That probably needs ignores of some sort but I don't have any specific recomendations
20:53:28 <inti83> yeah, sadly this endpoint is used for everything: it always goes through it :/
20:53:52 <JAA> It looks like there is filter faceting, but that might not be the only thing.
22:49:31 <AK> Woot possibly more AB pipelines?