-
nicolas17
wow
-
nicolas17
my system got all laggy
-
nicolas17
turns out it was swapping
-
nicolas17
because the archivebot.com tab was using 3GB of RAM and growing
-
Pedrosso
Haha. It tends to do that
-
JAA
How long did you have it open?
-
nicolas17
like a minute or two, idk what's up with that
-
JAA
I've had it open for hours, and it uses under 200 MB.
-
nicolas17
I expanded the log of a job
-
nicolas17
which may have affected it
-
JAA
~300 MB when expanding all logs, although it disappeared from about:performance for a bit. lol
-
JAA
Firefox, by the way.
-
nicolas17
I should change my support.apple.com scraping script to not trash my SSD...
-
JAA
/dev/shm <3
-
nicolas17
instead of "rm data/*; download everything; if git diff --quiet; then commit; fi"
-
nicolas17
I should read the existing file and compare it with what was downloaded, if it's the same then don't write anything
-
JAA
Ah
-
nicolas17
it's like 480MB, I wouldn't want to keep that in memory between runs
-
Pedrosso
Oh yeah not from like a minute or two, at least for me
-
Ryz
Is there anything else to save from Evernote? Is there user content to get? Considering
techcrunch.com/2023/11/29/its-offic…ill-restrict-free-users-to-50-notes
-
nicolas17
Ryz: I think it's all private content sooo
-
lindowsME
Hey were the videos from funnyordie.com saved?
-
lindowsME
they're not on the site anymore (since years ago), but seem to all still be on S3. the old cdn used to redirect, now 404s.
-
lindowsME
-
lindowsME
-
lindowsME
archive.org 403s
-
Inti83
Hi, I am here with a request similar to EndOfTerm archive but for Argentina, as the incoming government has already stated it's intent to dismantle most agencies
-
Inti83
We are already working on archiving the data by downloading it as wee understand archive.org doesn't necessarily automatically index all pages. We understand ArchiveBot helps with this
-
Inti83
The new term starts on 10th of December
-
Inti83
We have compiled a list of sites which is not exchaustive
-
Inti83
I found argentina.gob.ar and educ.ar in the archive but there are quite a few more that are not
-
Inti83
It would be prefereable to have the sites in archive.org rather than just downloading for preservation as this ensures public access to all whereas the distribution aspect after downloading is complex
-
Inti83
Some of the content is multimedia and we are having a hard time knowing how to archive it
-
Inti83
-
Sanqui
Inti83: Hello, please stick around, we're definitely able to help with this.
-
Sanqui
ArchiveBot is able to crawl and download many websites, which then get uploaded to the Internet Archive and become possible to browse in the Wayback Machine
-
Sanqui
It has its limitations though
-
Sanqui
A good start would be to create a page on the wiki with a list of websites, then we can make notes for if individual websites were successful to crawl with ArchiveBot
-
Sanqui
(BTW, I can't access cont.ar at all over here at Europe. I'm getting a Cloudflare block page.)
-
Inti83
Hi, cloudflare may be a problem, we encountered some problems with this even at this end
-
Inti83
OK, I'll get started on a wiki page
-
Inti83
What is a good nomeclature for such a page?
-
Inti83
Hi, I keep getting disconnected. I wrote earlier about an End Of Term archive for Argentina. I was wondering how to follow nomenclature norms in order to start a new page and add the links as suggested?
-
Inti83
There's a Government Backup page but it is US based
-
thuban
Inti83: "Argentina" is fine (we have a number of country pages, and can always add other sections/subpages as needed)
-
Inti83
ok
-
Inti83
Hey, OK. I sent the page for review
-
Inti83
It has a list of pages we have compiled so far as relevant, although we have issued out a call so will be most likely adding more
-
Inti83
I'll likely get disconnected again soon but I will connect again when poss
-
rewby|backup
Inti83: I've approved your page.
-
Inti83
Thank you!
-
h2ibot
Inti83 created Argentina (+4734, Add cultural links & YT):
wiki.archiveteam.org/?title=Argentina
-
Inti83
Thanks, we are going to test using grab-site to save these. If this works, do we consider the site saved? Or how do you usually proceeed in these cases? This tool:
github.com/ArchiveTeam/grab-site
-
TheTechRobo
Inti83: Generally, WARCs from most people won't get added to the Wayback Machine as there is the possibility of tampering. But if it works with grab-site, it will almost certainly work with ArchiveBot as they share the same crawling code
-
Naruyoko
-
Naruyoko
Have anyone noticed this? Google will start deleting inactive accounts.
-
Naruyoko
(I wasn't in #googlecrash, so I can't see history)
-
nulldata
Naruyoko - Yeah, that is what is prompted the grab for Blogger. #frogger
-
Naruyoko
I see
-
soap
I have a list of ~2000 or so cdn.discordapp.com urls from wiki.tockdom.com, would someone mind adding them to archivebot for me?
transfer.archivete.am/j5cLW/tockdom_discord_urls.txt
-
eggdrop
-
soap
or is there something else I should do with them?
-
JAA
soap: Sure, I'll throw them in.
-
soap
thanks!
-
nicolas17
JAA: what are the requirements for an archivebot pipeline?
-
nicolas17
if there's more .ar sites blocked so they only work from Argentina, it could be a problem
-
nicolas17
"I can't access cont.ar at all over here at Europe. I'm getting a Cloudflare block page."
-
inti83
cont.ar and cine.ar have user only content which may be why
-
JAA
nicolas17: Right. Stable machine with uptime measured at least in the months. Clean network. For the hardware, SSDs are basically required to operate at an acceptable speed, but otherwise, things can be scaled to fit what's available; the ideal machine would have a good number of CPU cores/threads.
-
JAA
RAM is rarely relevant, but more is better for caching.
-
nicolas17
running an archivebot crawler from a .ar IP would help with those cases, I doubt I can offer hardware for that but maybe I (or inti83) can find people who can?
-
JAA
For a more targeted project rather than a general pipeline, the uptime requirement would be less strict, I suppose.
-
inti83
is that something like grab-site?
-
JAA
Since we'll want to archive these things within weeks anyway.
-
inti83
yes; i think i can find people
-
inti83
what do i need to do?
-
JAA
grab-site is essentially a local version of ArchiveBot.
-
JAA
'Local' as in 'not distributed'; AB has a control node to coordinate the different machines (pipelines).
-
inti83
how would we run the archivebot from here?
-
nicolas17
I think I know people with servers inside the Cabase IXP :D
-
inti83
cool let me know and i can ask people who are on this whether they have the hardware capacity, may be possible
-
JAA
So the AB setup is fairly messy, and the install notes aren't entirely complete I think. If I can get access to a suitable server provided by a trustworthy party, I can set it up.
-
pokechu22
I assume this would be set up as a matchonly pipeline (though probably without matchonly in the name so that -p matchonly doesn't hit it), to avoid long-running jobs accidentally ending up on it?
-
JAA
Yes
-
nicolas17
I just found something interesting for future data-analysis purposes, archive.org has "access-control-allow-origin: *", so you can make client-side JS code to eg. get a cdx file and process it and return the extracted data, and do distributed computing by just giving people a link, kind of like the imgur bruteforce thing :D
-
inti83
do you have any tips on archiving atom archives? we are having some trouble:
share.riseup.net/#G_1seXPsbK1wKVUwdMCNpw
-
inti83
so many links
-
pokechu22
That probably needs ignores of some sort but I don't have any specific recomendations
-
inti83
yeah, sadly this endpoint is used for everything: it always goes through it :/
-
JAA
It looks like there is filter faceting, but that might not be the only thing.
-
AK
Woot possibly more AB pipelines?