#archiveteam-bs

00:24

nicolas17

wow
00:24

nicolas17

my system got all laggy
00:24

nicolas17

turns out it was swapping
00:24

nicolas17

because the archivebot.com tab was using 3GB of RAM and growing
00:26

Pedrosso

Haha. It tends to do that
00:27

JAA

How long did you have it open?
00:27

nicolas17

like a minute or two, idk what's up with that
00:28

JAA

I've had it open for hours, and it uses under 200 MB.
00:28

nicolas17

I expanded the log of a job
00:28

nicolas17

which may have affected it
00:29

JAA

~300 MB when expanding all logs, although it disappeared from about:performance for a bit. lol
00:30

JAA

Firefox, by the way.
00:32

nicolas17

I should change my support.apple.com scraping script to not trash my SSD...
00:32

JAA

/dev/shm <3
00:32

nicolas17

instead of "rm data/*; download everything; if git diff --quiet; then commit; fi"
00:33

nicolas17

I should read the existing file and compare it with what was downloaded, if it's the same then don't write anything
00:36

JAA

Ah
00:56

nicolas17

it's like 480MB, I wouldn't want to keep that in memory between runs
01:10

Pedrosso

Oh yeah not from like a minute or two, at least for me
03:36

Ryz

Is there anything else to save from Evernote? Is there user content to get? Considering techcrunch.com/2023/11/29/its-offic…ill-restrict-free-users-to-50-notes
03:40

nicolas17

Ryz: I think it's all private content sooo
03:41

lindowsME

Hey were the videos from funnyordie.com saved?
03:41

lindowsME

they're not on the site anymore (since years ago), but seem to all still be on S3. the old cdn used to redirect, now 404s.
03:41

lindowsME

web.archive.org/cdx/search/cdx?url=vo.fod4.com/v/*&limit=1000
03:41

lindowsME

web.archive.org/cdx/search/cdx?url=…ideos.funnyordie.com/v/*&limit=1000
03:42

lindowsME

archive.org 403s
11:53

Inti83

Hi, I am here with a request similar to EndOfTerm archive but for Argentina, as the incoming government has already stated it's intent to dismantle most agencies
11:54

Inti83

We are already working on archiving the data by downloading it as wee understand archive.org doesn't necessarily automatically index all pages. We understand ArchiveBot helps with this
11:54

Inti83

The new term starts on 10th of December
11:55

Inti83

We have compiled a list of sites which is not exchaustive
11:55

Inti83

I found argentina.gob.ar and educ.ar in the archive but there are quite a few more that are not
11:56

Inti83

It would be prefereable to have the sites in archive.org rather than just downloading for preservation as this ensures public access to all whereas the distribution aspect after downloading is complex
12:06

Inti83

Some of the content is multimedia and we are having a hard time knowing how to archive it
12:06

Inti83

Example cont.ar
12:14

Sanqui

Inti83: Hello, please stick around, we're definitely able to help with this.
12:15

Sanqui

ArchiveBot is able to crawl and download many websites, which then get uploaded to the Internet Archive and become possible to browse in the Wayback Machine
12:15

Sanqui

It has its limitations though
12:15

Sanqui

A good start would be to create a page on the wiki with a list of websites, then we can make notes for if individual websites were successful to crawl with ArchiveBot
12:16

Sanqui

(BTW, I can't access cont.ar at all over here at Europe. I'm getting a Cloudflare block page.)
12:17

Inti83

Hi, cloudflare may be a problem, we encountered some problems with this even at this end
12:17

Inti83

OK, I'll get started on a wiki page
12:33

Inti83

What is a good nomeclature for such a page?
13:45

Inti83

Hi, I keep getting disconnected. I wrote earlier about an End Of Term archive for Argentina. I was wondering how to follow nomenclature norms in order to start a new page and add the links as suggested?
13:45

Inti83

There's a Government Backup page but it is US based
13:48

thuban

Inti83: "Argentina" is fine (we have a number of country pages, and can always add other sections/subpages as needed)
13:48

Inti83

ok
14:18

Inti83

Hey, OK. I sent the page for review
14:19

Inti83

It has a list of pages we have compiled so far as relevant, although we have issued out a call so will be most likely adding more
14:20

Inti83

I'll likely get disconnected again soon but I will connect again when poss
15:01

rewby|backup

Inti83: I've approved your page.
15:01

Inti83

Thank you!
15:01

h2ibot

Inti83 created Argentina (+4734, Add cultural links & YT): wiki.archiveteam.org/?title=Argentina
15:05

Inti83

Thanks, we are going to test using grab-site to save these. If this works, do we consider the site saved? Or how do you usually proceeed in these cases? This tool: github.com/ArchiveTeam/grab-site
15:28

TheTechRobo

Inti83: Generally, WARCs from most people won't get added to the Wayback Machine as there is the possibility of tampering. But if it works with grab-site, it will almost certainly work with ArchiveBot as they share the same crawling code
16:42

Naruyoko

abcnews.go.com/Business/google-begi…e-gmail-accounts/story?id=105281283
16:42

Naruyoko

Have anyone noticed this? Google will start deleting inactive accounts.
16:49

Naruyoko

(I wasn't in #googlecrash, so I can't see history)
18:15

nulldata

Naruyoko - Yeah, that is what is prompted the grab for Blogger. #frogger
18:25

Naruyoko

I see
18:26

soap

I have a list of ~2000 or so cdn.discordapp.com urls from wiki.tockdom.com, would someone mind adding them to archivebot for me?transfer.archivete.am/j5cLW/tockdom_discord_urls.txt
18:26

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/j5cLW/tockdom_discord_urls.txt
18:27

soap

or is there something else I should do with them?
18:27

JAA

soap: Sure, I'll throw them in.
18:28

soap

thanks!
19:24

nicolas17

JAA: what are the requirements for an archivebot pipeline?
19:25

nicolas17

if there's more .ar sites blocked so they only work from Argentina, it could be a problem
19:25

nicolas17

"I can't access cont.ar at all over here at Europe. I'm getting a Cloudflare block page."
19:26

inti83

cont.ar and cine.ar have user only content which may be why
19:27

JAA

nicolas17: Right. Stable machine with uptime measured at least in the months. Clean network. For the hardware, SSDs are basically required to operate at an acceptable speed, but otherwise, things can be scaled to fit what's available; the ideal machine would have a good number of CPU cores/threads.
19:28

JAA

RAM is rarely relevant, but more is better for caching.
19:28

nicolas17

running an archivebot crawler from a .ar IP would help with those cases, I doubt I can offer hardware for that but maybe I (or inti83) can find people who can?
19:29

JAA

For a more targeted project rather than a general pipeline, the uptime requirement would be less strict, I suppose.
19:29

inti83

is that something like grab-site?
19:29

JAA

Since we'll want to archive these things within weeks anyway.
19:29

inti83

yes; i think i can find people
19:29

inti83

what do i need to do?
19:29

JAA

grab-site is essentially a local version of ArchiveBot.
19:30

JAA

'Local' as in 'not distributed'; AB has a control node to coordinate the different machines (pipelines).
19:31

inti83

how would we run the archivebot from here?
19:32

nicolas17

I think I know people with servers inside the Cabase IXP :D
19:33

inti83

cool let me know and i can ask people who are on this whether they have the hardware capacity, may be possible
19:34

JAA

So the AB setup is fairly messy, and the install notes aren't entirely complete I think. If I can get access to a suitable server provided by a trustworthy party, I can set it up.
19:35

pokechu22

I assume this would be set up as a matchonly pipeline (though probably without matchonly in the name so that -p matchonly doesn't hit it), to avoid long-running jobs accidentally ending up on it?
19:35

JAA

Yes
20:17

nicolas17

I just found something interesting for future data-analysis purposes, archive.org has "access-control-allow-origin: *", so you can make client-side JS code to eg. get a cdx file and process it and return the extracted data, and do distributed computing by just giving people a link, kind of like the imgur bruteforce thing :D
20:45

inti83

do you have any tips on archiving atom archives? we are having some trouble: share.riseup.net/#G_1seXPsbK1wKVUwdMCNpw
20:50

inti83

so many links
20:51

pokechu22

That probably needs ignores of some sort but I don't have any specific recomendations
20:53

inti83

yeah, sadly this endpoint is used for everything: it always goes through it :/
20:53

JAA

It looks like there is filter faceting, but that might not be the only thing.
22:49

AK

Woot possibly more AB pipelines?

10 months ago

« a day earlier

a day later »

today »