#archiveteam-bs

01:08

JAA

boet: Sounds fun, and comprehensive DNS datasets are hard to come by! So please release the raw data in bulk, too. I'd assume it won't be very large, and being textual, it'd compress very well, so an upload to IA would likely be a good idea.
05:11

pabs

gamingonlinux.com/2024/07/humble-ga…ions-with-reports-of-all-staff-gone
05:24

that_lurker

off
05:25

that_lurker

s/off/oof
06:20

OrIdow6

!tell kerim What do you want to tell us about animekalesi.com ?
06:20

eggdrop

[tell] ok, I'll tell kerim when they join next
08:13

h2ibot

Exorcism edited Mailman/2 (+0): wiki.archiveteam.org/?diff=53142&oldid=53141
08:36

h2ibot

Exorcism edited MoinMoin (+0): wiki.archiveteam.org/?diff=53143&oldid=53140
11:03

andybak

I wonder if I could get some guidance. I'm trying to retrieve 150 x 50gb warc.gz files from archive.org and it's going very slowly. Also the extraction from the warcs is super slow (lots of small files). We're trying to make all of Google Poly available again and this is one of our road blocks.
11:03

andybak

I'm not entirely sure what I'm asking - but is there anything I could be doing differently?
11:04

andybak

(for clarification there are two related projects: polygone.art and poly.pizza who were involved in the initial scrape but they've chosen only to make a subset of the files available and we specifically want to make them available more comprehensively)
13:13

arkiver

andybak: are you downloading this concurrently?
14:32

andybak

arkiver - no. I'm using aria2c which I think is using 16 separate connections
14:32

andybak

arkiver - I'm using aria2c which I think is using 16 separate connections
14:32

andybak

aria_cmd = "aria2c -c -s 16 -x 16 {0}"
14:38

andybak

hmmmm. i just switched to an SSD on USB 3.1 and it seems a lot better. Might just be a crappy USB port or spinning disk
14:41

nimaje

as far as I know, multiple connections is concurrently
14:44

JAA

You may get better throughput if you download multiple files in parallel rather than a single file with multiple connections. I'm not sure if aria2c supports the former at all, but with the options above, it definitely does the latter.
14:45

andybak

I can always launch multiple instances of aria. i'll play around.
14:45

JAA

Right, --max-concurrent-downloads aka -j.
14:46

andybak

Is grabbing the warcs themselves the right thing to do here? Instead of - I dunno - grabbing the file contents directly from wayback urls?
14:46

JAA

IA has two copies of each item, and each copy is on a single HDD. So by going highly parallel, those two HDDs get very sad with seeking.
14:47

JAA

At that magnitude, downloading the WARCs is the right approach. Whether unpacking them makes sense depends on how you'll use them.
14:47

andybak

what are the alternatives to unpacking them? treating them like a virtual file system and grabbing files as needed?
14:47

andybak

I hadn't even thought of that. I guess I need to test the overhead of that approach.
14:48

JAA

Yeah, a custom self-hosted Wayback Machine, if you will.
14:48

andybak

I'm usually iterating through 1000s of small files rapidly collating metadata.
14:48

JAA

There's pywb and openwayback, but not sure they're appropriate for this use case.
15:03

andybak

yeah. I think i'm ok now i've realised that the bottleneck isn't actually the download! i've never had broadband fast enough before that it wasn't the limiting factor
19:21

yarrow_alt

AppleVis feels like a particularly important closure: 1) underserved community that has relied heavily on this resource, 2) lack of any clear alternative, 3) site was influential to Apple employees and even management, and 4) since the community is blind and visually impaired users, special care may need to be taken to ensure the archive of the site works with screen readers or other accessibility tools.
20:34

pokechu22

Relating to that, I assume downloading a WARC from an item where other WARCs are currently being uploaded would run into that same seeking problem?
20:35

pokechu22

or derived, I guess
20:36

JAA

Derives run on a separate machine. With archive.php tasks, it could happen, yeah.
20:41

fireonlive

x.com/bokieiey/status/1818506690826059827 hope someone archives that lol
20:41

eggdrop

nitter: nitter.lucabased.xyz/bokieiey/status/1818506690826059827
20:57

JAA

Only one left: ebay.com/itm/266337902355
20:58

JAA

Well, one lot of 10, I guess.
21:05

fireonlive

ah damn, gone.
22:32

fireonlive

JAA++
22:32

eggdrop

[karma] 'JAA' now has 87 karma!
22:32

fireonlive

i now have more channel space
22:45

OrIdow6

yarrow_alt: We're already archiving it
22:45

OrIdow6

What do you mean that "special care may need to be taken to ensure the archive of the site works..."?
22:45

OrIdow6

What specifically?
22:49

fireonlive

oh right, i should compile a list and remove them from eggdrop's channelfile and firebot's database..
22:49

fireonlive

(eggdrop will keep trying to join forever)
22:50

h2ibot

JustAnotherArchivist edited Gfycat (+23): wiki.archiveteam.org/?diff=53144&oldid=50940
22:51

h2ibot

JustAnotherArchivist edited Operation London Bridge (-1): wiki.archiveteam.org/?diff=53145&oldid=48983
22:51

h2ibot

JustAnotherArchivist edited V Live (+23): wiki.archiveteam.org/?diff=53146&oldid=50784
22:51

h2ibot

JustAnotherArchivist edited BuzzVideo (+23): wiki.archiveteam.org/?diff=53147&oldid=49421
22:51

h2ibot

JustAnotherArchivist edited Pandora.tv (+23): wiki.archiveteam.org/?diff=53148&oldid=49429
22:52

h2ibot

JustAnotherArchivist edited Revue (+23): wiki.archiveteam.org/?diff=53149&oldid=49496
22:52

h2ibot

JustAnotherArchivist edited Egloos (+23): wiki.archiveteam.org/?diff=53150&oldid=50983
22:52

h2ibot

JustAnotherArchivist edited Skyblog (+23): wiki.archiveteam.org/?diff=53151&oldid=50550
22:52

h2ibot

JustAnotherArchivist edited Tiki (+23): wiki.archiveteam.org/?diff=53152&oldid=50217
22:53

h2ibot

JustAnotherArchivist edited ЯRUS (+23): wiki.archiveteam.org/?diff=53153&oldid=50113
22:53

h2ibot

JustAnotherArchivist edited Wysp (+23): wiki.archiveteam.org/?diff=53154&oldid=50982
22:53

h2ibot

JustAnotherArchivist edited Xuite (+23): wiki.archiveteam.org/?diff=53155&oldid=50631
22:53

h2ibot

JustAnotherArchivist edited ZOWA (+13): wiki.archiveteam.org/?diff=53156&oldid=50923

2 months ago

« a day earlier

a day later »

today »