-
JAA
boet: Sounds fun, and comprehensive DNS datasets are hard to come by! So please release the raw data in bulk, too. I'd assume it won't be very large, and being textual, it'd compress very well, so an upload to IA would likely be a good idea.
-
pabs
-
that_lurker
off
-
that_lurker
s/off/oof
-
OrIdow6
!tell kerim What do you want to tell us about
animekalesi.com ?
-
eggdrop
[tell] ok, I'll tell kerim when they join next
-
h2ibot
-
h2ibot
-
andybak
I wonder if I could get some guidance. I'm trying to retrieve 150 x 50gb warc.gz files from archive.org and it's going very slowly. Also the extraction from the warcs is super slow (lots of small files). We're trying to make all of Google Poly available again and this is one of our road blocks.
-
andybak
I'm not entirely sure what I'm asking - but is there anything I could be doing differently?
-
andybak
(for clarification there are two related projects:
polygone.art and
poly.pizza who were involved in the initial scrape but they've chosen only to make a subset of the files available and we specifically want to make them available more comprehensively)
-
arkiver
andybak: are you downloading this concurrently?
-
andybak
arkiver - no. I'm using aria2c which I think is using 16 separate connections
-
andybak
arkiver - I'm using aria2c which I think is using 16 separate connections
-
andybak
aria_cmd = "aria2c -c -s 16 -x 16 {0}"
-
andybak
hmmmm. i just switched to an SSD on USB 3.1 and it seems a lot better. Might just be a crappy USB port or spinning disk
-
nimaje
as far as I know, multiple connections is concurrently
-
JAA
You may get better throughput if you download multiple files in parallel rather than a single file with multiple connections. I'm not sure if aria2c supports the former at all, but with the options above, it definitely does the latter.
-
andybak
I can always launch multiple instances of aria. i'll play around.
-
JAA
Right, --max-concurrent-downloads aka -j.
-
andybak
Is grabbing the warcs themselves the right thing to do here? Instead of - I dunno - grabbing the file contents directly from wayback urls?
-
JAA
IA has two copies of each item, and each copy is on a single HDD. So by going highly parallel, those two HDDs get very sad with seeking.
-
JAA
At that magnitude, downloading the WARCs is the right approach. Whether unpacking them makes sense depends on how you'll use them.
-
andybak
what are the alternatives to unpacking them? treating them like a virtual file system and grabbing files as needed?
-
andybak
I hadn't even thought of that. I guess I need to test the overhead of that approach.
-
JAA
Yeah, a custom self-hosted Wayback Machine, if you will.
-
andybak
I'm usually iterating through 1000s of small files rapidly collating metadata.
-
JAA
There's pywb and openwayback, but not sure they're appropriate for this use case.
-
andybak
yeah. I think i'm ok now i've realised that the bottleneck isn't actually the download! i've never had broadband fast enough before that it wasn't the limiting factor
-
yarrow_alt
AppleVis feels like a particularly important closure: 1) underserved community that has relied heavily on this resource, 2) lack of any clear alternative, 3) site was influential to Apple employees and even management, and 4) since the community is blind and visually impaired users, special care may need to be taken to ensure the archive of the site works with screen readers or other accessibility tools.
-
pokechu22
Relating to that, I assume downloading a WARC from an item where other WARCs are currently being uploaded would run into that same seeking problem?
-
pokechu22
or derived, I guess
-
JAA
Derives run on a separate machine. With archive.php tasks, it could happen, yeah.
-
fireonlive
-
eggdrop
-
JAA
-
JAA
Well, one lot of 10, I guess.
-
fireonlive
ah damn, gone.
-
fireonlive
JAA++
-
eggdrop
[karma] 'JAA' now has 87 karma!
-
fireonlive
i now have more channel space
-
OrIdow6
yarrow_alt: We're already archiving it
-
OrIdow6
What do you mean that "special care may need to be taken to ensure the archive of the site works..."?
-
OrIdow6
What specifically?
-
fireonlive
oh right, i should compile a list and remove them from eggdrop's channelfile and firebot's database..
-
fireonlive
(eggdrop will keep trying to join forever)
-
h2ibot
-
h2ibot
JustAnotherArchivist edited Operation London Bridge (-1):
wiki.archiveteam.org/?diff=53145&oldid=48983
-
h2ibot
-
h2ibot
JustAnotherArchivist edited BuzzVideo (+23):
wiki.archiveteam.org/?diff=53147&oldid=49421
-
h2ibot
JustAnotherArchivist edited Pandora.tv (+23):
wiki.archiveteam.org/?diff=53148&oldid=49429
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot
-
h2ibot