-
project10
so I went looking at my 135G zowa warc on IA. Found it at
archive.org/download/archiveteam_zowa_20230923012400_df2de1d0 but also at
archive.org/download/archiveteam_zowa_20230924040422_7fbffef8. Why would there be two copies, uploaded on different days with different filenames/timestamps?
-
JAA
Probably the item was reclaimed and completed twice (or more times).
-
project10
oh, interesting. I assume IA won't dedupe/reap these and they will show on the WBM as captures on different days?
-
JAA
Yes
-
project10
ok, good to know the total size displayed on the tracker is not necessarily indicative of the amount shipped to IA
-
anarcat
so this debian developer died
abrahamraji.in
-
anarcat
i'm going to crawl that site and
wiki.abrahamraji.in
-
anarcat
there's also
youtube.com/@abrahamraji3699 i'm not sure what to do with
-
anarcat
-
anarcat
oh looks like pabs already did it
-
pabs
anarcat: yeah, well covered
-
pabs
anarcat: did the youtube in #down-the-tube
-
anarcat
thanks
-
anarcat
so sad
-
pabs
the mastodon I don't think can be saved, too much JS and AT doesn't save fediverse I thought
-
anarcat
ack
-
pabs
if we wanted to, this could be repurposed for that
github.com/jwilk/zygolophodon
-
flashfire42
I dunno what happened but I am seeing a lot more movement across the warrior projects
-
Exdetransitioner
does there anybody has an access to genspect's chatroom?
-
Exdetransitioner
-
Exdetransitioner
they claim to run a semi-secret forum where they discuss anti-trans extermist talking points
-
thunder_steak
how is decided how often a website will be crawled/snapshotted? e.g.
zwisler.de
-
pabs
thunder_steak: in what context? for ArchiveBot, usually when the site is closing or there is another reason for doing it
-
thunder_steak
pabs e.g.
zwisler.de has been snapshotted multiple times but with no constant frequency
-
pabs
I guess you mean in web.archive.org. if you click the "About this capture" thing on the top right, you can get some idea
-
pabs
as you can see here, zero of those were ArchiveTeam ArchiveBot snapshots:
archive.fart.website/archivebot/viewer/?q=zwisler.de
-
JAA
My Canucks forums topic page qwarc grab finished earlier today without any obvious issues.
-
JAA
196068 We could not find that topic.
-
JAA
21026 You do not have permission to view this topic.
-
JAA
122653 There are no posts to show
-
JAA
The rest of the 409104 topic IDs were retrieved.
-
JAA
I got approximately 6007327 posts, which matches the homepage. :-)
-
JAA
I might try to grab new posts as they're being made until the shutdown if I have time to set that up.
-
JAA
Although the post URLs require a topic ID, it doesn't have to be correct; you can do something like
forum.canucks.com/topic/0-x/?do=findComment&comment=16942183 instead.
-
JAA
FOIAonline completion rate has slowed down due to larger items, now at about a third done and an estimated 3 TiB total. ETA is still on time but only just (a bit over 4 days).
-
JAA
(That's based on the rate of the past 6 hours.)
-
JAA
Actually, probably closer to 4 TiB.
-
thuban
hm, rough--chronological ordering suggests sizes will continue to increase
-
JAA
Yeah
-
JAA
I can try throwing more concurrency at it. My machine is nowhere near its limits.
-
JAA
And I haven't seen any rate limiting or blocks whatsoever so far, just some random timeouts.
-
thuban
seems wise, especially if you can adjust on the fly. what tooling are you using?
-
JAA
qwarc
-
JAA
I can't adjust the concurrency of running processes, but I can add more processes. :-)
-
thuban
>:?
-
JAA
(I'd have to stop them, ideally gracefully, for the former.)
-
JAA
I originally had one process at 25 concurrency, but that was far from ideal because it got blocked sometimes by large downloads.
-
JAA
So now it's 5 processes with 5 concurrency each.
-
thuban
ah, i forgot qwarc runs off a database and everything. it's sufficiently self-organizing that you can just tell new processes to jump in, then?
-
JAA
Yep, each process just takes items from the DB, processes them, and writes the new status back (plus any new items it might've discovered, not relevant in this case).
-
thuban
neat
-
JAA
It really is pretty much like a local tracker in that respect. That's what I modelled it after conceptually, anyway.
-
JAA
Also, some of the timeouts I'm seeing are actually due to large downloads taking time to process, similar to the problems in wpull.
-
JAA
Eventually™, I'll refactor that so the actual HTTP stuff happens in a separate thread.
-
Rootliam
I got a response from Jason Scott about yahoo video with "all I can say is all the data is up there, one way or another. There'sno other stores out there."
-
Rootliam
I'm not really sure if that means it could have been mixed up with something else or if it wasn't uploaded then it's gone forever
-
thuban
Rootliam: did you ever open that github issue?
-
Rootliam
No but I guess I should do that soon
-
flashfire42
Wait we are completely clogged? Like completely?
-
flashfire42
So is it Optane9 again rewby or is it the transferring stuck?
-
flashfire42
Ok looks like its optane9 that needs a kick if you have access to it JAA I did a test and Mediafire uses a seperate target and one of them went through fine
-
JAA
flashfire42: Please stop.
-
JAA
Targets are doing target things as well as they can. The situation isn't great, and everyone's aware of it.