01:17:35 <project10> so I went looking at my 135G zowa warc on IA. Found it at https://archive.org/download/archiveteam_zowa_20230923012400_df2de1d0 but also at https://archive.org/download/archiveteam_zowa_20230924040422_7fbffef8. Why would there be two copies, uploaded on different days with different filenames/timestamps?
01:18:10 <JAA> Probably the item was reclaimed and completed twice (or more times).
01:18:59 <project10> oh, interesting. I assume IA won't dedupe/reap these and they will show on the WBM as captures on different days?
01:19:10 <JAA> Yes
01:20:14 <project10> ok, good to know the total size displayed on the tracker is not necessarily indicative of the amount shipped to IA
03:06:04 <anarcat> so this debian developer died https://abrahamraji.in/
03:06:39 <anarcat> i'm going to crawl that site and https://wiki.abrahamraji.in/
03:06:50 <anarcat> there's also https://www.youtube.com/@abrahamraji3699/ i'm not sure what to do with
03:07:34 <anarcat> there's also https://gitlab.com/avron https://aana.site/@avronr - same
03:08:46 <anarcat> oh looks like pabs already did it
03:09:04 <pabs> anarcat: yeah, well covered
03:09:21 <pabs> anarcat: did the youtube in #down-the-tube
03:09:38 <anarcat> thanks
03:09:41 <anarcat> so sad
03:10:02 <pabs> the mastodon I don't think can be saved, too much JS and AT doesn't save fediverse I thought
03:19:13 <anarcat> ack
03:20:03 <pabs> if we wanted to, this could be repurposed for that https://github.com/jwilk/zygolophodon
08:10:10 <flashfire42> I dunno what happened but I am seeing a lot more movement across the warrior projects
09:35:02 <Exdetransitioner> does there anybody has an access to genspect's chatroom?
09:35:06 <Exdetransitioner> https://www.dailydot.com/debug/genspect/
09:35:28 <Exdetransitioner> they claim to run a semi-secret forum where they discuss anti-trans extermist talking points
10:29:40 <thunder_steak> how is decided how often a website will be crawled/snapshotted? e.g. http://zwisler.de/
13:42:41 <pabs> thunder_steak: in what context? for ArchiveBot, usually when the site is closing or there is another reason for doing it
14:17:24 <thunder_steak> pabs e.g. http://zwisler.de/ has been snapshotted multiple times but with no constant frequency
14:24:25 <pabs> I guess you mean in web.archive.org. if you click the "About this capture" thing on the top right, you can get some idea
14:25:00 <pabs> as you can see here, zero of those were ArchiveTeam ArchiveBot snapshots: https://archive.fart.website/archivebot/viewer/?q=zwisler.de
14:39:45 <JAA> My Canucks forums topic page qwarc grab finished earlier today without any obvious issues.
14:43:20 <JAA> 196068 We could not find that topic.
14:43:21 <JAA> 21026 You do not have permission to view this topic.
14:43:21 <JAA> 122653 There are no posts to show
14:43:40 <JAA> The rest of the 409104 topic IDs were retrieved.
15:31:52 <JAA> I got approximately 6007327 posts, which matches the homepage. :-)
15:34:28 <JAA> I might try to grab new posts as they're being made until the shutdown if I have time to set that up.
15:35:01 <JAA> Although the post URLs require a topic ID, it doesn't have to be correct; you can do something like https://forum.canucks.com/topic/0-x/?do=findComment&comment=16942183 instead.
18:32:07 <JAA> FOIAonline completion rate has slowed down due to larger items, now at about a third done and an estimated 3 TiB total. ETA is still on time but only just (a bit over 4 days).
18:32:21 <JAA> (That's based on the rate of the past 6 hours.)
18:34:32 <JAA> Actually, probably closer to 4 TiB.
18:35:27 <thuban> hm, rough--chronological ordering suggests sizes will continue to increase
18:41:59 <JAA> Yeah
18:44:03 <JAA> I can try throwing more concurrency at it. My machine is nowhere near its limits.
18:44:21 <JAA> And I haven't seen any rate limiting or blocks whatsoever so far, just some random timeouts.
18:46:54 <thuban> seems wise, especially if you can adjust on the fly. what tooling are you using?
18:49:59 <JAA> qwarc
18:50:24 <JAA> I can't adjust the concurrency of running processes, but I can add more processes. :-)
18:50:48 <thuban> >:?
18:50:52 <JAA> (I'd have to stop them, ideally gracefully, for the former.)
18:51:46 <JAA> I originally had one process at 25 concurrency, but that was far from ideal because it got blocked sometimes by large downloads.
18:51:52 <JAA> So now it's 5 processes with 5 concurrency each.
18:56:53 <thuban> ah, i forgot qwarc runs off a database and everything. it's sufficiently self-organizing that you can just tell new processes to jump in, then?
18:57:57 <JAA> Yep, each process just takes items from the DB, processes them, and writes the new status back (plus any new items it might've discovered, not relevant in this case).
18:58:24 <thuban> neat
19:01:04 <JAA> It really is pretty much like a local tracker in that respect. That's what I modelled it after conceptually, anyway.
19:02:36 <JAA> Also, some of the timeouts I'm seeing are actually due to large downloads taking time to process, similar to the problems in wpull.
19:03:14 <JAA> Eventually™, I'll refactor that so the actual HTTP stuff happens in a separate thread.
19:36:50 <Rootliam> I got a response from Jason Scott about yahoo video with "all I can say is all the data is up there, one way or another. There'sno other stores out there."
19:37:20 <Rootliam> I'm not really sure if that means it could have been mixed up with something else or if it wasn't uploaded then it's gone forever
19:37:54 <thuban> Rootliam: did you ever open that github issue?
19:38:06 <Rootliam> No but I guess I should do that soon
20:52:05 <flashfire42> Wait we are completely clogged? Like completely?
23:29:38 <flashfire42> So is it Optane9 again rewby or is it the transferring stuck?
23:32:19 <flashfire42> Ok looks like its optane9 that needs a kick if you have access to it JAA I did a test and Mediafire uses a seperate target and one of them went through fine
23:32:29 <JAA> flashfire42: Please stop.
23:33:05 <JAA> Targets are doing target things as well as they can. The situation isn't great, and everyone's aware of it.