02:04:24 arkiver: ^ 03:08:00 AK: we do have some stashes here and there, but I consider them different from the stashes we just finished here 03:08:23 the stashes we jut finished here were literally stashed away from this project in times of lack of capacity to get them done 03:08:38 when we restart #shreddit a bunch of URLs will be added here as well again 03:08:55 qwertyasdfuiopghjkl: maybe yeah, i'll look into exporting those queued to a file for further processing 03:34:26 imer: we'll be queuing the PDFs! 03:34:55 will do some checks 03:36:52 together with URL extraction we have from PDFs, this is going to give some very nice data 03:38:27 ^_^ 03:38:58 the discord ones will 'die forever' (to the public) at some point; unless they can be rehydrated.. but even then they'll have different parameters sadly 04:50:18 WBM wildcard search would still be able to find them if they can be rehydrated, wouldn't it? 08:39:35 if anyone has lists of discord stuff left, we could look into queuing it 08:39:39 before they kill the old URLs 08:39:52 and especially URLs linked to from non-discord sources would be interesting to archive 09:37:58 I imagine it would take more storage than is plausible but it would be nice if we could keep a big list of all URLs/links seen in our projects/CC WARCS/whatever, crawled or not, for situations like this 09:39:44 a first batch of PDFs is going in 09:40:25 OrIdow6: agreed. and i had something planned for that here, but at the time that didn't go through for performance reasons i believe 09:40:29 but it's worth doing another try 09:45:02 imer: there's a lot of PDFs... this may take a while, will queue them in batches 09:53:36 sorry to all CPUs working on this project :P 09:53:49 when this batch is queued, i'll move it to todo:secondary 10:15:19 arkiver: Anywhere I can find more info on that? 11:09:16 no, it was simply dumping any URLs found here (also those not queued) to a separate project, and dumping data from that project to disk for a daily item on IA with URLs 11:09:42 the bloom filter would be cleared monthly, so we don't spend too many resources on that 11:09:56 OrIdow6: so relatively easy, but i need to check if backfeed can handle the extra load 11:22:11 arkiver just noticed a "not in list" error https://transfer.archivete.am/inline/MESyR/not_in_list 11:22:47 interesting, thank you 11:23:07 oh the port 13:28:32 arkiver: nice! :D load average: 346.62, 340.99, 338.43 16:51:57 arkiver: also reminder to move to secondary, looks like queuing is done? 19:17:58 WBM wildcard search would still be able to find them if they can be rehydrated, wouldn't it? < i think so yeah 19:18:23 though it wouldnt 'just work' for embeds etc 19:22:18 In theory, the WBM could ignore those signature query parameters for the relevant domain(s). 19:27:54 ah true 19:28:14 there are some hostnames that need some love and care (manual intervention)