-
pabs
arkiver: ^
-
arkiver
AK: we do have some stashes here and there, but I consider them different from the stashes we just finished here
-
arkiver
the stashes we jut finished here were literally stashed away from this project in times of lack of capacity to get them done
-
arkiver
when we restart #shreddit a bunch of URLs will be added here as well again
-
arkiver
qwertyasdfuiopghjkl: maybe yeah, i'll look into exporting those queued to a file for further processing
-
arkiver
imer: we'll be queuing the PDFs!
-
arkiver
will do some checks
-
arkiver
together with URL extraction we have from PDFs, this is going to give some very nice data
-
fireonlive
^_^
-
fireonlive
the discord ones will 'die forever' (to the public) at some point; unless they can be rehydrated.. but even then they'll have different parameters sadly
-
DigitalDragons
WBM wildcard search would still be able to find them if they can be rehydrated, wouldn't it?
-
arkiver
if anyone has lists of discord stuff left, we could look into queuing it
-
arkiver
before they kill the old URLs
-
arkiver
and especially URLs linked to from non-discord sources would be interesting to archive
-
OrIdow6
I imagine it would take more storage than is plausible but it would be nice if we could keep a big list of all URLs/links seen in our projects/CC WARCS/whatever, crawled or not, for situations like this
-
arkiver
a first batch of PDFs is going in
-
arkiver
OrIdow6: agreed. and i had something planned for that here, but at the time that didn't go through for performance reasons i believe
-
arkiver
but it's worth doing another try
-
arkiver
imer: there's a lot of PDFs... this may take a while, will queue them in batches
-
arkiver
sorry to all CPUs working on this project :P
-
arkiver
when this batch is queued, i'll move it to todo:secondary
-
OrIdow6
arkiver: Anywhere I can find more info on that?
-
arkiver
no, it was simply dumping any URLs found here (also those not queued) to a separate project, and dumping data from that project to disk for a daily item on IA with URLs
-
arkiver
the bloom filter would be cleared monthly, so we don't spend too many resources on that
-
arkiver
OrIdow6: so relatively easy, but i need to check if backfeed can handle the extra load
-
DLoader
-
arkiver
interesting, thank you
-
arkiver
oh the port
-
imer
arkiver: nice! :D load average: 346.62, 340.99, 338.43
-
imer
arkiver: also reminder to move to secondary, looks like queuing is done?
-
fireonlive
<DigitalDragons> WBM wildcard search would still be able to find them if they can be rehydrated, wouldn't it? < i think so yeah
-
fireonlive
though it wouldnt 'just work' for embeds etc
-
JAA
In theory, the WBM could ignore those signature query parameters for the relevant domain(s).
-
fireonlive
ah true
-
fireonlive
there are some hostnames that need some love and care (manual intervention)