-
nicolas17
pypy?
-
OrIdow6
I don't think the Python regex crate is written in Python
-
OrIdow6
Umm
-
OrIdow6
Ignore the confusion with Rust
-
fireonlive
hm
-
fireonlive
i think there's a couple things from here that won't go to projects per se, but could be wrong: telegram, youtube, google?
-
fireonlive
youtube for unlisted, telegram for channels, and google for a potential google drive/docs/etc project
-
fireonlive
not sure if others
-
datechnoman
kiska OrIdow6 - I've already processed 80% of common crawl links and have them all stored and compressed for future projects.
-
datechnoman
My understanding is that we do not want to feed common crawl data just randomly in here as we are trying to archive urls that people are using/sharing where common crawl is crawling websites. It's also worth noting that common crawl upload their warc's into the WBM already so we would be rearchiving
-
OrIdow6
datechnoman: Did you use that for Discord CDN links?
-
datechnoman
Obviously the pages most likely have changed since they were last crawled
-
datechnoman
OrIdow6 - i did for all discord, mediafire, telegram, pastebin, blogger and pdfs already.
-
datechnoman
I have about 30TB of urls stored on my home server from their releases. Your more than welcome to rerun them if you really want to
-
OrIdow6
Also curious how big that was, and how you managed to get around the issues that people were having during the Imgur project
-
datechnoman
People were processing the warc's during the imgur project. You should instead be processing the WAT files as they are the stripped down version of the warc's which have all of the HTML content being the outlinks we are after
-
datechnoman
I used multiple severs and hundreds of dollars using python
-
datechnoman
Slow connection rates to avoid the rate limiting and a few different ip's
-
datechnoman
And it's taken me about 4 months sofar
-
datechnoman
I would be checking with arkiver if we want to be dumping that kind of data into this project as I believe it falls out of scope
-
datechnoman
What I have done at the request of arkiver for the last week is run of the dumps i have created and extract the many millions of pdf links that are contained in their dumps as they drop anything larger than 1mb in their warc's and spot testing confirms this. They also don't process the outlinks in the pdfs they find so we will be more effective at
-
datechnoman
grabbing the sites related to the documents
-
OrIdow6
Did you get CC-MAIN-2022-49?
-
datechnoman
Yes. I started from latest and am back to 2020 archives atm
-
Ryz
Unsure if worth adding
wodnews.com to this project as a recurring thing; since it doesn't look like it's been processed all that often when checking
web.archive.org/web/20240000000000*/https://wodnews.com
-
imer
arkiver: can we get a refill? :)
-
imer
also seem to have stumbled into some porn stuff 16.43/s pornodavid.com 10.86/s hd-sexfilme.com doesn't look spammy at a glance at least? sites don't look happy though, bunch of 500's/timeouts
-
imer
I'm also suspicious of how much of the tracker scrolling by is me
-
datechnoman
All about that porn :P
-
datechnoman
Im sure fireonlive would have something to say about it
-
datechnoman
I think in the past we have filtered porn out as there is soooo much of it and it isn't as important as th3 lle focused outlinks we archive
-
imer
-
imer
looks like that just needs a filter on the main page with ?ver
-
imer
porn seems to be mostly gone
-
arkiver
imer: good on the porn stuff - i think we should be careful with adding new filters
-
arkiver
on the number of jobs you are completing - did you scale up?
-
arkiver
i'll do a new check of the URLs we archived recently and what we may be able to ignore
-
imer
nope, didnt change anything about my deployment, seem to have stayed fairly consistent where other people seem to have dropped off, maybe the porn sites didnt work on hetzner or something?
-
imer
or I'm less cpu bottlenecked atm
-
arkiver
CPU could be it, no PDFs or sitemaps at the moment
-
imer
srhc.org.au one is still around
-
arkiver
yep, i saw it
-
arkiver
i want to know where the URLs come from
-
imer
good :)
-
arkiver
so i know how to best filter them out
-
arkiver
AK logs to the rescue :)
-
arkiver
usually i just download a log from AK and go throuhg it
-
arkiver
and see what funny stuff is going on in there
-
arkiver
imer: found the source
-
arkiver
imer: an update it pushed that i think will handle this generally
-
arkiver
so also for other sites
-
arkiver
version updated
-
arkiver
imer: i don't see it anymore, i think it's solved itself (was not a look in that case)
-
arkiver
i see some stuff from it now actually
-
arkiver
imer: being filtered now
-
imer
still seeing srhc.org.au around, cant spot loops on my end though - have you forced the version yet or are these maybe just old workers?
-
imer
and thanks
-
imer
kfz.osl-online.de looks like the same arkiver, not sure if that needs a different filter
-
arkiver
imer: a wider filter is in for the kfz one
-
arkiver
there may still be srhc yes, but i hope the loop is resolved
-
imer
-
arkiver
yeah
-
imer
-
eggdrop
-
arkiver
imer: nd timestamp loop is out
-
imer
nice :)
-
arkiver
forced new version 20240408.04
-
arkiver
been going through a CDX, seeing another potential loop, but want to check it again in a day
-
arkiver
also yes, every loop imer saw i saw confirmed in the CDX as well, so very nicely found imer !
-
arkiver
two more loops gone, they were in the CDX. another update is out and enforced
-
fireonlive
datechnoman: :3! more gay porn :D
-
arkiver
fireonlive: yay :P
-
fireonlive
:P
-
katia
π
-
nyany
HOORAY FOR DICK
-
nyany
i mean ARCHIVING
-
katia
π³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβ
-
katia
ππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβππ³οΈβπ
-
qwertyasdfuiopghjkl
^ Interesting, it split an emoji in half
-
katia
yeah :D
-
katia
i think my client might have done that
-
nicolas17
more like "it doesn't know how to render this two-part emoji"
-
fireonlive
nyany: it's the best ;D
-
arkiver
increased max tries to 5
-
arkiver
(it was 3)
-
JAA
nicolas17: No, this is on the client doing the splitting. But doing it Correctlyβ’ is *hard*. The clients that do split on graphemes always fail on some things, so many clients don't even bother with trying to implement it.
-
nicolas17
oh you mean into two messages
-
JAA
Yeah
-
nicolas17
I see two separate graphemes for all of them :D
-
nicolas17
and I thought qwerty was talking about the same
-
JAA
π³οΈβ at the end of one message and π at the start of the other is the one being talked about.
-
JAA
Although all clients could detect ZWJ sequences, so that one can actually be solved somewhat easily. It's a much harder problem in the general case.
-
JAA
I looked into it when I wrote http2irc. I got lost in a dark forest and decided to nope the fuck out of there. So http2irc only does word and codepoint splitting.
-
qwertyasdfuiopghjkl
-
imer
you're all in luck, there's more porn: www.halloporno.net :p
transfer.archivete.am/mcRbb/halloporno.net.log maybe a rate limit so we don't wreck the site as much?
-
eggdrop
-
imer
JAA/arkiver: ^
-
imer
I say that and it goes away >.>
-
imer
-
eggdrop
-
nyany
JAA: " I got lost in a dark forest and decided to nope the fuck out of there." HAH