00:22:54 <nicolas17> pypy?
01:00:57 <OrIdow6> I don't think the Python regex crate is written in Python
01:01:07 <OrIdow6> Umm
01:01:22 <OrIdow6> Ignore the confusion with Rust
01:30:16 <fireonlive> hm
01:30:48 <fireonlive> i think there's a couple things from here that won't go to projects per se, but could be wrong: telegram, youtube, google?
01:31:16 <fireonlive> youtube for unlisted, telegram for channels, and google for a potential google drive/docs/etc project
01:31:26 <fireonlive> not sure if others
01:49:34 <datechnoman> kiska OrIdow6 - I've already processed 80% of common crawl links and have them all stored and compressed for future projects.
01:50:56 <datechnoman> My understanding is that we do not want to feed common crawl data just randomly in here as we are trying to archive urls that people are using/sharing where common crawl is crawling websites. It's also worth noting that common crawl upload their warc's into the WBM already so we would be rearchiving
01:51:24 <OrIdow6> datechnoman: Did you use that for Discord CDN links?
01:51:28 <datechnoman> Obviously the pages most likely have changed since they were last crawled
01:52:14 <datechnoman> OrIdow6 - i did for all discord, mediafire, telegram, pastebin, blogger and pdfs already.
01:53:24 <datechnoman> I have about 30TB of urls stored on my home server from their releases. Your more than welcome to rerun them if you really want to
01:53:28 <OrIdow6> Also curious how big that was, and how you managed to get around the issues that people were having during the Imgur project
01:54:58 <datechnoman> People were processing the warc's during the imgur project. You should instead be processing the WAT files as they are the stripped down version of the warc's which have all of the HTML content being the outlinks we are after
01:55:13 <datechnoman> I used multiple severs and hundreds of dollars using python
01:55:36 <datechnoman> Slow connection rates to avoid the rate limiting and a few different ip's
01:55:45 <datechnoman> And it's taken me about 4 months sofar
01:57:05 <datechnoman> I would be checking with arkiver if we want to be dumping that kind of data into this project as I believe it falls out of scope
02:02:59 <datechnoman> What I have done at the request of arkiver for the last week is run of the dumps i have created and extract the many millions of pdf links that are contained in their dumps as they drop anything larger than 1mb in their warc's and spot testing confirms this. They also don't process the outlinks in the pdfs they find so we will be more effective at
02:02:59 <datechnoman> grabbing the sites related to the documents
02:13:37 <OrIdow6> Did you get CC-MAIN-2022-49?
02:15:43 <datechnoman> Yes. I started from latest and am back to 2020 archives atm
04:35:36 <Ryz> Unsure if worth adding https://wodnews.com/ to this project as a recurring thing; since it doesn't look like it's been processed all that often when checking https://web.archive.org/web/20240000000000*/https://wodnews.com/
08:57:27 <imer> arkiver: can we get a refill? :)
08:59:18 <imer> also seem to have stumbled into some porn stuff 16.43/s pornodavid.com  10.86/s hd-sexfilme.com doesn't look spammy at a glance at least? sites don't look happy though, bunch of 500's/timeouts
09:19:44 <imer> I'm also suspicious of how much of the tracker scrolling by is me
09:23:24 <datechnoman> All about that porn :P
09:24:13 <datechnoman> Im sure fireonlive would have something to say about it
09:28:48 <datechnoman> I think in the past we have filtered porn out as there is soooo much of it and it isn't as important as th3 lle focused outlinks we archive
09:51:07 <imer> JAA/arkiver: loop Queuing for parent URL https://srhc.org.au/?ver=12.5.0.1712565966. [...] Queuing URL https://srhc.org.au/?ver=6.5.1712566508.
09:51:08 <imer> looks like that just needs a filter on the main page with ?ver
09:51:55 <imer> porn seems to be mostly gone
13:41:09 <arkiver> imer: good on the porn stuff - i think we should be careful with adding new filters
13:41:29 <arkiver> on the number of jobs you are completing - did you scale up?
13:41:48 <arkiver> i'll do a new check of the URLs we archived recently and what we may be able to ignore
14:48:01 <imer> nope, didnt change anything about my deployment, seem to have stayed fairly consistent where other people seem to have dropped off, maybe the porn sites didnt work on hetzner or something?
14:48:35 <imer> or I'm less cpu bottlenecked atm
14:50:11 <arkiver> CPU could be it, no PDFs or sitemaps at the moment
14:50:22 <imer> srhc.org.au one is still around
14:50:38 <arkiver> yep, i saw it
14:50:44 <arkiver> i want to know where the URLs come from
14:50:53 <imer> good :)
14:50:56 <arkiver> so i know how to best filter them out
14:51:14 <arkiver> AK logs to the rescue :)
14:51:41 <arkiver> usually i just download a log from AK and go throuhg it
14:51:52 <arkiver> and see what funny stuff is going on in there
14:57:44 <arkiver> imer: found the source
15:06:40 <arkiver> imer: an update it pushed that i think will handle this generally
15:06:44 <arkiver> so also for other sites
15:07:40 <arkiver> version updated
16:29:28 <arkiver> imer: i don't see it anymore, i think it's solved itself (was not a look in that case)
16:30:20 <arkiver> i see some stuff from it now actually
16:33:14 <arkiver> imer: being filtered now
16:53:29 <imer> still seeing srhc.org.au around, cant spot loops on my end though - have you forced the version yet or are these maybe just old workers?
16:53:38 <imer> and thanks
16:54:28 <imer> kfz.osl-online.de looks like the same arkiver, not sure if that needs a different filter
16:59:31 <arkiver> imer: a wider filter is in for the kfz one
16:59:41 <arkiver> there may still be srhc yes, but i hope the loop is resolved
16:59:42 <imer> Queuing for parent URL https://www.pompestichting.nl/images/news/all/81.?nd=1712548905. [...] Queuing URL https://www.pompestichting.nl/images/news/all/81.?nd=1712595502. another timestamp loop
17:00:05 <arkiver> yeah
17:01:10 <imer> https://transfer.archivete.am/14tx9K/www.pompestichting.nl.log
17:01:10 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/14tx9K/www.pompestichting.nl.log
17:04:37 <arkiver> imer: nd timestamp loop is out
17:04:44 <imer> nice :)
17:05:15 <arkiver> forced new version 20240408.04
17:34:17 <arkiver> been going through a CDX, seeing another potential loop, but want to check it again in a day
17:35:07 <arkiver> also yes, every loop imer saw i saw confirmed in the CDX as well, so very nicely found imer !
17:42:01 <arkiver> two more loops gone, they were in the CDX. another update is out and enforced
17:42:07 <fireonlive> datechnoman: :3! more gay porn :D
17:42:17 <arkiver> fireonlive: yay :P
17:42:22 <fireonlive> :P
17:43:59 <katia> 👀
18:15:32 <nyany> HOORAY FOR DICK
18:15:35 <nyany> i mean ARCHIVING
18:18:37 <katia> 🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍
18:18:39 <katia> 🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈🏳️‍🌈
18:21:26 <qwertyasdfuiopghjkl> ^ Interesting, it split an emoji in half
18:28:22 <katia> yeah :D
18:28:30 <katia> i think my client might have done that
19:00:20 <nicolas17> more like "it doesn't know how to render this two-part emoji"
19:21:01 <fireonlive> nyany: it's the best ;D
19:51:27 <arkiver> increased max tries to 5
19:51:32 <arkiver> (it was 3)
21:48:24 <JAA> nicolas17: No, this is on the client doing the splitting. But doing it Correctly™ is *hard*. The clients that do split on graphemes always fail on some things, so many clients don't even bother with trying to implement it.
21:48:44 <nicolas17> oh you mean into two messages
21:48:48 <JAA> Yeah
21:48:57 <nicolas17> I see two separate graphemes for all of them :D
21:49:05 <nicolas17> and I thought qwerty was talking about the same
21:49:07 <JAA> 🏳️‍ at the end of one message and 🌈 at the start of the other is the one being talked about.
21:49:46 <JAA> Although all clients could detect ZWJ sequences, so that one can actually be solved somewhat easily. It's a much harder problem in the general case.
21:50:29 <JAA> I looked into it when I wrote http2irc. I got lost in a dark forest and decided to nope the fuck out of there. So http2irc only does word and codepoint splitting.
21:52:57 <qwertyasdfuiopghjkl> nicolas17: a screenshot of what it looks like for me: https://transfer.archivete.am/inline/g9yBw/Screenshot_2024-04-09%20hackint%20-%20webirc.png
22:18:31 <imer> you're all in luck, there's more porn: www.halloporno.net :p https://transfer.archivete.am/mcRbb/halloporno.net.log maybe a rate limit so we don't wreck the site as much?
22:18:32 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/mcRbb/halloporno.net.log
22:18:49 <imer> JAA/arkiver: ^
22:19:46 <imer> I say that and it goes away >.>
22:23:35 <imer> https://transfer.archivete.am/1Zcxf/www.deejayrvparkcampground.com.log seems spammy.
22:23:36 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/1Zcxf/www.deejayrvparkcampground.com.log
22:32:48 <nyany> JAA: " I got lost in a dark forest and decided to nope the fuck out of there." HAH