00:22:54 pypy? 01:00:57 I don't think the Python regex crate is written in Python 01:01:07 Umm 01:01:22 Ignore the confusion with Rust 01:30:16 hm 01:30:48 i think there's a couple things from here that won't go to projects per se, but could be wrong: telegram, youtube, google? 01:31:16 youtube for unlisted, telegram for channels, and google for a potential google drive/docs/etc project 01:31:26 not sure if others 01:49:34 kiska OrIdow6 - I've already processed 80% of common crawl links and have them all stored and compressed for future projects. 01:50:56 My understanding is that we do not want to feed common crawl data just randomly in here as we are trying to archive urls that people are using/sharing where common crawl is crawling websites. It's also worth noting that common crawl upload their warc's into the WBM already so we would be rearchiving 01:51:24 datechnoman: Did you use that for Discord CDN links? 01:51:28 Obviously the pages most likely have changed since they were last crawled 01:52:14 OrIdow6 - i did for all discord, mediafire, telegram, pastebin, blogger and pdfs already. 01:53:24 I have about 30TB of urls stored on my home server from their releases. Your more than welcome to rerun them if you really want to 01:53:28 Also curious how big that was, and how you managed to get around the issues that people were having during the Imgur project 01:54:58 People were processing the warc's during the imgur project. You should instead be processing the WAT files as they are the stripped down version of the warc's which have all of the HTML content being the outlinks we are after 01:55:13 I used multiple severs and hundreds of dollars using python 01:55:36 Slow connection rates to avoid the rate limiting and a few different ip's 01:55:45 And it's taken me about 4 months sofar 01:57:05 I would be checking with arkiver if we want to be dumping that kind of data into this project as I believe it falls out of scope 02:02:59 What I have done at the request of arkiver for the last week is run of the dumps i have created and extract the many millions of pdf links that are contained in their dumps as they drop anything larger than 1mb in their warc's and spot testing confirms this. They also don't process the outlinks in the pdfs they find so we will be more effective at 02:02:59 grabbing the sites related to the documents 02:13:37 Did you get CC-MAIN-2022-49? 02:15:43 Yes. I started from latest and am back to 2020 archives atm 04:35:36 Unsure if worth adding https://wodnews.com/ to this project as a recurring thing; since it doesn't look like it's been processed all that often when checking https://web.archive.org/web/20240000000000*/https://wodnews.com/ 08:57:27 arkiver: can we get a refill? :) 08:59:18 also seem to have stumbled into some porn stuff 16.43/s pornodavid.com 10.86/s hd-sexfilme.com doesn't look spammy at a glance at least? sites don't look happy though, bunch of 500's/timeouts 09:19:44 I'm also suspicious of how much of the tracker scrolling by is me 09:23:24 All about that porn :P 09:24:13 Im sure fireonlive would have something to say about it 09:28:48 I think in the past we have filtered porn out as there is soooo much of it and it isn't as important as th3 lle focused outlinks we archive 09:51:07 JAA/arkiver: loop Queuing for parent URL https://srhc.org.au/?ver=12.5.0.1712565966. [...] Queuing URL https://srhc.org.au/?ver=6.5.1712566508. 09:51:08 looks like that just needs a filter on the main page with ?ver 09:51:55 porn seems to be mostly gone 13:41:09 imer: good on the porn stuff - i think we should be careful with adding new filters 13:41:29 on the number of jobs you are completing - did you scale up? 13:41:48 i'll do a new check of the URLs we archived recently and what we may be able to ignore 14:48:01 nope, didnt change anything about my deployment, seem to have stayed fairly consistent where other people seem to have dropped off, maybe the porn sites didnt work on hetzner or something? 14:48:35 or I'm less cpu bottlenecked atm 14:50:11 CPU could be it, no PDFs or sitemaps at the moment 14:50:22 srhc.org.au one is still around 14:50:38 yep, i saw it 14:50:44 i want to know where the URLs come from 14:50:53 good :) 14:50:56 so i know how to best filter them out 14:51:14 AK logs to the rescue :) 14:51:41 usually i just download a log from AK and go throuhg it 14:51:52 and see what funny stuff is going on in there 14:57:44 imer: found the source 15:06:40 imer: an update it pushed that i think will handle this generally 15:06:44 so also for other sites 15:07:40 version updated 16:29:28 imer: i don't see it anymore, i think it's solved itself (was not a look in that case) 16:30:20 i see some stuff from it now actually 16:33:14 imer: being filtered now 16:53:29 still seeing srhc.org.au around, cant spot loops on my end though - have you forced the version yet or are these maybe just old workers? 16:53:38 and thanks 16:54:28 kfz.osl-online.de looks like the same arkiver, not sure if that needs a different filter 16:59:31 imer: a wider filter is in for the kfz one 16:59:41 there may still be srhc yes, but i hope the loop is resolved 16:59:42 Queuing for parent URL https://www.pompestichting.nl/images/news/all/81.?nd=1712548905. [...] Queuing URL https://www.pompestichting.nl/images/news/all/81.?nd=1712595502. another timestamp loop 17:00:05 yeah 17:01:10 https://transfer.archivete.am/14tx9K/www.pompestichting.nl.log 17:01:10 inline (for browser viewing): https://transfer.archivete.am/inline/14tx9K/www.pompestichting.nl.log 17:04:37 imer: nd timestamp loop is out 17:04:44 nice :) 17:05:15 forced new version 20240408.04 17:34:17 been going through a CDX, seeing another potential loop, but want to check it again in a day 17:35:07 also yes, every loop imer saw i saw confirmed in the CDX as well, so very nicely found imer ! 17:42:01 two more loops gone, they were in the CDX. another update is out and enforced 17:42:07 datechnoman: :3! more gay porn :D 17:42:17 fireonlive: yay :P 17:42:22 :P 17:43:59 πŸ‘€ 18:15:32 HOORAY FOR DICK 18:15:35 i mean ARCHIVING 18:18:37 πŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€ 18:18:39 πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆπŸ³οΈβ€πŸŒˆ 18:21:26 ^ Interesting, it split an emoji in half 18:28:22 yeah :D 18:28:30 i think my client might have done that 19:00:20 more like "it doesn't know how to render this two-part emoji" 19:21:01 nyany: it's the best ;D 19:51:27 increased max tries to 5 19:51:32 (it was 3) 21:48:24 nicolas17: No, this is on the client doing the splitting. But doing it Correctlyβ„’ is *hard*. The clients that do split on graphemes always fail on some things, so many clients don't even bother with trying to implement it. 21:48:44 oh you mean into two messages 21:48:48 Yeah 21:48:57 I see two separate graphemes for all of them :D 21:49:05 and I thought qwerty was talking about the same 21:49:07 πŸ³οΈβ€ at the end of one message and 🌈 at the start of the other is the one being talked about. 21:49:46 Although all clients could detect ZWJ sequences, so that one can actually be solved somewhat easily. It's a much harder problem in the general case. 21:50:29 I looked into it when I wrote http2irc. I got lost in a dark forest and decided to nope the fuck out of there. So http2irc only does word and codepoint splitting. 21:52:57 nicolas17: a screenshot of what it looks like for me: https://transfer.archivete.am/inline/g9yBw/Screenshot_2024-04-09%20hackint%20-%20webirc.png 22:18:31 you're all in luck, there's more porn: www.halloporno.net :p https://transfer.archivete.am/mcRbb/halloporno.net.log maybe a rate limit so we don't wreck the site as much? 22:18:32 inline (for browser viewing): https://transfer.archivete.am/inline/mcRbb/halloporno.net.log 22:18:49 JAA/arkiver: ^ 22:19:46 I say that and it goes away >.> 22:23:35 https://transfer.archivete.am/1Zcxf/www.deejayrvparkcampground.com.log seems spammy. 22:23:36 inline (for browser viewing): https://transfer.archivete.am/inline/1Zcxf/www.deejayrvparkcampground.com.log 22:32:48 JAA: " I got lost in a dark forest and decided to nope the fuck out of there." HAH