00:09:25 !a https://transfer.archivete.am/OZNTH/goo-gl.2023-06-29-02-17-02.txt 00:09:26 datechnoman: Registering CPB2reGi for '!a https://transfer.archivete.am/OZNTH/goo-gl.2023-06-29-02-17-02.txt' 00:14:57 datechnoman: Skipped 31516 invalid URLs: https://transfer.archivete.am/Hdqwu/goo-gl.2023-06-29-02-17-02.txt.bad-urls.txt (CPB2reGi) 00:14:58 datechnoman: Deduplicating and queuing 9968402 items. (CPB2reGi) 00:26:19 datechnoman: Deduplicated and queued 9968402 items. (CPB2reGi) 01:23:28 Now i get it why my request was declined, should have read the wiki page 🙁 04:27:38 !a https://transfer.archivete.am/tVXwR/goo-gl.2023-07-02-20-17-02.txt 04:27:39 datechnoman: Registering U6Jj7i7F for '!a https://transfer.archivete.am/tVXwR/goo-gl.2023-07-02-20-17-02.txt' 04:32:41 datechnoman: Skipped 31976 invalid URLs: https://transfer.archivete.am/4r3gN/goo-gl.2023-07-02-20-17-02.txt.bad-urls.txt (U6Jj7i7F) 04:32:42 datechnoman: Deduplicating and queuing 9967898 items. (U6Jj7i7F) 04:35:07 !a https://transfer.archivete.am/TBnMU/goo-gl.2023-07-06-20-17-02.txt 04:35:08 datechnoman: Registering EWZIoYOf for '!a https://transfer.archivete.am/TBnMU/goo-gl.2023-07-06-20-17-02.txt' 04:41:46 datechnoman: Skipped 30039 invalid URLs: https://transfer.archivete.am/8zpIp/goo-gl.2023-07-06-20-17-02.txt.bad-urls.txt (EWZIoYOf) 04:41:47 datechnoman: Deduplicating and queuing 9969829 items. (EWZIoYOf) 04:47:31 datechnoman: Deduplicated and queued 9967898 items. (U6Jj7i7F) 04:53:35 datechnoman: Deduplicated and queued 9969829 items. (EWZIoYOf) 05:03:05 !a https://transfer.archivete.am/s2jIM/goo-gl.2023-07-10-10-17-02.txt 05:03:06 datechnoman: Registering agmbVXRJ for '!a https://transfer.archivete.am/s2jIM/goo-gl.2023-07-10-10-17-02.txt' 05:08:28 datechnoman: Skipped 31521 invalid URLs: https://transfer.archivete.am/Q84Wz/goo-gl.2023-07-10-10-17-02.txt.bad-urls.txt (agmbVXRJ) 05:08:29 datechnoman: Deduplicating and queuing 9968409 items. (agmbVXRJ) 05:08:57 here are three lists of scanlation group homepages scraped from three different sites: https://transfer.archivete.am/f58yQ/mangadex_groups_sorted.txt https://transfer.archivete.am/12mHh8/mangaupdates_groups_sorted.txt https://transfer.archivete.am/KLjR3/vatoto_groups_sorted.txt 05:08:57 inline (for browser viewing): https://transfer.archivete.am/inline/f58yQ/mangadex_groups_sorted.txt https://transfer.archivete.am/inline/12mHh8/mangaupdates_groups_sorted.txt https://transfer.archivete.am/inline/KLjR3/vatoto_groups_sorted.txt 05:10:47 i'm not 100% sure whether they're suitable here, as there are a few hosts with a fair number of urls and i haven't done any filtering, but it would be nice to get them if we can 05:11:11 thuban - If we were to run the website urls you provided through this channel it would only grab the homepage, sitemaps urls and assets on those pages. We can easily do that but will miss all of the data you are after 05:11:36 i'm aware, it's fine 05:12:07 (i've already submitted appropriate lists to projects that accept them) 05:13:23 roger no worries. Have you queued up the blogger/blogspot ones in that project channel etc? 05:13:34 If that is what you are referring to above sorry 05:13:44 (Just double checking) 05:15:44 yes, that's what i meant (np) 05:15:49 !a https://transfer.archivete.am/f58yQ/mangadex_groups_sorted.txt 05:15:50 datechnoman: Registering x56ocfmB for '!a https://transfer.archivete.am/f58yQ/mangadex_groups_sorted.txt' 05:15:55 datechnoman: Skipped 47 invalid URLs: https://transfer.archivete.am/8PuOJ/mangadex_groups_sorted.txt.bad-urls.txt (x56ocfmB) 05:15:56 datechnoman: Deduplicating and queuing 10068 items. (x56ocfmB) 05:15:57 datechnoman: Deduplicated and queued 10068 items. (x56ocfmB) 05:15:58 !a https://transfer.archivete.am/12mHh8/mangaupdates_groups_sorted.txt 05:15:59 datechnoman: Registering dJKSbgxJ for '!a https://transfer.archivete.am/12mHh8/mangaupdates_groups_sorted.txt' 05:16:02 datechnoman: Skipped 12 invalid URLs: https://transfer.archivete.am/U4XEJ/mangaupdates_groups_sorted.txt.bad-urls.txt (dJKSbgxJ) 05:16:03 datechnoman: Deduplicating and queuing 19721 items. (dJKSbgxJ) 05:16:04 datechnoman: Deduplicated and queued 19721 items. (dJKSbgxJ) 05:16:06 !a https://transfer.archivete.am/KLjR3/vatoto_groups_sorted.txt 05:16:07 datechnoman: Registering wdzNUPi9 for '!a https://transfer.archivete.am/KLjR3/vatoto_groups_sorted.txt' 05:16:09 datechnoman: Skipped 2 invalid URLs: https://transfer.archivete.am/boVO5/vatoto_groups_sorted.txt.bad-urls.txt (wdzNUPi9) 05:16:10 datechnoman: Deduplicating and queuing 4953 items. (wdzNUPi9) 05:16:11 datechnoman: Deduplicated and queued 4953 items. (wdzNUPi9) 05:16:43 thuban - ^^^^ queued 05:16:57 thank you! 05:17:20 No worries :) 05:20:04 datechnoman: Deduplicated and queued 9968409 items. (agmbVXRJ) 05:23:50 !a https://transfer.archivete.am/NgNlU/test_pdf_links.txt 05:23:51 datechnoman: Registering 44ECTPXv for '!a https://transfer.archivete.am/NgNlU/test_pdf_links.txt' 05:23:52 datechnoman: Deduplicating and queuing 19 items. (44ECTPXv) 05:23:53 datechnoman: Deduplicated and queued 19 items. (44ECTPXv) 05:42:45 !a https://transfer.archivete.am/i62E4/pdf_urls_cleaned_1.txt 05:43:47 datechnoman: Registering RNck25fz for '!a https://transfer.archivete.am/i62E4/pdf_urls_cleaned_1.txt' 05:46:29 datechnoman: Skipped 29953 invalid URLs: https://transfer.archivete.am/CSKTl/pdf_urls_cleaned_1.txt.bad-urls.txt (RNck25fz) 05:46:30 datechnoman: Deduplicating and queuing 9970047 items. (RNck25fz) 05:54:02 datechnoman: did you find out what was wrong with the previous lists? 05:54:12 i did not have a look yet (if you don't know, i will still have a look) 05:54:36 arkiver - Im going on a hunch and believe its due to really malformed url's or something like that 05:54:51 Ive created a url cleaning process to properly clean them before throwing them at the bot and testing it atm 05:55:48 I'm also throwing a stack more workers in to pickup the workload 05:55:52 datechnoman: Deduplicated and queued 9970047 items. (RNck25fz) 05:55:59 hmm 05:56:04 but that cleaning process should be done by the bot 05:56:05 Well that successfully queued everything there 05:56:29 when i have time i'll check your list and make the bot able to handle whatever is problematic in there 05:56:32 Odd. Only thing I did differently was properly split the urls as some were double on each line and stuff 05:56:49 Appreciate it mate. Love your work :) 05:56:58 you too :) 05:57:13 Just working towards a common goal :) 05:57:21 Should get us some nice clean data 05:57:38 Did spot checking and some of it was already picked up by this project. Its amazing the reach that this project has 06:08:19 i'm experimenting with and then pushing out the update to archive outlinks from news sites 06:08:39 this may create some loops, as in news->news->news->news URLs, but should be fine 06:10:04 ack no worries. I remember you mentioning this a few days ago. Good news (pardon the pun) is that news sites are typically hosted behind CDN's so we smash through them pretty fast and eaisly with minimal CPU overhead so shouldn't be too much of an issue 06:10:16 yeah! 06:10:23 I personally think its well worth the efforts. News it important no matter where you are in the world! 06:35:23 datechnoman: i'm going to move your queued items to secondary 06:35:28 and the current secondary to redu 06:35:30 redo* 06:35:37 ack no worries 06:35:49 Do what you gotta do. We did notice a fair bit of spam in the secondary FYI 06:36:01 Its floating around "out" atm 06:36:12 Hopefully will die out with the multiple retries 06:37:41 Discovering lots of new url's from the PDF documents :) 06:43:37 yeah :) 07:08:08 alright my implementation seems to be working 07:08:14 datechnoman: can you wait please with queuing more big lists? 07:08:32 i want to push this out when your lists are gone, to better see the effects of the change 07:18:40 For sure arkiver. I have a stack of lists and have been slow feeding so more than happy to hold off :) 07:20:37 thanks! 07:20:42 going to be exciting to get this update in 07:21:33 No worries at all. I guess with everything being in redo and secondary you could enable it right? As it will all feed to the backlog and we can use that as the metric? 07:23:41 i'd rather not, currently backfeed stays large due to the large lists that were fed in 07:24:09 so it's more difficult to estimate what part of URLs queued to todo:backfeed comes from the update, and which part from your lists 07:25:30 Ack fair call. I'll stay spun up the next 24 hours to smash through it all so you can roll out your update :) 07:25:58 or are you fine with me stashing your queued lists away for a bit? 07:26:00 datechnoman: ^ 07:26:08 i'll be off for an hour and then do that if you are fine with it 07:26:30 i'll also stash todo:redo away then 07:33:09 This is your show mate so do as you please arkiver. All I would say is that I'd like it to be requeued once we smash through the news sites outlinks 07:33:14 They are more important anyway 07:39:55 hah no no, it's our show! 07:46:17 arkiver: can you look into filtering out skinlookingyounger.com? https://transfer.archivete.am/iNVsl/skinlookingyounger.com.log don't seem to be successful for me, but there's a lot of it (~37% of urls on my end) 07:46:17 inline (for browser viewing): https://transfer.archivete.am/inline/iNVsl/skinlookingyounger.com.log 07:47:37 also this vilinkv.shop, same volume: https://transfer.archivete.am/GBCF0/vilinkv.shop.log 07:47:37 inline (for browser viewing): https://transfer.archivete.am/inline/GBCF0/vilinkv.shop.log 07:48:22 todo is growing quite rapidly too 07:51:56 imer: added a filter shortly before you messaged :) 07:52:04 for skinlookingyounger 07:52:07 not yet vilinkv 07:52:34 the filter may not be working then, unless you mean in code 07:52:51 but good :) 07:53:27 i'm back in an hour 07:54:15 Can confirm both are spamming up my workers to the point that I don't see many other urls from other domains coming through 07:54:28 See you when you get back! 07:55:19 ah 07:55:23 though custom items 07:55:52 through* 07:57:42 skinlookingyounger is out now 07:57:57 the vlinkv.shop one looks similar to an older pattern, need to look closer at that when i'm back 07:58:11 thanks! 07:58:34 paused until then 07:58:36 can confirm that's doing something - speed is going way up 07:58:40 probably a good idea 07:58:43 yeo 07:58:44 yep 08:01:50 nice i check a random PDF and i find an open directory 08:02:01 we should start attempting to find open directories here perhaps 08:02:33 arkiver: i love that idea 08:02:47 i think we can it, and relatively easily too 08:02:51 do it* 08:02:57 anyway i'm off for an hour 08:03:02 fireonlive: yeah :) 08:03:10 :D 08:03:18 i'm off to bed 08:03:20 ttyt :) 08:09:55 Good night fireonlive! 08:20:01 Def worth pausing the project. Was exploding 09:44:28 Quickly spun down my fleet as I reckon ill need to re-jig them for high density and IO processing for the news site. arkiver - ping me when we go live and ill get spun up with the correct profile 09:44:42 (was spun up for PDF/sitemap processing) 09:54:05 back 09:54:50 welcome back :) 09:55:18 thanks 10:11:18 datechnoman: imer: if you 10:13:04 datechnoman: imer: if you're interested, the skinlookingyounger loop was actually very similar to previous loops, with a small difference. i have no added support for it by adding https://github.com/ArchiveTeam/urls-grab/commit/c0f44fadd20f0ef2098edff30ec8fb9b40f1c65f 10:13:32 also 33k news sites :) https://github.com/ArchiveTeam/urls-grab/commit/56983c4484e64ff0e3cc17f91d9393c34d2ca620 10:13:34 good stuff 10:14:49 i removed the filters for skinlookingyounger, since they'll be handled in the code now. means they may still be handed out, but won't create a look anymore (and take very little resources) 10:16:17 the loop for vilinkv.shop is interesting 10:16:25 Awesome thanks so much! arkiver 10:16:50 it's due to URLs like https://tiib.vilinkv.shop/.well-known/openid-configuration redirecting to a different domain, which then gets the various 'special URLs' queued, which link to other domains, etc. 10:17:31 we'll want to support that in the code as well to prevent future loops like (which will surely occur) 10:18:18 There is always something ey :/ 10:18:29 well 10:18:35 in the beginning there were a lot of loops 10:18:44 but now there are not a ton of them 10:18:46 Mind you, I like your tact if actually blocking the pattern that sites use instead of filtering 10:19:08 really the more we support in the code, the less we have to fix as we move along 10:19:15 I can see lots of things skipped these days so the filtering works (skipped by the workers) 10:19:21 yep 10:19:23 Yeah exactly! 10:19:35 Much more efficient and solves the greater issue 10:19:49 Also keeps the bloom filter and backfeed happy 10:20:01 and turns out spam sites tend to use the same "spam software" (? or just the same owner), so blocking patterns helps 10:20:05 indeed! 10:20:07 arkiver++ 10:20:07 -eggdrop- [karma] 'arkiver' now has 19 karma! 10:20:12 :P 10:20:22 datechnoman++ 10:20:23 arkiver++ 10:20:24 -eggdrop- [karma] 'datechnoman' now has 7 karma! 10:20:26 -eggdrop- [karma] 'arkiver' now has 20 karma! 10:20:44 imer: your logs helped a lot, btw 10:20:55 imer++ 10:20:55 -eggdrop- [karma] 'imer' now has 2 karma! 10:21:02 lol 10:21:09 that_lurker also helped a lot by adding karmas 10:21:24 that_lurker++ 10:21:24 -eggdrop- [karma] self karma is a selfish pursuit. 10:21:27 damn :P 10:21:32 :) 10:23:04 that_lurker++ 10:23:04 -eggdrop- [karma] 'that_lurker' now has 4 karma! 10:23:07 I got you mate 10:23:18 No one left behind 10:23:28 https://lounge.kuhaon.fun/folder/63d0a64919a7452d/karma.gif 10:27:48 eggdrop++ 10:27:49 -eggdrop- [karma] 'eggdrop' now has 16 karma! 10:32:45 vilinkv.shop loop handled now too 10:32:53 moving out current todo:redo and todo:secondary 10:40:22 Roger. Moment of silence for the data 10:40:26 :( 10:40:27 Lol 11:28:35 how goes the moving? :) 11:29:54 Haha I was going to ask the same thing. Wondering if I go to bed or stay up for a few if we get rolling 11:44:19 arkiver - How are we lookin? 12:07:16 Well im gonna get some rest. Will look into this tomorrow morning. Night all! 12:19:36 good night 12:20:04 heading out soon myself as well (not to bed) 13:05:51 is this project currently pause 13:15:07 ping arkiver did something break? 13:31:42 that_lurker: the project is likely paused right now while they deal with the above issues 13:36:29 yep paused while arkiver moves things around, seems to have disappeared though 13:41:21 yup lol 13:41:58 The good news is that while urls is currently undergoing surgery, we desperately need more 1x1 workers on roblox 14:40:50 time to get rolling 14:41:37 this new method may give us some new loops that need eliminating 14:58:19 running 14:59:40 oooooh 15:05:33 with this new "outlinks from news sites" feature, we're also getting a lot of social media share URLs. i'm going to look into pushing those into the 'one-time lists', so they don't go into the bloom filter 15:06:30 hmm looking at 26 TiB/day currently 15:06:31 which is a lot 15:06:52 i think this is an initial wave of URLs though, like we've seen before with new features, so this number should go down 15:08:25 backfeed going down in size now, good, might have been an initial bump 15:08:46 the stuff in the main todo queue is requeued items from claims which we have a filter for in place. 15:08:55 (i want to get these out of the way from claims) 15:10:11 all is looking very good! 15:10:56 todo:backfeed is near 0 now 15:15:48 lol 15:15:52 i wonder why 15:16:12 i don't see any serious loops 15:17:26 Sigh. petition to rename this project to whatgoesaroundcumsaround because PORN 15:19:23 👀 15:20:42 nyany: i don't see much of it now? 15:21:00 lol, sorry, that was an off the record remark 15:34:23 rates are going down now as expected :) 15:34:44 if this keeps looking good in the coming days, we'll also turn it on for political and government sites! 16:09:20 i did finally decide i have given up on getting on hetzners good side so boxes i still have with them are running this again 16:10:00 fuzzy8021: ah :/ sorry to hear 16:10:20 were the problems back then mostly about IP addresses in the ranges we no block by default? 16:13:13 well we have found some PDFs (PDFs leading to more PDFs, etc.), hopefully not last too long 16:16:01 yeah lots of science related PDFs 16:17:39 an example: just saw https://journals.biologists.com/toolbox/downloadcombinedarticleandsupplmentpdf?resourceid=272582&multimediaid=2110223&pdfurl=/cob/content_public/journal/jcs/135/5/10.1242_jcs.259365/1/jcs259365.pdf getting archived, and it had 181 URLs extracted and queued back - most of which were doi.org URLs. so those will be resolved, leading to more PDFs, etc. 16:17:51 but that cycle should end at some point, i don't see 'bad looking' loops 16:25:09 paused for a bit as i investigate why scholar.google.com are not getting URLs discovered 16:32:41 solved 16:54:25 datechnoman: :) 17:08:28 There has been any cases of the URLs project DDoSing any website? 17:12:47 That's possible with any one of our DPoS projects 17:22:09 https://usercontent.irccloud-cdn.com/file/EbAJcAyl/image.png 17:22:21 I get all my most important things from that website 17:27:30 uh 17:27:36 i could improve some stuff there yeah 17:27:41 it's due to PDF extraction 17:28:05 with improve, all i mean is get rid of the repeated . 17:28:18 i _think_ the URL is technically still valid with repeated . taken out 17:28:38 i'm tired now and might make mistakes, so will make that update tomorrow 22:32:18 Good Morning All. Everything seems to be running very smoothly this morning :D great work arkiver! 22:43:38 Also great to hear we can support scholar.google.com 22:44:17 That is something we definitely want to support :) 23:07:29 Mmm, would https://video.sindonews.com/ be integrated into being archived and checked more here? Don't think I see much frequency checking via https://web.archive.org/web/20240000000000*/https://video.sindonews.com/ 23:47:33 nyany: The internet is really, really great...