00:54:11 urls is doing more bandwidth than usual because there's a ton of pdfs in queue, I believe 00:57:29 we are discovering tonnes of new url's and sources so the queue is growing rapidly 00:57:31 Nothing we cant handle 00:57:39 *cracks out the credit card* lol 00:58:32 discovery rate is nearly 1x (at which point the calculated ETA becomes infinity) 00:59:43 discovering 101 items for every 100 items completed 01:12:36 * fireonlive watches over datechnoman debt ratio 01:30:11 fireonlive i need someone like you to look out for me. It's an addiction.... 01:30:32 i feel you :( 01:31:36 We will eventually get to a point where we discover less lol... eventually... 01:31:49 my debt utilization ratio is in the 90s 01:31:52 :x 01:31:59 debt->credit 01:32:16 eventually :p 01:32:33 one whole internet archived 01:32:43 Haha well if we are including the mortgage then I'm fk'ed lol 01:33:15 Even with cheeky thousands of dollars we would be blocked by IA ingest haha 01:33:39 The real question is, when is WBM going to hit 1 trillions pages 01:33:40 :P we'll set mortgages and cars aside :D 01:33:43 ooh indeed 01:34:05 i better start posting more memes for JAA to archive 01:34:08 :3 01:35:31 :-) 01:49:24 Need to setup a bot that auto grabs any link and !a it into the correct channel lol 01:49:48 eg; imgur link will be !a in imgur channel, other links !a in #// 01:55:24 But you can't !a individual URLs here, only lists, I think. 02:16:38 !a https://dl.fireon.live/404 02:16:38 fireonlive: Registering SkBQiE2b for '!a https://dl.fireon.live/404' 02:16:39 fireonlive: Not a transfer.archivete.am URL. (SkBQiE2b) 02:16:40 fireonlive: Something went wrong. (SkBQiE2b) 02:16:42 indeed 02:17:09 ...why is my 404 a 200 02:17:25 ...why is anything a 200 02:17:45 :| 02:18:07 >:-( 02:19:23 * fireonlive takes caddy behind the barn 02:44:32 * TheTechRobo throws a planet.osm at datechnoman 02:47:23 Yeah, that happened before on AB with a blind !ao. 02:58:07 yeah, that was fun 02:58:12 Didn't that crash a few jobs? 02:58:18 Or am I misremembering? 02:58:54 Would be nice if pipelines could set a per-file size maximum that would fail the URL if it started to exceed that 03:02:22 *non existent memories of --large* 03:03:19 I think the proposal was to catch the out of space error and retry later when space was made or something like that 03:19:13 Not sure how many other jobs it affected, but it definitely crashed at least the !ao < job. 03:19:26 And yes, that's the idea, make the error non-fatal. 07:49:26 Looks like we hit the curve and are gaining now! 10:30:06 rewby: we're going to launch a URLs Tor project! :) can we have a target for it? it can be on the same machine as the regular URLs project - i don't expect a ton of data 10:30:15 it would have 10:30:22 archiveteam_urlstor_ 10:30:24 urlstor_ 10:30:30 Archive Team URLs Tor: 11:40:22 and the project is urls-onion on the tracker 11:58:29 cool! 12:01:07 Plenty of http://www.nudetubesex.com/eeQ9Yv3.php and http://www.nudetubesex.com/Wpcne.php?JvSeYeO.xml URLs in my logs, but these all re-direct to one specific page: https://171kj.cc 12:01:30 Can we filter these out? 12:02:02 (oh, and no nudity) 12:15:40 I did notice that. Just a bunch of SPAM JAA ^^^^ 13:31:39 datechnoman: please ping me too 13:31:58 i feel like some of these can be filtered out differently than adding a pattern, but not sure either 13:32:25 problem about patterns is that they stay in kind of "forever" - so resource use due to them increases over time as more are added 13:32:42 maybe some can be merged together or deleted later on, but not entirely sure yet how to check that 13:32:43 Anyone elses containers acting up? (looking at stats thats a yes) have some that are seemingly stuck 13:33:02 imer: stuck on what? 13:33:05 i can have a look in a bit of time 13:33:06 2024-03-27T13:17:06.563547423Z Starting MoveFiles for Item and then silence 13:33:24 oh crap 13:33:28 sorry, left something in :/ 13:35:41 imer: fixed 13:35:45 oh, good. thought something was broken on my end 13:35:55 there was a debug sleep of 1000 seconds in there, that i forgot to take out 13:36:00 oops haha 13:36:23 this large recent update is for the introduction of the urls-tor-grab project, it will be based on urls-grab with some stuff replaced simply in pipeline.py 13:49:52 tor urls: cool. Does it need an external tor proxy? will i2p be done eventually? 14:18:57 Wonder if there's some way of tracking the number that get filtered out. And then maybe that being an opt in thing we could run on some of them. e.g. "This filter hasn't actually filtered anything for 30 days, we can probably remove it". 14:19:04 Would only work for filters added to workers though 15:08:44 immibis: not sure about i2p 15:09:30 AK: yeah the thing is though that many of these filters are in to prevents expanding loops. once one URL goes through and is not filtered out, it expands in 2 URLs, 4, etc. (or more at each step) 16:15:44 arkiver: Poking Drone so urls-tor-grab will build once pushed. 16:18:43 thanks JAA :) 16:55:46 can the way back machine handle tor? 17:01:00 yes 17:03:40 Running the tor worker, anything we need to do or is it just run the workers and they handle all the tor bit for us? 17:05:26 just run docker and you're all done 17:05:31 it will run tor for you 17:07:07 Alright, gimme a ping when you've got a built image and I'll spin up a (low concurrency to start) few workers 17:07:48 sounds good :) 17:07:50 we'll start shortly! 17:07:54 working on the final bits 17:12:46 some Tor sites have custom captchas, like I think dread has one 17:13:19 it probably won't be a terribly complete archive. well it's only targeted urls right, should be ok 17:13:49 we would not get the content behind captchas 17:13:59 which is the same as for the general web, we currently don't attempt to solve captchas 17:24:53 i think we're ready 17:27:07 FYI AK coming up 17:28:36 it's up! 17:29:14 almost 17:29:22 exciting news! 17:30:51 Hello, keep getting this error on the project and its looping, not sure if anyone has reported yet https://transfer.archivete.am/137OT4/error.txt 17:30:52 inline (for browser viewing): https://transfer.archivete.am/inline/137OT4/error.txt 17:31:41 up! 17:32:12 Legend arkiver 17:32:38 Darken: i can't replicate :/ 17:32:46 Darken: this is urls-grab right? (not Tor) 17:32:51 You fixed it 17:32:52 and yes it was 17:32:57 when you said up! it fixed 17:33:08 not sure what fixed it but good :P 17:36:38 pausing tor project 17:38:00 why is there non-onion stuff in the project 17:38:16 onion sites linking to clear net? 17:38:49 Seeing a fair few `Failed ConnectTor for Item` on some of mine, not sure if that's a me problem 17:39:55 fireonlive: no those would be queued back to the urls-grab project 17:40:01 ahh ok 17:41:34 is there a separate tracker leaderboard for urls-tor? 17:41:34 found the problem 17:42:50 yeah urls-onion, maybe i should change that to urls-tor 17:43:50 ahh onion 17:44:17 ye tor would match the image/title i suppose 17:46:57 tor: getting a timeout in checkip after supposedly connecting 18:04:39 oh, it worky 18:06:40 imer: did it take many tries? 18:07:22 i think a new route needs to be requested upon a timeout, will get that in later this week if it is a solution indeed 18:07:38 arkiver: after update it just worked first try, dont see it submitting the results though.. it just kinda gave up? running on conc 1: https://transfer.archivete.am/iXZmR/2024-03-27_18-07-34.txt 18:07:39 inline (for browser viewing): https://transfer.archivete.am/inline/iXZmR/2024-03-27_18-07-34.txt 18:08:19 oh yeah we don't have a target yet 18:08:30 (bad error message - it's actually saying there's no target 18:08:33 aah 18:08:42 all good then 18:08:44 rewby: i changed the name on the tracker of the project from urls-onion to urls-tor FYI 18:09:15 imer: did you cut some lines out of that log? 18:09:20 yes 18:09:24 uh, no 18:09:28 I truncated the length 18:09:36 can you post the full log? 18:09:39 sure 18:10:21 arkiver: https://transfer.archivete.am/MCT2l/urls-tor.log 18:10:22 inline (for browser viewing): https://transfer.archivete.am/inline/MCT2l/urls-tor.log 18:10:38 ah okey looking fine 18:10:48 45s for checkip, might need a longer timeout? 18:10:52 yeah 18:10:56 will make it 120 seconds 18:11:06 we also have a 120 second timeout on retrieving a URL with Wget-AT 18:13:02 imer: done, 120 second now 18:14:05 will these be visible via WBM then or just in a special collection? 18:15:01 they'll be in the archiveteam_urlstor collection 18:15:11 and they will be visible in the Wayback Machine 18:15:19 cool 18:19:09 :3 18:47:31 project is unpaused 18:47:38 (i had to clean up a mess i made) 18:51:05 enjoy :) 18:51:11 i'll be off now 18:51:17 when we have a target, the party can truly start 19:06:21 🥳 19:06:27 arkiver++ 19:06:28 -eggdrop- [karma] 'arkiver' now has 18 karma! 19:18:15 hm 19:18:30 my VPS is already running a tor daemon, I don't think I can afford the RAM for a second :P 19:20:49 🤔oO( nicolas17 + tor = ? ) 19:21:44 what are you speculating about 19:22:13 your use cases :D 19:23:02 ah 19:23:18 at one point it was 19:23:37 "digitalocean gives me 1TB/mo upload and I barely use it, I'll get my money's worth by letting a tor relay burn through the rest" 19:24:33 now I'm also checking for changes in opensource.samsung.com file lists, and my VPS IP has been banned like a month ago already :P 19:51:19 oof 1TB/mo egress is smol 19:52:58 compared to hetzner yes. compared to aws no. 19:53:39 ahh :) 19:54:12 digitalocean-- 19:54:13 -eggdrop- [karma] 'digitalocean' now has -1 karma! 19:54:15 aws-- 19:54:16 -eggdrop- [karma] 'aws' now has -1 karma! 20:15:13 but digitalocean has the best peering ever :P 20:20:42 https://transfer.archivete.am/PdvK3/archiveteam_torrior.png suggested icon 20:20:43 inline (for browser viewing): https://transfer.archivete.am/inline/PdvK3/archiveteam_torrior.png 20:26:26 very nice 20:31:20 Nice 20:31:30 I was going to suggest Shrek. :-P 20:37:35 Shrek-- 20:37:36 -eggdrop- [karma] 'Shrek' now has -1 karma! 20:47:37 aws-- 20:47:38 -eggdrop- [karma] 'aws' now has -2 karma! 21:06:27 azure-- 21:06:27 -eggdrop- [karma] 'azure' now has -1 karma! 21:06:28 gcp-- 21:06:29 -eggdrop- [karma] 'gcp' now has -1 karma! 21:06:50 cloudflare-- 21:15:00 is there any provider we ++? 21:15:01 :3 21:15:08 we might be slowing down expertini.com, there's also some regional? redirect stuff going on from the looks of it https://transfer.archivete.am/14XnYB/2024-03-27_21-14-46.txt some 500s, lots of redirects 21:15:08 inline (for browser viewing): https://transfer.archivete.am/inline/14XnYB/2024-03-27_21-14-46.txt 21:15:14 hetzner++ 21:15:14 -eggdrop- [karma] 'hetzner' now has 1 karma! 21:17:54 ah, http -> https redirects 21:27:08 cloudflare-- 21:27:08 -eggdrop- [karma] 'cloudflare' now has -1 karma! 22:18:20 any thoughts on if there would be abuse messages on tor stuff? 22:46:40 what do you mean? 22:47:14 some fascist hosting providers prohibit all use of tor. the rest don't care unless they get an abuse message, which only happens to exit nodes, which you aren't 23:41:06 good deal. thanks 23:54:22 Man my cluster was cooked after those code changes. All the containers were hung from the pipeline issue. Just rolled everything and it appears to be working again as normal 23:58:47 Terbium: I pay $6/mo, and as I said in normal conditions I don't even use most of that 1TB