01:27:56 Just finished parsing the forum woman.ru and using regex processed the whole dataset and extracted a list of links to external sites. I think they will be useful here - https://transfer.archivete.am/LWPGL/womanru.txt.zst 01:42:06 !a https://transfer.archivete.am/LWPGL/womanru.txt 01:42:07 datechnoman: Registering YROzuz94 for '!a https://transfer.archivete.am/LWPGL/womanru.txt' 01:42:22 datechnoman: Skipped 307 invalid URLs: https://transfer.archivete.am/YOMRi/womanru.txt.bad-urls.txt (YROzuz94) 01:42:23 datechnoman: Deduplicating and queuing 167708 items. (YROzuz94) 01:42:30 datechnoman: Deduplicated and queued 167708 items. (YROzuz94) 01:43:00 ^^^ nyuuzyou - I've uploaded your script into the tracker which will be processed when the target is back online :) ^^^ 01:43:11 Not script, urls sorry 01:43:14 thanks 01:43:27 No worries. Thanks for sharing 05:34:33 !a https://transfer.archivete.am/flVVG/urls_batch_0000_000.txt 05:34:33 datechnoman: Registering PBLGmrlF for '!a https://transfer.archivete.am/flVVG/urls_batch_0000_000.txt' 05:38:13 arkiver JAA - Its only early days (nearly 24 hours of being down) but worth asking if either of you have a contact method for rewby|backup? 05:40:27 I cannot breathe properly when #// isn't chewing urls datechnoman: ouch i see 05:44:01 let's give them a bit more time, i hope this will be fixed soon 05:44:47 Ack no worries. I need to get another hobby >.< 05:44:58 Will keep chewing away urls either way :) 05:46:04 this is a great hobby to have though :) we're saving very important data 05:46:24 as i have come to find out personally when occasionally checking for more recent stuff in the Wayback Machine 05:50:39 55 million hits this week alone on #// items 05:50:45 The stat's speak for themselves 05:51:08 235 million in the last month 05:51:47 years from now this will likely simply be the most valuable web collection 05:52:29 CommonCrawl is trying to gain on us but we have got them beat >;) 05:53:04 It will be invaluable. My wife still doesn't get what or why I do this but I really enjoy it and have learnt so much :) 05:53:36 hah :) i check those stats regularly as well 05:53:59 datechnoman: that is very nice, especially on the learning part too 05:54:15 \me pretty much got into programming because of Archive Team 05:54:19 * arkiver * 05:54:22 :P sad 05:54:27 * arkiver pretty much got into programming because of Archive Team 05:54:36 did a little before too, but not serious 05:57:32 A decent portion of the stuff being added now is from CommonCrawl anyways no? It's just going to be in a much more accessable format 06:02:08 They do not extract / ingest .pdf files etc where we do 06:02:13 They are much more focused on their grabs 06:02:22 They want/like to sample websites 06:02:34 Which is what we do but we are much more widespread 06:02:42 And we grab certain outlinks etc 06:04:10 We are very much the same in certain aspects but they dont crawl all the different outlines from say blogger, telegram etc 06:07:27 datechnoman: Skipped 2763 invalid URLs: https://transfer.archivete.am/2BpDs/urls_batch_0000_000.txt.bad-urls.txt (PBLGmrlF) 06:07:28 datechnoman: Deduplicating and queuing 24997237 items. (PBLGmrlF) 06:14:03 !a https://transfer.archivete.am/WVCvq/urls_batch_0000_001.txt 06:14:04 datechnoman: Registering BOQ8dvzU for '!a https://transfer.archivete.am/WVCvq/urls_batch_0000_001.txt' 06:21:25 datechnoman: Skipped 436 invalid URLs: https://transfer.archivete.am/2LZcp/urls_batch_0000_001.txt.bad-urls.txt (BOQ8dvzU) 06:21:26 datechnoman: Deduplicating and queuing 3416579 items. (BOQ8dvzU) 06:24:27 datechnoman: Deduplicated and queued 3416579 items. (BOQ8dvzU) 06:30:32 datechnoman: Deduplicated and queued 24997237 items. (PBLGmrlF) 13:06:36 https://transfer.archivete.am/8k3DW/bordaruurls.txt from https://huggingface.co/datasets/nyuuzyou/bordaru-posts 13:06:38 inline (for browser viewing): https://transfer.archivete.am/inline/8k3DW/bordaruurls.txt 13:50:53 nyuuzyou: thank you!! 13:50:57 !a https://transfer.archivete.am/8k3DW/bordaruurls.txt 13:50:58 arkiver: Registering bOrp62lm for '!a https://transfer.archivete.am/8k3DW/bordaruurls.txt' 13:51:18 arkiver: Skipped 262 invalid URLs: https://transfer.archivete.am/MVZUX/bordaruurls.txt.bad-urls.txt (bOrp62lm) 13:51:19 arkiver: Deduplicating and queuing 212690 items. (bOrp62lm) 13:51:30 arkiver: Deduplicated and queued 212690 items. (bOrp62lm)