04:57:04 !a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt 04:57:04 datechnoman: Registering cYpn43AW for '!a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt' 05:00:13 datechnoman: Skipped 199 invalid URLs: https://transfer.archivete.am/RYy57/filtered_pdf_files_unique.txt.bad-urls.txt (cYpn43AW) 05:00:15 datechnoman: Skipped 32 very long URLs: https://transfer.archivete.am/DsXwU/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW) 05:00:16 datechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW) 05:25:43 arkiver could I hassle you to look at the bot? Doesn't looks like its queuing the jobs into the tracker. Noticed on my last 2 jobs that there was no increase in the todo queue and no "deduplicated and queued XXXXX items 05:27:44 The first job is a duplicate upload so it won't produce any new items (didn't notice at the time). The second job isn't and I'd expect a lot of new items from it. Threw in a bunch of new workers to process through it so don't worry about that :) 07:17:50 Uploaded to other channels/bots without issue so might be a character in one of the files causing issues or the bot not being happy :( 11:31:39 Also if you dont mind topping up the queue with some stash urls that would be great. Going away for the next 3 days and will leave the extra workers running so please feel free to keep them busy :) 12:06:01 datechnoman: i will put some extra in tonight 12:06:12 might move that out when i make updates to the code again 12:06:39 which well get any outlinks (so to a different domain) from web pages on sites that we have identified as news sites 12:07:25 ... which could go a little deep as we would also queue outlinks from one news site to another, from where more outlinks are queued, etc. 12:07:38 so that might explode and then i'll make changes to only get outlinks to non-news sites 12:07:57 but it may also not explode, and then we'll get significantly better news site coverage and news site sources coverage 12:19:41 All good. I shall leave it in your capable hands :) 12:20:18 archiving news is very important and historical so well worth the compute/resources 12:21:12 It will most likely be a large expansion of urls being queued but we will cut down on them as we process through them like the monthly sitemaps 12:21:32 Sounds well worth it 12:26:18 absolutely! 12:26:54 regularly i come across some government pdf document on the Internet, and wonder "is it saved?". so i look it up in the wayback machine, and really in 8/10 cases it is saved, and the only capture is this project 12:52:37 Honestly doing some really great work here 👏 12:54:54 I did some really rough testing of the unique pdfs in the second job and found a bunch missing so hoepfully that helps! They aren't targeted for things like news and government. Just pdf outlinks from Twitter posts 18:48:06 Queuing bot shutting down. 18:48:15 Queuing bot started. 18:48:17 datechnoman: Restarting unfinished job isAJoDlg for '!a https://transfer.archivete.am/gsq2b/unique_pdfs_output.txt'. 18:48:18 datechnoman: Restarting unfinished job cmKPUONH for '!a https://transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'. 18:48:19 datechnoman: Restarting unfinished job cYpn43AW for '!a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt'. 18:48:25 indeed datechnoman ^ 18:49:38 it may be a characters yeah, i'm not immediately sure 18:51:11 datechnoman: Skipped 199 invalid URLs: https://transfer.archivete.am/QH7Xd/filtered_pdf_files_unique.txt.bad-urls.txt (cYpn43AW) 18:51:16 datechnoman: Skipped 32 very long URLs: https://transfer.archivete.am/16f2lB/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW) 18:51:18 datechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW) 18:54:42 datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/i5NDE/unique_pdfs_output.txt.bad-urls.txt (isAJoDlg) 18:54:43 datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/oMQvL/unique_pdfs_output.txt.not-printable.txt (isAJoDlg) 18:54:44 datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/f9q9g/unique_pdfs_output.txt.skipped.txt (isAJoDlg) 18:54:45 datechnoman: Deduplicating and queuing 9401326 items. (isAJoDlg) 18:59:25 datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/uPUIK/filtered_.pdf_output.txt.bad-urls.txt (cmKPUONH) 18:59:26 datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/J7XI2/filtered_.pdf_output.txt.not-printable.txt (cmKPUONH) 18:59:28 datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/GvhN6/filtered_.pdf_output.txt.skipped.txt (cmKPUONH) 18:59:29 datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH) 19:02:25 Archiving news sources would be great! 19:02:43 (It'd work even better if news outlets actually linked their sources more frequently.) 19:03:22 We still haven't found that IS Telegram channel, have we? 19:06:29 Hmm, came across https://www.eurosport.com/football/premier-league/2015-2016/99-of-manchester-united-fans-want-wayne-rooney-dropped-do-you_sto5859298/story.shtml - but it's geoblocked; does this project take account of prioritizing geoblocked content to other Warriors that may have access to it? 19:06:41 Seeing that https://www.eurosport.com/ is constantly processed here 19:30:37 Woah, that's a very long URLs: 19:37:18 There is no prioritisation. Stuff just gets retried randomly by any worker. Hopefully one of them isn't getting geoblocked. 19:38:10 Eurosport should be geoblocked to ~Europe, and we have lots of workers there, so unless they block Hetzner, that particular case should mostly be fine, probably maybe. 21:50:53 Somehow I'm always finding the bugs in the system :P 21:59:24 !a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt 21:59:24 datechnoman: Registering dSoiEwEl for '!a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt' 21:59:26 datechnoman: Your job is waiting for a slot. (dSoiEwEl)