04:57:04 <datechnoman> !a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt
04:57:04 <h2ibot> datechnoman: Registering cYpn43AW for '!a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt'
05:00:13 <h2ibot> datechnoman: Skipped 199 invalid URLs: https://transfer.archivete.am/RYy57/filtered_pdf_files_unique.txt.bad-urls.txt (cYpn43AW)
05:00:15 <h2ibot> datechnoman: Skipped 32 very long URLs: https://transfer.archivete.am/DsXwU/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW)
05:00:16 <h2ibot> datechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW)
05:25:43 <datechnoman> arkiver could I hassle you to look at the bot? Doesn't looks like its queuing the jobs into the tracker. Noticed on my last 2 jobs that there was no increase in the todo queue and no "deduplicated and queued XXXXX items
05:27:44 <datechnoman> The first job is a duplicate upload so it won't produce any new items (didn't notice at the time). The second job isn't and I'd expect a lot of new items from it. Threw in a bunch of new workers to process through it so don't worry about that :)
07:17:50 <datechnoman> Uploaded to other channels/bots without issue so might be a character in one of the files causing issues or the bot not being happy :(
11:31:39 <datechnoman> Also if you dont mind topping up the queue with some stash urls that would be great. Going away for the next 3 days and will leave the extra workers running so please feel free to keep them busy :)
12:06:01 <arkiver> datechnoman: i will put some extra in tonight
12:06:12 <arkiver> might move that out when i make updates to the code again
12:06:39 <arkiver> which well get any outlinks (so to a different domain) from web pages on sites that we have identified as news sites
12:07:25 <arkiver> ... which could go a little deep as we would also queue outlinks from one news site to another, from where more outlinks are queued, etc.
12:07:38 <arkiver> so that might explode and then i'll make changes to only get outlinks to non-news sites
12:07:57 <arkiver> but it may also not explode, and then we'll get significantly better news site coverage and news site sources coverage
12:19:41 <datechnoman> All good. I shall leave it in your capable hands :)
12:20:18 <datechnoman> archiving news is very important and historical so well worth the compute/resources
12:21:12 <datechnoman> It will most likely be a large expansion of urls being queued but we will cut down on them as we process through them like the monthly sitemaps
12:21:32 <datechnoman> Sounds well worth it
12:26:18 <arkiver> absolutely!
12:26:54 <arkiver> regularly i come across some government pdf document on the Internet, and wonder "is it saved?". so i look it up in the wayback machine, and really in 8/10 cases it is saved, and the only capture is this project
12:52:37 <datechnoman> Honestly doing some really great work here 👏
12:54:54 <datechnoman> I did some really rough testing of the unique pdfs in the second job and found a bunch missing so hoepfully that helps! They aren't targeted for things like news and government. Just pdf outlinks from Twitter posts
18:48:06 <h2ibot> Queuing bot shutting down.
18:48:15 <h2ibot> Queuing bot started.
18:48:17 <h2ibot> datechnoman: Restarting unfinished job isAJoDlg for '!a https://transfer.archivete.am/gsq2b/unique_pdfs_output.txt'.
18:48:18 <h2ibot> datechnoman: Restarting unfinished job cmKPUONH for '!a https://transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'.
18:48:19 <h2ibot> datechnoman: Restarting unfinished job cYpn43AW for '!a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt'.
18:48:25 <arkiver> indeed datechnoman ^
18:49:38 <arkiver> it may be a characters yeah, i'm not immediately sure
18:51:11 <h2ibot> datechnoman: Skipped 199 invalid URLs: https://transfer.archivete.am/QH7Xd/filtered_pdf_files_unique.txt.bad-urls.txt (cYpn43AW)
18:51:16 <h2ibot> datechnoman: Skipped 32 very long URLs: https://transfer.archivete.am/16f2lB/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW)
18:51:18 <h2ibot> datechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW)
18:54:42 <h2ibot> datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/i5NDE/unique_pdfs_output.txt.bad-urls.txt (isAJoDlg)
18:54:43 <h2ibot> datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/oMQvL/unique_pdfs_output.txt.not-printable.txt (isAJoDlg)
18:54:44 <h2ibot> datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/f9q9g/unique_pdfs_output.txt.skipped.txt (isAJoDlg)
18:54:45 <h2ibot> datechnoman: Deduplicating and queuing 9401326 items. (isAJoDlg)
18:59:25 <h2ibot> datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/uPUIK/filtered_.pdf_output.txt.bad-urls.txt (cmKPUONH)
18:59:26 <h2ibot> datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/J7XI2/filtered_.pdf_output.txt.not-printable.txt (cmKPUONH)
18:59:28 <h2ibot> datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/GvhN6/filtered_.pdf_output.txt.skipped.txt (cmKPUONH)
18:59:29 <h2ibot> datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH)
19:02:25 <JAA> Archiving news sources would be great!
19:02:43 <JAA> (It'd work even better if news outlets actually linked their sources more frequently.)
19:03:22 <JAA> We still haven't found that IS Telegram channel, have we?
19:06:29 <Ryz> Hmm, came across https://www.eurosport.com/football/premier-league/2015-2016/99-of-manchester-united-fans-want-wayne-rooney-dropped-do-you_sto5859298/story.shtml - but it's geoblocked; does this project take account of prioritizing geoblocked content to other Warriors that may have access to it?
19:06:41 <Ryz> Seeing that https://www.eurosport.com/ is constantly processed here
19:30:37 <Vokun> Woah, that's a very long URLs:
19:37:18 <JAA> There is no prioritisation. Stuff just gets retried randomly by any worker. Hopefully one of them isn't getting geoblocked.
19:38:10 <JAA> Eurosport should be geoblocked to ~Europe, and we have lots of workers there, so unless they block Hetzner, that particular case should mostly be fine, probably maybe.
21:50:53 <datechnoman> Somehow I'm always finding the bugs in the system :P
21:59:24 <datechnoman> !a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt
21:59:24 <h2ibot> datechnoman: Registering dSoiEwEl for '!a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt'
21:59:26 <h2ibot> datechnoman: Your job is waiting for a slot. (dSoiEwEl)