01:00:08 <kiska> arkiver: Hrm... I am going to try and process the Common Crawl data, would you like me to queue up the list when I can get some URLs out of them?
01:00:23 <kiska> Keyword is "try"
18:29:30 <arkiver> datechnoman: i'll queue https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt manually
18:29:31 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/4EloA/goo-gl.2023-06-10-10-17-02.txt
18:29:38 <arkiver> so we have some stuff in the queue
18:29:45 <arkiver> until i can do more tomorrow
18:47:14 <arkiver> oh
18:47:38 <h2ibot> Queuing bot shutting down.
18:48:12 <h2ibot> Queuing bot started.
18:48:15 <h2ibot> datechnoman: Restarting unfinished job isAJoDlg for '!a https://transfer.archivete.am/gsq2b/unique_pdfs_output.txt'.
18:48:16 <h2ibot> datechnoman: Restarting unfinished job cmKPUONH for '!a https://transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'.
18:48:17 <h2ibot> datechnoman: Restarting unfinished job cYpn43AW for '!a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt'.
18:48:18 <h2ibot> datechnoman: Restarting unfinished job dSoiEwEl for '!a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt'.
18:48:20 <arkiver> !a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt
18:48:21 <h2ibot> arkiver: Registering TglcqGtY for '!a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt'
18:51:30 <h2ibot> datechnoman: Skipped 199 invalid URLs: https://transfer.archivete.am/ibK9i/filtered_pdf_files_unique.txt.bad-urls.txt (cYpn43AW)
18:51:32 <h2ibot> datechnoman: Skipped 32 very long URLs: https://transfer.archivete.am/33VsW/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW)
18:51:33 <h2ibot> datechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW)
18:59:48 <h2ibot> datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/RSxtz/filtered_.pdf_output.txt.bad-urls.txt (cmKPUONH)
18:59:50 <h2ibot> datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/r2gUS/filtered_.pdf_output.txt.not-printable.txt (cmKPUONH)
18:59:52 <h2ibot> datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/M8b3d/filtered_.pdf_output.txt.skipped.txt (cmKPUONH)
18:59:53 <h2ibot> datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH)
19:00:08 <h2ibot> datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/strAy/unique_pdfs_output.txt.bad-urls.txt (isAJoDlg)
19:00:09 <h2ibot> datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/TYn6Y/unique_pdfs_output.txt.not-printable.txt (isAJoDlg)
19:00:10 <h2ibot> datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/L2lIW/unique_pdfs_output.txt.skipped.txt (isAJoDlg)
19:00:12 <h2ibot> datechnoman: Deduplicating and queuing 9401326 items. (isAJoDlg)
19:12:10 <h2ibot> datechnoman: Skipped 31684 invalid URLs: https://transfer.archivete.am/a9iQ5/goo-gl.2023-06-10-10-17-02.txt.bad-urls.txt (dSoiEwEl)
19:12:11 <h2ibot> datechnoman: Deduplicating and queuing 9968210 items. (dSoiEwEl)
19:15:54 <h2ibot> arkiver: Skipped 31684 invalid URLs: https://transfer.archivete.am/kgaOs/goo-gl.2023-06-10-10-17-02.txt.bad-urls.txt (TglcqGtY)
19:15:55 <h2ibot> arkiver: Deduplicating and queuing 9968210 items. (TglcqGtY)
19:27:08 <h2ibot> datechnoman: Deduplicated and queued 9968210 items. (dSoiEwEl)
19:29:25 <h2ibot> arkiver: Deduplicated and queued 9968210 items. (TglcqGtY)
21:45:26 <datechnoman> Thanks for the manual queuing. Should buy us some time until more can be queued when you have a moment :)
23:00:23 <OrIdow6> kiska: I'm not arkiver obviously but what does processing involve? I was thinking recently (I know I haven't been active lately, but a miss came up elsewhere in my life) about trying to store all outlinks from everywhere
23:01:00 <kiska> Here is my regex for doing the Common Crawl stuff
23:01:00 <kiska> url_pattern = re.compile(r"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z]{1,}\b(?:/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?")
23:01:04 <kiska> In python...
23:01:18 <kiska> Here is my test file...
23:01:18 <kiska> ./CC-MAIN-20240220211055-20240221001055-00000.warc.gz:  61%|██████▏   | 21465/34998 [18:39:26<3:51:05,  1.02s/ record]
23:01:28 <kiska> I am doing to die of old age before this file completes itself
23:37:07 <OrIdow6> What are you doing with the URLs kiska? Also URL doesn't support $ (also what's the bottleneck?)
23:37:27 <kiska> Python :D
23:38:24 <kiska> OrIdow6: I am planning to feed the CC extracted URLs here :D