01:00:08 arkiver: Hrm... I am going to try and process the Common Crawl data, would you like me to queue up the list when I can get some URLs out of them? 01:00:23 Keyword is "try" 18:29:30 datechnoman: i'll queue https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt manually 18:29:31 inline (for browser viewing): https://transfer.archivete.am/inline/4EloA/goo-gl.2023-06-10-10-17-02.txt 18:29:38 so we have some stuff in the queue 18:29:45 until i can do more tomorrow 18:47:14 oh 18:47:38 Queuing bot shutting down. 18:48:12 Queuing bot started. 18:48:15 datechnoman: Restarting unfinished job isAJoDlg for '!a https://transfer.archivete.am/gsq2b/unique_pdfs_output.txt'. 18:48:16 datechnoman: Restarting unfinished job cmKPUONH for '!a https://transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'. 18:48:17 datechnoman: Restarting unfinished job cYpn43AW for '!a https://transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt'. 18:48:18 datechnoman: Restarting unfinished job dSoiEwEl for '!a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt'. 18:48:20 !a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt 18:48:21 arkiver: Registering TglcqGtY for '!a https://transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt' 18:51:30 datechnoman: Skipped 199 invalid URLs: https://transfer.archivete.am/ibK9i/filtered_pdf_files_unique.txt.bad-urls.txt (cYpn43AW) 18:51:32 datechnoman: Skipped 32 very long URLs: https://transfer.archivete.am/33VsW/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW) 18:51:33 datechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW) 18:59:48 datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/RSxtz/filtered_.pdf_output.txt.bad-urls.txt (cmKPUONH) 18:59:50 datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/r2gUS/filtered_.pdf_output.txt.not-printable.txt (cmKPUONH) 18:59:52 datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/M8b3d/filtered_.pdf_output.txt.skipped.txt (cmKPUONH) 18:59:53 datechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH) 19:00:08 datechnoman: Skipped 4203 invalid URLs: https://transfer.archivete.am/strAy/unique_pdfs_output.txt.bad-urls.txt (isAJoDlg) 19:00:09 datechnoman: Fixed 1 unprintable URLs: https://transfer.archivete.am/TYn6Y/unique_pdfs_output.txt.not-printable.txt (isAJoDlg) 19:00:10 datechnoman: Skipped 1 very long URLs: https://transfer.archivete.am/L2lIW/unique_pdfs_output.txt.skipped.txt (isAJoDlg) 19:00:12 datechnoman: Deduplicating and queuing 9401326 items. (isAJoDlg) 19:12:10 datechnoman: Skipped 31684 invalid URLs: https://transfer.archivete.am/a9iQ5/goo-gl.2023-06-10-10-17-02.txt.bad-urls.txt (dSoiEwEl) 19:12:11 datechnoman: Deduplicating and queuing 9968210 items. (dSoiEwEl) 19:15:54 arkiver: Skipped 31684 invalid URLs: https://transfer.archivete.am/kgaOs/goo-gl.2023-06-10-10-17-02.txt.bad-urls.txt (TglcqGtY) 19:15:55 arkiver: Deduplicating and queuing 9968210 items. (TglcqGtY) 19:27:08 datechnoman: Deduplicated and queued 9968210 items. (dSoiEwEl) 19:29:25 arkiver: Deduplicated and queued 9968210 items. (TglcqGtY) 21:45:26 Thanks for the manual queuing. Should buy us some time until more can be queued when you have a moment :) 23:00:23 kiska: I'm not arkiver obviously but what does processing involve? I was thinking recently (I know I haven't been active lately, but a miss came up elsewhere in my life) about trying to store all outlinks from everywhere 23:01:00 Here is my regex for doing the Common Crawl stuff 23:01:00 url_pattern = re.compile(r"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z]{1,}\b(?:/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?") 23:01:04 In python... 23:01:18 Here is my test file... 23:01:18 ./CC-MAIN-20240220211055-20240221001055-00000.warc.gz: 61%|██████▏ | 21465/34998 [18:39:26<3:51:05, 1.02s/ record] 23:01:28 I am doing to die of old age before this file completes itself 23:37:07 What are you doing with the URLs kiska? Also URL doesn't support $ (also what's the bottleneck?) 23:37:27 Python :D 23:38:24 OrIdow6: I am planning to feed the CC extracted URLs here :D