-
kiskaarkiver: Hrm... I am going to try and process the Common Crawl data, would you like me to queue up the list when I can get some URLs out of them?
-
kiskaKeyword is "try"
-
arkiverdatechnoman: i'll queue transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt manually
-
eggdropinline (for browser viewing): transfer.archivete.am/inline/4EloA/goo-gl.2023-06-10-10-17-02.txt
-
arkiverso we have some stuff in the queue
-
arkiveruntil i can do more tomorrow
-
arkiveroh
-
h2ibotQueuing bot shutting down.
-
h2ibotQueuing bot started.
-
h2ibotdatechnoman: Restarting unfinished job isAJoDlg for '!a transfer.archivete.am/gsq2b/unique_pdfs_output.txt'.
-
h2ibotdatechnoman: Restarting unfinished job cmKPUONH for '!a transfer.archivete.am/HTbgw/filtered_.pdf_output.txt'.
-
h2ibotdatechnoman: Restarting unfinished job cYpn43AW for '!a transfer.archivete.am/nIY9O/filtered_pdf_files_unique.txt'.
-
h2ibotdatechnoman: Restarting unfinished job dSoiEwEl for '!a transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt'.
-
arkiver
-
h2ibotarkiver: Registering TglcqGtY for '!a transfer.archivete.am/4EloA/goo-gl.2023-06-10-10-17-02.txt'
-
h2ibotdatechnoman: Skipped 199 invalid URLs: transfer.archivete.am/ibK9i/filtere…d_pdf_files_unique.txt.bad-urls.txt (cYpn43AW)
-
h2ibotdatechnoman: Skipped 32 very long URLs: transfer.archivete.am/33VsW/filtered_pdf_files_unique.txt.skipped.txt (cYpn43AW)
-
h2ibotdatechnoman: Deduplicating and queuing 5516002 items. (cYpn43AW)
-
h2ibotdatechnoman: Skipped 4203 invalid URLs: transfer.archivete.am/RSxtz/filtered_.pdf_output.txt.bad-urls.txt (cmKPUONH)
-
h2ibotdatechnoman: Fixed 1 unprintable URLs: transfer.archivete.am/r2gUS/filtere…d_.pdf_output.txt.not-printable.txt (cmKPUONH)
-
h2ibotdatechnoman: Skipped 1 very long URLs: transfer.archivete.am/M8b3d/filtered_.pdf_output.txt.skipped.txt (cmKPUONH)
-
h2ibotdatechnoman: Deduplicating and queuing 9401326 items. (cmKPUONH)
-
h2ibotdatechnoman: Skipped 4203 invalid URLs: transfer.archivete.am/strAy/unique_pdfs_output.txt.bad-urls.txt (isAJoDlg)
-
h2ibotdatechnoman: Fixed 1 unprintable URLs: transfer.archivete.am/TYn6Y/unique_pdfs_output.txt.not-printable.txt (isAJoDlg)
-
h2ibotdatechnoman: Skipped 1 very long URLs: transfer.archivete.am/L2lIW/unique_pdfs_output.txt.skipped.txt (isAJoDlg)
-
h2ibotdatechnoman: Deduplicating and queuing 9401326 items. (isAJoDlg)
-
h2ibotdatechnoman: Skipped 31684 invalid URLs: transfer.archivete.am/a9iQ5/goo-gl.…023-06-10-10-17-02.txt.bad-urls.txt (dSoiEwEl)
-
h2ibotdatechnoman: Deduplicating and queuing 9968210 items. (dSoiEwEl)
-
h2ibotarkiver: Skipped 31684 invalid URLs: transfer.archivete.am/kgaOs/goo-gl.…023-06-10-10-17-02.txt.bad-urls.txt (TglcqGtY)
-
h2ibotarkiver: Deduplicating and queuing 9968210 items. (TglcqGtY)
-
h2ibotdatechnoman: Deduplicated and queued 9968210 items. (dSoiEwEl)
-
h2ibotarkiver: Deduplicated and queued 9968210 items. (TglcqGtY)
-
datechnomanThanks for the manual queuing. Should buy us some time until more can be queued when you have a moment :)
-
OrIdow6kiska: I'm not arkiver obviously but what does processing involve? I was thinking recently (I know I haven't been active lately, but a miss came up elsewhere in my life) about trying to store all outlinks from everywhere
-
kiskaHere is my regex for doing the Common Crawl stuff
-
kiskaurl_pattern = re.compile(r"https?://(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z]{1,}\b(?:/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?")
-
kiskaIn python...
-
kiskaHere is my test file...
-
kiska./CC-MAIN-20240220211055-20240221001055-00000.warc.gz: 61%|██████▏ | 21465/34998 [18:39:26<3:51:05, 1.02s/ record]
-
kiskaI am doing to die of old age before this file completes itself
-
OrIdow6What are you doing with the URLs kiska? Also URL doesn't support $ (also what's the bottleneck?)
-
kiskaPython :D
-
kiskaOrIdow6: I am planning to feed the CC extracted URLs here :D