01:33:44 What's the difference between the (locked) ".os.cdx.gz" file and the (downloadable) ".cdx.gz" file here? https://archive.org/download/archiveteam_urls_20231116194819_c386be7f 01:37:04 DigitalDragons: the os.cdx.gz files have the CDX data for each WARC (there is one os.cdx.gz file for each WARC). the available .cdx.gz files have all the os.cdx.gz files combined and sorted 01:37:35 i could make the .os.cdx.gz file available as well if you'd like 01:45:06 ah! thank you 01:46:53 the combined one is fine - I just saw the file sizes were different and thought they may have contained different data, but I guess it must be some compression savings from sorting or something like that 01:54:30 !a https://transfer.archivete.am/zYwva/discord_urls.txt 01:54:37 datechnoman: Skipped 1 invalid URLs: https://transfer.archivete.am/GX9cx/discord_urls.txt.bad-urls.txt (for 'https://transfer.archivete.am/zYwva/discord_urls.txt') 01:54:38 datechnoman: Deduplicating and queuing 104923 items. (for 'https://transfer.archivete.am/zYwva/discord_urls.txt') 01:54:51 datechnoman: Deduplicated and queued 104923 items. (for 'https://transfer.archivete.am/zYwva/discord_urls.txt') 02:20:35 DigitalDragons: that must be it yes, because `diff` returns nothing for me for the uncompressed data 02:20:40 (just checked) 03:07:57 * pabs ponders about getting URLs from public channel topics of all IRC networks into #//