-
DigitalDragonsWhat's the difference between the (locked) ".os.cdx.gz" file and the (downloadable) ".cdx.gz" file here? archive.org/download/archiveteam_urls_20231116194819_c386be7f
-
arkiverDigitalDragons: the os.cdx.gz files have the CDX data for each WARC (there is one os.cdx.gz file for each WARC). the available .cdx.gz files have all the os.cdx.gz files combined and sorted
-
arkiveri could make the .os.cdx.gz file available as well if you'd like
-
DigitalDragonsah! thank you
-
DigitalDragonsthe combined one is fine - I just saw the file sizes were different and thought they may have contained different data, but I guess it must be some compression savings from sorting or something like that
-
datechnoman
-
h2ibotdatechnoman: Skipped 1 invalid URLs: transfer.archivete.am/GX9cx/discord_urls.txt.bad-urls.txt (for 'transfer.archivete.am/zYwva/discord_urls.txt')
-
h2ibotdatechnoman: Deduplicating and queuing 104923 items. (for 'transfer.archivete.am/zYwva/discord_urls.txt')
-
h2ibotdatechnoman: Deduplicated and queued 104923 items. (for 'transfer.archivete.am/zYwva/discord_urls.txt')
-
arkiverDigitalDragons: that must be it yes, because `diff` returns nothing for me for the uncompressed data
-
arkiver(just checked)
-
» pabs ponders about getting URLs from public channel topics of all IRC networks into #//