-
Jake
(sorry for the disconnects y'all)
-
h2ibot
-
h2ibot
Paulmorriss edited Flickr (+1371, /* 2020 pricing and campaign */):
wiki.archiveteam.org/?diff=48811&oldid=47723
-
h2ibot
Entartet edited List of websites excluded from the Wayback Machine (+54, Added farfrommoscow.com and picrew.me.):
wiki.archiveteam.org/?diff=48812&oldid=48803
-
h2ibot
Magmaus3 edited URLTeam (+231, /* Alive */ Add cutt.ly):
wiki.archiveteam.org/?diff=48813&oldid=48748
-
h2ibot
ElijahPepe created LGTM.com (+803, Created page with "{{Infobox project | title =…):
wiki.archiveteam.org/?title=LGTM.com
-
h2ibot
DJDunsie edited Deathwatch (+227, /* 2022 */ Lexico):
wiki.archiveteam.org/?diff=48815&oldid=48808
-
pabs
-
Maakuth|m
what timezone is arkiver on? should I try to reach them at US evening hours?
-
TheTechRobo
arkiver: ^
-
Maakuth|m
I'm UTC+03:00 myself
-
arkiver
just leave me a message
-
systwi_
Is it safe to simply `cat example.com-00000.warc example.com-00001.warc > example.com.warc`? Do I need to take note of the original input WARC filesizes in case I want to split them up again?
-
JAA
Assuming the two input files are valid WARCs, yes, that is safe.
-
systwi_
The AT wiki mentions [megawarc](
github.com/alard/megawarc) but I'm not sure if it's still needed.
-
TheTechRobo
Megawarc is also broken for me, at least for warc.gz.
-
systwi_
Thanks for the info. I suppose it's probably best to store the original filesizes anyway; that'
-
systwi_
:-/
-
systwi_
Thanks for the info. I suppose it's probably best to store the original filesizes anyway; that's what, ~2 KB?
-
JAA
megawarc is useful when you need to merge a larger number of WARCs, I guess. It keeps track of the original files and in theory allows extracting to that again (I think). The version on the AT org should work fine as that's used for all projects. It also does error checks and puts the broken files into a tar.
-
TheTechRobo
JAA: It's possible it's broken for me because of my Python version. I filed an issue a few months ago:
ArchiveTeam/megawarc #5
-
TheTechRobo
Interesting, "move to Python 3" is an Issue. I can't remember if I tried Python 2 or not.
-
JAA
TheTechRobo: That doesn't sound right, and I'm pretty sure the targets run Py 3.
-
TheTechRobo
Yep, it's set to use python 2.
-
TheTechRobo
JAA: Weird.
-
TheTechRobo
-
JAA
Issue 5 would indicate that you're giving it something that's neither a .gz nor a .zst file.
-
JAA
But also, you should only give it WARC files, not a .warc.os.cdx.gz file.
-
TheTechRobo
Isn't the point of Megawarc that it converts a directory tree into a WARC, a tar, and a metadata file?
-
JAA
converts a collection of WARCs into*, yes
-
JAA
It doesn't handle other file formats, nor does it check the file format.
-
TheTechRobo
From the README:
-
TheTechRobo
FILE.warc.gz is the concatenated .warc.gz
-
TheTechRobo
FILE.tar contains any non-warc files from the .tar
-
TheTechRobo
FILE.json.gz contains metadata
-
JAA
I'm pretty sure it never checks whether the file is actually a WARC, only whether it decompresses correctly.
-
TheTechRobo
Well, it crashes when `Checking 1652740309829c5a3e1fc0bf20-1_1652740337.315711/funeralhome-1934cdbeadc09ac1a98713bb2b1d8ca41f8f2ec1-20220516-223149.warc.gz`, which should be a valid WARC.
-
TheTechRobo
-
JAA
Ok, correction, the packer does indeed still use Python 2.7. Eww...
-
JAA
And true, it should append other files to the tar. test_gz would fail, but that only runs for .warc.gz and .warc.zst (the latter with further filename pattern restrictions for $reasons).
-
JAA
I'd help with debugging, but I banished Python 2 from my systems a long while ago.
-
TheTechRobo
Understandable.
-
TheTechRobo
I did that, until I ran into legacy software with no modern alternative. :/
-
TheTechRobo
Such as megawarc.
-
systwi_
Thanks for the info!