00:55:15 <h2ibot> JustAnotherArchivist edited Template:Infobox project sandbox (+449, Add irc_network parameter (cf. edit 41606) and…): https://wiki.archiveteam.org/?diff=47287&oldid=31242
01:48:30 <systwi> I'm assuming this is the correct place to ask.
01:48:34 <systwi> I'm trying to save a webpage with `grab-site' under Debian Bullseye, and I need to import a cookies.txt so page is grabbed as if I were logged in.
01:48:39 <systwi> I try using the following command: grab-site --1 --wpull-args='--load-cookies=/data/cookies.txt' 'https://example.com/'
01:49:07 <systwi> But for some reason the page grabbed does not show me signed in.
01:49:43 <systwi> The cookies.txt was exported from Librewolf using https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/
01:50:15 <JAA> Is the relevant cookie in the first line of cookies.txt?
01:51:06 <systwi> The site in question writes multiple cookies, The first line in particular is "# Netscape HTTP Cookie File"
01:51:47 <JAA> Hmm, ok, so not that wpull bug then.
01:53:06 <systwi> Passing through the same user agent and using the same IP as how I logged into the site through the browser didn't make a difference.
01:53:49 <JAA> Are the cookies in the request record in the WARC?
01:54:13 <systwi> I see a cookies.txt inside the output directory, but it's significantly smaller than the one I specified.
01:54:19 <systwi> If that's what you meant.
01:54:26 <wickedplayer494> ---> #archiveteam-bs, now.
01:54:35 <JAA> That's where we are, wickedplayer494. lol
01:54:36 <wickedplayer494> oops thought this was #archiveteam nvm
01:55:22 <JAA> systwi: Open the .warc.gz file with zless and look for the first `WARC-Type: request` record. It should have some `Cookie: X` line.
01:58:02 <systwi> It does have a line like that, yes.
01:59:33 <JAA> Well, then at least the cookie loading itself works I guess.
01:59:55 <systwi> It looks like some cookies match, but there are also new cookies in the WARC not present in the cookies.txt. Maybe from grabbing outlinks.
02:00:24 <JAA> Yes, and grab-site also has some default cookies I think. Not sure if those get loaded if you specify your own --load-cookies though.
02:10:35 <systwi> For context, I'm trying to grab a Quizlet page.
02:11:21 <systwi> Looking it over closer, the WARC seems like it has every cookie specified that cookies.txt has.
02:14:30 <h2ibot> JustAnotherArchivist edited Template:Infobox project sandbox (+214, archiving_type: s/warrior/dpos/, add archivebot…): https://wiki.archiveteam.org/?diff=47288&oldid=47287
02:43:28 <systwi> Know of anything else I could check/try?
14:42:09 <rewby> systwi: Have you considered whether the website is doing something like personalizing with javascript instead of sending you different pages?
19:24:15 <OrIdow6> systwi: Play around with the storage inspector of your browser
19:24:51 <OrIdow6> Or, one thing I find really useful is Firefox's "copy as curl" action in the network inspector, then eliminate curl args until you reproduce it
19:25:02 <OrIdow6> That'ss assuming you've already done what rewby's said
21:32:45 <TheTechRobo> What is a warc.zst in the warrior projects and why is it different to a gz?
21:34:47 <JAA> zstandard compression instead of gzip. gzip does decent compression, zstd is black magic.
21:35:19 <TheTechRobo> Do you mean a better ratio
21:35:20 <TheTechRobo> ?
21:35:23 <JAA> We also use a dictionary with the zstd WARCs (would be possible with gzip but not well-supported by the tooling), which makes it *much* more efficient.
21:35:27 <JAA> Yes, better compression ratio.
21:35:43 <TheTechRobo> Ah
21:36:32 <JAA> Here's my little script for decompressing .warc.zst: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat
21:37:02 <JAA> (Requires the zstd tools in PATH.)
21:37:12 <TheTechRobo> On a related note, I find lrzip works wonders for compressing warc.gz files for archival. High compression windows are great! Wouldn't work here, though.
21:37:37 <JAA> They won't be compressed properly though. Each record needs to be compressed individually to allow for random access.
21:37:48 <TheTechRobo> Exactky, that's why it wouldn't work her
21:37:49 <TheTechRobo> e
21:37:50 <JAA> And that's what wrecks the compression ratio.
21:38:11 <TheTechRobo> Even if that did work, lrzip only really works well with large files (>50MB) which most warc entries won't be
21:38:22 <JAA> The custom dictionary on the zstd WARCs fixes this because that way you *can* compress the similar parts between records by shoving them in the dictionary.
21:38:23 <TheTechRobo> so it's only good for compressing whole warcs
21:39:10 <JAA> Unfortunately, tooling for zstd WARCs so far is ... scarce.
21:41:13 <JAA> Really wget-at is the only tool that can write them, and IA's CDX-Writer and related software is the only thing that can read them.
21:41:28 <TheTechRobo> really? wpull can't write?
21:41:46 <JAA> I don't think there has been a commit to the wpull repo since we invented .warc.zst. lol
21:42:04 <TheTechRobo> Good point
21:42:10 <JAA> wpull only produces gzipped WARCs.
21:42:22 <TheTechRobo> I was tempted to use ludios_wpull for my project since it's the only one decently maintained
21:42:56 <TheTechRobo> Ended up just compiling python 3.6 and using wpull with it
21:44:54 <JAA> pyenv ftw :-)
21:45:22 <TheTechRobo> I was going to, but I can't be bothered to install it :P
21:46:45 <TheTechRobo> IIRC I gave up on ludios_wpull because it didn't download anything
21:47:13 <TheTechRobo> Probably would have worked if I had fallen back to my 3.7 install (debian buster ftw) but I like to live on the edge with 3.9 :P
21:47:27 <TheTechRobo> I'll have to recompile soon tho, 3.10 just came out
21:49:06 <JAA> cd ~/.pyenv; git pull; pyenv install 3.10
21:49:07 <JAA> Done
21:49:08 <JAA> :-P
21:49:39 <JAA> 3.10.0 *
21:50:18 <TheTechRobo> Shouldn't 3.10 be aliased to the latest 3.10.*?
21:52:43 <JAA> I don't think pyenv has such aliases, but not sure.