00:55:15 JustAnotherArchivist edited Template:Infobox project sandbox (+449, Add irc_network parameter (cf. edit 41606) and…): https://wiki.archiveteam.org/?diff=47287&oldid=31242 01:48:30 I'm assuming this is the correct place to ask. 01:48:34 I'm trying to save a webpage with `grab-site' under Debian Bullseye, and I need to import a cookies.txt so page is grabbed as if I were logged in. 01:48:39 I try using the following command: grab-site --1 --wpull-args='--load-cookies=/data/cookies.txt' 'https://example.com/' 01:49:07 But for some reason the page grabbed does not show me signed in. 01:49:43 The cookies.txt was exported from Librewolf using https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/ 01:50:15 Is the relevant cookie in the first line of cookies.txt? 01:51:06 The site in question writes multiple cookies, The first line in particular is "# Netscape HTTP Cookie File" 01:51:47 Hmm, ok, so not that wpull bug then. 01:53:06 Passing through the same user agent and using the same IP as how I logged into the site through the browser didn't make a difference. 01:53:49 Are the cookies in the request record in the WARC? 01:54:13 I see a cookies.txt inside the output directory, but it's significantly smaller than the one I specified. 01:54:19 If that's what you meant. 01:54:26 ---> #archiveteam-bs, now. 01:54:35 That's where we are, wickedplayer494. lol 01:54:36 oops thought this was #archiveteam nvm 01:55:22 systwi: Open the .warc.gz file with zless and look for the first `WARC-Type: request` record. It should have some `Cookie: X` line. 01:58:02 It does have a line like that, yes. 01:59:33 Well, then at least the cookie loading itself works I guess. 01:59:55 It looks like some cookies match, but there are also new cookies in the WARC not present in the cookies.txt. Maybe from grabbing outlinks. 02:00:24 Yes, and grab-site also has some default cookies I think. Not sure if those get loaded if you specify your own --load-cookies though. 02:10:35 For context, I'm trying to grab a Quizlet page. 02:11:21 Looking it over closer, the WARC seems like it has every cookie specified that cookies.txt has. 02:14:30 JustAnotherArchivist edited Template:Infobox project sandbox (+214, archiving_type: s/warrior/dpos/, add archivebot…): https://wiki.archiveteam.org/?diff=47288&oldid=47287 02:43:28 Know of anything else I could check/try? 14:42:09 systwi: Have you considered whether the website is doing something like personalizing with javascript instead of sending you different pages? 19:24:15 systwi: Play around with the storage inspector of your browser 19:24:51 Or, one thing I find really useful is Firefox's "copy as curl" action in the network inspector, then eliminate curl args until you reproduce it 19:25:02 That'ss assuming you've already done what rewby's said 21:32:45 What is a warc.zst in the warrior projects and why is it different to a gz? 21:34:47 zstandard compression instead of gzip. gzip does decent compression, zstd is black magic. 21:35:19 Do you mean a better ratio 21:35:20 ? 21:35:23 We also use a dictionary with the zstd WARCs (would be possible with gzip but not well-supported by the tooling), which makes it *much* more efficient. 21:35:27 Yes, better compression ratio. 21:35:43 Ah 21:36:32 Here's my little script for decompressing .warc.zst: https://gitea.arpa.li/JustAnotherArchivist/little-things/src/branch/master/zstdwarccat 21:37:02 (Requires the zstd tools in PATH.) 21:37:12 On a related note, I find lrzip works wonders for compressing warc.gz files for archival. High compression windows are great! Wouldn't work here, though. 21:37:37 They won't be compressed properly though. Each record needs to be compressed individually to allow for random access. 21:37:48 Exactky, that's why it wouldn't work her 21:37:49 e 21:37:50 And that's what wrecks the compression ratio. 21:38:11 Even if that did work, lrzip only really works well with large files (>50MB) which most warc entries won't be 21:38:22 The custom dictionary on the zstd WARCs fixes this because that way you *can* compress the similar parts between records by shoving them in the dictionary. 21:38:23 so it's only good for compressing whole warcs 21:39:10 Unfortunately, tooling for zstd WARCs so far is ... scarce. 21:41:13 Really wget-at is the only tool that can write them, and IA's CDX-Writer and related software is the only thing that can read them. 21:41:28 really? wpull can't write? 21:41:46 I don't think there has been a commit to the wpull repo since we invented .warc.zst. lol 21:42:04 Good point 21:42:10 wpull only produces gzipped WARCs. 21:42:22 I was tempted to use ludios_wpull for my project since it's the only one decently maintained 21:42:56 Ended up just compiling python 3.6 and using wpull with it 21:44:54 pyenv ftw :-) 21:45:22 I was going to, but I can't be bothered to install it :P 21:46:45 IIRC I gave up on ludios_wpull because it didn't download anything 21:47:13 Probably would have worked if I had fallen back to my 3.7 install (debian buster ftw) but I like to live on the edge with 3.9 :P 21:47:27 I'll have to recompile soon tho, 3.10 just came out 21:49:06 cd ~/.pyenv; git pull; pyenv install 3.10 21:49:07 Done 21:49:08 :-P 21:49:39 3.10.0 * 21:50:18 Shouldn't 3.10 be aliased to the latest 3.10.*? 21:52:43 I don't think pyenv has such aliases, but not sure.