-
h2ibot
JustAnotherArchivist edited Template:Infobox project sandbox (+449, Add irc_network parameter (cf. edit 41606) and…):
wiki.archiveteam.org/?diff=47287&oldid=31242
-
systwi
I'm assuming this is the correct place to ask.
-
systwi
I'm trying to save a webpage with `grab-site' under Debian Bullseye, and I need to import a cookies.txt so page is grabbed as if I were logged in.
-
systwi
I try using the following command: grab-site --1 --wpull-args='--load-cookies=/data/cookies.txt' '
example.com'
-
systwi
But for some reason the page grabbed does not show me signed in.
-
systwi
The cookies.txt was exported from Librewolf using
addons.mozilla.org/en-US/firefox/addon/cookies-txt
-
JAA
Is the relevant cookie in the first line of cookies.txt?
-
systwi
The site in question writes multiple cookies, The first line in particular is "# Netscape HTTP Cookie File"
-
JAA
Hmm, ok, so not that wpull bug then.
-
systwi
Passing through the same user agent and using the same IP as how I logged into the site through the browser didn't make a difference.
-
JAA
Are the cookies in the request record in the WARC?
-
systwi
I see a cookies.txt inside the output directory, but it's significantly smaller than the one I specified.
-
systwi
If that's what you meant.
-
wickedplayer494
---> #archiveteam-bs, now.
-
JAA
That's where we are, wickedplayer494. lol
-
wickedplayer494
oops thought this was #archiveteam nvm
-
JAA
systwi: Open the .warc.gz file with zless and look for the first `WARC-Type: request` record. It should have some `Cookie: X` line.
-
systwi
It does have a line like that, yes.
-
JAA
Well, then at least the cookie loading itself works I guess.
-
systwi
It looks like some cookies match, but there are also new cookies in the WARC not present in the cookies.txt. Maybe from grabbing outlinks.
-
JAA
Yes, and grab-site also has some default cookies I think. Not sure if those get loaded if you specify your own --load-cookies though.
-
systwi
For context, I'm trying to grab a Quizlet page.
-
systwi
Looking it over closer, the WARC seems like it has every cookie specified that cookies.txt has.
-
h2ibot
JustAnotherArchivist edited Template:Infobox project sandbox (+214, archiving_type: s/warrior/dpos/, add archivebot…):
wiki.archiveteam.org/?diff=47288&oldid=47287
-
systwi
Know of anything else I could check/try?
-
rewby
systwi: Have you considered whether the website is doing something like personalizing with javascript instead of sending you different pages?
-
OrIdow6
systwi: Play around with the storage inspector of your browser
-
OrIdow6
Or, one thing I find really useful is Firefox's "copy as curl" action in the network inspector, then eliminate curl args until you reproduce it
-
OrIdow6
That'ss assuming you've already done what rewby's said
-
TheTechRobo
What is a warc.zst in the warrior projects and why is it different to a gz?
-
JAA
zstandard compression instead of gzip. gzip does decent compression, zstd is black magic.
-
TheTechRobo
Do you mean a better ratio
-
TheTechRobo
?
-
JAA
We also use a dictionary with the zstd WARCs (would be possible with gzip but not well-supported by the tooling), which makes it *much* more efficient.
-
JAA
Yes, better compression ratio.
-
TheTechRobo
Ah
-
JAA
-
JAA
(Requires the zstd tools in PATH.)
-
TheTechRobo
On a related note, I find lrzip works wonders for compressing warc.gz files for archival. High compression windows are great! Wouldn't work here, though.
-
JAA
They won't be compressed properly though. Each record needs to be compressed individually to allow for random access.
-
TheTechRobo
Exactky, that's why it wouldn't work her
-
TheTechRobo
e
-
JAA
And that's what wrecks the compression ratio.
-
TheTechRobo
Even if that did work, lrzip only really works well with large files (>50MB) which most warc entries won't be
-
JAA
The custom dictionary on the zstd WARCs fixes this because that way you *can* compress the similar parts between records by shoving them in the dictionary.
-
TheTechRobo
so it's only good for compressing whole warcs
-
JAA
Unfortunately, tooling for zstd WARCs so far is ... scarce.
-
JAA
Really wget-at is the only tool that can write them, and IA's CDX-Writer and related software is the only thing that can read them.
-
TheTechRobo
really? wpull can't write?
-
JAA
I don't think there has been a commit to the wpull repo since we invented .warc.zst. lol
-
TheTechRobo
Good point
-
JAA
wpull only produces gzipped WARCs.
-
TheTechRobo
I was tempted to use ludios_wpull for my project since it's the only one decently maintained
-
TheTechRobo
Ended up just compiling python 3.6 and using wpull with it
-
JAA
pyenv ftw :-)
-
TheTechRobo
I was going to, but I can't be bothered to install it :P
-
TheTechRobo
IIRC I gave up on ludios_wpull because it didn't download anything
-
TheTechRobo
Probably would have worked if I had fallen back to my 3.7 install (debian buster ftw) but I like to live on the edge with 3.9 :P
-
TheTechRobo
I'll have to recompile soon tho, 3.10 just came out
-
JAA
cd ~/.pyenv; git pull; pyenv install 3.10
-
JAA
Done
-
JAA
:-P
-
JAA
3.10.0 *
-
TheTechRobo
Shouldn't 3.10 be aliased to the latest 3.10.*?
-
JAA
I don't think pyenv has such aliases, but not sure.