-
h2ibot
JustAnotherArchivist edited Onlyfiles (+432, It's dead, Jim):
wiki.archiveteam.org/?diff=49404&oldid=49375
-
h2ibot
JustAnotherArchivist edited Onlyfiles (+59):
wiki.archiveteam.org/?diff=49405&oldid=49404
-
h2ibot
-
audrooku|m
Re: WBM CDX Stuffs
-
audrooku|m
JAA: Thanks for the response, I really appreciate it.
-
audrooku|m
Do even privileged users in archiveteam have access to the wbm grab items or is that something I have no chance of touching?
-
JAA
I'm sure Jason has access, but he works at IA, so...
-
JAA
Otherwise, I'm almost certain the answer is no.
-
h2ibot
JustAnotherArchivist edited Deathwatch (+164, /* 2023 */ Add MyWOT forum):
wiki.archiveteam.org/?diff=49407&oldid=49406
-
JAA
Haha qwarc goes brrrrr :-)
-
JAA
Looks good, I should have a complete copy of the topic pages in 20-ish minutes.
-
JAA
Done, and apart from one topic with an SQL error, it ran almost suspiciously smoothly.
-
pabs
for sites with a bunch of JS links on one page, is it possible to pass AB multiple entry points for !a instead of just one?
-
JAA
pabs: !a < exists, but it has various sharp edges that can backfire badly with regards to recursion, which is why it's undocumented. Depending on the URL list, site structure, and retrieval order, it might recurse further or less far than you'd think.
-
pabs
for eg I wanted to archive
hungry.com but
hungry.com/people only has JS links
-
pokechu22
-
pokechu22
In this case, an !a < list would probably work, since it's (at least assumed to be) safe if you can do e.g.
hungry.com/~alves hungry.com/~beers etc (all URLs without a slash in the path)
-
pokechu22
Looks like
hungry.com/hungries/hungries.js stores all of that stuff. AB does sometimes directly extract links from javascript, but I'm not sure if it would extract these or not (I think it will extract the images like /hungries/watkins0.jpg but I'm less sure if it extracts full URLs for whatever reason)
-
audrooku|m
Re: wbm cdx stuff again
-
audrooku|m
Thanks JAA, figures haha, didnt realize Jason was considered part of AT
-
JAA
audrooku|m: Well, he founded it, so yeah. :-) He isn't around much these days though.
-
pokechu22
JAA: How does qwarc handle threads with multiple pages?
-
audrooku|m
Ah huh didnt know haha, makes a lot of sense though ;)
-
pokechu22
(my impression is that it just indiscriminately downloads a series of URLs, and you'd generate an incremental list of thread IDs and download those, in which case you'd need special logic to know if multiple pages exist. but I might be wrong on this)
-
JAA
pokechu22: qwarc's just a framework for grabbing stuff with very little overhead. It doesn't do anything on its own. You need to write code (called a spec file, I might change that term at some point) to tell it what to actually do. And yes, my code for MyWOT handled topic pagination (and session IDs and the language switcher).
-
pokechu22
Ah, it's not just a version of !ao < list that doesn't extract images, it actually does know how to look at the response to extract more links (if you tell it how to). OK, good to know
-
JAA
No, it does not. You need to write that code to make it do that.
-
pokechu22
Right, but I was assuming that it was *just* capable of doing lists without any way of expanding on it
-
JAA
It provides an interface for 'fetch this URL' with a callback to decide what to do with the response (accept, retry; write to WARC or not). And it handles the HTTP and WARC writing. Plus there's a CLI for actually running it, with concurrency and parallel processes and stuff.
-
JAA
All other logic is user-provided in the spec file.
-
JAA
I did write a spec file a long while ago that does recursive crawls. That's what I was referring to in #archivebot earlier.
-
JAA
But on its own, qwarc doesn't even look at the HTTP body at all. It just downloads it and (unless prevented) writes it to the WARC.
-
JAA
That's also part of why it's so efficient.
-
JAA
This is what the spec file for the MyWOT Forums looks like:
transfer.archivete.am/GDKC3/forum.mywot.com.py
-
arkiver
JAA: as always feel free to put the outlinks from that site in #// :)
-
JAA
arkiver: Yup, and I'll have a bunch from others as well. :-)
-
arkiver
awesome!
-
pabs
pokechu22: hmm, how would one find out? guess I could just try !a
hungry.com/hungries/hungries.js ? and separately !a the other stuff
-
pokechu22
I think the extraction behaves differently if you save the js file itself versus saving a page containing it, so I'm not sure exactly. Probably the easiest thing to do would be just !a
hungry.com and watch the log to see whether it finds those pages or not
-
pokechu22
In any case, I'm going to sleep
-
arkiver
did we run all known twu.net sites through AB?
-
arkiver
I never got a reply from them, bbtw
-
arkiver
btw*
-
arkiver
oh
-
arkiver
i see a reply in my spam saying the email to support⊙tn could not be delivered
-
ano
i'd have thought there'd be more available logs for ircs around here -- must be a good sum of wisdom, resources, and history that passes through
-
audrooku|m
I thought so too
-
Ryz
Hmm, is there an ongoing project where there's constant archiving of Pastebin and Pastebin-like services?
-
Retrofan
Hi
-
Retrofan
I was wondering... what the (Readline timed out)?
-
Retrofan
and how to fix it?
-
pokechu22
That just means that the site didn't respond in time. If it's happening for everything on the site, the site might have blocked you. If it's happening somewhat randomly, and in particular the site has lots of large files, try setting the concurrency to 1
-
Retrofan
Oh, yeah, it's happened randomly. I'll give your solution a shot
-
Retrofan
pokechu22: Thank you, it makes it even better... but I've noticed that some "soft 404" pages (or redirects) on other sites return "Readline timeout"
-
pokechu22
Hmm, that probably depends on the site as well
-
pokechu22
At least with archivebot, pages that give an error like that are automatically retried after everything finishes (and will be attempted up to 3 times total)
-
Retrofan
It would be better if it archived those "soft error" pages so that the website experience (via WARC) is more similar to the original...
-
pokechu22
If it's an actual 404 (or 3XX or similar) error code, it'll be saved. An example of an actual readline timeout is... e.g.
holypeak.com/talent/voiceactor/shuka_saito.html which isn't something that it would make sense to save
-
pokechu22
hmm, actually, that one eventually redirected to a cloudflare error page (after a minute), not a perfect example as that could be saved, but archivebot gives up after 30 seconds (IIRC)
-
pokechu22
ja.net/company/policies/aup.html might be a better example then?
-
Retrofan
Can I extend the archive bot's timeout?
-
pokechu22
I don't know. I assume you're using grab-site, which may or may not let you change that easily; I'm not familiar with it. (#archivebot via IRC doesn't support changing the timeout though)
-
Retrofan
I am not using grab-site; I am using the full archive bot system (with pipelines)
-
Retrofan
Anyway, I was thinking of making the archivebot to check the website (perhaps for a keyword or something) and archive it if true, and otherwise, if it false.
-
Retrofan
but I don't know how to do it...
-
Retrofan
*that