-
pabs
hi all. whats the easiest way to extract a site from web.archive.org? its penta.debconf.org, which was static HTML with pages that shouldn't change after creation and where the server was lost and there are probably no backups
-
OrIdow6
If it uses relative links, easy way is to do wget or similar with an allow regex for
web.archive.org/[0-9]+id_/(URL matching site)
-
pabs
probably the cdx API is a better option than that?
-
pabs
-
OrIdow6
Oh, that too
-
OrIdow6
That may not be the "easiest"
-
OrIdow6
If you are not adept with a CLI
-
OrIdow6
curl "
web.archive.org/cdx/search?url=query" | grep " 200 " | awk '{ printf("
web.archive.org/web/%sim_/%s\n", $2, $3) }' | wget --no-directories --retry-on-http-error=429 --input-file=- is what I've used if you are
-
OrIdow6
But recursive wget is the easiest
-
pabs
I was assuming there would be an existing command-line tool that would use the API and could download everything related to the domain
-
OrIdow6
Not to my knowledge
-
OrIdow6
But what I've put should do the same thing
-
OrIdow6
Unless it's a very big site, in which case you'll have to use the CDX pagination API
-
anarcat
-
OrIdow6
OH
-
anarcat
which may or may not use those APIs
-
OrIdow6
*Oh
-
anarcat
i suspect the answer is "not"
-
pabs
anarcat: could you say more about the WARC thing you mention on another channel?
-
anarcat
yes
-
pabs
IA has got the "backup" part of "backup the Internet" down, but they seem to be lacking in the "restore" part :)
-
anarcat
archiveteam operates a bot called archivebot which crawls websites and throws that into WARC files which is a "standard" (yeah those) for archiving websites
-
anarcat
if the site was crawled by archivebot, it's a good way to extract the entirety of that crawl
-
anarcat
it can also be used to perform authenticated crawls, wich proxies and plugins and so on
-
OrIdow6
I wouldn't call this a deficiency on archive.org's part; it would not be possible for them to support all uses as niche as this
-
anarcat
because then you can reproduce a logged in user
-
anarcat
pabs: it's an "archive", not a "backup" :)
-
anarcat
anyways
-
anarcat
we're archiveteam here, not IA :)
-
pabs
:)
-
OrIdow6
Yeah #internetarchive (also unofficial) might be best for future questions liek these
-
anarcat
right
-
anarcat
blame me
-
anarcat
i pointed pabs here :)
-
anarcat
then he blamed me for asking another question in the wrong channel, so we're even
-
anarcat
:)
-
anarcat
anyways, gn
-
pabs
later :)
-
OrIdow6
I don't think (though I'm not of the strongest of opinions here) that it's wrong to put it in -ot, so much that it's on-topic and may as well be made more prominent in the right channel
-
Wayward-
With the surface densities of a 1.44 MB 3-1/2" diskette, or 1.2 MB 5-1/4" floppy... how would they scale up to 12 TB capacity in inches?
-
Wayward-
will also accept units in miles.
-
Doranwen
lol
-
G4te_Keep3r
yikes seek time. Really wish laserdisks would have taken off better-imagine one but with blueray density so holds...~200gb?
-
JAA
Don't worry. Holographic storage will be here soon!!1!
-
Wayward-
any decade now
-
Wayward-
an LD had ~6.25x the surface area of a CD/DVD/BR btw, so might have had LD-R's with a capacity of 312 GB
-
IDK
Hi
-
AK
ping ping ping ping
-
nyany
AK: ping ping pong
-
m0nika
🏓