#archiveteam-ot

00:48

pabs

hi all. whats the easiest way to extract a site from web.archive.org? its penta.debconf.org, which was static HTML with pages that shouldn't change after creation and where the server was lost and there are probably no backups
00:51

OrIdow6

If it uses relative links, easy way is to do wget or similar with an allow regex for web.archive.org/[0-9]+id_/(URL matching site)
00:57

pabs

probably the cdx API is a better option than that?
00:57

pabs

archive.org/help/wayback_api.php
00:59

OrIdow6

Oh, that too
00:59

OrIdow6

That may not be the "easiest"
00:59

OrIdow6

If you are not adept with a CLI
01:01

OrIdow6

curl "web.archive.org/cdx/search?url=query" | grep " 200 " | awk '{ printf("web.archive.org/web/%sim_/%s\n", $2, $3) }' | wget --no-directories --retry-on-http-error=429 --input-file=- is what I've used if you are
01:01

OrIdow6

But recursive wget is the easiest
01:02

pabs

I was assuming there would be an existing command-line tool that would use the API and could download everything related to the domain
01:02

OrIdow6

Not to my knowledge
01:02

OrIdow6

But what I've put should do the same thing
01:03

OrIdow6

Unless it's a very big site, in which case you'll have to use the CDX pagination API
01:03

anarcat

for completeness, i know of github.com/jsvine/waybackpack github.com/hartator/wayback-machine-downloader
01:03

OrIdow6

OH
01:03

anarcat

which may or may not use those APIs
01:03

OrIdow6

*Oh
01:03

anarcat

i suspect the answer is "not"
01:03

pabs

anarcat: could you say more about the WARC thing you mention on another channel?
01:04

anarcat

yes
01:04

pabs

IA has got the "backup" part of "backup the Internet" down, but they seem to be lacking in the "restore" part :)
01:04

anarcat

archiveteam operates a bot called archivebot which crawls websites and throws that into WARC files which is a "standard" (yeah those) for archiving websites
01:04

anarcat

if the site was crawled by archivebot, it's a good way to extract the entirety of that crawl
01:05

anarcat

it can also be used to perform authenticated crawls, wich proxies and plugins and so on
01:05

OrIdow6

I wouldn't call this a deficiency on archive.org's part; it would not be possible for them to support all uses as niche as this
01:05

anarcat

because then you can reproduce a logged in user
01:05

anarcat

pabs: it's an "archive", not a "backup" :)
01:05

anarcat

anyways
01:05

anarcat

we're archiveteam here, not IA :)
01:05

pabs

:)
01:06

OrIdow6

Yeah #internetarchive (also unofficial) might be best for future questions liek these
01:06

anarcat

right
01:06

anarcat

blame me
01:06

anarcat

i pointed pabs here :)
01:06

anarcat

then he blamed me for asking another question in the wrong channel, so we're even
01:06

anarcat

:)
01:06

anarcat

anyways, gn
01:07

pabs

later :)
01:15

OrIdow6

I don't think (though I'm not of the strongest of opinions here) that it's wrong to put it in -ot, so much that it's on-topic and may as well be made more prominent in the right channel
01:26

Wayward-

With the surface densities of a 1.44 MB 3-1/2" diskette, or 1.2 MB 5-1/4" floppy... how would they scale up to 12 TB capacity in inches?
01:27

Wayward-

will also accept units in miles.
01:42

Doranwen

lol
02:13

G4te_Keep3r

yikes seek time. Really wish laserdisks would have taken off better-imagine one but with blueray density so holds...~200gb?
03:25

JAA

Don't worry. Holographic storage will be here soon!!1!
03:34

Wayward-

any decade now
03:42

Wayward-

an LD had ~6.25x the surface area of a CD/DVD/BR btw, so might have had LD-R's with a capacity of 312 GB
15:43

IDK

Hi
19:45

AK

ping ping ping ping
21:22

nyany

AK: ping ping pong
21:26

m0nika

🏓

3 years ago

« 2 days earlier

a day later »

today »