00:48:32 hi all. whats the easiest way to extract a site from web.archive.org? its penta.debconf.org, which was static HTML with pages that shouldn't change after creation and where the server was lost and there are probably no backups 00:51:52 If it uses relative links, easy way is to do wget or similar with an allow regex for https://web.archive.org/[0-9]+id_/(URL matching site) 00:57:15 probably the cdx API is a better option than that? 00:57:28 https://archive.org/help/wayback_api.php 00:59:16 Oh, that too 00:59:23 That may not be the "easiest" 00:59:42 If you are not adept with a CLI 01:01:31 curl "https://web.archive.org/cdx/search?url=query" | grep " 200 " | awk '{ printf("https://web.archive.org/web/%sim_/%s\n", $2, $3) }' | wget --no-directories --retry-on-http-error=429 --input-file=- is what I've used if you are 01:01:37 But recursive wget is the easiest 01:02:23 I was assuming there would be an existing command-line tool that would use the API and could download everything related to the domain 01:02:46 Not to my knowledge 01:02:59 But what I've put should do the same thing 01:03:18 Unless it's a very big site, in which case you'll have to use the CDX pagination API 01:03:20 for completeness, i know of https://github.com/jsvine/waybackpack https://github.com/hartator/wayback-machine-downloader 01:03:34 OH 01:03:34 which may or may not use those APIs 01:03:36 *Oh 01:03:40 i suspect the answer is "not" 01:03:50 anarcat: could you say more about the WARC thing you mention on another channel? 01:04:12 yes 01:04:22 IA has got the "backup" part of "backup the Internet" down, but they seem to be lacking in the "restore" part :) 01:04:40 archiveteam operates a bot called archivebot which crawls websites and throws that into WARC files which is a "standard" (yeah those) for archiving websites 01:04:59 if the site was crawled by archivebot, it's a good way to extract the entirety of that crawl 01:05:13 it can also be used to perform authenticated crawls, wich proxies and plugins and so on 01:05:18 I wouldn't call this a deficiency on archive.org's part; it would not be possible for them to support all uses as niche as this 01:05:22 because then you can reproduce a logged in user 01:05:37 pabs: it's an "archive", not a "backup" :) 01:05:40 anyways 01:05:45 we're archiveteam here, not IA :) 01:05:49 :) 01:06:12 Yeah #internetarchive (also unofficial) might be best for future questions liek these 01:06:20 right 01:06:23 blame me 01:06:26 i pointed pabs here :) 01:06:50 then he blamed me for asking another question in the wrong channel, so we're even 01:06:55 :) 01:06:57 anyways, gn 01:07:08 later :) 01:15:26 I don't think (though I'm not of the strongest of opinions here) that it's wrong to put it in -ot, so much that it's on-topic and may as well be made more prominent in the right channel 01:26:07 With the surface densities of a 1.44 MB 3-1/2" diskette, or 1.2 MB 5-1/4" floppy... how would they scale up to 12 TB capacity in inches? 01:27:04 will also accept units in miles. 01:42:35 lol 02:13:10 yikes seek time. Really wish laserdisks would have taken off better-imagine one but with blueray density so holds...~200gb? 03:25:49 Don't worry. Holographic storage will be here soon!!1! 03:34:06 any decade now 03:42:10 an LD had ~6.25x the surface area of a CD/DVD/BR btw, so might have had LD-R's with a capacity of 312 GB 15:43:14 Hi 19:45:05 ping ping ping ping 21:22:43 AK: ping ping pong 21:26:37 🏓