06:45:40 If a question in #archiveteam could be resolved in one or a few answers I don't think sending them to #bs is necessary 06:45:42 *messages 06:48:16 I directed them to #bs because I wasn't sure if an answer would be available anytime soon (and so far it seems like nobody's had an answer) - better to let them know to check elsewhere immediately than to have it just be silent for a while and like an hour later have someone redirect them IMO 06:50:51 Well I see the point with making it clear to them that someone in the room's not dead 06:53:00 Channel for that is I believe #nintendone BTW, no substantial activity for over a year, but status on that specific service I don't know 10:25:14 #nintendone was originally for Super Mario Maker, rather than the eShop. My understanding: the eShop is copywritten games that aren't publicly accessible, so I'm not sure there's anything that we can save there. 15:13:50 Hey there, hope this is allowed! I'm trying to get a very single photo from the Panoramio archive for my father. 15:13:57 I know the photo ID and I generally see how the warc files stored in a set of 10s > panoramio-photos_xxxxx0-xxxxx9 15:14:07 So I was just wondering if there is any kind of searchable databse or something cause right now I just manually open the collection pages 15:14:12 Sometimes I get close (it's a 9 digit number), but yeah it's slow 15:14:18 Once again sorry if it's not allowed here 15:18:08 gydgyd: absolutely allowed 15:18:21 you could PM me the ID privately? 15:19:03 Yes! 16:37:15 so I notice that this page doesn't have a paywall, at least not with my cookies/ip: https://www.ft.com/content/80818949-cbf6-4830-8703-0e561e2fead7 16:37:43 but the archive.org page shows a paywall: http://web.archive.org/web/20230302015333/https://www.ft.com/content/80818949-cbf6-4830-8703-0e561e2fead7 16:38:01 I get a similar paywall if I wget the url 16:39:18 I wonder if this reveals a shortcoming in the warc format? if I can record my interaction with a web site, I should be able to emulate that web site by serving some archive file, no? 16:40:25 perhaps it's not an issue with the archive format, maybe the crawler is just not getting the non-paywalled version due to its IP being known? 16:41:28 I believe getting past the paywall requires running some JS and possibly allowing the JS to make requests; do archive.org or other archive tools record that part of the interaction? 16:43:10 web.archive.org does run JS when saving; archivebot doesn't 16:43:34 archive.is does a better job for this article: https://archive.is/7MyEE 16:44:25 so what's the difference between the methods of web.archive.org and archive.is? 16:44:43 is the former runs JS, archive.is is doing something more to save the page 16:45:39 It's possible that ft.com has archive.org blacklisted to always get the paywall, not sure 16:45:58 (or they have some kind of per-IP limit, and archive.org has hit that while archive.is hasn't) 16:47:10 archive.is seems to store the page in a rendered form that doesn't include e.g. the sticky navigation bar or the cookie banner 16:47:46 Yeah, they don't store in in WARC and it's not suitable for a normal replay 16:47:53 I'm getting a paywall on that ft.com link and I have JS enabled. Maybe it's a geoblocking thing 16:47:56 They're also signed in to accounts on some sites (e.g. github) 16:48:11 kind of seems like archive.is converts the state of the dom into html so that you can reproduce the page without running the same JS 16:48:14 archive.org does have an option to save a screenshot: https://web.archive.org/web/20230317164443/http://web.archive.org/screenshot/https://www.ft.com/content/80818949-cbf6-4830-8703-0e561e2fead7 - but it looks like that didn't help 16:49:25 I would like to investigate the archive.is storage format but the "download .zip" function 404s 16:52:39 idk what heuristics ft uses, but it shows paywall to me as well; archvie.is used to go to a great lengths to avoid getting blocked by facebook etc, so maybe they just know which ip to access ft from to get the content 17:03:50 since I'm able to get the content in my browser, I wonder if there's an archive tool I could run on my computer to produce an archive with the page content? 17:41:45 cm: yes. the tool we most commonly recommend for this purpose is warcprox: https://github.com/internetarchive/warcprox 17:42:57 once you've recorded a warc, you can view it in a player like replayweb.page: https://replayweb.page/, https://github.com/webrecorder/replayweb.page 18:22:43 Jake: I believe there was talk of using it for other Nintendo-related things 18:24:17 But what you say on the EShop sounds right 19:18:36 Remember when we backed up all of the Intel Download Center? Is there a convenient way to search that. I tried putting model number into archive.org, but I'm not getting useful results. 19:19:23 wb machine, I mean. 19:34:22 https://web.archive.org/web/*/example.com/* might help for that depending on how the URLs are structured, but it also might not (and there's a 10k limit for that view) 19:39:48 Looks like someone SPNd the list pages e.g. https://web.archive.org/web/20190901145929/https://downloadcenter.intel.com/product/80939/Graphics-Drivers 19:40:07 To the effect that the "show more" thing works 19:42:34 Specifically looking for updated BIOS for DCP847SKE (aka. DCP847DYE for some reason). I found a newer on some download site, but with no documentation, and I have no idea if it's the latest :( 19:43:30 Axing all of that was such a brain damaged move by Intel. 19:57:01 There's no convenient way for searching, but we should have everything that existed at the time of archival. IIRC, we previously saw that things had been removed already. 19:59:02 https://web.archive.org/web/20190901155237/https://downloadcenter.intel.com/product/71620/Intel-NUC-Board-DCP847SKE ? 20:29:47 OrIdow6: You did what I could not. Thanks. 20:36:27 https://twitter.com/lancereddick he passed away today, dunno if he has an official site too 22:31:45 Hey, is there a way to archive someone else's Twitter account, including media? 22:33:08 socialbot can save tweets, things linked from tweets, and images in tweets (but not videos in tweets) and upload it to web.archive.org automatically. It uses snscrape internally which is generally available 22:33:49 socialbot lives in #archivebot except it's down for maintenance right now. 22:34:46 Plus you need to be voiced to use it, but I'd be happy to run the command for you when it's back online 22:36:25 Is it able to at least flag which videos are there? I can feed a list to yt-dlp 22:36:45 Anyway, thanks, I didn't know about snscrape 22:37:02 snscrape extracts videos (but doesn't download them itself). socialbot does not. 22:38:52 Oh whoops, I wonder how "reacts" are translated here 22:39:36 https://github.com/JustAnotherArchivist/snscrape for reference 23:27:12 thuban: I tried wget with the --warc-file option, since it's packaged for my distro 23:27:32 but the result does not include the page content 23:28:14 that's why i suggested warcprox 23:28:18 cm: Upstream wget's WARC code is buggy. 23:28:20 do you think warcprox is more likely to be able to fetch the page content? 23:28:27 JAA: still? 23:28:31 Yes, still. 23:28:32 yes, because it relies on your browser to decide what to request 23:28:51 The bump to WARC/1.1 never happened, sadly. 23:29:05 is it still going to? 23:29:13 And they're reluctant to fix their misleading WARC/1.0 output. 23:29:43 Not a clue. I've poked darnir about it a couple times without success. 23:30:27 thuban: ahh that makes sense 23:30:29 Most work is going towards wget2 now, I think, which doesn't support WARC yet at all. 23:31:14 so there's not really a command line tool that will pretend to be a browser and run JS in order to get the full version of a page? 23:31:42 anyone know how archive.org does it? 23:31:49 web.archive.org does run JS when saving; archivebot doesn't 23:32:12 I've heard of something called brozzler or something but I'm not sure if that's what they use or any details 23:32:42 it is https://github.com/internetarchive/brozzler 23:33:18 Yeah, the current iteration of SPN uses brozzler. 23:33:26 (As far as I know, anyway.) 23:33:28 (it's possible to run this locally, but this requires rethinkdb and i would not call it convenient) 23:33:57 Everything that comes from the webrecorder people should be avoided for capturing traffic. 23:34:17 the webrecorder people? 23:34:54 So yeah, brozzler or something with a similar approach (browser with automation + a MITM proxy that writes to WARC) is the way to go. I'm not aware of anything other than brozzler that doesn't come from webrecorder. 23:35:05 https://github.com/webrecorder 23:35:29 cool I see 23:35:35 https://github.com/webrecorder/warcio/issues/created_by/JustAnotherArchivist for a selection of reasons why their stuff needs to be avoided. 23:35:59 They write inaccurate WARCs and don't seem to care too much about it. 23:36:53 hm I see 23:37:23 unfortunately, it's not really possible to "run JS in order to get the full version of a page" in the general case, due to site-specific interactive elements, etc. 23:37:23 thanks for helping me understand how this stuff works 23:38:00 I think I might email the archive.is guy to ask how he generates his archives 23:39:06 (although the general approaches are different, both archive.today and archiveteam's own projects involve some degree of manual tailoring per site.) 23:40:27 example: http://archive.today/2023.03.16-070651/https://github.com/ is signed in 23:40:48 WTF is that date format? lol 23:41:23 It also apparently supports a saner one but that's the one you get from the share link :/ 23:42:00 (https://archive.ph/20230316070651/https://github.com - just get rid of the punctuation) 23:42:03 Also interesting, that leaks some private repos. Hmmm... 23:42:38 You've also got http://archive.today/20230302022032/https://github.com/notifications?query=is:unread 23:43:33 lol 23:43:48 does it leak the code for the site somewhere? 23:44:46 Aw: https://archive.ph/4Niz4 23:45:19 whats volth 23:45:38 The account used for those logged-in snapshots. 23:58:50 I guess there is _some_ justification to keep things like archive.today closed source, lest certain sites take counter-measures against the various methods of obtaining content 23:59:14 I wonder if archiveteam is ever worried about that? 23:59:38 ripcord, the unauthorized discord/slack client, is in a similar position