12:48:57 So URLs involved in fetching gallery pages in Wysp are tied to window size, as I have spent quite a bit of time figuring out 12:49:22 So playback will not work properly on those, but it should still be fine on individual images 15:05:50 someone archivebot https://geomaticblog.net please, thanks :) no date given at https://geomaticblog.net/2023/07/06/retiring-geomaticblog.net/ but should be a quick job so no issue 15:08:24 spirit: it's in 15:08:29 cheers! 15:13:14 spirit: https://github.com/jsanz/geomaticblog/ and https://jorgesanz.net/ is also being taken care of :) 15:29:11 Barto: <3 18:13:50 Hello? 18:14:17 I have no idea what im doing i wanted to find a deleted or privated youtube video 18:23:59 I'm not an expert at that, but does the video work on web.archive.org? 18:24:25 havent tried it yet lemme check 18:24:56 Yeah its not there 18:26:08 What's the video URL? 18:27:10 Wait a min 18:27:38 https://www.youtube.com/watch?v=8gR1Vm3yoMQ 18:27:43 this is the one 18:28:18 #youtubearchive has a copy. You can ask there and wait patiently until someone has time to pull it from storage. 18:29:25 Thanks. how long will that take? 18:29:32 any estimates? 18:30:42 I'm pretty sure it's a manual process so it could take a few hours to a day depending on who's available (but I'm not part of that project so I don't know the details) 18:31:26 Oh Thanks man 18:32:49 Yep, something like that. 18:39:22 hello 18:39:38 anyone know how to get http/s proxy working with grab-site or wpull? 18:39:46 i have an http proxy working, but having trouble with https proxy 18:40:34 wpull keeps sending a bad request to the proxy. i think its a cert error, but am not sure how to debug it 18:40:46 anyone try this before and know how to get it working? 18:41:16 I know that HTTPS proxying is pretty broken in wpull 2.x. Not overly familiar with the changes in ludios_wpull (which is what grab-site uses), so can't comment on whether it applies there as well. 18:42:13 ahh that sucks, thought it must have been on wpulls end, tried a few different proxies and none worked with https 18:42:32 I think you should get a relatively clear error though, not 400 or similar. 18:42:49 'CONNECT is intentionally not supported' should appear somewhere. 18:42:51 I get: code 400, message Bad request version 18:43:37 Oh wait, you're trying to use an external proxy, not wpull's proxy. 18:43:39 Nevermind then. 18:45:15 yeah trying to use an external proxy 18:46:26 i have a pretty specific crawl im trying to do, and need to modify the logic of wpull to do it. thought the easy way would be to run it through a proxy and have the proxy handle that logic 18:47:53 Hmm, what kind of logic? 18:50:12 Im trying to download all pages from a site that were published from a specific date range 18:50:41 my first idea was to use igsets and see if the date is available in the url path 18:52:03 its not though unfortunately, so I have to crawl a few index pages, and use those pages to find pages between date ranges 19:02:22 hey guys, can someone please archive https://www.progaming.ba because it is shutting down and i want it to be archived. 19:03:14 spirit: saved 19:03:19 yaaay 19:03:29 !a https://www.progaming.ba --igset blogs,badvideos -e 'for VickoSaviour' 19:03:34 wrong place lol 19:03:45 now it's at the right place :-) 19:04:41 VickoSaviour: looks like it's login walled, cant do much 19:04:56 oh fk 19:05:13 so how is it bad? 19:05:34 no account, no data 19:05:49 welp. shit. 19:05:56 :( 19:06:43 and also who tf uses login wall... 19:07:32 VickoSaviour: if you have an account, you can save it yourself by giving your login cookies to https://github.com/ArchiveTeam/grab-site/ (or another spidering program) 19:07:47 OH YES 19:07:56 i have a acc already 19:08:14 that can't go in the wayback machine, but it's better than nothing, and you could upload it to the internet archive if you want 19:13:55 fucking sourceforge broke my https://github.com/SpiritQuaddicted/sourceforge-file-download download script =( 19:20:20 qq44: sometimes a crude selfwritten program for enumerating and then crawling a url list without recursion works for pages like that 19:27:24 fixed, i think 19:57:27 update: progaming.ba is gutted already, nothing to do but contact the admins :( 20:03:22 VickoSaviour: If you do upload it to archive.org, remember that the cookies you pass it will be stored inside the WARC 20:09:48 also any personal data will be saved in the WARC as well such as your username if it’s returned in the pages 20:10:55 true, but moot 20:11:52 :) 20:12:29 just a note I guess to those grab siteing ao3 or something ig 20:15:42 qq44|m: I'm not sure how a proxy would help you there, unless you mangle data there (in which case I sure hope you aren't producing WARCs). I'd do it with a wpull plugin. 20:16:44 Or well, I'd really do it with my own stuff (qwarc) instead, but that's not really user-friendly especially since there's zero documentation. 20:23:42 JAA: I want to preserve a specific directory structure in the WARC. The proxy in this case would take the URL, do some fetches to find the relevant URL within the date range, and return that page to wpull 20:23:50 unless im misunderstanding how proxies work 20:24:27 also in some cases I want the proxy to modify the contents of the page that its sending back to wpull 20:24:56 As long as you don't write that to WARC, that's fine. WARC is supposed to be an exact reproduction of what the target server sent. 20:30:52 i do want to write it to the warc, but i know im misusing warc in this case 20:32:38 i save a lot of documentation, and in those cases I mostly care about having a usable copy of the documentation as opposed to a faithful copy of the web pages 20:33:28 Then I'd recommend at least adding a custom WARC header explaining that in detail. Not sure what I'd call that header, but probably something with an X- prefix. 20:33:40 --warc-header on wpull 20:48:23 what's the progress on reddit.com website? is the content earlier than January of 2021 saved? 20:51:36 100 seconds, longer than some other people. 20:55:59 we should have a leaderboard at some point, JAA 20:57:54 I'd rather spend my time saving shit. :-) 20:58:37 before the shredders reach the data 20:59:04 :) 21:01:24 it's like a conveyer belt we're trapped on, constantly running, with a meat shredder screaming at the end 21:06:26 We should have a bot that makes someone pass a "welcome to IRC" quiz before voicing them... 21:07:01 i have seen such a long time ago lol 21:07:13 read the rules at and enter the password hidden in the rules 21:07:18 but it din't help a lot 21:07:31 s/rules/faq/ 21:08:08 it was a game of find password asap, ask question already answered :3 21:09:10 'you do understand you might have to wait minutes or hours for this right' 'yes yes get out of my way i want to type' 21:09:17 :D 21:09:36 I mean, we generally want people to be able to reach us with as few barriers as possible in general. 21:13:02 that too 21:13:19 sometimes there are gems that come in, sometimes you get me :3 22:22:51 https://news.ycombinator.com/item?id=36657829 22:23:21 "InfluxDB Cloud shuts down in Belgium; some weren't notified before data deletion" 22:23:24 Oof 22:24:07 JAA: I tried https proxy with grab-site, but get the same error as wpull 2.0.3. Do you know of any other archiving tools similar to wpull or grab-site that works with https proxies? 22:26:45 qq44|m: No idea, I don't use proxies for archival precisely because of the potential for data corruption. 22:28:58 wget works with proxy, do you know if there is a way to download page requisites with wget? 22:29:16 when ive used it in the past it only downloaded files from the first party domain, no third party files 23:40:57 Came across https://radar.cloudflare.com/domains which has a top 1 million domains list sourced from users of Cloudflare's DNS. Lots of other interesting information on that site including a list of known bots. 23:41:48 in csv format, too! 23:41:50 =]