01:33:47 [tell] Inti83: [2023-12-03T19:51:39Z] https://hackint.logs.kiska.pw/archiveteam-bs/20231203#c393229 01:34:07 Thanks XD was just reading logs 01:36:24 thanks - we did see that cont.ar and cine.ar have cloudflare, someone here said they may have a local contact in Argentina Cabase IXP 01:37:00 We are doing all we can with grab-site - thankd for the testing tool, helps! 01:38:08 Also we were wondering if there is anything that can help so that grab-site doesn't start from the beginning when it fails, if there is some flag to start from where it left off, we couldnt find anything in the docs 01:38:58 That is a *long*-standing wishlist entry: https://github.com/ArchiveTeam/grab-site/issues/58 01:39:04 So, no. 03:52:30 nicolas17: #gitgud and archive.softwareheritage.org for GitHub repos :) (and #codearchiver for other git repos) 03:54:14 Inti83 edited Argentina (-4, /* Guidelines for Adding Websites */): https://wiki.archiveteam.org/?diff=51252&oldid=51242 04:44:06 https://twitter.com/kogekidogso This account is now inactive. It looks like it was saved with ArchiveBot when I bringed this up last week, but the associated querie.me account might not be. 04:44:07 nitter: https://nitter.net/kogekidogso 04:45:13 (And peing) 04:45:21 Yes, it was run through ArchiveBot, but that was only a very superficial crawl. I'll rerun it as soon as there's space. 04:45:36 Thank you 04:45:51 Querie.me looks scripty. 04:47:07 I can't seem to get to any account page there...? Or is it just a matter of following the links in their tweets? 04:48:39 You can get a list of their answers as "recent answers" here: https://querie.me/user/r1OYTzyfrTY0Fn4ZIBXI4nEJPs63/recent 04:48:52 However they use infinite scrolling thing 04:49:09 Yeah, that page is entirely useless without JavaScript, so archiving it is going to be difficult. 04:51:48 hm 04:51:57 I'm now seeing if I can load the list by holding down arrow 04:52:44 I'm trying to do some curl magic. 04:52:52 JAA: suppose I write code to archive a querie.me page, by parsing the JS crap if needed to figure out what URLs to recurse into 04:53:01 They only load 5 answers per request by default, but you can do far more. 04:53:26 1000 is slow but works. :-) 04:53:34 we're not mass-archiving the entire site so this is not a DPoS project, just for one-off pages 04:53:54 how should I write that code? would a wget-at lua script be appropriate anyway? 04:53:55 No extra URLs need to be fetched for the individual answers, it seems. 04:54:14 So I'll just do the user page API crap and then throw the answer page URLs into AB. 04:54:51 I see, loading 1000 at once is much more efficient than me scrolling down endlessly 04:55:06 ah hm I guess it could be a script in any technology, that produces a URL list for archivebot 04:58:23 There are over 2000 answers, so yes. :-) 04:59:23 The very technologically advanced extraction: 04:59:31 `function querie { pp="$1"; curl "https://querie.me/api/qas?kind=recent&count=1000&userId=r1OYTzyfrTY0Fn4ZIBXI4nEJPs63${pp}" -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/120.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://querie.me/user/r1OYTzyfrTY0Fn4ZIBXI4nEJPs63/recent' --compressed -s | tee 04:59:37 "/tmp/querie-json-${pp}" | jq -r '.[] | .id' >"/tmp/querie-ids-${pp}"; ls -al /tmp/querie-{json,ids}-"${pp}"; pp="&startAfterId=$(tail -n 1 "/tmp/querie-ids-${pp}")"; printf '%q\n' "$pp"; }` 05:00:10 No loop because I wasn't sure how it'd behave at the end. 05:00:22 (It returns an empty array then.) 05:05:39 It looks like you'll get an empty array as response at the end, from observing other user 05:06:12 there must be some prize we can award JAA for that one-liner 05:06:32 arkiver: This isn't even my final form. 05:06:42 oh no 05:08:53 This is the one-liner I'm most proud of so far: `try-except` in a single line in Python: https://web.archive.org/web/20230311201616/https://bpa.st/657RU 05:08:56 Meanwhile, peing.net simply has a page number (3/page) 05:09:39 i think we now have a bot for queuing for all long term channels except #shreddit (which by default gets everything) 05:10:13 JAA: well... congrats i guess :P 05:10:32 Naruyoko: what is up with peing.net? 05:10:49 The person has an account there too 05:13:19 I don't know how well the individual pages save, since it's excluded 05:13:39 arkiver: I *think* you can rewrite any Python code as a single line. Pattern matching and exception groups are hard, but it should be possible. I have this idea of writing a tool to do the conversion. Maybe when I'm retired or something. :-P 05:14:02 JAA: conversion to one-liners? 05:14:04 or from 05:14:12 To 05:14:15 oh no 05:14:17 :-) 05:14:24 thisisfine 05:14:34 no god, please, no... 05:14:42 We'll be able to ship our pipeline.py as a single line! Imagine the savings from not having to store all those LF characters! 05:14:49 79 chars please! 05:15:04 The 80s called, they want their monitor back. :-P 05:15:08 i feel an old discussion coming up 05:15:13 :-) 05:15:37 * arkiver and JAA have fundamental differences when it comes to Python line length 05:15:46 Nah, it's obviously just for fun to prove that it's possible. The Python grammar very much requires separate lines in several places. `try-except` is one of them. 05:15:55 Hence that complicated one-liner to still do it. 05:16:03 maybe we'll make JAA into one line 05:16:12 one dimensional JAA 05:16:16 no more 3 dimensions 05:16:31 just for fun :) 05:16:48 Like your Python code is basically one-dimensional because it has no width? :-) 05:17:14 like my Python code is basically one-dimensional because it has no width, exactly! 05:17:53 No depth to it either I guess. :-P 05:18:56 yes 05:19:04 nice clean python code without personality 05:24:29 Here's the API data from Querie for that account as JSONL, because I had it anyway: https://transfer.archivete.am/whGJs/querie.me_user_r1OYTzyfrTY0Fn4ZIBXI4nEJPs63.jsonl.zst 05:25:28 Produced by concatenating the querie-json-* files in the right order + `jq -c '.[]'` 05:25:43 (Yes, this could be done better, and I would if I had to do this more than once.) 05:27:13 The job for those 2003 answers is running now. 06:42:24 https://transfer.archivete.am/STyXI/peing.net_18kogekisoccer.txt 06:42:25 inline (for browser viewing): https://transfer.archivete.am/inline/STyXI/peing.net_18kogekisoccer.txt 07:03:07 Naruyoko: Thanks, done. 13:56:51 http://fileformats.archiveteam.org/ is down 14:37:39 thanks qwertyasdfuiopghjkl , i reported it to someone who may be able to fix it 15:08:30 perhaps we could add it to #nodeping? 15:32:42 Megame edited Deathwatch (+130, http://www.baanboard.com/ - Dec 31): https://wiki.archiveteam.org/?diff=51253&oldid=51232 17:15:00 https://www.fz.se/ supposedly existed since 1996, might it be a good idea to have a proactive grab? 18:00:07 digitize.archiveteam.org is also (still) down 18:01:05 digitize.archiveteam.org is permanently down, and its contents were integrated into the main wiki years ago, from what I've been told. 19:03:03 Hi, can anyone please teach me how to add a webpage to the wayback machine crawls? At least the main website every 6 hours. If the linked pages from it could be auto-archived also, even better . maybe a check for each page if anything significant has changed instead of making too many duplicates 19:06:04 why do you want it be crawled so often 19:10:37 yesterday savepagenow had a global backlog of like 18 hours lol 19:14:05 e853962747e3759: Were you here a couple days ago already? 19:14:50 god dammit you spooked them 19:14:57 :P 19:15:07 lol 19:15:29 I was so ready to give them my "instuctions.jpg" :P 19:18:03 ;d 19:18:30 webchat needs like a 'btw if you leave this in a background tab you'll disconnect' 19:18:42 gone are the times of tabs just having fun 24/7 20:38:34 Hi, can anyone please teach me how to add a webpage to the wayback machine crawls? At least the main website every 6 hours. If the linked pages from it could be auto-archived also, even better . maybe a check for each page if anything significant has changed instead of making too many duplicates 20:38:46 e853962747e3759: Were you here a couple days ago already? 20:41:05 so is this some sort of faux intellectual elitest thing? I am not worthy of the knowledge to be able to archive a website? 20:42:15 just trying not to answer the same question a dozen times to people who will ignore them and ask again 20:45:19 faux intellectual elitest...+ 20:45:20 ?*' 20:45:36 What's that even supposed to mean? 20:45:53 the last comment i saw here was started with "yesterday save page now..." Can someone please copy paste the explanation if there was one 20:45:54 just 2 comments, no explanation 20:46:07 it means "i'm a troll" as if the cat-on-keyboard nickname didn't give it away 20:46:24 I don't understand how but, I'll take your word for it 20:47:07 https://hackint.logs.kiska.pw/%2F%2F/20231202 probably? 20:47:13 they left 20:47:30 oh 20:47:33 Yeah probably 20:47:45 I thought you sent a link to what explenation they were asking for 20:47:58 I don't think we have anything that can poll every 6 hours 20:48:39 They may be 'trolling' but it's a good idea to at least grab it once, right? 20:48:59 "It is an important and significant news aggregator" 20:50:59 Adding it to #// makes the most sense. 20:51:27 does #// grab entire sites? I thought those were just an equivalent to !ao 20:51:29 it has a news sources thing iirc 20:52:04 JAA: oh btw, how should I handle new/updated support.apple.com articles? 20:52:28 It does not, but their question was to grab the homepage regularly plus links on it. Which is exactly what #// already does for lots of news sites and other things. 20:52:55 nicolas17: AB !ao? 20:53:19 Ahhh 20:53:27 I don't need something to grab 8000 pages periodically because I'm already doing it, but I can give a list of the changes I did find 20:54:24 If you already grab them, grab them as WARCs and upload that? That also creates a direct record that the unchanged pages did indeed not change. 20:54:35 Rather than changes being missed, for example. 20:55:05 hrmmm grabbing them as WARC would need significant changes :P I have a git repo of file content alone atm 20:59:11 ah i see 21:00:26 oh right I'm even mangling the data I store (apparently tags linking to other languages get regularly shuffled so I strip them out to get readable diffs) 21:01:09 Is this a joke? I thought the internet archive organization is a normal organization that works with volunteers and archivists to archive the internet 21:01:31 You disconnected... 21:01:38 Also, we are not the Internet Archive. 21:03:06 And yes, we do try to work with everyone, but if you keep disconnecting, it's hard to communicate. 21:03:12 lol 21:03:15 Case in point... 21:03:56 hard to answer your questions if you keep disconnecting 21:04:48 How do i prevent it from disconnecting? I am also having significant problems with the chat box here. Did I miss any comments or explanations? 21:05:27 Keep the webchat tab in the foreground or move it to a separate window. 21:06:29 https://hackint.logs.kiska.pw/archiveteam-bs/20231204#c393345 21:07:35 JAA: "grab them as WARCs and upload that?" that won't appear on WBM will it? 21:08:25 nicolas17: We can make that happen. 21:08:38 qwarc might work for this... 21:12:01 e7269535e6632: In case you did not yet disconnect. Why do you want the site to be archived so often and what is the site? Also connecting trough irc would be better if you are having trouble with the webirc client 21:22:13 (╯°□°)╯︵ ┻━┻ 21:25:21 lol 21:29:22 christ lol 21:31:50 !tell e7269535e6632 "can anyone please teach me how to add a webpage to the wayback machine crawls?" To archive a site on your own which is what it sounds like you're asking for use https://github.com/ArchiveTeam/grab-site and upload to IA through an item. It won't show up on the wayback machine but it will be saved which is really the point. 21:31:50 To have it in the wayback machine it'd have to be queried to AT's ArchiveBot, one of their other projects. I'm not sure what other ways there are. 21:31:51 -eggdrop- [tell] ok, I'll tell e7269535e6632 when they join next 21:32:20 inb4 they join with a different nickname next time 21:32:26 Haha 21:32:32 !tell e7269535e6632 To have it in the wayback machine it'd have to be queried to AT's ArchiveBot, one of their other projects. I'm not sure what other ways there are. 21:32:32 -eggdrop- [tell] ok, I'll tell e7269535e6632 when they join next 21:32:37 (cut in two) 21:32:38 Thank 21:32:41 :) 21:32:42 I bet they may, but they didn't last time 21:32:49 (about joining with a different nick) 21:33:17 they may have been a7427a63 from the 2nd but unknown 21:33:30 They could also hopefully be reading the logs and see that, if they are in fact having issues with the webchat 21:33:42 was e853962747e3759 earlier today 21:34:06 and most likely Maybe they at least got the log link and are reading there. If so, hi. :-) 21:56:25 supppppppp 22:18:34 https://argenteam.net/ this website provides crowdsourced subtitles in spanish, mainly for uhh questionably-obtained movies 22:19:04 you can find subtitles by the torrent infohash so you know it syncs properly with the exact video you have 22:19:08 it's shutting down at the end of the year 22:19:34 they said they will soon publish a torrent with all 100k subtitles they have done 22:20:31 oh nice 22:21:19 there's also a forum with 127475 threads but it seems to be login-walled, so that could be complicated to archive 22:25:36 hm some forums are open https://foro.argenteam.net/viewforum.php?f=11 22:29:04 i would think that subtitles work no matter how you got a copy of a movie, so there's no need to call them questionably-obtained 22:29:41 immibis: well, the website actually has magnet: links to the video the subtitles were made for >.> 22:30:08 so I'm sure many people use it primarily as a torrent search index too 22:30:23 Would having someone sign up for a throwaway account and giving the cookie for archival something that'd work? For the login-walls I mean 22:30:51 Pedrosso: I signed up and I can see all forums normally 22:31:23 I'm just not sure if that can be used for archival 22:31:23 Can the archivebot use custom cookies? 22:31:34 No 22:31:45 and well, my username shows up on every page :P 22:32:04 Things archived with accounts also can't go into the WBM generally speaking. 22:35:03 I see 22:35:20 Then how are such things generally saved? 22:37:14 I've done some with wpull and cookies. The WARCs are somewhere, either in IA just for download and local playback or still sitting in my pile of stuff to upload. 22:38:35 https://web.archive.org/web/20150416181917/https://www.furaffinity.net/view/1/ seems to have a user "~smaugit" 22:39:12 "generally speaking" so it was just a special case? 22:40:18 forums (viewforum.php?f=) 4, 11, 35, 46, 55, 64 are publicly accessible 22:40:36 https://www.furaffinity.net/user/smaugit/ < people have nice things to say about the account haha 22:40:43 yep, they sure do 22:42:59 forums 1, 4, 11, 14, 27, 35, 46, 55, 63, 64, 66, 67, 73 are accessible on a brand new account 22:44:04 of the remaining IDs, when logged in some return "forum doesn't exist" and others return "you're not authorized to see this forum" (probably private stuff for trusted translators, moderators, etc) 22:45:11 FireonLive edited Issuu (+115, move to partially saved to now, can be changed…): https://wiki.archiveteam.org/?diff=51254&oldid=50096 22:47:07 (Also, since the last one was in 2015, another proactive grab of furaffinity might be warranted, maybe?) 22:47:38 Pedrosso: I can only think of one or two cases where such archives went into the WBM. For SPUF, Valve people gave us an account to continue archiving past the shutdown deadline, allowing us to cover everything. And I think there was another one that I can't remember right now. 22:49:10 I could make a more anonymous account :P 22:49:18 but anyway 22:49:26 there's a few public forums 22:49:38 and there's the main site to deal with 23:08:52 oh fun, the pages are not deterministic 23:09:07 "Codec info = AVC Baseline⊙L2 | V_MPEG4/ISO/AVC" 23:09:24 gets turned into and the cfemail field changes across requests 23:17:12 anyway I'm doing a simplistic wget of all movie IDs now 23:17:30 because they have a with the canonical URL 23:25:31 -+rss- YouTuber who intentionally crashed plane is sentenced to 6 months in prison: https://twitter.com/bnonews/status/1731748816250974335 https://news.ycombinator.com/item?id=38523704 23:25:31 nitter: https://nitter.net/bnonews/status/1731748816250974335 23:57:10 should take me 30 minutes to get all IDs