00:11:27 thuban: not necessarily suggesting this one for consideration, but to maybe convey a sense of what the site has that could be lost: https://podcast.rthk.hk/podcast/item.php?pid=205&year=2014&lang=en-US news commentary about The Umbrella movement, 25th anniversary of June 4th (a topic censored in mainland China) 00:21:03 along with a few social topics like housing strategy, academic freedom. as you mentioned, there's a lot of content, so whatever else (if anything) you decide might be interesting or worthwhile 00:23:47 unfortunately their server seems to be very, very slow (at least for me) 00:24:04 a lot of the podcasts listed run for maybe a year or two and are complete/discontinued 00:25:08 maybe their servers are being flooded with people trying to save bits and pieces of old shows :) 00:25:11 the first video apparently downloaded for 40 minutes and then hit an ECONNRESET 00:25:57 maybe a geo thing? anyone have servers nearby? 00:29:19 hm, or maybe i should be faking a useragent; it didn't seem to be this bad in the browser 00:31:28 think they're streaming from akamai, at least in the browser 00:34:18 the site uses akamai; the xml feed links to a file (not a playlist) on archive.rthk.hk--but i was able to load a video from it in the browser earlier 00:34:24 can't seem to now, though 00:34:43 or maybe a little, just really incredibly slow 00:35:18 i guess i can rewrite to grab the akamai version 00:40:49 T-minus 3:20 until the Y!A shutdown right? 00:56:35 arkiver: The thing you're probably thinking of hasn't been in operation for a while now, unfortunately. 00:58:09 i see 02:36:28 MeriStation Comunidad Zonaforo qwarc grab is started. I'm only retrieving the thread pages. Their servers are horribly slow at an average response time of 4 seconds, so we'll see how that goes. 02:38:21 arkiver: From what I've experienced (have not systematically tested it), there's a short ban of between 10 hours and a day; then if you continue after that, it's permanent (or long enough that I haven't been unbanned yet) 03:14:28 I can't go very hard at MeriStation. Starting to see timeouts and DB errors at only 200 connections. Average response time also increased to 6.5 seconds. This is the most I can get out of it I think. 03:15:51 Gives an ETA of 46 hours or so. Not fast enough, sadly. 03:16:24 Less than 2 days of lead time after 21 years... :-| 03:20:53 :( 05:44:47 hey avoozl, how's your xenforo support? 05:45:48 Currently, non-existent, but adding new parsers is pretty doable 05:47:49 i have some warcs if you want raw material 05:51:49 thuban: I currently am working first on getting it to go a bit faster so I can build the yahoo answers index, but pointers are always welcome. 05:52:12 thuban: if you want to take a look how the parsers are currently implemented... the league of legends forum parser looks like this: https://paste.ofcode.org/QHnHH4ErUvsnmCptW4SiH2 05:52:30 thuban: basically a bunch of selectors to get the right bits from the page, construct them into a Post object, and the indexer takes it from there 05:53:39 whoa, go :o 05:54:00 yeah I figured for something self-hosted that'd be easiest and most compact 05:54:03 been meaning to get into that, i'll have a look 05:55:00 that html sanitization is probably going to be removed here, I'll make that a task of the front-end serving the html instead. It is all still pretty much in motion 05:55:32 * thuban nods 05:55:52 for yahoo answers things are much more complex, as there are json payloads and other odd things in there (multiple payload types that each refer to the same type of data) 06:26:29 I'm parsing the yahoo scrape at around 7MB/sec, most of that is spend on cpu limited tasks.. and my download speed from archive.org is pretty low so I'm still 9 days behind (20210423 is downloaded) 07:11:03 Re: ArchiveBot - is it better for me to do an initial run with grab-site to see if there's some giant forum archive off to the side, or if a blind pull of the site ends up pulling a lot of external, uninteresting content (like an awful lot of 3rd party Wordpress controls or gstatic.com fonts?) 07:12:12 Because one of the sites jo-dizzle put in for me just seems to be ballooning 07:13:04 if you have voice (which you can probably just ask nicely for) you can alter ignoresets on the fly as with grab-site 07:14:04 Yes, we regularly check in on crawls and add ignores as appropriate 07:15:05 Leaving the forum in that job was intentional LeighR, but you're right that it probably needs some ignores 07:15:53 ok 07:18:46 is it better in cases like this to instruct AB to ignore off-site links? 07:19:50 at least in the forums? 07:20:17 I know that it needs them to make the site itself display correctly 07:25:37 For very large forums it's usually better to ignore off-site links. I guess we'll see if this requires that. 07:26:50 Note though that if using `--no-offsite-links` when launching the job, AB will still pick up off-site page dependencies, like stylesheets and such 07:36:11 for something like dwiggie.com, offsite links to individual files (images) would have made sense to keep, but probably not the complete contents needed to render an Amazon page for a book someone in a forum post recommended 07:39:05 is there a way to make AB ignore offsite links for specific paths, or to only allow them for specific paths? 07:39:26 or should a site be broken into separate jobs? 07:47:00 you can use regex ignores to manage offsite links (but it won't be as simple as applying one set of rules to same-domain urls and another to offsites) 07:53:28 would it make more sense to break a site like this into "not-forum" and "forums"? 07:53:46 the stuff under "not-forum" is the important part 08:28:53 thuban: just in case, I've added a #warceater channel for anything related to this code I'm building 12:09:58 Is there a dashboard to view the staging server progress anywhere? 12:11:31 http://fos.textfiles.com/pipeline.html doesnt seem to contain anything yahoo related afaik 12:12:29 If you're looking for Yahoo Answers related status, then no. There isn't a statuspage that shows the upload progress. 12:23:21 yeah we dont have that public on any projects 12:38:21 does anyone know how large the yahoo answers set will be in total? I may have to clean out some space for that 12:39:18 Tracker says 4.75TiB compressed for the new project, and 30TB (uncompressed) for the 2016 project. 12:42:28 4.75TiB sounds good. I've got around 2.5 downloaded so I will need to create some extra space 12:42:32 thanks 12:43:02 I'll probably need around 3TiB for the index as well. this will be interesting juggling some free space 13:45:46 arkiver: isolario is on it's way to the IA :) 13:48:52 so is a random 3TB of webs, 3.8TB of bintray (once I have vars) 13:48:57 and im sure im about to find more crap 13:49:41 bets on finding a folder of G+ somewhere? 13:50:21 i dont think i have anything that old 14:01:24 Google+ was a nice one 14:01:28 sounds good HCross! 14:42:07 here's another favicon idea with a 3.5" floppy: https://i.imgur.com/ChCYwKs.png https://i.imgur.com/XDfSEOv.png 15:41:46 So that MeriStation archive didn't go well... Slowed to a crawl, then I got banned. 15:48:07 Looks like I only got maybe 7 % of it up to that point. 16:06:20 the meristation case is litterally incredible 17:04:00 thuban: is there a way to feed the rss xml to wget or some app and have it download the links inside, then format it for upload to IA? I tried to download a few audio podcast episodes manually (leaving aside for a moment the file descriptions in the xml being cropped so still have to find a way to fetch those) 17:07:18 the AT wiki page on wget has a command for webpages I'm trying as well, but haven't managed to adjust it to narrow down fetching to just the pages related to a single podcast 17:29:39 Hi! I hope this is the right channel for this question--the Clash of Clans forums are being shut down in a couple months, and I'm wondering how I can archive them. I saw that there was a script for getting Yahoo Answers on the Internet Archive based off its sitemap, does anyone know where to find that script? 17:33:18 nschmeller: what is the URL for the forum 17:33:59 https://forum.supercell.com/ 17:34:07 https://forum.supercell.com/showthread.php/1953693-End-of-the-Official-Supercell-Forums 17:34:59 Read-only in June, shutdown in August 17:36:12 Yup, ^^ 17:36:36 looks like sequential IDs 17:36:45 If i'm reading correctly, someone with permissions will have to point the archive bot at the main webpage and it'll get everything? 17:36:58 even the members have sequential IDs 17:37:14 JAA: is this small enough for archivebot? 17:37:14 Yup, standard forum, but with session ID hell. 17:37:51 What is session ID hell? 17:38:00 Too big for AB, but I can do it with qwarc. 17:39:08 nschmeller: When you access it without cookies, it adds an 's' parameter to every link. As the session expires after a while, it inevitably devolves into a huge mess of different session IDs being crawled etc. 17:40:13 Interesting, sounds annoying. Does that mean that the same page might be archived multiple times once a session expires? 17:41:01 Yeah 17:41:12 It would keep recursing through the site endlessly. 17:41:19 Doesn't sound good 17:41:37 What can I do to help? 17:48:12 I'll get this sorted. :-) 17:50:52 Awesome!! I'm surprised I haven't come across this group earlier, I've been religiously contributing to the IA since 2016 18:26:11 Uhhh, should we do a proactive archiving of Giant Bomb? https://www.giantbomb.com/ More and more people are leaving Giant Bomb, 3 notable people are Vinny Caravella, Alex Navarro, and Brad Shoemaker 18:26:42 Ever since being acquired and bought away from CBS Interactive, there has been bleeding talent over time S: 18:27:07 Apparently, there's only 2 notable people left :/ 18:36:08 nuroten: that is more or less what i'm doing 18:36:51 the trouble is that their video _and_ their web pages _and_ apparently their CDN are all a bit flaky, so there's a lot of retrying involved 18:38:21 thuban: nice. yeah, their servers are slow 18:39:24 did you manage to get the equivalent akamai urls? not that it's less flaky, hoped it would be a bit faster 18:39:53 i did 18:42:02 the 2020 ones were fine but 2019 (and presumably earlier) are giving streamlink problems; can't investigate now but will look at it this evening 18:42:02 that's good, is that exposed/extractable via browser inspector? I saw some m3u8 playlist files with *.ts fragments but not sure how to put it back together (or maybe that's not it) 18:43:01 there are tools to handle those but, as i say, problems 18:43:56 okay ... wouldn't be too surprised if the 2019 ones are flaky, it was one of the more eventful years 18:45:14 thanks a lot for your work on this! 18:47:16 I still have to check Youtube, if the quality is identical maybe grabbing from there is another option 18:52:17 thuban: are you archiving those RTHK videos? 18:53:07 arkiver: yes 18:53:28 thuban: alright, any details on what is being archived exactly and how? 18:54:24 podcast episodes + thumbnails + metadata (scraped from xml feed and episode pages) 18:54:42 and videos? 18:54:49 or are those videos 18:55:02 that's what i meant by "episodes" 18:55:06 i can throw the episode pages into archivebot too if we want provenance 18:55:33 yeah try to get everything into the Wayback Machine at least 18:55:42 k, will do 18:55:43 that is also the audio/video files themselves 18:56:27 that is likely to be problematic but i will generate the list 18:59:24 right i see podcast/rthk.hk 19:22:20 oof, i see the problem: episodes more than a year old aren't on their cdn at all; they also come in a playlist version, but it's self-hosted as well. if i can get one down i will compare the quality to the 'archive' mp4 and act accordingly 19:38:03 Is all of their stuff not on youtube? 19:38:18 I just checked a recent video and it's just a youtube embed 19:41:14 these videos are not youtube embeds 19:41:18 I don't see any indication that the site is going anywhere but it's good to get a backup 19:41:46 yeah, better to backup stuff than being forced to a emergency rescue 19:41:55 they have a playlist for "hong kong connection" (this show), but many, many of the videos are unavailable https://www.youtube.com/playlist?list=PLuwJy35eAVaJ-DaWHYe8PK6Yg-cyEMVo1 19:42:19 Ryz: Yes re: Giant Bomb. 19:43:03 Giant Bomb has forums, a wiki, and has premium content (requires a subscription to access that kind of content) 19:43:14 On top of being a news and media website for video games 19:43:39 This should expand to the other related websites that are under Red Ventures 19:43:42 https://www.giantbomb.com/shows/returnal/2970-21070 19:43:59 So checking their recent video lists, I'd say 3/4 of them are on youtube 19:44:09 And some of them on the site are youtube embeds, such as ^ 19:45:12 This isn't the first time calls for this stuff being archived was echoed, as Jason Scott gave a message via Twitter on encouraging ArchiveTeam to do such an archiving 19:45:28 But yes there are some that are not on youtube, such as https://www.giantbomb.com/shows/4-30-2021-g-is-for-golden/2970-21074 19:46:29 I can get their recent twitch videos as a low res backup copy as they most likely will end up in youtube at a higher res copy and hard drives are expensive now :-\ 19:48:17 ouch, youtube-dl does not like that link. Has a GiantBomb extractor but maybe it's unmaintained/broken? 19:48:39 ytdl is not known for keeping up with its prs; try youtube-dlc? 19:49:06 Isn't there another one besides that that is even more up to date 19:49:27 youtube-dlc hasn't had a commit to master since October 19:50:18 *December 19:50:41 yt-dlp 19:51:28 there is a download link but it's only for the audio, but the video just seems to be a placeholder 19:53:13 https://github.com/yt-dlp/yt-dlp 20:10:00 from #archiveteam: 20:10:03 Game site with ~13 years of history has 3 of its founders leaving after ~13 years. No word yet on if videos are going anywhere. videos hosted on their site as well as youtube.com , in most cases. tons and tons of 2h + video. As a fan, i think the biggest risk is that the site jettisons some of its less visible/ profitable features, like its 20:10:03 extensive wiki. old videos (older ones may not be on youtube?) may also get deleted for storage reasons. https://www.resetera.com/threads/vinny-caravella-alex-navarro-brad-shoemaker-announce-theyre-leaving...goodtime15:03:38one of the people leaving: "We are still a website... in a time when websites kind of don't exist anymore". storm clouds on 20:10:04 the wiki "Are they gonna be on our forum? Are they gonna be on discords?"founder, still staying: "Do we still need a website? I've been asking for 5 years" 20:10:26 tldr old videos (not on youtube) and non videos are the highest risk imo 20:37:05 Probably easiest to list the web pages to scrape and then get a listing of all the videos and download them somehow 20:37:56 Holy Cow did I have an instinct for site at risk - pemberley.com is unresponsive, and its old IP address is a parking page 20:38:34 Got it just in time? 20:38:50 apparently! 20:39:37 there wasn't anything on the site that announced it going away, so this might just be a temporary hiccup, but given how unresponsive it was, I felt its days were numbered 20:40:05 Hope AB didn't knock it out (I don't seriously think AB knocked it out) 20:41:15 And if someone writes code to get a listing of GB's pages , that should be put on GitHub and linked on the wiki so it can be rerun in the future :) 20:44:10 Did something similar for the TM-exchange. Dumped the URLLists to archive.org and added the source code of the tool into that item, too. Better to have the code at multiple locations 20:44:52 URLList dump makes it easier to do a incremental update since replays don't need redownload after initial download, and no need to redo the POST search if you already got the IDs 20:55:44 aside from downloading the whole WARC myself, is there a way to spot-check some URLs? Most of the stories in that site were indexed in a single, slightly mangled table that was de-mangled for viewers one page at a time 20:56:52 (site is back up, but still slow as heck) 20:58:10 each WARC has a cdx which is like a ToC 21:16:07 WRPlayer choked on the metadata WARC 21:17:53 downloading the WARC from https://archive.fart.website/archivebot/viewer/job/b8mfh isn't eating into someone's monthly bandwidth allotment? 21:19:18 It's just an index for the AB collection on IA. 21:19:22 oh, good 21:21:42 if those pages end up not being in there, what is the best way to archive the list of URLs I parse from the slightly mangled list? 21:22:52 how is it mangled? 21:23:51 https:\/\/pemberley.com\/derby\/ariane1.cim.html 21:23:51 sidenote: Just noticed that on the Wikiteam dump the last upload was 2016. 21:24:07 grep all out and replace \/ with / 21:24:08 no big deal to clean up in PowerShell or whatever 21:24:20 yeah, scripting or some quicjk C# code is the ebst way sometimes 21:24:20 (to pull out of the table) 21:25:02 *last upload of wikimedia commons 21:25:03 sed 's,\\/,/,g' 21:25:05 I thought it would be some serious JS BS, but no, I can see them all clear as day when I pull that page with curl 21:25:35 Slashes are often unnecessarily escaped in JS strings (including embedded JSON). 21:25:41 they're stuck in a table, but a regular enough pattern. Not sure if ArchiveBot would have caught this. 21:25:59 probably nope since the backslashes hide it 21:26:13 unless it got some unmangling code for that 21:26:30 but easiest to verify by crosschecking that list with the cdx of the WARC file 21:26:34 I get the feeling that some of this might have been done to prevent just the sort of thing we just did 21:27:02 still better than __doPostBack aspx pagination that doesnt use the URL 21:27:14 but their main fear was probably the stories being posted on fanfiction.net or the like under different authors' names 21:27:19 If it's JS, wpull handles that by calling json.loads. 21:27:26 nice 21:27:44 whats the initial URL where the table resides? 21:28:00 https://pemberley.com/?page_id=5270 21:30:30 if it turns out that AB didn't get them, I'll clean them up and put them in a list - no reason for y'all to bother 21:31:14 just curious on the fuckery hidden in that page 21:34:09 it's a site that was started before Google was 21:34:38 all I can guess is that it's some effort to prevent low-effort web scraping 21:35:03 script tag with a CDATA wrapper around, not sure if wpull expects a variable assignment containing the essential data 21:37:19 what's the polite way to get AB to pull a list of links that are all on the same site, but aren't the only thing on that site? 21:38:12 Oh, I see, it's HTML in JS strings. Yeah, that isn't processed by wpull I think. 21:38:27 you probably don't want several hundred !ao messages in the channel 21:39:06 Create a file containing one URL per line, upload that to https://transfer.archivete.am/ (with a good filename!), then use !ao < LISTURL. 21:39:31 and you don't need several hundred copies of that obnoxious background image 21:39:46 that was probably very classy in 1997 21:39:53 great! 21:39:55 the transfer.archivete.am required or any deeplinkable host working 21:40:33 ? 21:40:54 Anything works. Anything with good filenames (e.g. not Pastebin) is acceptable. transfer.archivete.am is strongly recommended. 21:41:01 I need to check, but I think some of them might just be the first chapter of multi-chaptered stories, linked in who knows what pattern 21:41:16 (This might change in the future, we'll see.) 21:41:35 also: got this link https://app.box.com/s/6b9wmjvr582c95uzma1136exumk6p989/folder/136698646305 via this tweet: https://twitter.com/simoncarless/status/1389297530341519362 21:42:23 Apple Vs Epic Lawsuit Extended stuff. (not directly in the RECAP archive which pipes to archive.org) 22:31:36 LeighR: if we know of any people, would be good to get in contact with 22:31:54 https://thediplomat.com/2021/04/hong-kongs-activists-in-exile/ 22:32:35 but those are perhaps not as archive-oriented 22:35:19 I remember some folks in college who were from Taiwan (important because they and HKers can read the full Traditional Chinese character set, while the mainland uses Simplified Chinese) 22:37:03 This group would probably be delighted with your help: https://www.2021hkcharter.com/ 22:41:47 I'll do some more looking for who might be able to make best use of AT's help 22:42:18 for Giant Bomb we could probably amass a collection of premium subscribers who want to make sure the content is archived. premium subs get download URLs which are supposedly checked for abuse (i.e. no mass downloads, i think an api key is involved)