00:11:27 <nuroten> thuban: not necessarily suggesting this one for consideration, but to maybe convey a sense of what the site has that could be lost: https://podcast.rthk.hk/podcast/item.php?pid=205&year=2014&lang=en-US news commentary about The Umbrella movement, 25th anniversary of June 4th (a topic censored in mainland China)
00:21:03 <nuroten> along with a few social topics like housing strategy, academic freedom. as you mentioned, there's a lot of content, so whatever else (if anything) you decide might be interesting or worthwhile
00:23:47 <thuban> unfortunately their server seems to be very, very slow (at least for me)
00:24:04 <nuroten> a lot of the podcasts listed run for maybe a year or two and are complete/discontinued
00:25:08 <nuroten> maybe their servers are being flooded with people trying to save bits and pieces of old shows :)
00:25:11 <thuban> the first video apparently downloaded for 40 minutes and then hit an ECONNRESET
00:25:57 <thuban> maybe a geo thing? anyone have servers nearby?
00:29:19 <thuban> hm, or maybe i should be faking a useragent; it didn't seem to be this bad in the browser
00:31:28 <nuroten> think they're streaming from akamai, at least in the browser
00:34:18 <thuban> the site uses akamai; the xml feed links to a file (not a playlist) on archive.rthk.hk--but i was able to load a video from it in the browser earlier
00:34:24 <thuban> can't seem to now, though
00:34:43 <thuban> or maybe a little, just really incredibly slow
00:35:18 <thuban> i guess i can rewrite to grab the akamai version
00:40:49 <Inhonion> T-minus 3:20 until the Y!A shutdown right?
00:56:35 <JAA> arkiver: The thing you're probably thinking of hasn't been in operation for a while now, unfortunately.
00:58:09 <arkiver> i see
02:36:28 <JAA> MeriStation Comunidad Zonaforo qwarc grab is started. I'm only retrieving the thread pages. Their servers are horribly slow at an average response time of 4 seconds, so we'll see how that goes.
02:38:21 <OrIdow6> arkiver: From what I've experienced (have not systematically tested it), there's a short ban of between 10 hours and a day; then if you continue after that, it's permanent (or long enough that I haven't been unbanned yet)
03:14:28 <JAA> I can't go very hard at MeriStation. Starting to see timeouts and DB errors at only 200 connections. Average response time also increased to 6.5 seconds. This is the most I can get out of it I think.
03:15:51 <JAA> Gives an ETA of 46 hours or so. Not fast enough, sadly.
03:16:24 <JAA> Less than 2 days of lead time after 21 years... :-|
03:20:53 <jodizzle> :(
05:44:47 <thuban> hey avoozl, how's your xenforo support?
05:45:48 <avoozl> Currently,  non-existent, but adding new parsers is pretty doable
05:47:49 <thuban> i have some warcs if you want raw material
05:51:49 <avoozl> thuban: I currently am working first on getting it to go a bit faster so I can build the yahoo answers index, but pointers are always welcome.
05:52:12 <avoozl> thuban: if you want to take a look how the parsers are currently implemented... the league of legends forum parser looks like this: https://paste.ofcode.org/QHnHH4ErUvsnmCptW4SiH2
05:52:30 <avoozl> thuban: basically a bunch of selectors to get the right bits from the page, construct them into a Post object, and the indexer takes it from there
05:53:39 <thuban> whoa, go :o
05:54:00 <avoozl> yeah I figured for something self-hosted that'd be easiest and most compact
05:54:03 <thuban> been meaning to get into that, i'll have a look
05:55:00 <avoozl> that html sanitization is probably going to be removed here, I'll make that a task of the front-end serving the html instead. It is all still pretty much in motion
05:55:32 * thuban nods
05:55:52 <avoozl> for yahoo answers things are much more complex, as there are json payloads and other odd things in there (multiple payload types that each refer to the same type of data)
06:26:29 <avoozl> I'm parsing the yahoo scrape at around 7MB/sec, most of that is spend on cpu limited tasks.. and my download speed from archive.org is pretty low so I'm still 9 days behind (20210423 is downloaded)
07:11:03 <LeighR> Re: ArchiveBot - is it better for me to do an initial run with grab-site to see if there's some giant forum archive off to the side, or if a blind pull of the site ends up pulling a lot of external, uninteresting content (like an awful lot of 3rd party Wordpress controls or gstatic.com fonts?)
07:12:12 <LeighR> Because one of the sites jo-dizzle put in for me just seems to be ballooning
07:13:04 <thuban> if you have voice (which you can probably just ask nicely for) you can alter ignoresets on the fly as with grab-site
07:14:04 <jodizzle> Yes, we regularly check in on crawls and add ignores as appropriate
07:15:05 <jodizzle> Leaving the forum in that job was intentional LeighR, but you're right that it probably needs some ignores
07:15:53 <LeighR> ok
07:18:46 <LeighR> is it better in cases like this to instruct AB to ignore off-site links?
07:19:50 <LeighR> at least in the forums?
07:20:17 <LeighR> I know that it needs them to make the site itself display correctly
07:25:37 <jodizzle> For very large forums it's usually better to ignore off-site links.  I guess we'll see if this requires that.
07:26:50 <jodizzle> Note though that if using `--no-offsite-links` when launching the job, AB will still pick up off-site page dependencies, like stylesheets and such
07:36:11 <LeighR> for something like dwiggie.com, offsite links to individual files (images) would have made sense to keep, but probably not the complete contents needed to render an Amazon page for a book someone in a forum post recommended
07:39:05 <LeighR> is there a way to make AB ignore offsite links for specific paths, or to only allow them for specific paths?
07:39:26 <LeighR> or should a site be broken into separate jobs?
07:47:00 <thuban> you can use regex ignores to manage offsite links (but it won't be as simple as applying one set of rules to same-domain urls and another to offsites)
07:53:28 <LeighR> would it make more sense to break a site like this into "not-forum" and "forums"?
07:53:46 <LeighR> the stuff under "not-forum" is the important part
08:28:53 <avoozl> thuban: just in case, I've added a #warceater channel for anything related to this code I'm building
12:09:58 <s-crypt> Is there a dashboard to view the staging server progress anywhere?
12:11:31 <s-crypt> http://fos.textfiles.com/pipeline.html doesnt seem to contain anything yahoo related afaik
12:12:29 <rewby> If you're looking for Yahoo Answers related status, then no. There isn't a statuspage that shows the upload progress.
12:23:21 <EggplantN> yeah we dont have that public on any projects
12:38:21 <avoozl> does anyone know how large the yahoo answers set will be in total? I may have to clean out some space for that
12:39:18 <Jake> Tracker says 4.75TiB	compressed for the new project, and 30TB (uncompressed) for the 2016 project.
12:42:28 <avoozl> 4.75TiB sounds good. I've got around 2.5 downloaded so I will need to create some extra space
12:42:32 <avoozl> thanks
12:43:02 <avoozl> I'll probably need around 3TiB for the index as well. this will be interesting juggling some free space
13:45:46 <HCross> arkiver: isolario is on it's way to the IA :)
13:48:52 <EggplantN> so is a random 3TB of webs, 3.8TB of bintray (once I have vars)
13:48:57 <EggplantN> and im sure im about to find more crap
13:49:41 <HCross> bets on finding a folder of G+ somewhere?
13:50:21 <EggplantN> i dont think i have anything that old
14:01:24 <arkiver> Google+ was a nice one
14:01:28 <arkiver> sounds good HCross!
14:42:07 <hilda> here's another favicon idea with a 3.5" floppy: https://i.imgur.com/ChCYwKs.png https://i.imgur.com/XDfSEOv.png
15:41:46 <JAA> So that MeriStation archive didn't go well... Slowed to a crawl, then I got banned.
15:48:07 <JAA> Looks like I only got maybe 7 % of it up to that point.
16:06:20 <serx> the meristation case is litterally incredible
17:04:00 <nuroten> thuban: is there a way to feed the rss xml to wget or some app and have it download the links inside, then format it for upload to IA? I tried to download a few audio podcast episodes manually (leaving aside for a moment the file descriptions in the xml being cropped so still have to find a way to fetch those)
17:07:18 <nuroten> the AT wiki page on wget has a command for webpages I'm trying as well, but haven't managed to adjust it to narrow down fetching to just the pages related to a single podcast
17:29:39 <nschmeller> Hi! I hope this is the right channel for this question--the Clash of Clans forums are being shut down in a couple months, and I'm wondering how I can archive them. I saw that there was a script for getting Yahoo Answers on the Internet Archive based off its sitemap, does anyone know where to find that script?
17:33:18 <arkiver> nschmeller: what is the URL for the forum
17:33:59 <JAA> https://forum.supercell.com/
17:34:07 <JAA> https://forum.supercell.com/showthread.php/1953693-End-of-the-Official-Supercell-Forums
17:34:59 <JAA> Read-only in June, shutdown in August
17:36:12 <nschmeller> Yup, ^^
17:36:36 <arkiver> looks like sequential IDs
17:36:45 <nschmeller> If i'm reading correctly, someone with permissions will have to point the archive bot at the main webpage and it'll get everything?
17:36:58 <arkiver> even the members have sequential IDs
17:37:14 <arkiver> JAA: is this small enough for archivebot?
17:37:14 <JAA> Yup, standard forum, but with session ID hell.
17:37:51 <nschmeller> What is session ID hell?
17:38:00 <JAA> Too big for AB, but I can do it with qwarc.
17:39:08 <JAA> nschmeller: When you access it without cookies, it adds an 's' parameter to every link. As the session expires after a while, it inevitably devolves into a huge mess of different session IDs being crawled etc.
17:40:13 <nschmeller> Interesting, sounds annoying. Does that mean that the same page might be archived multiple times once a session expires?
17:41:01 <JAA> Yeah
17:41:12 <JAA> It would keep recursing through the site endlessly.
17:41:19 <nschmeller> Doesn't sound good
17:41:37 <nschmeller> What can I do to help?
17:48:12 <JAA> I'll get this sorted. :-)
17:50:52 <nschmeller> Awesome!! I'm surprised I haven't come across this group earlier, I've been religiously contributing to the IA since 2016
18:26:11 <Ryz> Uhhh, should we do a proactive archiving of Giant Bomb? https://www.giantbomb.com/ More and more people are leaving Giant Bomb, 3 notable people are Vinny Caravella, Alex Navarro, and Brad Shoemaker
18:26:42 <Ryz> Ever since being acquired and bought away from CBS Interactive, there has been bleeding talent over time S:
18:27:07 <Ryz> Apparently, there's only 2 notable people left :/
18:36:08 <thuban> nuroten: that is more or less what i'm doing
18:36:51 <thuban> the trouble is that their video _and_ their web pages _and_ apparently their CDN are all a bit flaky, so there's a lot of retrying involved
18:38:21 <nuroten> thuban: nice. yeah, their servers are slow
18:39:24 <nuroten> did you manage to get the equivalent akamai urls? not that it's less flaky, hoped it would be a bit faster
18:39:53 <thuban> i did
18:42:02 <thuban> the 2020 ones were fine but 2019 (and presumably earlier) are giving streamlink problems; can't investigate now but will look at it this evening
18:42:02 <nuroten> that's good, is that exposed/extractable via browser inspector? I saw some m3u8 playlist files with *.ts fragments but not sure how to put it back together (or maybe that's not it)
18:43:01 <thuban> there are tools to handle those but, as i say, problems
18:43:56 <nuroten> okay ... wouldn't be too surprised if the 2019 ones are flaky, it was one of the more eventful years
18:45:14 <nuroten> thanks a lot for your work on this!
18:47:16 <nuroten> I still have to check Youtube, if the quality is identical maybe grabbing from there is another option
18:52:17 <arkiver> thuban: are you archiving those RTHK videos?
18:53:07 <thuban> arkiver: yes
18:53:28 <arkiver> thuban: alright, any details on what is being archived exactly and how?
18:54:24 <thuban> podcast episodes + thumbnails + metadata (scraped from xml feed and episode pages)
18:54:42 <arkiver> and videos?
18:54:49 <arkiver> or are those videos
18:55:02 <thuban> that's what i meant by "episodes"
18:55:06 <thuban> i can throw the episode pages into archivebot too if we want provenance
18:55:33 <arkiver> yeah try to get everything into the Wayback Machine at least
18:55:42 <thuban> k, will do
18:55:43 <arkiver> that is also the audio/video files themselves
18:56:27 <thuban> that is likely to be problematic but i will generate the list
18:59:24 <arkiver> right i see podcast/rthk.hk
19:22:20 <thuban> oof, i see the problem: episodes more than a year old aren't on their cdn at all; they also come in a playlist version, but it's self-hosted as well. if i can get one down i will compare the quality to the 'archive' mp4 and act accordingly
19:38:03 <mgrandi> Is all of their stuff not on youtube?
19:38:18 <mgrandi> I just checked a recent video and it's just a youtube embed
19:41:14 <thuban> these videos are not youtube embeds
19:41:18 <mgrandi> I don't see any indication that the site is going anywhere but it's good to get a backup
19:41:46 <masterX244> yeah, better to backup stuff than being forced to a emergency rescue
19:41:55 <thuban> they have a playlist for "hong kong connection" (this show), but many, many of the videos are unavailable https://www.youtube.com/playlist?list=PLuwJy35eAVaJ-DaWHYe8PK6Yg-cyEMVo1
19:42:19 <JAA> Ryz: Yes re: Giant Bomb.
19:43:03 <Ryz> Giant Bomb has forums, a wiki, and has premium content (requires a subscription to access that kind of content)
19:43:14 <Ryz> On top of being a news and media website for video games
19:43:39 <Ryz> This should expand to the other related websites that are under Red Ventures
19:43:42 <mgrandi> https://www.giantbomb.com/shows/returnal/2970-21070
19:43:59 <mgrandi> So checking their recent video lists, I'd say 3/4 of them are on youtube
19:44:09 <mgrandi> And some of them on the site are youtube embeds, such as ^
19:45:12 <Ryz> This isn't the first time calls for this stuff being archived was echoed, as Jason Scott gave a message via Twitter on encouraging ArchiveTeam to do such an archiving
19:45:28 <mgrandi> But yes there are some that are not on youtube, such as https://www.giantbomb.com/shows/4-30-2021-g-is-for-golden/2970-21074
19:46:29 <mgrandi> I can get their recent twitch videos as a low res backup copy as they most likely will end up in youtube at a higher res copy and hard drives are expensive now :-\
19:48:17 <lunik1> ouch, youtube-dl does not like that link. Has a GiantBomb extractor but maybe it's unmaintained/broken?
19:48:39 <thuban> ytdl is not known for keeping up with its prs; try youtube-dlc?
19:49:06 <mgrandi> Isn't there another one besides that that is even more up to date
19:49:27 <lunik1> youtube-dlc hasn't had a commit to master since October
19:50:18 <lunik1> *December
19:50:41 <Ajay> yt-dlp
19:51:28 <lunik1> there is a download link but it's only for the audio, but the video just seems to be a placeholder
19:53:13 <mgrandi> https://github.com/yt-dlp/yt-dlp
20:10:00 <goodtime> from #archiveteam:
20:10:03 <goodtime> Game site with ~13 years of history has 3 of its founders leaving after ~13 years. No word yet on if videos are going anywhere. videos hosted on their site as well as youtube.com , in most cases. tons and tons of 2h + video. As a fan, i think the biggest risk is that the site jettisons some of its less visible/ profitable features, like its
20:10:03 <goodtime> extensive wiki. old videos (older ones may not be on youtube?) may also get deleted for storage reasons. https://www.resetera.com/threads/vinny-caravella-alex-navarro-brad-shoemaker-announce-theyre-leaving...goodtime15:03:38one of the people leaving: "We are still a website... in a time when websites kind of don't exist anymore". storm clouds on
20:10:04 <goodtime> the wiki "Are they gonna be on our forum? Are they gonna be on discords?"founder, still staying: "Do we still need a website? I've been asking for 5 years"
20:10:26 <goodtime> tldr old videos (not on youtube) and non videos are the highest risk imo
20:37:05 <mgrandi> Probably easiest to list the web pages to scrape and then get a listing of all the videos and download them somehow
20:37:56 <LeighR> Holy Cow did I have an instinct for site at risk - pemberley.com is unresponsive, and its old IP address is a parking page
20:38:34 <masterX244> Got it just in time?
20:38:50 <LeighR> apparently!
20:39:37 <LeighR> there wasn't anything on the site that announced it going away, so this might just be a temporary hiccup, but given how unresponsive it was, I felt its days were numbered
20:40:05 <LeighR> Hope AB didn't knock it out (I don't seriously think AB knocked it out)
20:41:15 <mgrandi> And if someone writes code to get a listing of GB's pages , that should be put on GitHub and linked on the wiki so it can be rerun in the future :)
20:44:10 <masterX244> Did something similar for the TM-exchange. Dumped the URLLists to archive.org and added the source code of the tool into that item, too. Better to have the code at multiple locations
20:44:52 <masterX244> URLList dump makes it easier to do a incremental update since replays don't need redownload after initial download, and no need to redo the POST search if you already got the IDs
20:55:44 <LeighR> aside from downloading the whole WARC myself, is there a way to spot-check some URLs? Most of the stories in that site were indexed in a single, slightly mangled table that was de-mangled for viewers one page at a time
20:56:52 <LeighR> (site is back up, but still slow as heck)
20:58:10 <masterX244> each WARC has a cdx which is like a ToC
21:16:07 <LeighR> WRPlayer choked on the metadata WARC
21:17:53 <LeighR> downloading the WARC from https://archive.fart.website/archivebot/viewer/job/b8mfh isn't eating into someone's monthly bandwidth allotment?
21:19:18 <JAA> It's just an index for the AB collection on IA.
21:19:22 <LeighR> oh, good
21:21:42 <LeighR> if those pages end up not being in there, what is the best way to archive the list of URLs I parse from the slightly mangled list?
21:22:52 <masterX244> how is it mangled?
21:23:51 <LeighR> https:\/\/pemberley.com\/derby\/ariane1.cim.html
21:23:51 <masterX244> sidenote: Just noticed that on the Wikiteam dump the last upload was 2016.
21:24:07 <masterX244> grep all out and replace \/ with /
21:24:08 <LeighR> no big deal to clean up in PowerShell or whatever
21:24:20 <masterX244> yeah, scripting or some quicjk C# code is the ebst way sometimes
21:24:20 <LeighR> (to pull out of the table)
21:25:02 <masterX244> *last upload of wikimedia commons
21:25:03 <JAA> sed 's,\\/,/,g'
21:25:05 <LeighR> I thought it would be some serious JS BS, but no, I can see them all clear as day when I pull that page with curl
21:25:35 <JAA> Slashes are often unnecessarily escaped in JS strings (including embedded JSON).
21:25:41 <LeighR> they're stuck in a table, but a regular enough pattern. Not sure if ArchiveBot would have caught this.
21:25:59 <masterX244> probably nope since the backslashes hide it
21:26:13 <masterX244> unless it got some unmangling code for that
21:26:30 <masterX244> but easiest to verify by crosschecking that list with the cdx of the WARC file
21:26:34 <LeighR> I get the feeling that some of this might have been done to prevent just the sort of thing we just did
21:27:02 <masterX244> still better than __doPostBack aspx pagination that doesnt use the URL
21:27:14 <LeighR> but their main fear was probably the stories being posted on fanfiction.net or the like under different authors' names
21:27:19 <JAA> If it's JS, wpull handles that by calling json.loads.
21:27:26 <LeighR> nice
21:27:44 <masterX244> whats the initial URL where the table resides?
21:28:00 <LeighR> https://pemberley.com/?page_id=5270
21:30:30 <LeighR> if it turns out that AB didn't get them, I'll clean them up and put them in a list - no reason for y'all to bother
21:31:14 <masterX244> just curious on the fuckery hidden in that page
21:34:09 <LeighR> it's a site that was started before Google was
21:34:38 <LeighR> all I can guess is that it's some effort to prevent low-effort web scraping
21:35:03 <masterX244> script tag with a CDATA wrapper around, not sure if wpull expects a variable assignment containing the essential data
21:37:19 <LeighR> what's the polite way to get AB to pull a list of links that are all on the same site, but aren't the only thing on that site?
21:38:12 <JAA> Oh, I see, it's HTML in JS strings. Yeah, that isn't processed by wpull I think.
21:38:27 <LeighR> you probably don't want several hundred !ao messages in the channel
21:39:06 <JAA> Create a file containing one URL per line, upload that to https://transfer.archivete.am/ (with a good filename!), then use !ao < LISTURL.
21:39:31 <LeighR> and you don't need several hundred copies of that obnoxious background image
21:39:46 <LeighR> that was probably very classy in 1997
21:39:53 <LeighR> great!
21:39:55 <masterX244> the transfer.archivete.am required or any deeplinkable host working
21:40:33 <masterX244> ?
21:40:54 <JAA> Anything works. Anything with good filenames (e.g. not Pastebin) is acceptable. transfer.archivete.am is strongly recommended.
21:41:01 <LeighR> I need to check, but I think some of them might just be the first chapter of multi-chaptered stories, linked in who knows what pattern
21:41:16 <JAA> (This might change in the future, we'll see.)
21:41:35 <masterX244> also: got this link https://app.box.com/s/6b9wmjvr582c95uzma1136exumk6p989/folder/136698646305 via this tweet: https://twitter.com/simoncarless/status/1389297530341519362
21:42:23 <masterX244> Apple Vs Epic Lawsuit Extended stuff. (not directly in the RECAP archive which pipes to archive.org)
22:31:36 <arkiver> LeighR: if we know of any people, would be good to get in contact with
22:31:54 <LeighR> https://thediplomat.com/2021/04/hong-kongs-activists-in-exile/
22:32:35 <LeighR> but those are perhaps not as archive-oriented
22:35:19 <LeighR> I remember some folks in college who were from Taiwan (important because they and HKers can read the full Traditional Chinese character set, while the mainland uses Simplified Chinese)
22:37:03 <LeighR> This group would probably be delighted with your help: https://www.2021hkcharter.com/
22:41:47 <LeighR> I'll do some more looking for who might be able to make best use of AT's help
22:42:18 <goodtime> for Giant Bomb we could probably amass a collection of premium subscribers who want to make sure the content is archived. premium subs get download URLs which are supposedly checked for abuse (i.e. no mass downloads, i think an api key is involved)