-
nuroten
thuban: not necessarily suggesting this one for consideration, but to maybe convey a sense of what the site has that could be lost:
podcast.rthk.hk/podcast/item.php?pid=205&year=2014&lang=en-US news commentary about The Umbrella movement, 25th anniversary of June 4th (a topic censored in mainland China)
-
nuroten
along with a few social topics like housing strategy, academic freedom. as you mentioned, there's a lot of content, so whatever else (if anything) you decide might be interesting or worthwhile
-
thuban
unfortunately their server seems to be very, very slow (at least for me)
-
nuroten
a lot of the podcasts listed run for maybe a year or two and are complete/discontinued
-
nuroten
maybe their servers are being flooded with people trying to save bits and pieces of old shows :)
-
thuban
the first video apparently downloaded for 40 minutes and then hit an ECONNRESET
-
thuban
maybe a geo thing? anyone have servers nearby?
-
thuban
hm, or maybe i should be faking a useragent; it didn't seem to be this bad in the browser
-
nuroten
think they're streaming from akamai, at least in the browser
-
thuban
the site uses akamai; the xml feed links to a file (not a playlist) on archive.rthk.hk--but i was able to load a video from it in the browser earlier
-
thuban
can't seem to now, though
-
thuban
or maybe a little, just really incredibly slow
-
thuban
i guess i can rewrite to grab the akamai version
-
Inhonion
T-minus 3:20 until the Y!A shutdown right?
-
JAA
arkiver: The thing you're probably thinking of hasn't been in operation for a while now, unfortunately.
-
arkiver
i see
-
JAA
MeriStation Comunidad Zonaforo qwarc grab is started. I'm only retrieving the thread pages. Their servers are horribly slow at an average response time of 4 seconds, so we'll see how that goes.
-
OrIdow6
arkiver: From what I've experienced (have not systematically tested it), there's a short ban of between 10 hours and a day; then if you continue after that, it's permanent (or long enough that I haven't been unbanned yet)
-
JAA
I can't go very hard at MeriStation. Starting to see timeouts and DB errors at only 200 connections. Average response time also increased to 6.5 seconds. This is the most I can get out of it I think.
-
JAA
Gives an ETA of 46 hours or so. Not fast enough, sadly.
-
JAA
Less than 2 days of lead time after 21 years... :-|
-
jodizzle
:(
-
thuban
hey avoozl, how's your xenforo support?
-
avoozl
Currently, non-existent, but adding new parsers is pretty doable
-
thuban
i have some warcs if you want raw material
-
avoozl
thuban: I currently am working first on getting it to go a bit faster so I can build the yahoo answers index, but pointers are always welcome.
-
avoozl
thuban: if you want to take a look how the parsers are currently implemented... the league of legends forum parser looks like this:
paste.ofcode.org/QHnHH4ErUvsnmCptW4SiH2
-
avoozl
thuban: basically a bunch of selectors to get the right bits from the page, construct them into a Post object, and the indexer takes it from there
-
thuban
whoa, go :o
-
avoozl
yeah I figured for something self-hosted that'd be easiest and most compact
-
thuban
been meaning to get into that, i'll have a look
-
avoozl
that html sanitization is probably going to be removed here, I'll make that a task of the front-end serving the html instead. It is all still pretty much in motion
-
» thuban nods
-
avoozl
for yahoo answers things are much more complex, as there are json payloads and other odd things in there (multiple payload types that each refer to the same type of data)
-
avoozl
I'm parsing the yahoo scrape at around 7MB/sec, most of that is spend on cpu limited tasks.. and my download speed from archive.org is pretty low so I'm still 9 days behind (20210423 is downloaded)
-
LeighR
Re: ArchiveBot - is it better for me to do an initial run with grab-site to see if there's some giant forum archive off to the side, or if a blind pull of the site ends up pulling a lot of external, uninteresting content (like an awful lot of 3rd party Wordpress controls or gstatic.com fonts?)
-
LeighR
Because one of the sites jo-dizzle put in for me just seems to be ballooning
-
thuban
if you have voice (which you can probably just ask nicely for) you can alter ignoresets on the fly as with grab-site
-
jodizzle
Yes, we regularly check in on crawls and add ignores as appropriate
-
jodizzle
Leaving the forum in that job was intentional LeighR, but you're right that it probably needs some ignores
-
LeighR
ok
-
LeighR
is it better in cases like this to instruct AB to ignore off-site links?
-
LeighR
at least in the forums?
-
LeighR
I know that it needs them to make the site itself display correctly
-
jodizzle
For very large forums it's usually better to ignore off-site links. I guess we'll see if this requires that.
-
jodizzle
Note though that if using `--no-offsite-links` when launching the job, AB will still pick up off-site page dependencies, like stylesheets and such
-
LeighR
for something like dwiggie.com, offsite links to individual files (images) would have made sense to keep, but probably not the complete contents needed to render an Amazon page for a book someone in a forum post recommended
-
LeighR
is there a way to make AB ignore offsite links for specific paths, or to only allow them for specific paths?
-
LeighR
or should a site be broken into separate jobs?
-
thuban
you can use regex ignores to manage offsite links (but it won't be as simple as applying one set of rules to same-domain urls and another to offsites)
-
LeighR
would it make more sense to break a site like this into "not-forum" and "forums"?
-
LeighR
the stuff under "not-forum" is the important part
-
avoozl
thuban: just in case, I've added a #warceater channel for anything related to this code I'm building
-
s-crypt
Is there a dashboard to view the staging server progress anywhere?
-
s-crypt
fos.textfiles.com/pipeline.html doesnt seem to contain anything yahoo related afaik
-
rewby
If you're looking for Yahoo Answers related status, then no. There isn't a statuspage that shows the upload progress.
-
EggplantN
yeah we dont have that public on any projects
-
avoozl
does anyone know how large the yahoo answers set will be in total? I may have to clean out some space for that
-
Jake
Tracker says 4.75TiB compressed for the new project, and 30TB (uncompressed) for the 2016 project.
-
avoozl
4.75TiB sounds good. I've got around 2.5 downloaded so I will need to create some extra space
-
avoozl
thanks
-
avoozl
I'll probably need around 3TiB for the index as well. this will be interesting juggling some free space
-
HCross
arkiver: isolario is on it's way to the IA :)
-
EggplantN
so is a random 3TB of webs, 3.8TB of bintray (once I have vars)
-
EggplantN
and im sure im about to find more crap
-
HCross
bets on finding a folder of G+ somewhere?
-
EggplantN
i dont think i have anything that old
-
arkiver
Google+ was a nice one
-
arkiver
sounds good HCross!
-
hilda
here's another favicon idea with a 3.5" floppy:
i.imgur.com/ChCYwKs.png i.imgur.com/XDfSEOv.png
-
JAA
So that MeriStation archive didn't go well... Slowed to a crawl, then I got banned.
-
JAA
Looks like I only got maybe 7 % of it up to that point.
-
serx
the meristation case is litterally incredible
-
nuroten
thuban: is there a way to feed the rss xml to wget or some app and have it download the links inside, then format it for upload to IA? I tried to download a few audio podcast episodes manually (leaving aside for a moment the file descriptions in the xml being cropped so still have to find a way to fetch those)
-
nuroten
the AT wiki page on wget has a command for webpages I'm trying as well, but haven't managed to adjust it to narrow down fetching to just the pages related to a single podcast
-
nschmeller
Hi! I hope this is the right channel for this question--the Clash of Clans forums are being shut down in a couple months, and I'm wondering how I can archive them. I saw that there was a script for getting Yahoo Answers on the Internet Archive based off its sitemap, does anyone know where to find that script?
-
arkiver
nschmeller: what is the URL for the forum
-
JAA
-
JAA
-
JAA
Read-only in June, shutdown in August
-
nschmeller
Yup, ^^
-
arkiver
looks like sequential IDs
-
nschmeller
If i'm reading correctly, someone with permissions will have to point the archive bot at the main webpage and it'll get everything?
-
arkiver
even the members have sequential IDs
-
arkiver
JAA: is this small enough for archivebot?
-
JAA
Yup, standard forum, but with session ID hell.
-
nschmeller
What is session ID hell?
-
JAA
Too big for AB, but I can do it with qwarc.
-
JAA
nschmeller: When you access it without cookies, it adds an 's' parameter to every link. As the session expires after a while, it inevitably devolves into a huge mess of different session IDs being crawled etc.
-
nschmeller
Interesting, sounds annoying. Does that mean that the same page might be archived multiple times once a session expires?
-
JAA
Yeah
-
JAA
It would keep recursing through the site endlessly.
-
nschmeller
Doesn't sound good
-
nschmeller
What can I do to help?
-
JAA
I'll get this sorted. :-)
-
nschmeller
Awesome!! I'm surprised I haven't come across this group earlier, I've been religiously contributing to the IA since 2016
-
Ryz
Uhhh, should we do a proactive archiving of Giant Bomb?
giantbomb.com More and more people are leaving Giant Bomb, 3 notable people are Vinny Caravella, Alex Navarro, and Brad Shoemaker
-
Ryz
Ever since being acquired and bought away from CBS Interactive, there has been bleeding talent over time S:
-
Ryz
Apparently, there's only 2 notable people left :/
-
thuban
nuroten: that is more or less what i'm doing
-
thuban
the trouble is that their video _and_ their web pages _and_ apparently their CDN are all a bit flaky, so there's a lot of retrying involved
-
nuroten
thuban: nice. yeah, their servers are slow
-
nuroten
did you manage to get the equivalent akamai urls? not that it's less flaky, hoped it would be a bit faster
-
thuban
i did
-
thuban
the 2020 ones were fine but 2019 (and presumably earlier) are giving streamlink problems; can't investigate now but will look at it this evening
-
nuroten
that's good, is that exposed/extractable via browser inspector? I saw some m3u8 playlist files with *.ts fragments but not sure how to put it back together (or maybe that's not it)
-
thuban
there are tools to handle those but, as i say, problems
-
nuroten
okay ... wouldn't be too surprised if the 2019 ones are flaky, it was one of the more eventful years
-
nuroten
thanks a lot for your work on this!
-
nuroten
I still have to check Youtube, if the quality is identical maybe grabbing from there is another option
-
arkiver
thuban: are you archiving those RTHK videos?
-
thuban
arkiver: yes
-
arkiver
thuban: alright, any details on what is being archived exactly and how?
-
thuban
podcast episodes + thumbnails + metadata (scraped from xml feed and episode pages)
-
arkiver
and videos?
-
arkiver
or are those videos
-
thuban
that's what i meant by "episodes"
-
thuban
i can throw the episode pages into archivebot too if we want provenance
-
arkiver
yeah try to get everything into the Wayback Machine at least
-
thuban
k, will do
-
arkiver
that is also the audio/video files themselves
-
thuban
that is likely to be problematic but i will generate the list
-
arkiver
right i see podcast/rthk.hk
-
thuban
oof, i see the problem: episodes more than a year old aren't on their cdn at all; they also come in a playlist version, but it's self-hosted as well. if i can get one down i will compare the quality to the 'archive' mp4 and act accordingly
-
mgrandi
Is all of their stuff not on youtube?
-
mgrandi
I just checked a recent video and it's just a youtube embed
-
thuban
these videos are not youtube embeds
-
mgrandi
I don't see any indication that the site is going anywhere but it's good to get a backup
-
masterX244
yeah, better to backup stuff than being forced to a emergency rescue
-
thuban
they have a playlist for "hong kong connection" (this show), but many, many of the videos are unavailable
youtube.com/playlist?list=PLuwJy35eAVaJ-DaWHYe8PK6Yg-cyEMVo1
-
JAA
Ryz: Yes re: Giant Bomb.
-
Ryz
Giant Bomb has forums, a wiki, and has premium content (requires a subscription to access that kind of content)
-
Ryz
On top of being a news and media website for video games
-
Ryz
This should expand to the other related websites that are under Red Ventures
-
mgrandi
-
mgrandi
So checking their recent video lists, I'd say 3/4 of them are on youtube
-
mgrandi
And some of them on the site are youtube embeds, such as ^
-
Ryz
This isn't the first time calls for this stuff being archived was echoed, as Jason Scott gave a message via Twitter on encouraging ArchiveTeam to do such an archiving
-
mgrandi
But yes there are some that are not on youtube, such as
giantbomb.com/shows/4-30-2021-g-is-for-golden/2970-21074
-
mgrandi
I can get their recent twitch videos as a low res backup copy as they most likely will end up in youtube at a higher res copy and hard drives are expensive now :-\
-
lunik1
ouch, youtube-dl does not like that link. Has a GiantBomb extractor but maybe it's unmaintained/broken?
-
thuban
ytdl is not known for keeping up with its prs; try youtube-dlc?
-
mgrandi
Isn't there another one besides that that is even more up to date
-
lunik1
youtube-dlc hasn't had a commit to master since October
-
lunik1
*December
-
Ajay
yt-dlp
-
lunik1
there is a download link but it's only for the audio, but the video just seems to be a placeholder
-
mgrandi
-
goodtime
from #archiveteam:
-
goodtime
Game site with ~13 years of history has 3 of its founders leaving after ~13 years. No word yet on if videos are going anywhere. videos hosted on their site as well as youtube.com , in most cases. tons and tons of 2h + video. As a fan, i think the biggest risk is that the site jettisons some of its less visible/ profitable features, like its
-
goodtime
extensive wiki. old videos (older ones may not be on youtube?) may also get deleted for storage reasons.
resetera.com/threads/vinny-caravell…heyre-leaving...goodtime15:03:38one of the people leaving: "We are still a website... in a time when websites kind of don't exist anymore". storm clouds on
-
goodtime
the wiki "Are they gonna be on our forum? Are they gonna be on discords?"founder, still staying: "Do we still need a website? I've been asking for 5 years"
-
goodtime
tldr old videos (not on youtube) and non videos are the highest risk imo
-
mgrandi
Probably easiest to list the web pages to scrape and then get a listing of all the videos and download them somehow
-
LeighR
Holy Cow did I have an instinct for site at risk - pemberley.com is unresponsive, and its old IP address is a parking page
-
masterX244
Got it just in time?
-
LeighR
apparently!
-
LeighR
there wasn't anything on the site that announced it going away, so this might just be a temporary hiccup, but given how unresponsive it was, I felt its days were numbered
-
LeighR
Hope AB didn't knock it out (I don't seriously think AB knocked it out)
-
mgrandi
And if someone writes code to get a listing of GB's pages , that should be put on GitHub and linked on the wiki so it can be rerun in the future :)
-
masterX244
Did something similar for the TM-exchange. Dumped the URLLists to archive.org and added the source code of the tool into that item, too. Better to have the code at multiple locations
-
masterX244
URLList dump makes it easier to do a incremental update since replays don't need redownload after initial download, and no need to redo the POST search if you already got the IDs
-
LeighR
aside from downloading the whole WARC myself, is there a way to spot-check some URLs? Most of the stories in that site were indexed in a single, slightly mangled table that was de-mangled for viewers one page at a time
-
LeighR
(site is back up, but still slow as heck)
-
masterX244
each WARC has a cdx which is like a ToC
-
LeighR
WRPlayer choked on the metadata WARC
-
LeighR
downloading the WARC from
archive.fart.website/archivebot/viewer/job/b8mfh isn't eating into someone's monthly bandwidth allotment?
-
JAA
It's just an index for the AB collection on IA.
-
LeighR
oh, good
-
LeighR
if those pages end up not being in there, what is the best way to archive the list of URLs I parse from the slightly mangled list?
-
masterX244
how is it mangled?
-
LeighR
https:\\/pemberley.com\/derby\/ariane1.cim.html
-
masterX244
sidenote: Just noticed that on the Wikiteam dump the last upload was 2016.
-
masterX244
grep all out and replace \/ with /
-
LeighR
no big deal to clean up in PowerShell or whatever
-
masterX244
yeah, scripting or some quicjk C# code is the ebst way sometimes
-
LeighR
(to pull out of the table)
-
masterX244
*last upload of wikimedia commons
-
JAA
sed 's,\\/,/,g'
-
LeighR
I thought it would be some serious JS BS, but no, I can see them all clear as day when I pull that page with curl
-
JAA
Slashes are often unnecessarily escaped in JS strings (including embedded JSON).
-
LeighR
they're stuck in a table, but a regular enough pattern. Not sure if ArchiveBot would have caught this.
-
masterX244
probably nope since the backslashes hide it
-
masterX244
unless it got some unmangling code for that
-
masterX244
but easiest to verify by crosschecking that list with the cdx of the WARC file
-
LeighR
I get the feeling that some of this might have been done to prevent just the sort of thing we just did
-
masterX244
still better than __doPostBack aspx pagination that doesnt use the URL
-
LeighR
but their main fear was probably the stories being posted on fanfiction.net or the like under different authors' names
-
JAA
If it's JS, wpull handles that by calling json.loads.
-
LeighR
nice
-
masterX244
whats the initial URL where the table resides?
-
LeighR
-
LeighR
if it turns out that AB didn't get them, I'll clean them up and put them in a list - no reason for y'all to bother
-
masterX244
just curious on the fuckery hidden in that page
-
LeighR
it's a site that was started before Google was
-
LeighR
all I can guess is that it's some effort to prevent low-effort web scraping
-
masterX244
script tag with a CDATA wrapper around, not sure if wpull expects a variable assignment containing the essential data
-
LeighR
what's the polite way to get AB to pull a list of links that are all on the same site, but aren't the only thing on that site?
-
JAA
Oh, I see, it's HTML in JS strings. Yeah, that isn't processed by wpull I think.
-
LeighR
you probably don't want several hundred !ao messages in the channel
-
JAA
Create a file containing one URL per line, upload that to
transfer.archivete.am (with a good filename!), then use !ao < LISTURL.
-
LeighR
and you don't need several hundred copies of that obnoxious background image
-
LeighR
that was probably very classy in 1997
-
LeighR
great!
-
masterX244
the transfer.archivete.am required or any deeplinkable host working
-
masterX244
?
-
JAA
Anything works. Anything with good filenames (e.g. not Pastebin) is acceptable. transfer.archivete.am is strongly recommended.
-
LeighR
I need to check, but I think some of them might just be the first chapter of multi-chaptered stories, linked in who knows what pattern
-
JAA
(This might change in the future, we'll see.)
-
masterX244
-
masterX244
Apple Vs Epic Lawsuit Extended stuff. (not directly in the RECAP archive which pipes to archive.org)
-
arkiver
LeighR: if we know of any people, would be good to get in contact with
-
LeighR
-
LeighR
but those are perhaps not as archive-oriented
-
LeighR
I remember some folks in college who were from Taiwan (important because they and HKers can read the full Traditional Chinese character set, while the mainland uses Simplified Chinese)
-
LeighR
This group would probably be delighted with your help:
2021hkcharter.com
-
LeighR
I'll do some more looking for who might be able to make best use of AT's help
-
goodtime
for Giant Bomb we could probably amass a collection of premium subscribers who want to make sure the content is archived. premium subs get download URLs which are supposedly checked for abuse (i.e. no mass downloads, i think an api key is involved)