#archiveteam-bs

00:11

nuroten

thuban: not necessarily suggesting this one for consideration, but to maybe convey a sense of what the site has that could be lost: podcast.rthk.hk/podcast/item.php?pid=205&year=2014&lang=en-US news commentary about The Umbrella movement, 25th anniversary of June 4th (a topic censored in mainland China)
00:21

nuroten

along with a few social topics like housing strategy, academic freedom. as you mentioned, there's a lot of content, so whatever else (if anything) you decide might be interesting or worthwhile
00:23

thuban

unfortunately their server seems to be very, very slow (at least for me)
00:24

nuroten

a lot of the podcasts listed run for maybe a year or two and are complete/discontinued
00:25

nuroten

maybe their servers are being flooded with people trying to save bits and pieces of old shows :)
00:25

thuban

the first video apparently downloaded for 40 minutes and then hit an ECONNRESET
00:25

thuban

maybe a geo thing? anyone have servers nearby?
00:29

thuban

hm, or maybe i should be faking a useragent; it didn't seem to be this bad in the browser
00:31

nuroten

think they're streaming from akamai, at least in the browser
00:34

thuban

the site uses akamai; the xml feed links to a file (not a playlist) on archive.rthk.hk--but i was able to load a video from it in the browser earlier
00:34

thuban

can't seem to now, though
00:34

thuban

or maybe a little, just really incredibly slow
00:35

thuban

i guess i can rewrite to grab the akamai version
00:40

Inhonion

T-minus 3:20 until the Y!A shutdown right?
00:56

JAA

arkiver: The thing you're probably thinking of hasn't been in operation for a while now, unfortunately.
00:58

arkiver

i see
02:36

JAA

MeriStation Comunidad Zonaforo qwarc grab is started. I'm only retrieving the thread pages. Their servers are horribly slow at an average response time of 4 seconds, so we'll see how that goes.
02:38

OrIdow6

arkiver: From what I've experienced (have not systematically tested it), there's a short ban of between 10 hours and a day; then if you continue after that, it's permanent (or long enough that I haven't been unbanned yet)
03:14

JAA

I can't go very hard at MeriStation. Starting to see timeouts and DB errors at only 200 connections. Average response time also increased to 6.5 seconds. This is the most I can get out of it I think.
03:15

JAA

Gives an ETA of 46 hours or so. Not fast enough, sadly.
03:16

JAA

Less than 2 days of lead time after 21 years... :-|
03:20

jodizzle

:(
05:44

thuban

hey avoozl, how's your xenforo support?
05:45

avoozl

Currently, non-existent, but adding new parsers is pretty doable
05:47

thuban

i have some warcs if you want raw material
05:51

avoozl

thuban: I currently am working first on getting it to go a bit faster so I can build the yahoo answers index, but pointers are always welcome.
05:52

avoozl

thuban: if you want to take a look how the parsers are currently implemented... the league of legends forum parser looks like this: paste.ofcode.org/QHnHH4ErUvsnmCptW4SiH2
05:52

avoozl

thuban: basically a bunch of selectors to get the right bits from the page, construct them into a Post object, and the indexer takes it from there
05:53

thuban

whoa, go :o
05:54

avoozl

yeah I figured for something self-hosted that'd be easiest and most compact
05:54

thuban

been meaning to get into that, i'll have a look
05:55

avoozl

that html sanitization is probably going to be removed here, I'll make that a task of the front-end serving the html instead. It is all still pretty much in motion
05:55

» thuban nods
05:55

avoozl

for yahoo answers things are much more complex, as there are json payloads and other odd things in there (multiple payload types that each refer to the same type of data)
06:26

avoozl

I'm parsing the yahoo scrape at around 7MB/sec, most of that is spend on cpu limited tasks.. and my download speed from archive.org is pretty low so I'm still 9 days behind (20210423 is downloaded)
07:11

LeighR

Re: ArchiveBot - is it better for me to do an initial run with grab-site to see if there's some giant forum archive off to the side, or if a blind pull of the site ends up pulling a lot of external, uninteresting content (like an awful lot of 3rd party Wordpress controls or gstatic.com fonts?)
07:12

LeighR

Because one of the sites jo-dizzle put in for me just seems to be ballooning
07:13

thuban

if you have voice (which you can probably just ask nicely for) you can alter ignoresets on the fly as with grab-site
07:14

jodizzle

Yes, we regularly check in on crawls and add ignores as appropriate
07:15

jodizzle

Leaving the forum in that job was intentional LeighR, but you're right that it probably needs some ignores
07:15

LeighR

ok
07:18

LeighR

is it better in cases like this to instruct AB to ignore off-site links?
07:19

LeighR

at least in the forums?
07:20

LeighR

I know that it needs them to make the site itself display correctly
07:25

jodizzle

For very large forums it's usually better to ignore off-site links. I guess we'll see if this requires that.
07:26

jodizzle

Note though that if using `--no-offsite-links` when launching the job, AB will still pick up off-site page dependencies, like stylesheets and such
07:36

LeighR

for something like dwiggie.com, offsite links to individual files (images) would have made sense to keep, but probably not the complete contents needed to render an Amazon page for a book someone in a forum post recommended
07:39

LeighR

is there a way to make AB ignore offsite links for specific paths, or to only allow them for specific paths?
07:39

LeighR

or should a site be broken into separate jobs?
07:47

thuban

you can use regex ignores to manage offsite links (but it won't be as simple as applying one set of rules to same-domain urls and another to offsites)
07:53

LeighR

would it make more sense to break a site like this into "not-forum" and "forums"?
07:53

LeighR

the stuff under "not-forum" is the important part
08:28

avoozl

thuban: just in case, I've added a #warceater channel for anything related to this code I'm building
12:09

s-crypt

Is there a dashboard to view the staging server progress anywhere?
12:11

s-crypt

fos.textfiles.com/pipeline.html doesnt seem to contain anything yahoo related afaik
12:12

rewby

If you're looking for Yahoo Answers related status, then no. There isn't a statuspage that shows the upload progress.
12:23

EggplantN

yeah we dont have that public on any projects
12:38

avoozl

does anyone know how large the yahoo answers set will be in total? I may have to clean out some space for that
12:39

Jake

Tracker says 4.75TiB compressed for the new project, and 30TB (uncompressed) for the 2016 project.
12:42

avoozl

4.75TiB sounds good. I've got around 2.5 downloaded so I will need to create some extra space
12:42

avoozl

thanks
12:43

avoozl

I'll probably need around 3TiB for the index as well. this will be interesting juggling some free space
13:45

HCross

arkiver: isolario is on it's way to the IA :)
13:48

EggplantN

so is a random 3TB of webs, 3.8TB of bintray (once I have vars)
13:48

EggplantN

and im sure im about to find more crap
13:49

HCross

bets on finding a folder of G+ somewhere?
13:50

EggplantN

i dont think i have anything that old
14:01

arkiver

Google+ was a nice one
14:01

arkiver

sounds good HCross!
14:42

hilda

here's another favicon idea with a 3.5" floppy: i.imgur.com/ChCYwKs.png i.imgur.com/XDfSEOv.png
15:41

JAA

So that MeriStation archive didn't go well... Slowed to a crawl, then I got banned.
15:48

JAA

Looks like I only got maybe 7 % of it up to that point.
16:06

serx

the meristation case is litterally incredible
17:04

nuroten

thuban: is there a way to feed the rss xml to wget or some app and have it download the links inside, then format it for upload to IA? I tried to download a few audio podcast episodes manually (leaving aside for a moment the file descriptions in the xml being cropped so still have to find a way to fetch those)
17:07

nuroten

the AT wiki page on wget has a command for webpages I'm trying as well, but haven't managed to adjust it to narrow down fetching to just the pages related to a single podcast
17:29

nschmeller

Hi! I hope this is the right channel for this question--the Clash of Clans forums are being shut down in a couple months, and I'm wondering how I can archive them. I saw that there was a script for getting Yahoo Answers on the Internet Archive based off its sitemap, does anyone know where to find that script?
17:33

arkiver

nschmeller: what is the URL for the forum
17:33

JAA

forum.supercell.com
17:34

JAA

forum.supercell.com/showthread.php/…nd-of-the-Official-Supercell-Forums
17:34

JAA

Read-only in June, shutdown in August
17:36

nschmeller

Yup, ^^
17:36

arkiver

looks like sequential IDs
17:36

nschmeller

If i'm reading correctly, someone with permissions will have to point the archive bot at the main webpage and it'll get everything?
17:36

arkiver

even the members have sequential IDs
17:37

arkiver

JAA: is this small enough for archivebot?
17:37

JAA

Yup, standard forum, but with session ID hell.
17:37

nschmeller

What is session ID hell?
17:38

JAA

Too big for AB, but I can do it with qwarc.
17:39

JAA

nschmeller: When you access it without cookies, it adds an 's' parameter to every link. As the session expires after a while, it inevitably devolves into a huge mess of different session IDs being crawled etc.
17:40

nschmeller

Interesting, sounds annoying. Does that mean that the same page might be archived multiple times once a session expires?
17:41

JAA

Yeah
17:41

JAA

It would keep recursing through the site endlessly.
17:41

nschmeller

Doesn't sound good
17:41

nschmeller

What can I do to help?
17:48

JAA

I'll get this sorted. :-)
17:50

nschmeller

Awesome!! I'm surprised I haven't come across this group earlier, I've been religiously contributing to the IA since 2016
18:26

Ryz

Uhhh, should we do a proactive archiving of Giant Bomb? giantbomb.com More and more people are leaving Giant Bomb, 3 notable people are Vinny Caravella, Alex Navarro, and Brad Shoemaker
18:26

Ryz

Ever since being acquired and bought away from CBS Interactive, there has been bleeding talent over time S:
18:27

Ryz

Apparently, there's only 2 notable people left :/
18:36

thuban

nuroten: that is more or less what i'm doing
18:36

thuban

the trouble is that their video _and_ their web pages _and_ apparently their CDN are all a bit flaky, so there's a lot of retrying involved
18:38

nuroten

thuban: nice. yeah, their servers are slow
18:39

nuroten

did you manage to get the equivalent akamai urls? not that it's less flaky, hoped it would be a bit faster
18:39

thuban

i did
18:42

thuban

the 2020 ones were fine but 2019 (and presumably earlier) are giving streamlink problems; can't investigate now but will look at it this evening
18:42

nuroten

that's good, is that exposed/extractable via browser inspector? I saw some m3u8 playlist files with *.ts fragments but not sure how to put it back together (or maybe that's not it)
18:43

thuban

there are tools to handle those but, as i say, problems
18:43

nuroten

okay ... wouldn't be too surprised if the 2019 ones are flaky, it was one of the more eventful years
18:45

nuroten

thanks a lot for your work on this!
18:47

nuroten

I still have to check Youtube, if the quality is identical maybe grabbing from there is another option
18:52

arkiver

thuban: are you archiving those RTHK videos?
18:53

thuban

arkiver: yes
18:53

arkiver

thuban: alright, any details on what is being archived exactly and how?
18:54

thuban

podcast episodes + thumbnails + metadata (scraped from xml feed and episode pages)
18:54

arkiver

and videos?
18:54

arkiver

or are those videos
18:55

thuban

that's what i meant by "episodes"
18:55

thuban

i can throw the episode pages into archivebot too if we want provenance
18:55

arkiver

yeah try to get everything into the Wayback Machine at least
18:55

thuban

k, will do
18:55

arkiver

that is also the audio/video files themselves
18:56

thuban

that is likely to be problematic but i will generate the list
18:59

arkiver

right i see podcast/rthk.hk
19:22

thuban

oof, i see the problem: episodes more than a year old aren't on their cdn at all; they also come in a playlist version, but it's self-hosted as well. if i can get one down i will compare the quality to the 'archive' mp4 and act accordingly
19:38

mgrandi

Is all of their stuff not on youtube?
19:38

mgrandi

I just checked a recent video and it's just a youtube embed
19:41

thuban

these videos are not youtube embeds
19:41

mgrandi

I don't see any indication that the site is going anywhere but it's good to get a backup
19:41

masterX244

yeah, better to backup stuff than being forced to a emergency rescue
19:41

thuban

they have a playlist for "hong kong connection" (this show), but many, many of the videos are unavailable youtube.com/playlist?list=PLuwJy35eAVaJ-DaWHYe8PK6Yg-cyEMVo1
19:42

JAA

Ryz: Yes re: Giant Bomb.
19:43

Ryz

Giant Bomb has forums, a wiki, and has premium content (requires a subscription to access that kind of content)
19:43

Ryz

On top of being a news and media website for video games
19:43

Ryz

This should expand to the other related websites that are under Red Ventures
19:43

mgrandi

giantbomb.com/shows/returnal/2970-21070
19:43

mgrandi

So checking their recent video lists, I'd say 3/4 of them are on youtube
19:44

mgrandi

And some of them on the site are youtube embeds, such as ^
19:45

Ryz

This isn't the first time calls for this stuff being archived was echoed, as Jason Scott gave a message via Twitter on encouraging ArchiveTeam to do such an archiving
19:45

mgrandi

But yes there are some that are not on youtube, such as giantbomb.com/shows/4-30-2021-g-is-for-golden/2970-21074
19:46

mgrandi

I can get their recent twitch videos as a low res backup copy as they most likely will end up in youtube at a higher res copy and hard drives are expensive now :-\
19:48

lunik1

ouch, youtube-dl does not like that link. Has a GiantBomb extractor but maybe it's unmaintained/broken?
19:48

thuban

ytdl is not known for keeping up with its prs; try youtube-dlc?
19:49

mgrandi

Isn't there another one besides that that is even more up to date
19:49

lunik1

youtube-dlc hasn't had a commit to master since October
19:50

lunik1

*December
19:50

Ajay

yt-dlp
19:51

lunik1

there is a download link but it's only for the audio, but the video just seems to be a placeholder
19:53

mgrandi

github.com/yt-dlp/yt-dlp
20:10

goodtime

from #archiveteam:
20:10

goodtime

Game site with ~13 years of history has 3 of its founders leaving after ~13 years. No word yet on if videos are going anywhere. videos hosted on their site as well as youtube.com , in most cases. tons and tons of 2h + video. As a fan, i think the biggest risk is that the site jettisons some of its less visible/ profitable features, like its
20:10

goodtime

extensive wiki. old videos (older ones may not be on youtube?) may also get deleted for storage reasons. resetera.com/threads/vinny-caravell…heyre-leaving...goodtime15:03:38one of the people leaving: "We are still a website... in a time when websites kind of don't exist anymore". storm clouds on
20:10

goodtime

the wiki "Are they gonna be on our forum? Are they gonna be on discords?"founder, still staying: "Do we still need a website? I've been asking for 5 years"
20:10

goodtime

tldr old videos (not on youtube) and non videos are the highest risk imo
20:37

mgrandi

Probably easiest to list the web pages to scrape and then get a listing of all the videos and download them somehow
20:37

LeighR

Holy Cow did I have an instinct for site at risk - pemberley.com is unresponsive, and its old IP address is a parking page
20:38

masterX244

Got it just in time?
20:38

LeighR

apparently!
20:39

LeighR

there wasn't anything on the site that announced it going away, so this might just be a temporary hiccup, but given how unresponsive it was, I felt its days were numbered
20:40

LeighR

Hope AB didn't knock it out (I don't seriously think AB knocked it out)
20:41

mgrandi

And if someone writes code to get a listing of GB's pages , that should be put on GitHub and linked on the wiki so it can be rerun in the future :)
20:44

masterX244

Did something similar for the TM-exchange. Dumped the URLLists to archive.org and added the source code of the tool into that item, too. Better to have the code at multiple locations
20:44

masterX244

URLList dump makes it easier to do a incremental update since replays don't need redownload after initial download, and no need to redo the POST search if you already got the IDs
20:55

LeighR

aside from downloading the whole WARC myself, is there a way to spot-check some URLs? Most of the stories in that site were indexed in a single, slightly mangled table that was de-mangled for viewers one page at a time
20:56

LeighR

(site is back up, but still slow as heck)
20:58

masterX244

each WARC has a cdx which is like a ToC
21:16

LeighR

WRPlayer choked on the metadata WARC
21:17

LeighR

downloading the WARC from archive.fart.website/archivebot/viewer/job/b8mfh isn't eating into someone's monthly bandwidth allotment?
21:19

JAA

It's just an index for the AB collection on IA.
21:19

LeighR

oh, good
21:21

LeighR

if those pages end up not being in there, what is the best way to archive the list of URLs I parse from the slightly mangled list?
21:22

masterX244

how is it mangled?
21:23

LeighR

https:\\/pemberley.com\/derby\/ariane1.cim.html
21:23

masterX244

sidenote: Just noticed that on the Wikiteam dump the last upload was 2016.
21:24

masterX244

grep all out and replace \/ with /
21:24

LeighR

no big deal to clean up in PowerShell or whatever
21:24

masterX244

yeah, scripting or some quicjk C# code is the ebst way sometimes
21:24

LeighR

(to pull out of the table)
21:25

masterX244

*last upload of wikimedia commons
21:25

JAA

sed 's,\\/,/,g'
21:25

LeighR

I thought it would be some serious JS BS, but no, I can see them all clear as day when I pull that page with curl
21:25

JAA

Slashes are often unnecessarily escaped in JS strings (including embedded JSON).
21:25

LeighR

they're stuck in a table, but a regular enough pattern. Not sure if ArchiveBot would have caught this.
21:25

masterX244

probably nope since the backslashes hide it
21:26

masterX244

unless it got some unmangling code for that
21:26

masterX244

but easiest to verify by crosschecking that list with the cdx of the WARC file
21:26

LeighR

I get the feeling that some of this might have been done to prevent just the sort of thing we just did
21:27

masterX244

still better than __doPostBack aspx pagination that doesnt use the URL
21:27

LeighR

but their main fear was probably the stories being posted on fanfiction.net or the like under different authors' names
21:27

JAA

If it's JS, wpull handles that by calling json.loads.
21:27

LeighR

nice
21:27

masterX244

whats the initial URL where the table resides?
21:28

LeighR

pemberley.com/?page_id=5270
21:30

LeighR

if it turns out that AB didn't get them, I'll clean them up and put them in a list - no reason for y'all to bother
21:31

masterX244

just curious on the fuckery hidden in that page
21:34

LeighR

it's a site that was started before Google was
21:34

LeighR

all I can guess is that it's some effort to prevent low-effort web scraping
21:35

masterX244

script tag with a CDATA wrapper around, not sure if wpull expects a variable assignment containing the essential data
21:37

LeighR

what's the polite way to get AB to pull a list of links that are all on the same site, but aren't the only thing on that site?
21:38

JAA

Oh, I see, it's HTML in JS strings. Yeah, that isn't processed by wpull I think.
21:38

LeighR

you probably don't want several hundred !ao messages in the channel
21:39

JAA

Create a file containing one URL per line, upload that to transfer.archivete.am (with a good filename!), then use !ao < LISTURL.
21:39

LeighR

and you don't need several hundred copies of that obnoxious background image
21:39

LeighR

that was probably very classy in 1997
21:39

LeighR

great!
21:39

masterX244

the transfer.archivete.am required or any deeplinkable host working
21:40

masterX244

?
21:40

JAA

Anything works. Anything with good filenames (e.g. not Pastebin) is acceptable. transfer.archivete.am is strongly recommended.
21:41

LeighR

I need to check, but I think some of them might just be the first chapter of multi-chaptered stories, linked in who knows what pattern
21:41

JAA

(This might change in the future, we'll see.)
21:41

masterX244

also: got this link app.box.com/s/6b9wmjvr582c95uzma1136exumk6p989/folder/136698646305 via this tweet: twitter.com/simoncarless/status/1389297530341519362
21:42

masterX244

Apple Vs Epic Lawsuit Extended stuff. (not directly in the RECAP archive which pipes to archive.org)
22:31

arkiver

LeighR: if we know of any people, would be good to get in contact with
22:31

LeighR

thediplomat.com/2021/04/hong-kongs-activists-in-exile
22:32

LeighR

but those are perhaps not as archive-oriented
22:35

LeighR

I remember some folks in college who were from Taiwan (important because they and HKers can read the full Traditional Chinese character set, while the mainland uses Simplified Chinese)
22:37

LeighR

This group would probably be delighted with your help: 2021hkcharter.com
22:41

LeighR

I'll do some more looking for who might be able to make best use of AT's help
22:42

goodtime

for Giant Bomb we could probably amass a collection of premium subscribers who want to make sure the content is archived. premium subs get download URLs which are supposedly checked for abuse (i.e. no mass downloads, i think an api key is involved)

3 years ago

« a day earlier

a day later »

today »