-
pabs
I heard that someone inside Google has been trying to get rid of feedburner for years.
-
pabs
asked them to do a proper transition and also contact us before it goes away
-
h2ibot
JustAnotherArchivist edited Zippyshare.com (+172, Update infobox):
wiki.archiveteam.org/?diff=49604&oldid=49575
-
OrIdow6^2
Shutterfly share sites is dead
-
OrIdow6^2
I did not figure out in time how to generate image URLs
-
OrIdow6^2
Appears that the DNS trick works to some extent
-
OrIdow6^2
Well, I should go asleep now, but if it's still "up" in the morning I guess I'll just take a best guess at that image URL generation
-
OrIdow6^2
Which probably isn't that hard, but I was trying to be a perfectionist :|
-
thuban
grab-site works with a pyenv 3.8 venv but not with a system 3.7 venv, because the latter looks for libre2.so.9 and chokes on libre2.so.10. why? idk -_-
-
Sophira
Hi there. I'm the owner of a site that's been running for the last 10-11 years or so dedicated to the TV Tropes ARG "The Wall Will Fall" (
twwf.info and its linked subdomains ). I intended to shut it down in December but haven't been able to bring myself to do so yet. The domain expires in a week, though, and I would prefer not to renew it if possible. Not all of the forum is archived on
-
Sophira
web.archive.org as for much of its life the forums had restrictive robots.txt files. I removed them a while back but there's a lot that still isn't in the archive. Is it possible to request that the sites be archived?
-
Sophira
As the owner I'm willing to help in any way I can for this to happen.
-
Sophira
(I was one of the original puppetmasters on the ARG.)
-
Sophira
Actually no my mistake, I think the forums had some kind of bot detection IIRC.
-
thuban
Sophira: yes, certainly! do you have a sitemap you can provide?
-
Sophira
thuban: I don't. I should be able to create one, I think, though it might take a while. Would a list of URLs suffice?
-
thuban
a list of urls would be perfect
-
Sophira
Also bear in mind that this will cover several different hostnames, though they're all under the umbrella domain of twwf.info.
-
thuban
that should be fine
-
Sophira
Okay. I'll do what I can, then! It might take a while though, as I say. Is there any kind of special processing you would normally do for phpBB forums and Wordpress sites?
-
pokechu22
wordpress and phpbb can both be done with archivebot without much issue
-
pokechu22
(in that the annoying stuff has mostly already been solved with some standard ignoresets)
-
Sophira
Awesome. One thing to bear in mind is that many of the sites will link to each other in forum posts and blog comments, so those 'external' links will need to be rewritten accordingly.
-
pokechu22
Yeah, archivebot doesn't do that super well - it only recurses within a single domain and saves individual outlinks. If each of the wordpress/phpBB forums has a front page where everything can be accessed that won't be as much of a problem though
-
pokechu22
there isn't any super good way to rewrite them with archivebot as-is :/
-
thuban
i think there's a miscommunication here--archivebot (like archiveteam) does not rewrite anything
-
Sophira
Oh, even links within the same site?
-
thuban
archivebot _follows_ links, but it won't alter anything. if an old post on foo.twwf.info links to bar.com (which is now copied at bar.twwf.info), it will be saved exactly as it is, including the link to bar.com
-
pokechu22
twwf.info says that links like that are already rewritten though so that might not be a problem
-
Sophira
Yeah, I use mod_filter on the server in order to do domain substitution like thata.
-
Sophira
^that/.
-
Sophira
...pretend I typed that correctly.
-
Sophira
But yes, all links to sites that have been archived to a subdomain beneath twwf.info are rewritten automatically.
-
thuban
what did you then mean by "those 'external' links will need to be rewritten"?
-
Sophira
I mean 'external' in that, for example, some users making comments on Romeo's blog site, romeo.ezblog.twwf.info, have links to Juliet's blog site, which are rewritten to juliet.ezblog.twwf.info automatically. From your point of view, juliet.ezblog.twwf.info will be a different site from romeo.ezblog.twwf.info, right?
-
Sophira
That's what I mean by 'external', and that's why I put the word in quotes - because they're still under twwf.info, but from the point of view that they use two different hostnames, they could be considered two different sites.
-
Sophira
twwf.info
-
Sophira
Er.
-
thuban
ah. so by "rewrite" you only mean 'consider as part of the same site'.
-
Sophira
The main page at
twwf.info (sorry, no HTTPS) links to all the various sites and they should all be accessible.
-
pokechu22
Yes, but that won't cause issues with doing two separate jobs that recurse over all of
romeo.ezblog.twwf.info and
juliet.ezblog.twwf.info (the pages that are linked between them would get saved twice, but that's probably fine)
-
pokechu22
It'd be an issue for
xovr.twwf.info though and any deep links that aren't reachable from the front page
-
thuban
archivebot's subdomain handling is complicated™, but a complete sitemap will render it moot
-
thuban
(or, failing that (eg for the forums), a complete list of subdomains)
-
pokechu22
My thought is that doing an !a on each of the forums and blogs would get good enough coverage of those; wordpress and phpbb are usually fine for discovering pages even without a sitemap (though wordpress generally generates a sitemap anyways; seems like there isn't one in this case (too old?))
-
pokechu22
I might as well just try it and see how it goes... Sophira, any parameters on rate-limiting? Archivebot's default is 3 concurrency sets of requests where after each request it waits 250-375 milliseconds
-
Sophira_
Okay. An example of a page on xovr.twwf.info, btw, would be
xovr.twwf.info/i_xukb3tnd.php . Entering the password "Gurt" (case-sensitive) would then show an image. I assume in these cases I should give both the pages themselves and the image URLs.
-
Sophira
I'm not sure if my last message sent because of the ping timeout, so:
-
Sophira
Okay. An example of a page on xovr.twwf.info, btw, would be
xovr.twwf.info/i_xukb3tnd.php . Entering the password "Gurt" (case-sensitive) would then show an image. I assume in these cases I should give both the pages themselves and the image URLs.
-
Sophira
(also, the last thing I saw was thuban saying subdomain handling is complicated™.)
-
thuban
-
Sophira
Ah, thank you! Odd that my message sent but I didn't see anybody else's. Oh well. As for rate-limiting, I imagine that'll be fine. The sites themselves aren't really used any more so there won't really be any disturbances.
-
Sophira
Subdomain-wise, I *think* all the subdomains are on twwf.info's front page. Let me just double-check.
-
Sophira
Yeah, they're all listed, I believe. Also just to note, the only sites that are Wordpress/phpBB-based are watchthefootage.twwf.info, forum.watchthefootage.twwf.info, and all the *.ezblog.twwf.info subdomains.
-
Sophira
Actually, that said, I would also like to archive the phpBB forum at forum.twwf.info. It's not listed in the main table because it only became a thing after the ARG itself, but it has a lot on it.
-
Sophira
(not so active any more though, heh)
-
pokechu22
Yeah, I can do that too. Last active Dec 31, 2022 is fairly good as far as inactive forums go :P
-
pokechu22
I've started on the ezblog ones
-
Sophira
Awesome, thank you <3
-
Sophira
Heee. I like the "and not" in the User-Agent string.
-
Sophira
So does this mean that with regard to the site map that I don't need to bother with grabbing all the post URLs and such from the databases?
-
Sophira
Or should I do that anyway?
-
pokechu22
For the wordpress ones? It's probably not necessary
-
pokechu22
It might be useful after everything's been saved to verify that it's actually complete, though (but that would have to be in a few days)
-
Sophira
That makes sense. Okay.
-
pokechu22
Based on
watchthefootage.twwf.info there's also several twitter accounts linked with it - I can save those via socialbot. Is there a more complete list than the ones in the sidebar?
-
Sophira
One moment...
-
kiska
I hate npm... I broke etherpad :(
-
Sophira
I can't think of any other Twitter accounts to archive. I think it's complete.
-
kiska
Fuck that was annoying... pad.notkiska.pw is back online
-
qwertyasdfuiopghjkl
Sophira: From searching for links to twitter.com on
tvtropes.org/pmwiki/pmwiki.php/Recap/TheWallWillFall and the other tvtropes wiki pages linked from that, I found
twitter.com/DeadCatInABox ,
twitter.com/GurtTheLimeMan and
twitter.com/RADIOVOIDREBEL that look related but weren't listed on the sidebar
-
OrIdow6^2
Shutterfly share sites is indeed usable with the DNS trick
-
driib
Hi, I've been running a warrior on the telegrab project for some short while and got interested to check what kind of data ends up published on IA. I tried to download and inspect one package from
archive.org/details/archiveteam_telegram but ran into some issues. 1) I cannot unzstd the megawarc file due to a "Decoding error (36) :
-
driib
Dictionary mismatch"; the internet says it's due to an external dictionary use but I can't seem to find one on
archive.org/download/archiveteam_telegram_20230327203637_2cf0eb8f, for example. 2)
github.com/internetarchive/warctools does not seem to include tools to deal with megawarcs or zstd, what CLI tools do you recommend if I
-
driib
want to look into the payload of a single item? Thank you all for your patience with my noob questions! Hope I put em in the right channel too.
-
pokechu22
Pretty sure this is the right channel but I don't have an answer beyond that
-
OrIdow6^2
The dict is in a skippable frame at the beginning of the zstd
-
OrIdow6^2
I wrote an awful tool to extract them a while back, I believe someone else wrote a better one, but if no one comes around with that in a bit I can give you the old one
-
OrIdow6^2
(It's in the skippable frame, and furthermore it itself is compressed with vanilla zstd)
-
OrIdow6^2
driib: Alright, actually I've made a little new one without the dependency issue
transfer.archivete.am/sW9PL/get_zstd_dict_simple.py
-
OrIdow6^2
This takes the name of the warc.gz as its argument and puts the compressed dict to stdout