-
thuban
ah yeah, i heard something about it being moved to the same servers as the sister site fictionpress
-
thuban
man, it looks like
github.com/ArchiveTeam/ffnet-grab wasn't even warrior-based?
-
Doranwen
Fanfiction.net is a massive bit of history for fandom, so that's a huge priority for us
-
thuban
looks like there's never been a scrape of the forums, either
-
OrIdow6
This does sound to me like another panic-site-is-shutting-down-over-a-single-social-media-post
-
thuban
not really
-
OrIdow6
Obviously from how it's going, it seems less than likely it'll last another year or so, but AFAICT the thing in the last ~2 days does not reflect a days-left state before shutdown?
-
thuban
the thing is that most people have _already_ fled ffn, because it's done multiple rounds of purges in the past and it's no longer trusted in fandom circles--that post isn't so much a news announcement as a psa that gets repeated occasionally
-
Doranwen
no, I don't think it's imminent - but it's not looking *great*, and fandom would not like to take *any* chances
-
» Doranwen nods
-
thuban
anyway: sequential user and story ids, max ~15m (!) and ~14m respectively, can be accessed without slugs (not a redirect, just same page content). forums are user-created and do seem to require slugs; topics don't require slugs but do include the forum id in the url and appear to use a single counter across all forums, so both would probably have to be enumerated through the
-
thuban
pagination
-
Doranwen
there's still a lot of very good stuff on there - and the reviews are very valuable too
-
thuban
"communities" are just user-curated story collections; there's a separate beta-reader profile system that isn't linked from the main user profiles (not all users are "registered beta readers") but uses the same ids. i think that's about it
-
cm
I know this isn't archive.org, but does anyone know how to determine the size of a page that is saved on the wayback machine?
-
Doranwen
this is more the channel for questions like that, though I can't answer it, hopefully someone can!
-
cm
with a live web page you can look at the bandwidth usage in firefox dev tools, but on the wayback machine you also pick up stuff from archive.org
-
JAA
cm: Define 'size'? Byte size of the original HTML? Total size with images etc.? Something else?
-
cm
JAA: the bandwidth it takes to load the page completely
-
JAA
In the WBM? Then you need to include the WBM's own things. And it rewrites links and references as well, which will also change the size.
-
JAA
What are you trying to do?
-
cm
compare an updated site to an older version of the same site
-
JAA
If it's just about the difference, including the WBM's scripts etc. shouldn't matter.
-
JAA
Since they'll be included in both.
-
cm
ah yeah I can just use the wayback machine version of both
-
cm
good idea
-
cm
ah that doesn't work though, since the wayback machine is now IP blocked by the site I want to measure :(
-
JAA
hook54321:
lists.mozilla.org was discontinued last month, apparently. Found that while looking into news.mozilla.org. :-|
-
hook54321
i thought we got that one
-
JAA
At least all the content's still there it seems.
-
JAA
Did we?
-
JAA
Grepping my logs only yielded an AB job from 2017.
-
hook54321
ah
-
hook54321
must be that i'm thinking of
-
webdownload
Why would Mozilla do such a thing?
-
JAA
They've been doing this for at least a couple years now.
-
webdownload
They don't seem like the type.
-
JAA
Inb4 Google Groups shuts down and they move back to their own infra.
-
JAA
-
Doranwen
thuban: also a note - the default index pages for each category exclude the M-rated fics
-
Doranwen
for fanfiction.net
-
Doranwen
one has to change that setting to see all of them
-
Doranwen
it's a simple string that is added to the URL, so that's not hard to apply, but it's a consideration, because it'd be easy to just go through the default ones and miss those in the index pages
-
JAA
But those stories themselves are accessible normally, right? It's just excluded from the lists?
-
thuban
not relevant for enumeration over fic ids
-
thuban
yeah
-
thuban
do we usually get that sort of pagination data?
-
JAA
Does something similar exist in the forums, where we can't just enumerate topics?
-
JAA
Usually not, no. It'd be nice to preserve the entire website 'experience', but it's not easily possible often, and the unique content is way more important obviously.
-
Doranwen
yeah, it's only really a consideration for the WBM, I think
-
thuban
^^ site policy is "All forum posts must be suitable for teens", and topics don't have ratings, so i presume not
-
Doranwen
that's what I saw on a Reddit thread discussing whether this latest round of paranoia has any substance - and the general consensus was something like "ff.n has been dying for ages, it's not imminent at all but eventually it may go", but there was one fandom user (who, incidentally, helped us with the Yahoo Groups project) that was really bothered that the WBM never got those fics in the index pages
-
JAA
The index can always be rebuilt from the stories anyway.
-
JAA
Pagination is particularly horrible to properly archive on websites that are still live.
-
thuban
yeah, was thinking that myself
-
JAA
You virtually always end up with an incomplete list because things get shifted around while you're iterating through the pages.
-
JAA
So some stories would appear twice and others would be missing.
-
Doranwen
yeah, if there was a way to reverse the order so the oldest appeared first, one *can* set it to sort by publish date instead of update date - but any new story posted will throw it off
-
Doranwen
AO3 is nicer in that you *can* set that
-
JAA
Stories sometimes get deleted, and then it'll shift everything anyway.
-
Doranwen
oof, yeah
-
JAA
Offset-based pagination always has that problem.
-
JAA
You need cursor-based pagination instead, but that's messier to implement, so many smaller sites don't use it.
-
JAA
What you do is grab all the stories and then generate an independent index from that.
-
JAA
Anyway, the primary concern is making sure that the unique data, i.e. the actual stories, are safe.
-
thuban
yeah. (i don't think ffn will show you the exact timestamp, unfortunately)
-
JAA
It doesn't display it, but it's in the HTML in a data-xutime attribute.
-
thuban
oh! i should have checked
-
JAA
What is the URL tweak needed for the M-rated stories in lists?
-
thuban
it's the 'r' parameter
-
JAA
Ah, found it in the filters. r=10 param
-
thuban
if you click the "Filters"--yeah
-
JAA
Looks like
fanfiction.net/j/0/0/0 always includes them. That's good.
-
JAA
Thinking about how to continuously cover the site until it inevitably dies.
-
JAA
Hmm, shouldn't all entries on
fanfiction.net/j/0/2/0 ('Updated Stories') also be in
fanfiction.net/j/0/0/0 ('All Types')?
-
thuban
i think "Updated Stories" excludes newly published one-shots
-
thuban
wait, no it doesn't.
-
JAA
Hmm, I'm sensing some caching bugs there.
-
JAA
But I suppose the best strategy would be to retrieve all five of the 'Just In' pages regularly to collect story IDs that need to be regrabbed.
-
thuban
the first entry in "Updated Stories" as i'm looking at it right now says "Updated: Oct 17, 2020" but actually clicking on the fic shows "5m ago"
-
thuban
caching bugs indeed :)
-
JAA
Yeah
-
JAA
But there's a 'Ghost of Love (Reylo Fanfic)' on Updated now that doesn't appear on All while the two around it do.
-
thuban
there are also rss feeds, but i think they're only per-category and they appear to be equally flaky
-
Doranwen
yeah, you can customize them quite a bit, I think
-
Doranwen
like, I think they generate a feed based on whatever filter you have set to browse with
-
nerdguy1138
JAA: i'm actually working on this! i can send you a list of inprogress fics
-
nerdguy1138
ive been archiving fics for years now
-
nerdguy1138
fanfiction.net has really been amping up the cloudflare blocking recently. ive moved on to AO3, wattpad, and quotev.
-
JAA
That's disappointing. What kind of rate limit are we looking at there?
-
nerdguy1138
somewhere between 5-10 seconds, the sript i was using just completely gave up, and honestly if they want to consign themselves to the dustbin of history that badly , i'm inclined to let them. i already have millions of stories from there.
-
nerdguy1138
they only started doing this in the last few months , afaik
-
JAA
Like, one request per 5-10 seconds? Ew.
-
OrIdow6
Question is whether it works based on total load
-
OrIdow6
Though of course that does mean that it's out of QWarc's reach
-
nerdguy1138
i was focused on saving as many stories as i could, i got almost 9 million, so i think i dd pretty well.
-
JAA
OrIdow6: Nothing is out of qwarc's reach. :-)
-
JAA
All it needs is a bunch of IPs and a bit of iptables magic. :-P
-
OrIdow6
JAA: Oh, I suppose
-
OrIdow6
QWarc does seem pretty nice
-
JAA
Thanks, just don't look too closely or you'll see the nasty intestines. :-P
-
OrIdow6
Well, that's all useful software
-
OrIdow6
*That's a feature of all useful software (original makes sense pronounced, not written)
-
JAA
Guess so, but this is particularly bad. Monkey-patching internal parts of a third-party package. :-P
-
JAA
(Also, I'm a bit picky in that regrad, but the correct spelling of the name is qwarc, not QWarc.)
-
Ryz
Oh good, that means I can continue pronouncing it as 'qwark' x3
-
JAA
While I'm at it, the correct pronunciation is exactly like quark (the subatomic particle). :-)
-
rewby
JAA: re "All it needs is a bunch of IPs" HCross or me can likely assist with that if you need
-
HCross
not for a few days
-
HCross
that box is going in for open-server-surgery later tonight
-
rewby
Ah, well I have a /24 on hand anyway
-
HCross
(the fear of dread dawns as I realise it involves a fight with _that_ KVM)
-
rewby
Need me to help?
-
HCross
Potentially tomorrow
-
rewby
Lmk
-
JAA
rewby: Thanks, might get back to one of you about that sometime. Scanning all FFN stories at one request per 10 seconds per IP from a /24 would take about a week. Some would have pagination of course, but seems feasible enough.
-
rewby
Aight! Let us know.
-
EggplantN
I can provide lots of IPs JAA
-
EggplantN
If needed before HCross’s box is back
-
JAA
Not particularly urgent and needs some more investigation and code writing first anyway, so we'll see. But sounds great, thanks.
-
s-crypt
Is there a way to download quite a bit of larger videos from dropbox? I cannot download all of them because it says "the zip is too large"
-
s-crypt
I have this that would be nice to archive but I really dont want to go and download all of them manually
-
s-crypt
-
HiccupJul
does archivebot function differently to the wayback machine's "save page"/built-in scraping features?
-
HiccupJul
if so, how are those two datasets reconciled when an archivebot warc is added to the wayback machine?
-
masterX244
Wayback uses the closest WARC record to the given timestamp from a WARC file
-
HiccupJul
so if there was one made by the wayback machine itself, that maybe missed some sort of dynamic content, and then one a few seconds afterwards made with archivebot, which might have better support for some dynamic content (i guess), then changing the datetime you are currently viewing would switch between a "complete" and "incomplete" page?
-
HiccupJul
also, i want to archive a reddit thread, but neither archive.org or archive.is get anything apart from the top 100 or so comments. what's a good tool for archiving a whole thread?
-
EggplantN
maybe #archivebot ?
-
EggplantN
#shreddit may have already archived it
-
thuban
s-crypt: script
github.com/dropbox/dbxcli to get them one at a time automatically?
-
HiccupJul
i'll try archivebot
-
HiccupJul
is there a known reason why archivebot works better on some pages? is it a purposeful decision by the internet archive to limit the wayback machine? (i know Archive Team isn't the wayback machine, but IA stopped responding to emails)
-
thuban
what do you mean exactly by "works better"?
-
HiccupJul
i think i misremembered something about archivebot working better on dynamic content.
-
HiccupJul
*possibly misremembered
-
thuban
i think so. "Save Page Now" is based on brozzler, a browser emulator (which can execute javascript), while archivebot is based on wpull (which can't), so spn is generically going to be better for sites that depend on dynamic content. (it's not perfect, though, so sites that serve a dynamic version to spn and a non-dynamic version (eg because of the user-agent) to archivebot
-
thuban
may work better in the latter)
-
thuban
warcs from archiveteam _projects_ do include dynamic content, since they're manually scripted to do so
-
HiccupJul
ah i see
-
HiccupJul
thanks, that makes things much clearer
-
thuban
so no, i don't think archivebot would get you the entire thread automatically. i was going to suggest that you peek at the 'more comments' network requests and submit all of them to ab, since that should make them play back properly in the wbm, but it seems that they're POSTs. iirc wbm can handle those but archivebot can't
-
HiccupJul
the more comment links seem to use form data and/or cookies
-
HiccupJul
so it doesn't seem like the url is enough
-
HiccupJul
url is just "
reddit.com/api/morechildren" for expanding sub-comments
-
thuban
that's what i said, yes :)
-
HiccupJul
ah right "Form Data" = POST
-
wessel1512
is there a project that can help me archive the home.xs4all.nl homepages
-
thuban
can you describe the problem in more detail? (do you already have a list of urls?)
-
wessel1512
a small first one yes
-
thuban
how big? do the urls represent entire "homepages" or just the front pages of small sites? what would you need to do to get a more complete list?
-
wessel1512
i believe that there probably upwords of a half million sites
-
thuban
in total or in your list
-
wessel1512
in total
-
wessel1512
just begon scraping
-
wessel1512
my fist list has 350 urls
-
wessel1512
the sites can be very big
-
wessel1512
but most of them are pretty small
-
wessel1512
-
thuban
ok. that puts it out of scope for urls (no recursion) but it should be doable with archivebot. is that right, JAA?
-
wessel1512
archivebot is off limits for his project
-
thuban
why's that?
-
wessel1512
probably because it wood clog the system
-
thuban
i'm not sure what you mean. did someone tell you that?
-
wessel1512
yea jaa sed no
-
JAA
I'm sure 350 is nowhere near the actual number.
-
JAA
Also, let's take this to #webroasting because that's exactly what the channel exists for.