-
h2ibot
Manu edited Discourse (+195, add various active discourse forums):
wiki.archiveteam.org/?diff=50143&oldid=49768
-
fireonlive
i love how it's very tech biased :D
-
h2ibot
FireonLive edited Discourse (+54, add NTP Pool Community):
wiki.archiveteam.org/?diff=50144&oldid=50143
-
fireonlive
I assume Discourse is in #msgbored as well yeah?
-
JAA
Yep
-
fireonlive
:)
-
Video
in retrospect it was probably not a good idea to name myself Video
-
fireonlive
lemme kinda just janky wanky that into there
-
fireonlive
Video: xD
-
fireonlive
i wonder if myself thinks similar
-
fireonlive
wiki getting rid of /index.php/ when? :D
-
h2ibot
FireonLive edited Discourse (+218, slap in a 'place to gather at' for now):
wiki.archiveteam.org/?diff=50145&oldid=50144
-
h2ibot
FireonLive edited Current Projects (+146, let's keep recently finished around a bit longer):
wiki.archiveteam.org/?diff=50146&oldid=50119
-
h2ibot
-
h2ibot
FireonLive edited Discourse (+28, add logo :)):
wiki.archiveteam.org/?diff=50148&oldid=50145
-
fireonlive
ok no more wiki spam from me for now :P
-
nulldata
-
fireonlive
.....
-
nulldata
What an awful response to deleting customer data without ensuring they are aware outside of an Email
-
fireonlive
yes indded
-
fireonlive
reminds me of Heroku, just being like well we emailed you, time to shred all the free tier stuff
-
fireonlive
iirc tha'ts how it went down
-
JAA
TIL 'scream test'
-
fireonlive
news.ycombinator.com/item?id=36660869 makes a good 'what could have been done'
-
fireonlive
ye it's a fun term :D
-
fireonlive
news.ycombinator.com/item?id=36660626 "After that we make best efforts but if people can’t respond to vendors they pay money to, we’re really at a loss."
-
fireonlive
seems to be the only two comments he's made there
-
nicolas17
yikes
-
fireonlive
oh that nicolas17 and his pings
-
BPCZ
Damn they really should have scream tested that instead of just pulling the plug. 1 month of no access would have saved anyone worth saving
-
pabs
tech234a: pass that domains thing to #// (the URLs project)?
-
tech234a
could be nice to get the homepages of a bunch of domains
-
JAA
So regarding the EPA Archive, they've removed the shutdown date from the homepage at some point in March.
-
fireonlive
o_O
-
h2ibot
PaulWise edited Mailman2 (+24, lists.sucs.org done):
wiki.archiveteam.org/?diff=50149&oldid=50133
-
flashfire42
-
nstrom|m
btw if anyone has any spare IPs to throw at #wuciyuan, project has 2 days left before site shutdown and still lots to go
-
nstrom|m
with caveats that a) worker IPs will show up in saved webpages and b) site seems to block on anything higher than 1 concurrency
-
nstrom|m
that said we're not even archiving images yet so it's barely any bandwidth usage right now
-
Exorcism|m
<nstrom|m> "with caveats that a) worker..." <- even with 1 concurrency it's difficult lol 😭
-
systwi
JAA: My script that collects IRC URLs omits #archivebot , #// and #Y . I had considered omitting channels such as #imgone but I thought there was a possibility of the bot generated lists not being explicitly saved through AB.
-
systwi
Since I've seen you save things manually before.
-
systwi
I thought, in case you were to happen to miss something, my script would have that covered.
-
JAA
systwi: Yeah, could happen, but duplicating most things is not a good approach there. I do try to cover everything. I sometimes process things in batches if there are too many to handle semi-manually.
-
systwi
If queueh2ibot does this automatically, at least WRT #imgone bot-generated lists, I can omit #imgone as well, or any other channels at your request.
-
JAA
Well, 'everything'. I generally omit things for which we have specific projects (no point in trying to !ao a mediafire.com URL when it gets archived by #mediaonfire) or for which AB doesn't work (e.g. YouTube).
-
JAA
I do this for all AT channels.
-
systwi
Okay, I think I'm following. So channels such as #mediaonfire and #imgone are likely safe to ignore.
-
JAA
Perhaps I should automate it completely, but sometimes there are URLs that need manual treatment. I.e. where simple direct archival causes actual harm.
-
OrIdow6
Believe I have found (what should have been a very obvious) way to enumerate Wysp
-
systwi
JAA: Direct archival as in !ao
example.com/theActualThing or including the actual thing in the IRC URLs list, too?
-
JAA
systwi: Either. The URL needs to be modified first.
-
JAA
E.g. i.postimg.cc URLs, when !ao'd directly or as part of a list, don't archive the images because of their hotlink prevention.
-
JAA
So you get broken snapshots in the WBM etc.
-
systwi
I see, okay. So, hmm... My original plan was just to grab everything from that day, en masse, and awesome if it happens to work. If not, yeah that's a shame, but my thought process was the "grab things out of a burning building" approach, considering I've seen URLs die in as little as a couple hours (thankfully with one WBM capture, courtesy of AB).
-
systwi
I can change the behaviour of what my script does to certain URLs beforehand, if that is what you were thinking.
-
systwi
I already have it grep Imgur images into their own list for #imgone .
-
systwi
s/images/URLs/
-
systwi
JAA: If you have any specific suggestions, feel free to send them my way. I'm touching up my script a bit so I can share it later. I have a blacklist with key words that would be better kept in a separate file. For the time being/until further instruction, this is the only change I'm making for now.
-
JAA
systwi: Well, anything that avoids duplicates. But it's hard to coordinate if those URLs aren't in the #archivebot logs.
-
JAA
Part of why I stopped doing list jobs.
-
systwi
Hmm, yeah, that would be a bit tricky to pull off (also doesn't help that I haven't slept and am thinking about this).
-
systwi
When I am able to send over a copy of my script maybe you will have some ideas on how to implement that.
-
systwi
It's nothing too fancy. It's a bash script that started as a oneliner, and could probably do fine as a Python script, too, if I was as proficient in that language.
-
manu|m
Not sure if y'all caught this, but Nitter is working again
-
Exorcism|m
oh really ?
-
manu|m
I just checked, at least for nitter.net it does. The fix even made it on the front-page of orange site.
-
yetanotherarchiver|m
-
yetanotherarchiver|m
(with Updated tag)
-
vokunal|m
On banciyuan, is Bad SSR data kind of like a rate limit or is it expected?
-
JAA
→ #wuciyuan
-
vokunal|m
ah ty
-
vokunal|m
Idk how, but i've had #wuciyuanr this whole time :P
-
cm
not sure if this is the right channel but would anyone be able to help me get brozzler running?
-
cm
so far I am stuck running pip3 install brozzler[easy]
-
cm
it keeps hanging at the markupsafe dependency, when run with and without a venv
-
cm
this is on debian bullseye
-
pokechu22
I assume that beyond the contents of futurequest.net, we'd also need to identify what sites are hosted by them and save those
-
tzt
there's this
bgp.he.net/net/69.5.0.0/19#_dns but it is truncated
-
Exorcism|m
anyways, I'm adding this to deathwatch
-
pokechu22
... and, that forum is mostly restricted, it seems
-
Exorcism|m
darn :c
-
Exorcism|m
Exorcism|m: added
-
pokechu22
based on
subdomainfinder.c99.nl/scans/2023-07-04/futurequest.net there's a bunch of internal services that aren't accessible (giving 401s). I've thrown the ones that do something else into AB
-
tzt
-
flashfire42
I will get started on these I guess
-
fireonlive
-
fireonlive
(cont'd from #archivebot)
-
fireonlive
bgp.tools/prefix/69.5.0.0/19#dns has more entries ('show forward DNS') but it's truncated per host unless you happen to own an AS or otherwise have a login
-
fireonlive
s/host/IP/
-
nulldata
-
fireonlive
lmao
-
nulldata
Man if I was a InfluxDB Cloud customer I'd be getting my data out ASAP - even if the data deletion didn't affect my region
-
fireonlive
yeah, they really fucked the dog with that one
-
nulldata
InfluxDB: "Sorry we purposely deleted your data with only an Email notice that may or may not have been actually sent. Feel free to sign-up again in a different region!"
-
fireonlive
they're in the tech space too, so would have probably heard of the heroku shredding free DBs as well
-
nulldata
That kind of shit is why I steer clear of SaaS for business production stuff if I can
-
JAA
-bs and -ot are swapped now. :-|
-
nulldata
Sorry!
-
fireonlive
it's opposite day!
-
Doranwen
LOL
-
rewby
arkiver: Anyway, as I was going to say: Looks like they don't entirely mind people doing archivism? Quote: "However, we don't have a policy against responsible data collection — such as those done by academic researchers, fans backing up works to Wayback Machine or Google's search indexing."
-
rewby
Notably that part about the WBM
-
rewby
Given that our crap usually ends up there
-
Doranwen
Anyway, the original purpose for the login wall with AO3 was just privacy in general - reduced the chances of someone seeing your explicit fic if they found your nick, for instance, unless they also had a account.
-
Doranwen
But now more people are locking theirs because of OpenAI, yeah.
-
rewby
I think if we figure out a way to dump the whole thing into some darkened items
-
rewby
Probably would be fine?
-
rewby
Although we might consider emailing them
-
Doranwen
Though my experience with LOTR books was something like 11k unlocked, 3k locked? Granted, I had a few filters on so it wasn't the raw fandom without exclusions, but…
-
Doranwen
And the ao3downloader script is really an excellent way to grab stuff as it uses their api and respects requests for breaks.
-
rewby
The way I see it: It's small enough that I doubt this will hit the 1T mark after zstd
-
rewby
But could very well be worth preserving
-
rewby
If nothing else it's a timecapsule of how zeitgeist and popularity changes over time
-
Doranwen
It'll say "ao3 has requested a 206 second break" and give the timestamps for pausing and resuming.
-
rewby
I'm sure there's some fun stats to be made about popularity of fandoms over time
-
Doranwen
Oh definitely.
-
Doranwen
The breaks do vary, though, so it's not a static thing, must respond to the general load on the site at the time.
-
rewby
I mean, if we just ask them for collaboration, maybe they'd be open to it?
-
rewby
And we can work out the best way to get as much as possible
-
arkiver
rewby: sounds like it's time to rebrand Archive Team as fanclub :P
-
Doranwen
LOL
-
Doranwen
Well, fandom is what got me saving Yahoo Groups, after all…
-
rewby
arkiver: I don't get it
-
arkiver
"fans backing up works to the Wayback Machine" is allowed
-
Doranwen
There's definitely an interesection of archivists and fandom.
-
rewby
I mean, just sending an email going "Hi, we do a lot of archival stuff and we'd like to archive the works on your site. Is this okay with you all and can we collaborate?"
-
rewby
Would probably do a lot
-
pokechu22
After looking at futurequest.net some more it seems like the rest of the forums do still work -
futurequest.net/forums/forumdisplay.php?f=1 is empty, but
futurequest.net/forums/forumdisplay…rt=lastpost&order=desc&daysprune=-1 exists just fine.
-
rewby
Especially since this will involve bypassing a login wall
-
rewby
But if the admins are okay with it, we could probably just get an account just for this purpose
-
arkiver
how do you bypass the login wall?
-
arkiver
hmm
-
arkiver
not sure about account
-
JAA
Or perhaps they could whitelist a UA or similar.
-
arkiver
data may not go into the Wayback Machine
-
rewby
Yeah
-
arkiver
yes they could whitelist
-
arkiver
i'm not a fan of an account
-
rewby
The reason for the account is as posted in -ot: To prevent OpenAI from grabbing stuff they forced a bunch of stuff behind logins
-
rewby
But they're completely open to archiving things as far as I can tell
-
rewby
So we could probably do something like an UA whitelist
-
rewby
Or something
-
rewby
I don't really see the problem with just having an account we use for the archive and then disable after we get stuff
-
rewby
Like, yeah good luck
-
rewby
* doing anything with that
-
fireonlive
the official ArchiveTeam AO3 account
-
fireonlive
x3
-
Doranwen
Rather than backing it all up to the WBM - which it already is, to some extent - I'd suggest just grabbing the files they already provide.
-
rewby
Could do
-
Doranwen
There's no user-identifiable data in them, even the locked ones.
-
fireonlive
user-identifiable being the author?
-
Doranwen
And then you can use already-made tools like ao3downloader (it's not the only one, but I think it's probably the best out there).
-
Doranwen
Well, that's identifiable, lol.
-
Doranwen
The user browsing the files, I meant.
-
fireonlive
ah ok
-
Doranwen
There's no way to tell, from looking at a file you downloaded, that *you* were the one who downloaded it.
-
fireonlive
i thought it strange that would include author/tags/etc :p
-
fireonlive
s/include/exclude/
-
Doranwen
The fics always have author, tags, etc. on them.
-
Doranwen
As well as the link of the fic they came from.
-
fireonlive
ah
-
Doranwen
So they could always be used afterward to grab links for the WBM if so desired.
-
tzt
anyone know any better way to get reverse IP data to find futurequest sites
-
Doranwen
For a general file estimate, for one of the fandoms I grabbed fics for, I have 3,899 epubs which total 172.4 MB.
-
fireonlive
tzt: like IP Y hosts domains example.com, example.net?
-
tzt
yes
-
fireonlive
ah, there was some more on bgp.tools, but need a login sadly
-
fireonlive
(each host is elided)
-
fireonlive
after about 2-3
-
rewby
bgp.tools mostly uses certificate transparency for the domain -> ip mapping IIRc
-
Doranwen
Two of the fics were each just under 11, another was 9.6. Only a couple dozen were over 1. Over 1,800 of them were under 10 kB.
-
flashfire42
I am going to go through them over my day today but I will have a lot of stuff to queue and also a few chores to do so will see how I go
-
fireonlive
ahh ok
-
fireonlive
there's companies that offer 'passive dns' lookups... but usually charge,
securitytrails.com/dns-trails
-
rewby
I could look up how bgp.tools does it, but not tonight
-
fireonlive
though that found nothing for 69.5.3.189 lol
-
fireonlive
ah it found one
-
rewby
fireonlive: It does?
-
rewby
I see like 9 sites?
-
fireonlive
it showed nothing, then now it just shows 'pop.agaveguides.com'
-
fireonlive
weird
-
rewby
I have a full bgp.tools account, shows up fine
-
fireonlive
oh was talking about securitytrails
-
fireonlive
bgp.tools is better for that
-
rewby
Ah ST
-
rewby
I mean, if you care much I can go ask for a quick db extract
-
fireonlive
tzt/pokechu22 could probably find that useful
-
fireonlive
i need to sacrifice a goat to benjojo for a login one of these days i guess :p
-
rewby
I'll just at ben and get him to dump the db for me
-
pokechu22
I don't think I have time to do anything more elaborate with futurequest beyond what I've just done with the forums
-
fireonlive
oh hey i remembered who owned it
-
fireonlive
:3
-
tzt
i'm trying to get a list of all the sites hosted with them so it could be run in #// or #Y since it's shutting down in 4 days
-
fireonlive
tzt 🤝 rewby
-
fireonlive
:)
-
rewby
tzt: I've asked ben to dump me the data from bgp.tools for their ip range
-
tzt
rewby: thank you
-
rewby
He's asleep as far as I can tell, so probably tomorrow
-
rewby
I could go figure out how bgptools gets its data, but I wanna go sleep too
-
fireonlive
night rewby :)
-
fireonlive
don't let the targets bite
-
fireonlive
:p
-
rewby
Ben's a decent coder but dear god bgptools' code can be a pain due to just the sheer stupidity of half the things it needs to talk to
-
rewby
So I'm not doing that tonight
-
fireonlive
xD
-
TheTechRobo
Oh, JAA, here's what happens every so often with `docker logs` on the project containers:
-
TheTechRobo
-
TheTechRobo
b9310074d~tplv-banciyuan-sq90.image?x-expires=1705101090&x-signature=gOL%2Bbdr9v
-
TheTechRobo
mHDm%2Fxu%2FDcl9%2FX44uA%3D
-
TheTechRobo
error from daemon in stream: Error grabbing logs: unexpected EOF