-
fireonlive
-
fireonlive
we've maybe seen the nintendo stuff earlier yeah?
-
Pedrosso
So... What services does Nintendo hold with that network?
-
fireonlive
friend was complaining that all the player-made/shared levels for the first game would be unavailable
-
fireonlive
unsure if there's a way to do anything with that sadly
-
Pedrosso
Was there ever done a grab of Super Mario Maker games? What other sorts of things could be of interest of archival?
-
Pedrosso
courses*
-
JAA
Assuming OneHallyu stays up, the topic retries should be done in a bit over 2 hours. I'll then run another similar thing for the remaining topics that are being done sequentially since that's so slow. Also one more topic failed with timeouts.
-
JAA
Some of these topics have well over 10k pages, pretty insane.
-
Terbium
another forum bites the dust, forums are disappearing rapidly :(
-
fireonlive
everyone loves fucking discord these days
-
Terbium
sadly the case, forums are move to the free easy to use walled garden known as discord :/
-
Terbium
wrote a discord archiver recently to archive discord servers into a database, hopefully their API doesn't change too much in the coming months
-
fireonlive
i was using DiscordChatExporter but sadly it doesn't quite support those 'new fangled' 'forum' channels (and threads in normal channels) yet
-
Pedrosso
a discord archiver you say?
-
fireonlive
OwO
-
Terbium
Yeah, DiscordChatExporter didn't really suit my needs (large scale distributed crawling with database store)
-
Terbium
Decided to write up a basic crawler
-
Terbium
Currently doesn't grab attachments, but planned as a feature soon
-
fireonlive
:)
-
fireonlive
sounds like you do some fun stuff
-
fireonlive
:p
-
Terbium
More like preparing for Discord's inevitable demise :P
-
fireonlive
true true xP
-
Pedrosso
How is the crawler you wrote different than DiscordChatExporter?
-
fireonlive
at the very least i would assume discord hates it more
-
fireonlive
:p
-
Terbium
Python based (no .NET thankfully), dumps everything to database as fast as possible, distributed (can use multiple instances to allocate servers/channels to crawl with different accounts/IPs)
-
Terbium
No attachment downloading right now (can backfill later)
-
fireonlive
(no .NET thank you so much)
-
Pedrosso
what's so bad about .NET ?
-
fireonlive
it's not python ™
-
Pedrosso
True
-
Terbium
I saw .NET, I gagged so hard I ended up writing my own crawler
-
fireonlive
:D
-
fireonlive
any issues forseen with discord requiring 'the parameters' soon?
-
fireonlive
for earlier grabs/crawls that might not have them
-
Terbium
It's simple enough to rewrite in go or rust, but I don't really care as it's not performance intensive (all IO bound)
-
fireonlive
(and i guess they'll expire at some point on the attachment urls?)
-
fireonlive
if we really wanted developers here, we'd just need to make a few posts around the internet saying 'rust would never be able to be up to the task of ArchiveTeam's needs'
-
Terbium
I believe you can regenerate the links with refresh tokens
-
fireonlive
and the RETF would descend hell on here
-
fireonlive
(rust evangelism task force) to prove us wrong
-
Pedrosso
o.o
-
Terbium
DiscordChatExporter is great for personal exporting for the average user, just didn't suit my needs. It's not a bad app for the casual archiver
-
Pedrosso
I feel like this fits in #discard lol
-
Terbium
I know Sanqui and TheTechRobo was working on Discard for MITM based crawling. I think that stalled
-
Pedrosso
what does DiscordChatExporter do badly, other than not being able to handle the new features?
-
Terbium
Mostly scalability
-
Terbium
Not as easy to use 50 Discord accounts. Multiple accounts are needed due to server cap for each account (unless you leave/rejoin to swap in and out servers)
-
Pedrosso
I see I see. I think I'd leave this to the non-casual archivers. Hah. How've you been using it so far?
-
fireonlive
i think if I read it correctly as well Terbium's can do 'follow-up's very easily
-
fireonlive
i.e. get new messages since last visit
-
fireonlive
(or maybe has it built in already to continuously do so)
-
Terbium
Just started, so slowly scaling up (trying to find large lists of discord servers),and throw 10 accounts and IPs at them
-
fireonlive
:)
-
Pedrosso
Does it go recursively too? As in if it finds an invite link does it try it and then go from there?
-
Pedrosso
hm, but I suppose that doesn't work for servers which you have to manually figure out stuff like roles for viewing channels..
-
Terbium
Nope, it's very dumb right now, just crawls the servers the account has access to
-
Terbium
Yeah, the roles stuff causes lots of problems for me
-
Terbium
Especially "Verify your phone number" and all that nonsense
-
fireonlive
ah yeah the phone number thing :/
-
Terbium
Would be nice if we had direct access to their SycllaDB clusters
-
Pedrosso
I suppose a dumb bot will get lots of content still.
-
Pedrosso
Especially since we have very long lists of servers
-
Pedrosso
Terbium: Please do keep us up to date with this
-
Terbium
*disappears from AT for another 12 months*
-
JAA
Ok, I should have all OneHallyu topic pages now, I think.
-
fireonlive
:D
-
fireonlive
are attachments to be attempted?
-
JAA
Do you have an example? I couldn't find any.
-
fireonlive
oh, i don't
-
fireonlive
oh! i meant media i guess
-
fireonlive
i think you said you skipped .. something
-
JAA
The only things I saw hosted on OneHallyu itself were avatars. But maybe I just didn't look in the right place.
-
fireonlive
ah ok :)
-
fireonlive
faulty memories!
-
JAA
I did this with qwarc. qwarc doesn't care about HTML. So no page requisite extraction or similar.
-
JAA
qwarc fetches a URL you give it and writes it to WARC. Basically everything else is left as an exercise to the user.
-
fireonlive
:)
-
JAA
Oh, two topics failed. One is a 'count to a million' forum game, the other just a random small discussion.
-
JAA
The former doesn't even have 5k pages, but it's extremely slow.
-
arkiver
some inefficient pagination i guess
-
JAA
No, there are far larger topics that are faster.
-
JAA
Largest I saw had 18k pages.
-
fireonlive
damn
-
JAA
(I didn't systematically check though, so maybe that isn't even the largest one that exists.)
-
JAA
Anyway, it's getting grabbed now, whether the server likes it or not. :-)
-
fireonlive
👀
-
JAA
Ah, now the response time is actually decent.
-
JAA
-
fireonlive
:D
-
JAA
That topic is done as well now, and that should be everything that's accessible. (I saw a small number of 403s.)
-
JAA
src extraction is running but will take a little while.
-
arkiver
outlinks going to #// ?
-
JAA
Possibly later. Just focusing on onsite stuff now since that'll vanish very soon.
-
arkiver
got it
-
arkiver
sounds good
-
SketchCow
Merry Christmas, maniacs
-
fireonlive
thanks sketchy
-
pabs
publicwww.com — a search engine for stuff in websites' source code.
-
fireonlive
ooh neat
-
pabs
paid beyond alexa rank 1mil, higher costs for further down the ranking
-
pabs
-
pabs
ouch, $499/month for all URLs
-
» pabs wonders how that compares to shodan
-
pabs
er 3mil not 1mil
-
pabs
and $49/month gets you all URLs, but only 100 searches/day up to 100K rows
-
pabs
hmm, I think the 1mil was without an account
-
qwertyasdfuiopghjkl
JAA: for OneHallyu, did you save the user profiles? (You can do
onehallyu.com/profile/1-- onehallyu.com/profile/2-- onehallyu.com/profile/3-- etc and get redirected to the correct name. Looks like there's also different tabs on each profile page that need to be requested separately.)
-
JAA
qwertyasdfuiopghjkl: No, only topics.
-
JAA
I started an AB job for the OneHallyu src values I managed to extract, but it looks like the site is dying now and returning HTTP 522 (Buttflare's code for connection timeout to the upstream server) for a lot of things.
-
JAA
So maybe they took the server online and only what remains in Buttflare's cache is still around.
-
JAA
offline*
-
that_lurker
Could someone grab the upcoming Finnish presidential election candidates websites.
lounge.kuhaon.fun/folder/65908e5765…hPresidentialElectionCandidates.txt
-
that_lurker
-
JAA
that_lurker: #vooterbooter
-
JAA
I'll run them later if nobody beats me to it.
-
that_lurker
ooh did not know there was a channel for this
-
that_lurker
also thanks :-)