-
fireonlive
JAA: you looking at forward dns right?
-
JAA
fireonlive: No, because Ryz isn't either.
-
JAA
(Based on the AB jobs, at least.)
-
JAA
But good point. :-)
-
Ryz
oo;
-
fireonlive
:3
-
Ryz
Forward DNS? Can you two clarify?
-
fireonlive
the lack of linking directly to that page doesn't help
-
fireonlive
hm, reverse DNS is like what you see when people connect to IRC
-
fireonlive
like the hostname of an IP
-
fireonlive
forward DNS is like www.google.com to an IP
-
fireonlive
some services like bgp.tools report all the things they come across among their travels of what hostnames resolve to what IPs and provide 'forward dns' lookups
-
fireonlive
or 'what's hosted on this IP'
-
Ryz
Yeah, unfortunately I was a bit miffed on that and had to make those links access for me; then again, I was given what flashfire42 provided
-
flashfire42
Sorry
-
fireonlive
google.com is 142.250.179.206 for my server but 142.250.179.206's 'hostname' (or reverse DNS) is ams15s42-in-f14.1e100.net
-
fireonlive
so if you just looked up 142.250.179.206 you'd only see the latter but not nesc. the former
-
fireonlive
(unless you used a special service)
-
JAA
The rDNS section is also easy to check: you just do a bunch of DNS queries for each IP.
-
JAA
Forward DNS requires knowledge of the domains that resolve to an IP in the block, which is exactly what we're looking for.
-
JAA
The rDNS section data is old, might be worth rerunning that.
-
fireonlive
indeed :)
-
JAA
I see a bunch of domains there that no longer resolve to FutureQuest IPs.
-
fireonlive
there's various sources to do fDNS but it's much harder to populate
-
fireonlive
ye, also rDNS has no verification
-
fireonlive
i could set my rDNS to free-bitcoin.google.com and that would be 100% ok
-
JAA
I suspect it's just outdated data in this case, but yeah.
-
JAA
I used to set my rDNS to a .invalid domain on a provider that let me. :-)
-
fireonlive
that too :)
-
fireonlive
ah :D
-
fireonlive
or maybe they migrated but didn't bother updating rDNS on the way out the door
-
fireonlive
and the going out of business host didn't bother either
-
JAA
Yeah, that's what I'm thinking.
-
JAA
Well, or it did get updated, but bgp.tools never refetched it.
-
fireonlive
mm
-
JAA
> Data Age between: 2022-10-15T11:32:55Z UTC and 2020-06-16T10:52:28Z UTC
-
fireonlive
ah there you go
-
JAA
So who wants to do a few thousand rDNS queries? :-)
-
fireonlive
ok uwu
-
fireonlive
8192 * 2 potential dns queries since i'll also check forward dns lol
-
JAA
Thanks
-
fireonlive
:)
-
fireonlive
i didn't add a sleep or anything but it's like DNS so should be ok lol
-
fireonlive
well a rough check
-
fireonlive
fdns = 69.5.* lol
-
fireonlive
979 matches so far
-
fireonlive
(on x.x.22.227 atm)
-
fireonlive
leela.futurequest.net genesis.futurequest.net. neon.futurequest.net. evangelion.futurequest.net. eva.futurequest.net.
-
fireonlive
nerds :P
-
fireonlive
-
JAA
Thanks!
-
JAA
1103 domains in there after excluding *.futurequest.net, which seem to be noise.
-
fireonlive
:)
-
fireonlive
JAA: there seems to be a lot from securecnc.net as well
-
fireonlive
mqs0042.securecnc.net mqs0043.securecnc.net etc
-
JAA
Hmm yeah
-
fireonlive
seem to share some names with futurequest as well
-
fireonlive
leela exists in both, genesis does too
-
JAA
1009 after kicking those out.
-
fireonlive
=]
-
Ryz
Yeah, I'm actually maybe getting a bit sick of running constant jobs and checking the content myself, oof; I'm pondering on regarding queueh2ibot, the problem which ones are the real links since I had to manually check if they were actually dead or if it's HTTP only :/
-
Ryz
I think that's part of the reason I wasn't too sure on using that bot, JAA
-
JAA
HTTP vs HTTPS check can easily be automated.
-
Ryz
WWW and non-WWW too?
-
fireonlive
curl -m or something i suppose
-
Ryz
And I'm assuming a combination of the 4?
-
Ryz
JAA, there is unfortunately only other reason I went for the non-automated route at the time, is spotting the jobs that failed because it doesn't work on pipeline but it could on another pipeline, since I don't think the bot can detect something like that all
-
Ryz
The occasional 'Connection closed' and 'Connection refused' entries on some of the jobs S:
-
JAA
Correct, someone would need to watch the dashboard and requeue those manually.
-
JAA
It's obviously still less work if the queueing happens automatically.
-
JAA
Also, ain't nobody got time for manually queueing a thousand websites.
-
Ryz
Mhm, even me eventually, since I would wanna spend more leisure time, or at least make jobs finish faster myself since that part isn't easily automated~
-
Ryz
I'll give you the list of what I have left
-
nicolas17
at least you need to batch them... send a hundred and *then* check how they go, instead of pasting them into IRC one by one and switching windows to the pipeline status for each and everyone etc
-
Ryz
For each automated job, should have concurrency 2, for leeway reasons, and ignoreset badvideos because annoyingly some of the jobs I thought would be safe from New York Times videos...pushed it's ugly head up :/
-
Ryz
Here is the remaining stuff (except for the really big jobs that I put them in a separate list, which is small):
transfer.archivete.am/zr086/remaining-list - again, this is based on
bgp.tools/prefix/69.5.0.0/19#dns that flashfire42, I did some cleanup, mainly the '.' stuff at the end of the URL, although there's still some of them;
-
Ryz
I removed the futurequest.net domains because I did a sampling of a few and...they don't respond or exist S:
-
fireonlive
if you have vim you can type :%s/\.$//<enter> to remove the dots at the end
-
Ryz
Heh, I went around maybe 400-500 links before I burnt out, lol
-
fireonlive
ah đ
-
JAA
Ugh, their TLS is pretty broken.
-
JAA
All kinds of weak key and signature errors.
-
JAA
Or well, I guess it's the customers and their ancient setups, but same difference.
-
fireonlive
:(
-
JAA
Insecure OPENSSL_CONF time...
-
fireonlive
if they don't use TLSv1.3 it's not worth archiving
-
fireonlive
:p
-
» JAA offers fireonlive an SSLv2 server with an MD5 signature.
-
JAA
Best I can do.
-
fireonlive
*dies*
-
JAA
Scan finished, need to process it into something that can be queued, but too tired for that now.
-
fireonlive
would you like some coffee
-
fireonlive
re: oceangate; looks like every url except images/logo-offwhite-600.png returns the same HTML
-
fireonlive
wow it's literally named oceangate i just realized
-
fireonlive
anything ending in gate is just doomed
-
fireonlive
anyways was just checking if a sitemap or something existed still :D
-
fireonlive
oh, some other files too. (manifest, etc) i guess they just rewrote all 404s
-
pabs
-
pabs
or is the viewer not showing them up
-
pabs
-
h2ibot
PaulWise edited Bugzilla (+22, kde bugzilla):
wiki.archiveteam.org/?diff=50178&oldid=50163
-
h2ibot
PaulWise edited Mailman2 (+57, afrinic lists):
wiki.archiveteam.org/?diff=50179&oldid=50159
-
h2ibot
PaulWise edited Mailman2 (+11, twisted legacy archives):
wiki.archiveteam.org/?diff=50180&oldid=50179
-
h2ibot
OrIdow6 edited Wysp (+571, Initial remarks on grab):
wiki.archiveteam.org/?diff=50181&oldid=50167
-
h2ibot
-
h2ibot
-
imer
just going to repost is here so it doesnt get lost in the depths of #archivebot
-
imer
10:25 <mexat2> does archivebot have space for 7891 mini-blog entries? they're hosted on a site that have no/very little activity for the past 3 years and may shutdown anytime
-
imer
10:26 <imer> mexat2: maybe! do you have a list of urls/sites ready? (kindly upload to
transfer.archivete.am if you do)
-
imer
10:27 <imer> someone with permission (= not me) will look at it, might take a bit for someone to get to it though
-
imer
-
imer
10:29 <mexat2> the whole forum needs to be archived as it's one of the few remaining giants in Arabic web. the forum already does have sitemap up and ready for crawling.
-
imer
10:30 <mexat2> I try to archive whatever I can, but it takes forever using the Wayback Machine browser extension.
-
flashfire42
I will start on it on sunday
-
imer
ah, great
-
VickoSaviour
damn banciyuan images progress is rolling!
-
anarcat
pabs: not much, i'm afraid
-
rewby
Ryz, JAA: Is the list I provided not useful? (It's forward dns and not reverse.)
-
h2ibot
Yts98 edited Games/Engines, Platforms and Hostings (+429, Added ăăăȘăš):
wiki.archiveteam.org/?diff=50184&oldid=50168
-
JAA
rewby: No, your list is useful, I was just confused because I didn't notice the rDNS/fDNS switch on the page. I'll combine the rDNS list from fireonlive with yours, filter out what Ryz did, and run that through AB.
-
h2ibot
Yts98 edited Discourse (+14, Format adjustment):
wiki.archiveteam.org/?diff=50185&oldid=50161
-
rewby|backup
JAA: Ah yeah, that switch trips people up. The fdns data is usually only (fully) available via login and even then not designed for scraping. And logins can only really be gotten by network administrators. I have a full login, but I just asked the ben to do a backend lookup for me.
-
rewby|backup
It's at least partially based on CT logs
-
pokechu22
imer: re
transfer.archivete.am/inline/qzcm2/mexatblog - looks like that's suitable as just an !ao < list job, but the s= parameter is a session ID, so it'd probably be better to do !a
mexat.com/vb?archiveteam instead and I'm pretty sure it'd find everything
-
pokechu22
Fortunately everything is all on the same domain, so flashfire42 doesn't need to manually run through a list of 7891 posts
-
pokechu22
I think -i forums should cover everything fairly well
-
h2ibot
-
h2ibot
-
h2ibot
Yts98 created ZOWA (+1353, Create ZOWA):
wiki.archiveteam.org/?title=ZOWA
-
h2ibot
Yts98 edited Deathwatch (-22, Update ZOWA):
wiki.archiveteam.org/?diff=50189&oldid=50152
-
lennier1
Wow, Twitter is actually suing data scrapers. It sounds like they had a script to automatically sign up for new accounts.
theverge.com/2023/7/13/23794163/elo…awsuit-data-scraping-twitter-x-corp
-
nicolas17
if they don't know their identities how do they know they profited?
-
upintheairsheep
-
Jake
also laughing that somehow 4 IPs can overload Twitter's servers.
-
JAA
-
upintheairsheep
Thanks.
-
nicolas17
the IPs in question are from linode
-
JAA
And as the comments explain, it was nothing new then either.
-
JAA
Ugh, why do they leave immediately?
-
nicolas17
block them if they keep doing that (?)
-
Jake
looks like they are joining from web, so they're probably closing the tab
-
JAA
No, they parted the channel rather than closed the connection.
-
JAA
Still connected, in fact.
-
TheTechRobo
JAA: couldn't the connection have timed out? i don't think they're connected anymore
-
JAA
TheTechRobo: Possible, but they did explicitly /part this channel (or however you do that in the web UI with clicky things).
-
nicolas17
and it's not the first time
-
JAA
^
-
JAA
They constantly appear, drop something, and leave again.
-
TheTechRobo
yeah
-
JAA
rewby: FYI, quite a few of the domains in your list do not in fact resolve to FutureQuest IPs. I don't know if that many sites were migrated in the past week, but yeah, skipping anything that isn't still in that range.
-
flashfire42
Thats a good and smart idea I love grabbing random sites but when on a deadline best to put them in the do later pile
-
JAA
Getting about 851 out of 1555 still in 69.5.0.0/16 (too lazy to restrict it properly to /19).
-
fireonlive
(same lol)
-
JAA
So, quick explanation about what I'm doing: I curl'd each domain with HTTP/HTTPS and non-www/www, and I'm ignoring anything for now that doesn't return HTTP 200. Then I'm filtering out any domain which has been done by Ryz and taking the 'best' remaining URL for each domain, preferring HTTPS over HTTP and non-www over www.
-
JAA
Where the starting point for 'each domain' is the combination of fireonlive's rDNS list and rewby's fDNS list (filtered to domains that still resolve to there now).
-
JAA
This is crude and will miss some things, but it's enough work that it should keep AB busy for a while.
-
JAA
Results in a total of 1022 jobs to run.
-
JAA
queueh2ibot has been unleashed.