01:07:13 JAA: you looking at forward dns right? 01:12:17 fireonlive: No, because Ryz isn't either. 01:12:36 (Based on the AB jobs, at least.) 01:12:50 But good point. :-) 01:12:52 oo; 01:14:39 :3 01:15:11 Forward DNS? Can you two clarify? 01:15:18 the lack of linking directly to that page doesn't help 01:15:48 hm, reverse DNS is like what you see when people connect to IRC 01:15:52 like the hostname of an IP 01:15:59 forward DNS is like www.google.com to an IP 01:16:34 some services like bgp.tools report all the things they come across among their travels of what hostnames resolve to what IPs and provide 'forward dns' lookups 01:16:41 or 'what's hosted on this IP' 01:17:01 Yeah, unfortunately I was a bit miffed on that and had to make those links access for me; then again, I was given what flashfire42 provided 01:17:13 Sorry 01:17:43 google.com is 142.250.179.206 for my server but 142.250.179.206's 'hostname' (or reverse DNS) is ams15s42-in-f14.1e100.net 01:17:57 so if you just looked up 142.250.179.206 you'd only see the latter but not nesc. the former 01:18:14 (unless you used a special service) 01:21:04 The rDNS section is also easy to check: you just do a bunch of DNS queries for each IP. 01:21:25 Forward DNS requires knowledge of the domains that resolve to an IP in the block, which is exactly what we're looking for. 01:21:49 The rDNS section data is old, might be worth rerunning that. 01:21:54 indeed :) 01:23:05 I see a bunch of domains there that no longer resolve to FutureQuest IPs. 01:23:11 there's various sources to do fDNS but it's much harder to populate 01:23:19 ye, also rDNS has no verification 01:23:34 i could set my rDNS to free-bitcoin.google.com and that would be 100% ok 01:23:38 I suspect it's just outdated data in this case, but yeah. 01:23:52 I used to set my rDNS to a .invalid domain on a provider that let me. :-) 01:23:52 that too :) 01:23:55 ah :D 01:24:10 or maybe they migrated but didn't bother updating rDNS on the way out the door 01:24:19 and the going out of business host didn't bother either 01:24:29 Yeah, that's what I'm thinking. 01:24:45 Well, or it did get updated, but bgp.tools never refetched it. 01:24:50 mm 01:24:51 > Data Age between: 2022-10-15T11:32:55Z UTC and 2020-06-16T10:52:28Z UTC 01:24:56 ah there you go 01:25:17 So who wants to do a few thousand rDNS queries? :-) 01:31:33 ok uwu 01:31:55 8192 * 2 potential dns queries since i'll also check forward dns lol 01:32:04 Thanks 01:32:17 :) 01:32:34 i didn't add a sleep or anything but it's like DNS so should be ok lol 01:34:29 well a rough check 01:34:39 fdns = 69.5.* lol 02:00:45 979 matches so far 02:01:01 (on x.x.22.227 atm) 02:13:36 leela.futurequest.net genesis.futurequest.net. neon.futurequest.net. evangelion.futurequest.net. eva.futurequest.net. 02:13:39 nerds :P 02:14:53 here's what I got cc JAA: https://transfer.archivete.am/inline/Acexz/rdns-matches.txt 02:22:05 Thanks! 02:24:28 1103 domains in there after excluding *.futurequest.net, which seem to be noise. 02:26:39 :) 02:26:44 JAA: there seems to be a lot from securecnc.net as well 02:26:55 mqs0042.securecnc.net mqs0043.securecnc.net etc 02:27:22 Hmm yeah 02:27:29 seem to share some names with futurequest as well 02:27:39 leela exists in both, genesis does too 02:27:50 1009 after kicking those out. 02:28:07 =] 02:34:25 Yeah, I'm actually maybe getting a bit sick of running constant jobs and checking the content myself, oof; I'm pondering on regarding queueh2ibot, the problem which ones are the real links since I had to manually check if they were actually dead or if it's HTTP only :/ 02:35:55 I think that's part of the reason I wasn't too sure on using that bot, JAA 02:36:32 HTTP vs HTTPS check can easily be automated. 02:36:55 WWW and non-WWW too? 02:36:56 curl -m or something i suppose 02:37:18 And I'm assuming a combination of the 4? 02:44:40 JAA, there is unfortunately only other reason I went for the non-automated route at the time, is spotting the jobs that failed because it doesn't work on pipeline but it could on another pipeline, since I don't think the bot can detect something like that all 02:44:58 The occasional 'Connection closed' and 'Connection refused' entries on some of the jobs S: 02:45:10 Correct, someone would need to watch the dashboard and requeue those manually. 02:46:54 It's obviously still less work if the queueing happens automatically. 02:47:08 Also, ain't nobody got time for manually queueing a thousand websites. 02:47:46 Mhm, even me eventually, since I would wanna spend more leisure time, or at least make jobs finish faster myself since that part isn't easily automated~ 02:47:59 I'll give you the list of what I have left 02:48:47 at least you need to batch them... send a hundred and *then* check how they go, instead of pasting them into IRC one by one and switching windows to the pipeline status for each and everyone etc 02:48:57 For each automated job, should have concurrency 2, for leeway reasons, and ignoreset badvideos because annoyingly some of the jobs I thought would be safe from New York Times videos...pushed it's ugly head up :/ 02:52:22 Here is the remaining stuff (except for the really big jobs that I put them in a separate list, which is small): https://transfer.archivete.am/zr086/remaining-list - again, this is based on https://bgp.tools/prefix/69.5.0.0/19#dns that flashfire42, I did some cleanup, mainly the '.' stuff at the end of the URL, although there's still some of them; 02:52:43 I removed the futurequest.net domains because I did a sampling of a few and...they don't respond or exist S: 02:55:03 if you have vim you can type :%s/\.$// to remove the dots at the end 02:55:54 Heh, I went around maybe 400-500 links before I burnt out, lol 02:56:13 ah 😅 03:05:05 Ugh, their TLS is pretty broken. 03:05:23 All kinds of weak key and signature errors. 03:06:09 Or well, I guess it's the customers and their ancient setups, but same difference. 03:06:32 :( 03:07:20 Insecure OPENSSL_CONF time... 03:32:07 if they don't use TLSv1.3 it's not worth archiving 03:32:11 :p 03:47:00 * JAA offers fireonlive an SSLv2 server with an MD5 signature. 03:47:03 Best I can do. 03:47:29 *dies* 03:47:47 Scan finished, need to process it into something that can be queued, but too tired for that now. 03:48:13 would you like some coffee 03:48:29 re: oceangate; looks like every url except images/logo-offwhite-600.png returns the same HTML 03:48:45 wow it's literally named oceangate i just realized 03:48:51 anything ending in gate is just doomed 03:49:03 anyways was just checking if a sitemap or something existed still :D 03:49:48 oh, some other files too. (manifest, etc) i guess they just rewrote all 404s 07:24:50 hmm, does AB not upload meta WARCs any more? https://archive.fart.website/archivebot/viewer/job/202306040627542dvm3 07:27:08 or is the viewer not showing them up 07:28:15 ah, indeed the meta warc is on https://archive.fart.website/archivebot/viewer/item/archiveteam_archivebot_go_20230606040557_ca293687 07:39:28 PaulWise edited Bugzilla (+22, kde bugzilla): https://wiki.archiveteam.org/?diff=50178&oldid=50163 10:18:58 PaulWise edited Mailman2 (+57, afrinic lists): https://wiki.archiveteam.org/?diff=50179&oldid=50159 10:23:58 PaulWise edited Mailman2 (+11, twisted legacy archives): https://wiki.archiveteam.org/?diff=50180&oldid=50179 10:38:01 OrIdow6 edited Wysp (+571, Initial remarks on grab): https://wiki.archiveteam.org/?diff=50181&oldid=50167 10:58:05 OrIdow6 edited Wysp (+473, On auth): https://wiki.archiveteam.org/?diff=50182&oldid=50181 10:59:05 OrIdow6 edited Wysp (+2): https://wiki.archiveteam.org/?diff=50183&oldid=50182 11:23:29 just going to repost is here so it doesnt get lost in the depths of #archivebot 11:23:29 10:25 does archivebot have space for 7891 mini-blog entries? they're hosted on a site that have no/very little activity for the past 3 years and may shutdown anytime 11:23:29 10:26 mexat2: maybe! do you have a list of urls/sites ready? (kindly upload to https://transfer.archivete.am if you do) 11:23:29 10:27 someone with permission (= not me) will look at it, might take a bit for someone to get to it though 11:23:29 10:27 https://transfer.archivete.am/qzcm2/mexatblog 11:23:30 10:29 the whole forum needs to be archived as it's one of the few remaining giants in Arabic web. the forum already does have sitemap up and ready for crawling. 11:23:30 10:30 I try to archive whatever I can, but it takes forever using the Wayback Machine browser extension. 11:24:37 I will start on it on sunday 11:25:10 ah, great 11:29:44 damn banciyuan images progress is rolling! 12:15:52 pabs: not much, i'm afraid 12:42:04 Ryz, JAA: Is the list I provided not useful? (It's forward dns and not reverse.) 14:48:52 Yts98 edited Games/Engines, Platforms and Hostings (+429, Added ノベăƒȘス): https://wiki.archiveteam.org/?diff=50184&oldid=50168 14:57:50 rewby: No, your list is useful, I was just confused because I didn't notice the rDNS/fDNS switch on the page. I'll combine the rDNS list from fireonlive with yours, filter out what Ryz did, and run that through AB. 15:05:55 Yts98 edited Discourse (+14, Format adjustment): https://wiki.archiveteam.org/?diff=50185&oldid=50161 15:13:56 JAA: Ah yeah, that switch trips people up. The fdns data is usually only (fully) available via login and even then not designed for scraping. And logins can only really be gotten by network administrators. I have a full login, but I just asked the ben to do a backend lookup for me. 15:14:42 It's at least partially based on CT logs 15:16:16 imer: re https://transfer.archivete.am/inline/qzcm2/mexatblog - looks like that's suitable as just an !ao < list job, but the s= parameter is a session ID, so it'd probably be better to do !a http://www.mexat.com/vb?archiveteam instead and I'm pretty sure it'd find everything 15:16:42 Fortunately everything is all on the same domain, so flashfire42 doesn't need to manually run through a list of 7891 posts 15:17:01 I think -i forums should cover everything fairly well 16:06:07 Yts98 uploaded File:ZOWA-icon.png: https://wiki.archiveteam.org/?title=File%3AZOWA-icon.png 16:07:07 Yts98 uploaded File:ZOWA-logo.png: https://wiki.archiveteam.org/?title=File%3AZOWA-logo.png 16:08:07 Yts98 created ZOWA (+1353, Create ZOWA): https://wiki.archiveteam.org/?title=ZOWA 16:11:08 Yts98 edited Deathwatch (-22, Update ZOWA): https://wiki.archiveteam.org/?diff=50189&oldid=50152 21:04:02 Wow, Twitter is actually suing data scrapers. It sounds like they had a script to automatically sign up for new accounts. https://www.theverge.com/2023/7/13/23794163/elon-musk-lawsuit-data-scraping-twitter-x-corp 21:05:12 if they don't know their identities how do they know they profited? 21:05:52 Have you heard: https://cdn.discordapp.com/attachments/1002873478980046858/1003203199513149510/Screenshot_20220731-103049_Reddit.jpg 21:06:08 also laughing that somehow 4 IPs can overload Twitter's servers. 21:07:46 upintheairsheep: That Reddit thread is from almost three years ago. https://old.reddit.com/r/DataHoarder/comments/js7rou/meganz_will_delete_your_files_now/ 21:08:00 Thanks. 21:08:17 the IPs in question are from linode 21:08:22 And as the comments explain, it was nothing new then either. 21:08:30 Ugh, why do they leave immediately? 21:09:02 block them if they keep doing that (?) 21:09:47 looks like they are joining from web, so they're probably closing the tab 21:10:04 No, they parted the channel rather than closed the connection. 21:10:24 Still connected, in fact. 21:16:24 JAA: couldn't the connection have timed out? i don't think they're connected anymore 21:17:50 TheTechRobo: Possible, but they did explicitly /part this channel (or however you do that in the web UI with clicky things). 21:18:05 and it's not the first time 21:18:14 ^ 21:18:26 They constantly appear, drop something, and leave again. 21:18:36 yeah 23:07:05 rewby: FYI, quite a few of the domains in your list do not in fact resolve to FutureQuest IPs. I don't know if that many sites were migrated in the past week, but yeah, skipping anything that isn't still in that range. 23:08:41 Thats a good and smart idea I love grabbing random sites but when on a deadline best to put them in the do later pile 23:10:14 Getting about 851 out of 1555 still in 69.5.0.0/16 (too lazy to restrict it properly to /19). 23:11:51 (same lol) 23:30:12 So, quick explanation about what I'm doing: I curl'd each domain with HTTP/HTTPS and non-www/www, and I'm ignoring anything for now that doesn't return HTTP 200. Then I'm filtering out any domain which has been done by Ryz and taking the 'best' remaining URL for each domain, preferring HTTPS over HTTP and non-www over www. 23:31:24 Where the starting point for 'each domain' is the combination of fireonlive's rDNS list and rewby's fDNS list (filtered to domains that still resolve to there now). 23:31:38 This is crude and will miss some things, but it's enough work that it should keep AB busy for a while. 23:33:42 Results in a total of 1022 jobs to run. 23:45:54 queueh2ibot has been unleashed.