00:46:48 Switchnode edited Deathwatch (+217, /* 2024 */ add 123guestbook): https://wiki.archiveteam.org/?diff=52359&oldid=52345 01:51:33 someone with common crawl indices want to grep for *.123guestbook.com? (imer?) 01:51:59 sure, can start that. will take a day or two 01:52:21 thanks! 05:20:46 here are some i got from the latest cc web graph: https://transfer.archivete.am/NtANm/123guestbook_subdomains_cc-main-2024-feb-apr-may.txt (this is the case where `!a <` should be fine, fwiw) 05:20:48 inline (for browser viewing): https://transfer.archivete.am/inline/NtANm/123guestbook_subdomains_cc-main-2024-feb-apr-may.txt 05:21:36 i tried some brute-forcing, but i hit it too hard (fell over at first, some 429s when i scaled back but not enough) 05:28:16 (interesting: deleted and not found are both 404, but they render differently https://treasure.123guestbook.com/ / https://travelers.123guestbook.com/) 05:54:59 Are there any easy ways for someone to find their Google+ page from the grab or just if they happen to have the profile URL? 09:46:51 https://freshcut.gg/ is shutting down tommrow 10:24:08 Exorcism edited LEGO Insiders Community (+4): https://wiki.archiveteam.org/?diff=52360&oldid=52344 10:46:56 IDK, it's running on archivebot now but i'm not sure it's crawling more than just the user profiles / no media seems to be gotten from those links - those are all via JS to a graphql endpoint :/ 10:47:20 oh there are some mp4s now hm :D 11:11:34 iirc the things are under storage.googleapis.com and a cdn domain 11:15:05 And it's probably not even mp4s its ts 11:21:48 tiki flashback lol 12:00:30 Common crawl indices? 12:02:03 Think they meant indexes 12:04:53 More indexes of the links from common crawl? Or lists of the URLs that they've crawled? (And if the latter case what's the advantage over IA CDX?) 12:28:21 "More indexes of the links from..." <- if you got them locally you can crunch the data much faster 12:28:51 (did that at imgone times to hunt imgur links out of laion5B and friends) 15:27:19 Has anyone tried to AB https://osintukraine.com/ yet? 16:39:23 https://9to5google.com/2024/06/12/youtube-ad-injection/, does this affect #down-the-tube? 20:52:15 So Triage (a malware analysis platform), has started going more corporate and and slowly getting rid of their free users and deleting the analysis, they have a sitemap with every single url on triage, is there a way we can get this archived with archivebot or something? https://tria.ge/sitemap.xml 21:59:37 ^looks like katia has thrown it in AB 23:46:17 masterx244|m: So it's the URLs that CC captured? 23:46:57 If so how is that specifically better than IA CDX? 23:54:21 IA cdx is a superset actually since it includes common crawl data too (recent comparison with JAA for #webroasting I found one URL the IA index didnt have.. which was a 404 incidentally)