-
h2ibotSwitchnode edited Deathwatch (+217, /* 2024 */ add 123guestbook): wiki.archiveteam.org/?diff=52359&oldid=52345
-
thubansomeone with common crawl indices want to grep for *.123guestbook.com? (imer?)
-
imersure, can start that. will take a day or two
-
thubanthanks!
-
thubanhere are some i got from the latest cc web graph: transfer.archivete.am/NtANm/123gues…omains_cc-main-2024-feb-apr-may.txt (this is the case where `!a <` should be fine, fwiw)
-
eggdropinline (for browser viewing): transfer.archivete.am/inline/NtANm/…omains_cc-main-2024-feb-apr-may.txt
-
thubani tried some brute-forcing, but i hit it too hard (fell over at first, some 429s when i scaled back but not enough)
-
thuban(interesting: deleted and not found are both 404, but they render differently treasure.123guestbook.com / travelers.123guestbook.com)
-
hook54321Are there any easy ways for someone to find their Google+ page from the grab or just if they happen to have the profile URL?
-
IDKfreshcut.gg is shutting down tommrow
-
h2ibotExorcism edited LEGO Insiders Community (+4): wiki.archiveteam.org/?diff=52360&oldid=52344
-
katiaIDK, it's running on archivebot now but i'm not sure it's crawling more than just the user profiles / no media seems to be gotten from those links - those are all via JS to a graphql endpoint :/
-
katiaoh there are some mp4s now hm :D
-
IDKiirc the things are under storage.googleapis.com and a cdn domain
-
IDKAnd it's probably not even mp4s its ts
-
IDKtiki flashback lol
-
OrIdow6Common crawl indices?
-
datechnomanThink they meant indexes
-
OrIdow6More indexes of the links from common crawl? Or lists of the URLs that they've crawled? (And if the latter case what's the advantage over IA CDX?)
-
masterx244|m<OrIdow6> "More indexes of the links from..." <- if you got them locally you can crunch the data much faster
-
masterx244|m(did that at imgone times to hunt imgur links out of laion5B and friends)
-
that_lurkerHas anyone tried to AB osintukraine.com yet?
-
IDK9to5google.com/2024/06/12/youtube-ad-injection, does this affect #down-the-tube?
-
JaffaCakes118So Triage (a malware analysis platform), has started going more corporate and and slowly getting rid of their free users and deleting the analysis, they have a sitemap with every single url on triage, is there a way we can get this archived with archivebot or something? tria.ge/sitemap.xml
-
nulldata^looks like katia has thrown it in AB
-
OrIdow6masterx244|m: So it's the URLs that CC captured?
-
OrIdow6If so how is that specifically better than IA CDX?
-
imerIA cdx is a superset actually since it includes common crawl data too (recent comparison with JAA for #webroasting I found one URL the IA index didnt have.. which was a 404 incidentally)