#archiveteam-bs

00:46

h2ibot

Switchnode edited Deathwatch (+217, /* 2024 */ add 123guestbook): wiki.archiveteam.org/?diff=52359&oldid=52345
01:51

thuban

someone with common crawl indices want to grep for *.123guestbook.com? (imer?)
01:51

imer

sure, can start that. will take a day or two
01:52

thuban

thanks!
05:20

thuban

here are some i got from the latest cc web graph: transfer.archivete.am/NtANm/123gues…omains_cc-main-2024-feb-apr-may.txt (this is the case where `!a <` should be fine, fwiw)
05:20

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/NtANm/…omains_cc-main-2024-feb-apr-may.txt
05:21

thuban

i tried some brute-forcing, but i hit it too hard (fell over at first, some 429s when i scaled back but not enough)
05:28

thuban

(interesting: deleted and not found are both 404, but they render differently treasure.123guestbook.com / travelers.123guestbook.com)
05:54

hook54321

Are there any easy ways for someone to find their Google+ page from the grab or just if they happen to have the profile URL?
09:46

IDK

freshcut.gg is shutting down tommrow
10:24

h2ibot

Exorcism edited LEGO Insiders Community (+4): wiki.archiveteam.org/?diff=52360&oldid=52344
10:46

katia

IDK, it's running on archivebot now but i'm not sure it's crawling more than just the user profiles / no media seems to be gotten from those links - those are all via JS to a graphql endpoint :/
10:47

katia

oh there are some mp4s now hm :D
11:11

IDK

iirc the things are under storage.googleapis.com and a cdn domain
11:15

IDK

And it's probably not even mp4s its ts
11:21

IDK

tiki flashback lol
12:00

OrIdow6

Common crawl indices?
12:02

datechnoman

Think they meant indexes
12:04

OrIdow6

More indexes of the links from common crawl? Or lists of the URLs that they've crawled? (And if the latter case what's the advantage over IA CDX?)
12:28

masterx244|m

<OrIdow6> "More indexes of the links from..." <- if you got them locally you can crunch the data much faster
12:28

masterx244|m

(did that at imgone times to hunt imgur links out of laion5B and friends)
15:27

that_lurker

Has anyone tried to AB osintukraine.com yet?
16:39

IDK

9to5google.com/2024/06/12/youtube-ad-injection, does this affect #down-the-tube?
20:52

JaffaCakes118

So Triage (a malware analysis platform), has started going more corporate and and slowly getting rid of their free users and deleting the analysis, they have a sitemap with every single url on triage, is there a way we can get this archived with archivebot or something? tria.ge/sitemap.xml
21:59

nulldata

^looks like katia has thrown it in AB
23:46

OrIdow6

masterx244|m: So it's the URLs that CC captured?
23:46

OrIdow6

If so how is that specifically better than IA CDX?
23:54

imer

IA cdx is a superset actually since it includes common crawl data too (recent comparison with JAA for #webroasting I found one URL the IA index didnt have.. which was a 404 incidentally)

3 months ago

« a day earlier

a day later »

today »