00:11:02 haha 00:15:44 So no Tom & Jerry gay porn was captured? 00:16:32 uwu 00:18:13 now i'm curious... 00:18:15 brb 00:19:33 ..huh 00:19:34 ok 00:22:01 i ran it overnight from 2 machines outside CCC and woke up to 16TB of data most of it porn 00:22:40 my plan right now is to take the Stuff That Is Interesting To Me and delete the rest 02:52:35 https://forum.tailscale.com/t/tailscale-forum-announcement/5768 < "I’m here to share some important news regarding the Tailscale forum. After almost three years, we have made the decision to sunset this platform. Starting on July 15, 2023 the forum will go into read-only mode: Posts will continue to be available to read, but users will not be 02:52:35 able to start new threads or reply to existing ones. While our forum has served as a valuable platform for discussions and knowledge sharing, we believe that concentrating our resources on our managed support channels will enable us to better meet the needs of our community." 02:52:44 dunno if we coverred this 02:57:25 fart says no 02:58:20 fireonlive: will you AB? 02:58:36 ah thanks pabs, sure 02:58:49 https://archive.fart.website/archivebot/viewer/?q=forum.tailscale.com 02:59:17 hmm 11k posts 02:59:23 might ignore individual post urls 02:59:39 (but keep topic ones) 03:00:00 Nah if its good to go with no rate limits we can probably keep individual posts 03:00:03 actually that's not too much 03:00:09 assuming you mean 11k posts not 11k threads 03:00:09 asterisk was 300k 03:00:17 https://forum.tailscale.com/about 03:00:23 yeah 1.9k topics, 11.0k posts 03:02:19 11.0k is pretty small 03:03:21 started akane3b8pelvuc53swdsdwjqr 03:03:40 no-offsite due to con/delay discourse needed but we can run those later 03:03:46 needed->needs 03:06:22 keeping indiv. urls too :) 04:36:27 assuming argenteam actually shuts down Jan 1st, I was going to do another crawl 04:36:52 turns out they removed the subtitles already D: 04:42:25 PaulWise edited Jira (+27, add one more): https://wiki.archiveteam.org/?diff=51441&oldid=51353 05:23:36 looks like some argenteam subtitle zips are actually misnamed rars 05:26:21 ofc 05:26:30 what's a file extension for anyways? 06:22:42 I considered feeding argenteam API responses into archivebot too but it's too late now 06:22:49 the API responses no longer have the subtitle URLs either 06:31:07 :/ 06:35:47 webpages work tho 06:35:58 https://web.archive.org/web/20231207132633/https://argenteam.net/movie/149055/Thereaposs.Something.About.Mary.%281998%29 the subtitle download link here works 06:36:05 because I archivebot'd all webpages and subtitles 06:37:11 and I do have API responses saved, but not as pristine WARCs, rather as concatenated ndjson 07:23:28 https://sobre.arquivo.pt/en/savepagenow-to-record-webpages-immediately-on-arquivo-pt/ 07:23:49 "Arquivo.pt launched a new version, called Francisco, on the 19th of January 2022. The SavePageNow function stands out, allowing anyone to save a Web page to be preserved by Arquivo.pt. It is only necessary to enter a page’s address and browse through its contents. Arquivo.pt SavePageNow was inspired on the Internet Archive Save Page Now and 07:23:49 implemented using webrecorder pywb." 07:23:55 (oof) 07:23:59 (but also TIL) 07:24:03 cc JAA/arkiver 07:25:39 they use /wayback/ in their urls too xP 07:25:56 https://arquivo.pt/wayback/19961013183946/http:/www.yellow.com/ 07:27:08 (ofc in 2022 yellow is some crypto bullshit now) 11:32:08 What'd be my "first" step for if i've spotted a potentially important website that's possibly going to go down within the next month? 11:34:32 I'd be happy to develop the tools myself, I've checked and it doesn't have a sitemap.xml, but it runs mediawiki, anybody got pointers for enumerating pages? 12:27:36 GhostOverflow256: the first step is to drop the link; we already have specialized tools (#wikiteam/#wikibot) for archiving wikis 12:29:01 @thuban I've just downloaded the xml backup of all pages, but the page is `https://ddosecrets.com/wiki/Distributed_Denial_of_Secrets`. I saw that the last uploaded backup onto the internet archive was 3 years ago, so should i upload it? 12:30:41 you can if you want, but wikibot handles that as well 12:33:53 According to the github of wikibot, it's "Note: this bot is NOT currently set up to be installed by other people (hardcoded Discord channels, google cloud bucket names, etc), though it is on my todo list. Dragona be here (not just me)!" 12:34:03 So that kinda discouraged me from trying to use it lol 12:35:38 you don't need to run your own! just submit a job to ours 12:36:45 (might require permissions; idk as i don't normally handle wikis. if it does someone will do it for you shortly) 12:39:27 one way to find out i guess 13:34:28 GhostOverflow256: was already dumped in 2021 https://archive.org/search?query=originalurl%3A%28%2Addosecrets.com%2A%29 13:34:34 does it need a new dump? 13:35:54 and I guess we should save the HTML to web.archive.org too 13:36:27 ah, HTML was saved in 2021 too https://archive.fart.website/archivebot/viewer/?q=ddosecrets.com 13:48:23 guess we will save it again since there was some more changes 13:48:27 and subdomains too 13:50:02 hmm, https://data.ddosecrets.com/ is quite huge... 13:54:26 I archived data.ddosecrets.com before. Agree it'd be nice to rearchive, but it's huge indeed. 13:57:41 JAA: seems there are lots of 2023 updates on it 13:57:50 * pabs threw it in with -c 1 13:59:42 JAA: oh, it has 32.1 TB of Parler videos, that is maybe a bit much for AB? 14:04:40 Yes, it's far too much for AB. 14:04:52 * pabs aborted for now 14:04:57 Also was at the time I archived it. 14:05:20 I also wonder whether the Parler data is extracted from our project. 14:05:30 Duplicating that would be a bit silly. 14:09:54 "it now has over 190 datasets and 60 terabytes of data!" 14:36:00 if it was already archived, I'd assume only new data gets stored and everything else is deduplicated? 14:38:26 there's still the traffic archivebot needs to download and reupload ofc 14:40:32 would AB need 32TB storage for it to work or does it upload stuff to IA as it goes? 14:41:37 the latter IIRC 14:46:47 kpcyrd: We don't currently have software that can dedupe against previous WARCs. (wget's CDX reading is broken IIRC.) It's an important factor in the design of the new WARC library I've been working on (slowly). 14:47:23 katia: It uploads as it goes, but it'd take a very long time and couldn't dedupe. 14:48:07 I don't remember how I grabbed it at the time, but I believe there were parallel processes. 15:13:35 could archivebot save hashes of current running stuff somewhere to dedupe without having the filelocally? 15:27:39 Entartet created Arhivach (+1785, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=Arhivach 16:01:46 Entartet created M.Dvach (+893, Created page with "{{Infobox project| | URL =…): https://wiki.archiveteam.org/?title=M.Dvach 16:02:46 Entartet edited Arhivach (+54): https://wiki.archiveteam.org/?diff=51444&oldid=51442 16:12:48 Exorcism uploaded File:M.Dvach-screenshot.png: https://wiki.archiveteam.org/?title=File%3AM.Dvach-screenshot.png 16:13:48 Exorcism edited M.Dvach (+33): https://wiki.archiveteam.org/?diff=51446&oldid=51443 16:18:49 Exorcism uploaded File:Arhivach-screenshot.png: https://wiki.archiveteam.org/?title=File%3AArhivach-screenshot.png 16:18:50 Exorcism edited Arhivach (+34): https://wiki.archiveteam.org/?diff=51448&oldid=51444 18:51:44 katia: It could be implemented in theory. It's not practical at scale though. 18:53:14 Currently the way I dedupe is load all CDX entries in a PostgreSQL database and add a hook in wpull to query hashes against it at crawl time 18:53:33 It's not perfect, but works at smaller scales 18:56:31 Terbium: Uh, wpull doesn't have a hook for that, does it? 18:57:38 It's a custom written hook I added to my non-mainline wpull fork to work with the rest of my ingest pipeline. It's not part of wpull or wpull_ludios natively 18:57:45 Right 18:58:04 Did you also modify the name so it's obvious the WARCs were produced with the fork? 18:58:58 The URLTable does in fact support revisits, but it's not implemented I think. Plus it's all broken on non-SQLite anyway. 18:59:01 It has a bumped version number, but no name change. The WARCs I generate are unlikely to ever be redistrubuted 18:59:21 What version numbers are you using? 18:59:49 Yep URLTable has basic local dedupe support, but it didn't suit my needs as I need to dudupe against a large number of concurrent crawlers on a central DB 19:00:17 Makes sense. I was also tinkering with that sort of thing before. 19:00:50 I'm using version 4.0.0 19:01:21 postgres 😍 19:01:37 So we'll need to skip 4.x upstream as well, ok... 19:01:54 I'm going to bring ludios_wpull up to 5 19:02:50 Found regressions in 3.11 that broke some of the socket code, so will likely end up pushing it to 3.12, and jump from v3 to v5 19:04:39 wpull standard is 2.0.3, ludios_wpull is at 3.0.9 currently 19:05:58 Please change the name it writes to the warcinfo record on that bump. 19:13:15 huh, looks like ludios_wpull retains the "Wpull" name in warcinfo. I'll change the name to ludios_wpull 19:13:41 Yeah, it does, and it's annoyed me ever since that patch. :-) 19:14:09 Ideally, grab-site would also appear in there when it's invoked that way, but that may be trickier. 19:15:50 hmm, that would be a pain, i guess i can use importlib to extract the grab-site version 19:16:40 i guess we're going tnto -dev terrotory lol 19:17:41 Yeah :-)