03:54:21 What's the best way to compress a large list of URLs using zstd? Like for highest compression ratio with a reasonable amount of RAM (~16 GB)? 04:35:55 hello 04:38:28 i need help with something 04:43:55 don’t ask to ask, just ask 04:46:03 does anyone have "eene edited parody - the ed touchables" saved because its on the internet archive but with a notice saying that it "has been archived but cannot be played" 04:46:33 got a link? 04:48:12 https://web.archive.org/web/20190215031325/https://www.youtube.com/watch?v=zOdesdNis3w 04:48:32 … 🪦 04:49:17 bah, I was just looking on https://findyoutubevideo.thetechrobo.ca/ 04:54:39 ugh subdomain.center is quite broken, API host no longer resolves 05:25:27 maybe we need an automated warning to webirc users that backgrounding the tab might disconnect them or something 06:28:15 So, mineteria is a dead minecraft server with a unique enough name that it's basically the only thing that shows up on searches. I'm curious if there's any tool that could collect all / many of the URLs from search engine searches & YouTube searches? 06:31:39 the main problem with that is that search engines limit the amount of results returned. Google is like 300, Bing 2000 06:31:58 but anyway, JAA has bing-scrape, and I have a couple of hacky JS snippets you can run from browser consoles 06:32:18 bing-scrape is in https://gitea.arpa.li/JustAnotherArchivist/little-things 06:34:42 https://transfer.archivete.am/6WZGP/bing-domain-scraper.js 06:34:42 inline (for browser viewing): https://transfer.archivete.am/inline/6WZGP/bing-domain-scraper.js 06:34:46 https://transfer.archivete.am/11Enzl/google-domain-scraper.js 06:35:30 bing-domain-scraper.js you run it over and over again, then concatenate all the urls.txt files it downloads 06:36:11 google-domain-scraper.js you run it once (or more) with which=1, then change which=0 to save the URLs 06:44:38 I see 06:54:00 thanks 13:45:13 Pedrosso: Re zstd, I don't think I've seen it use more than a few GB unless you use multi-threaded mode. To maximise the compression ratio, you'll want to use single-threaded anyway. So `zstd --ultra -22 --long=31` would be the highest compression ratio. You'll pay for it in CPU cycles. 14:36:07 probably also pre-sort the URLs for better ratio and faster compression 14:38:11 That can help, yeah, although simple sorting might not be optimal. It'd be neat if there was a 'rearrange the lines to maximise compression ratio'-type option/tool. :-) 14:43:27 If you have enough RAM, you can crank up --long to fit the entire file in the window 14:44:18 Won't help too much if everything is already properly sorted, but can eek out a small % in ratio 14:49:29 --long=31 as above is the maximum I believe (and only on 64-bit machines). 14:50:36 That's a 2 GiB window. 14:56:53 you're right, odd, I was reading the format spec as well which stated: The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB. 14:57:22 I guess 2 GiB is the real world limit for 64 bit, just that the spec goes higher 15:01:51 Huh 17:30:07 Hello there, I had some questions regarding archiving and url fetching and was redirected here 17:30:28 Feel free you ask your questions here 17:32:01 Ok, so there are a few things. The first one is this: I was once a user of the Gothic-l Yahoo group, as everyone knows, the Yahoo group shut down, but I have exported my emails from that group and have all of them in a file. I don't know how to efficiently put them into a website format or remove everything around it which isn't part of the 17:32:02 messages, so that it's possible to get an archive of all the messages from that group. All the messages from before I was a member were archived on another website which still exists 17:34:01 The second one is this: I am working on a personal project where I am building a search engine for older websites (so that one can browse the old internet, more or less), a few of these exist but I do some things differently as I have a repository of urls in a database which I use. The problem is that I don't know what the most efficient way is to 17:34:01 do this, since I glued together some python scripts for url fetching with descriptions and titles and converting them to a csv file, but this would burden the server of the wayback machine too much, so I wonder if there is a more friendly way to do this with which I can efficiently fetch urls with their metadata from only older websites (not modern 17:34:02 ones since some websites didn't exist yet back around the 2000s) 17:35:45 i presume the emails from Yahoo groups are probably in a standard format, you can try an Python script with an XML parser to extract the core contents of each email, then generate an HTML file from them 17:36:51 Is your only source of old websites the WBM? 17:37:53 It is not the only source, some websites are not archived on there, so for some websites I also use Arquivo.pt or some other archive websites. The neopets website for example is almost entirely functionally preserved on a different archive website 17:38:49 Thank you for the suggestion on the emails, this is what the files look like: https://ibb.co/g4L1cBt 17:39:37 I don't know if any archive group is interested in getting them? They are not publicly available anymore since the only people having them now since the shutdown of yahoo groups are people which were subscribed and I am apparently the only one who exported them into a file format so that we have all the messages 17:41:44 https://listserv.linguistlist.org/pipermail/gothic-l/ This website has all the messages up to March 2015, I have everything from April 2015 until the shutdown on my hard drive exported from my emails. 17:56:09 Assuming they are HTML based emails and not plaintext, should be simple to parse each email to obtain the important text 17:56:34 Yahoo probably generates the emails in a standard format 17:57:44 The description and title are unlikely to be available separately in a condensed format unfortunately. I don't think WBM exposes that information separately from the actual page body 17:58:26 Oh no, I used Mozilla Thunderbird to export all the emails from the Yahoo Mailing list 17:59:44 Well yes, what I got adviced was to export the wikipedia reference dump which is available for 2010. That works since my project is intended for websites from before around 2008, they seem to contain titles and websites in the reference list, but not the descriptions so if I'd want those that still needs to be fetched 18:00:23 I looked into beautifulsoup and did some tests with it, but don't immediately know how to let it fetch multiple urls in one get, instead of going through an array loop and doing one url each get which burdens a web server 18:01:14 (The emails are exported as .eml format, so it's basically displaying the emails as a file, I guess there should be some python script to convert eml files to an xml format or something similar 18:09:34 might be useful https://pypi.org/project/eml-parser/ 18:10:22 you'll probably need lxml (or html parser) once you extract the email body using the eml parser 18:10:50 Ah thank you, that should be useful