03:54:21 <Pedrosso> What's the best way to compress a large list of URLs using zstd? Like for highest compression ratio with a reasonable amount of RAM (~16 GB)?
04:35:55 <rasterofmandomness> hello
04:38:28 <rasterofmandomness> i need help with something
04:43:55 <fireonlive> don’t ask to ask, just ask
04:46:03 <rasterofmandomness> does anyone have "eene edited parody - the ed touchables" saved  because its on the internet archive but with a notice saying that it "has been archived but cannot be played"
04:46:33 <pabs> got a link?
04:48:12 <rasterofmandomness> https://web.archive.org/web/20190215031325/https://www.youtube.com/watch?v=zOdesdNis3w
04:48:32 <fireonlive> … 🪦
04:49:17 <pabs> bah, I was just looking on https://findyoutubevideo.thetechrobo.ca/
04:54:39 <pabs> ugh subdomain.center is quite broken, API host no longer resolves
05:25:27 <fireonlive> maybe we need an automated warning to webirc users that backgrounding the tab might disconnect them or something
06:28:15 <Pedrosso> So, mineteria is a dead minecraft server with a unique enough name that it's basically the only thing that shows up on searches. I'm curious if there's any tool that could collect all / many of the URLs from search engine searches & YouTube searches?
06:31:39 <pabs> the main problem with that is that search engines limit the amount of results returned. Google is like 300, Bing 2000
06:31:58 <pabs> but anyway, JAA has bing-scrape, and I have a couple of hacky JS snippets you can run from browser consoles
06:32:18 <pabs> bing-scrape is in https://gitea.arpa.li/JustAnotherArchivist/little-things
06:34:42 <pabs> https://transfer.archivete.am/6WZGP/bing-domain-scraper.js
06:34:42 <eggdrop> inline (for browser viewing): https://transfer.archivete.am/inline/6WZGP/bing-domain-scraper.js
06:34:46 <pabs> https://transfer.archivete.am/11Enzl/google-domain-scraper.js
06:35:30 <pabs> bing-domain-scraper.js you run it over and over again, then concatenate all the urls.txt files it downloads
06:36:11 <pabs> google-domain-scraper.js you run it once (or more) with which=1, then change which=0 to save the URLs
06:44:38 <Pedrosso> I see
06:54:00 <Pedrosso> thanks
13:45:13 <JAA> Pedrosso: Re zstd, I don't think I've seen it use more than a few GB unless you use multi-threaded mode. To maximise the compression ratio, you'll want to use single-threaded anyway. So `zstd --ultra -22 --long=31` would be the highest compression ratio. You'll pay for it in CPU cycles.
14:36:07 <Terbium> probably also pre-sort the URLs for better ratio and faster compression
14:38:11 <JAA> That can help, yeah, although simple sorting might not be optimal. It'd be neat if there was a 'rearrange the lines to maximise compression ratio'-type option/tool. :-)
14:43:27 <Terbium> If you have enough RAM, you can crank up --long to fit the entire file in the window
14:44:18 <Terbium> Won't help too much if everything is already properly sorted, but can eek out a small % in ratio
14:49:29 <JAA> --long=31 as above is the maximum I believe (and only on 64-bit machines).
14:50:36 <JAA> That's a 2 GiB window.
14:56:53 <Terbium> you're right, odd, I was reading the format spec as well which stated: The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
14:57:22 <Terbium> I guess 2 GiB is the real world limit for 64 bit, just that the spec goes higher
15:01:51 <JAA> Huh
17:30:07 <Verta> Hello there, I had some questions regarding archiving and url fetching and was redirected here
17:30:28 <Terbium> Feel free you ask your questions here
17:32:01 <Verta> Ok, so there are a few things. The first one is this: I was once a user of the Gothic-l Yahoo group, as everyone knows, the Yahoo group shut down, but I have exported my emails from that group and have all of them in a file. I don't know how to efficiently put them into a website format or remove everything around it which isn't part of the
17:32:02 <Verta> messages, so that it's possible to get an archive of all the messages from that group. All the messages from before I was a member were archived on another website which still exists
17:34:01 <Verta> The second one is this: I am working on a personal project where I am building a search engine for older websites (so that one can browse the old internet, more or less), a few of these exist but I do some things differently as I have a repository of urls in a database which I use. The problem is that I don't know what the most efficient way is to
17:34:01 <Verta> do this, since I glued together some python scripts for url fetching with descriptions and titles and converting them to a csv file, but this would burden the server of the wayback machine too much, so I wonder if there is a more friendly way to do this with which I can efficiently fetch urls with their metadata from only older websites (not modern
17:34:02 <Verta> ones since some websites didn't exist yet back around the 2000s)
17:35:45 <Terbium> i presume the emails from Yahoo groups are probably in a standard format, you can try an Python script with an XML parser to extract the core contents of each email, then generate an HTML file from them
17:36:51 <Terbium> Is your only source of old websites the WBM?
17:37:53 <Verta> It is not the only source, some websites are not archived on there, so for some websites I also use Arquivo.pt or some other archive websites. The neopets website for example is almost entirely functionally preserved on a different archive website
17:38:49 <Verta> Thank you for the suggestion on the emails, this is what the files look like: https://ibb.co/g4L1cBt
17:39:37 <Verta> I don't know if any archive group is interested in getting them? They are not publicly available anymore since the only people having them now since the shutdown of yahoo groups are people which were subscribed and I am apparently the only one who exported them into a file format so that we have all the messages
17:41:44 <Verta> https://listserv.linguistlist.org/pipermail/gothic-l/  This website has all the messages up to March 2015, I have everything from April 2015 until the shutdown on my hard drive exported from my emails.
17:56:09 <Terbium> Assuming they are HTML based emails and not plaintext, should be simple to parse each email to obtain the important text
17:56:34 <Terbium> Yahoo probably generates the emails in a standard format
17:57:44 <Terbium> The description and title are unlikely to be available separately in a condensed format unfortunately. I don't think WBM exposes that information separately from the actual page body
17:58:26 <Verta> Oh no, I used Mozilla Thunderbird to export all the emails from the Yahoo Mailing list
17:59:44 <Verta> Well yes, what I got adviced was to export the wikipedia reference dump which is available for 2010. That works since my project is intended for websites from before around 2008, they seem to contain titles and websites in the reference list, but not the descriptions so if I'd want those that still needs to be fetched
18:00:23 <Verta> I looked into beautifulsoup and did some tests with it, but don't immediately know how to let it fetch multiple urls in one get, instead of going through an array loop and doing one url each get which burdens a web server
18:01:14 <Verta> (The emails are exported as .eml format, so it's basically displaying the emails as a file, I guess there should be some python script to convert eml files to an xml format or something similar
18:09:34 <Terbium> might be useful https://pypi.org/project/eml-parser/
18:10:22 <Terbium> you'll probably need lxml (or html parser) once you extract the email body using the eml parser
18:10:50 <Verta> Ah thank you, that should be useful