-
Pedrosso
What's the best way to compress a large list of URLs using zstd? Like for highest compression ratio with a reasonable amount of RAM (~16 GB)?
-
rasterofmandomness
hello
-
rasterofmandomness
i need help with something
-
fireonlive
don’t ask to ask, just ask
-
rasterofmandomness
does anyone have "eene edited parody - the ed touchables" saved because its on the internet archive but with a notice saying that it "has been archived but cannot be played"
-
pabs
got a link?
-
rasterofmandomness
-
fireonlive
… 🪦
-
pabs
-
pabs
ugh subdomain.center is quite broken, API host no longer resolves
-
fireonlive
maybe we need an automated warning to webirc users that backgrounding the tab might disconnect them or something
-
Pedrosso
So, mineteria is a dead minecraft server with a unique enough name that it's basically the only thing that shows up on searches. I'm curious if there's any tool that could collect all / many of the URLs from search engine searches & YouTube searches?
-
pabs
the main problem with that is that search engines limit the amount of results returned. Google is like 300, Bing 2000
-
pabs
but anyway, JAA has bing-scrape, and I have a couple of hacky JS snippets you can run from browser consoles
-
pabs
-
pabs
-
eggdrop
-
pabs
-
pabs
bing-domain-scraper.js you run it over and over again, then concatenate all the urls.txt files it downloads
-
pabs
google-domain-scraper.js you run it once (or more) with which=1, then change which=0 to save the URLs
-
Pedrosso
I see
-
Pedrosso
thanks
-
JAA
Pedrosso: Re zstd, I don't think I've seen it use more than a few GB unless you use multi-threaded mode. To maximise the compression ratio, you'll want to use single-threaded anyway. So `zstd --ultra -22 --long=31` would be the highest compression ratio. You'll pay for it in CPU cycles.
-
Terbium
probably also pre-sort the URLs for better ratio and faster compression
-
JAA
That can help, yeah, although simple sorting might not be optimal. It'd be neat if there was a 'rearrange the lines to maximise compression ratio'-type option/tool. :-)
-
Terbium
If you have enough RAM, you can crank up --long to fit the entire file in the window
-
Terbium
Won't help too much if everything is already properly sorted, but can eek out a small % in ratio
-
JAA
--long=31 as above is the maximum I believe (and only on 64-bit machines).
-
JAA
That's a 2 GiB window.
-
Terbium
you're right, odd, I was reading the format spec as well which stated: The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
-
Terbium
I guess 2 GiB is the real world limit for 64 bit, just that the spec goes higher
-
JAA
Huh
-
Verta
Hello there, I had some questions regarding archiving and url fetching and was redirected here
-
Terbium
Feel free you ask your questions here
-
Verta
Ok, so there are a few things. The first one is this: I was once a user of the Gothic-l Yahoo group, as everyone knows, the Yahoo group shut down, but I have exported my emails from that group and have all of them in a file. I don't know how to efficiently put them into a website format or remove everything around it which isn't part of the
-
Verta
messages, so that it's possible to get an archive of all the messages from that group. All the messages from before I was a member were archived on another website which still exists
-
Verta
The second one is this: I am working on a personal project where I am building a search engine for older websites (so that one can browse the old internet, more or less), a few of these exist but I do some things differently as I have a repository of urls in a database which I use. The problem is that I don't know what the most efficient way is to
-
Verta
do this, since I glued together some python scripts for url fetching with descriptions and titles and converting them to a csv file, but this would burden the server of the wayback machine too much, so I wonder if there is a more friendly way to do this with which I can efficiently fetch urls with their metadata from only older websites (not modern
-
Verta
ones since some websites didn't exist yet back around the 2000s)
-
Terbium
i presume the emails from Yahoo groups are probably in a standard format, you can try an Python script with an XML parser to extract the core contents of each email, then generate an HTML file from them
-
Terbium
Is your only source of old websites the WBM?
-
Verta
It is not the only source, some websites are not archived on there, so for some websites I also use Arquivo.pt or some other archive websites. The neopets website for example is almost entirely functionally preserved on a different archive website
-
Verta
Thank you for the suggestion on the emails, this is what the files look like:
ibb.co/g4L1cBt
-
Verta
I don't know if any archive group is interested in getting them? They are not publicly available anymore since the only people having them now since the shutdown of yahoo groups are people which were subscribed and I am apparently the only one who exported them into a file format so that we have all the messages
-
Verta
listserv.linguistlist.org/pipermail/gothic-l This website has all the messages up to March 2015, I have everything from April 2015 until the shutdown on my hard drive exported from my emails.
-
Terbium
Assuming they are HTML based emails and not plaintext, should be simple to parse each email to obtain the important text
-
Terbium
Yahoo probably generates the emails in a standard format
-
Terbium
The description and title are unlikely to be available separately in a condensed format unfortunately. I don't think WBM exposes that information separately from the actual page body
-
Verta
Oh no, I used Mozilla Thunderbird to export all the emails from the Yahoo Mailing list
-
Verta
Well yes, what I got adviced was to export the wikipedia reference dump which is available for 2010. That works since my project is intended for websites from before around 2008, they seem to contain titles and websites in the reference list, but not the descriptions so if I'd want those that still needs to be fetched
-
Verta
I looked into beautifulsoup and did some tests with it, but don't immediately know how to let it fetch multiple urls in one get, instead of going through an array loop and doing one url each get which burdens a web server
-
Verta
(The emails are exported as .eml format, so it's basically displaying the emails as a file, I guess there should be some python script to convert eml files to an xml format or something similar
-
Terbium
-
Terbium
you'll probably need lxml (or html parser) once you extract the email body using the eml parser
-
Verta
Ah thank you, that should be useful