#archiveteam-bs

03:54

Pedrosso

What's the best way to compress a large list of URLs using zstd? Like for highest compression ratio with a reasonable amount of RAM (~16 GB)?
04:35

rasterofmandomness

hello
04:38

rasterofmandomness

i need help with something
04:43

fireonlive

don’t ask to ask, just ask
04:46

rasterofmandomness

does anyone have "eene edited parody - the ed touchables" saved because its on the internet archive but with a notice saying that it "has been archived but cannot be played"
04:46

pabs

got a link?
04:48

rasterofmandomness

web.archive.org/web/20190215031325/…www.youtube.com/watch?v=zOdesdNis3w
04:48

fireonlive

… 🪦
04:49

pabs

bah, I was just looking on findyoutubevideo.thetechrobo.ca
04:54

pabs

ugh subdomain.center is quite broken, API host no longer resolves
05:25

fireonlive

maybe we need an automated warning to webirc users that backgrounding the tab might disconnect them or something
06:28

Pedrosso

So, mineteria is a dead minecraft server with a unique enough name that it's basically the only thing that shows up on searches. I'm curious if there's any tool that could collect all / many of the URLs from search engine searches & YouTube searches?
06:31

pabs

the main problem with that is that search engines limit the amount of results returned. Google is like 300, Bing 2000
06:31

pabs

but anyway, JAA has bing-scrape, and I have a couple of hacky JS snippets you can run from browser consoles
06:32

pabs

bing-scrape is in gitea.arpa.li/JustAnotherArchivist/little-things
06:34

pabs

transfer.archivete.am/6WZGP/bing-domain-scraper.js
06:34

eggdrop

inline (for browser viewing): transfer.archivete.am/inline/6WZGP/bing-domain-scraper.js
06:34

pabs

transfer.archivete.am/11Enzl/google-domain-scraper.js
06:35

pabs

bing-domain-scraper.js you run it over and over again, then concatenate all the urls.txt files it downloads
06:36

pabs

google-domain-scraper.js you run it once (or more) with which=1, then change which=0 to save the URLs
06:44

Pedrosso

I see
06:54

Pedrosso

thanks
13:45

JAA

Pedrosso: Re zstd, I don't think I've seen it use more than a few GB unless you use multi-threaded mode. To maximise the compression ratio, you'll want to use single-threaded anyway. So `zstd --ultra -22 --long=31` would be the highest compression ratio. You'll pay for it in CPU cycles.
14:36

Terbium

probably also pre-sort the URLs for better ratio and faster compression
14:38

JAA

That can help, yeah, although simple sorting might not be optimal. It'd be neat if there was a 'rearrange the lines to maximise compression ratio'-type option/tool. :-)
14:43

Terbium

If you have enough RAM, you can crank up --long to fit the entire file in the window
14:44

Terbium

Won't help too much if everything is already properly sorted, but can eek out a small % in ratio
14:49

JAA

--long=31 as above is the maximum I believe (and only on 64-bit machines).
14:50

JAA

That's a 2 GiB window.
14:56

Terbium

you're right, odd, I was reading the format spec as well which stated: The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
14:57

Terbium

I guess 2 GiB is the real world limit for 64 bit, just that the spec goes higher
15:01

JAA

Huh
17:30

Verta

Hello there, I had some questions regarding archiving and url fetching and was redirected here
17:30

Terbium

Feel free you ask your questions here
17:32

Verta

Ok, so there are a few things. The first one is this: I was once a user of the Gothic-l Yahoo group, as everyone knows, the Yahoo group shut down, but I have exported my emails from that group and have all of them in a file. I don't know how to efficiently put them into a website format or remove everything around it which isn't part of the
17:32

Verta

messages, so that it's possible to get an archive of all the messages from that group. All the messages from before I was a member were archived on another website which still exists
17:34

Verta

The second one is this: I am working on a personal project where I am building a search engine for older websites (so that one can browse the old internet, more or less), a few of these exist but I do some things differently as I have a repository of urls in a database which I use. The problem is that I don't know what the most efficient way is to
17:34

Verta

do this, since I glued together some python scripts for url fetching with descriptions and titles and converting them to a csv file, but this would burden the server of the wayback machine too much, so I wonder if there is a more friendly way to do this with which I can efficiently fetch urls with their metadata from only older websites (not modern
17:34

Verta

ones since some websites didn't exist yet back around the 2000s)
17:35

Terbium

i presume the emails from Yahoo groups are probably in a standard format, you can try an Python script with an XML parser to extract the core contents of each email, then generate an HTML file from them
17:36

Terbium

Is your only source of old websites the WBM?
17:37

Verta

It is not the only source, some websites are not archived on there, so for some websites I also use Arquivo.pt or some other archive websites. The neopets website for example is almost entirely functionally preserved on a different archive website
17:38

Verta

Thank you for the suggestion on the emails, this is what the files look like: ibb.co/g4L1cBt
17:39

Verta

I don't know if any archive group is interested in getting them? They are not publicly available anymore since the only people having them now since the shutdown of yahoo groups are people which were subscribed and I am apparently the only one who exported them into a file format so that we have all the messages
17:41

Verta

listserv.linguistlist.org/pipermail/gothic-l This website has all the messages up to March 2015, I have everything from April 2015 until the shutdown on my hard drive exported from my emails.
17:56

Terbium

Assuming they are HTML based emails and not plaintext, should be simple to parse each email to obtain the important text
17:56

Terbium

Yahoo probably generates the emails in a standard format
17:57

Terbium

The description and title are unlikely to be available separately in a condensed format unfortunately. I don't think WBM exposes that information separately from the actual page body
17:58

Verta

Oh no, I used Mozilla Thunderbird to export all the emails from the Yahoo Mailing list
17:59

Verta

Well yes, what I got adviced was to export the wikipedia reference dump which is available for 2010. That works since my project is intended for websites from before around 2008, they seem to contain titles and websites in the reference list, but not the descriptions so if I'd want those that still needs to be fetched
18:00

Verta

I looked into beautifulsoup and did some tests with it, but don't immediately know how to let it fetch multiple urls in one get, instead of going through an array loop and doing one url each get which burdens a web server
18:01

Verta

(The emails are exported as .eml format, so it's basically displaying the emails as a file, I guess there should be some python script to convert eml files to an xml format or something similar
18:09

Terbium

might be useful pypi.org/project/eml-parser
18:10

Terbium

you'll probably need lxml (or html parser) once you extract the email body using the eml parser
18:10

Verta

Ah thank you, that should be useful

9 months ago

« a day earlier

a day later »

today »