-
nicolas17
ok I whipped up a script to do that now
-
nicolas17
I'm definitely getting a speedup but not *that* good
-
nicolas17
went through 24GiB of the tar file in 6 minutes
-
nicolas17
which would be 71MiB/s if I was downloading the full thing, which seems impossible to get from archive.org :p
-
nicolas17
but... it could be better
-
nicolas17
there we go, I almost halved the number of requests needed, 100MiB/s equivalent now :D
-
nicolas17
"exception: connection aborted" nooo, time to add retries
-
nicolas17
I should have done this script earlier lol
-
Rootliam
And now my program downloads 256kb at once and caches it, is this officially a race now or :P
-
nicolas17
I'm benchmarking :o
-
nicolas17
I tried readahead of 128KB, 256KB, and 512KB, the speed difference was completely lost in the noise
-
JAA
> Cult of the Lamb dev says it will delete the game on January 1
-
JAA
... due to the Unity changes
-
fireonlive
:\
-
JAA
-
JAA
I was going to suggest 'discord' if we wanted to create a channel, but...
-
» fireonlive blinks
-
fireonlive
-
eggdrop
-
fireonlive
the article's source was also two posts from their shitpost-social-media-account so ¯\_(ツ)_/¯
-
JAA
Ah indeed :-)
-
nicolas17
IA download speeds are way too variable to test this properly
-
fireonlive
we need to get nicolas17 a 10Gig interconnect to IA
-
nicolas17
suddenly dropped to 800KiB/s *despite* skipping chunks
-
project10
I think optane9 has a better part of 10g to IA, it's on a network with peering to IA at the SFMIX. Maybe run it there? :D
-
nicolas17
this is all over the place...
-
nicolas17
64KB: 32s 35s 43s
-
nicolas17
256KB: 11s 17s 18s
-
nicolas17
1024KB: 14s 25s 50s
-
h2ibot
PaulWise created MoinMoin (+2087, create MoinMoin project page):
wiki.archiveteam.org/?title=MoinMoin
-
h2ibot
PaulWise edited MoinMoin (+5058, add more moinmoin wikis from google/bing):
wiki.archiveteam.org/?diff=50769&oldid=50768
-
h2ibot
PaulWise edited MoinMoin (+145, another strategem):
wiki.archiveteam.org/?diff=50770&oldid=50769
-
h2ibot
PaulWise edited MoinMoin (+3477, more, sorted):
wiki.archiveteam.org/?diff=50771&oldid=50770
-
nicolas17
43.1GiB tar file indexed in 3m22s :D
-
Rootliam
how the hell
-
nicolas17
I had others, especially those with few videos and mostly html pages, taking longer than just downloading the entire tar
-
nicolas17
so it depends on the tar content *and* on the speed of the particular IA server I hit
-
nicolas17
especially latency more than throughput...
-
h2ibot
PaulWise edited MoinMoin (+5010, more, sorted):
wiki.archiveteam.org/?diff=50772&oldid=50771
-
nicolas17
just finished a big one, 255GiB in 49m30s
-
nicolas17
I have another of a similar size with an ETA of 4 hours -.-
-
pabs
anyone got any scripts/something to automate (browser-based?) searching using Bing?
-
fireonlive
flashfire42?
-
fireonlive
or do you manually rawdog that
-
nicolas17
phrasing
-
fireonlive
:3
-
nicolas17
indexing 8 tar files at the same time, to do them at this speed while downloading the whole .tar I would need to download from IA at a total speed of 433 MB/s >:3
-
pabs
-
nicolas17
pabs: it seems they closed in 2020, but it sucks that the announcement doesn't have a date
-
pabs
website got an AB in 2020, no subdomains though, inc the wiki
-
pabs
started some jobs
-
Rootliam
literally how
-
Rootliam
my program is taking 20 seconds just to get 5mb into the file with 256 or 512 kbps downloaded at a time and its written in c ._.
-
Rootliam
and thats also with downloading html files turned off
-
Doomaholic
taaffeite: What kind of errors were you getting?
-
Doomaholic
Is it running at all or is it just a problem with the page you're trying to download
-
taaffeite
I'm receiving several warnings and errors: EBADENGINE is an unsupported engine, npm ERR! path /usr/local/lib/node_modules/mwoffliner/node_modules/sharp command failed, Installation error: Expected Node.js version >=14.15.0 but found 12.22.9.
-
taaffeite
So an outdated Node.js version?
-
Doomaholic
Yeah that's likely the issue
-
Doomaholic
How did you install it?
-
Doomaholic
Usually the repo in your distribution is outdated
-
taaffeite
I followed the instructions on the GitHub page. Using the latest version of Linux Mint. 'npm i -g mwoffliner'
-
Doomaholic
I mean how did you install Node?
-
taaffeite
Perhaps I didn't actually. I just installed the redis-server.
-
Doomaholic
I see
-
taaffeite
I downloaded the Node.js binary from their site, but couldn't install that.
-
Doomaholic
Well you should try to install Node then
-
Doomaholic
sudo apt install nodejs
-
Doomaholic
Then run nodejs -v and see what version it gave you
-
taaffeite
'nodejs is already the newest version (12.22.9~dfsg-1ubuntu3)'
-
taaffeite
The apt package is out of date?
-
Doomaholic
Oh okay, yes
-
Doomaholic
I haven't tried this myself but apparently there is a Node package that will update it for you
-
Doomaholic
npm install -g n
-
Doomaholic
Then run:
-
Doomaholic
n stable
-
Doomaholic
And it should update
-
Doomaholic
If that doesn't work you'll have to reinstall with a newer version
-
taaffeite
Okay that worked. It's now v18.17.1. I'll try running mwoffliner again.
-
Doomaholic
Nice
-
Doomaholic
Give it a try
-
taaffeite
A wall of deprecation and unmaintained warnings, then 'added 1191 packages in 2m', '117 packages are looking for funding'.
-
Doomaholic
That's normal for npm :P
-
Doomaholic
If you just see that when installing packages then it's likely fine
-
taaffeite
I'm getting a help page from mwoffliner so I think we're good.
-
Doomaholic
Sweet
-
taaffeite
I was told by someone working on the ZIM project that this script is undergoing maintenance and might not work until it's been repaired sometime in the next several months. But I'll give it a go. Thanks for the help.
-
Doomaholic
You're welcome :)
-
AK
Hmm anyone thought about archiving parts of the Unity forums? This is 128 pages of comments to the new pricing changes that should probable be saved (In case they delete it like they did their github)
forum.unity.com/threads/unity-plan-…-packaging-updates.1482750/page-128
-
h2ibot
PaulWise edited Mailman2 (+1125, add new lists, move not done lists to the right…):
wiki.archiveteam.org/?diff=50773&oldid=50767
-
pabs
-
JAA
pabs: little-things/bing-scrape, though I haven't used it in some time.
-
Ryz
joepie91|m and others regarding interest on Unity, need to vocariously start finding Unity stuff :C
-
icedice
You guys are going to start archiving Unity games?
-
JAA
Exorcism: WordPress works well with simple recursive crawling like AB. Are you proposing a large-scale project?
-
Exorcism
JAA: let's say yzqzss launched its own project
github.com/saveweb/wordpress-rss-archiver then I don't know if you really want to, that's why I'm asking 👀
-
JAA
The README sounds like this is a 'run this to continuously submit new posts to SPN' thing...?
-
pokechu22
Isn't there also some kind of wordpress push notification system that's tied into IA?
-
pokechu22
related to
developer.wordpress.com/docs/firehose (though I think it includes stuff not hosted by wordpress.com)?
-
JAA
The README claims you have to pay for that.
-
pokechu22
Right, but I think IA does?
-
pokechu22
-
pokechu22
-
JAA
Ah nice, I've long wanted to look into leveraging Jetpack for that.
-
Exorcism
<JAA> "The README sounds like this is a..." <- yep, that's it :p
-
icedice
You might want to add
mangadex.com to the list if you're going to be archiving WordPress sites
-
icedice
It's a scanlation group site hosting service run by MangaDex and it uses WordPress
-
icedice
(mangadex.org is the domain used by MangaDex's manga reader)
-
Exorcism
👌🏻
-
JAA
A continuous thing for select blogs would fit into #//. Duplicating the IA's project, i.e. doing that for all blogs with Jetpack, probably makes little sense, assuming they're achieving decent coverage there.
-
JAA
And as mentioned, one-off archival works very well with AB.
-
pokechu22
Oh, also, arkiver - what data would you need for a DPoS project for orange? I can build a list of pages that are known to exist (e.g. website front pages, possibly deeper ones too) based on the AB jobs, but I'm not sure what else is needed
-
arkiver
pokechu22: we need all the links you know about
-
arkiver
Exorcism: what is this about?
-
pokechu22
I've got 2GB of assorted links (some dead but existed in the past via CDX data, some alive, some already saved via AB); I can try to organize that into something actually usable
-
pokechu22
one other thing is that there are several kinds of links that will need to be remapped into other links because sites link to older domains that no longer work, but I imagine that's pretty easy to do with a script
-
arkiver
pokechu22: can you gz or zst the list up and post it?
-
arkiver
on transfer.archivete.am
-
arkiver
let's do a channel for orange!
-
arkiver
any ideas for an orange channel? :)
-
pokechu22
#webroasting already exists, not sure if we need a dedicated one
-
arkiver
ah
-
arkiver
alright we'll use that
-
arkiver
Exorcism: for wordpress, could you just use #archivebot , and for regularly getting a set of wordpress RSS feeds we could (as JAA suggests) just use #// indeed
-
arkiver
I'm not in favor of Archive Team using SPN on a large scale, SPN is not made for that
-
arkiver
honestly behind the scenes, SPN is quite busy and regularly has too much to do, so queuing complete wordpress blogs through is maybe not the best way
-
arkiver
(plus indeed IA already does something with wordpress)
-
fireonlive
archivebot best bot :)
-
arkiver
:)
-
h2ibot
-
h2ibot
Arkiver edited CNET Forums (-33, Reverted edits by…):
wiki.archiveteam.org/?diff=50776&oldid=50445
-
Exorcism
👍🏻
-
arkiver
i reverted a sneaky spammy edit ^
-
fireonlive
huh, odd
-
fireonlive
should block that user I suppose
-
arkiver
yeah i marked them as spammer
-
fireonlive
ah :)
-
arkiver
they tried to get in another edit (it was in the mod queue)
-
fireonlive
ahh
-
fireonlive
gotta love spammers...
-
arkiver
it was in the CNET announcement, which went like "blabla... Thanks, CNET team"
-
arkiver
and they added "BLABLA... Thanks [spam link], CNET team"
-
arkiver
sneaky
-
arkiver
:P
-
fireonlive
indeed :3
-
arkiver
they were caught thouhg
-
arkiver
Exorcism: or do you have different thoughts about that?
-
pokechu22
plcp - you should probably join #webroasting
-
pokechu22
-
Exorcism
<arkiver> "Exorcism: or do you have..." <- not really, I just prefer to use wordpress archiver, that's it haha
-
arkiver
right
-
fireonlive
-
fireonlive
"ACE Takes Aim at Zoro.to Successor Aniwatch.to" "Below is a list of all domains targeted by MPA/ACE in a recent DMCA subpoena wave"
-
Peroniko
Moved from #archiveteam-ot: Any idea how feasible is to archive rateyourmusic.com considering that they seems to block Wayback machine IPs, probably because of the amount of traffic. They are great place for music discovery, and their forum is around 20 years old. Most of the pages are unarchived, and many of those that are just display block notice because of unusual activity (ex.
web.archive.org/web/20230909224447/https://rateyourmusic.com
-
Peroniko
/~Fooftilly). Their image CDN isn't blocked though.
-
flashfire42
Well with 2 new trackers coming up I may switch to AT Choice when I head to work today
-
Peroniko
Which new ones are coming up?
-
imer
Peroniko: #zowch and one under #webroasting for orange