-
h2ibot
Pokechu22 edited List of website hosts (+28, /* F */ freeservers also has 8m.net):
wiki.archiveteam.org/?diff=51495&oldid=51494
-
h2ibot
Pokechu22 edited List of website hosts (+55, /* 0-9 */ 20m.com):
wiki.archiveteam.org/?diff=51496&oldid=51495
-
fireonlive
I think what is more surprising is that angel fire is still around
-
h2ibot
FireonLive edited YouTube (+166, extend IP warning to the wiki as well):
wiki.archiveteam.org/?diff=51497&oldid=51413
-
fireonlive
the wording on that sucks musty asshole, but i figured i'd add it before i forgot
-
h2ibot
FireonLive edited YouTube (-14, no more ChromeBot):
wiki.archiveteam.org/?diff=51498&oldid=51497
-
fireonlive
(the whole page needs a good deep loving but i'm too tired)
-
Pedrosso
I had asked about this in december but the end-of-year shutdowns had to come first. Should there be another DPoS of furaffinity? I don't doubt the site's stability but a lot of user content is still regularly deleted much like reddit, and the original grab was way back in 2015
-
fireonlive
i believe there was general support for that last time; could even be a continuous thing :3
-
fireonlive
cc arkiver too
-
arkiver
hi
-
arkiver
Pedrosso: can you make data deletion clear in some way?
-
arkiver
for people not familiar to understand the significance/scale of it
-
audrooku|m
hey, I have a 145 small sites I would like to request be crawled, ideally some of them should have their domain crawled, and some of them I would like to crawl all children of the given path (recursively but only matching the parent path, usually because it is a subdomain or specific blog on a site), is this possible?
-
project10
JAA: do you know the bloom filter properties (capacity, error rate, bits per hash) for the tracker bloom implementation?
-
thuban
audrooku|m: it's possible. why should they be archived? also, are any of the sites likely to link to one another?
-
arkiver
project10: 1/1000000 false positive rate
-
arkiver
or well, that is the maximum false positive rate
-
arkiver
the bloom filter expands to keep a maximum of 1/1000000 (1 over a million) false positive rate
-
project10
that's useful to know! I think that in combination with the (maximum?) set size defines bits_per_hash
-
arkiver
expand here means double in size
-
project10
oh, it's a Scalable Bloom Filter?
-
arkiver
yes
-
project10
thanks arkiver! very helpful
-
arkiver
:)
-
arkiver
it's been working pretty well for the projects
-
arkiver
but it does grow and grow... and grow
-
project10
now I'm halfway curious as to the size of the 69.5B #// bloom filter :)
-
audrooku|m
thuban: thanks for the response, I will submit a proper request soon with that information (including requested justification), and it's possible that they will link to each other, yes
-
audrooku|m
very likely
-
Pedrosso
arkiver: It's hard to approximate since FA gives an HTTP code of 200 even on missing items. However, they use a simple enumeration system
furaffinity.net/view/* and going through that one can find that a lot of posts are missing. I can't give a complete estimate though. I'm suggesting the DPoS because I have found links I've sent to
-
Pedrosso
people stop working as content is deleted, although if only a portion
-
thuban
audrooku|m: ah, that makes it a bit more complicated--they would have to be in separate archivebot jobs
-
thuban
(so either someone manually feeds in all 145, or an op sets up a queue)
-
OrIdow6^2
On Furaffinity, range samples like that may be complicated if there are spam removals in there
-
Pedrosso
That's true,
-
Pedrosso
So it's difficult to get an approximation there
-
Pedrosso
There are a lot of niche topics on FA, making the deletion of one artist's work possibly lead to an entire topic (or at least the majority of it) being nuked, notably unique characters and settings that aren't explored in many other media. An example would be the niche topic of multis (characters with more (usually)limbs than a human & how those
-
Pedrosso
creatures would live & interact) which itself has many tiny interesting subgroups
-
h2ibot
OrIdow6 edited FurAffinity (+81, /* Archives */ furarchiver.net):
wiki.archiveteam.org/?diff=51499&oldid=49859
-
nicolas17
arkiver: are your pings working now? :P
-
fireonlive
seems to be
-
arkiver
nicolas17: they are working
-
arkiver
nicolas17: you mean on what Pedrosso wrote?
-
nicolas17
(a few days ago you said they weren't, and I pinged you on something else and got no response, but I don't know your uhh effective timezone)
-
arkiver
nicolas17: did i miss something?
-
arkiver
i believe they are working
-
nicolas17
-
arkiver
nicolas17: oh yeah the 800 GB of open source data rigt?
-
arkiver
right*
-
nicolas17
I don't think it can go to WBM in any way because requests use POST and a one-time token, so it will have to be an IA item
-
arkiver
i believe you estimated 5k files, the 800 GB could go into a single item, but not sure if that is the nicest thing to do
-
arkiver
2448 results i see
-
arkiver
nicolas17: shall we do one item for each result?
-
nicolas17
"-" is not a comprehensive search string :)
-
arkiver
do i see a 27 MB NOTICE.html file? :P
-
nicolas17
yes
-
arkiver
nicolas17: if we do one item for each result, how many items would we end up with roughly?
-
nicolas17
I used sequential numbers on eg.
opensource.samsung.com/downAnnMPop?uploadId=11931 to get number and size of files
-
nicolas17
and it seemed there's ~2500
-
arkiver
shall we do one item per result?
-
nicolas17
yeah I think that would work
-
fireonlive
would probably be cleanest
-
nicolas17
but then each item needs a decent description
-
nicolas17
I'm not currently looking at search results at all
-
arkiver
nicolas17: i'd do a link to where you go it from, combines with any descriptive information you can find
-
arkiver
there doesn't seem to be a whole lot of descriptive information
-
nicolas17
but I'll have to, as I think the device model and all that is only on the search result row, not in downSrcMPop/downAnnMPop or in the files themselves
-
arkiver
the "announcement" is just the multi-MB NOTICE.html
-
nicolas17
I downloaded *all* the NOTICEs
-
arkiver
is there information in those perhaps for what you look for?
-
nicolas17
and "tar | zstd -19" compresses them to like 1%
-
arkiver
nice
-
nicolas17
1. it's text, 2. there's redundant data in each file, 3. there's significant redundant data *across* files
-
nicolas17
if most or all the html files have an entire copy of the GPL... yeah :P
-
arkiver
i guess each items would have the corresponding NOTICE.html and the files belonging to the ID?
-
arkiver
you don't have to compress the NOTICE.html files for upload to IA - the total size is still not shockingly huge, and not compressing will make possible use of it easier
-
arkiver
use directly on IA for example - by loading the NOTICE.html
-
nicolas17
yeah, and there's a few with multiple source files or multiple ann files
-
arkiver
yeah in that case the item would have multiple
-
arkiver
Pedrosso: looking into it now
-
arkiver
Pedrosso: i read 18+ content is only available with login
-
arkiver
is most of the continuously deleted data behind a login wall?
-
nicolas17
compressing each individual NOTICE would not give 99% savings anyway (that's only when compressing the whole tar at once)
-
nicolas17
so yeah, compression is not worth the annoyance for use
-
nicolas17
funny thing, with all the problems of IA upload sometimes being slow, plus my Internet connection (like most residential ISPs) having much faster download than upload... it will *still* be slower to download from samsung than to upload to IA :D
-
arkiver
ouch :P
-
nicolas17
I don't know if it's awful routing or intentional throttling but I get ~200KB/s
-
nicolas17
parallelism helps
-
DigitalDragons
arkiver: yes, 18+ content needs an account
-
DigitalDragons
it will be silently hidden from everywhere if you aren't logged in, or present an error if you try to view a direct link to nsfw content
-
nicolas17
okay will bikeshed specifics tomorrow
-
OrIdow6^2
Also some stuff is gated behind a login but not 18+, just becasue the user elects to make it so
-
OrIdow6^2
I would like at one point to gather the links from there and similar sites that have uploads but don't support all types, lots of stuff in Dropbox/Google Drive/similar
-
OrIdow6^2
If there does end up being a proactive ongoing project could be good for that, Dropbox in particular seems to have changed their URL format a few years ago and broken a bunch of downloads
-
fireonlive
ooh good idea; should start collecting google drive/dropbox links too
-
OrIdow6^2
The dropbox links I encounter can be nicely saved by changing the GET parameter "dl" from "0" to "1", but sadly the domain or relevant path prefix is excluded from the WBM
-
Pedrosso
arkiver: "is most of the continuously deleted data behind a login wall?" I don't know about 18+ content, I don't view such. But I know that some users' content is only availible logged-in regardless of its sfw/nsfw status.
-
Pedrosso
I also know the old 2015 grab included 18+ content
-
Pedrosso
I'd assume that the majority of content which is not 18+ is publically accessible
-
fireonlive
>sadly the domain or relevant path prefix is excluded from the WBM
-
fireonlive
:(
-
Barto
fireonlive: video playback seems to be broken in nitter, a fix was upstreamed, i'm gonna deploy
zedeus/nitter 52db03b shortly
-
Barto
fireonlive: deployed
-
fireonlive
Barto: :D thanks
-
audrooku|m
-
thuban
audrooku|m: thanks!
-
audrooku|m
based on what I read on the archivebot wiki page the default behavior is what I desire for both url lists (only crawl pages that match the seed url)
-
thuban
correct
-
thuban
those aren't bad, actually; my understanding (and i just went and double-checked the code) is that the six -children items will need their own jobs, but -domains can be one big `!a <` (since you want the entirety of each site)
-
audrooku|m
is this something I need to do or an OP needs to do?
-
thuban
*seven
-
thuban
audrooku|m: only voiced users can submit jobs, and only ops can run `!a <` jobs or set up queues (not sure which one would be preferred here)
-
audrooku|m
Thuban: Alright thanks
-
thuban
someone will probably get to it in the next day or two; thanks for the suggestions :)
-
betamax
I am very much regretting not uploading my archive of 2022 US midterm campaign sites sooner....
-
betamax
Pulled out the drive today to check something else on it and it no longer powers up, and there's damage to the PCB
-
betamax
it's.... not going to work again :(
-
Pedrosso
:(
-
Pedrosso
There's nothing that can be done?
-
nicolas17
betamax: is it a magnetic hard disk?
-
betamax
yup, 1TB magnetic disk
-
betamax
I can see a chip out of the PCB, no idea how that happened
-
nicolas17
that's fixable
-
nicolas17
maybe only by professional data recovery companies but fixable
-
betamax
yeah, but at what cost
-
betamax
I'm going to label the HDD with what was on it and what is wrong with it, and put it into a box
-
nicolas17
also
-
betamax
then if someone in the future really wants to know what a campaign website looked like (and it's not on wayback) I can revisit it
-
betamax
(thankfully I have a full list of the sites that were on it)
-
nicolas17
if a disk head gets damaged, I would *want* professional data recovery to open it in a professional cleanroom to replace it (I have had to deal with that before)
-
nicolas17
but PCB damage is more accessible to DIY
-
nicolas17
not necessarily by yourself but like, hardware-nerd friends
-
betamax
it's not something I have the time or skills for now, but I'll keep the drive around just in case
-
nicolas17
yeah
-
nicolas17
it doesn't have to be here and now, I was just giving optimism to the "not going to work again" :)
-
betamax
thanks!
-
betamax
It'll go with the other dead HDD of lost material :\
-
nicolas17
JAA: yes we know about Hobbes... but is anything being done about it?
-
nicolas17
oh there's a 18GB tar
-
nicolas17
(why didn't they use bittorrent...)
-
JAA
nicolas17: According to Jason, there are also already multiple copies of Hobbes.
-
fireonlive
there was also an AB job started
-
h2ibot
JustAnotherArchivist edited Deathwatch (+255, /* 2024 */ Add RuneScape forums):
wiki.archiveteam.org/?diff=51500&oldid=51491
-
Vokun
Is the runescape forum more suitable for AB or DPPOS?
-
betamax
On the subject of broken drives (losing the drive earlier has made me realise I need to sort out my "old HDD pile" ASAP :D )
-
betamax
I have a USB 1TB HDD that started giving me I/O errors. As soon as that happened I disconnected it and put it in the "deal with it later" pile
-
fireonlive
you have some bad luck!
-
betamax
that drive was *ancient*, I'm just bad at upgrading ("oh, no, this 12-year-old drive is fine for <critical thing>")
-
fireonlive
ah :)
-
betamax
Is the best way to try and recover it, to (1) connect it but not mount it, (2) use dd / ddrescue to make an image of it?
-
betamax
Or is there a better way? (excluding paying for recovery, which I don't want to do yet)
-
JAA
Yeah, ddrescue is what I'd try, I think.
-
JAA
Or retrieve it from the backups. ;-)
-
fireonlive
back-what?
-
fireonlive
and yeah ddrescue
-
fireonlive
i saw a post on reddit where someone had a HDD where they accidentally some data.. and used recovery software... to put the deleted data back onto the HDD where they deleted the data.
-
fireonlive
:|
-
fireonlive
not like 'ok i got what i can let's move it back' but 'let's find from and then write to the same drive'
-
JAA
Yeah, great idea.
-
fireonlive
it's what data recovery companies recommend xP
-
nicolas17
/o\
-
Terbium
re: RuneScape forums, another forum gets swallowed by Discord
-
JAA
No no, Discord, Reddit and Twitter!!1!
-
nulldata
Totally sad news everyone, GameStop is ending their NFT marketplace.
nft.gamestop.com lounge.nulldata.foo/uploads/253d2886be251e91/image.png
-
h2ibot
Nulldata edited Deathwatch (+257, /* 2024 */ Added GameStop NFT Marketplace):
wiki.archiveteam.org/?diff=51501&oldid=51500
-
lumidify
betamax: (re: ddrescue) make sure to use the mapfile option (optional third argument) - that lets you pause the recovery or restart it later with other options.
-
betamax
lumidify: just started it now, and yup, am doing the mapfile (it was the first thing the man page said!)
-
betamax
Continuing my broken drives / backups questions:
-
betamax
I'm (finally!) getting round to building a "proper" data storage system (rather than just a pile of HDDs on a shelf)
-
betamax
Being a cheapskate, my thought is to get 3x used 8TB HDDs off ebay (but from business sellers with 1 year warranty) and create a Truenas mirrored setup
-
betamax
Main questions: (1) Is there any real danger in going with used drives if I have it in a 3 drive mirror?
-
betamax
(2) Is 3-drive mirror significantly better than 2 drive (there's a couple of posts online from people saying that in a 2 drive mirror, if one drive fails the strain on copying everything to the replacement drive can cause the remaining one to fail - really?!)
-
betamax
I initially thought I'd get 2x new drives but in my mind 3x used is better redundancy, and if it turns out I buy duds that fail quickly the worst its done is cause me extra expense - not put the data at risk
-
audrooku|m
Betamax: I don't think the seller's one year warranties are binding in any way, so youre gambling that they will honor it
-
betamax
fair, but I'm looking at business sellers with 99% feedback and 10k+ items sold, and only ones with 5+ years on ebay
-
JAA
Yes, such cascading failures are a significant risk, especially if the drives you have are all of similar age or, worse, from the same batch.
-
lumidify
Just my 2c: I would never trust data that's only stored on two drives to be safe. Three copies are the bare minimum for important data (although two copies are sadly still better than what most people have).
-
JAA
I do two copies for most data I'd like to keep, three copies for anything I really care about.
-
betamax
Yeah, 3 drives feels a lot safer to me than 2
-
betamax
JAA: I assume then that I am still at risk of cascading failures if I buy 3x drives from the same seller then?
-
betamax
(given they've probably been pulled from the same system)
-
JAA
Potentially, yeah.
-
JAA
Personally, I mix HDD manufacturers, too. Or at least models.
-
JAA
So, if I have the choice, one copy on WD drives and one copy on Seagate drives.
-
JAA
No experience with Toshiba, but I'll consider them on the next purchase.
-
betamax
The seller has a mix of manufacturers, I was planning on asking them to provide a mix if possible
-
betamax
but I guess cascading failures are still a risk if the drives came from the same NAS / CCTV setup / etc.
-
betamax
The obvious fix is to buy new, I may compromise that by buying used from two separate sellers (and having 3x drives)
-
h2ibot
Blankie edited List of websites excluded from the Wayback Machine (+28, Add
pendantaudio.com/): wiki.archiveteam.org/?diff=51502&oldid=51490
-
h2ibot
JAABot edited List of websites excluded from the Wayback Machine (+0):
wiki.archiveteam.org/?diff=51503&oldid=51502
-
h2ibot
Nulldata edited Deathwatch (+241, /* Pining for the Fjords (Dying) */ Added…):
wiki.archiveteam.org/?diff=51504&oldid=51501
-
h2ibot
Nulldata edited Deathwatch (-3, /* 2024 */ Correct Artifact URL):
wiki.archiveteam.org/?diff=51505&oldid=51504
-
nulldata
Doesn't appear to be a way into Artifact from a browser - all previous links just redirect to the notice. The app still works, however, at the moment I can't find any of the 'social' features.
-
nulldata
The AI summary function still works...
-
nulldata
-
nulldata
Can someone please throw
mosaic.co into AB? See -ot assets were bought and all employees let go