-
pabs
re mastodon, here is a command-line client that requires no JS and dumps info to the terminal
github.com/jwilk/zygolophodon
-
pabs
probably could take the guts of it to make an archiver for mastodon
-
rktk
"transgender phenomenon" it's not a phenomenon lmao
-
rktk
is this stuff that's being archived? I feel like why not let that content die?
-
SketchCow
Want to have that debate now? It is -bs
-
SketchCow
Have to warn you, though, no matter what you say, I'm doing it anyway.
-
rktk
debate on transgender being a phenomenon or archiving content like that
-
SketchCow
Archiving content.
-
flashfire42
I feel its important to have that stuff archived for future historians and the sort. But I personally will not be going out of my way to archive it. But Agree its important to do so.
-
rktk
Indeed, I didn't consider the historical research perspective
-
nstrom|m
those who forget history are doomed to repeat it 🤷
-
flashfire42
I am now just scared at the absolute fucking wall of text sketchcow must be typing
-
qyxojzh|m
<nstrom|m> "those who forget history are..." <- It's about what _to_ do and what _not_ to do =P
-
qyxojzh|m
Gotta have both
-
rktk
thankfully i am going to bed but I am on znc so.
-
rktk
SketchCow, feel free to PM me if you want instead :P
-
arkiver
we're not letting any content die
-
arkiver
as with any site, if it is shutting down, and effort can be made in #archivebot or a custom project to archive that website
-
arkiver
(or webpage/etc.)
-
h2ibot
Yts98 edited Mobile Phone Applications (+186, Clean up dead links, add APKPure):
wiki.archiveteam.org/?diff=50714&oldid=48489
-
flashfire42
Android subsystem for windows may come in handy for archival of android stuff
-
h2ibot
Yts98 edited Mobile Phone Applications (+50):
wiki.archiveteam.org/?diff=50715&oldid=50714
-
h2ibot
Yts98 edited BlackBerry World (+596, Update project status):
wiki.archiveteam.org/?diff=50716&oldid=46619
-
fireonlive
RIP
-
pabs
-
pabs
"Gizmodo’s owner shuts down Spanish language site in favor of AI translations"
-
pabs
seems the site stays up but every article is Google-translated (poorly)
-
nicolas17
w t f
-
pabs
I should have said every *new* article
-
nicolas17
bad human translations are better because bad grammar can be *noticed*
-
nicolas17
the better the AI translation the harder it is to notice when it screws up :/
-
» pabs wonders whether to archive the site or not...
-
fireonlive
wat
-
fireonlive
pabs: might be worthy to grab the old articles i guess...
-
fireonlive
in case they just scrap the thing alltogether in the future
-
pabs
-
fireonlive
was thrown in DTT
-
fireonlive
website failed in AB due to how overloaded/slow it is
-
fireonlive
old.reddit.com/r/DataHoarder/comments/169q4cp claims 'multiple copies have been made' as well
-
nicolas17
4900 videos, seems most are not too long though
-
pabs
hmm, the site has a ton of subdomains
-
pabs
pretty broken in a browser
-
nicolas17
my yahoo videos indexing seems to be going well
-
fireonlive
pabs: wonder if it's relying on www.* for styles/js?
-
nicolas17
a .tar download from IA has been going for 26 continuous hours without getting interrupted, amazing
-
pabs
www. seems broken indeed
-
h2ibot
Yts98 edited Duelyst (+40, Clean up formatting):
wiki.archiveteam.org/?diff=50717&oldid=46610
-
fireonlive
response times for things from there seem to take multiple seconds at times
-
pabs
in AB I just got a 403 and then a timeout :(
-
pabs
hmm, in the browser all the subdomains are the same but different to the main site
-
pabs
-
h2ibot
Yts98 edited URLTeam/Dead (-82, untiny.me (untiny.com) is an unshortner, not a…):
wiki.archiveteam.org/?diff=50718&oldid=50695
-
h2ibot
Yts98 edited URLTeam (+597, Filter out dead table rows):
wiki.archiveteam.org/?diff=50719&oldid=50694
-
fireonlive
yts98: nice work lately :)
-
yts98
;)
-
fireonlive
;o
-
that_lurker
-
qwertyasdfuiopghjkl
I found this url shortener + pastebin site
fars.ee that has a notice saying "All data may be erased without notifications". Maybe something that should be archived? IDs are 1 to 4 characters long, a-zA-Z0-9_- so might be possible to just !ao < a list of all possible combinations.
-
nicolas17
download interrupted for 46GiB .tar.bz2 from yahoo videos :( at least the others seem to be still going well
-
arkiver
nicolas17: Wget allows resuming downloads
-
arkiver
and IA allows download a range of bytes
-
arkiver
downloading*
-
nicolas17
arkiver: I'm using "curl | tar tv", but I'll switch to wget which I *think* will auto-resume on errors even when outputting to stdout
-
arkiver
right, within in the same Wget process it might
-
fireonlive
would be interesting to know
-
nicolas17
for this one I can't "wget -c" now because there's no file to resume from
-
nicolas17
I think I like the wget progress indicator better anyway :p
-
fireonlive
me toooooo
-
fireonlive
wish i could shove that into curl
-
nicolas17
there is "curl --progress-bar" to get a... bar, but it only shows % and no throughput or MiB or ETA
-
fireonlive
:(
-
nicolas17
Length: 484259014419 (451G) [application/octet-stream]
-
nicolas17
yikes
-
fireonlive
ooof lol
-
fireonlive
that’ll take a second
-
nicolas17
110.20M 907KB/s eta 6d 14h
-
» FireFly . o O ( if it takes a second that's some _very_ impressive downlink (and uplink) :p )
-
arkiver
you could split it up in pieces and download concurrently using the bytes ranges
-
nicolas17
arkiver: that would require 451GB of disk space ;) I'm piping into tar -tv
-
fireonlive
FireFly: :P
-
nicolas17
it could be interesting to make a tool that does multithreaded downloads on smaller chunks and outputs them to stdout as they finish
-
FireFly
with some kind of container so they can be rearranged and reassembled afterward? (or synchronisation to make sure they're output in-order?)
-
nicolas17
yeah internal buffering and output in order
-
FireFly
yeah could be interesting
-
JAA
Basically a reverse ia-upload-stream. It does exactly that, just in the other direction (reading from stdin, uploading in chunks in parallel in order).
-
nicolas17
JAA: you mentioned chunked uploads to IA have drawbacks, right?
-
JAA
Yeah, the processing on IA's side is inefficient.
-
JAA
It copies the chunks to the backup server, then assembles them in a separate task and copies the assembled file over again.
-
JAA
And because that's always a separate task, snowballing doesn't work well for uploading multiple files to the same item.
-
JAA
Neither of this should affect chunked downloads, of course.
-
nicolas17
hm
-
nicolas17
if I upload a file to an existing item, it goes to the same server as the other files, right?
-
JAA
Yes
-
JAA
An entire item is always on a single server (plus its mirror at the other facility). Even on a single disk in that server, I think.
-
JAA
Or well, single FS at least.
-
nicolas17
what happens if I upload multiple files, in parallel, targeting the same non-existent item name? will they end up in the same server? when is the server "assigned" to an item?
-
JAA
Maybe there's some RAID in place, no idea.
-
JAA
You can't. All but one upload will fail with some weird error message IIRC.
-
nicolas17
well that's good to know
-
nicolas17
I have uploaded multiple files in parallel to get better speed
-
nicolas17
which worked great
-
nicolas17
good to know I shouldn't do that for the *first* file...
-
JAA
Yeah, at least one upload has to be done individually, afterwards you can go parallel, even if the archive.php task for that first upload hasn't run yet I believe.
-
JAA
But it's also worth mentioning that IA generally discourages parallel uploads to individual items.
-
nicolas17
why?
-
nicolas17
too much load on a single server?
-
JAA
I believe it has to do with task limits. I.e. you're more likely to run into rate limiting errors.
-
nicolas17
ah
-
nicolas17
that may be more relevant if it was hundreds of files I guess?
-
nicolas17
I was uploading like, 5 files, each of them >1GB
-
JAA
Yeah, or at least several dozen.
-
nicolas17
with IA's current ingestion problems I would probably upload one at a time and let it take as long as it wants tho... don't add to the problem ^^
-
JAA
Agreed
-
nicolas17
even if I have 3Mbps upstream
-
fireonlive
in theory, would it be possible to, say, take all (wiki)pages that have an infobox and pull attributes from them onto a page? e.g. 'every infobox's IRC channel'
-
JAA
Anything is possible!
-
JAA
Retrieving the page contents is easy enough. But then you have to parse MediaWiki syntax probably...
-
fireonlive
the eldritch horror of mediawiki :D
-
fireonlive
oh! i meant to ping you JAA - there's a page I can't quite edit:
wiki.archiveteam.org/index.php?title=Main_Page&action=edit
-
fireonlive
"Monday, Nov. 09, 2009"
-
fireonlive
no rushy :3
-
JAA
Yup, protected page.
-
fireonlive
ye, bad phrasing
-
fireonlive
could you please datetime-ify that for me =]
-
JAA
Ah
-
h2ibot
JustAnotherArchivist edited Main Page (+10, Datetimeify):
wiki.archiveteam.org/?diff=50720&oldid=48497
-
fireonlive
:D thanks
-
nicolas17
sooo any news on IA ingestion?
-
JAA
Frame 6 of 6
-
nicolas17
x_x I know what you're referencing
-
nicolas17
I guess temp storage already has its hat on fire too
-
flashfire42|m
Most likely
-
arkiver
nicolas17: there is internal progress on resolving it. it's not resolved yet
-
flashfire42|m
This is archiveteam what else do you expect
-
arkiver
i don't have an ETA
-
arkiver
i'm hoping within a month... but i don't know, i'm not the one handling this
-
nicolas17
that's fine
-
nicolas17
I'm not like "what's taking so long?!", more like "by any chance did I miss news while I was offline?"
-
nicolas17
do we have a month worth of temp storage? x_x
-
JAA
Depends on if any urgent projects come up, probably. Without, we're not far away with the throttled projects. But getting #shreddit up again would be good and would change the equation.
-
flashfire42|m
Looks like #zowch isn’t happening then…..
-
arkiver
it is happening
-
arkiver
#zowch ^
-
arkiver
nicolas17: completely understood! i had no negative (or necessarily positive) reading of your question
-
arkiver
just a question, i answered :P
-
arkiver
(reading back - may have come off as annoyed/harsh? not my intention)
-
nicolas17
a few days ago I complained because people saw #telegrab was idle and started adding items "to keep things busy"
-
JAA
flashfire42|m: ZOWA is small enough to not be an issue.
-
arkiver
nicolas17: if you have large lists of channels, feel free to pass them to me
-
flashfire42|m
Yeah that was my fault partially and I apologise for that Nicolas17
-
arkiver
best is in the formats of channel:CHANNELNAME lines in a file, then I can queue it directly
-
nicolas17
flashfire42|m: yeah don't worry, you already apologized at the time
-
nicolas17
I was also a bit unsure if it's *really* such a big deal to add large channels like that
-
nicolas17
like *how* conservative should we be adding items? are we in "emergency mode - low capacity - only archive if it's truly at risk", or is that too extreme? :p
-
arkiver
the rules were much more relaxed
-
arkiver
then a ton of big news channels were put in, we added like 50 TB of youtube to IA for a few days
-
arkiver
i noticed too late
-
nicolas17
arkiver: oh I was talking about telegram, a few days ago
-
arkiver
and we got in trouble at IA. we're known as a trusted responsible organisation, and dumping 100s of TBs of youtube into IA of content that will likely not be deleted any time soon does not fit the "responsible" flag we have
-
arkiver
oh
-
arkiver
uh
-
flashfire42
Telegram items are a lot smaller but there were people including myself throwing in "busywork" so to speak. Stuff that could be useful but like also probably not super important. A few million crypto faucet items
-
flashfire42
I think nicolas17 is trying to ask what limits should be in place for that with the limited storage
-
flashfire42
I wont be adding anymore except for the ones I scrape off the wiki because we are at like 40 million to do as of right now.
-
nicolas17
arkiver: I was like "is that channel actually important to archive? or are we adding stuff just to keep workers busy? I don't think we need to 'stay busy' while we have limited capacity"
-
nicolas17
but it's not really my place to judge that if I don't even know how much capacity we have or how long it will take for IA issues to resolve
-
arkiver
i just need to do those checks for reddit and we can restart
-
nicolas17
also, if I ask "is that item actually important" it's probably a genuine question and not judging that it's not important, maybe it is :)
-
flashfire42
Because yeah when we have the free flow telegram is a free for all but is it needed to be more selective right now for that project
-
nicolas17
my stats on telegram say: avg item 1.7MB, success rate 50.1% (completed ÷ dequeued), estimated data remaining in queue 34TB
-
arkiver
not bad
-
JAA
We have around 180 TiB of remaining offload capacity currently.
-
nicolas17
it's very hard to give an "ETA for queue empty" because it seems we're hitting target "max connections (-1)" errors, so the speed goes up and down a lot
-
nicolas17
imgur has so many items failing that I estimate like 1TB left lol
-
fireonlive
rip the i.imgur.com refuge
-
nicolas17
I started a telegram worker using a ramdisk for data, and it's now on request 5357 with 104MB of data total /o\
-
fireonlive
active channels go brrrrrrrrrrrrr