-
JAA
I guess we'll need LaTeX in our wiki soon.
-
fireonlive
đ
-
TheTechRobo
mfw when to read an archiveteam article you need a math degree
-
TheTechRobo
:P
-
fireonlive
-
rootliam
Hello, is there someone knowledgable about the yahoo video archives that could help me find which tar file a specific user id's videos are in? I've tried downloading file lists generated by the internet archive through a script but they seem to be incomplete
-
nicolas17
where are those?
-
nicolas17
rootliam: are user IDs numeric?
-
rootliam
Yeah, the one I'm looking for is 375869 and according to the wiki they got 0,300,000 - 0,400,000 so it should be in there somewhere
-
nicolas17
I can't find that range :|
-
rootliam
It seems like they got split into a bunch of files but the ranges on some of them don't even make sense like 00045002-07439897
-
rootliam
Maybe the only option is to just download all of them but I don't have the space or bandwidth to
-
flashfire42
.
-
flashfire42
How much temp storage do we have left? I wanna be aware if I can feed some more telegram stuff in
-
nicolas17
rewby:
-
thuban
rootliam: you could shoot underscor a message? most recently active account i can find is
github.com/ab2525
-
nicolas17
only the giant 00000024-02442192 and 00045002-07439897 ranges would fit that number
-
thuban
it's not in either of those, they both just have stray extras at the beginning--the 'real' ranges are 2400026-2442192 and 7400007-7439897 respectively
-
nicolas17
I'll get their full file listings anyway, since I already started :P
-
nicolas17
yahoovideo-00000024-02442192.tar will take 8h to download oof
-
fireonlive
of
-
fireonlive
oof
-
nicolas17
-
nicolas17
these items are such a mess
-
nicolas17
omg 256GB in a single .tar
-
nicolas17
will a single continuous 256GB download from IA survive to completion?
-
nicolas17
we'll find out... in 48h or so
-
rootliam
I actually downloaded the extensionless/bz2 ones and made a program to extract the html files/file positions
-
rootliam
I would like to try and make a search engine for the archives besides just finding this specific user's uploads but that 2TB was all I had space for and it took multiple months to download so if you'd like I could upload the program and the data I extracted tomorrow
-
nicolas17
for now I'm doing "curl | tar tv" to get the file listings
-
fireonlive
nicolas17: aria2 and multiple threads? haha
-
nicolas17
fireonlive: that would require actually having the disk space to store the whole tar ;)
-
fireonlive
ahh :D
-
fireonlive
true :
-
fireonlive
:3
-
nicolas17
which I can do for some of these
-
nicolas17
but not the 256GB beasts
-
thuban
-
rootliam
Does tar tv store the position of the file inside the tar itself? My idea was to have javascript to get the flv out of the tar with a range request
-
nicolas17
thuban: for tars it seems to be highly incomplete
-
nicolas17
for zips I can use remotezip to get the file list without downloading the whole thing
-
thuban
oh, the tar listing is itself incomplete? i thought the problem was just that not all the yahoo video data made it to ia
-
nicolas17
that *too* :P
-
thuban
fucked up
-
nicolas17
I think view_archive has to do the same linear process as 'tar tv' so it takes forever and soon gives up?
-
thuban
quite possible
-
thuban
regardless, it would be nice to track down the rest of that dataset...
-
nicolas17
absolutely
-
nicolas17
someone also mentioned repacking the tars into a more usable format, but I don't know if that was about yahoo video or about another project of a similar era
-
arkiver
-
arkiver
i see some lists are running through AB now - is that all the lists? so, anything not finished here is still 'to be done'?
-
arkiver
or is there now more elsewhere still?
-
thuban
arkiver: it's running now, job 9t2wjsmi7nb9izhv58mnj4tnf
-
thuban
that's all the lists, yes
-
arkiver
oh next time it might be good to run that though with --no-offsite-links
-
arkiver
through*
-
arkiver
what response do you get if it rate limits you?
-
thuban
yeah, i wondered about that :S
-
thuban
(iirc you can do it the hacky way with a negative-lookahead ignore, but i don't have voice anyway, so)
-
arkiver
6 AM here, so really need to take some sleep, but we'll get a little emergency project up, hopefully it'll finish in time
-
arkiver
thuban: maybe JAA knows about that
-
» arkiver has very little AB experience
-
thuban
don't recall about the rate-limiting, i'm afraid, but i'm pretty sure it's not 429
-
thuban
pokechu22 should know
-
pokechu22
You get timeouts for 24 hours
-
pokechu22
and/or refused connections
-
pokechu22
with no warning ahead of time (but it's generally OK with you running above the rate limit for a bit if you stop afterwards, it seems)
-
pokechu22
I didn't use --no-offsite-links because I expect it interacts with !a < list on multiple domains in weird ways
-
pokechu22
I'll add an ignore that prevents it from using URLs without orange in the domain
-
nicolas17
hmm I think having the size of every file inside a tar, I can calculate the absolute position of files
-
pokechu22
added ^https?://(?![^/]*(orange|wanadoo)[^/]*)[^/]*/
-
thuban
pokechu22: you also need woopic.com :/
-
pokechu22
oh
-
thuban
it's their cdn
-
pokechu22
well let's just hope that nothing much was missed by that - if there's still time, we can recheck skipped URLs (and we might as well do outlinks afterwards anyways)
-
thuban
was thinking the same thing
-
thuban
ty for fix :)
-
pokechu22
It's possible to requeue skipped URLs but that's a difficult process
-
pabs
do we have a way of saving Twitter at the moment?
nitter.net/steveharwell died
-
fireonlive
i've seen people using nitter.net or nitter.cz but last i heard they're rate limiting so have to go slow
-
nicolas17
I could get a tar file listing somewhat faster by skipping over the actual file data
-
nicolas17
but it seems there's a yahoo-videos tar where that would be *especially* beneficial, because the videos inside are all like 100MB+, but I can't do it because it's .bz2 /o\
-
h2ibot
Wessel1512 edited Deathwatch (+333, /* 2023 */):
wiki.archiveteam.org/?diff=50713&oldid=50710
-
pabs
interesting, a Google referrer makes twitter user pages public
-
flashfire42
pabs does that include privated accounts?
-
thuban
orange.fr is dead
-
AntoninDelFabbro|m
I confirm. âšī¸
-
pabs
flashfire42: not sure, got an example account?
-
plcp
hi
-
plcp
pokechu22, thuban, someone from orange wrote me back, to say somethings like "orange's pagesperso has been suspended today, but we've acknowledged your request"
-
plcp
"we'll try to re-up the pages until the 30th of september, and then, they will be down forever"
-
plcp
I don't have much more info, but we may have got more time
-
pabs
flashfire42, thuban: ^
-
imer
Nice, good thing you asked :)
-
plcp
it's not done yet (fingers crossed)
-
qwertyasdfuiopghjkl
mastodon.0011.lt is shutting down tomorrow (2023-09-06):
mastodon.0011.lt/@au0/110939313852541543 . JAA, arkiver: what's the current status on whether or not to archive mastodon instances (asking you since you were the ones who edited the wiki page)?
-
arkiver
qwertyasdfuiopghjkl: let's put it in #archivebot
-
qwertyasdfuiopghjkl
arkiver: to clarify, do you mean the conversation or the site?
-
arkiver
if it's shutting down, let's put it in AB
-
qwertyasdfuiopghjkl
ok, thanks :)
-
qwertyasdfuiopghjkl
arkiver: Looks like ArchiveBot no longer works for mastodon instances due to JS :(
-
qwertyasdfuiopghjkl
Is there any other way to archive it?
-
arkiver
orange is down :(
-
imer
arkiver: See above (9:32), might come back
-
pokechu22
It seems like it's partially online again?
-
pokechu22
ah, or just redirects to a dead domain - I can't load it currently :|
-
arkiver
plcp: that would be great news!
-
arkiver
we can make a proper full copy then
-
fireonlive
fingers crossed :)
-
arkiver
imer: thank for letting me know!
-
arkiver
yeah :)
-
pokechu22
I'm currently letting the archivebot jobs drain rather than saving the redirects to the dead domain
-
arkiver
would probably be good to pause them and resume when the sites are back?
-
pokechu22
I guess - I thought the image domain was still alive so it would be useful to do those, but it's not
-
plcp
yeah dead redirs to end.pagesperso-orange.fr for now
-
pokechu22
-
plcp
first is 302 as it was an existing site before, second is 404 as it doesn't exist?
-
pokechu22
Yeah, that's what I'd expect
-
plcp
I hope that orange guy didn't gave me false hopes tbh
-
pokechu22
so probably somewhat useful to leave the job that didn't finish iterating though all the possible sites alone just to record whether a site existed or not, even if it's not going to get anything
-
plcp
can only wait & see
-
pokechu22
Unrelated to that - what's the deal with free.fr? Is that ISP hosting as well, or is it something else?
-
plcp
is it ISP hosting
-
plcp
basically the same deal as orange's pagesperso, at some point they will pull the plug, but for now, its staying alive
-
pokechu22
Alright, we should probably do something about that sooner rather than later then
-
plcp
yup, lots of old personal webpages there
-
plcp
and most active orange's pagesperso migrated to free.fr btw :o)
-
plcp
also saw several sitew.com
-
plcp
as well as other misc web hosting services (most of the time these "create your website for free" ones)
-
qwertyasdfuiopghjkl
arkiver: (mentioning it again in case you missed it since the shutdown is tomorrow) ArchiveBot couldn't get any posts or users on
mastodon.0011.lt , even when starting from
mastodon.0011.lt/about , as recommended in
wiki.archiveteam.org/index.php/Mastodon . I'm guessing Mastodon used to previously work without JS, and the
-
qwertyasdfuiopghjkl
wiki page is outdated.
-
pokechu22
I do remember hearing that mastodon worked without JS in the past and they removed that at some point
-
fireonlive
me too
-
fireonlive
-
fireonlive
-
katia
mastodon seems clunky
-
fireonlive
lots of new things don't really care about not having javascript because 'who turns it off'
-
rewby
hackint's not having a good day is it
-
JAA
nicolas17: Cute. There's a 650 GiB tar from me on IA somewhere.
-
JAA
qwertyasdfuiopghjkl: Yeah, I think Mastodon would require special tools now thanks to the JS without fallback.
-
flashfire42|m
How is ingestion going?
-
Exdetransitioner
what about archiving gender-critical websites? it would be useful for scholars to study anti-trans websites of today in the future
-
nicolas17
what about it?
-
nicolas17
give specific websites and someone will add them to archivebot I guess
-
SketchCow
I am not sure anti-trans websites are what "gender-critical" means
-
audrooku|m
^
-
SketchCow
Regardless, bigotry-a-go-go sites get archives all the time
-
audrooku|m
^ to give websites
-
Exdetransitioner
ovarit.com
-
Exdetransitioner
supposedly "women-centered" but in reality it's a gatekept echo chamber thru invite codes
-
Exdetransitioner
there was also kiwi farms but i think it would have to have all the personal information redacted by AI
-
nicolas17
yeah I don't think it's worth doing any specific effort to archive kf
-
Exdetransitioner
nicolas17, all it really was it's pure noise
-
thuban
the operator has announced he'll release an archive if it ever shuts down
-
nicolas17
archiving the specific doxxing information would be counter-productive, and redacting specific parts of the content is.. not something we do, and would be way too much effort anyway
-
Exdetransitioner
nicolas17, yeah it would be harder than organizing elections
-
Exdetransitioner
you would need multiple people to verify everything
-
nicolas17
I mean, besides the "effort" part, we archive pristine HTTP responses including headers; if you need to modify the html to redact some content then it probably doesn't belong on the Wayback Machine
-
Exdetransitioner
another site that might be of interest is SEGM.org - an activist organization that's overplaying the negatives of gender-affirming treatments while underplaying the positives
-
Exdetransitioner
generally saying anything from this list would fit the bill:
rationalwiki.org/wiki/RationalWiki:Webshites/Gender
-
nicolas17
lol webshites
-
Exdetransitioner
It's cited "really bad sources of information - or really good sources of bad information"
-
SketchCow
Oh yes, archiving kiwifarms, nobody has ever discussed that.
-
nicolas17
maybe it needs a wiki page to avoid rehashing the argument whenever it comes up
-
Exdetransitioner
-
Exdetransitioner