-
hlgs|m
would it be possible to save a bunch of image links via archivebot, but only if they're either a) not archived at all, or b) have been archived, but the latest archive has a specific title pattern? to get more specific: save page now has been breaking on tumblr image and all the broken ones have a title that's "[something]: Image". i've got a list of about 60k image urls and i'd like to only save the ones that are broken, or that haven't been
-
hlgs|m
saved at all, just to not spend too many resources saving what's already saved okay
-
nicolas17
I'm not sure but it might take more resources to look up 60k images on the wayback machine than to just archive them again
-
hlgs|m
right, good to know. shame about the storage cost for the WBM itself but that might be the quickest/simplest option for me (i'd like to get these saved as soon as possible as images were being removed entirely lately)
-
audrooku|m
What about just listing all saved images using the cdx api?
-
hlgs|m
i don't have any experience with that, can you explain?
-
hlgs|m
the key thing would be identifying which images are broken by looking at the title they have in the wayback machine (as in, the title the tab/window shows when it's open in the WBM)
-
hlgs|m
that's the only consistent tell i've found in my research, other than it all being by save page now and recent, but i can't tell how recent
-
nicolas17
do you have the URL of the image, or of the page-containing-the-image?
-
JAA
Are they actual images when saved correctly or shitty page wrappers?
-
nicolas17
anyway send us the 60k list, you can use
transfer.archivete.am
-
hlgs|m
the direct urls of all the images
-
nicolas17
so they got mis-saved as html pages?
-
nicolas17
a jpeg file doesn't have a "title"
-
hlgs|m
save page now has been saving the weird page wrapper things tumblr has been doing lately, but archivebot isn't having that issue, so i'm basically wanting to redo a ton of images i saved using a SPN script recently
-
hlgs|m
let me get an example
-
hlgs|m
-
JAA
Yup, that's what Tumblr does.
-
hlgs|m
yeah
-
JAA
I'm not sure ArchiveBot is able to archive it correctly when given a direct image URL.
-
hlgs|m
SPN doesn't save the actual image at the moment (i've reported it as a bug but it's still being worked on it seems)
-
JAA
It works on the running Tumblr jobs because those send an appropriate Referer header.
-
hlgs|m
hmm, really? could test it. so far, i haven't noticed and broken images when saved by archivebot
-
nicolas17
might depend on user agent too, curl gives me a png
-
hlgs|m
oh interesting
-
JAA
Ah right, yeah, and the Accept header might also matter.
-
nicolas17
yeah seems it's Accept, not UA
-
hlgs|m
what i find fascinating is that, with non-gifs, i can right click and open the image in a new tab and get the actual image, but the url stays the same
-
JAA
Yes, the URL is not the only thing determining how something gets loaded.
-
nicolas17
my browser requests that URL with "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" and gets a webpage
-
nicolas17
the webpage has an <img> pointing at the same URL
-
hlgs|m
interesting
-
JAA
Even just refreshing after 'open image in new tab' loads the page again.
-
nicolas17
the browser then requests the same URL with "Accept: image/avif,image/webp,*/*" and gets an image
-
hlgs|m
yeah, i've noticed that too
-
nicolas17
seems tumblr only cares that html is *not* in the list?
-
hlgs|m
(tumblr has been getting... really hard to archive properly lately. this and then the www blog urls and now the permalinks of the previous reblog just being gone in the www blog view... ugh)
-
nicolas17
hlgs|m: do you have an example of a tumblr image that *did* get archived properly?
-
hlgs|m
let me see, i can find one
-
hlgs|m
-
hlgs|m
-
hlgs|m
oh, this is interesting, this one actually got saved by SPN outlinks, not archivebot. but i know i've seen archivebot ones that were saved, i'll find one
-
hlgs|m
ah, although, this one doesn't get the html wrapper thing when i open it separately
-
h2ibot
JustAnotherArchivist edited Deathwatch (+139, /* 2023 */ Add Ragtag Archive):
wiki.archiveteam.org/?diff=49876&oldid=49874
-
hlgs|m
i'll try to find something
-
JAA
Remember that images saved by SPN might have been saved because someone saved the page they're embedded on rather than the image URL directly.
-
hlgs|m
that's what i've been doing, but pretty much all the images within posts that i've tried saving via SPN have ended up broken
-
hlgs|m
okay, here's an archivebot one
-
hlgs|m
-
hlgs|m
-
hlgs|m
has the html wrapper thing when opened live
-
JAA
Ah right
-
h2ibot
JustAnotherArchivist edited Deathwatch (+76, /* 2023 */ Add kitsune tweet about Ragtag Archive):
wiki.archiveteam.org/?diff=49877&oldid=49876
-
JAA
SPN does some weird things with images sometimes. If they're loaded by JS, they might not get archived on the initial SPN but only later when you access the snapshot.
-
JAA
Which changes how they get accessed, which can cause this Tumblr nonsense.
-
hlgs|m
really? i think i've accessed snapshots several times without seeing the images update, but i can try again now
-
nicolas17
I think there's multiple snapshots of the same URL, some with the page, some with the image
-
JAA
I've seen it happen on Imgur for example. SPN itself doesn't archive the actual image when you give it an image page or album.
-
nicolas17
which makes things harder
-
JAA
That can happen, but each URL only gets saved once per 45 minutes.
-
JAA
By SPN, anyway.
-
hlgs|m
just checked one i'd checked before and the images are still broken
-
nicolas17
-
nicolas17
-
hlgs|m
oh, interesting. let me see what saved those
-
hlgs|m
the working ones are archivebot
-
hlgs|m
the broken one is SPN, i did that one earlier today
-
nicolas17
the problem here is that if you archive an image "properly", then
web.archive.org/https://64.media.tumblr.com/whatever/whatever.jpg will take you to the latest snapshot and will look correct, *until* something else causes it to get archived again x_x
-
hlgs|m
it says no collection info, i did it with the addon though i think
-
hlgs|m
yeahhh
-
nicolas17
and then the latest snapshot is the stupid wrapper page again
-
hlgs|m
the wrapper pages are useful because they have the URL of the original post, but they shouldn't be the last saved copy of an image because it then doesn't display, and it breaks embedding
-
hlgs|m
ideally i'd make sure the last saved copy is one that's not the broken wrapper, and then prevent further saving if it's going to save the wrapper over it...
-
hlgs|m
* the wrapper version over it...
-
hlgs|m
not sure if that's possible though
-
JAA
That still doesn't help, because when you load a page, it will embed the snapshot of the image that's closest in time to the page's.
-
JAA
So you can still end up with broken pages everywhere.
-
hlgs|m
ah right, ughh
-
JAA
It's fun when you SPN something, it looks fine, and then some days later it breaks. Because it was actually embedding an old working snapshot of something (like an image or stylesheet), and in the following days, that something got rearchived in a broken state.
-
hlgs|m
in which case, ideally i'd just keep one copy of the wrapper because of the usefulness of the original post being linked, and convert all the rest into proper images or just wipe them all (and somehow keep the wrapper copy further away from the post than any images...)
-
hlgs|m
damn. does archivebot not have this issue?
-
JAA
It's purely a WBM issue on playback because it mixes various data sources and uses that 'closest timestamp' stuff.
-
hlgs|m
well, for me what matters is just that the images and other data is somewhere on the archive and that i can access it with some inspect element digging from the post
-
JAA
Isolated archives from AB don't have this problem, but when they're in the WBM, it can still happen.
-
alexshpilkin
imer: in any case thank you :)
-
hlgs|m
makes sense
-
JAA
Worth mentioning that AB does breadth-first recursion, so embedded images are sometimes archived *much* later than the page.
-
hlgs|m
so... i suppose i could just run my list of urls that may-or-may-not be broken through archivebot to make sure there's at least one working backup somewhere? and hope the WBM team figures out some solution for this later...
-
JAA
Like, can be weeks later.
-
hlgs|m
good to know
-
JAA
Again, not a problem in isolation, can be a problem in the WBM or if the embedded things vanish in the meantime.
-
» alexshpilkin just went to a NixOS channel for a second and ended up investigating *a bug in bash* of all things for two hours, sorry imer
-
hlgs|m
i took the time to get the direct urls so they'd be prioritised now as they're most at risk (aside from people just deleting posts before i can get to them, which is annoying)
-
nicolas17
JAA: for a moment I thought, wouldn't breadth first get the images before recursing deep into links? but that's assuming the page is the root of the tree...
-
JAA
alexshpilkin: Heh, I've encountered a bunch of weird stuff in Bash that turned out to be intentional/correct behaviour, but I found my first bug a couple weeks ago that also made me bang my head against the wall for hours. (Still need to write an email to bug-bash though.)
-
hlgs|m
for the moment then... could i get some help running the url list through archivebot? not sure how long it'll take and how many resources for 60k direct image urls, hopefully not that much. i've got to leave town again tomorrow so i can't get started on setting anything up myself sadly
-
JAA
nicolas17: There is no distinction between links and page requisites as far as the recursion is concerned.
-
JAA
They both just get added to the end of the queue.
-
nicolas17
yeah
-
nicolas17
(maybe there should be a distinction)
-
JAA
hlgs|m: Well, as nicolas17 said, upload a list. :-)
-
hlgs|m
-
nicolas17
I just meant, if you start on a page, you'd soon get its images (and links), before going into a rabbit hole following links
-
imer
alexshpilkin: no worries, still waiting on an ftp listing to finish (that had some random uiuc.edu mirrors) and then i'm out of leads unfortunately
-
hlgs|m
thank you all so much for being so helpful, by the way. been stressful doing so much emergency archival over the past month but you here have taken some weight off my shoulders
-
JAA
nicolas17: That's true at the beginning, but when the queue is already in the millions, well, it'll take a while until it gets to those images.
-
nicolas17
but that assumes that page is where you *start*, if you're several levels deep it won't work that way, it has to get all the level n links from all sorts of unrelated pages before it even starts with n+1 where the image is :)
-
JAA
But URL prioritisation is something I've partially implemented, and prioritising page requisites is high on the wishlist.
-
JAA
Soon™
-
hlgs|m
woo
-
alexshpilkin
imer: that’s honestly $leads leads more than I expected, so cheers
-
hlgs|m
okay, thanks for the help, going afk for a while now
-
alexshpilkin
the csrd.uiuc.edu seems to have had different subdomains under that over the years fwiw
-
alexshpilkin
* FTP seems
-
alexshpilkin
a note from 2000 mentions sp2.csrd.uiuc.edu for example
-
fireonlive
SketchCow: just a heads up the discord invite link expired, unsure if that's intentional tho
-
alexshpilkin
JAA: the secret ingredient is getting someone else to write and send the email for you :)
-
alexshpilkin
(to be fair, that person discovered the bug)
-
JAA
And miss out on all that street cred‽
-
alexshpilkin
... send and cc you on it :P
-
Rotietip
Hello all, a few weeks ago I uploaded
archive.org/details/epsonianos which contains a WARC file from epsonianos.com, but when I checked in
web.archive.org/web/collections/20180000000000*/http://epsonianos.com it seems that the content of it has not been indexed yet. Why is this? Because I made sure to indicate "mediatype:web" when I created the item.
-
nicolas17
WARCs uploaded by regular users to regular collections don't appear in the WBM
-
nicolas17
as said earlier today here, "items need to go into special collections to be ingested into the WBM. Because allowing anyone to ingest WARCs allows anyone to fake WBM snapshots, you need special permissions for those collections."
-
Rotietip
Well, how do I make them appear or who do I have to contact for that?
-
nicolas17
how do people know your WARC is a legitimate and accurate archive of the website? :)
-
Rotietip
Perhaps by checking the file type and reviewing the first few lines of the file (in addition to the CDX file)?
-
vokunal|m
Ragtag actually sounds kind of fun to archive (as someone that doesn't have to code it). I've been liking the mediafire archive for a while since it's much heavier on filesize than other projects. Makes it feel like i'm contributing more
-
TheTechRobo
Rotietip: Unfortunately the WBM only allows certain people to put WARCs into Wayback Machine-ingesting collections. That's because there's no good way to tell if a WARC file has been modified.
-
TheTechRobo
If they let just anyone put stuff into the WBM, then someone could fake a snapshot.
-
vokunal|m
I'm not sure how hard it would be to set up something that could grab these quickly, but this spreadsheet has links to direct downloads for every mkv in their database apparently.
ragtag.link/archive-videos
-
vokunal|m
Is that something that could get sent into urls in small batches, or to AB in batches?
-
JAA
We're not going to archive 1.4 PB through AB.
-
JAA
(Amazing that I actually have to type that out.)
-
JAA
All of AB's 9-year crawls are only 3.1 PiB...
-
vokunal|m
yeah that was a dumb question
-
nicolas17
for starters 1.4PB is way into "formally ask Internet Archive for approval" range
-
JAA
More into 'we need to filter this down to something that's actually reasonable' territory.
-
nicolas17
the imgur project is at 520TB
-
SketchCow
I'll probably make the discord link perm soon.
-
JAA
Do we know the size of the videos that are no longer on YouTube?
-
fireonlive
thanks sketch
-
vokunal|m
At a glance, it seems every video in that list is either private or unlisted. It'll take me a minute, but i can try to see if I can narrow it down to only private or deleted videos
-
vokunal|m
The total size of all unlisted and private videos seems to be 49072 GB
-
nicolas17
aaaaugh how do I find someone's tweet after he disabled his twitter account because of Musk? >_<
-
Rotietip
TheTechRobo, nicolas17 had mentioned that an item must be in certain collections in order to be indexed in Wayback Machine. Is there a way to contact the owners of some of those collections to ask them to add an item?
-
nicolas17
JAA: we need help explaining this :p
-
nicolas17
"Accepting WARCs from random people would make the WBM useless because anyone could insert manipulated data. You can still upload them to IA, but they won't be in the WBM."
-
Rotietip
That's why I was asking if there is a way to request permission or something like that.
-
vokunal|m
That'd still be a random person asking verified person to upload it to IA for them. Same problem
-
Rotietip
Anyway another approach occurs to me. Do you know any online viewer for WARC files? I tried with
replayweb.page but when I try to upload the file from Internet Archive I get this error: "An unexpected error occured: TypeError: Failed to fetch"
-
nicolas17
Rotietip: if someone could give you permission, how would they know they can trust you and your data?
-
Rotietip
In the case of epsonianos.com just check the CDX, there you can see that it is a forum that I downloaded in 2018 and that currently appears the default page of the hosting.
-
pabs
-
pabs
-
nicolas17
it has been talked about in #shreddit
-
nicolas17
pabs:
news.ycombinator.com/item?id=36192312 "My Reddit account was banned after adding my subs to the protest"
-
pabs
ah
-
vokunal|m
JAA:
transfer.archivete.am/uK5k0/ragtag-non-200-videos.csv Here's a csv of every video that's privated or deleted from that list, its equivalent youtube id, and its filesize. Total is 38,574GB. And here's a plain txt of all the direct download urls,
transfer.archivete.am/HA8XK/ragtag_deleted_videos.txt
-
vokunal|m
It took a little longer than a minute, but I don't actually know python. Just a chatgpt wizard apparently
-
Rotietip
Just out of curiosity, can ArchiveBot be given a referrer and a cookie when downloading something?
-
pabs
no for the cookie, I assume also no for the referrer.
-
rewby|backup
Grabsite can do cookies
-
rewby|backup
But if you wanna do huge archives like 1.4pb. that'd require talking to the IA first of all
-
rewby|backup
And it's well within DPoS territory at that point
-
rewby|backup
No way AB can do it
-
flashfire42
!a
1945.melbourne --concurrency 1
-
rewby|backup
Wrong channel
-
pabs
does ArchiveTeam have any way to archive rsync servers? for eg these datasets: rsync sourceware.org::
-
Rotietip
The question is that I would like to archive some boards from
8chan.moe to be accessible from Wayback Machine, but this can only be achieved by sending
8chan.moe as referrer and the cookie disclaimer2=1 How feasible would it be to do this? Because otherwise it shows a fucking disclaimer like
web.archive.org/web/20230413194626/https://8chan.moe and all the images and thumbnails look the same as
8chan
-
Rotietip
.moe/.media/t_1e724d164bee05ea1d9c2c069172b916212f6742e07b6230194d3a4bb34f953a
-
Rotietip
According to my estimations what I am interested in archiving would be between 70 and 80 GB (although if you want to archive the whole site I won't stop you).
-
masterx244|m
those sites are the worst. and a WARC to download on archive.org is better than no archive at all.
-
audrooku|m
Would it be redundant to mention the ragtag archive in #archiveteam at this point?
-
vokunal|m
not sure. I pasted a list of every private or deleted url from their site
-
vokunal|m
I think Jaa'll be on it when they get the time
-
arkiver
i didn't follow this
-
arkiver
can someone please give me a tl;dr on ragtag?
-
BigBrain
vtuber archive, around 1.4PB, has a lot of lost media
-
BigBrain
all of it yt i think
-
BigBrain
shutting down on or before july 24, dumped full database with metadata and has "compiled a list of videos that are no longer available on YouTube, but are still available in Ragtag Archive" in a csv
-
arkiver
thank you
-
BigBrain
np
-
JAA
arkiver: And ~38 TB for the stuff on Ragtag Archive that's no longer on YouTube, which would be more feasible.