#archiveteam-bs

00:00

hlgs|m

would it be possible to save a bunch of image links via archivebot, but only if they're either a) not archived at all, or b) have been archived, but the latest archive has a specific title pattern? to get more specific: save page now has been breaking on tumblr image and all the broken ones have a title that's "[something]: Image". i've got a list of about 60k image urls and i'd like to only save the ones that are broken, or that haven't been
00:00

hlgs|m

saved at all, just to not spend too many resources saving what's already saved okay
00:03

nicolas17

I'm not sure but it might take more resources to look up 60k images on the wayback machine than to just archive them again
00:04

hlgs|m

right, good to know. shame about the storage cost for the WBM itself but that might be the quickest/simplest option for me (i'd like to get these saved as soon as possible as images were being removed entirely lately)
00:06

audrooku|m

What about just listing all saved images using the cdx api?
00:07

hlgs|m

i don't have any experience with that, can you explain?
00:08

hlgs|m

the key thing would be identifying which images are broken by looking at the title they have in the wayback machine (as in, the title the tab/window shows when it's open in the WBM)
00:08

hlgs|m

that's the only consistent tell i've found in my research, other than it all being by save page now and recent, but i can't tell how recent
00:11

nicolas17

do you have the URL of the image, or of the page-containing-the-image?
00:11

JAA

Are they actual images when saved correctly or shitty page wrappers?
00:12

nicolas17

anyway send us the 60k list, you can use transfer.archivete.am
00:14

hlgs|m

the direct urls of all the images
00:14

nicolas17

so they got mis-saved as html pages?
00:14

nicolas17

a jpeg file doesn't have a "title"
00:14

hlgs|m

save page now has been saving the weird page wrapper things tumblr has been doing lately, but archivebot isn't having that issue, so i'm basically wanting to redo a ton of images i saved using a SPN script recently
00:14

hlgs|m

let me get an example
00:15

hlgs|m

64.media.tumblr.com/377948577d35abb…blr_o98nqnSmqh1s4dx9ko4_r5_1280.png
00:15

JAA

Yup, that's what Tumblr does.
00:15

hlgs|m

yeah
00:15

JAA

I'm not sure ArchiveBot is able to archive it correctly when given a direct image URL.
00:15

hlgs|m

SPN doesn't save the actual image at the moment (i've reported it as a bug but it's still being worked on it seems)
00:15

JAA

It works on the running Tumblr jobs because those send an appropriate Referer header.
00:16

hlgs|m

hmm, really? could test it. so far, i haven't noticed and broken images when saved by archivebot
00:16

nicolas17

might depend on user agent too, curl gives me a png
00:16

hlgs|m

oh interesting
00:16

JAA

Ah right, yeah, and the Accept header might also matter.
00:16

nicolas17

yeah seems it's Accept, not UA
00:16

hlgs|m

what i find fascinating is that, with non-gifs, i can right click and open the image in a new tab and get the actual image, but the url stays the same
00:17

JAA

Yes, the URL is not the only thing determining how something gets loaded.
00:17

nicolas17

my browser requests that URL with "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" and gets a webpage
00:17

nicolas17

the webpage has an <img> pointing at the same URL
00:17

hlgs|m

interesting
00:18

JAA

Even just refreshing after 'open image in new tab' loads the page again.
00:18

nicolas17

the browser then requests the same URL with "Accept: image/avif,image/webp,*/*" and gets an image
00:18

hlgs|m

yeah, i've noticed that too
00:18

nicolas17

seems tumblr only cares that html is *not* in the list?
00:18

hlgs|m

(tumblr has been getting... really hard to archive properly lately. this and then the www blog urls and now the permalinks of the previous reblog just being gone in the www blog view... ugh)
00:19

nicolas17

hlgs|m: do you have an example of a tumblr image that *did* get archived properly?
00:19

hlgs|m

let me see, i can find one
00:21

hlgs|m

<web.archive.org/web/20230527004757/…-has-spot-and-julian-has-kukalaka-i>
00:21

hlgs|m

<web.archive.org/web/20230527003724/…aa42c89e9610c308b96960583b2b47c.jpg>
00:21

hlgs|m

oh, this is interesting, this one actually got saved by SPN outlinks, not archivebot. but i know i've seen archivebot ones that were saved, i'll find one
00:21

hlgs|m

ah, although, this one doesn't get the html wrapper thing when i open it separately
00:22

h2ibot

JustAnotherArchivist edited Deathwatch (+139, /* 2023 */ Add Ragtag Archive): wiki.archiveteam.org/?diff=49876&oldid=49874
00:22

hlgs|m

i'll try to find something
00:23

JAA

Remember that images saved by SPN might have been saved because someone saved the page they're embedded on rather than the image URL directly.
00:23

hlgs|m

that's what i've been doing, but pretty much all the images within posts that i've tried saving via SPN have ended up broken
00:24

hlgs|m

okay, here's an archivebot one
00:24

hlgs|m

web.archive.org/web/20230526221101/…he-downside-of-having-a-host-friend
00:24

hlgs|m

web.archive.org/web/20230526221058/…blr_paoqjnQrrp1wv21vuo1_r1_1280.jpg
00:24

hlgs|m

has the html wrapper thing when opened live
00:24

JAA

Ah right
00:25

h2ibot

JustAnotherArchivist edited Deathwatch (+76, /* 2023 */ Add kitsune tweet about Ragtag Archive): wiki.archiveteam.org/?diff=49877&oldid=49876
00:25

JAA

SPN does some weird things with images sometimes. If they're loaded by JS, they might not get archived on the initial SPN but only later when you access the snapshot.
00:25

JAA

Which changes how they get accessed, which can cause this Tumblr nonsense.
00:25

hlgs|m

really? i think i've accessed snapshots several times without seeing the images update, but i can try again now
00:26

nicolas17

I think there's multiple snapshots of the same URL, some with the page, some with the image
00:26

JAA

I've seen it happen on Imgur for example. SPN itself doesn't archive the actual image when you give it an image page or album.
00:26

nicolas17

which makes things harder
00:26

JAA

That can happen, but each URL only gets saved once per 45 minutes.
00:26

JAA

By SPN, anyway.
00:26

hlgs|m

just checked one i'd checked before and the images are still broken
00:26

nicolas17

web.archive.org/web/20230527003724/…aa42c89e9610c308b96960583b2b47c.jpg
00:27

nicolas17

web.archive.org/web/20230604220454/…aa42c89e9610c308b96960583b2b47c.jpg
00:27

hlgs|m

oh, interesting. let me see what saved those
00:28

hlgs|m

the working ones are archivebot
00:28

hlgs|m

the broken one is SPN, i did that one earlier today
00:28

nicolas17

the problem here is that if you archive an image "properly", then web.archive.org/https://64.media.tumblr.com/whatever/whatever.jpg will take you to the latest snapshot and will look correct, *until* something else causes it to get archived again x_x
00:28

hlgs|m

it says no collection info, i did it with the addon though i think
00:28

hlgs|m

yeahhh
00:28

nicolas17

and then the latest snapshot is the stupid wrapper page again
00:29

hlgs|m

the wrapper pages are useful because they have the URL of the original post, but they shouldn't be the last saved copy of an image because it then doesn't display, and it breaks embedding
00:29

hlgs|m

ideally i'd make sure the last saved copy is one that's not the broken wrapper, and then prevent further saving if it's going to save the wrapper over it...
00:29

hlgs|m

* the wrapper version over it...
00:29

hlgs|m

not sure if that's possible though
00:31

JAA

That still doesn't help, because when you load a page, it will embed the snapshot of the image that's closest in time to the page's.
00:32

JAA

So you can still end up with broken pages everywhere.
00:32

hlgs|m

ah right, ughh
00:33

JAA

It's fun when you SPN something, it looks fine, and then some days later it breaks. Because it was actually embedding an old working snapshot of something (like an image or stylesheet), and in the following days, that something got rearchived in a broken state.
00:33

hlgs|m

in which case, ideally i'd just keep one copy of the wrapper because of the usefulness of the original post being linked, and convert all the rest into proper images or just wipe them all (and somehow keep the wrapper copy further away from the post than any images...)
00:33

hlgs|m

damn. does archivebot not have this issue?
00:34

JAA

It's purely a WBM issue on playback because it mixes various data sources and uses that 'closest timestamp' stuff.
00:34

hlgs|m

well, for me what matters is just that the images and other data is somewhere on the archive and that i can access it with some inspect element digging from the post
00:34

JAA

Isolated archives from AB don't have this problem, but when they're in the WBM, it can still happen.
00:34

alexshpilkin

imer: in any case thank you :)
00:34

hlgs|m

makes sense
00:35

JAA

Worth mentioning that AB does breadth-first recursion, so embedded images are sometimes archived *much* later than the page.
00:35

hlgs|m

so... i suppose i could just run my list of urls that may-or-may-not be broken through archivebot to make sure there's at least one working backup somewhere? and hope the WBM team figures out some solution for this later...
00:35

JAA

Like, can be weeks later.
00:35

hlgs|m

good to know
00:36

JAA

Again, not a problem in isolation, can be a problem in the WBM or if the embedded things vanish in the meantime.
00:36

» alexshpilkin just went to a NixOS channel for a second and ended up investigating *a bug in bash* of all things for two hours, sorry imer
00:36

hlgs|m

i took the time to get the direct urls so they'd be prioritised now as they're most at risk (aside from people just deleting posts before i can get to them, which is annoying)
00:37

nicolas17

JAA: for a moment I thought, wouldn't breadth first get the images before recursing deep into links? but that's assuming the page is the root of the tree...
00:38

JAA

alexshpilkin: Heh, I've encountered a bunch of weird stuff in Bash that turned out to be intentional/correct behaviour, but I found my first bug a couple weeks ago that also made me bang my head against the wall for hours. (Still need to write an email to bug-bash though.)
00:38

hlgs|m

for the moment then... could i get some help running the url list through archivebot? not sure how long it'll take and how many resources for 60k direct image urls, hopefully not that much. i've got to leave town again tomorrow so i can't get started on setting anything up myself sadly
00:38

JAA

nicolas17: There is no distinction between links and page requisites as far as the recursion is concerned.
00:39

JAA

They both just get added to the end of the queue.
00:39

nicolas17

yeah
00:39

nicolas17

(maybe there should be a distinction)
00:39

JAA

hlgs|m: Well, as nicolas17 said, upload a list. :-)
00:39

hlgs|m

transfer.archivete.am/gTKal/tumblr_media_urls.txt
00:40

nicolas17

I just meant, if you start on a page, you'd soon get its images (and links), before going into a rabbit hole following links
00:40

imer

alexshpilkin: no worries, still waiting on an ftp listing to finish (that had some random uiuc.edu mirrors) and then i'm out of leads unfortunately
00:40

hlgs|m

thank you all so much for being so helpful, by the way. been stressful doing so much emergency archival over the past month but you here have taken some weight off my shoulders
00:40

JAA

nicolas17: That's true at the beginning, but when the queue is already in the millions, well, it'll take a while until it gets to those images.
00:40

nicolas17

but that assumes that page is where you *start*, if you're several levels deep it won't work that way, it has to get all the level n links from all sorts of unrelated pages before it even starts with n+1 where the image is :)
00:41

JAA

But URL prioritisation is something I've partially implemented, and prioritising page requisites is high on the wishlist.
00:41

JAA

Soon™
00:42

hlgs|m

woo
00:42

alexshpilkin

imer: that’s honestly $leads leads more than I expected, so cheers
00:43

hlgs|m

okay, thanks for the help, going afk for a while now
00:43

alexshpilkin

the csrd.uiuc.edu seems to have had different subdomains under that over the years fwiw
00:43

alexshpilkin

* FTP seems
00:44

alexshpilkin

a note from 2000 mentions sp2.csrd.uiuc.edu for example
01:24

fireonlive

SketchCow: just a heads up the discord invite link expired, unsure if that's intentional tho
01:41

alexshpilkin

JAA: the secret ingredient is getting someone else to write and send the email for you :)
01:41

alexshpilkin

(to be fair, that person discovered the bug)
01:42

JAA

And miss out on all that street cred‽
02:03

alexshpilkin

... send and cc you on it :P
02:35

Rotietip

Hello all, a few weeks ago I uploaded archive.org/details/epsonianos which contains a WARC file from epsonianos.com, but when I checked in web.archive.org/web/collections/20180000000000*/http://epsonianos.com it seems that the content of it has not been indexed yet. Why is this? Because I made sure to indicate "mediatype:web" when I created the item.
02:38

nicolas17

WARCs uploaded by regular users to regular collections don't appear in the WBM
02:39

nicolas17

as said earlier today here, "items need to go into special collections to be ingested into the WBM. Because allowing anyone to ingest WARCs allows anyone to fake WBM snapshots, you need special permissions for those collections."
02:40

Rotietip

Well, how do I make them appear or who do I have to contact for that?
02:41

nicolas17

how do people know your WARC is a legitimate and accurate archive of the website? :)
02:42

Rotietip

Perhaps by checking the file type and reviewing the first few lines of the file (in addition to the CDX file)?
02:55

vokunal|m

Ragtag actually sounds kind of fun to archive (as someone that doesn't have to code it). I've been liking the mediafire archive for a while since it's much heavier on filesize than other projects. Makes it feel like i'm contributing more
03:00

TheTechRobo

Rotietip: Unfortunately the WBM only allows certain people to put WARCs into Wayback Machine-ingesting collections. That's because there's no good way to tell if a WARC file has been modified.
03:00

TheTechRobo

If they let just anyone put stuff into the WBM, then someone could fake a snapshot.
03:08

vokunal|m

I'm not sure how hard it would be to set up something that could grab these quickly, but this spreadsheet has links to direct downloads for every mkv in their database apparently. ragtag.link/archive-videos
03:14

vokunal|m

Is that something that could get sent into urls in small batches, or to AB in batches?
03:28

JAA

We're not going to archive 1.4 PB through AB.
03:29

JAA

(Amazing that I actually have to type that out.)
03:29

JAA

All of AB's 9-year crawls are only 3.1 PiB...
03:29

vokunal|m

yeah that was a dumb question
03:29

nicolas17

for starters 1.4PB is way into "formally ask Internet Archive for approval" range
03:30

JAA

More into 'we need to filter this down to something that's actually reasonable' territory.
03:30

nicolas17

the imgur project is at 520TB
03:30

SketchCow

I'll probably make the discord link perm soon.
03:31

JAA

Do we know the size of the videos that are no longer on YouTube?
03:32

fireonlive

thanks sketch
03:44

vokunal|m

At a glance, it seems every video in that list is either private or unlisted. It'll take me a minute, but i can try to see if I can narrow it down to only private or deleted videos
03:50

vokunal|m

The total size of all unlisted and private videos seems to be 49072 GB
03:57

nicolas17

aaaaugh how do I find someone's tweet after he disabled his twitter account because of Musk? >_<
04:22

Rotietip

TheTechRobo, nicolas17 had mentioned that an item must be in certain collections in order to be indexed in Wayback Machine. Is there a way to contact the owners of some of those collections to ask them to add an item?
04:25

nicolas17

JAA: we need help explaining this :p
04:29

nicolas17

"Accepting WARCs from random people would make the WBM useless because anyone could insert manipulated data. You can still upload them to IA, but they won't be in the WBM."
04:35

Rotietip

That's why I was asking if there is a way to request permission or something like that.
04:38

vokunal|m

That'd still be a random person asking verified person to upload it to IA for them. Same problem
04:38

Rotietip

Anyway another approach occurs to me. Do you know any online viewer for WARC files? I tried with replayweb.page but when I try to upload the file from Internet Archive I get this error: "An unexpected error occured: TypeError: Failed to fetch"
04:39

nicolas17

Rotietip: if someone could give you permission, how would they know they can trust you and your data?
04:42

Rotietip

In the case of epsonianos.com just check the CDX, there you can see that it is a forum that I downloaded in 2018 and that currently appears the default page of the hosting.
05:41

pabs

Reddits organising a strike news.ycombinator.com/item?id=36187705
05:41

pabs

old.reddit.com/r/LifeProTips/commen…ill_be_going_dark_from_june_1214_in
05:42

nicolas17

it has been talked about in #shreddit
05:42

nicolas17

pabs: news.ycombinator.com/item?id=36192312 "My Reddit account was banned after adding my subs to the protest"
05:43

pabs

ah
06:42

vokunal|m

JAA: transfer.archivete.am/uK5k0/ragtag-non-200-videos.csv Here's a csv of every video that's privated or deleted from that list, its equivalent youtube id, and its filesize. Total is 38,574GB. And here's a plain txt of all the direct download urls, transfer.archivete.am/HA8XK/ragtag_deleted_videos.txt
06:43

vokunal|m

It took a little longer than a minute, but I don't actually know python. Just a chatgpt wizard apparently
08:15

Rotietip

Just out of curiosity, can ArchiveBot be given a referrer and a cookie when downloading something?
08:15

pabs

no for the cookie, I assume also no for the referrer.
08:20

rewby|backup

Grabsite can do cookies
08:21

rewby|backup

But if you wanna do huge archives like 1.4pb. that'd require talking to the IA first of all
08:22

rewby|backup

And it's well within DPoS territory at that point
08:22

rewby|backup

No way AB can do it
08:26

flashfire42

!a 1945.melbourne --concurrency 1
08:27

rewby|backup

Wrong channel
08:28

pabs

does ArchiveTeam have any way to archive rsync servers? for eg these datasets: rsync sourceware.org::
08:39

Rotietip

The question is that I would like to archive some boards from 8chan.moe to be accessible from Wayback Machine, but this can only be achieved by sending 8chan.moe as referrer and the cookie disclaimer2=1 How feasible would it be to do this? Because otherwise it shows a fucking disclaimer like web.archive.org/web/20230413194626/https://8chan.moe and all the images and thumbnails look the same as 8chan
08:39

Rotietip

.moe/.media/t_1e724d164bee05ea1d9c2c069172b916212f6742e07b6230194d3a4bb34f953a
08:39

Rotietip

According to my estimations what I am interested in archiving would be between 70 and 80 GB (although if you want to archive the whole site I won't stop you).
09:19

masterx244|m

those sites are the worst. and a WARC to download on archive.org is better than no archive at all.
17:34

audrooku|m

Would it be redundant to mention the ragtag archive in #archiveteam at this point?
17:50

vokunal|m

not sure. I pasted a list of every private or deleted url from their site
17:51

vokunal|m

I think Jaa'll be on it when they get the time
17:52

arkiver

i didn't follow this
17:54

arkiver

can someone please give me a tl;dr on ragtag?
17:56

BigBrain

vtuber archive, around 1.4PB, has a lot of lost media
17:56

BigBrain

all of it yt i think
17:59

BigBrain

shutting down on or before july 24, dumped full database with metadata and has "compiled a list of videos that are no longer available on YouTube, but are still available in Ragtag Archive" in a csv
17:59

arkiver

thank you
17:59

BigBrain

np
18:57

JAA

arkiver: And ~38 TB for the stuff on Ragtag Archive that's no longer on YouTube, which would be more feasible.

11 months ago

« a day earlier

a day later »

today »