00:00:07 would it be possible to save a bunch of image links via archivebot, but only if they're either a) not archived at all, or b) have been archived, but the latest archive has a specific title pattern? to get more specific: save page now has been breaking on tumblr image and all the broken ones have a title that's "[something]: Image". i've got a list of about 60k image urls and i'd like to only save the ones that are broken, or that haven't been 00:00:07 saved at all, just to not spend too many resources saving what's already saved okay 00:03:45 I'm not sure but it might take more resources to look up 60k images on the wayback machine than to just archive them again 00:04:43 right, good to know. shame about the storage cost for the WBM itself but that might be the quickest/simplest option for me (i'd like to get these saved as soon as possible as images were being removed entirely lately) 00:06:41 What about just listing all saved images using the cdx api? 00:07:13 i don't have any experience with that, can you explain? 00:08:23 the key thing would be identifying which images are broken by looking at the title they have in the wayback machine (as in, the title the tab/window shows when it's open in the WBM) 00:08:49 that's the only consistent tell i've found in my research, other than it all being by save page now and recent, but i can't tell how recent 00:11:08 do you have the URL of the image, or of the page-containing-the-image? 00:11:12 Are they actual images when saved correctly or shitty page wrappers? 00:12:04 anyway send us the 60k list, you can use https://transfer.archivete.am/ 00:14:12 the direct urls of all the images 00:14:37 so they got mis-saved as html pages? 00:14:47 a jpeg file doesn't have a "title" 00:14:51 save page now has been saving the weird page wrapper things tumblr has been doing lately, but archivebot isn't having that issue, so i'm basically wanting to redo a ton of images i saved using a SPN script recently 00:14:57 let me get an example 00:15:11 https://64.media.tumblr.com/377948577d35abb1be9e2be2dc9f2897/tumblr_o98nqnSmqh1s4dx9ko4_r5_1280.png 00:15:27 Yup, that's what Tumblr does. 00:15:33 yeah 00:15:42 I'm not sure ArchiveBot is able to archive it correctly when given a direct image URL. 00:15:45 SPN doesn't save the actual image at the moment (i've reported it as a bug but it's still being worked on it seems) 00:15:54 It works on the running Tumblr jobs because those send an appropriate Referer header. 00:16:15 hmm, really? could test it. so far, i haven't noticed and broken images when saved by archivebot 00:16:16 might depend on user agent too, curl gives me a png 00:16:26 oh interesting 00:16:40 Ah right, yeah, and the Accept header might also matter. 00:16:49 yeah seems it's Accept, not UA 00:16:49 what i find fascinating is that, with non-gifs, i can right click and open the image in a new tab and get the actual image, but the url stays the same 00:17:29 Yes, the URL is not the only thing determining how something gets loaded. 00:17:30 my browser requests that URL with "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" and gets a webpage 00:17:46 the webpage has an pointing at the same URL 00:17:58 interesting 00:18:04 Even just refreshing after 'open image in new tab' loads the page again. 00:18:14 the browser then requests the same URL with "Accept: image/avif,image/webp,*/*" and gets an image 00:18:14 yeah, i've noticed that too 00:18:15 seems tumblr only cares that html is *not* in the list? 00:18:37 (tumblr has been getting... really hard to archive properly lately. this and then the www blog urls and now the permalinks of the previous reblog just being gone in the www blog view... ugh) 00:19:27 hlgs|m: do you have an example of a tumblr image that *did* get archived properly? 00:19:38 let me see, i can find one 00:21:33 00:21:33 00:21:33 oh, this is interesting, this one actually got saved by SPN outlinks, not archivebot. but i know i've seen archivebot ones that were saved, i'll find one 00:21:51 ah, although, this one doesn't get the html wrapper thing when i open it separately 00:22:01 JustAnotherArchivist edited Deathwatch (+139, /* 2023 */ Add Ragtag Archive): https://wiki.archiveteam.org/?diff=49876&oldid=49874 00:22:05 i'll try to find something 00:23:21 Remember that images saved by SPN might have been saved because someone saved the page they're embedded on rather than the image URL directly. 00:23:56 that's what i've been doing, but pretty much all the images within posts that i've tried saving via SPN have ended up broken 00:24:29 okay, here's an archivebot one 00:24:29 https://web.archive.org/web/20230526221101/https://unrestedjade.tumblr.com/post/667862452973780992/bluehedron-the-downside-of-having-a-host-friend 00:24:29 https://web.archive.org/web/20230526221058/https://64.media.tumblr.com/d2d9086c62bb92fc78d0494a98addf2d/tumblr_paoqjnQrrp1wv21vuo1_r1_1280.jpg 00:24:38 has the html wrapper thing when opened live 00:24:43 Ah right 00:25:02 JustAnotherArchivist edited Deathwatch (+76, /* 2023 */ Add kitsune tweet about Ragtag Archive): https://wiki.archiveteam.org/?diff=49877&oldid=49876 00:25:10 SPN does some weird things with images sometimes. If they're loaded by JS, they might not get archived on the initial SPN but only later when you access the snapshot. 00:25:26 Which changes how they get accessed, which can cause this Tumblr nonsense. 00:25:39 really? i think i've accessed snapshots several times without seeing the images update, but i can try again now 00:26:06 I think there's multiple snapshots of the same URL, some with the page, some with the image 00:26:08 I've seen it happen on Imgur for example. SPN itself doesn't archive the actual image when you give it an image page or album. 00:26:25 which makes things harder 00:26:25 That can happen, but each URL only gets saved once per 45 minutes. 00:26:31 By SPN, anyway. 00:26:33 just checked one i'd checked before and the images are still broken 00:26:59 https://web.archive.org/web/20230527003724/https://64.media.tumblr.com/5b7a205a9cff48315c7b1d72e5ec6315/365671295079d02a-5f/s1280x1920/425ba3301aa42c89e9610c308b96960583b2b47c.jpg 00:27:00 https://web.archive.org/web/20230604220454/https://64.media.tumblr.com/5b7a205a9cff48315c7b1d72e5ec6315/365671295079d02a-5f/s1280x1920/425ba3301aa42c89e9610c308b96960583b2b47c.jpg 00:27:44 oh, interesting. let me see what saved those 00:28:00 the working ones are archivebot 00:28:10 the broken one is SPN, i did that one earlier today 00:28:22 the problem here is that if you archive an image "properly", then https://web.archive.org/https://64.media.tumblr.com/whatever/whatever.jpg will take you to the latest snapshot and will look correct, *until* something else causes it to get archived again x_x 00:28:24 it says no collection info, i did it with the addon though i think 00:28:33 yeahhh 00:28:39 and then the latest snapshot is the stupid wrapper page again 00:29:20 the wrapper pages are useful because they have the URL of the original post, but they shouldn't be the last saved copy of an image because it then doesn't display, and it breaks embedding 00:29:42 ideally i'd make sure the last saved copy is one that's not the broken wrapper, and then prevent further saving if it's going to save the wrapper over it... 00:29:46 * the wrapper version over it... 00:29:52 not sure if that's possible though 00:31:52 That still doesn't help, because when you load a page, it will embed the snapshot of the image that's closest in time to the page's. 00:32:06 So you can still end up with broken pages everywhere. 00:32:16 ah right, ughh 00:33:23 It's fun when you SPN something, it looks fine, and then some days later it breaks. Because it was actually embedding an old working snapshot of something (like an image or stylesheet), and in the following days, that something got rearchived in a broken state. 00:33:27 in which case, ideally i'd just keep one copy of the wrapper because of the usefulness of the original post being linked, and convert all the rest into proper images or just wipe them all (and somehow keep the wrapper copy further away from the post than any images...) 00:33:43 damn. does archivebot not have this issue? 00:34:11 It's purely a WBM issue on playback because it mixes various data sources and uses that 'closest timestamp' stuff. 00:34:25 well, for me what matters is just that the images and other data is somewhere on the archive and that i can access it with some inspect element digging from the post 00:34:30 Isolated archives from AB don't have this problem, but when they're in the WBM, it can still happen. 00:34:37 imer: in any case thank you :) 00:34:37 makes sense 00:35:38 Worth mentioning that AB does breadth-first recursion, so embedded images are sometimes archived *much* later than the page. 00:35:41 so... i suppose i could just run my list of urls that may-or-may-not be broken through archivebot to make sure there's at least one working backup somewhere? and hope the WBM team figures out some solution for this later... 00:35:43 Like, can be weeks later. 00:35:53 good to know 00:36:05 Again, not a problem in isolation, can be a problem in the WBM or if the embedded things vanish in the meantime. 00:36:15 * alexshpilkin just went to a NixOS channel for a second and ended up investigating *a bug in bash* of all things for two hours, sorry imer 00:36:26 i took the time to get the direct urls so they'd be prioritised now as they're most at risk (aside from people just deleting posts before i can get to them, which is annoying) 00:37:34 JAA: for a moment I thought, wouldn't breadth first get the images before recursing deep into links? but that's assuming the page is the root of the tree... 00:38:00 alexshpilkin: Heh, I've encountered a bunch of weird stuff in Bash that turned out to be intentional/correct behaviour, but I found my first bug a couple weeks ago that also made me bang my head against the wall for hours. (Still need to write an email to bug-bash though.) 00:38:04 for the moment then... could i get some help running the url list through archivebot? not sure how long it'll take and how many resources for 60k direct image urls, hopefully not that much. i've got to leave town again tomorrow so i can't get started on setting anything up myself sadly 00:38:38 nicolas17: There is no distinction between links and page requisites as far as the recursion is concerned. 00:39:20 They both just get added to the end of the queue. 00:39:25 yeah 00:39:32 (maybe there should be a distinction) 00:39:44 hlgs|m: Well, as nicolas17 said, upload a list. :-) 00:39:48 https://transfer.archivete.am/gTKal/tumblr_media_urls.txt 00:40:03 I just meant, if you start on a page, you'd soon get its images (and links), before going into a rabbit hole following links 00:40:05 alexshpilkin: no worries, still waiting on an ftp listing to finish (that had some random uiuc.edu mirrors) and then i'm out of leads unfortunately 00:40:29 thank you all so much for being so helpful, by the way. been stressful doing so much emergency archival over the past month but you here have taken some weight off my shoulders 00:40:48 nicolas17: That's true at the beginning, but when the queue is already in the millions, well, it'll take a while until it gets to those images. 00:40:50 but that assumes that page is where you *start*, if you're several levels deep it won't work that way, it has to get all the level n links from all sorts of unrelated pages before it even starts with n+1 where the image is :) 00:41:20 But URL prioritisation is something I've partially implemented, and prioritising page requisites is high on the wishlist. 00:41:30 Soon™ 00:42:43 woo 00:42:50 imer: that’s honestly $leads leads more than I expected, so cheers 00:43:16 okay, thanks for the help, going afk for a while now 00:43:22 the csrd.uiuc.edu seems to have had different subdomains under that over the years fwiw 00:43:32 * FTP seems 00:44:28 a note from 2000 mentions sp2.csrd.uiuc.edu for example 01:24:38 SketchCow: just a heads up the discord invite link expired, unsure if that's intentional tho 01:41:02 JAA: the secret ingredient is getting someone else to write and send the email for you :) 01:41:26 (to be fair, that person discovered the bug) 01:42:26 And miss out on all that street cred‽ 02:03:16 ... send and cc you on it :P 02:35:05 Hello all, a few weeks ago I uploaded https://archive.org/details/epsonianos which contains a WARC file from epsonianos.com, but when I checked in https://web.archive.org/web/collections/20180000000000*/http://epsonianos.com/ it seems that the content of it has not been indexed yet. Why is this? Because I made sure to indicate "mediatype:web" when I created the item. 02:38:30 WARCs uploaded by regular users to regular collections don't appear in the WBM 02:39:26 as said earlier today here, "items need to go into special collections to be ingested into the WBM. Because allowing anyone to ingest WARCs allows anyone to fake WBM snapshots, you need special permissions for those collections." 02:40:38 Well, how do I make them appear or who do I have to contact for that? 02:41:14 how do people know your WARC is a legitimate and accurate archive of the website? :) 02:42:44 Perhaps by checking the file type and reviewing the first few lines of the file (in addition to the CDX file)? 02:55:00 Ragtag actually sounds kind of fun to archive (as someone that doesn't have to code it). I've been liking the mediafire archive for a while since it's much heavier on filesize than other projects. Makes it feel like i'm contributing more 03:00:27 Rotietip: Unfortunately the WBM only allows certain people to put WARCs into Wayback Machine-ingesting collections. That's because there's no good way to tell if a WARC file has been modified. 03:00:42 If they let just anyone put stuff into the WBM, then someone could fake a snapshot. 03:08:53 I'm not sure how hard it would be to set up something that could grab these quickly, but this spreadsheet has links to direct downloads for every mkv in their database apparently. https://ragtag.link/archive-videos 03:14:24 Is that something that could get sent into urls in small batches, or to AB in batches? 03:28:49 We're not going to archive 1.4 PB through AB. 03:29:12 (Amazing that I actually have to type that out.) 03:29:46 All of AB's 9-year crawls are only 3.1 PiB... 03:29:48 yeah that was a dumb question 03:29:57 for starters 1.4PB is way into "formally ask Internet Archive for approval" range 03:30:21 More into 'we need to filter this down to something that's actually reasonable' territory. 03:30:33 the imgur project is at 520TB 03:30:35 I'll probably make the discord link perm soon. 03:31:00 Do we know the size of the videos that are no longer on YouTube? 03:32:53 thanks sketch 03:44:55 At a glance, it seems every video in that list is either private or unlisted. It'll take me a minute, but i can try to see if I can narrow it down to only private or deleted videos 03:50:34 The total size of all unlisted and private videos seems to be 49072 GB 03:57:59 aaaaugh how do I find someone's tweet after he disabled his twitter account because of Musk? >_< 04:22:58 TheTechRobo, nicolas17 had mentioned that an item must be in certain collections in order to be indexed in Wayback Machine. Is there a way to contact the owners of some of those collections to ask them to add an item? 04:25:54 JAA: we need help explaining this :p 04:29:58 "Accepting WARCs from random people would make the WBM useless because anyone could insert manipulated data. You can still upload them to IA, but they won't be in the WBM." 04:35:33 That's why I was asking if there is a way to request permission or something like that. 04:38:09 That'd still be a random person asking verified person to upload it to IA for them. Same problem 04:38:36 Anyway another approach occurs to me. Do you know any online viewer for WARC files? I tried with https://replayweb.page/ but when I try to upload the file from Internet Archive I get this error: "An unexpected error occured: TypeError: Failed to fetch" 04:39:17 Rotietip: if someone could give you permission, how would they know they can trust you and your data? 04:42:36 In the case of epsonianos.com just check the CDX, there you can see that it is a forum that I downloaded in 2018 and that currently appears the default page of the hosting. 05:41:43 Reddits organising a strike https://news.ycombinator.com/item?id=36187705 05:41:55 https://old.reddit.com/r/LifeProTips/comments/140b6q6/rlifeprotips_will_be_going_dark_from_june_1214_in/ 05:42:27 it has been talked about in #shreddit 05:42:38 pabs: https://news.ycombinator.com/item?id=36192312 "My Reddit account was banned after adding my subs to the protest" 05:43:02 ah 06:42:05 JAA: https://transfer.archivete.am/uK5k0/ragtag-non-200-videos.csv Here's a csv of every video that's privated or deleted from that list, its equivalent youtube id, and its filesize. Total is 38,574GB. And here's a plain txt of all the direct download urls, https://transfer.archivete.am/HA8XK/ragtag_deleted_videos.txt 06:43:13 It took a little longer than a minute, but I don't actually know python. Just a chatgpt wizard apparently 08:15:24 Just out of curiosity, can ArchiveBot be given a referrer and a cookie when downloading something? 08:15:59 no for the cookie, I assume also no for the referrer. 08:20:18 Grabsite can do cookies 08:21:00 But if you wanna do huge archives like 1.4pb. that'd require talking to the IA first of all 08:22:42 And it's well within DPoS territory at that point 08:22:48 No way AB can do it 08:26:51 !a http://1945.melbourne/ --concurrency 1 08:27:08 Wrong channel 08:28:36 does ArchiveTeam have any way to archive rsync servers? for eg these datasets: rsync sourceware.org:: 08:39:12 The question is that I would like to archive some boards from https://8chan.moe/ to be accessible from Wayback Machine, but this can only be achieved by sending https://8chan.moe as referrer and the cookie disclaimer2=1 How feasible would it be to do this? Because otherwise it shows a fucking disclaimer like http://web.archive.org/web/20230413194626/https://8chan.moe/ and all the images and thumbnails look the same as https://8chan 08:39:12 .moe/.media/t_1e724d164bee05ea1d9c2c069172b916212f6742e07b6230194d3a4bb34f953a 08:39:12 According to my estimations what I am interested in archiving would be between 70 and 80 GB (although if you want to archive the whole site I won't stop you). 09:19:34 those sites are the worst. and a WARC to download on archive.org is better than no archive at all. 17:34:46 Would it be redundant to mention the ragtag archive in #archiveteam at this point? 17:50:32 not sure. I pasted a list of every private or deleted url from their site 17:51:03 I think Jaa'll be on it when they get the time 17:52:12 i didn't follow this 17:54:27 can someone please give me a tl;dr on ragtag? 17:56:27 vtuber archive, around 1.4PB, has a lot of lost media 17:56:47 all of it yt i think 17:59:16 shutting down on or before july 24, dumped full database with metadata and has "compiled a list of videos that are no longer available on YouTube, but are still available in Ragtag Archive" in a csv 17:59:31 thank you 17:59:48 np 18:57:06 arkiver: And ~38 TB for the stuff on Ragtag Archive that's no longer on YouTube, which would be more feasible.