00:10:13 I guess we'll need LaTeX in our wiki soon. 00:19:53 😏 00:22:00 mfw when to read an archiveteam article you need a math degree 00:22:01 :P 00:25:45 TheTechRobo: heres one more for you https://wiki.archiveteam.org/index.php/Famous_Internet_videos 00:46:19 Hello, is there someone knowledgable about the yahoo video archives that could help me find which tar file a specific user id's videos are in? I've tried downloading file lists generated by the internet archive through a script but they seem to be incomplete 01:09:51 where are those? 01:11:37 rootliam: are user IDs numeric? 01:33:27 Yeah, the one I'm looking for is 375869 and according to the wiki they got 0,300,000 - 0,400,000 so it should be in there somewhere 01:44:56 I can't find that range :| 01:59:26 It seems like they got split into a bunch of files but the ranges on some of them don't even make sense like 00045002-07439897 02:01:10 Maybe the only option is to just download all of them but I don't have the space or bandwidth to 02:01:23 . 02:03:04 How much temp storage do we have left? I wanna be aware if I can feed some more telegram stuff in 02:03:23 rewby: 02:05:18 rootliam: you could shoot underscor a message? most recently active account i can find is https://github.com/ab2525 02:06:23 only the giant 00000024-02442192 and 00045002-07439897 ranges would fit that number 02:07:50 it's not in either of those, they both just have stray extras at the beginning--the 'real' ranges are 2400026-2442192 and 7400007-7439897 respectively 02:10:28 I'll get their full file listings anyway, since I already started :P 02:10:57 yahoovideo-00000024-02442192.tar will take 8h to download oof 02:15:04 of 02:15:06 oof 02:57:40 https://archive.org/download/ARCHIVETEAM-YV-008650000-008699944 no file extension? 02:57:43 these items are such a mess 03:10:36 omg 256GB in a single .tar 03:19:15 will a single continuous 256GB download from IA survive to completion? 03:19:29 we'll find out... in 48h or so 03:21:43 I actually downloaded the extensionless/bz2 ones and made a program to extract the html files/file positions 03:27:19 I would like to try and make a search engine for the archives besides just finding this specific user's uploads but that 2TB was all I had space for and it took multiple months to download so if you'd like I could upload the program and the data I extracted tomorrow 03:31:03 for now I'm doing "curl | tar tv" to get the file listings 03:31:45 nicolas17: aria2 and multiple threads? haha 03:32:02 fireonlive: that would require actually having the disk space to store the whole tar ;) 03:32:10 ahh :D 03:32:13 true : 03:32:14 :3 03:32:31 which I can do for some of these 03:32:34 but not the 256GB beasts 03:35:09 i mean... is there something wrong with eg https://ia800407.us.archive.org/view_archive.php?archive=/0/items/ARCHIVETEAM-YV-0000000-0001000/ARCHIVETEAM-YV-0000000-0001000.zip ? 03:35:14 Does tar tv store the position of the file inside the tar itself? My idea was to have javascript to get the flv out of the tar with a range request 03:36:39 thuban: for tars it seems to be highly incomplete 03:37:02 for zips I can use remotezip to get the file list without downloading the whole thing 03:37:30 oh, the tar listing is itself incomplete? i thought the problem was just that not all the yahoo video data made it to ia 03:37:37 that *too* :P 03:37:51 fucked up 03:38:30 I think view_archive has to do the same linear process as 'tar tv' so it takes forever and soon gives up? 03:38:42 quite possible 03:39:04 regardless, it would be nice to track down the rest of that dataset... 03:39:09 absolutely 03:39:44 someone also mentioned repacking the tars into a more usable format, but I don't know if that was about yahoo video or about another project of a similar era 04:02:11 thuban: pokechu22: so also https://transfer.archivete.am/l2Sws/orangefr_pagespro_scrubbed.txt.zst was run already though AB? 04:02:31 i see some lists are running through AB now - is that all the lists? so, anything not finished here is still 'to be done'? 04:02:39 or is there now more elsewhere still? 04:03:01 arkiver: it's running now, job 9t2wjsmi7nb9izhv58mnj4tnf 04:03:13 that's all the lists, yes 04:04:34 oh next time it might be good to run that though with --no-offsite-links 04:04:37 through* 04:05:20 what response do you get if it rate limits you? 04:05:24 yeah, i wondered about that :S 04:05:55 (iirc you can do it the hacky way with a negative-lookahead ignore, but i don't have voice anyway, so) 04:06:05 6 AM here, so really need to take some sleep, but we'll get a little emergency project up, hopefully it'll finish in time 04:06:11 thuban: maybe JAA knows about that 04:06:22 * arkiver has very little AB experience 04:06:43 don't recall about the rate-limiting, i'm afraid, but i'm pretty sure it's not 429 04:06:47 pokechu22 should know 04:13:52 You get timeouts for 24 hours 04:14:03 and/or refused connections 04:14:43 with no warning ahead of time (but it's generally OK with you running above the rate limit for a bit if you stop afterwards, it seems) 04:15:19 I didn't use --no-offsite-links because I expect it interacts with !a < list on multiple domains in weird ways 04:17:32 I'll add an ignore that prevents it from using URLs without orange in the domain 04:18:05 hmm I think having the size of every file inside a tar, I can calculate the absolute position of files 04:21:35 added ^https?://(?![^/]*(orange|wanadoo)[^/]*)[^/]*/ 04:25:34 pokechu22: you also need woopic.com :/ 04:27:37 oh 04:28:37 it's their cdn 04:28:52 well let's just hope that nothing much was missed by that - if there's still time, we can recheck skipped URLs (and we might as well do outlinks afterwards anyways) 04:29:19 was thinking the same thing 04:30:10 ty for fix :) 04:31:13 It's possible to requeue skipped URLs but that's a difficult process 05:04:21 do we have a way of saving Twitter at the moment? https://nitter.net/steveharwell died 05:08:16 i've seen people using nitter.net or nitter.cz but last i heard they're rate limiting so have to go slow 05:09:15 I could get a tar file listing somewhat faster by skipping over the actual file data 05:10:04 but it seems there's a yahoo-videos tar where that would be *especially* beneficial, because the videos inside are all like 100MB+, but I can't do it because it's .bz2 /o\ 07:13:35 Wessel1512 edited Deathwatch (+333, /* 2023 */): https://wiki.archiveteam.org/?diff=50713&oldid=50710 08:42:36 interesting, a Google referrer makes twitter user pages public 08:44:50 pabs does that include privated accounts? 08:46:13 orange.fr is dead 08:47:03 I confirm. ☹️ 08:51:48 flashfire42: not sure, got an example account? 09:32:04 hi 09:33:13 pokechu22, thuban, someone from orange wrote me back, to say somethings like "orange's pagesperso has been suspended today, but we've acknowledged your request" 09:33:34 "we'll try to re-up the pages until the 30th of september, and then, they will be down forever" 09:34:12 I don't have much more info, but we may have got more time 10:30:41 flashfire42, thuban: ^ 12:11:15 Nice, good thing you asked :) 12:13:27 it's not done yet (fingers crossed) 12:14:08 https://mastodon.0011.lt is shutting down tomorrow (2023-09-06): https://mastodon.0011.lt/@au0/110939313852541543 . JAA, arkiver: what's the current status on whether or not to archive mastodon instances (asking you since you were the ones who edited the wiki page)? 13:37:31 qwertyasdfuiopghjkl: let's put it in #archivebot 13:53:43 arkiver: to clarify, do you mean the conversation or the site? 14:32:37 if it's shutting down, let's put it in AB 14:39:19 ok, thanks :) 14:55:07 arkiver: Looks like ArchiveBot no longer works for mastodon instances due to JS :( 15:14:07 Is there any other way to archive it? 16:03:17 orange is down :( 16:15:27 arkiver: See above (9:32), might come back 16:23:04 It seems like it's partially online again? 16:29:40 ah, or just redirects to a dead domain - I can't load it currently :| 16:41:24 plcp: that would be great news! 16:41:31 we can make a proper full copy then 16:41:35 fingers crossed :) 16:41:36 imer: thank for letting me know! 16:41:42 yeah :) 16:56:27 I'm currently letting the archivebot jobs drain rather than saving the redirects to the dead domain 16:56:50 would probably be good to pause them and resume when the sites are back? 16:57:34 I guess - I thought the image domain was still alive so it would be useful to do those, but it's not 16:59:57 yeah dead redirs to end.pagesperso-orange.fr for now 17:00:40 Can you check the difference between http://laugerie.basse.pagesperso-orange.fr/ and http://laughing.gif.pagesperso-orange.fr/ for instance? 17:02:21 first is 302 as it was an existing site before, second is 404 as it doesn't exist? 17:02:42 Yeah, that's what I'd expect 17:03:07 I hope that orange guy didn't gave me false hopes tbh 17:03:09 so probably somewhat useful to leave the job that didn't finish iterating though all the possible sites alone just to record whether a site existed or not, even if it's not going to get anything 17:03:10 can only wait & see 17:06:27 Unrelated to that - what's the deal with free.fr? Is that ISP hosting as well, or is it something else? 17:07:07 is it ISP hosting 17:07:38 basically the same deal as orange's pagesperso, at some point they will pull the plug, but for now, its staying alive 17:07:45 Alright, we should probably do something about that sooner rather than later then 17:07:58 yup, lots of old personal webpages there 17:08:19 and most active orange's pagesperso migrated to free.fr btw :o) 17:08:36 also saw several sitew.com 17:09:12 as well as other misc web hosting services (most of the time these "create your website for free" ones) 18:29:32 arkiver: (mentioning it again in case you missed it since the shutdown is tomorrow) ArchiveBot couldn't get any posts or users on https://mastodon.0011.lt/ , even when starting from https://mastodon.0011.lt/about , as recommended in https://wiki.archiveteam.org/index.php/Mastodon . I'm guessing Mastodon used to previously work without JS, and the 18:29:32 wiki page is outdated. 18:59:44 I do remember hearing that mastodon worked without JS in the past and they removed that at some point 18:59:56 me too 19:04:22 https://github.com/mastodon/mastodon/issues/19953 19:05:11 as well as https://github.com/mastodon/mastodon/issues/23153 19:07:16 mastodon seems clunky 19:12:52 lots of new things don't really care about not having javascript because 'who turns it off' 20:49:16 hackint's not having a good day is it 21:54:03 nicolas17: Cute. There's a 650 GiB tar from me on IA somewhere. 21:57:40 qwertyasdfuiopghjkl: Yeah, I think Mastodon would require special tools now thanks to the JS without fallback. 23:03:55 How is ingestion going? 23:26:02 what about archiving gender-critical websites? it would be useful for scholars to study anti-trans websites of today in the future 23:29:47 what about it? 23:30:03 give specific websites and someone will add them to archivebot I guess 23:30:11 I am not sure anti-trans websites are what "gender-critical" means 23:30:14 ^ 23:30:22 Regardless, bigotry-a-go-go sites get archives all the time 23:30:37 ^ to give websites 23:30:43 ovarit.com 23:31:26 supposedly "women-centered" but in reality it's a gatekept echo chamber thru invite codes 23:34:43 there was also kiwi farms but i think it would have to have all the personal information redacted by AI 23:35:44 yeah I don't think it's worth doing any specific effort to archive kf 23:36:43 nicolas17, all it really was it's pure noise 23:37:23 the operator has announced he'll release an archive if it ever shuts down 23:37:44 archiving the specific doxxing information would be counter-productive, and redacting specific parts of the content is.. not something we do, and would be way too much effort anyway 23:38:08 nicolas17, yeah it would be harder than organizing elections 23:38:20 you would need multiple people to verify everything 23:39:21 I mean, besides the "effort" part, we archive pristine HTTP responses including headers; if you need to modify the html to redact some content then it probably doesn't belong on the Wayback Machine 23:40:14 another site that might be of interest is SEGM.org - an activist organization that's overplaying the negatives of gender-affirming treatments while underplaying the positives 23:40:43 generally saying anything from this list would fit the bill: https://rationalwiki.org/wiki/RationalWiki:Webshites/Gender 23:41:45 lol webshites 23:42:15 It's cited "really bad sources of information - or really good sources of bad information" 23:43:27 Oh yes, archiving kiwifarms, nobody has ever discussed that. 23:44:44 maybe it needs a wiki page to avoid rehashing the argument whenever it comes up 23:45:24 might as well put homophobic sites at the same time https://rationalwiki.org/wiki/RationalWiki:Webshites/Sexuality 23:49:40 that one is pretty vile https://www.ihmistenkirjo.net/english