00:00:45 JAABot edited CurrentWarriorProject (+2): https://wiki.archiveteam.org/?diff=51228&oldid=51227 00:45:43 smol change: twitter2nitter/transferinliner/and karma system now ignore lines starting with !; so it won't go off if you're using a bot command (thanks project10); also 'known-bots' (h2ibot, botifico, and Aramaki) are skipped from them 06:33:15 Sanqui: Just a brief update, all of those linked Webzdarma jobs are 4.5 TiB, so it'll take a while to download them all, even at the 60 MB/s I'm getting from right next to IA. 06:36:10 Arctic Circle System edited Alive... OR ARE THEY (+383, /* Endangered */ Added Kirby's Rainbow Resort): https://wiki.archiveteam.org/?diff=51229&oldid=51031 09:51:44 Thanks JAA. Problem is the ones that had offsite, sadly not enough foresight there. In the long term we will be making and keeping our own copies 14:04:10 Sanqui: would you like me to run that through the common crawl cdx? I have that lying around and from a quick spot check there is some matching links in there 14:51:30 imer: yes please, ^https?://(www.)?(uloz.to|ulozto.cz|ulozto.sk|ulozto.net|zachowajto.pl) 14:58:31 I could try the fdns data set I have 16:00:54 Sanqui: ack, will be a few days to run through it all 16:26:01 imer: deadline is tomorrow, so probably no need then 16:26:08 thanks though 16:26:19 maybe if it's possible to run on a subset of .cz sites 16:26:22 (and .sk) 16:26:24 it would make sense 16:27:02 oh. oops 16:27:18 i'll toss you over the partial results then as I get them 17:28:18 Sanqui: Sometime in the future, all AB jobs' databases should be kept, and then this wouldn't be an issue. wpull still extracts all links when running with --no-offsite-links, it just then ignores them silently, so they only appear in the DB. 17:35:25 Can these sorts of links be put into AB? This person passed away and if possible i'd like to have these pages saved. Also, can AB grab a youtube channel? Just the pages, not videos. I already put it into downthetube 17:35:25 https://www.instagram.com/chesyarts 17:35:25 https://ko-fi.com/chesyarts 17:35:25 https://www.tiktok.com/@chesyarts0w0 17:35:25 https://www.youtube.com/@chesyarts1691 17:37:11 Vokun: I don't think any o those work properly in AB, n; all of those sites have strict rate-limiting and are JS-based, and AB will only get 429s 17:37:56 rip 17:39:33 youtube can go to #down-the-tube as long as it's in scope https://wiki.archiveteam.org/index.php/YouTube#Scope (someone dying is) 17:48:46 I put it in. Thanks 17:50:00 :) 18:30:45 Pokechu22 edited DokuWiki (+472, mention taskrunner): https://wiki.archiveteam.org/?diff=51230&oldid=51010 19:14:06 the archiveteam wikipage on bluesky is very short, has anything been done about that? 19:22:39 hello everyone. I might have something for the archivebot if anyone has time to put it in the queue: https://www.summoners-inn.de is the biggest and probably one of the oldest german league of legends news website with articles back to 2013. today, they announced the end of Summoner's Inn after their parent company Freaks4U lost their partnership 19:22:40 to host the official german Leauge of Legends broadcast. 19:25:05 polduran: I've queued it, not sure how well it'll run though as they don't seem to have a sitemap 19:26:10 I also queued https://www.freaks4u.de 19:30:31 let's hope for the best^^ thank you. and yeah, good idea ^-^" maybe also the german LoL-league? https://www.primeleague.gg/ not sure if there is anything interessting on there and how and if the situation also affects this, but the website is hosted and copyrighted by freaks4u 19:32:24 Alright 19:33:05 thanks again and have a nice day :D 19:45:01 continuing on the discussion from #//; imer: what would be the best way to handle this JS mess? 19:45:27 I can probably write a scraper that'll generate a list of URLs from these downloaders; there isn't much metadata to be saved anyways, so IMO saving just the ZIPs is a good starting point 19:48:09 imer: hey, also, can you verify if the downloader3.html still works? I.. think I crashed it 19:48:28 did you check with devtools how the EULA acceptance is handled? 19:48:30 checked from two IPs and several browsers, no dice 19:48:39 masterX244: on some of them there's no EULA at all 19:48:48 with some luck that can be faked with some headers/constant request stuff 19:48:49 so I'm focusing on that right now 19:49:57 had a site once that had a ad-intercept on first download under a session, fooled that by "wasting" that with a url-parametered URL before the real crawl started 19:50:27 https://f.sakamoto.pl/UwUMicKuA.png ,_, 19:51:01 2 "wasted" requests ion the WARC but better than a lost one. POST sucks for archivebot though 19:52:04 masterX244: no, no; i'm not getting any responses anymore 19:52:07 oh, it's back now 19:52:26 so what I did was.. I tried a wildcard instead of the version number, just to check what would happen 19:52:41 ahh, poking around for shortcuts 19:52:43 and it seems that it crashed their entire API for a solid minute 19:52:56 so. uh. we need to be careful around this one XD 19:54:28 cockroach-infested area :(, that sucks 19:55:39 btw, how does WARC work? I know that I can run a mitm proxy for myself, but how would I go about handing it over to IA? what are the steps/precautions/who do I need to talk to...? :p 19:55:54 "you don't" 19:56:44 you can upload WARC files to archive.org, but they won't be used by web.archive.org, because there's no way to know if they actually match the website you mirrored or if you messed with the content (accidentally or intentionally) 19:56:54 yes, that I know 19:57:25 i was more asking about... what steps do I take to actually get the content preserved with y'alls help? 20:04:00 a project/mini-project proposal let’s say :3 20:05:56 figured out how the EULA stuff works! it's a static JS function that takes params from the current URL 20:06:05 so this is very much possible to automate 20:06:27 function in question: https://pastebin.com/9bsxLDLu 20:38:02 sdomi: sorry, stepped away for a bit, I have not the slightest idea how to do this - although I am probably no the person to ask haha 20:38:15 imer: writing a scraper as we speak :p 20:38:18 nice 20:57:52 could you help me find an archive of this video https://www.youtube.com/watch?v=V3gbrP2U10A ? 21:05:05 #youtubearchive would be a fitting channel for that question 21:06:26 alright thank you ! 23:22:56 https://pastebin.com/gAwF2bwc URLs 23:34:55 https://f.sakamoto.pl/nvidia_rescue.tar.gz here's the code I wrote 23:36:49 turns out that most docs URLs are completely dead already, or point to generic sites that have likely been archived for ages. i'm downloading real "data" locally right now, gonna upload as an item onto IA later ^-^