00:58:32 (resending unless it was archive in a different channel) can someone get these for me thanks https://transfer.archivete.am/vABs9/honkaiwiki-newlinks.txt https://transfer.archivete.am/7axZa/honkaiwiki-newfiles.txt 00:59:24 Ivan226: I did those earlier today in #archivebot 00:59:48 ah got it 01:07:40 tomodachi94: in #archivebot, `socialbot: snscrape twitter-profile foo` works a reasonable amount of time. it can gather 3200 recent tweets. the twitter-user option doesn't work at the moment, but there is a fix in snscrape git that isn't released yet 01:18:50 apparently archive.org is being hammered by thousands of AWS instances "downloading the OCR text from our materials" 01:32:23 Doranwen: The Dropbox link recovered, and I'm pulling a copy. 01:36:27 IA S3 stats show a massive drop in uploads about 8 hours ago... did our targets finally catch up or what? :P 02:05:08 oh does IA store stuff in AWS? 02:05:53 Of course not, but there's an S3-ish interface. 02:06:46 ah! i see :) 02:06:51 i thought that rather odd haha 02:18:36 pabs: thanks! 02:18:44 Can someone snag this one too? https://transfer.archivete.am/1E4dG/tos.txt 02:19:12 archivebot can't do individual posts, but we can snag the user 02:19:44 I just did this: socialbot: snscrape twitter-profile TheOrderofSith 02:20:03 Ah okay, much appreciated. 02:27:21 JAA: Thanks! Glad it did :) 02:38:08 Oh lovely...... (full message at ) 06:11:33 can grab-site be resumed in-place from another server? would like to move off this (higher priced) one at some point 06:12:17 'oh i'll only have this server like 4 days... "up 20 days"' where did the time gooo 10:13:22 last two days have been chaotic for me. i may have missed important messages in channels, if you think I missed something please ping again 10:25:37 Tomodachi94 created Prnt.sc (+439, Create page): https://wiki.archiveteam.org/?title=Prnt.sc 10:25:38 Manu edited Coronavirus (+159, Add some German case, vaccination (+more) data…): https://wiki.archiveteam.org/?diff=49850&oldid=48266 10:42:36 i've now gone through my youtube archive and picked out all videos that aren't on yt anymore (either private or missing completely) - 4.9k videos/1.3tb 10:42:36 mainly looking for some guidance if/how I should submit those to IA? (the yt-dl command used mostly matches the one on the wiki except for the info-json which I didnt add at the time) 12:49:26 imer: join #down-the-tube 12:49:50 oh, you said no longer on YT, oops 12:51:48 pabs: well, its in there as well now haha, probably more appropriate either way :) 12:51:54 thanks 14:16:29 JAA: do you know if we got the hl2dm.net forum? 14:16:38 closing may 31 14:16:42 according to deathwatch 14:25:29 arkiver: pokechu22 did it according to https://archive.fart.website/archivebot/viewer/job/64nco 15:26:08 pabs: perfect, thank you 19:00:55 Alright here's a second dump of 14452 URLs related to that group, sourced from a dump of their Discord: https://transfer.archivete.am/KlnAM/urls.txt 19:01:11 It's mostly images. 19:40:16 Entartet edited List of websites excluded from the Wayback Machine (+30, Added patrickcollison.com.): https://wiki.archiveteam.org/?diff=49852&oldid=49833 20:02:41 If I want to run a Warrior on a dedicated machine at home (headlessly via Docker), what would be reasonable specs for it? 20:06:03 manu|m: There's no general answer as it depends entirely on the project. Some projects are CPU-intensive (e.g. sitemap parsing on URLs), some projects require significant disk space (anything with videos, e.g. YouTube), some require a lot of RAM due to recursion (e.g. Enjin I think)... If you want to run multiple projects at once, consider using the project images rather than the warrior. 20:09:06 so different projects/pipelines will pick the warriors they use based on their specs? I’d just like to make use of my internet connection when I don’t need it for myself 20:10:32 No, either the warrior runs a specific selected project, or it runs the default project that we set on the tracker side, which is the same for all warriors set to 'ArchiveTeam's choice'. 20:11:35 oh okay, thanks 20:11:59 i'll check out the project pages then 20:13:33 If you want a 'set it up once and forget about it' thing, you'll want the warrior set to AT's choice. 20:13:53 But a dedicated machine for that is a bit overkill. 20:15:39 i’m not getting a second tower or a server rack for that, i just thought it might be a good idea to have it running on a machine that draws a bit less power than my desktop setup 20:20:41 another question: once or twice a year (when there isn’t a pandemic going on) i’m attending Chaos events where there’s 4-7 days of practically unlimited bandwith available, where it’s possible to colocate machines. would it be useful to bring a warrior (or more) there to crunch through larger projects, or would that be counter-productive? 20:22:59 depends on the project, sometimes the website being archived has per-IP limits so having more bandwidth doesn't actually help 21:14:12 Is there a way to make archivebot login-walled content? I want to backup redump.org using archivebot, but some of the content is walled behind the requirement to submit a few discs to the site. Its not exactly public but its not exactly private either. 21:14:27 *archivebot archive login-walled content 21:16:04 Archivebot can't do that, no :/ (I think there might be other tools that can (e.g. grab-site installed locally) but I haven't worked with those) 21:16:27 ... hmm, and redump.org isn't loading for me at all, that's not a good sign :| 21:17:01 Yeah it goes down occasionally, the admin is unresponsive, and the admin is against public backups. 21:17:22 so I think it'd be a good thing to have it in the wayback machine 21:18:08 is there any tool that supports login-walled sites, that can be used to ingest data into the wayback machine? 21:18:28 There have been some backups in the past, but without being logged in 21:19:25 yeah, its just that misses a lot of data 21:19:34 Looks like the last full run of redump.org was on 2022-05-10, and forum.redump.org was last run on 2023-04-29, but both of those wouldn't be logged in 21:19:58 For main redum.porg, it'd be missing data for a few more recent systems, and also revision history, right? While forum.redump.org would be missing basically everything to my understanding 21:19:59 i.e. it misses modern systems, change history, dump submission sub-forums 21:20:22 hah, we said almost the same thing 21:20:42 so yeah, we are on the same page 21:21:33 all that stuff is pretty important to the continued operation of the site, imo 21:22:05 My understanding is that grab-site produces warcs, and those *can* be ingested into web.archive.org but won't necessarily be by default 21:22:34 https://github.com/webrecorder/browsertrix-crawler support supports logins as profiles https://github.com/webrecorder/browsertrix-crawler#creating-and-using-browser-profiles 21:24:49 i assume wayback doesn't ingest stuff made by random people 21:25:30 only things archived by archive team or IA services, or stuff from companies like alexa 21:26:23 My understanding is that yeah, that's roughly the case. Most outsider stuff ends up in https://archive.org/details/warczone 21:26:48 One other aspect to consider is that if you do save login-walled content, every single page will show you being logged in 21:28:02 Data behind logins doesn't make it into the Wayback Machine in general. 21:28:59 That's a relatively hard rule with few exceptions. 21:32:17 i guess i should look into using something like archivebot, and then just hosting the static pages on free hosting so they can be browsed, in addition to the WARCs 21:32:27 (And the exceptions are of historical nature, e.g. our SPUF project in 2017.) 21:33:53 That sounds reasonable. grab-site is basically like AB but local, and you can give it cookies. 21:34:11 someone did run grab-site, but i think it was a pretty long process 21:34:17 and hard to get working 21:34:53 i should check if the output of that was okay, then maybe i can just set that up to run every week and upload to IA (and github pages/neocities for a browsable version) 21:35:17 Shouldn't be hard to get working unless there's annoying 'DDoS protection' stuff in the way or extensive use of JS, but it certainly won't be fast, yeah. 21:35:59 seems like that grab-site run was incomplete 21:36:05 so the issues with it apparently weren't resolved 21:36:09 Note that making the WARCs publicly accessible might allow others to hijack your account. 21:36:19 i think i'll make a dummy account for it 21:36:53 actually i believe i know someone who has one like that already 21:37:04 redump.org doesn't support https; I doubt it has proper DDoS protection :P 21:37:28 but yeah i don't think i want to do it with my account, if only to avoid my name being plastered over it 21:37:58 I assume you'd be grabbing the submission history subforums but not the dumpers subforum, then? 21:38:16 the account could probably get dumper access 21:38:36 i have plenty of low-priority discs (e.g. already verified ps2 shovelware) that i can use 21:39:13 actually, i can probably ask a moderator just to promote an account even without any disc submissions 21:41:26 this command was used for the grab-site attempt: https://bpa.st/6EQWS 21:41:34 ignores was: http://redump.org/discs/.*?/dumper/.*? 21:41:59 took 33 hours, not too bad 21:42:34 ah, this bug prevented forum attachments being saved: https://github.com/ArchiveTeam/wpull/issues/291 21:43:49 HTTP/0.9? Eww. 21:44:23 Technically, that can't go into WARC either. 21:45:01 technically? 21:46:43 The spec only permits HTTP/1.1, strictly speaking. 21:49:01 hm, as long as it works, i guess it'd be fine 23:07:34 HiccupJul I see you got IRC working. Yes redump sucks in terms of tech 23:08:22 i think hackint may have been down for a bit, or maybe some system clock fluke that caused a certificate error 23:11:24 hackint hasn't been down in a good while, but the webchat thingy was broken for about a week recently.