00:58:32 <Ivan226> (resending unless it was archive in a different channel) can someone get these for me thanks https://transfer.archivete.am/vABs9/honkaiwiki-newlinks.txt https://transfer.archivete.am/7axZa/honkaiwiki-newfiles.txt
00:59:24 <pokechu22> Ivan226: I did those earlier today in #archivebot
00:59:48 <Ivan226> ah got it
01:07:40 <pabs> tomodachi94: in #archivebot, `socialbot: snscrape twitter-profile foo` works a reasonable amount of time. it can gather 3200 recent tweets. the twitter-user option doesn't work at the moment, but there is a fix in snscrape git that isn't released yet
01:18:50 <nicolas17> apparently archive.org is being hammered by thousands of AWS instances "downloading the OCR text from our materials"
01:32:23 <JAA> Doranwen: The Dropbox link recovered, and I'm pulling a copy.
01:36:27 <nicolas17> IA S3 stats show a massive drop in uploads about 8 hours ago... did our targets finally catch up or what? :P
02:05:08 <fireonlive> oh does IA store stuff in AWS?
02:05:53 <JAA> Of course not, but there's an S3-ish interface.
02:06:46 <fireonlive> ah! i see :)
02:06:51 <fireonlive> i thought that rather odd haha
02:18:36 <tomodachi94> pabs: thanks!
02:18:44 <tomodachi94> Can someone snag this one too? https://transfer.archivete.am/1E4dG/tos.txt
02:19:12 <pabs> archivebot can't do individual posts, but we can snag the user
02:19:44 <pabs> I just did this: socialbot: snscrape twitter-profile TheOrderofSith
02:20:03 <tomodachi94> Ah okay, much appreciated.
02:27:21 <Doranwen> JAA: Thanks!  Glad it did :)
02:38:08 <tomodachi94> Oh lovely...... (full message at <https://matrix.hackint.org/_matrix/media/v3/download/hackint.org/uiXAnJTsoHshnzOiwaOqHSEY>)
06:11:33 <fireonlive> can grab-site be resumed in-place from another server? would like to move off this (higher priced) one at some point
06:12:17 <fireonlive> 'oh i'll only have this server like 4 days... "up 20 days"' where did the time gooo
10:13:22 <arkiver> last two days have been chaotic for me. i may have missed important messages in channels, if you think I missed something please ping again
10:25:37 <h2ibot> Tomodachi94 created Prnt.sc (+439, Create page): https://wiki.archiveteam.org/?title=Prnt.sc
10:25:38 <h2ibot> Manu edited Coronavirus (+159, Add some German case, vaccination (+more) data…): https://wiki.archiveteam.org/?diff=49850&oldid=48266
10:42:36 <imer> i've now gone through my youtube archive and picked out all videos that aren't on yt anymore (either private or missing completely) - 4.9k videos/1.3tb
10:42:36 <imer> mainly looking for some guidance if/how I should submit those to IA? (the yt-dl command used mostly matches the one on the wiki except for the info-json which I didnt add at the time)
12:49:26 <pabs> imer: join #down-the-tube
12:49:50 <pabs> oh, you said no longer on YT, oops
12:51:48 <imer> pabs: well, its in there as well now haha, probably more appropriate either way :)
12:51:54 <imer> thanks
14:16:29 <arkiver> JAA: do you know if we got the hl2dm.net forum?
14:16:38 <arkiver> closing may 31
14:16:42 <arkiver> according to deathwatch
14:25:29 <pabs> arkiver: pokechu22 did it according to https://archive.fart.website/archivebot/viewer/job/64nco
15:26:08 <arkiver> pabs: perfect, thank you
19:00:55 <tomodachi94> Alright here's a second dump of 14452 URLs related to that group, sourced from a dump of their Discord: https://transfer.archivete.am/KlnAM/urls.txt
19:01:11 <tomodachi94> It's mostly images.
19:40:16 <h2ibot> Entartet edited List of websites excluded from the Wayback Machine (+30, Added patrickcollison.com.): https://wiki.archiveteam.org/?diff=49852&oldid=49833
20:02:41 <manu|m> If I want to run a Warrior on a dedicated machine at home (headlessly via Docker), what would be reasonable specs for it?
20:06:03 <JAA> manu|m: There's no general answer as it depends entirely on the project. Some projects are CPU-intensive (e.g. sitemap parsing on URLs), some projects require significant disk space (anything with videos, e.g. YouTube), some require a lot of RAM due to recursion (e.g. Enjin I think)... If you want to run multiple projects at once, consider using the project images rather than the warrior.
20:09:06 <manu|m> so different projects/pipelines will pick the warriors they use based on their specs? I’d just like to make use of my internet connection when I don’t need it for myself
20:10:32 <JAA> No, either the warrior runs a specific selected project, or it runs the default project that we set on the tracker side, which is the same for all warriors set to 'ArchiveTeam's choice'.
20:11:35 <manu|m> oh okay, thanks
20:11:59 <manu|m> i'll check out the project pages then
20:13:33 <JAA> If you want a 'set it up once and forget about it' thing, you'll want the warrior set to AT's choice.
20:13:53 <JAA> But a dedicated machine for that is a bit overkill.
20:15:39 <manu|m> i’m not getting a second tower or a server rack for that, i just thought it might be a good idea to have it running on a machine that draws a bit less power than my desktop setup
20:20:41 <manu|m> another question: once or twice a year (when there isn’t a pandemic going on) i’m attending Chaos events where there’s 4-7 days of practically unlimited bandwith available, where it’s possible to colocate machines. would it be useful to bring a warrior (or more) there to crunch through larger projects, or would that be counter-productive?
20:22:59 <nicolas17> depends on the project, sometimes the website being archived has per-IP limits so having more bandwidth doesn't actually help
21:14:12 <HiccupJul> Is there a way to make archivebot login-walled content? I want to backup redump.org using archivebot, but some of the content is walled behind the requirement to submit a few discs to the site. Its not exactly public but its not exactly private either.
21:14:27 <HiccupJul> *archivebot archive login-walled content
21:16:04 <pokechu22> Archivebot can't do that, no :/ (I think there might be other tools that can (e.g. grab-site installed locally) but I haven't worked with those)
21:16:27 <pokechu22> ... hmm, and redump.org isn't loading for me at all, that's not a good sign :|
21:17:01 <HiccupJul> Yeah it goes down occasionally, the admin is unresponsive, and the admin is against public backups.
21:17:22 <HiccupJul> so I think it'd be a good thing to have it in the wayback machine
21:18:08 <HiccupJul> is there any tool that supports login-walled sites, that can be used to ingest data into the wayback machine?
21:18:28 <pokechu22> There have been some backups in the past, but without being logged in
21:19:25 <HiccupJul> yeah, its just that misses a lot of data
21:19:34 <pokechu22> Looks like the last full run of redump.org was on 2022-05-10, and forum.redump.org was last run on 2023-04-29, but both of those wouldn't be logged in
21:19:58 <pokechu22> For main redum.porg, it'd be missing data for a few more recent systems, and also revision history, right? While forum.redump.org would be missing basically everything to my understanding
21:19:59 <HiccupJul> i.e. it misses modern systems, change history, dump submission sub-forums
21:20:22 <HiccupJul> hah, we said almost the same thing
21:20:42 <HiccupJul> so yeah, we are on the same page
21:21:33 <HiccupJul> all that stuff is pretty important to the continued operation of the site, imo
21:22:05 <pokechu22> My understanding is that grab-site produces warcs, and those *can* be ingested into web.archive.org but won't necessarily be by default
21:22:34 <that_lurker> https://github.com/webrecorder/browsertrix-crawler support supports logins as profiles https://github.com/webrecorder/browsertrix-crawler#creating-and-using-browser-profiles
21:24:49 <HiccupJul> i assume wayback doesn't ingest stuff made by random people
21:25:30 <HiccupJul> only things archived by archive team or IA services, or stuff from companies like alexa
21:26:23 <pokechu22> My understanding is that yeah, that's roughly the case. Most outsider stuff ends up in https://archive.org/details/warczone
21:26:48 <pokechu22> One other aspect to consider is that if you do save login-walled content, every single page will show you being logged in
21:28:02 <JAA> Data behind logins doesn't make it into the Wayback Machine in general.
21:28:59 <JAA> That's a relatively hard rule with few exceptions.
21:32:17 <HiccupJul> i guess i should look into using something like archivebot, and then just hosting the static pages on free hosting so they can be browsed, in addition to the WARCs
21:32:27 <JAA> (And the exceptions are of historical nature, e.g. our SPUF project in 2017.)
21:33:53 <JAA> That sounds reasonable. grab-site is basically like AB but local, and you can give it cookies.
21:34:11 <HiccupJul> someone did run grab-site, but i think it was a pretty long process
21:34:17 <HiccupJul> and hard to get working
21:34:53 <HiccupJul> i should check if the output of that was okay, then maybe i can just set that up to run every week and upload to IA (and github pages/neocities for a browsable version)
21:35:17 <JAA> Shouldn't be hard to get working unless there's annoying 'DDoS protection' stuff in the way or extensive use of JS, but it certainly won't be fast, yeah.
21:35:59 <HiccupJul> seems like that grab-site run was incomplete
21:36:05 <HiccupJul> so the issues with it apparently weren't resolved
21:36:09 <JAA> Note that making the WARCs publicly accessible might allow others to hijack your account.
21:36:19 <HiccupJul> i think i'll make a dummy account for it
21:36:53 <HiccupJul> actually i believe i know someone who has one like that already
21:37:04 <pokechu22> redump.org doesn't support https; I doubt it has proper DDoS protection :P
21:37:28 <HiccupJul> but yeah i don't think i want to do it with my account, if only to avoid my name being plastered over it
21:37:58 <pokechu22> I assume you'd be grabbing the submission history subforums but not the dumpers subforum, then?
21:38:16 <HiccupJul> the account could probably get dumper access
21:38:36 <HiccupJul> i have plenty of low-priority discs (e.g. already verified ps2 shovelware) that i can use
21:39:13 <HiccupJul> actually, i can probably ask a moderator just to promote an account even without any disc submissions
21:41:26 <HiccupJul> this command was used for the grab-site attempt: https://bpa.st/6EQWS
21:41:34 <HiccupJul> ignores was: http://redump.org/discs/.*?/dumper/.*?
21:41:59 <HiccupJul> took 33 hours, not too bad
21:42:34 <HiccupJul> ah, this bug prevented forum attachments being saved: https://github.com/ArchiveTeam/wpull/issues/291
21:43:49 <JAA> HTTP/0.9? Eww.
21:44:23 <JAA> Technically, that can't go into WARC either.
21:45:01 <HiccupJul> technically?
21:46:43 <JAA> The spec only permits HTTP/1.1, strictly speaking.
21:49:01 <HiccupJul> hm, as long as it works, i guess it'd be fine
23:07:34 <flashfire42> HiccupJul I see you got IRC working. Yes redump sucks in terms of tech
23:08:22 <HiccupJul> i think hackint may have been down for a bit, or maybe some system clock fluke that caused a certificate error
23:11:24 <JAA> hackint hasn't been down in a good while, but the webchat thingy was broken for about a week recently.