00:14:07 Thanks 00:15:54 VerifiedJ: Submitted to AB. 00:16:43 @JAA: thanks 00:17:28 Also threw in the /json URLs. That should at least fetch some of the map data. 00:24:24 http://halo.bungie.net/ is (finally) shutting down February 9th. Coverage looks pretty good (last run in AB 2014), but might be worth another run? https://www.windowscentral.com/bungie-original-halo-website-being-taken-down-february-9 00:25:26 Whew, blast from the past. 00:31:37 yeah 00:56:09 Ah yeah, I archived their forums a while ago. 00:56:18 That's why it seemed so familiar. :-) 00:56:38 I'll throw it into AB. 00:59:21 Anyone want to save ~124 million Halo screenshots? 00:59:48 Is #findelmondo dead? 9.75m left but I'm the only one in the room haha 01:00:39 That's because it's #findelmundo with a u. 01:01:04 Thanks JAA! :) 01:01:32 JAA man this is like a throwback to one of the first projects I took part in 01:01:37 the halo match records 01:01:53 because there was a big argument over why the fuck we were doing it 01:02:12 Yeah indeed, I just remembered those as well. 01:02:16 BlameGithub :P 01:04:28 that reminds me is astrid still around? 01:06:02 no 01:06:27 oh 01:08:53 http://halo.bungie.net/online/communityfiles.aspx?tags=&datefilter=7&sort=0&page=0 01:09:04 124 million files, mostly screenshots of a couple hundred KiB. 01:09:30 Well, 124 million IDs, no idea whether they all exist. 01:09:38 So I take it that AB farms the work out to dedis? Or does it take the urls and throw them into the urls queue? 01:10:08 There are also videos, which appear to actually be game data saves that were then rerendered by the server. Rendering is broken, and the raw data can't be downloaded. :-/ 01:10:19 124 million... wow, did we get all of that back in 2014? 01:10:43 E.g. http://halo.bungie.net/Online/Halo3UserContentDetails.aspx?h3fileid=123990982 -> http://halo.bungie.net/Stats/Halo3/RenderToVideoWindowUI.aspx?h3fileid=123990982 01:12:07 Jake: Possibly. Certainly not through AB, but there was also a DPoS project it seems. 01:12:31 Yeah, looks like that was for these files. :-) 01:12:32 http://tracker.archiveteam.org/halo/#show-all 01:13:11 didnt we already get most of this? 01:13:32 yeah 01:14:40 Endomondo can probably be removed off the tracker homepage now, same with Flash I'm guessing? 01:14:48 Looks like there are large gaps though. 01:14:53 E.g. https://web.archive.org/web/*/http://halo.bungie.net/Online/Halo3UserContentDetails.aspx?h3fileid=123900* 01:15:14 JAA we agreed to grab the first million and last million last time 01:15:19 something like that 01:15:24 flashfire42: That was something else. 01:15:27 Oh ok 01:15:39 My bad ignore me 01:15:40 First/last/random million was game stats. These are the shared files. 01:15:57 no endomondo cannot be removed yet 01:16:06 there's some API not archived yet 01:16:21 Oh actually, those nine-digit files are all not from the project. 01:16:38 I'm fine with setting up a project for halo if we figure out what we dont have yet 01:16:45 Or the WBM APIs aren't working correctly again, or something went wrong on the indexing, or whatever. 01:17:07 indexing problems seem plausible 01:17:12 arkiver: ahh didn't realise. The tracker is only giving profile stuff at the moment which does seem to be dead 01:18:06 https://github.com/ArchiveTeam/halo-items/tree/master/halo3file/ADDED suggests that only up to 120 million might've been attempted. 01:18:37 And https://web.archive.org/web/*/http://halo.bungie.net/Online/Halo3UserContentDetails.aspx?h3fileid=119900* looks much better indeed. 01:19:58 So I guess we might need a project for the IDs above 120 million. 01:20:53 Specifically, 120000100 and up if I'm reading the code and -items correctly. 01:23:17 Is it worth cancelling the AB job if it's potentially gonna be done by us anyway? 01:23:50 Not sure how AB hands out it's tasks and whether it actually affects anything to have lots of jobs running 01:24:41 AB does recursive crawls, a distributed project would be much more specific. It makes sense to do another recursive crawl. 01:24:55 Ahh I see 01:27:04 Also, nice bug: https://github.com/ArchiveTeam/halo-grab/blob/93195fe56d6b5ec22c89f4586699deac1f28602e/pipeline.py#L197 01:27:51 lol 01:27:54 early days 01:28:06 missed a , there 01:28:09 I wonder what the lone file outside of ADDED in -items is about. 01:28:27 https://github.com/ArchiveTeam/halo-items 01:28:37 no idea 01:28:37 Looks like those were indeed not grabbed. 01:28:48 At least based on WBM prefix search. 01:28:50 0-99999? 01:28:57 Yeah 01:29:06 https://web.archive.org/web/*/http://halo.bungie.net/Online/Halo3UserContentDetails.aspx?h3fileid=99000* only returns 8-digit IDs. 01:29:14 so lets update the project and bring it back to life :P 01:29:19 with all above 120 million 01:29:21 and those 0-99999 01:29:29 Sounds good :-) 01:30:10 Let's reopen #yolohalo ? 01:30:13 yep 01:38:34 -purplebot- Deathwatch edited by JustAnotherArchivist (+207, /* 2021 */ Add Halo) just now -- https://www.archiveteam.org/?diff=46174&oldid=46170 02:31:34 -purplebot- Halo edited by JustAnotherArchivist (+1872, Complete overhaul, 2018 additions, …) just now -- https://www.archiveteam.org/?diff=46175&oldid=28888 02:43:14 https://images.nga.gov/ 'will no longer be available as of January 1, 2021' my arse. 03:00:36 yano: I found a few more things on PirateBox. Also, they stated that the tracker might go down earlier than the rest of the stuff in https://forum.piratebox.cc/read.php?9,23070, so would be best to grab all of that and upload to IA as soon as possible. If you could dump the .torrent file for each of the magnet links, that would also be useful for preservation purposes. 05:02:34 -purplebot- Current Projects edited by Wickedplayer494 (-318, Go home everyone) just now -- https://www.archiveteam.org/?diff=46176&oldid=46156 05:03:53 i finished up this: https://github.com/mgrandi/archive_pogchamp_emote which will help me archive the daily PogChamp emote 05:06:33 i should probably write something to help generate urls for all the twitch emotes, plus the information about them (streamer / shortcode ), since the https://archiveteam.org/index.php?title=Twitch.tv page says the last run didn't have those 05:42:34 -purplebot- Current Projects edited by Wickedplayer494 (+168, Halo's back, bitches) 20 minutes ago -- https://www.archiveteam.org/?diff=46177&oldid=46176 12:09:00 Hi. I made some modifications to warrior Docker build and forked some grabs. Unfortunately, I wasn't aware that unofficial versions are discourages. I am wondering how I can contribute my changes to upstream. 12:09:38 To start with, here is my warrior Dockerfile https://github.com/SrihariThalla/archive-team-warrior/blob/337f0602d8b47f06161ab4c9990467b51fc7c1da/Dockerfile 12:10:50 In the meantime, I have stopped my instance of warrior 12:16:13 The main issue I wanted to resolve with my version of Dockerfile is the wget-at. I see that there is a Warrior Extras Installer. Since it would be mostly installing all the dependencies, I wanted to move to the warrior container itself 12:18:58 Thus I built wget-at (and zstd from Github) and forked some grabs to make them use the common wget-at. For ex: https://github.com/ArchiveTeam/domains-grab/pull/3 13:13:03 JAA: i'm using qbittorrent; i'll check but i don't think i have the .torrent files anymore, can the IA not ingest magnet links for torrents? 13:14:46 oh nice, it looks like i am saving them 13:30:11 JAA: https://archive.org/details/PirateBox-Bittorrent-Files 13:30:23 i didn't know where to upload them so i put them there 13:30:25 🤷 13:31:07 lol, it creates a bittorrent of the bittorrent files lol 13:31:59 torrent ception? 13:32:11 We need to go deeper. 13:32:29 hehe 13:32:47 yano: yes IA can do magnet links 13:32:49 i like qbittorrent because when you add a torrent of torrent files it asks if you want to start download on the subsequent torrents 13:32:59 make a .torrent with the magnet: URI as content 13:33:02 arkiver: oh, i could have avoided uploaded the .torrent to IA then 13:33:03 upload to IA and let is derive 13:33:04 oh well 13:33:09 it* 13:33:10 anyways, the torrents are now on IA 13:33:28 Are the old versions still seeded? 13:33:36 some of them are 13:33:51 i got them in my seedbox trying to find them on dht 13:34:06 also, yay my first upload to IA 13:35:17 congrats 13:35:24 many more to follow :) 13:36:45 hehe :D 13:36:54 i mean, i've uploaded through AT, but this is my first direct upload :D 13:39:35 Yay :-) 13:45:17 DaxServer, please stay around. I'm sure one of the ops will get back to you. Most people run the docker image workers these days. Instructions are at https://archiveteam.org/index.php?title=Running_Archive_Team_Projects_with_Docker 13:45:50 and thanks for stopping by :) 13:46:27 Thanks :) 13:48:17 DaxServer: people should not edit our code and run it 'in production' 13:48:32 in this case some Lua dependencies have not been installed on the warrior - which causes issues 13:48:48 we might have indeed missed data due the edits 13:51:48 Can you please mark all of my uploads for a redo? So sorry about the data loss 14:48:41 Someone is uploading all of thingiverse to IA 14:49:33 Damn how big is that going to be? 15:09:26 I thought we already did that 15:10:34 -purplebot- Webzdarma edited by Sanqui (+25, /* ArchiveBot jobs */ job 20) just now -- https://www.archiveteam.org/?diff=46178&oldid=46093 15:11:59 There was a project in 2015 it seems. 15:12:14 The hole dutch parliament just handed in their resignation 15:12:40 RIP 15:13:28 lol, I didn't even know that was possible. Rather than dissolving parliament, calling a new general election, etc. 15:13:31 and the netherlands is without a parliament now till mid March 2021 15:14:17 Does the goverment also go? 15:14:35 not sure 15:14:51 https://www.houseofrepresentatives.nl/members_of_parliament/members_of_parliament 15:14:59 Most of their MPs have been in for what seems like ages 15:15:28 Or "seniority" 15:15:55 just happened 2 hours ago 15:16:20 Yeah it just seems like most of them have been in for ages, tho tbh I expect the numbers are similar for UK mps 15:16:51 Uh, what I'm reading, it's the government that resigned, not parliament? 15:17:04 yeah that seems more sane 15:17:09 https://nltimes.nl/2021/01/15/rutte-confirms-cabinet-resignation-says-covid-crisis-management-wont-change 15:17:43 Yeah 15:17:48 we have every 4 yeas general election for parliament 15:19:45 Election's in a few months anyway. I do love the videos of Rutte going to the king on his bike. Because of course he did that. 15:23:25 not sure tho what's this means for the upcoming elections as most of the candidates are the same as the current Government 15:26:49 I mean, the cabinet isn't elected anyway, right? 15:28:20 Not directly no 15:28:34 I just realised we're in -bs. Maybe we should move this to -ot? 15:28:45 Yeah 16:42:38 So, I am being informed that we only grabbed the Halo 3 stats from the Halo Forerunner Project. 16:42:46 Now, this is going to lead to some sackings 16:43:01 But before then, can someone fire up a raspberry pi and get the rest of halo 16:45:37 Only Halo 3 files*, not stats, as far as I can see. 16:45:42 Screenshots and stuff 16:46:07 I updated the wiki page last night with my understanding of what was covered. 16:46:23 And yes, #yolohalo is back in business. 16:47:36 JAA: enjoying bing-scrape. do you have a little thing for URL derivation (i.e. for https://example.com/a/b.html, derive https://example.com/a/ and https://example.com/)? 16:47:42 if not I may want to contribute it 16:47:55 just dunno how to call it 16:48:04 suffix-stripping-derivation 16:48:12 Sanqui: I have written something for that before, yeah. 16:48:28 I've written it too but it's nowhere specific. I'm bad at keeping my oneliners organized 16:48:41 Same 17:02:02 Sanqui: awk -F/ 'BEGIN { OFS="/" } /\/$/ { --NF } { for (i = NF; i > 3; --i) { --NF; print $0 OFS; } }' 17:02:25 frith, you awklord 17:02:27 cheers 17:03:05 Doesn't print the original URL, add an extra `{ print }` before `/\/$/` if you want that. 17:03:59 And just in case you intend to use this for AB !a <, well, might not work as intended. 17:18:12 Sanqui: Actually, hold on, writing a better version. :-) 17:24:04 Sanqui: https://git.kiska.pw/JustAnotherArchivist/little-things/src/branch/master/parent-urls 17:24:57 ah, ain't it beautiful when things get done with minimal effort from my side 17:25:01 thanks a bunch!! 17:27:08 :-) 18:09:33 What's the lua regex for all the instagram login pages? (https://www.instagram.com/accounts/login/?next=/reel/CJ6Yx9KJEP9/ As in it can have anything after the /login) 18:13:49 ^https?://www%.instagram%.com/accounts/login.-$ Would that do it? (I'm brand new to Lua matching) 19:12:35 I think we should increase our efforts to save online forums (i.e. #msgbored ) . I feel that user-generated content everywhere is under threat, and forums, especially old forums, are pretty much just that. Content written in them years ago may be viewed in different lights now. https://lauren.vortex.com/2021/01/15/moderating-ugc 19:26:39 +1 19:26:48 I'm working on some czech forums but yeah 19:27:17 ++ 19:30:03 BugTraq archives are shutting down at the end of the month: https://www.securityfocus.com/archive/1/542247/30/0/threaded (Thanks, gb in #urlteam) 19:31:41 So, I report here as well that BugTraq's archives will be shut down on January 31st, 2021. See https://www.securityfocus.com/archive/1/542247/30/0/threaded . I wasn't able to find existing mirrors. 19:32:08 Since it's a mailing list, there are definitely mirrors. E.g. https://seclists.org/bugtraq/ 19:32:32 But we should archive the original site anyway. 19:33:31 Ah you're right, I checked quite poorly then 19:37:42 The archive at securityfocus.com is actually much more limited than the seclist.org one, it only goes back to 2002 (https://www.securityfocus.com/cgi-bin/index.cgi?offset=52380&limit=30&c=11&op=display_threads&ListID=1&mode=threaded&expand_all=false) 19:38:56 Interesting. 19:39:14 I've started an ArchiveBot job for https://www.securityfocus.com/. We'll see how that goes. 19:39:36 It's one of those sites with an absolutely disgusting URL structure. 19:40:31 Wow that was fast, thanks! 19:46:34 -purplebot- Deathwatch edited by JustAnotherArchivist (+310, /* 2021 */ Add BugTraq) just now -- https://www.archiveteam.org/?diff=46180&oldid=46174 19:46:34 -purplebot- 99.se edited by Flashfire42 (+42) just now -- https://www.archiveteam.org/?diff=46181&oldid=37553 19:49:10 They also have unauthenticated "Post message" links which you should be wary about, although theoretically the list stopped accepting posts since 2020/02 19:51:15 There's also this vulnerability database, not just the mailing list: https://www.securityfocus.com/bid/ 19:52:14 I guess most of that information is in the mailing list archives anyway, but still nice to have. 19:52:34 -purplebot- Comcast Personal Web Pages edited by Flashfire42 (+0) just now -- https://www.archiveteam.org/?diff=46183&oldid=28777 19:53:34 -purplebot- Hackpad edited by Flashfire42 (+0) just now -- https://www.archiveteam.org/?diff=46184&oldid=29345 19:54:57 Ah right, with "archive" they probably mean that too! Than one goes back a lot further, to (!) 1980: https://www.securityfocus.com/cgi-bin/index.cgi?o=102300&l=30&c=12&op=display_list&vendor=&version=&title=&CVE= 19:59:19 So starting from the last pages of that vulnerability archive is probably a good idea, at those times there weren't CVE or stuff yet so it's a little more likely that there aren't references elsewhere 20:03:01 True. In principle, the recursive crawl should get to those, but only towards the end. 20:06:19 The dates on the early bugs seem sketchy though. 20:06:35 https://www.securityfocus.com/bid/2053 lists Debian 2.3 for example, which obviously didn't exist in 1980. 20:07:36 'This vulnerability was first announced by vort-fu on December 5, 2000.' 20:07:43 Heh, just 20 years off. 20:13:34 -purplebot- 99.se edited by JustAnotherArchivist (+98) 24 minutes ago -- https://www.archiveteam.org/?diff=46182&oldid=46181 20:13:54 Yeah, it did seem fishy... "Bugtraq was created on November 5, 1993" 20:14:41 But maybe that database started earlier anyway 20:55:20 adding to the user-generated content comment...if there are major changes to Section 230...that could lead to a whole wave of site closures. Not unlike fallout from GDPR of sites feeling they might be noncompliant. 20:57:14 Ok I have to leave, thank you and keep up the good work, At some point hopefully I'll manage to lend a hand. Bye 21:11:03 https://www.pcgamer.com/bungies-vast-library-of-halo-stats-goes-offline-next-month/ 21:12:47 this'd be good to make a warrior for? :p 21:14:50 I see someone's already added it to wiki, good 21:17:11 Billy549 #yolohalo 21:17:16 ty for the channel name 21:17:27 JAA: from the Halo wiki page: "The forums were archived in full shortly before the shutdown at the end of June 2018." i take it this means something other than the still-operational http://carnage.bungie.org/haloforum/halo.forum.pl ? 21:20:01 (not now at the halo. subdomain, but used to be) 21:21:00 oh god nvm, they're bungie.org and always have been. 21:22:50 yeah bungie.org isnt bungie 21:23:44 shows what i think of the ~official~ sites :P 22:39:32 I'll try to finalize the (i.e. remove all the SmackJeeves references from) So-Net U+ script soon, think that might be a relatively slow site 22:39:40 (Closes the 28th) 22:41:11 OrIdow6: what is that one about? 22:41:49 arkiver: It's a Japanese personal webpage host that had its heyday in (I'd say) around 2006 22:42:03 thats awesome 22:42:24 looking at some sites, pretty nice 22:42:41 well ping me when you have something, will be a nice project 22:42:45 Ok 22:43:10 For now I think I might actually try to do something about CrowdMap 22:43:14 In the 1.5 hours I have here 22:43:37 But that's small enough that even I should have enough capacity 22:44:00 got a list of everything? 22:44:11 getting item lists is sometimes the biggest problem 22:44:16 Verified J made a scrape 22:44:24 Or are you tlaking about So-Net U+? 22:44:39 both 22:44:39 orldow6 you aiming to have a project up tonight?! 22:44:48 if so please do let me know ASAP 22:45:32 EggplantN: Next few days, maybe, tonight (my tonight, your morning) if everything else in the day takes me a very low amount of time 22:46:09 kk if you let me know ahead of time to get a target up we can blitz through it like SJ 22:47:17 arkiver: I haven't checked yet that the CrowdMap scrape is comprehensive, but I think it may be; you're right that So-Net U+ will probably benefit from something more extensive than the WBM CDX for discovery 22:47:55 how was the crowdmap scrape done? 22:48:21 will check with so-net u as well 22:49:18 EggplantN: Will do, was nice working with you last time, and I don't think this will be a stressfully huge amount of data 22:49:34 thats fine by me, just holla and i'll be helpful 22:49:46 arkiver: To my knowledge everything that's happened to Crowdmap thus far is that someone (don't remember who) ran some simple AB jobs 22:53:41 contacting someone at IA now who can read japanese 22:53:44 about so-net u 22:53:54 OrIdow6: or do we have someone already? ^ 22:54:05 i want to ask if he sees any contact info that we could use for this 22:54:10 maybe can help with a japanese email as well 22:55:04 arkiver: I don't speak Japanese, if that's what you're asking; there's someone who does who comes on sporadically who told me about this 22:56:05 alright 22:56:08 will contact 22:56:58 "taka", last online January 2 UTC 22:57:10 arkiver: Ok, would be nice to have a complete list 22:58:44 Well, online the 10th, but didn't say anything 23:17:22 thuban: It's the forums that used to be at http://halo.bungie.net/Forums/default.aspx (until 2018). 23:19:37 OrIdow6, arkiver: Re Crowdmap, the AB job for the JSONs failed on a few very slow responses. Not sure if the reports pages were grabbed (links are in the JSONs), and there is some data on those that isn't in the JSONs. 23:23:37 thuban: I wonder if we should grab those forums you linked though. I can't imagine that they'll stay around a lot longer. 23:26:47 i wondered that myself. (that said, they're _so_ old that i wonder whether the lindy effect kicks in--they've been going since 1999) 23:28:21 i did some scripting to extract urls from the very similar marathon story forums, when we threw into archivebot last year 23:29:26 don't remember the details but i think grabbing posts from this one would not be too hard, will do when i get a chance. should probably hit the oni forums as well 23:29:29 Unfortunately, the archived posts are on http://library.bungie.org/ instead. 23:29:44 s/when/which/ 23:29:48 Which then links to yet another domain, forums.bungie.org. 23:30:32 'The HBO Forum Archive is maintained with WebBBS 4.33.' - 'THE PERL SCRIPTS ARE NO LONGER BEEING SUPPORTED' 23:30:35 :-) 23:30:41 thisisfine.png 23:30:44 I would put very little trust in digital lindy effects. Political pressure on Section 230 could change matters overnight. 23:32:03 Can't quickly find a version history, but WebBBS 5.0 dates back to before 2002. 23:32:27 JAA: i wouldn't worry about the domains if we just feed the bot individual post urls like last time 23:33:08 I'm worried about keeping the archives accessible and browsable. 23:33:22 But yeah, for the content itself, you're right. 23:36:58 individual post pages link to the other posts in their thread, and we could include search result pages to get all-posts indexes (think i did that last time too). would be awkward to find a specific post if you didn't have the url, but usable 23:38:41 WebBBS 4.33 is from mid-2000 and has a vulnerability that's been known since mid-2002. lol 23:39:23 please don't pwn the forums before we've saved them :( 23:47:22 I've thrown carnage and library into AB. 23:47:48 carnage actually returns a list of all (non-archived) posts on the homepage when using the AB UA. That's handy. 23:51:49 There are also a bunch of other old forums hosted on carnage. Might be worth digging into sometime.