05:16:32 https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/ 05:17:12 (GitLab to delete inactive projects from users on the free tier) 05:29:06 Highlights: "the policy is scheduled to come into force in September 2022"; "GitLab... will... give users weeks or months of warning" 05:34:35 Looking into discovery 05:47:09 Will look into discovery tomorrow 05:47:12 Anyhow thanks pabs 05:47:40 thanks 06:48:07 The gitlab one could be a big one 06:49:16 Yup and incredibly sad if they actually end up going through with it 06:49:41 Somewhat ironic to me that saving 1M/y is worth sacrificing years of goodwill 06:49:58 "GitLab is aware of the potential for angry opposition to the plan", sounds like they know it will piss people off, but I guess they're hoping it will make it worth it 06:54:26 GitLabs on-call person/SRE is about to have a bad couple of nights... 06:55:39 well, we'd hope to not cause any huge problems... 07:00:26 I wonder if the cost in network egress will go above $1M in storage costs... 07:00:35 (also compute...) 07:01:02 (also SRE salary/time / project planning in response to the new load) 07:01:29 Oh I mean, probably! Seems incredibly shortsighted to me, but /shrug. 07:03:09 tme to pull out this article again - https://articles.uie.com/beans-and-noses/ 07:03:12 *time 07:04:56 looks like GitLab runs under cloudflare... could end up being annoying. 07:07:35 trying to find public info on any ratelimits, but they seem to be generous right now. 08:12:45 https://gitlab.com/explore/projects?sort=latest_activity_asc 08:13:28 https://gitlab.com/explore/projects?sort=created_asc 08:13:41 First project in that second list is ID 450 08:13:52 second one is 526 08:14:30 IDs appear sequential 08:14:51 Current max ID is approximately 38337314 08:15:01 based on https://gitlab.com/explore/projects?sort=created_desc 08:17:15 API endpoint for the ID ex: https://gitlab.com/api/v4/projects/526 08:18:07 can also use ID without API: https://gitlab.com/projects/526 08:19:58 so probably about ~38.4 million items with many deleted/private 08:20:06 OrIdow6: ^ 08:21:38 returns 404 for a private repo 08:23:09 I see a generous rate limit on that call of 2K/minute. 08:24:28 19.2K minutes if we use the full rate limit every minute on an ip or about 13 days I think. Should be possible. 08:27:47 also as a note: in addition to public and private, there is also now-discontinued "internal" visibility setting that some repos might still have. "Internal" means visible to any logged-in user. 08:30:10 found a project ID 143 as new lowest, it was archived so it wasn't originally visible in the project listing 08:35:29 API for listing https://gitlab.com/api/v4/projects?sort=asc&order_by=id 08:38:08 Docs for this: https://docs.gitlab.com/ee/api/projects.html#list-all-projects 08:39:08 Also make note that keyset-based pagination is likely needed to avoid running into a limit https://docs.gitlab.com/ee/api/projects.html#pagination-limits 11:09:05 would it be possible for someone here to archive vtda.org? it's got a ton of useful retrocomputing materials 15:45:24 thuban: Continuing from #archiveteam-ot, the page I wanted to grab is not very large: https://www.cyberciti.biz/faq/debian-linux-install-openssh-sshd-server/ 15:46:27 OrIdow6: Continuing from #archiveteam-ot, yeah, that website ^ uses Clownflare and it's only accessible through either a real browser or a curl-impersonator request. 15:50:12 if you're not too worried about the header integrity issues that we retired chromebot over, you could try using its engine, crocoite: https://github.com/PromyLOPh/crocoite 15:52:25 i've never tried to run this myself and i don't know if it still works, but if it does it would be much more convenient than writing your own script/wpull plugin 15:53:43 It's worth a shot. I'll try installing it and giving it a try. 15:53:59 *I'll install it and give it a try. 16:01:08 Oh, if anyone already has an HTTrack instance up and running, could you please give that cyberciti.biz page I mentioned a try? 17:46:56 crocoite didn't work. :-/ 17:53:03 well, wget and curl don't 'compose', and having looked at the wpull plugin api i don't think you can override the actual request implementation (is that an architectural limitation or just an oversight)? 17:54:19 Should the GitLab project be under #gitgud (used for GitHub) or #gitlost? I'm thinking merge with #gitgud 17:56:06 systwi_: so i guess in your position i would write a little python script that calls curl-impersonate, saves the results, and after extracting links/urls from appropriate mime types derelativizes, dedupes, and queues them 17:56:27 the naïve implementation won't scale, but you don't need scale 18:04:16 I'm thinking that's the next best step. I can already access the correct page with curl-impersonate, so it's up to me to do the rest. 18:06:22 seems damn silly to reimplement a whole spiderer/scraper, though. you'd think we'd have something for this 18:08:01 Is the need to use tools like curl-impersonate on some pages relatively new? I never recall encountering such situations even five months ago. 18:08:34 No, Buttflare has always been a pain in the butt. 18:08:39 But yeah, it would be nice to see this in a nicer tool than something crummy that I'd make. :-P 18:09:00 But apart from using a headless browser etc., there wasn't any tooling for it until curl-impersonate emerged. 18:09:45 Yeah, but I mean in the sense of TLS fingerprinting that Cloudflare does. 18:09:55 I see. 18:10:39 i personally ran into it at least a year ago 18:10:58 They did increase the fingerprinting, but that was some time ago. I think around the same time they rolled out the new JS challenge. 18:11:14 JAA: got any insight on the wpull plugin question? 18:11:42 systwi_: using a bit of an experimental crawler I've been working on, we added https://github.com/refraction-networking/utls which impersonates Chrome's ClientHello. I captured that site in WARC for you here: https://jakel.rocks/up/b96ca2096dbcbeb2/ZENO-20220804175848709-00001-ATHENA.warc.gz 18:12:30 :0 18:13:29 Woah, thanks so much, Jake! :-D 18:13:35 Extremely helpful. 18:14:03 thuban: It's complicated. Yeah, you can't override the actual requesting. But wpull is highly modular. In principle, it should be relatively easy to e.g. use just its scraping/link extraction part. I doubt this is documented much though. 18:15:52 (And actually, you probably can override the requesting via a plugin, except that's currently broken due to a bug in the plugin system...) 18:16:05 (?) 18:19:32 np! Always happy to help. :) 18:19:33 Doing that would be incredibly messy, but it should be possible. You can replace individual components of wpull in a plugin, see e.g. here: https://github.com/ArchiveTeam/ArchiveBot/blob/4a672dbff49597dd8a1f53d95ee60f6ff17a5c87/pipeline/archivebot/wpull/plugin.py#L71 18:20:08 And the relevant bug is this: https://github.com/ArchiveTeam/wpull/issues/383 18:20:28 (I really should get that giant PR merged already, huh?) 18:22:40 i read the first post that pr and went 'but what if they need references to objects that don't exist yet?' then i read the second post on the pr :< 18:23:08 *post on 18:23:51 (yes. what were the "various issues" with 2.0.1?) 18:25:00 Here's a selection from the laundry list: https://github.com/ArchiveTeam/wpull/pull/393 18:27:19 nice 18:27:39 (That's the giant PR I was talking about.) 18:32:31 speaking of wpull, i was thinking again about the subdomains issue from the other day, and it occurred to me it that it might be nice to add 'subdomains' to the `--span-hosts-allow` options and that this would be fairly simple from a code standpoint. 18:35:31 do you think i should put in a pr (assuming i can figure out the test suite)? chfoo's comments on https://github.com/ArchiveTeam/wpull/issues/373 suggest that `--span-hosts-allow` has a limited future given its ambiguities / the lack of power in its implementation (reservations i'm sympathetic to), but it's been five years, so... 18:41:36 or, hm, i guess there's been no development to speak of since 2019. https://hackint.logs.kiska.pw/archiveteam-bs/20201028#c40220 i forgot we already talked about this 18:42:01 Yeah 18:42:11 That idea is something worth considering, I guess. 18:46:51 Although in general I think the focus should be on fixing the bugs first. 18:46:58 Anyway, we're well into -dev territory. :-) 21:20:10 Hello Everyone! 21:30:58 I note that IA's description of ArchiveTeam's ArchiveBot still mentions EFNet even though you moved to hacknet: https://archive.org/details/archivebot 21:32:51 hi Ruk8, welcome! Did you have a question? 21:36:06 Nothing in particular, like a few days ago I have a list of url that need to be archived... Today's list is mainly composed of Italian Scientific Journas and for the rest there are some videos hosted on a cdn. 21:37:21 (I'm the guy that requested the adobe archival of framemaker/robohelp installers) 21:41:00 yes, i remember. if you upload your list to transfer.archivete.am and paste the link here again, someone will queue it into archivebot 21:44:10 Here's the list: https://transfer.archivete.am/10GlW3/urls.txt 21:48:17 thanks! someone will start the job soon 21:48:59 is there an advantage of using AB for that rather than SPN? 21:50:32 Thanks everyone! I'm glad to offer some help 21:54:40 pabs: ime, reliability, mostly. spn can be a bit flaky under load, especially for bulk submissions 21:55:48 ok. I've found the SPN email interface to be reasonable for those the couple of times I used it 22:01:34 there are also (as always) some edge cases--e.g. tumblr's image rewriting breaks most images on tumblr pages under spn, but archivebot can disable it with `--user-agent-alias=curl` 22:32:50 all good. :) now oocities... I don't believe _we_ run it? Let me double check if we have any more info on that 22:33:18 no you don't run it.. I thought since you are in archive circles, that you may have heard something 22:39:05 on the oocites FAQ (under "can I download oocities.org"), they mention that they are interested in working with others who want to make backups of the content. If they didn't lose a server or something, maybe archiveteam could get with them sometime to make a backup at some point? https://web.archive.org/web/20220619181802/http://www.oocities.org/geocities-archive/faq.html 22:40:30 out of all the geocities archives, they had the most content. geocities.ws is down permanently, so we already lost one archive. reocities is still up. I'd hate to see this information lost.. it's all in the hands of a single group 22:42:42 i think the archiveteam torrent was more comprehensive 22:44:08 is everything in the torrent on the wayback machine? most of what I've tried to pull up in the past on WBM hasn't been archived 22:49:03 i don't think so--this was ages ago; afaict the project wasn't even in warc 22:50:00 yes, sorry. I was trying to see if we had any more information written down about oocities, but I can't seem to locate anything. This project wasn't WARC, so it won't be on the WBM, but rather in torrents. 22:50:01 boo... no seeders on the torrent 22:50:36 good to know thanks 22:50:44 https://archive.org/details/archiveteam-geocities 22:51:35 yeah, I believe we should have a full copy on IA somewhere, even if the torrent isn't seeding anymore. 22:51:50 yeah I have that page pulled up already, but there's no text search 22:52:11 https://wiki.archiveteam.org/index.php/GeoCities#How_can_I_find_a_page_or_website_I'm_looking_for? 22:53:33 thanks I'll do some more reading 22:54:39 I actually find more "real" information on old geocities archives now than I do on modern search engines, it seems. so much info scrubbed, or deranked, or pushed aside by AI-generated garbage. you'd be surprised how great a geocities search is in 2022... lol. 22:57:36 (by the way, tbp's peer numbers are often unreliable; there were definitely seeders on the torrent quite recently) 22:59:01 I'll dig out a spare drive and give it a shot.. thanks 23:42:30 https://twitter.com/gitlab/status/1555325376687226883 23:42:36 This sounds... less critical than before. 23:43:14 (as long as it's still accessible by the public, and not archived just for the repo owner...)