05:16:32 <pabs> https://www.theregister.com/2022/08/04/gitlab_data_retention_policy/
05:17:12 <pabs> (GitLab to delete inactive projects from users on the free tier)
05:29:06 <OrIdow6> Highlights: "the policy is scheduled to come into force in September 2022"; "GitLab... will... give users weeks or months of warning"
05:34:35 <OrIdow6> Looking into discovery
05:47:09 <OrIdow6> Will look into discovery tomorrow
05:47:12 <OrIdow6> Anyhow thanks pabs
05:47:40 <pabs> thanks
06:48:07 <AK> The gitlab one could be a big one
06:49:16 <Jake> Yup and incredibly sad if they actually end up going through with it
06:49:41 <Jake> Somewhat ironic to me that saving 1M/y is worth sacrificing years of goodwill
06:49:58 <AK> "GitLab is aware of the potential for angry opposition to the plan", sounds like they know it will piss people off, but I guess they're hoping it will make it worth it
06:54:26 <ghuntley> GitLabs on-call person/SRE is about to have a bad couple of nights...
06:55:39 <Jake> well, we'd hope to not cause any huge problems...
07:00:26 <ghuntley> I wonder if the cost in network egress will go above $1M in storage costs...
07:00:35 <ghuntley> (also compute...)
07:01:02 <ghuntley> (also SRE salary/time / project planning in response to the new load)
07:01:29 <Jake> Oh I mean, probably! Seems incredibly shortsighted to me, but /shrug.
07:03:09 <ghuntley> tme to pull out this article again - https://articles.uie.com/beans-and-noses/
07:03:12 <ghuntley> *time
07:04:56 <Jake> looks like GitLab runs under cloudflare... could end up being annoying.
07:07:35 <Jake> trying to find public info on any ratelimits, but they seem to be generous right now.
08:12:45 <tech234a> https://gitlab.com/explore/projects?sort=latest_activity_asc
08:13:28 <tech234a> https://gitlab.com/explore/projects?sort=created_asc
08:13:41 <tech234a> First project in that second list is ID 450
08:13:52 <tech234a> second one is 526
08:14:30 <tech234a> IDs appear sequential
08:14:51 <tech234a> Current max ID is approximately 38337314
08:15:01 <tech234a> based on https://gitlab.com/explore/projects?sort=created_desc
08:17:15 <tech234a> API endpoint for the ID ex: https://gitlab.com/api/v4/projects/526
08:18:07 <tech234a> can also use ID without API: https://gitlab.com/projects/526
08:19:58 <tech234a> so probably about ~38.4 million items with many deleted/private
08:20:06 <tech234a> OrIdow6: ^
08:21:38 <tech234a> returns 404 for a private repo
08:23:09 <Jake> I see a generous rate limit on that call of 2K/minute.
08:24:28 <Jake> 19.2K minutes if we use the full rate limit every minute on an ip or about 13 days I think. Should be possible.
08:27:47 <tech234a> also as a note: in addition to public and private, there is also now-discontinued "internal" visibility setting that some repos might still have. "Internal" means visible to any logged-in user.
08:30:10 <tech234a> found a project ID 143 as new lowest, it was archived so it wasn't originally visible in the project listing
08:35:29 <tech234a> API for listing https://gitlab.com/api/v4/projects?sort=asc&order_by=id
08:38:08 <tech234a> Docs for this: https://docs.gitlab.com/ee/api/projects.html#list-all-projects
08:39:08 <tech234a> Also make note that keyset-based pagination is likely needed to avoid running into a limit https://docs.gitlab.com/ee/api/projects.html#pagination-limits
11:09:05 <apache2_> would it be possible for someone here to archive vtda.org? it's got a ton of useful retrocomputing materials
15:45:24 <systwi_> thuban: Continuing from #archiveteam-ot, the page I wanted to grab is not very large: https://www.cyberciti.biz/faq/debian-linux-install-openssh-sshd-server/
15:46:27 <systwi_> OrIdow6: Continuing from #archiveteam-ot, yeah, that website ^ uses Clownflare and it's only accessible through either a real browser or a curl-impersonator request.
15:50:12 <thuban> if you're not too worried about the header integrity issues that we retired chromebot over, you could try using its engine, crocoite: https://github.com/PromyLOPh/crocoite
15:52:25 <thuban> i've never tried to run this myself and i don't know if it still works, but if it does it would be much more convenient than writing your own script/wpull plugin
15:53:43 <systwi_> It's worth a shot. I'll try installing it and giving it a try.
15:53:59 <systwi_> *I'll install it and give it a try.
16:01:08 <systwi_> Oh, if anyone already has an HTTrack instance up and running, could you please give that cyberciti.biz page I mentioned a try?
17:46:56 <systwi_> crocoite didn't work. :-/
17:53:03 <thuban> well, wget and curl don't 'compose', and having looked at the wpull plugin api i don't think you can override the actual request implementation (is that an architectural limitation or just an oversight)?
17:54:19 <jamesp> Should the GitLab project be under #gitgud (used for GitHub) or #gitlost? I'm thinking merge with #gitgud
17:56:06 <thuban> systwi_: so i guess in your position i would write a little python script that calls curl-impersonate, saves the results, and after extracting links/urls from appropriate mime types derelativizes, dedupes, and queues them
17:56:27 <thuban> the naïve implementation won't scale, but you don't need scale
18:04:16 <systwi_> I'm thinking that's the next best step. I can already access the correct page with curl-impersonate, so it's up to me to do the rest.
18:06:22 <thuban> seems damn silly to reimplement a whole spiderer/scraper, though. you'd think we'd have something for this
18:08:01 <systwi_> Is the need to use tools like curl-impersonate on some pages relatively new? I never recall encountering such situations even five months ago.
18:08:34 <JAA> No, Buttflare has always been a pain in the butt.
18:08:39 <systwi_> But yeah, it would be nice to see this in a nicer tool than something crummy that I'd make. :-P
18:09:00 <JAA> But apart from using a headless browser etc., there wasn't any tooling for it until curl-impersonate emerged.
18:09:45 <systwi_> Yeah, but I mean in the sense of TLS fingerprinting that Cloudflare does.
18:09:55 <systwi_> I see.
18:10:39 <thuban> i personally ran into it at least a year ago
18:10:58 <JAA> They did increase the fingerprinting, but that was some time ago. I think around the same time they rolled out the new JS challenge.
18:11:14 <thuban> JAA: got any insight on the wpull plugin question?
18:11:42 <Jake> systwi_: using a bit of an experimental crawler I've been working on, we added https://github.com/refraction-networking/utls which impersonates Chrome's ClientHello. I captured that site in WARC for you here: https://jakel.rocks/up/b96ca2096dbcbeb2/ZENO-20220804175848709-00001-ATHENA.warc.gz
18:12:30 <thuban> :0
18:13:29 <systwi_> Woah, thanks so much, Jake! :-D
18:13:35 <systwi_> Extremely helpful.
18:14:03 <JAA> thuban: It's complicated. Yeah, you can't override the actual requesting. But wpull is highly modular. In principle, it should be relatively easy to e.g. use just its scraping/link extraction part. I doubt this is documented much though.
18:15:52 <JAA> (And actually, you probably can override the requesting via a plugin, except that's currently broken due to a bug in the plugin system...)
18:16:05 <thuban> (?)
18:19:32 <Jake> np! Always happy to help. :)
18:19:33 <JAA> Doing that would be incredibly messy, but it should be possible. You can replace individual components of wpull in a plugin, see e.g. here: https://github.com/ArchiveTeam/ArchiveBot/blob/4a672dbff49597dd8a1f53d95ee60f6ff17a5c87/pipeline/archivebot/wpull/plugin.py#L71
18:20:08 <JAA> And the relevant bug is this: https://github.com/ArchiveTeam/wpull/issues/383
18:20:28 <JAA> (I really should get that giant PR merged already, huh?)
18:22:40 <thuban> i read the first post that pr and went 'but what if they need references to objects that don't exist yet?' then i read the second post on the pr :<
18:23:08 <thuban> *post on
18:23:51 <thuban> (yes. what were the "various issues" with 2.0.1?)
18:25:00 <JAA> Here's a selection from the laundry list: https://github.com/ArchiveTeam/wpull/pull/393
18:27:19 <thuban> nice
18:27:39 <JAA> (That's the giant PR I was talking about.)
18:32:31 <thuban> speaking of wpull, i was thinking again about the subdomains issue from the other day, and it occurred to me it that it might be nice to add 'subdomains' to the `--span-hosts-allow` options and that this would be fairly simple from a code standpoint.
18:35:31 <thuban> do you think i should put in a pr (assuming i can figure out the test suite)? chfoo's comments on https://github.com/ArchiveTeam/wpull/issues/373 suggest that `--span-hosts-allow` has a limited future given its ambiguities / the lack of power in its implementation (reservations i'm sympathetic to), but it's been five years, so...
18:41:36 <thuban> or, hm, i guess there's been no development to speak of since 2019. https://hackint.logs.kiska.pw/archiveteam-bs/20201028#c40220 i forgot we already talked about this
18:42:01 <JAA> Yeah
18:42:11 <JAA> That idea is something worth considering, I guess.
18:46:51 <JAA> Although in general I think the focus should be on fixing the bugs first.
18:46:58 <JAA> Anyway, we're well into -dev territory. :-)
21:20:10 <Ruka> Hello Everyone!
21:30:58 <pabs> I note that IA's description of ArchiveTeam's ArchiveBot still mentions EFNet even though you moved to hacknet: https://archive.org/details/archivebot
21:32:51 <pabs> hi Ruk8, welcome! Did you have a question?
21:36:06 <Ruk8> Nothing in particular, like a few days ago I have a list of url that need to be archived... Today's list is mainly composed of Italian Scientific Journas and for the rest there are some videos hosted on a cdn.
21:37:21 <Ruk8> (I'm the guy that requested the adobe archival of framemaker/robohelp installers)
21:41:00 <thuban> yes, i remember. if you upload your list to transfer.archivete.am and paste the link here again, someone will queue it into archivebot
21:44:10 <Ruk8> Here's the list: https://transfer.archivete.am/10GlW3/urls.txt
21:48:17 <thuban> thanks! someone will start the job soon
21:48:59 <pabs> is there an advantage of using AB for that rather than SPN?
21:50:32 <Ruk8> Thanks everyone! I'm glad to offer some help
21:54:40 <thuban> pabs: ime, reliability, mostly. spn can be a bit flaky under load, especially for bulk submissions
21:55:48 <pabs> ok. I've found the SPN email interface to be reasonable for those the couple of times I used it
22:01:34 <thuban> there are also (as always) some edge cases--e.g. tumblr's image rewriting breaks most images on tumblr pages under spn, but archivebot can disable it with `--user-agent-alias=curl`
22:32:50 <Jake> all good. :) now oocities... I don't believe _we_ run it? Let me double check if we have any more info on that
22:33:18 <cowsay-moo> no you don't run it.. I thought since you are in archive circles, that you may have heard something
22:39:05 <cowsay-moo> on the oocites FAQ (under "can I download oocities.org"), they mention that they are interested in working with others who want to make backups of the content.  If they didn't lose a server or something, maybe archiveteam could get with them sometime to make a backup at some point?  https://web.archive.org/web/20220619181802/http://www.oocities.org/geocities-archive/faq.html
22:40:30 <cowsay-moo> out of all the geocities archives, they had the most content.  geocities.ws is down permanently, so we already lost one archive.  reocities is still up.  I'd hate to see this information lost.. it's all in the hands of a single group
22:42:42 <thuban> i think the archiveteam torrent was more comprehensive
22:44:08 <cowsay-moo> is everything in the torrent on the wayback machine?  most of what I've tried to pull up in the past on WBM hasn't been archived
22:49:03 <thuban> i don't think so--this was ages ago; afaict the project wasn't even in warc
22:50:00 <Jake> yes, sorry. I was trying to see if we had any more information written down about oocities, but I can't seem to locate anything. This project wasn't WARC, so it won't be on the WBM, but rather in torrents.
22:50:01 <cowsay-moo> boo... no seeders on the torrent
22:50:36 <cowsay-moo> good to know thanks
22:50:44 <thuban> https://archive.org/details/archiveteam-geocities
22:51:35 <Jake> yeah, I believe we should have a full copy on IA somewhere, even if the torrent isn't seeding anymore.
22:51:50 <cowsay-moo> yeah I have that page pulled up already, but there's no text search
22:52:11 <thuban> https://wiki.archiveteam.org/index.php/GeoCities#How_can_I_find_a_page_or_website_I'm_looking_for?
22:53:33 <cowsay-moo> thanks I'll do some more reading
22:54:39 <cowsay-moo> I actually find more "real" information on old geocities archives now than I do on modern search engines, it seems.  so much info scrubbed, or deranked, or pushed aside by AI-generated garbage.  you'd be surprised how great a geocities search is in 2022... lol.
22:57:36 <thuban> (by the way, tbp's peer numbers are often unreliable; there were definitely seeders on the torrent quite recently)
22:59:01 <cowsay-moo> I'll dig out a spare drive and give it a shot.. thanks
23:42:30 <Jake> https://twitter.com/gitlab/status/1555325376687226883
23:42:36 <Jake> This sounds... less critical than before.
23:43:14 <Jake> (as long as it's still accessible by the public, and not archived just for the repo owner...)