-
pabs
-
pabs
(GitLab to delete inactive projects from users on the free tier)
-
OrIdow6
Highlights: "the policy is scheduled to come into force in September 2022"; "GitLab... will... give users weeks or months of warning"
-
OrIdow6
Looking into discovery
-
OrIdow6
Will look into discovery tomorrow
-
OrIdow6
Anyhow thanks pabs
-
pabs
thanks
-
AK
The gitlab one could be a big one
-
Jake
Yup and incredibly sad if they actually end up going through with it
-
Jake
Somewhat ironic to me that saving 1M/y is worth sacrificing years of goodwill
-
AK
"GitLab is aware of the potential for angry opposition to the plan", sounds like they know it will piss people off, but I guess they're hoping it will make it worth it
-
ghuntley
GitLabs on-call person/SRE is about to have a bad couple of nights...
-
Jake
well, we'd hope to not cause any huge problems...
-
ghuntley
I wonder if the cost in network egress will go above $1M in storage costs...
-
ghuntley
(also compute...)
-
ghuntley
(also SRE salary/time / project planning in response to the new load)
-
Jake
Oh I mean, probably! Seems incredibly shortsighted to me, but /shrug.
-
ghuntley
tme to pull out this article again -
articles.uie.com/beans-and-noses
-
ghuntley
*time
-
Jake
looks like GitLab runs under cloudflare... could end up being annoying.
-
Jake
trying to find public info on any ratelimits, but they seem to be generous right now.
-
tech234a
-
tech234a
-
tech234a
First project in that second list is ID 450
-
tech234a
second one is 526
-
tech234a
IDs appear sequential
-
tech234a
Current max ID is approximately 38337314
-
tech234a
-
tech234a
-
tech234a
can also use ID without API:
gitlab.com/projects/526
-
tech234a
so probably about ~38.4 million items with many deleted/private
-
tech234a
OrIdow6: ^
-
tech234a
returns 404 for a private repo
-
Jake
I see a generous rate limit on that call of 2K/minute.
-
Jake
19.2K minutes if we use the full rate limit every minute on an ip or about 13 days I think. Should be possible.
-
tech234a
also as a note: in addition to public and private, there is also now-discontinued "internal" visibility setting that some repos might still have. "Internal" means visible to any logged-in user.
-
tech234a
found a project ID 143 as new lowest, it was archived so it wasn't originally visible in the project listing
-
tech234a
-
tech234a
-
tech234a
Also make note that keyset-based pagination is likely needed to avoid running into a limit
docs.gitlab.com/ee/api/projects.html#pagination-limits
-
apache2_
would it be possible for someone here to archive vtda.org? it's got a ton of useful retrocomputing materials
-
systwi_
thuban: Continuing from #archiveteam-ot, the page I wanted to grab is not very large:
cyberciti.biz/faq/debian-linux-install-openssh-sshd-server
-
systwi_
OrIdow6: Continuing from #archiveteam-ot, yeah, that website ^ uses Clownflare and it's only accessible through either a real browser or a curl-impersonator request.
-
thuban
if you're not too worried about the header integrity issues that we retired chromebot over, you could try using its engine, crocoite:
github.com/PromyLOPh/crocoite
-
thuban
i've never tried to run this myself and i don't know if it still works, but if it does it would be much more convenient than writing your own script/wpull plugin
-
systwi_
It's worth a shot. I'll try installing it and giving it a try.
-
systwi_
*I'll install it and give it a try.
-
systwi_
Oh, if anyone already has an HTTrack instance up and running, could you please give that cyberciti.biz page I mentioned a try?
-
systwi_
crocoite didn't work. :-/
-
thuban
well, wget and curl don't 'compose', and having looked at the wpull plugin api i don't think you can override the actual request implementation (is that an architectural limitation or just an oversight)?
-
jamesp
Should the GitLab project be under #gitgud (used for GitHub) or #gitlost? I'm thinking merge with #gitgud
-
thuban
systwi_: so i guess in your position i would write a little python script that calls curl-impersonate, saves the results, and after extracting links/urls from appropriate mime types derelativizes, dedupes, and queues them
-
thuban
the naïve implementation won't scale, but you don't need scale
-
systwi_
I'm thinking that's the next best step. I can already access the correct page with curl-impersonate, so it's up to me to do the rest.
-
thuban
seems damn silly to reimplement a whole spiderer/scraper, though. you'd think we'd have something for this
-
systwi_
Is the need to use tools like curl-impersonate on some pages relatively new? I never recall encountering such situations even five months ago.
-
JAA
No, Buttflare has always been a pain in the butt.
-
systwi_
But yeah, it would be nice to see this in a nicer tool than something crummy that I'd make. :-P
-
JAA
But apart from using a headless browser etc., there wasn't any tooling for it until curl-impersonate emerged.
-
systwi_
Yeah, but I mean in the sense of TLS fingerprinting that Cloudflare does.
-
systwi_
I see.
-
thuban
i personally ran into it at least a year ago
-
JAA
They did increase the fingerprinting, but that was some time ago. I think around the same time they rolled out the new JS challenge.
-
thuban
JAA: got any insight on the wpull plugin question?
-
Jake
systwi_: using a bit of an experimental crawler I've been working on, we added
github.com/refraction-networking/utls which impersonates Chrome's ClientHello. I captured that site in WARC for you here:
jakel.rocks/up/b96ca2096dbcbeb2/ZEN…20804175848709-00001-ATHENA.warc.gz
-
thuban
:0
-
systwi_
Woah, thanks so much, Jake! :-D
-
systwi_
Extremely helpful.
-
JAA
thuban: It's complicated. Yeah, you can't override the actual requesting. But wpull is highly modular. In principle, it should be relatively easy to e.g. use just its scraping/link extraction part. I doubt this is documented much though.
-
JAA
(And actually, you probably can override the requesting via a plugin, except that's currently broken due to a bug in the plugin system...)
-
thuban
(?)
-
Jake
np! Always happy to help. :)
-
JAA
Doing that would be incredibly messy, but it should be possible. You can replace individual components of wpull in a plugin, see e.g. here:
github.com/ArchiveTeam/ArchiveBot/b…line/archivebot/wpull/plugin.py#L71
-
JAA
And the relevant bug is this:
ArchiveTeam/wpull #383
-
JAA
(I really should get that giant PR merged already, huh?)
-
thuban
i read the first post that pr and went 'but what if they need references to objects that don't exist yet?' then i read the second post on the pr :<
-
thuban
*post on
-
thuban
(yes. what were the "various issues" with 2.0.1?)
-
JAA
Here's a selection from the laundry list:
ArchiveTeam/wpull #393
-
thuban
nice
-
JAA
(That's the giant PR I was talking about.)
-
thuban
speaking of wpull, i was thinking again about the subdomains issue from the other day, and it occurred to me it that it might be nice to add 'subdomains' to the `--span-hosts-allow` options and that this would be fairly simple from a code standpoint.
-
thuban
do you think i should put in a pr (assuming i can figure out the test suite)? chfoo's comments on
ArchiveTeam/wpull #373 suggest that `--span-hosts-allow` has a limited future given its ambiguities / the lack of power in its implementation (reservations i'm sympathetic to), but it's been five years, so...
-
thuban
or, hm, i guess there's been no development to speak of since 2019.
hackint.logs.kiska.pw/archiveteam-bs/20201028#c40220 i forgot we already talked about this
-
JAA
Yeah
-
JAA
That idea is something worth considering, I guess.
-
JAA
Although in general I think the focus should be on fixing the bugs first.
-
JAA
Anyway, we're well into -dev territory. :-)
-
Ruka
Hello Everyone!
-
pabs
I note that IA's description of ArchiveTeam's ArchiveBot still mentions EFNet even though you moved to hacknet:
archive.org/details/archivebot
-
pabs
hi Ruk8, welcome! Did you have a question?
-
Ruk8
Nothing in particular, like a few days ago I have a list of url that need to be archived... Today's list is mainly composed of Italian Scientific Journas and for the rest there are some videos hosted on a cdn.
-
Ruk8
(I'm the guy that requested the adobe archival of framemaker/robohelp installers)
-
thuban
yes, i remember. if you upload your list to transfer.archivete.am and paste the link here again, someone will queue it into archivebot
-
Ruk8
-
thuban
thanks! someone will start the job soon
-
pabs
is there an advantage of using AB for that rather than SPN?
-
Ruk8
Thanks everyone! I'm glad to offer some help
-
thuban
pabs: ime, reliability, mostly. spn can be a bit flaky under load, especially for bulk submissions
-
pabs
ok. I've found the SPN email interface to be reasonable for those the couple of times I used it
-
thuban
there are also (as always) some edge cases--e.g. tumblr's image rewriting breaks most images on tumblr pages under spn, but archivebot can disable it with `--user-agent-alias=curl`
-
Jake
all good. :) now oocities... I don't believe _we_ run it? Let me double check if we have any more info on that
-
cowsay-moo
no you don't run it.. I thought since you are in archive circles, that you may have heard something
-
cowsay-moo
on the oocites FAQ (under "can I download oocities.org"), they mention that they are interested in working with others who want to make backups of the content. If they didn't lose a server or something, maybe archiveteam could get with them sometime to make a backup at some point?
web.archive.org/web/20220619181802/…ties.org/geocities-archive/faq.html
-
cowsay-moo
out of all the geocities archives, they had the most content. geocities.ws is down permanently, so we already lost one archive. reocities is still up. I'd hate to see this information lost.. it's all in the hands of a single group
-
thuban
i think the archiveteam torrent was more comprehensive
-
cowsay-moo
is everything in the torrent on the wayback machine? most of what I've tried to pull up in the past on WBM hasn't been archived
-
thuban
i don't think so--this was ages ago; afaict the project wasn't even in warc
-
Jake
yes, sorry. I was trying to see if we had any more information written down about oocities, but I can't seem to locate anything. This project wasn't WARC, so it won't be on the WBM, but rather in torrents.
-
cowsay-moo
boo... no seeders on the torrent
-
cowsay-moo
good to know thanks
-
thuban
-
Jake
yeah, I believe we should have a full copy on IA somewhere, even if the torrent isn't seeding anymore.
-
cowsay-moo
yeah I have that page pulled up already, but there's no text search
-
thuban
-
cowsay-moo
thanks I'll do some more reading
-
cowsay-moo
I actually find more "real" information on old geocities archives now than I do on modern search engines, it seems. so much info scrubbed, or deranked, or pushed aside by AI-generated garbage. you'd be surprised how great a geocities search is in 2022... lol.
-
thuban
(by the way, tbp's peer numbers are often unreliable; there were definitely seeders on the torrent quite recently)
-
cowsay-moo
I'll dig out a spare drive and give it a shot.. thanks
-
Jake
-
Jake
This sounds... less critical than before.
-
Jake
(as long as it's still accessible by the public, and not archived just for the repo owner...)