-
JAA
(That's French for PB/TB, for the unaware.)
-
nico_32
yes octets = bytes
-
JAA
I was never worried about the source code. Such a popular project is bound to have hundreds of up-to-date copies everywhere at any time.
-
JAA
The issues, pull requests, etc. though...
-
nico_32
also git lfs
-
nico_32
i am not sure everyone have it
-
JAA
Yeah, quite rare I believe.
-
thuban
JAA: did you get the google-cache-over-multiple-ab-pipelines working? (is it worth setting up as a warrior job?) i agree, those issues are important
-
JAA
thuban: Nope, only two jobs, so ~12 days each expected.
-
thuban
(and i'm not just saying that because i've had a perfectly good pr languishing open for years)
-
JAA
Worth mentioning that the Google Cache certainly won't have all comments on long discussions because GitHub hides them behind a button. I strongly doubt Google fetches that.
-
hurricos
Hi folks. Is anyone working on recovering the youtube-dl issue tracker?
-
thuban
yes
-
thuban
but it's likely not to be complete
-
hurricos
Is there anything I can do to help? I do database work by day and have a basic understanding of the postgresql db behind the gitlab issue tracker so I may be able to help with the restore process
-
thuban
the current method is archiving google's cached version of the issue pages, which is slow, likely not to cover every issue, and likely not to show all comments, but it's better than nothing
-
hurricos
Please also archive the list of watchers of the Github repository. Depending on the exact setup a watcher has a complete recollection of issues and comments.
-
hurricos
Where is this data being stored right now, if you don't mind me asking? Is it possible to set up an SFTP server for read access to the result?
-
hurricos
I'd like to take a swing at beginning converting it into a Gitlab-importable dump
-
jodizzle
hurricos: Our methods generally store content on the Internet Archive
-
jodizzle
Different topic: has there been any discussion of Twitch streamers deleting their vods? Came across this:
old.reddit.com/r/LivestreamFail/com…es_4_years_worth_of_vods_says_lirik
-
hurricos
jodizzle: how is that usually structured? (do you have a link?)
-
thuban
it looks like the subscriber list (
github.com/ytdl-org/youtube-dl/watchers /
api.github.com/repos/ytdl-org/youtube-dl/subscribers) is blocked along with the rest of the repository. wbm doesn't have the api, but it does have the first page (only) of the webpage
-
thuban
-
OrIdow6
-
OrIdow6
Or in this case, multiple warc files
-
thuban
hurricos: what would the "setup" that covers all issues and comments look like? are we talking about hoping for undeleted notification emails, or is there a more structured option?
-
JAA
Google doesn't have watchers cached either.
-
icedice
It might be worth trying to get what you can of youtube-dlc as well:
github.com/blackjack4494/youtube-dlc
-
icedice
It was a fork that was more active on fixing pull requests
-
JAA
Seems small, will do.
-
hurricos
`curl
api.github.com/user/48381040/events | jq -c '.[] | select(.type="IssueCommentEvent")' | wc -l`
-
hurricos
The user API on Github exposes those issues despite the repo being taken down.
-
JAA
Ooh, nice.
-
hurricos
We need a list of users
-
thuban
phenomenal!
-
thuban
i'll go through and pull out the user ids
-
hurricos
It's only a couple of days worth but we should use it while we can
-
hurricos
I can set up an SFTP server to upload files to
-
hurricos
if anyone is interested
-
hurricos
thuban: how are you doing that
-
hurricos
(what are you using to grab lists of users)
-
icedice
-
hurricos
-
hurricos
yes!
-
hurricos
I have 13TB free storage, quickly composing a makefile
-
hurricos
note that if you have individual JSON objects and a lot of cores you can very easily separate them into individual files and use `parallel -P $jobfile jq ...`
-
hurricos
I use this for data conversions for healthcare data from CCDAs
-
JAA
Google cache of youtube-dlc issues and PRs running now in AB.
-
thuban
-
thuban
pagination with ?page=<i>, last page in "link" header (9), so ~250 events, some of which are issues or prs
-
thuban
JAA, want me to generate those urls for AB or can you do it
-
hurricos
thuban: are you suggesting that
api.github.com/networks/ytdl-org/youtube-dl/events can be descended as a tree somehow?
-
thuban
it includes events from every repo in the network, and the root is one of the repos in the network
-
thuban
some of the events in there are from the youtube-dl repo
-
hurricos
gotcha
-
hurricos
I've started downloading Github
-
hurricos
(from gharchive.org)
-
hurricos
plan is to use some jq once I've got enough to scrape from
-
thuban
that only goes back up to 90 days or 300 events, though
-
JAA
thuban: Done
-
hurricos
I think gharchive contains all of the open-source events
-
hurricos
I can bring some rack servers to the lab if I need to distribute the load ultimately but I don't think it's that much data
-
thuban
i'm looking at gharchive now, tho i'm not about to beg google for bigquery access
-
hurricos
:P
-
thuban
ytdl-org/youtube-dl repo id is 1039520
-
thuban
hm, or not?
-
thuban
seem to be multiple repos with different names and that "id"
-
thuban
lol, no there aren't. my bad
-
cadence
hi o/ i'm told there's youtube-dl things happening here?
-
thuban
cadence: yes
-
hurricos
what was the group ID for youtube-dl's org?
-
cadence
it was called ytdl-org, I remember that much
-
Ajay
-
Ajay
> pagination with ?page=<i>, last page in "link" header (9), so ~250 events, some of which are issues or prs
-
thuban
hurricos: is group id different from org id?
-
hurricos
It's the same. I'm just bad at IRC and invariably close pidgin.
-
thuban
org id is 48381040
-
hurricos
thus losing track. I've got my makefile sorted out for pulling down from gharchive, just want to start writing .jq files now
-
hurricos
thanks
-
thuban
you said you know something about importing to gitlab?
-
hurricos
I work with Postgres at work, run a Gitlab server personally and know how to poke at the schemas well enough to figure itout
-
hurricos
I don't know what import formats it can digest but I can try seeing what my instance can do. Ultimately getting the data is more important right now
-
hurricos
ingest*
-
hurricos
getting forms of the data and verifying that an issue history can be constructed from it anyways.
-
hurricos
`curl data.gharchive.org/2020-10-22-04.json.gz` fails
-
hurricos
looks like everything from 0 to 9 fails
-
hurricos
(hours per day)
-
thuban
odd
-
thuban
every day?
-
hurricos
every day
-
thuban
hm
-
thuban
~90k events per hour, so that's a pretty big loss
-
thuban
wonder if it's intentional
-
hurricos
base64'd xz of the Makefile I use: /Td6WFoAAATm1rRGAgAhARYAAAB0L+Wj4AETAMddADIYSu7u0hpd9dYp9fTAQfIbYxiU1p8TfBs5IdFuAC1VPGqUizyTyQM3kjKkgdZY02qc+hU1hEBGQY9LotHqn75UWyE8SUZ6ocwBYFo6ib3adfUNy2PeI410+x4Sg70/kOi1QRT4jD6JA0iwy3ZKXV8VRWaOaunwT8rJuLh/4buyanFESS59pD26yc56mT+y9i1JhUwRNWvPdC5xWqB+ozXkmCHYl2v2+bDvMo8LwEiqIsJpQAyw5LJvZunaIs6M0KSdOKQRIAAAAAyYFJJrcDR+AAHjAZQCAADcNkNfscRn+wIAAAAABFla
-
hurricos
(to download)
-
hurricos
Oh I see what's going on.
-
hurricos
I'm pulling 09 and not 9
-
thuban
lol
-
hurricos
oh well, I'll fix that in the recipe later
-
hurricos
I think gharchive should have it, could you pastebin some sample messages I should expect to be in these archives?
-
hurricos
or I can pull actually from the page I posted wit hcurl
-
thuban
i just grabbed a random hour and i definitely see ytdl issues/comments complete with bodies
-
thuban
-
hurricos
`find . -type f ! -empty -print0 | parallel -0 -P 20 zgrep -we 13954170828` ...
-
hurricos
just looking for a specific one I pulled from
api.github.com/user/48381040/events is in the archives
-
hurricos
It's not
-
thuban
event id?
-
hurricos
so if you curl that you can grep for 13954170828
-
hurricos
that id is not in the respective file on 2020-10-23T19:13:57Z
-
hurricos
Nope, I'm a dumbass. I did not pull that hour
-
hurricos
it's in 19 :D
-
Ajay
Would this data include pull request branches?
-
hurricos
OK, so we get the same data from gharchive as from the api.github.com/user/${id}/events
-
hurricos
check out the event_types from the developer.github.com documentation
-
hurricos
I can give a list of what I see in the JSONs though
-
hurricos
maybe not the content of the PRs but the issue titles and first post, yes
-
Ajay
yea, I wonder if it is just the posts
-
thuban
Ajay: gharchive logs _all_ public events, so yes, the data is there (if they're private)
-
hurricos
perhaps even a transactional history of what and how things were tagged. Let me get a list of event types
-
hurricos
awesome
-
Ajay
ooh, that's great
-
hurricos
I just pulled october, it's 26g
-
hurricos
without 0-9AM
-
hurricos
so
-
hurricos
I should resize the volume and kick this off for an overnight run
-
hurricos
seems like this can be processed into a SQLite database with some good parallel code tbf
-
hurricos
napkin math shows gharchive until 2015 is 3TB
-
hurricos
sounds small. that can't contain public comment history?
-
hurricos
are you sure it's *all* public events @ajay?
-
hurricos
well it is gzipped though.
-
Ajay
you mean thuban
-
hurricos
yeah sorry
-
hurricos
thuban:
-
cadence_
Git objects are already compressed
-
hurricos
not git objects, the public comments in issues etc.
-
hurricos
that's what needs recovered
-
cadence_
Hmm. Then maybe. Still seems kinda small
-
cadence_
Sorry for misunderstanding
-
Ajay
the git object for pull requests would also be important to recover
-
hurricos
hmm
-
Ajay
youtube-dl has tons of un-merged pull requests in limbo
-
cadence_
I think youtube-dlc got a lot of them, but yeah
-
hurricos
youtube-dlc? we have backups of it though?
-
hurricos
the PRs
-
Ajay
oh, we do?
-
cadence_
youtube-dlc is a fork that merged a bunch of those pull requests
-
Ajay
youtube-dlc was a fork meant to try to go through the pull request backlog and merge stuff
-
cadence_
I have it cloned but nothing else from it
-
hurricos
I wonder if it got pulled down at the same time. If not someone may have a record
-
Ajay
it seemed to have been pulled at the same time
-
hurricos
gotcha
-
Ajay
as well as many other forks
-
cadence_
Yeah, cause it was a fork.
-
hurricos
Well, the PRs can be reestablished by users when a new site goes up.
-
cadence_
I have it cloned which therefore means I probably have a bunch of youtube-dl's unmerged prs.
-
hurricos
also checking my math again, it's looking closer to 6TB
-
cadence_
The youtube-dlc people are discussing similar things on gitter, IMO it would be good to organise with them
-
hurricos
which gitter?
-
cadence_
The only reason -c existed was because the youtube-dl maintainers were uncooperative. And now that they're out of the way…
-
cadence_
One sec while I get it
-
thuban
i think that's a little hasty
-
cadence_
`youtube-dlc/community`
-
hurricos
-
hurricos
I'm there
-
hurricos
I've been there already :)
-
cadence_
IMO it would be good to collab & community outreach & all that cause you both have the same goals now
-
hurricos
I'm literally in that gitter though. The gitter just reflects
app.element.io/#/room/!xbOjHLEQzPJBXjeTWo:matrix.org
-
cadence_
Ah, cool ^.^
-
hurricos
really I just need a place to coordinate this. I have a homeserver running a Gitlab instance that I'd rather not use for this because I like having home internet and I bet I'd get DDoSed
-
hurricos
but I *could* do it there.
-
thuban
aha! ok, so, the PullRequestEvent _does_ have references to the git repository and commit id of the pull request _from the fork_
-
thuban
so, still accessible if the repo being pr'd into is hidden
-
thuban
.payload.pull_request.head.repo.git_url and .payload.pull_request.head.sha
-
thuban
JAA or anyone else in a position to judge: do we want to mirror GHArchive?
-
thuban
as in any/all of (a) downloading its existing backlog, (b) downloading it on an ongoing basis, or (c) doing similar and/or redundant work as part of #gitgud
-
thuban
(i asked over there whether the web potion of the project included api results, but it's pretty quiet)
-
hurricos
I can create .torrent files of the whole thing
-
hurricos
... and seed
-
hurricos
Laboratory B has server infrastructure, that is, we have a single 12-bay R510 with 13.5TB RAID10, 150Mb symmetrical
-
hurricos
sorry, never mind. I'm not even part of archiveteam. Just gonna focus on the youtube-dl thing.
-
Arcorann
There's nothing stopping you from helping out with other archiveteam projects after this gets resolved
-
hurricos
I don't have the time :( I distribute laptops for local community
-
hurricos
if I could get paid doing that I'd be happy to, I've just already overcommitted and I need to get better at cleaning up house before I start anything like that
-
hurricos
but I personally rely on youtube-dl, so
-
hurricos
I just want to see them get the basic stuff to start back up with.
-
Arcorann
Fair enough
-
thuban
hurricos: downloading large amounts of data is a problem we already have a lot of infrastructure for--i think if you really want to see them back up and running soon the hard part is the export.
-
thuban
exporting archived data to a gitlab-importable format is something that's been in our long-term plans for a while, but nobody's made a start on it yet--do you want to?
-
hurricos
I'll take a swing. Not something I was planning to do already but I am planning on doing some work on fresh copies of a Gitlab instance so I might as well try
-
thuban
great! if you poke us here or in #gitgud i'm sure people will contribute once the ball is rolling
-
hurricos
I remember now. Gitlab actually lets you import issue lists as CSVs. Comment history perhaps not directly. It would be good to centralize that work around an issue tracker already, does the archiveteam have one? Kanban board of any kind?
-
thuban
generally just the github issue tracker for each code repository
-
thuban
(which is now seeming like maybe not the best policy ;) but i believe we have copies)
-
thuban
non-directly-code-related plans and issues are coordinated over irc or on the wiki
-
hurricos
chicken-and-egg
-
thuban
?
-
hurricos
Lol, just having an issue tracker for things hosted on the place you'd like to make sure is safe from them
-
thuban
mm
-
hurricos
not having readline installed (from within a docker container) *sucks*
-
hurricos
OK, got it. Gitlab uses an `issues` table, comments are in the `notes` table. `notes` points to `issues` by `noteable_type` and `noteable_id` fields. The state of the issue is in the `todos` table; a `todos` can point to an `issues` via the `target_type` and `target_id` fields.
-
hurricos
everything else, e.g. attachments, links on as you might expect, but it's a fairly loose, object-oriented framework
-
hurricos
that's 11.3 CE, I haven't updated in a while :upside-down-face:
-
hurricos
I'll go into gitgud and ask about repositories
-
mgrandi
so are you folks just trying to recreate the youtube-dl issues from the gharchive data?
-
hurricos
Yes
-
mgrandi
good luck!
-
hurricos
I'll need it :(
-
mgrandi
@Jean-Fred luckily, PS's store has a all games option 👀 makes this much easier
-
Ryz
-
HP_Archivist
Heh ^^
-
HP_Archivist
'If you see something, save it'
-
purplebot
FileFormats created by JesseW (+21, Redirected page to [[Formats]]) just now --
archiveteam.org/?diff=45703&oldid=0
-
wessel1512
i have found a python script that keyword crawler websites
-
wessel1512
-
wessel1512
-
wessel1512
only i dont know how to fix it
-
wessel1512
and i like to filter things like: .jpg .png and .js files out
-
purplebot
-
Jean-Fred
mgrandi Awesome! Thanks :-) Are you archiving the us-en store only? Asking because there are 5 stores (EMEA and others, America (North and South), Asia, Japan and China), and some stores with many subdomains per country/lang (eg German, French etc). And some information (like local content ratings eg USK or PEGI) might be only in one sub-store. I
-
Jean-Fred
got that list of domains − , not sure how exhaustive it is but that’s already a start
justpaste.it/93kgd
-
icedice
-
icedice
It's a more up to date version of youtube-dl
-
icedice
The dev fixes pull requests a lot faster there
-
mgrandi
@Jean-Fred: I can see if I can the other language stores too
-
Jean-Fred
mgrandi Thanks for looking ; I heard from others that the store pages are gone for many/most people now :-(
-
mgrandi
i think you have to log in?
-
Jean-Fred
Aaaaah maybe?
-
mgrandi
-
Jean-Fred
(Also, looking at
web.archive.org/web/sitemap/https://store.playstation.com , things seem to have been well crawled in the past)
-
mgrandi
going to store.playstation.com seems to be a white page that seems new but going to that link i just posted seems like the old game page
-
mgrandi
i'll work on this now then
-
Jean-Fred
Thank you so much, really :)
-
Jean-Fred
(ok off to sleep here)
-
JAA
thuban: Size of GH Archive?