00:02:06 (That's French for PB/TB, for the unaware.) 00:04:06 yes octets = bytes 00:06:35 I was never worried about the source code. Such a popular project is bound to have hundreds of up-to-date copies everywhere at any time. 00:06:43 The issues, pull requests, etc. though... 00:07:38 also git lfs 00:07:49 i am not sure everyone have it 00:08:09 Yeah, quite rare I believe. 02:11:53 JAA: did you get the google-cache-over-multiple-ab-pipelines working? (is it worth setting up as a warrior job?) i agree, those issues are important 02:12:24 thuban: Nope, only two jobs, so ~12 days each expected. 02:12:42 (and i'm not just saying that because i've had a perfectly good pr languishing open for years) 02:13:50 Worth mentioning that the Google Cache certainly won't have all comments on long discussions because GitHub hides them behind a button. I strongly doubt Google fetches that. 02:19:54 Hi folks. Is anyone working on recovering the youtube-dl issue tracker? 02:20:11 yes 02:20:24 but it's likely not to be complete 02:21:02 Is there anything I can do to help? I do database work by day and have a basic understanding of the postgresql db behind the gitlab issue tracker so I may be able to help with the restore process 02:21:57 the current method is archiving google's cached version of the issue pages, which is slow, likely not to cover every issue, and likely not to show all comments, but it's better than nothing 02:22:46 Please also archive the list of watchers of the Github repository. Depending on the exact setup a watcher has a complete recollection of issues and comments. 02:24:48 Where is this data being stored right now, if you don't mind me asking? Is it possible to set up an SFTP server for read access to the result? 02:25:11 I'd like to take a swing at beginning converting it into a Gitlab-importable dump 02:26:07 hurricos: Our methods generally store content on the Internet Archive 02:27:14 Different topic: has there been any discussion of Twitch streamers deleting their vods? Came across this: https://old.reddit.com/r/LivestreamFail/comments/jgt86c/pokimane_deletes_4_years_worth_of_vods_says_lirik/ 02:27:26 jodizzle: how is that usually structured? (do you have a link?) 02:29:17 it looks like the subscriber list (https://github.com/ytdl-org/youtube-dl/watchers / https://api.github.com/repos/ytdl-org/youtube-dl/subscribers) is blocked along with the rest of the repository. wbm doesn't have the api, but it does have the first page (only) of the webpage 02:29:49 last grab: https://web.archive.org/web/20201002044053/https://github.com/ytdl-org/youtube-dl/watchers 02:30:45 hurricos: It will generally go into a warc file (https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem http://fileformats.archiveteam.org/wiki/WARC), which holds copies of webpages 02:30:52 Or in this case, multiple warc files 02:32:08 hurricos: what would the "setup" that covers all issues and comments look like? are we talking about hoping for undeleted notification emails, or is there a more structured option? 02:32:17 Google doesn't have watchers cached either. 02:32:17 It might be worth trying to get what you can of youtube-dlc as well: https://github.com/blackjack4494/youtube-dlc 02:32:36 It was a fork that was more active on fixing pull requests 02:33:09 Seems small, will do. 02:33:35 `curl https://api.github.com/user/48381040/events | jq -c '.[] | select(.type="IssueCommentEvent")' | wc -l` 02:33:52 The user API on Github exposes those issues despite the repo being taken down. 02:33:59 Ooh, nice. 02:34:00 We need a list of users 02:34:10 phenomenal! 02:34:51 i'll go through and pull out the user ids 02:35:19 It's only a couple of days worth but we should use it while we can 02:35:31 I can set up an SFTP server to upload files to 02:35:35 if anyone is interested 02:35:53 thuban: how are you doing that 02:35:58 (what are you using to grab lists of users) 02:37:53 https://www.gharchive.org/ 02:39:17 https://www.gharchive.org/ 02:39:18 yes! 02:39:31 I have 13TB free storage, quickly composing a makefile 02:39:57 note that if you have individual JSON objects and a lot of cores you can very easily separate them into individual files and use `parallel -P $jobfile jq ...` 02:40:05 I use this for data conversions for healthcare data from CCDAs 02:42:34 Google cache of youtube-dlc issues and PRs running now in AB. 02:42:55 https://api.github.com/repos/ytdl-org/youtube-dl/events is blocked, but https://api.github.com/networks/ytdl-org/youtube-dl/events is NOT... and it includes the root. 02:45:03 pagination with ?page=, last page in "link" header (9), so ~250 events, some of which are issues or prs 02:46:20 JAA, want me to generate those urls for AB or can you do it 02:47:38 thuban: are you suggesting that https://api.github.com/networks/ytdl-org/youtube-dl/events can be descended as a tree somehow? 02:48:05 it includes events from every repo in the network, and the root is one of the repos in the network 02:48:23 some of the events in there are from the youtube-dl repo 02:48:26 gotcha 02:48:42 I've started downloading Github 02:48:52 (from gharchive.org) 02:49:02 plan is to use some jq once I've got enough to scrape from 02:49:04 that only goes back up to 90 days or 300 events, though 02:49:08 thuban: Done 02:49:26 I think gharchive contains all of the open-source events 02:49:41 I can bring some rack servers to the lab if I need to distribute the load ultimately but I don't think it's that much data 02:50:13 i'm looking at gharchive now, tho i'm not about to beg google for bigquery access 02:50:22 :P 02:51:37 ytdl-org/youtube-dl repo id is 1039520 02:52:15 hm, or not? 02:52:59 seem to be multiple repos with different names and that "id" 02:55:07 lol, no there aren't. my bad 03:01:35 hi o/ i'm told there's youtube-dl things happening here? 03:02:52 cadence: yes 03:06:26 what was the group ID for youtube-dl's org? 03:06:50 it was called ytdl-org, I remember that much 03:07:18 cadence: they are referring to the results from https://api.github.com/networks/ytdl-org/youtube-dl/events 03:07:54 > pagination with ?page=, last page in "link" header (9), so ~250 events, some of which are issues or prs 03:08:08 hurricos: is group id different from org id? 03:08:41 It's the same. I'm just bad at IRC and invariably close pidgin. 03:08:54 org id is 48381040 03:09:08 thus losing track. I've got my makefile sorted out for pulling down from gharchive, just want to start writing .jq files now 03:09:15 thanks 03:09:39 you said you know something about importing to gitlab? 03:10:19 I work with Postgres at work, run a Gitlab server personally and know how to poke at the schemas well enough to figure itout 03:10:48 I don't know what import formats it can digest but I can try seeing what my instance can do. Ultimately getting the data is more important right now 03:10:53 ingest* 03:11:53 getting forms of the data and verifying that an issue history can be constructed from it anyways. 03:11:55 `curl data.gharchive.org/2020-10-22-04.json.gz` fails 03:12:10 looks like everything from 0 to 9 fails 03:12:13 (hours per day) 03:12:23 odd 03:12:25 every day? 03:12:39 every day 03:12:57 hm 03:13:31 ~90k events per hour, so that's a pretty big loss 03:13:40 wonder if it's intentional 03:14:29 base64'd xz of the Makefile I use: /Td6WFoAAATm1rRGAgAhARYAAAB0L+Wj4AETAMddADIYSu7u0hpd9dYp9fTAQfIbYxiU1p8TfBs5IdFuAC1VPGqUizyTyQM3kjKkgdZY02qc+hU1hEBGQY9LotHqn75UWyE8SUZ6ocwBYFo6ib3adfUNy2PeI410+x4Sg70/kOi1QRT4jD6JA0iwy3ZKXV8VRWaOaunwT8rJuLh/4buyanFESS59pD26yc56mT+y9i1JhUwRNWvPdC5xWqB+ozXkmCHYl2v2+bDvMo8LwEiqIsJpQAyw5LJvZunaIs6M0KSdOKQRIAAAAAyYFJJrcDR+AAHjAZQCAADcNkNfscRn+wIAAAAABFla 03:14:48 (to download) 03:16:02 Oh I see what's going on. 03:16:07 I'm pulling 09 and not 9 03:16:22 lol 03:16:23 oh well, I'll fix that in the recipe later 03:16:46 I think gharchive should have it, could you pastebin some sample messages I should expect to be in these archives? 03:16:57 or I can pull actually from the page I posted wit hcurl 03:19:06 i just grabbed a random hour and i definitely see ytdl issues/comments complete with bodies 03:19:19 (event types documentation: https://developer.github.com/v3/activity/event_types/) 03:20:34 `find . -type f ! -empty -print0 | parallel -0 -P 20 zgrep -we 13954170828` ... 03:21:24 just looking for a specific one I pulled from https://api.github.com/user/48381040/events is in the archives 03:21:48 It's not 03:22:04 event id? 03:22:08 so if you curl that you can grep for 13954170828 03:23:03 that id is not in the respective file on 2020-10-23T19:13:57Z 03:23:24 Nope, I'm a dumbass. I did not pull that hour 03:24:24 it's in 19 :D 03:24:40 Would this data include pull request branches? 03:24:41 OK, so we get the same data from gharchive as from the api.github.com/user/${id}/events 03:24:57 check out the event_types from the developer.github.com documentation 03:25:06 I can give a list of what I see in the JSONs though 03:25:17 maybe not the content of the PRs but the issue titles and first post, yes 03:25:27 yea, I wonder if it is just the posts 03:25:28 Ajay: gharchive logs _all_ public events, so yes, the data is there (if they're private) 03:25:28 perhaps even a transactional history of what and how things were tagged. Let me get a list of event types 03:25:42 awesome 03:25:50 ooh, that's great 03:25:54 I just pulled october, it's 26g 03:25:59 without 0-9AM 03:26:01 so 03:26:09 I should resize the volume and kick this off for an overnight run 03:26:28 seems like this can be processed into a SQLite database with some good parallel code tbf 03:27:44 napkin math shows gharchive until 2015 is 3TB 03:27:53 sounds small. that can't contain public comment history? 03:28:00 are you sure it's *all* public events @ajay? 03:28:15 well it is gzipped though. 03:28:18 you mean thuban 03:28:22 yeah sorry 03:28:24 thuban: 03:28:28 Git objects are already compressed 03:28:41 not git objects, the public comments in issues etc. 03:28:52 that's what needs recovered 03:28:58 Hmm. Then maybe. Still seems kinda small 03:29:07 Sorry for misunderstanding 03:29:30 the git object for pull requests would also be important to recover 03:30:00 hmm 03:30:33 youtube-dl has tons of un-merged pull requests in limbo 03:30:52 I think youtube-dlc got a lot of them, but yeah 03:31:11 youtube-dlc? we have backups of it though? 03:31:14 the PRs 03:31:31 oh, we do? 03:31:50 youtube-dlc is a fork that merged a bunch of those pull requests 03:31:59 youtube-dlc was a fork meant to try to go through the pull request backlog and merge stuff 03:32:05 I have it cloned but nothing else from it 03:32:18 I wonder if it got pulled down at the same time. If not someone may have a record 03:32:43 it seemed to have been pulled at the same time 03:32:47 gotcha 03:32:50 as well as many other forks 03:32:52 Yeah, cause it was a fork. 03:33:18 Well, the PRs can be reestablished by users when a new site goes up. 03:33:25 I have it cloned which therefore means I probably have a bunch of youtube-dl's unmerged prs. 03:34:04 also checking my math again, it's looking closer to 6TB 03:34:04 The youtube-dlc people are discussing similar things on gitter, IMO it would be good to organise with them 03:34:12 which gitter? 03:34:35 The only reason -c existed was because the youtube-dl maintainers were uncooperative. And now that they're out of the way… 03:34:43 One sec while I get it 03:34:57 i think that's a little hasty 03:34:57 `youtube-dlc/community` 03:35:22 https://gitter.im/youtube-dlc/community 03:35:24 I'm there 03:35:31 I've been there already :) 03:35:58 IMO it would be good to collab & community outreach & all that cause you both have the same goals now 03:36:29 I'm literally in that gitter though. The gitter just reflects https://app.element.io/#/room/!xbOjHLEQzPJBXjeTWo:matrix.org 03:36:48 Ah, cool ^.^ 03:36:54 really I just need a place to coordinate this. I have a homeserver running a Gitlab instance that I'd rather not use for this because I like having home internet and I bet I'd get DDoSed 03:37:00 but I *could* do it there. 03:37:22 aha! ok, so, the PullRequestEvent _does_ have references to the git repository and commit id of the pull request _from the fork_ 03:37:44 so, still accessible if the repo being pr'd into is hidden 03:39:25 .payload.pull_request.head.repo.git_url and .payload.pull_request.head.sha 03:40:33 JAA or anyone else in a position to judge: do we want to mirror GHArchive? 03:40:37 as in any/all of (a) downloading its existing backlog, (b) downloading it on an ongoing basis, or (c) doing similar and/or redundant work as part of #gitgud 03:41:40 (i asked over there whether the web potion of the project included api results, but it's pretty quiet) 03:41:46 I can create .torrent files of the whole thing 03:42:03 ... and seed 03:42:39 Laboratory B has server infrastructure, that is, we have a single 12-bay R510 with 13.5TB RAID10, 150Mb symmetrical 03:43:18 sorry, never mind. I'm not even part of archiveteam. Just gonna focus on the youtube-dl thing. 03:44:40 There's nothing stopping you from helping out with other archiveteam projects after this gets resolved 03:44:55 I don't have the time :( I distribute laptops for local community 03:45:47 if I could get paid doing that I'd be happy to, I've just already overcommitted and I need to get better at cleaning up house before I start anything like that 03:45:58 but I personally rely on youtube-dl, so 03:46:10 I just want to see them get the basic stuff to start back up with. 03:47:30 Fair enough 03:47:44 hurricos: downloading large amounts of data is a problem we already have a lot of infrastructure for--i think if you really want to see them back up and running soon the hard part is the export. 03:48:40 exporting archived data to a gitlab-importable format is something that's been in our long-term plans for a while, but nobody's made a start on it yet--do you want to? 03:49:20 I'll take a swing. Not something I was planning to do already but I am planning on doing some work on fresh copies of a Gitlab instance so I might as well try 03:49:53 great! if you poke us here or in #gitgud i'm sure people will contribute once the ball is rolling 03:50:23 I remember now. Gitlab actually lets you import issue lists as CSVs. Comment history perhaps not directly. It would be good to centralize that work around an issue tracker already, does the archiveteam have one? Kanban board of any kind? 03:52:20 generally just the github issue tracker for each code repository 03:52:44 (which is now seeming like maybe not the best policy ;) but i believe we have copies) 03:53:36 non-directly-code-related plans and issues are coordinated over irc or on the wiki 03:54:04 chicken-and-egg 03:54:32 ? 03:56:10 Lol, just having an issue tracker for things hosted on the place you'd like to make sure is safe from them 03:56:23 mm 03:57:51 not having readline installed (from within a docker container) *sucks* 04:06:23 OK, got it. Gitlab uses an `issues` table, comments are in the `notes` table. `notes` points to `issues` by `noteable_type` and `noteable_id` fields. The state of the issue is in the `todos` table; a `todos` can point to an `issues` via the `target_type` and `target_id` fields. 04:06:48 everything else, e.g. attachments, links on as you might expect, but it's a fairly loose, object-oriented framework 04:07:03 that's 11.3 CE, I haven't updated in a while :upside-down-face: 04:07:33 I'll go into gitgud and ask about repositories 04:23:59 so are you folks just trying to recreate the youtube-dl issues from the gharchive data? 04:24:25 Yes 04:26:29 good luck! 04:26:44 I'll need it :( 04:41:00 @Jean-Fred luckily, PS's store has a all games option 👀 makes this much easier 08:43:16 Potential archiving inspiration? https://old.reddit.com/r/AskReddit/comments/jgv6iq/the_internet_is_scheduled_to_go_down_forever_you/ 09:05:13 Heh ^^ 09:05:33 'If you see something, save it' 11:37:19 -purplebot- FileFormats created by JesseW (+21, Redirected page to [[Formats]]) just now -- https://www.archiveteam.org/?diff=45703&oldid=0 12:49:53 i have found a python script that keyword crawler websites 12:50:56 but it crashes after just a few url captured https://transfer.notkiska.pw/11P1Ep/website_keyword_crawl_error.txt 12:51:21 https://github.com/wessel1512/website-keyword-crawler/blob/master/website_keyword_crawl.py 12:52:03 only i dont know how to fix it 13:00:28 and i like to filter things like: .jpg .png and .js files out 14:15:19 -purplebot- File:Albumee-logo.gif uploaded by Arkiver (+0) just now -- https://www.archiveteam.org/?diff=45704&oldid=0 15:32:38 mgrandi Awesome! Thanks :-) Are you archiving the us-en store only? Asking because there are 5 stores (EMEA and others, America (North and South), Asia, Japan and China), and some stores with many subdomains per country/lang (eg German, French etc). And some information (like local content ratings eg USK or PEGI) might be only in one sub-store. I 15:32:38 got that list of domains − , not sure how exhaustive it is but that’s already a start https://justpaste.it/93kgd 19:25:28 youtube-dlc is back: https://github.com/blackjack4494/yt-dlc 19:25:43 It's a more up to date version of youtube-dl 19:26:09 The dev fixes pull requests a lot faster there 20:30:36 @Jean-Fred: I can see if I can the other language stores too 22:27:17 mgrandi Thanks for looking ; I heard from others that the store pages are gone for many/most people now :-( 22:28:45 i think you have to log in? 22:28:59 Aaaaah maybe? 22:29:27 https://store.playstation.com/en-us/grid/STORE-MSF77008-ALLGAMES/1?PlatformPrivacyWs1=exempt&direction=asc&psappver=19.15.0&scope=sceapp&smcid=psapp%3Alink%20menu%3Astore&sort=release_date 22:29:33 (Also, looking at https://web.archive.org/web/sitemap/https://store.playstation.com/ , things seem to have been well crawled in the past) 22:29:53 going to store.playstation.com seems to be a white page that seems new but going to that link i just posted seems like the old game page 22:30:27 i'll work on this now then 22:30:41 Thank you so much, really :) 22:31:09 (ok off to sleep here) 22:44:16 thuban: Size of GH Archive?