02:09:09 So the anniversary of the death of Comicogs & Co. is approaching, and they *still* haven't managed to upload the dumps... *facepalm* https://comics.discogs.com/ 02:21:42 You know the vid dot me stuff, non-zero chance the companies that have embedded links to that website will just delete the articles instead 02:22:09 Not much we can do about that though. 02:26:13 could use a search engine to find links to vid.me and archive the articles? 02:28:48 They're embeds, not links. Is there any search engine that indexes those? 02:30:57 hmm 02:31:30 Also, scraping search engines is hard to impossible. They don't like that at all. Only Bing kind of tolerates a slow speed. 02:31:45 So... $£€$£€ 04:34:33 Bauerbach edited SCP Foundation (+100): https://wiki.archiveteam.org/?diff=47007&oldid=46701 08:02:08 Sanqui created Sweb.cz (+215, Created page with "Czech freehost provided by…): https://wiki.archiveteam.org/?title=Sweb.cz 19:05:22 FloydHub's a JS hellhole, so archiving it is tricky. The search accepts an empty query: https://www.floydhub.com/search/projects?page=0&query= 19:07:10 11806 pages on the empty search for projects 19:07:25 Fun 19:08:08 5738 pages of datasets 19:09:26 8644 pages of users 19:09:31 lol 19:09:35 It's all in their main JS file. 19:09:42 Because SPA. 19:10:11 Oh wait no, those are other hits. Odd 19:10:38 SPA :( 19:12:13 wait, who is @JAA? 19:12:20 Jake: How did you find those numbers so quickly? 19:12:33 just casually went through it, starting with bigger numbers 19:12:57 jamesp: He's just another archivist. 19:13:04 :-) 19:13:35 I'm just wondering about textfiles. Does he come on? 19:13:53 He's around. Usually. 19:14:12 Don't ping him though, that's like poking a sleeping bear 19:14:31 he isn't on 19:14:47 His IRC username isn't textfiles 19:15:28 then what is it? put a space before the last character, like james p 19:16:57 Projects API endpoint is this: https://www.floydhub.com/api/v1/projects/search?query=&limit=15&offset=15 (for the second page) Limit maxes out at a bit over 1000. 19:17:24 1023 is the maximum allowed limit, to be precise. 19:18:13 some of floydhub seems partially broken already, https://www.floydhub.com/fastai/projects/lesson1_dogs_cats the project is 'empty', but has a few jobs, some of which have files 19:18:29 Yeah, I haven't found any non-empty project yet, actually. 19:19:02 I think they are all displaying as empty for some odd reason, this job seems to show the code for the project. https://www.floydhub.com/fastai/projects/lesson1_dogs_cats/13/code 19:21:20 max offsets projects: https://www.floydhub.com/api/v1/projects/search?query=&limit=15&offset=177090 datasets: https://www.floydhub.com/api/v1/datasets/search?query=&limit=15&offset=86070 users: https://www.floydhub.com/api/v1/profile/search?query=&limit=15&offset=129660 19:23:59 I'll run through all the datasets and get a size estimate real quick. 19:25:38 Datasets are separate from job outputs, it seems? 19:26:28 I believe so 19:26:30 But I imagine the datasets will be much larger. 19:32:12 JustAnotherArchivist edited Deathwatch (+141, /* 2021 */ Add FloydHub): https://wiki.archiveteam.org/?diff=47009&oldid=47002 19:37:02 script started. might be a bit. 20:00:02 Tumblr to take a page from Patreon to have Tumblr account posts that can only be accessed through money: https://cdn.discordapp.com/attachments/455120412460974104/868582804609458197/E68AvqbVcAI2FRt.png - https://cdn.discordapp.com/attachments/455120412460974104/868582834409984010/E68AzrMVgAEOLlE.png 20:02:38 alright, FloydHub datasets, I got 35363216376832 bytes for total size, or around 35 terabytes. 20:03:53 Will extract a full list of URLs to use later in a minute. 20:04:36 https://techcrunch.com/2021/07/21/tumblr-debuts-post-a-subscription-service-for-gen-z-creators/ - https://techcrunch.com/2021/07/22/tumblr-community-lash-out-post-plus-subscription/ 20:15:13 35 TB doesn't sound too bad. I wonder how much duplication there is. 20:15:23 Wow...the videos are still up! 20:15:28 Dude... 20:16:15 sorry wrong channel 20:17:23 I also used totalSizeBytes rather than latestSizeBytes so that might count however many versions exist, most seem to have one version, though. I'll do another run with latest as well. 21:01:13 Looks likes there is a fair bit of duplication. there are ~1K forks(?) of dog-breed-images dataset. A version of that dataset is 700MB. 21:01:41 lists of users, datasets and projects https://verifiedjoseph.com/archiveteam/website-discovery/floydhub.com/ 21:07:47 beat me to it! :-) 21:08:12 total size with just the latest version is 29575463216128 or 29TB. 21:10:04 my version: https://transfer.archivete.am/c9djz/dataset_ids as well as all of the JSON from the datasets: https://transfer.archivete.am/TID3N/dataset_full_json 21:24:28 There's obviously no good way to detect duplicates just from this, but summing up unique latestSizeBytes over 1 GiB gives 8.2 TiB. 21:25:51 Well, all datasets over 1 GiB are 8.5 TiB though, so I guess there's not too much duplication, maybe. 21:29:26 I checked the tracker and I don't see it moving. What's happening 21:30:47 jamesp: Still the wrong channel. 21:31:06 oops. I keep forgetting 23:45:55 hello can't rsync my warrior anymore 23:46:05 im getting this from my client 23:46:24 It's being worked on. 23:47:12 ah ok...cool...wasn't sure if I borked something up 23:47:25 thanks...I'll leave it running then 23:47:26 thanks 23:49:06 cool looks like it's working now...port changed on the rsync...nice! Thanks alot!