02:09:09 <JAA> So the anniversary of the death of Comicogs & Co. is approaching, and they *still* haven't managed to upload the dumps... *facepalm*  https://comics.discogs.com/
02:21:42 <Ryz> You know the vid dot me stuff, non-zero chance the companies that have embedded links to that website will just delete the articles instead
02:22:09 <JAA> Not much we can do about that though.
02:26:13 <pabs> could use a search engine to find links to vid.me and archive the articles?
02:28:48 <JAA> They're embeds, not links. Is there any search engine that indexes those?
02:30:57 <pabs> hmm
02:31:30 <JAA> Also, scraping search engines is hard to impossible. They don't like that at all. Only Bing kind of tolerates a slow speed.
02:31:45 <JAA> So... $£€$£€
04:34:33 <h2ibot> Bauerbach edited SCP Foundation (+100): https://wiki.archiveteam.org/?diff=47007&oldid=46701
08:02:08 <h2ibot> Sanqui created Sweb.cz (+215, Created page with "Czech freehost provided by…): https://wiki.archiveteam.org/?title=Sweb.cz
19:05:22 <JAA> FloydHub's a JS hellhole, so archiving it is tricky. The search accepts an empty query: https://www.floydhub.com/search/projects?page=0&query=
19:07:10 <Jake> 11806 pages on the empty search for projects
19:07:25 <JAA> Fun
19:08:08 <Jake> 5738 pages of datasets
19:09:26 <Jake> 8644 pages of users
19:09:31 <JAA> lol
19:09:35 <JAA> It's all in their main JS file.
19:09:42 <JAA> Because SPA.
19:10:11 <JAA> Oh wait no, those are other hits. Odd
19:10:38 <Jake> SPA :(
19:12:13 <jamesp> wait, who is @JAA?
19:12:20 <JAA> Jake: How did you find those numbers so quickly?
19:12:33 <Jake> just casually went through it, starting with bigger numbers
19:12:57 <rewby> jamesp: He's just another archivist.
19:13:04 <JAA> :-)
19:13:35 <jamesp> I'm just wondering about textfiles. Does he come on?
19:13:53 <rewby> He's around. Usually.
19:14:12 <rewby> Don't ping him though, that's like poking a sleeping bear
19:14:31 <jamesp> he isn't on
19:14:47 <rewby> His IRC username isn't textfiles
19:15:28 <jamesp> then what is it? put a space before the last character, like james p
19:16:57 <JAA> Projects API endpoint is this: https://www.floydhub.com/api/v1/projects/search?query=&limit=15&offset=15 (for the second page)  Limit maxes out at a bit over 1000.
19:17:24 <JAA> 1023 is the maximum allowed limit, to be precise.
19:18:13 <Jake> some of floydhub seems partially broken already, https://www.floydhub.com/fastai/projects/lesson1_dogs_cats the project is 'empty', but has a few jobs, some of which have files
19:18:29 <JAA> Yeah, I haven't found any non-empty project yet, actually.
19:19:02 <Jake> I think they are all displaying as empty for some odd reason, this job seems to show the code for the project. https://www.floydhub.com/fastai/projects/lesson1_dogs_cats/13/code
19:21:20 <Jake> max offsets projects: https://www.floydhub.com/api/v1/projects/search?query=&limit=15&offset=177090 datasets: https://www.floydhub.com/api/v1/datasets/search?query=&limit=15&offset=86070 users: https://www.floydhub.com/api/v1/profile/search?query=&limit=15&offset=129660
19:23:59 <Jake> I'll run through all the datasets and get a size estimate real quick.
19:25:38 <JAA> Datasets are separate from job outputs, it seems?
19:26:28 <Jake> I believe so
19:26:30 <JAA> But I imagine the datasets will be much larger.
19:32:12 <h2ibot> JustAnotherArchivist edited Deathwatch (+141, /* 2021 */ Add FloydHub): https://wiki.archiveteam.org/?diff=47009&oldid=47002
19:37:02 <Jake> script started. might be a bit.
20:00:02 <Ryz> Tumblr to take a page from Patreon to have Tumblr account posts that can only be accessed through money: https://cdn.discordapp.com/attachments/455120412460974104/868582804609458197/E68AvqbVcAI2FRt.png - https://cdn.discordapp.com/attachments/455120412460974104/868582834409984010/E68AzrMVgAEOLlE.png
20:02:38 <Jake> alright, FloydHub datasets, I got 35363216376832 bytes for total size, or around 35 terabytes.
20:03:53 <Jake> Will extract a full list of URLs to use later in a minute.
20:04:36 <Ryz> https://techcrunch.com/2021/07/21/tumblr-debuts-post-a-subscription-service-for-gen-z-creators/ - https://techcrunch.com/2021/07/22/tumblr-community-lash-out-post-plus-subscription/
20:15:13 <JAA> 35 TB doesn't sound too bad. I wonder how much duplication there is.
20:15:23 <jamesp> Wow...the videos are still up!
20:15:28 <JAA> Dude...
20:16:15 <jamesp> sorry wrong channel
20:17:23 <Jake> I also used totalSizeBytes rather than latestSizeBytes so that might count however many versions exist, most seem to have one version, though. I'll do another run with latest as well.
21:01:13 <VerifiedJ> Looks likes there is a fair bit of duplication. there are ~1K forks(?) of dog-breed-images dataset. A version of that dataset is 700MB.
21:01:41 <VerifiedJ> lists of users, datasets and projects https://verifiedjoseph.com/archiveteam/website-discovery/floydhub.com/
21:07:47 <Jake> beat me to it! :-)
21:08:12 <Jake> total size with just the latest version is 29575463216128 or 29TB.
21:10:04 <Jake> my version: https://transfer.archivete.am/c9djz/dataset_ids as well as all of the JSON from the datasets: https://transfer.archivete.am/TID3N/dataset_full_json
21:24:28 <JAA> There's obviously no good way to detect duplicates just from this, but summing up unique latestSizeBytes over 1 GiB gives 8.2 TiB.
21:25:51 <JAA> Well, all datasets over 1 GiB are 8.5 TiB though, so I guess there's not too much duplication, maybe.
21:29:26 <jamesp> I checked the tracker and I don't see it moving. What's happening
21:30:47 <JAA> jamesp: Still the wrong channel.
21:31:06 <jamesp> oops. I keep forgetting
23:45:55 <WarHawk80> hello can't rsync my warrior anymore
23:46:05 <WarHawk80> im getting this from my client
23:46:24 <JAA> It's being worked on.
23:47:12 <WarHawk80> ah ok...cool...wasn't sure if I borked something up
23:47:25 <WarHawk80> thanks...I'll leave it running then
23:47:26 <WarHawk80> thanks
23:49:06 <WarHawk80> cool looks like it's working now...port changed on the rsync...nice!  Thanks alot!