00:48:17 !a https://transfer.archivete.am/uRFK3/discord-Gate.io 00:48:20 TheTechRobo: Skipped 1 unprintable URLs: https://transfer.archivete.am/13gURD/discord-Gate.io.not-printable.txt 00:48:21 TheTechRobo: Deduplicating and queuing 8252 items. 00:48:25 TheTechRobo: Deduplicated and queued 8252 items. 00:52:54 !a https://transfer.archivete.am/PY1mr/discord-GateChain 00:52:58 TheTechRobo: Skipped 1 bad URLs: https://transfer.archivete.am/DQQ6E/discord-GateChain.bad-urls.txt 00:52:59 TheTechRobo: Skipped 1 unprintable URLs: https://transfer.archivete.am/7hXDt/discord-GateChain.not-printable.txt 00:53:00 TheTechRobo: Deduplicating and queuing 7409 items. 00:53:01 TheTechRobo: Deduplicated and queued 7409 items. 01:14:06 TheTechRobo: awesome stuff :) 01:31:50 arkiver: it,ll take me awhile to get motivated for the twitch outlinks extractor :P, in the meantime could you check how well twitch's servers can handle it? 01:57:53 TheTechRobo: could i check how well twitch's servers can handle it? 01:57:55 what do you mean? 01:58:05 Make sure we don't ddos them ig? 01:58:06 handle what? 01:58:12 all our requests 01:58:12 we're talking about outlinks 01:58:30 yes but there are profile pictures, emoji, etc 01:58:31 so that is links from twitch to something not twitch 01:58:38 are we not getting those? 01:58:56 we're talking outlinks from the comments right? 01:59:02 from the chat, yes 01:59:03 so if the comments as an outlink, save it here 01:59:14 comment has* 01:59:32 so no icons, images, etc? 01:59:55 this is not supposed to be a general twitch archiving program 02:00:09 do you have examples of icons and images? 02:03:20 https://transfer.archivete.am/inline/SMfb2/a.png 02:03:37 medal: Bits Leader (I think that icon is universal across all channels that have a bits leader) 02:03:43 there's also 2nd place, 3rd place, etc 02:03:45 and other badges for bits 02:03:56 triangle prism: cheer 02:03:56 i guess there is a very limited set of those right? 02:03:59 Yeah 02:04:06 okey yeah those can be put in here 02:04:07 The only one there are a lot of is emojis like the one there 02:04:13 since channels can add their own 02:04:26 what do you mean by "images"? 02:04:27 that one I'm not sure about 02:04:57 by "images" i meant the emotes 02:05:06 ah 02:05:21 okey I think there'll be a very limited number, so those can go in here 02:05:26 do you have examples with outlinks? 02:05:44 lots of people also paste links to clips in Twitch chat, so that's another thing we have to consider 02:06:04 i guess those have 'twitch.tv' in the URL? 02:06:18 or perhaps some short version 02:06:29 (if twitch has that - i never used twitch) 02:06:38 like, how reddit has reddit.com, and redd.it 02:06:40 here's a clip: https://clips.twitch.tv/RelentlessSucculentGrouseJonCarnage-L9OEViii3DOw58P8 02:07:07 right let's not get out 02:07:11 right let's not get those 02:07:21 only real outlinks (so whatever points to something else than twitch) 02:07:32 so only non-twitch links? 02:07:36 yes 02:07:48 Ok 02:08:06 do you have example of a comment with an outlinks (non-twitch?) 02:08:17 twitch links from twitch are not real outlinks anyway 02:08:23 right 02:10:09 I don't have one on hand 02:11:05 alright 02:11:09 also 02:11:11 hang on i'm trying grep 02:11:30 roughly 5% of visited WARC records in the Wayback Machine is now this project! 02:11:37 wow! 02:12:01 :) 02:12:33 ooh there are several sizes for emotes, should we just get largest or smallest or something? 02:12:56 https://pastebin.com/Ft02kVDR 02:13:45 arkiver: ^ 02:14:19 I vote for smallest 02:14:32 whatever is used in the comment I'd say 02:14:41 no they're all provided in the api json 02:14:56 but which one is actually rendered in the browser? 02:14:57 appears size 1 is used in chat 02:15:04 or does that depends on something 02:15:07 right 02:15:09 let's do size 1 02:15:10 probably depends on something 02:15:16 but it looks like Most of the time it's size 1 02:15:22 (otherwise why'd there be other sizes?) 02:15:39 oh here's an outlink example 02:16:12 https://transfer.archivete.am/inline/DKfRQ/a.png 02:16:24 nice! 02:16:29 yeah we want all the real outlinks 02:16:35 also found another 02:17:15 I don't see any parsed outlink list in the json :/ 02:17:20 looks like I'll have to regex 'em out 02:17:37 can i see the JSON? 02:18:06 here's a long one. lots of patreon links: https://transfer.archivete.am/9Ce5k/chat.json.zst 02:19:01 nice! 02:19:09 yeah these are exactly the type of URLs we want here 02:19:20 (they're all the same btw :P) 02:19:29 (the channel set up a bot to periodically remind people) 02:19:37 parsing out sounds good - perhaps you can check the code of twitch to see what they use exactly to detect URLs 02:19:40 does patreon need javascript? 02:19:52 > perhaps you can check the code of twitch to see what they use exactly to detect URLs | I was thinking that, but they use a lot of JS 02:20:08 "the code" is the JS in that case 02:20:15 yes 02:20:17 but they use a LOT 02:20:40 hmm 02:21:01 well parsing out with regex would work 02:21:27 or I create a command here to which you can queue text (on transfer.archivete.am) and it'll parse out the URLs 02:21:47 I've already done a bit of work on that with my discord-urls-extractor 02:21:59 sounds good 02:22:05 well i'll be off for the night 02:22:07 https://github.com/TheTechRobo/discord-urls-extractor/blob/main/src/main.rs#L180 <- feel free to contribute 02:22:09 just ping me in case of anything 02:22:20 wait 02:22:21 wrong one lol 02:22:31 here we go: https://github.com/TheTechRobo/discord-urls-extractor/blob/main/src/main.rs#L397 02:22:38 probably should go to bed :P 02:24:09 maybe this works differently in this language 02:24:55 nvm 02:25:27 looks decent i think 02:26:58 the bot here will filter out anything with a bad domain 02:27:05 yeah 02:45:51 !a https://transfer.archivete.am/pm9jk/discord-Anchor 02:46:00 TheTechRobo: Invalid command message. 02:46:21 oh my gosh these urls are bad 02:46:28 > http://localhost:3001/static/js/bundle.js:11881:16)\n 02:46:37 arkiver: ^ 12:43:29 TheTechRobo: thanks 12:43:38 we should be able to handle any bad URLs though - looking into this 18:38:46 !a https://transfer.archivete.am/Tl5Ho/twitter-employee-group-accounts-outlinks 18:38:48 JAA: Deduplicating and queuing 3589 items. 18:38:50 JAA: Deduplicated and queued 3589 items.