02:07:37 Hi! I'm looking for a specific fanfic but don't really have the disk space or the internet speed to comb the WARC batches for it, is this the proper channel to ask if someone has it in an archive? 02:48:58 negativegray: which archive are you looking at? 02:49:10 most official archiveteam stuff is in the wayback machine 02:56:19 I don't know how to effectively search there, I'm looking at these: https://archive.org/details/archiveteam_fanfiction 02:58:38 From the collection info (https://archive.org/details/archiveteam_fanfiction?tab=about): "Fanfiction.net Safety Download is a single 2 GB tar file containing epub files, which may be easier to extract." 03:02:56 negativegray: ^ 03:03:37 I've checked that, it doesn't have the complete thing I'm pretty sure 03:04:40 TheTechRobo: or I'm very bad at searching through it 03:07:02 negativegray: Assuming it's not in the Wayback Machine (that grab was long before I joined AT so I don't know), you can look through the items' CDX 03:07:10 For example, for https://archive.org/download/archiveteam-fanfiction-warc-07, there are several cdx.gz files 03:07:20 Not sure which is the "correct" one but they're a lot smaller than the WARC 03:07:33 They basically list the WARC's contents, e.g. urls, capture time iirc 03:08:06 they also list which WARC contains the resource 03:08:34 TheTechRobo: oooh, thank you! How do I open a cdx? 03:09:10 negativegray: It's just text, and there's plenty of documentation. 03:09:13 let me see if I can find some. 03:10:28 negativegray: Here you go! The first line of CDX is the legend, and it has letters that correspond to what the value is representing. I think it's space separated. 03:10:31 Here's the letter list: https://archive.org/web/researcher/cdx_legend.php 03:10:48 TheTechRobo: thank you! 03:10:49 Not all letters will be present. 03:12:12 TheTechRobo: I tried reading the .cdx and it did not help me, even with the legend 03:12:34 Hang on let me download it my internet is slow 03:12:38 I may have to go before it finishes 03:13:46 okay! 03:15:06 negativegray: while it downloads, what information do you have about the fanfic? 03:15:16 do you have the URL? or do you need a full-text search? 03:15:32 if the latter, CDX won't work for oyu 03:18:15 TheTechRobo: yeah I need a full text search. I have the author's name and the fanfic's name 03:18:18 or title 03:18:28 In that case, yeah, CDX probably won't help you. :/ 03:18:36 Unless the url contains the title or something. 03:19:12 yeah 03:19:15 ty though 03:20:11 I'm not sure what you can do in that case. Does anybody have the fanfic warcs downloaded? 03:20:18 I have to go to bed btw, good night! 03:20:38 good night! 03:23:32 negativegray: out of curiosity, what fandom? 03:23:38 Harry Potter 03:23:58 Ah, I wouldn't have it. Could ask some friends of mine, though. 03:24:24 We've got a Discord server where we share info on deleted fics we have. 03:25:52 oh! 03:25:57 That'd be great! 03:26:06 It is in portuguese, though 04:04:22 okay! I got an URL for the author and the fic! 04:06:07 I can only access the first chapter, though 04:20:35 gods, being so close hurts. I managed to get to the wayback machine page of the first chapter, but it seems to be the only one that there is on cache 10:15:38 Can you share a link, 10:16:41 Damn it, didn’t mean to send so early. 10:16:41 negativegray: can you share a link please? 10:29:06 Arkiver uploaded File:Buzzvideo-logo.png: https://wiki.archiveteam.org/?title=File%3ABuzzvideo-logo.png 10:30:06 Arkiver uploaded File:Buzzvideo-icon.png: https://wiki.archiveteam.org/?title=File%3ABuzzvideo-icon.png 12:28:18 (They left hours ago.) 12:39:43 ah. is that something only admins see? 12:40:29 I have no idea what Matrix does with that information, but on IRC, anyone can see it. 12:41:21 hm, weird. 12:41:28 I've been noticing that parts aren't bridging correctly lately 12:41:31 I suspect a bridge bug 14:47:57 Is yt-dlp able to download a YouTube channel that has more videos than the page limit? 15:29:14 can you provide an example channel? 15:30:34 schwarzkatz|m: They were looking for https://www.fanfiction.net/s/1888034/1/. It's not in the FanficRepack_Redux collection, which a friend of mine suggested looking in. 16:01:12 Doranwen: Do we have any idea when it was deleted? 16:01:22 The WBM snapshot is from 2005. 18:14:32 Hello, I would like to learn what tool https://archive.org/details/TikTok?tab=about is scraped by 18:15:07 internal, not related to IA 18:15:13 I know a lot about the comment API and the replies API 18:15:42 So is the ArchiveTeam not behind it? 18:15:45 no 18:17:05 NEXT! 18:17:34 :P 18:18:47 To remind you, TikTok is going to remove videos related to tanning after warning from medical experts. https://www.theguardian.com/technology/2022/dec/01/tiktok-to-ban-videos-that-encourage-sunburn-and-tanning-after-alarm-from-medical-experts 18:22:03 Tag and videos still seem to be up: https://www.tiktok.com/tag/sunburnchallenge?lang=en 18:22:51 Other tags of interest: https://www.tiktok.com/tag/sunburn https://www.tiktok.com/tag/tanning https://www.tiktok.com/tag/sunbathing 19:34:12 How do you reverse engineer the requests that a Steam game makes? I was thinking of a proxy, but as far as I'm aware you can't configure its use. 19:34:25 Wireshark's fine but it captures ALL traffic... 19:35:04 it has powerful filtering tho 19:36:50 I don't know how to use it xD 19:37:47 I might be able to guess at the domain name, though. Is there a way to do that for wireshark? 19:37:54 Or guess at part of the domain name, at leasty. 19:38:01 (I know both the company and game name) 19:38:31 related documentation: 19:38:31 https://www.wireshark.org/docs/wsug_html_chunked/ChCapCaptureFilterSection.html 19:38:31 https://www.tcpdump.org/manpages/pcap-filter.7.html 19:38:35 Wireshark also isn't great for HTTP because it just gets the raw TCP data, no? There's likely ssl. 19:39:23 you'd need to use https://docs.mitmproxy.org/stable/ then I guess :D 19:40:43 Depending on how the game validates TLS certs, it might be messy though. 19:41:05 If it has its own cert store or hardcoded fingerprints or similar, for example. 19:41:36 Then you'll need to either replace that (have fun) or use something like tcpdump/Wireshark and extract the master key (also fun). 19:41:47 pre-master key* 19:42:02 if everything goes through mitmproxy though, why would it be messy? 19:42:30 Because the client (game) needs to trust mitmproxy's CA cert for that to work. 19:46:57 If it uses the system trust store, that's easy, but if it doesn't, mess. 19:47:42 Is there a linux way to get the traffic from a specific process given its PID? 19:47:45 See also: you can't make browsers accept mitmproxy by adding the CA cert to the system trust store. Need to do it separately in the browser. 19:48:07 that... sucks. I thought it was system wide. 19:50:04 TheTechRobo: Maybe some iptables magic would help here, but not sure. 19:53:20 Stack Exchange suggests strace, network namespaces, and iptables: https://askubuntu.com/questions/11709/how-can-i-capture-network-traffic-of-a-single-process 20:22:59 hiya. i'm looking for some tool recommendations. so i've been trying to archive all static assets from some websites i'm interested in for personal curiosity. i decided to finally give archiving user content from one of them a shot, but it kinda breaks my normal workflow of "try many URLs and git commit whatever i found" due to the sheer # of files 20:23:34 i've started using git lfs but the reason i'm using git is mainly to actually see how much progress i've made/new things found each time i try something 20:24:21 i'm wondering if there's a better tool to track progress with recovered files -- i'm also committing tooling for guessing filenames at the same time 20:24:49 i guess i could use S3 but I still like being able to see what's new with `git status` and so on. 20:25:40 also git lfs kinda duplicates objects into .git/lfs so double disk space 20:32:56 Yeah, you'll want to get away from 'one file per asset' anyway probably. It just doesn't scale. Eventually, your file system will be sad as well. 20:34:03 eh, yeah, you're right -- key-based object storage is probably much better for this stuff 20:34:04 One route is WARC, but accessibility isn't exactly great with it. 20:34:45 i've been thinking about building a little ceph server in my basement for a while for that purpose (instead of using Amazon) 20:35:12 You get extra metadata and a technically more accurate capture that way, too. 20:35:30 I suppose that would work as well, yeah. 22:29:57 JAA: No, he never mentioned that. Left his Reddit nick with me but that's all I've got. Oh well, lol.