00:00:34 we need evidence that it's good, not "I can't think why it would be bad" 00:01:07 Yep, ideally a comprehensive test suite we can also run continuously in the future on building. 00:01:29 But no such test suite for WARCs exists in general, and it's a lot of work. 00:01:57 as someone who tried to write a crawler that outputs WARC... I decided WARC is just garbage and I wrote my own format 00:02:12 🤨 00:02:19 tell us more about this format of yours 00:02:27 WARC isn't great, but it's the least terrible format out there. ARC is far worse. 00:02:46 The WARC spec has a number of issues though, and implementing it is tricky to get right. 00:03:09 My format is just a folder full of timestamped uncompressed HTTP request and response payloads, with folders named based on the request URL/path 00:03:16 It gets the job done for what I need 00:05:01 * fireonlive blunks 00:05:39 what do you use it for mainly? 00:06:46 Web scraping, saving contents of web sites as I go that I might want to process later... The main idea is say, I have an image web site I want to download, I write a script which saves the images all in a directory, but I also output that raw data in case later on I find out there was some vital information in every HTML page that I forgot to download (say a description of 00:06:46 the image, or the author, or something) 00:07:00 then I can go back and process those files and match them up with the other data I downloaded to augment it 00:07:01 ahh 00:08:57 Personal mirrors are fair game for anything. Use wget with link rewriting for all I care. :-P 00:09:24 For proper archival, that'd be missing some metadata. It also doesn't scale, and repeated retrievals of the same URL get fun. 00:09:44 What'd it be missing? 00:10:33 hm 00:10:41 JAA: I just thought of a tool that would be handy to have 00:10:49 dedup a WARC after the fact 00:11:18 HTTP headers, IP, transfer encoding (although that one's debatable) come to mind. 00:11:51 nicolas17: Yes, that was a key design part of the thing I've been working on. 00:11:52 well the request/response data include http headers :p 00:12:01 everything after the TCP socket 00:12:08 Ah, ok. 'Payload' means something specific in HTTP. :-) 00:12:52 afaik qwarc does deduplication between different URLs in one archival task, but if I rerun it next month, it won't deduplicate files that didn't change vs previous archival 00:13:01 Actually, RFC 9110 deprecated the word, I guess. But it was the body without encoding prior to that. 00:13:19 archivebot doesn't dedup anything I think? 00:13:24 It's also moderately annoying that every tool that generates warc files seems to be absurdly complicated for no reason 00:13:40 I have been a software engineer and sysadmin for 12 years and I still feel like I need a PhD to understand most of these 00:14:16 nicolas17: Both correct. In fact, qwarc only dedupes within a single process. When you spread a single archival across multiple processes or restart the process to fix the memory 'leak' (fragmentation), that also leads to duplication. 00:14:35 and I think wget dedups across time, but only if they have the same URL 00:15:14 appledash: warcio's interface is reasonable, but unfortunately warcio itself sucks. warcprox would allow you to use whatever HTTP client you want via MITM proxying, which is neat. 00:15:40 nicolas17: Correct, and you can also write and load CDXs. wget-at supports URL-agnostic dedupe. 00:15:45 I remember having some issue with warcprox 00:15:50 that was my first try I think 00:16:07 I'm not terribly surprised. I have no experience with it myself. 00:16:33 yeah so it would be nice to have a tool that can run afterwards to replace warc records with dedup pointers 00:16:54 Yeah, soon™. :-P 00:18:35 do you know anything about the HAR format? 00:21:34 browser dev tools can export requests to HAR and it *might* be complete enough to be convertible to WARC but I'm not sure yet 00:30:05 It isn't. 00:30:23 It doesn't preserve the headers verbatim, and it doesn't preserve transfer encoding. 00:30:26 appledash: do you know what issue you were having? 00:30:44 i was using archivebox then JAA went and rained on my parade 00:30:46 😢 00:30:55 ⛈️ 00:31:03 pywarc when 00:31:37 (rightly) 00:32:59 I do not remember :( 00:33:01 It was awhile ago 02:23:05 PaulWise edited Mailman2 (+806, add more mailman2 instances, corpit.ru done): https://wiki.archiveteam.org/?diff=50587&oldid=50487 02:27:06 PaulWise edited Bugzilla (+53, add more bugzilla instances): https://wiki.archiveteam.org/?diff=50588&oldid=50488 02:57:11 PaulWise edited Mailman2 (+67, started some jobs, one instance already gone): https://wiki.archiveteam.org/?diff=50589&oldid=50587 07:39:44 about 5.5 days and wowturkey arcihval still going full blast 08:14:02 in wowturkey archivals, you might have seen "DNS resolution failed: [Errno -2] Name or service not known http://www.reklam_link.com/d/news/433509.jpg" 08:14:09 those are the links wowturkey censors 08:14:27 they replace censored hostnames by reklam_link 12:54:56 Okay, hi. On the chnace that I am right here. I have around 20 or 40TB of archived YouTube channels I'd like to out up on IA, however the videos are sorted in subfolders for the playlist names and I'd like to keep it that way when uploading. I know the web uploader supports folder creation, but I want to use the cli on a headless server, and I can't find any way to do this with the ia cli 12:55:02 utility. And outting hundreths of video files all in one root is extremely stupid. Is there any way to archive this outcome? 13:03:08 No way to cd your way through it? 13:03:20 (Never used the IA CLI, sorry) 13:08:43 I specifically need a cli solution. On the chance that I overlooked something or there is another tool or script I'm asking. 13:08:59 "No way to cd your way through..." <- Is this not possible? 13:09:17 i.e. navigate to or create different directories and then upload to those 13:09:56 I see now way to do this with the ia cli utility. 13:10:08 How odd 13:10:28 It let's me specify an idetifier and that's it 13:10:46 Annoying ngl 13:11:19 Would be handy to “directorize” the archive or at least allow uploading directorized archives to it 13:12:03 Yes, but I hope anyone here has any good idea how to maybe handle this 13:20:56 you can do directories 13:21:09 say you have a folder named `a`, then you put a file in it 13:21:38 you can do `ia upload a` and it will upload all files and subdirectories in `a` to the item 13:21:53 be sure you're not using a trailing slash, or it will upload everything to the root! 13:21:56 Perfect! 13:22:13 TheTechRobo: How? 13:22:43 Oh so 13:22:43 `a/b` uploads folder `b` 13:22:43 `a/b/` uploads contents of `b` without the folder 13:23:08 * Oh so 13:23:08 `a/b` uploads folder `b` and therefore its contents 13:23:08 `a/b/` uploads contents of `b` without the folder 13:24:42 yes 13:25:12 don't ask me why 13:26:15 TheTechRobo: Nah it makes sense tbh 13:26:32 `a/b` = target `b` 13:26:32 `a/b/` = target `b/*` 13:27:03 * target `b/*` (but not `b`) 13:27:30 Why is this site excluded from WBM? https://www.11alive.com/article/news/special-reports/ga-trump-investigation/donald-trump-mug-shot-when-it-will-be-released/85-38d22a92-057c-461d-951e-4331f74b8c4d 13:31:17 403 on that from uk 13:33:23 kaz: Works on my end 13:33:30 are you in the uk 13:33:33 No, US 13:33:41 ok then 13:33:50 Works here from Canada 13:34:03 WBM excludes the site though, for some reason 13:34:19 I've seen them exclude certain patterns 13:34:31 EU legislation issues, methinks 13:34:36 Try VPNing? 13:37:38 if you see anything saying "reklam_link" in wowturkey archives, those are censored link 13:38:04 wowturkey censors links to certain sites by replacing their hostname by "reklam_link" 13:38:42 Reklam = ad 13:38:48 from French réclame 13:39:12 is that right? 13:40:35 yes 13:40:46 ad_link :) 13:40:49 So yeah, makes sense 13:41:21 status, t.me and a few more are amongst the censored ones 13:42:03 Makes sense tbh 13:42:14 At the same time it opens up some questionable stuff 13:43:42 hmm, what if i open a website called reklam_link.com * 13:43:58 i'd make tons of ad revenue tbh 13:44:09 and it's not only me who thought of doing this 13:45:01 Might be taken, doğru mu? 13:45:55 no 13:46:00 Underscore isn't valid in a domain name 13:46:07 no such domain registered 13:46:11 Ah so the underscore is the key 13:46:12 ah 13:47:04 No way to register it either 13:47:29 hmm 13:47:31 Ne yazık 13:47:51 * Ne yazık (= what a pity) 13:47:53 register reklam-link.com and rewrite all reklam_link.com to reklam-link.com client side 13:47:55 :joy: 13:48:18 maybe MITM /j 14:28:58 transfer is dead due to an incident at Scaleway. 15:09:44 JAA: hoping it’s not a SBG2 :/ 15:12:59 it's a "our blade chassis is dead" I think 15:14:05 ahj 15:14:11 ahh* 15:14:23 Currently working out if I may invite my darling Aroy, she made an archival tool I think would be greatly useful here 15:20:15 i read that as tracker at first and was much more concerned 15:20:28 😅 16:10:55 What is the *actual* url for this image? https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fdonald-trumps-mugshot-v0-td10trboe5kb1.png%3Fauto%3Dwebp%26s%3D60b1dd8fc794db49169cd0c892f09277ab58faaa 16:11:05 WBM doesn't like this link 16:11:24 Okay, that works. Thanks! 16:13:37 Thought it was this but leads to an error https://preview.redd.it/donald-trumps-mugshot-v0-td10trboe5kb1.png 16:15:00 Hm. Looks like it's this, I guess https://preview.redd.it/donald-trumps-mugshot-v0-td10trboe5kb1.png?auto=webp&s=60b1dd8fc794db49169cd0c892f09277ab58faaa 16:18:30 Odd. Still redirects to the scrambled URI https://web.archive.org/web/20230826160557/https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fdonald-trumps-mugshot-v0-td10trboe5kb1.png%3Fauto%3Dwebp%26s%3D60b1dd8fc794db49169cd0c892f09277ab58faaa&rdt=56478 16:20:56 qwertyasdfuiopghjkl: Mind taking a look at this? ^ 16:24:08 HP_Archivist: https://i.redd.it/td10trboe5kb1.png (found with https://addons.mozilla.org/en-US/firefox/addon/image-max-url/ ) 16:24:55 fireonlive: A lot of stuff depends on transfer since that's where the zstd dicts are stored, so eventually it would still stall everything. 16:25:21 indeed 16:25:47 HP_Archivist: Yeah, the i.redd.it URL is it, but if you just access that directly, you won't get the image. They started doing that bullshit quite recently, like in the last few months. 16:27:09 Thanks qwertyasdfuiopghjkl - It still redirects in the browser. And JAA, yeah, I've never had a problem capturing Reddit images from posts before now. What nonsense. 16:27:46 last i checked curl on i.reddit got the full image but what a pain 16:28:37 Still not showing in WBM https://web.archive.org/web/20230826162738/https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2Ftd10trboe5kb1.png&rdt=49036 16:29:13 What's odd is that when I crawled this actual post last night in SPN, it captured the page but not the image (which is kinda the point of the crawl) 16:31:50 Archive.is captured the page and image just fine though 16:40:26 Maybe you can try saving a (different) page that embeds it as an image, but idk if that would work 16:40:44 quit 16:41:04 sorry wrong terminal window 17:02:01 Hi, would it be against the TOS or perhaps the law to download and host some or maybe all pico song files. 17:04:01 Currently working on a app that allows users to filter thru picosong entries, get details preview and download the file. But downloading and previewing from archive org itself is extremely slow and sometimes doesn't work at all. 17:04:16 Thinking of downloading some files and hosting them on my server 17:08:43 Rynav: Obviously, virtually all of it is copyrighted content. Whether the artists/copyright holders will care is not a question we can answer. 17:09:50 JAA Well yeah you are right , i wonder why I haven't figured it out. Thank you!! 17:15:01 I've tried to get a list of URLs for Orange website with wget, but (oh surprise) I got a 403 from Google and failed on http://annuaire-pp.orange.fr/ 17:15:01 Is there a repository where I can already paste some URLs? 17:57:37 If you've got a file you want to share you can upload it to https://transfer.archivete.am/ 17:59:14 well, you cant, since transfer is currently offline, but that would be the usual place 18:08:58 😆 alright, nice thank you! 18:47:56 i’d say bpa.st but the spam filters are a “big oof”. you can use paste.debian.net in the meantime though if you’d like to dump and run 18:48:08 but if you’re around a bit i’d just wait for the transfer 19:56:11 transfer is back. 19:59:39 Nice! Okay, quick question, I have to get URLs from a website (that uses JS…): to which tool would you orientate me? wget? 20:02:55 wowturkey archival still going strong 20:44:45 Hmm, I have an FTP server which seems to be telling me to connect to a LAN IP address whenever I initiate a transfer from it. What'd be the best way to transfer data from it? I'm going to make the assumption that if I just connect to the FTP server's WAN address instead of the LAN address it gives me, it'll work. But is there any way to tell wget to ignore the address the FTP 20:44:45 server tells me and use a given one? 20:44:56 The control connection works fine, it just fails to open the data connection 20:55:52 Maybe active mode would work, where the FTP server opens a connection to your machine? (That's the older mode so it should be fairly well supported) 21:04:35 I was thinking about that, but there's a catch to it... The FTP server is Russian, and something between me (Canada) and the FTP server is blocking my connection, so I have to proxy through a Russian VPS 21:04:51 I would need to forward the active mode through the VPS as well I guess 21:11:54 Ah, then yeah, you'd need to do something special to trick that :| 23:33:09 Vokunal edited Frequently Asked Questions (+0): https://wiki.archiveteam.org/?diff=50590&oldid=50586 23:33:10 Cooljeanius edited Twitter (+56, /* External links */ add relevant GitHub repo): https://wiki.archiveteam.org/?diff=50591&oldid=50555