03:32:56 !a https://transfer.archivete.am/WMRvV/discord_urls.txt 03:33:06 datechnoman: Skipped 2 invalid URLs: https://transfer.archivete.am/PIZ87/discord_urls.txt.bad-urls.txt (for 'https://transfer.archivete.am/WMRvV/discord_urls.txt') 03:33:07 datechnoman: Deduplicating and queuing 200925 items. (for 'https://transfer.archivete.am/WMRvV/discord_urls.txt') 03:33:26 datechnoman: Deduplicated and queued 200925 items. (for 'https://transfer.archivete.am/WMRvV/discord_urls.txt') 03:34:29 AK: checking 03:40:37 interesting one on the whitespaces 03:41:07 i would think they're probably in the URL as given in the HTML, and normally browsers trim them off, except Wget-AT 03:48:01 arkiver: I have just subjected myself to the hell that is modern web specifications and yes it seems browsers trim leading and trailing whitespace from href attributes 03:50:05 I wonder what wget does with ''. 03:50:23 Or other whitespace before the protocol. 03:55:27 nicolas17: yeah i think Wget-AT may not 03:55:55 i'll queue URLs in #// with trailing whitespace with an without that trailing whitespace 03:56:00 JAA: no idea... 03:59:26 Well, I found part of the answer: https://github.com/ArchiveTeam/wget-lua/blob/01bae48b489b93efe26fee97f10f6f5b5ba4583e/src/html-parse.c#L1016-L1028 04:00:28 hah :) 04:00:41 eugh 04:00:55 it could probably be handled better yeah 04:01:09 I don't think that's what the specs say, but I don't feel like wading through that right now. 04:01:10 i'll create an issue and get it fixed soon 04:01:35 but there is probably a reason they do that in Wget-AT 04:01:43 yeah, this would be a different part of the spec altogether 04:01:54 so we might run into further problems if we start treating this differently 04:01:58 I looked at the part that says how to handle a click to 04:02:08 Yeah, this, I think: https://html.spec.whatwg.org/multipage/parsing.html#attribute-value-(double-quoted)-state 04:02:14 once you already have " foo " parsed as the value of the attribute 04:02:28 urghh 04:02:34 I get ANGRY every time I look at whatwg specs 04:02:37 arkiver: My suspicion would be 'because that's how we've always been doing it' with a dash of 'the early internet was hell'. 04:02:50 this is not a spec, this is a browser implementation, written in the programming language known as "English" 04:02:52 maybe yeah 04:03:07 do we know how usual it is for this to occur today? 04:03:08 There are so many unspecified things in earlier HTML it isn't even funny. 04:03:09 the leadin \n 04:03:29 I feel like the relevant question should probably be: what do browsers do? 04:03:42 nowadays, browsers do what the spec does 04:03:49 Yeah 04:03:57 Except when they don't. :-P 04:03:59 because if major browsers agree to handle broken code in the same way, they change the spec to match as well 04:04:00 browser probably do a lot more complex parsing than Wget-AT does 04:04:09 which means we could run into problems with Wget-AT 04:04:15 Browsers implement the entire HTML standard, yes. 04:04:18 but will make an issue anyway and see 04:04:45 add a unit test >.> 04:04:54 Hmm: https://github.com/ArchiveTeam/wget-lua/blob/01bae48b489b93efe26fee97f10f6f5b5ba4583e/src/html-parse.c#L1042-L1043 04:06:41 oh integration tests are written in perl 04:06:43 pain 04:12:28 whatwg loves you :3 04:12:38 w3c was way too slow 04:18:10 AK: i don't see namesuppressed.com in recent WARCs 04:24:05 those mentioned IPs don't do pipelines do they 08:52:03 Interesting, I wonder what it was then