01:38:28 So turns out that some files aren't downloadable from FOIAonline: 01:38:33 > Message The request was rejected because the URL contained a potentially malicious String "%3B" 01:38:40 The filename contains a semicolon... 01:39:07 No, replacing %3B with a literal semicolon doesn't work either, nor does double-encoding it. 01:40:18 This appears to be the problem for many of the items that failed. 01:56:57 rogue WAF strikes again 01:57:58 …………. 01:58:04 fucking christ 02:03:00 lmao 02:27:37 Oh, it's even stupider than I thought. 02:27:54 So those document download links contain the filename, presumably for the Content-Disposition header. 02:28:06 But you can just change it (as long as you still send the right Referer etc.). 02:29:04 Guess I'll add detection for this org.springframework.security.web.firewall.StrictHttpFirewall.rejectedBlacklistedUrls thing and just replace the filename with something generic for those. 02:29:57 The bruteforcing still didn't happen, by the way. Machine was busy with the known existing requests and uploading. 02:41:36 Since I need to retry these items yet again, not sure if it will happen I'm afraid. 02:43:50 I did briefly sample some of the 'missing' IDs, and those all seemed to fail (i.e. 403 on the API), so hopefully there isn't much missing. 03:03:20 I'll also skip over files that can't be downloaded. Until now, it would fail the entire item (= request). 03:12:19 I also bumped the timeout since some API requests and some large file downloads ran into that. 03:12:26 This is probably as good as it'll get. 03:25:06 Oops, I played with a bruteforce sample and got myself banned it seems. Let's hope it doesn't last long. 03:26:58 Actually, can't connect from elsewhere either. Uh oh... 03:28:23 Ok, it's back. 03:30:22 Needless to say that bruteforcing won't work if that's what happens. 03:30:29 But also, not a single hit on that sample. 14:36:48 https://www.nytimes.com/2023/09/29/business/media/letterboxd-new-owner.html 14:57:02 While I love Letterboxd, it is ultimately a reskin of TMDB data with a few social media features. I would love it if they implemented better guidelines about what is a review and what is a comment. There is too many one liners on any decently popular film 14:57:58 Still better than Goodreads though. Amazon ruined that site 17:51:44 fyi it appears discord is changing how cdn links work, this is likely to break the other half of the uploads that didn't get lit ablaze by dropbox dropping the box or imgur failing to image... 17:52:27 considering how much important knowledge is in non-crawled, almost certainly non-backed-up guilds there, i wonder if a proper project would be worthwhile... 17:57:55 #discard 18:29:04 Turns out that there are entries on FOIAonline which can't be found by the search (at least with how I used it), but they aren't in my bruteforce list either. Two examples: https://foiaonline.gov/foiaonline/action/public/submissionDetails?trackingNumber=DOI-FWS-2023-003849&type=Request https://foiaonline.gov/foiaonline/action/public/submissionDetails?trackingNumber=DOJ-2020-000763&type=Request 18:29:50 Probably not much that can be done about that. :-/ 18:31:01 They don't even show up when you specifically search for those tracking numbers. 18:49:03 about the discord cdn shenanigans... it appears it means all cdn links to discord will break outside of discord... 18:49:31 and i have seen many MANY cases of a discord cdn link being used for a download that ought to be persistent... 18:50:51 And this is still not the channel to discuss it. 18:51:26 -> #discard 19:00:09 JustAnotherArchivist edited FOIAonline (+2477, Document site quirks): https://wiki.archiveteam.org/?diff=50912&oldid=50898 20:47:46 Something broke at FOIAonline about 15 minutes ago. Getting a lot more errors now. 22:03:36 FOIAonline is offline now, happened sometime in the past hour or so. 22:09:58 I was hoping it'd last a bit longer since they said that today would be the last day of access and it'd be inaccessible tomorrow, but oh well. 22:10:11 I got the vast majority of discoverable content, I think. 22:15:35 🪦 rip 22:15:49 thanks JAA 22:15:54 JustAnotherArchivist edited FOIAonline (-44, It's dead, Jim.): https://wiki.archiveteam.org/?diff=50913&oldid=50912 22:21:59 good work, JAA! 22:31:00 for sure