00:08:11 the fuck 00:16:43 Made a table of Steam Workshops (the table starting out collapsed for obvious reasons) 00:20:15 nice 00:20:21 ye 00:50:15 >283 kilobytes 02:02:58 https://anon.cafe/ is shutting down on March 15 02:03:19 https://anon.cafe/meta/res/16466.html announcement 02:07:32 what is it? 02:15:14 It's an imageboard, part of a webring. Shutting down because of operating costs https://anon.cafe/meta/res/16467.html#16486 02:16:33 Oh sorry that's not the board owner, it's just speculation they don't know. 02:41:03 pokechu22: a jira https://jira.ecmwf.int 02:41:33 Pokechu22 edited Jira (+23, /* Not yet archived */ https://jira.ecmwf.int): https://wiki.archiveteam.org/?diff=51661&oldid=51655 02:41:33 thanks 02:41:40 I'm going to try to get something started on those soon 02:42:38 I'm pretty sure the database doesn't actually need to be saved to get attachments, as the same URL extraction issue that causes a bunch of junk relative URLs for attachments means that all attachments get logged... so that simplifies things a bit 04:23:52 JustAnotherArchivist edited Current Projects (+1, Fix date): https://wiki.archiveteam.org/?diff=51662&oldid=51658 04:30:54 FireonLive edited Current Projects (+16, move Blogger to long-term to reflect new…): https://wiki.archiveteam.org/?diff=51663&oldid=51662 05:09:48 Weaveworks is shutting down - https://www.linkedin.com/posts/richardsonalexis_hi-everyone-i-am-very-sad-to-announce-activity-7160295096825860096-ZS67 https://news.ycombinator.com/item?id=39262650 05:35:05 Pokechu22 edited Jira (+50, /* Not yet archived */…): https://wiki.archiveteam.org/?diff=51664&oldid=51661 06:06:10 Pokechu22 edited Jira (+244, /* Strategy */ database isn't needed; link script): https://wiki.archiveteam.org/?diff=51665&oldid=51664 06:07:10 Pokechu22 edited Jira (+26, /* Not yet archived */ https://bugs.openjdk.org/): https://wiki.archiveteam.org/?diff=51666&oldid=51665 09:57:00 Exorcism edited Vbox7 (+64): https://wiki.archiveteam.org/?diff=51667&oldid=51648 15:32:12 Switchnode edited Deathwatch (+390, /* 2024 */ add world of tanks forums): https://wiki.archiveteam.org/?diff=51668&oldid=51649 18:31:46 Entartet edited Deathwatch (+231, Added thebillionscompanion.net.): https://wiki.archiveteam.org/?diff=51669&oldid=51668 19:30:57 Pokechu22 edited Games/Engines, Platforms and Hostings (+12, /* PC and Web */ [[Steam]]): https://wiki.archiveteam.org/?diff=51670&oldid=50184 20:23:56 Hmm, `(echo a; echo b; echo c) | zstdgrep -e 'a' -e 'b'` gives no output for me but `zstdgrep -e 'a'` does as does `zgrep -e 'a' -e 'b'` or `grep -e 'a' -e 'b'` - this also happened when I used zstdgrep on a .gz file. Is this a bug or have I misunderstood something about zstdgrep? 20:25:00 This is a bug. 20:25:19 https://github.com/facebook/zstd/issues/2064 20:25:42 zstdless has similar issues with option parsing: https://github.com/facebook/zstd/issues/2880 20:26:07 Oof 20:26:53 Er, zstdless had*, although I haven't verified whether everything behaves correctly now. 20:27:01 I didn't even intend to type zstdgrep the first time, glad I noticed the missing output (I was verifying that extracting JIRA attachments from junk that gets logged in the meta-warc would work by comparing it with one where we extracted it from the DB) 20:27:31 Yeah, zstdgrep is fine for very simple cases, but if in doubt, it's better to use `zstdcat | grep ...` instead. 20:44:02 ... ok, new problem, and this seems like it's not a grep one: from view-source:https://web.archive.org/web/20230929192111id_/https://bugs.mojang.com/browse/MC-180529 archivebot saw data-downloadurl="application/zip:Normal_Font_TT_v3.zip:https://bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip" and extracted 20:44:04 https://bugs.mojang.com/browse/application/zip:Normal_Font_TT_v3.zip:https:/bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip but it *didn't* do anything with data-downloadurl="text/plain:hs_err_pid9900.log:https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log" 20:44:32 both https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log and https://bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip ended up in the database though 20:45:49 It doesn't seem to have extracted anything along the lines of browse/text.*\.log: 20:46:08 but it did accept https://bugs.mojang.com/browse/text/plain:crash.log.txt:https:/bugs.mojang.com/secure/attachment/71965/crash.log.txt 20:47:03 hmm, it also didn't extract any .nbt or .dat files - does archivebot have a list of extensions it'll assume might be files when doing extraction from data attributes? 20:49:24 This would be on wpull, not AB. 20:51:01 https://github.com/ArchiveTeam/wpull/blob/cfa5bcc571e7ff2d5175d8299e90651955c72df5/wpull/scraper/html.py#L618-L621 20:51:57 And https://github.com/ArchiveTeam/wpull/blob/cfa5bcc571e7ff2d5175d8299e90651955c72df5/wpull/scraper/util.py#L136-L217 20:52:49 That should pass `is_likely_link`. 20:53:35 Oh hmm, unless it's the `mimetype.guess` check. 20:53:43 `mimetype.guess_type` * 20:54:59 Yeah, it fails the `is_likely_link` check. 20:56:09 alright, I guess we do need the database after all :| 20:56:38 Yep, `mimetypes.guess_type` doesn't know about `.log`. 20:56:56 It wouldn't be in the DB either. 20:57:23 `mimetypes.guess_type('text/plain:hs_err_pid9900.log:https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log', strict=False)` → `(None, None)` 20:57:37 That wouldn't, but the correct URL (https://bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip or https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log) will be; they're just not saved due to the no-parent rule 20:57:47 Ah 20:59:03 this also means I need to find the database for hub.spigotmc.org which we ran a while back and saved the DB for, but I don't think I ever extracted outlinks from 20:59:49 I'll start !a < list jobs for several of the JIRA instances since we are running low on time, and then ping you for the DBs to be saved 21:00:17 fwiw on 3.11 `mimetypes.guess_type('text/plain:hs_err_pid9900.log:https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log', strict=False)` → `('text/plain', None)` 21:01:19 It probably still won't like .dat or .nbt though 21:01:49 indeed not 21:02:05 thuban: I'm still getting `(None, None)` on 3.11. 21:02:51 I think the `mimetypes` module does some discovery stuff in /usr/share or something like that. 21:02:57 So it can differ from system to system. 21:03:41 ah, so it does 21:04:19 https://github.com/python/cpython/blob/831b95d9b970901a39c64b5f261f379a490c64fb/Lib/mimetypes.py#L48-L58 21:04:32 Not /usr/share but same concept. :-) 21:04:42 you beat me to it, new github is awful v_v 21:05:27 It sure is, I do more and more stuff locally with a clone instead. 21:05:42 Especially since code search is loginwalled anyway. 21:08:53 anyway, perhaps the ab pipelines should be fitted with local mimetype files? 21:10:59 Perhaps wpull should ship its own list and init the `mimetypes` module with that. 21:12:23 ah! i didn't see that option. yes, that would simplify things 21:13:16 Apache's list doesn't even have .gz and .zst... 21:29:41 Looks like they're open to changes: https://github.com/apache/httpd/pull/372 22:12:32 Pokechu22 edited Jira (+163, the database is still needed): https://wiki.archiveteam.org/?diff=51671&oldid=51666 23:55:50 Pokechu22 edited Jira (+0, update script): https://wiki.archiveteam.org/?diff=51672&oldid=51671