00:08:11 <flashfire42> the fuck
00:16:43 <Pedrosso> Made a table of Steam Workshops (the table starting out collapsed for obvious reasons)
00:20:15 <fireonlive> nice
00:20:21 <Pedrosso> ye
00:50:15 <nicolas17> >283 kilobytes
02:02:58 <DJ> https://anon.cafe/ is shutting down on March 15
02:03:19 <DJ> https://anon.cafe/meta/res/16466.html announcement
02:07:32 <nicolas17> what is it?
02:15:14 <DJ> It's an imageboard, part of a webring. Shutting down because of operating costs https://anon.cafe/meta/res/16467.html#16486
02:16:33 <DJ> Oh sorry that's not the board owner, it's just speculation they don't know.
02:41:03 <pabs> pokechu22: a jira https://jira.ecmwf.int
02:41:33 <h2ibot> Pokechu22 edited Jira (+23, /* Not yet archived */ https://jira.ecmwf.int): https://wiki.archiveteam.org/?diff=51661&oldid=51655
02:41:33 <pokechu22> thanks
02:41:40 <pokechu22> I'm going to try to get something started on those soon
02:42:38 <pokechu22> I'm pretty sure the database doesn't actually need to be saved to get attachments, as the same URL extraction issue that causes a bunch of junk relative URLs for attachments means that all attachments get logged... so that simplifies things a bit
04:23:52 <h2ibot> JustAnotherArchivist edited Current Projects (+1, Fix date): https://wiki.archiveteam.org/?diff=51662&oldid=51658
04:30:54 <h2ibot> FireonLive edited Current Projects (+16, move Blogger to long-term to reflect new…): https://wiki.archiveteam.org/?diff=51663&oldid=51662
05:09:48 <fireonlive> Weaveworks is shutting down - https://www.linkedin.com/posts/richardsonalexis_hi-everyone-i-am-very-sad-to-announce-activity-7160295096825860096-ZS67 https://news.ycombinator.com/item?id=39262650
05:35:05 <h2ibot> Pokechu22 edited Jira (+50, /* Not yet archived */…): https://wiki.archiveteam.org/?diff=51664&oldid=51661
06:06:10 <h2ibot> Pokechu22 edited Jira (+244, /* Strategy */ database isn't needed; link script): https://wiki.archiveteam.org/?diff=51665&oldid=51664
06:07:10 <h2ibot> Pokechu22 edited Jira (+26, /* Not yet archived */ https://bugs.openjdk.org/): https://wiki.archiveteam.org/?diff=51666&oldid=51665
09:57:00 <h2ibot> Exorcism edited Vbox7 (+64): https://wiki.archiveteam.org/?diff=51667&oldid=51648
15:32:12 <h2ibot> Switchnode edited Deathwatch (+390, /* 2024 */ add world of tanks forums): https://wiki.archiveteam.org/?diff=51668&oldid=51649
18:31:46 <h2ibot> Entartet edited Deathwatch (+231, Added thebillionscompanion.net.): https://wiki.archiveteam.org/?diff=51669&oldid=51668
19:30:57 <h2ibot> Pokechu22 edited Games/Engines, Platforms and Hostings (+12, /* PC and Web */ [[Steam]]): https://wiki.archiveteam.org/?diff=51670&oldid=50184
20:23:56 <pokechu22> Hmm, `(echo a; echo b; echo c) | zstdgrep -e 'a' -e 'b'` gives no output for me but `zstdgrep -e 'a'` does as does `zgrep -e 'a' -e 'b'` or `grep -e 'a' -e 'b'` - this also happened when I used zstdgrep on a .gz file. Is this a bug or have I misunderstood something about zstdgrep?
20:25:00 <JAA> This is a bug.
20:25:19 <JAA> https://github.com/facebook/zstd/issues/2064
20:25:42 <JAA> zstdless has similar issues with option parsing: https://github.com/facebook/zstd/issues/2880
20:26:07 <pokechu22> Oof
20:26:53 <JAA> Er, zstdless had*, although I haven't verified whether everything behaves correctly now.
20:27:01 <pokechu22> I didn't even intend to type zstdgrep the first time, glad I noticed the missing output (I was verifying that extracting JIRA attachments from junk that gets logged in the meta-warc would work by comparing it with one where we extracted it from the DB)
20:27:31 <JAA> Yeah, zstdgrep is fine for very simple cases, but if in doubt, it's better to use `zstdcat | grep ...` instead.
20:44:02 <pokechu22> ... ok, new problem, and this seems like it's not a grep one: from view-source:https://web.archive.org/web/20230929192111id_/https://bugs.mojang.com/browse/MC-180529 archivebot saw data-downloadurl="application/zip:Normal_Font_TT_v3.zip:https://bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip" and extracted
20:44:04 <pokechu22> https://bugs.mojang.com/browse/application/zip:Normal_Font_TT_v3.zip:https:/bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip but it *didn't* do anything with data-downloadurl="text/plain:hs_err_pid9900.log:https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log"
20:44:32 <pokechu22> both https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log and https://bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip ended up in the database though
20:45:49 <pokechu22> It doesn't seem to have extracted anything along the lines of browse/text.*\.log:
20:46:08 <pokechu22> but it did accept https://bugs.mojang.com/browse/text/plain:crash.log.txt:https:/bugs.mojang.com/secure/attachment/71965/crash.log.txt
20:47:03 <pokechu22> hmm, it also didn't extract any .nbt or .dat files - does archivebot have a list of extensions it'll assume might be files when doing extraction from data attributes?
20:49:24 <JAA> This would be on wpull, not AB.
20:51:01 <JAA> https://github.com/ArchiveTeam/wpull/blob/cfa5bcc571e7ff2d5175d8299e90651955c72df5/wpull/scraper/html.py#L618-L621
20:51:57 <JAA> And https://github.com/ArchiveTeam/wpull/blob/cfa5bcc571e7ff2d5175d8299e90651955c72df5/wpull/scraper/util.py#L136-L217
20:52:49 <JAA> That should pass `is_likely_link`.
20:53:35 <JAA> Oh hmm, unless it's the `mimetype.guess` check.
20:53:43 <JAA> `mimetype.guess_type` *
20:54:59 <JAA> Yeah, it fails the `is_likely_link` check.
20:56:09 <pokechu22> alright, I guess we do need the database after all :|
20:56:38 <JAA> Yep, `mimetypes.guess_type` doesn't know about `.log`.
20:56:56 <JAA> It wouldn't be in the DB either.
20:57:23 <JAA> `mimetypes.guess_type('text/plain:hs_err_pid9900.log:https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log', strict=False)` → `(None, None)`
20:57:37 <pokechu22> That wouldn't, but the correct URL (https://bugs.mojang.com/secure/attachment/286387/Normal_Font_TT_v3.zip or https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log) will be; they're just not saved due to the no-parent rule
20:57:47 <JAA> Ah
20:59:03 <pokechu22> this also means I need to find the database for hub.spigotmc.org which we ran a while back and saved the DB for, but I don't think I ever extracted outlinks from
20:59:49 <pokechu22> I'll start !a < list jobs for several of the JIRA instances since we are running low on time, and then ping you for the DBs to be saved
21:00:17 <thuban> fwiw on 3.11 `mimetypes.guess_type('text/plain:hs_err_pid9900.log:https://bugs.mojang.com/secure/attachment/286386/hs_err_pid9900.log', strict=False)` → `('text/plain', None)`
21:01:19 <pokechu22> It probably still won't like .dat or .nbt though
21:01:49 <thuban> indeed not
21:02:05 <JAA> thuban: I'm still getting `(None, None)` on 3.11.
21:02:51 <JAA> I think the `mimetypes` module does some discovery stuff in /usr/share or something like that.
21:02:57 <JAA> So it can differ from system to system.
21:03:41 <thuban> ah, so it does
21:04:19 <JAA> https://github.com/python/cpython/blob/831b95d9b970901a39c64b5f261f379a490c64fb/Lib/mimetypes.py#L48-L58
21:04:32 <JAA> Not /usr/share but same concept. :-)
21:04:42 <thuban> you beat me to it, new github is awful v_v
21:05:27 <JAA> It sure is, I do more and more stuff locally with a clone instead.
21:05:42 <JAA> Especially since code search is loginwalled anyway.
21:08:53 <thuban> anyway, perhaps the ab pipelines should be fitted with local mimetype files?
21:10:59 <JAA> Perhaps wpull should ship its own list and init the `mimetypes` module with that.
21:12:23 <thuban> ah! i didn't see that option. yes, that would simplify things
21:13:16 <JAA> Apache's list doesn't even have .gz and .zst...
21:29:41 <JAA> Looks like they're open to changes: https://github.com/apache/httpd/pull/372
22:12:32 <h2ibot> Pokechu22 edited Jira (+163, the database is still needed): https://wiki.archiveteam.org/?diff=51671&oldid=51666
23:55:50 <h2ibot> Pokechu22 edited Jira (+0, update script): https://wiki.archiveteam.org/?diff=51672&oldid=51671