01:23:32 on that note, what about world of tanks? 05:48:31 Manu edited Mailman/2 (+50, /* http://calypso.tux.org/pipermail/ lost */): https://wiki.archiveteam.org/?diff=52173&oldid=52170 07:27:39 thuban: should I repost those messages here? 07:28:22 tourist: for the benefit of log-readers, sure 07:29:29 [reposting from #archiveteam] 07:29:30 Hi, just want to discuss before editing Deathwatch because it's a bit vague: 07:29:40 booru.org is a site which allowed people to host their own tag-based 'booru' imageboards. Some are basically archives themselves for fandoms or special interests. 07:29:48 There are about 3000 boorus hosted, eighty have over 10,000 images, ten have over 100,000 images and two exceptional boorus have 1.5 and 1.7 million images respectively. 07:29:56 I propose it be placed on deathwatch due to this post a couple of weeks ago from the site admin: https://forum.booru.org/viewtopic.php?t=14193 07:30:04 >The project is closed and winding down, resources for search functionality etc. at peak times will get tapped out. 07:30:12 Does it seem like this site would require a dedicated project or should it be added to the Deathwatch page as normal, with 'Unknown' date. 07:30:17 [/end repost] 07:31:20 tourist: in theory sites with troubling vital signs but no clear shutdown announcement should go on 'fire drill' rather than deathwatch, but that page is a bit of a mess and i've been meaning to clean it up for some time, so i think deathwatch is ok for now 07:32:41 and we do add sites to deathwatch even if they get their own wiki pages/dedicated projects (although my guess is that it won't be necessary in this case) 07:33:11 Alright, I'll add it to the list now. Thanks :) 07:34:29 tourist: you're welcome! do you know whether there's a way to get a list of all the boorus, and/or whether booru creation/activity has been disabled? 07:36:13 Booru creation is closed. Boorus are still active. 07:38:31 List of boorus can be found at https://booru.org/top but you can only grab up to 200 per page. 07:39:08 that's fine if it's a complete list; let's see 07:39:45 yep, looks like it 07:39:52 thanks! 09:16:46 should we do something for opensubtitles.org ? they have been restricting access greatly lately 09:16:55 Ryz: no updates, and no reply from them 09:17:17 perhaps at this point best is just gathering lists of abload.de URLs and pushing them through AB if it's not an extreme number of URLs 09:18:02 huh is opensubtitles.org completely behind login now? 09:18:39 what... looks like it, same for others? 09:18:44 :( 09:18:45 arkiver: no, not for me 09:19:04 thuban: do you have a example of a subtitle URL? 09:19:17 that is not behind a login for you 09:19:39 sure: https://www.opensubtitles.org/en/subtitleserve/sub/10010159 / https://www.opensubtitles.com/en/subtitles/legacy/10010159/download ('beta') 09:20:01 sends me to a login form 09:20:12 let me VPN this 09:21:48 thuban: hmm from a different location i get no login scen 09:21:50 screen* 09:22:15 i feel like opensubtitles.org is becoming more shitty fast though 09:22:31 arkiver: X-Forwarded-For trick work? 09:25:01 thuban: no i think 09:43:07 arkiver: yes, very much becoming more shitty fast 09:44:43 i think at one point (some months ago) it also tried to make me login but doesn't do it now 09:44:51 i think we'll launch a project for them 09:44:55 better archive it before its too late 09:45:04 now they apparently have a forced login for some IPs 09:45:53 --- 09:45:53 -eggdrop- [karma] '-' now has -2 karma! 09:45:59 WHAT 09:46:09 --- 09:46:10 -eggdrop- [karma] '-' now has -3 karma! 09:46:13 - -- 09:46:14 -eggdrop- [karma] '-' now has -4 karma! 09:46:34 what magic is this 09:47:01 sourcery 09:47:04 strip trailing '--', trim remainder of message 09:47:19 -- - 09:47:27 no pre-decrement 09:47:39 anyway 09:47:45 so something i came across 09:48:12 this one is apparently going away in June http://www.bedfordregiment.org.uk/ - clearly some simple site made by probably a single enthusiastic person 09:48:17 (i put it in AB) 09:48:34 they have a list of sources/links to similar sites at http://www.bedfordregiment.org.uk/links.html 09:49:05 near the bottom of the page they have 09:49:09 > A Northamptonshire family history site worth knowing, which carries a wide array of [...] 09:49:24 with a link to http://www.familyhistorynorthants.co.uk/ , which is a blog about gambling. 09:50:00 but looking the front page up in the wayback machine, one finds a beautiful simple little site rich with information... gone and taken over by some gambling/scam business recently 09:51:15 i wonder if we can find these sites easily somehow and get them all archived, it's sad to see how some of these end up. i bet many of these are maintained by old enthusiastic people, who may pass away in the coming years, after which their sites go down and tons of information gets lost 09:52:37 arkiver: i was just thinking along the same lines (i love sites like this and run them trough ab whenever i come across them) 09:52:44 thuban: yeah! 09:53:04 marginalia.nu is not a bad source for these; i think the index is available somewhere 09:53:21 we should get them all 09:54:03 perhaps we can also contact several of these sites and let them know that we can archive these types of sites 09:54:43 perhaps they could spread the word, and people behind these sites could submit lists of sites like these that they know about 09:55:04 there's maybe forums with enthusiasts around these kind of subjects? 09:55:57 thuban: did you ever contact marginalia? maybe we should contact them? 09:56:52 arkiver: https://downloads.marginalia.nu/exports/ ! i think 'domains' is what we would want? 09:57:17 or 'urls', depending on how we handle it 09:57:33 ^ *through 09:58:34 thuban: love it, yeah! i don't know much about marginalia, do they only collect these type of little home made sites? 10:00:55 i don't know that much more, but there's a fair amount of writing about the project(s) and philosophy on the site 10:02:29 see also https://www.marginalia.nu/marginalia-search/about/#similar-projects 10:03:28 thuban: i love it 10:03:38 just looking at search.marginalia.nu 10:03:48 i need to #Y up and running really 10:03:52 so we can get all these domains 10:04:44 Manu edited Mailman/2 (+46, /* https://datacast.hu/mailman/listinfo saved */): https://wiki.archiveteam.org/?diff=52174&oldid=52173 10:05:01 http://www.bedfordregiment.org.uk is in the list! 10:05:29 http://www.familyhistorynorthants.co.uk/ too 10:06:07 yeah we need to get this archive, amazing! 10:06:24 arkiver: are you looking for a crawled index of individual pages, or seed URLs? 10:06:29 c3manu: any 10:06:31 https://github.com/MarginaliaSearch/PublicData/tree/master/sets 10:06:41 this is where people can submit urls :) 10:06:58 ..or what people submitted 10:07:00 lovely! 10:07:05 yeah we should get that too 10:07:20 feel free to extend the wiki page ;) 10:07:21 https://wiki.archiveteam.org/index.php/Marginalia_Search 10:08:05 i also think webrings would be good for indices. in the indie corners of the internet those are getting popular again 10:09:03 just look at this: https://webring.xxiivv.com/ 10:09:18 in theory yes; in practice it might be difficult to identify webring to/from links (since they can be formatted arbitrarily) 10:09:55 ah, a central index :) 10:09:59 yeah, that's definitely not going to be fun ^^ 10:12:18 perhaps it's more something for marginalia to find these sites through those ^ and list them online? 10:12:28 i will send marginalia.nu an email about this awesomeness 10:13:10 do we have a pipeline on AB that can handle a 180 GB file? 10:13:17 i want to throw https://downloads.marginalia.nu/ into it 10:13:25 ^^ sounds good, i'm not sure whether the index is really curated or the search engine is doing the heavy lifting 10:13:35 i approve re awesomeness email :) 10:13:56 thuban: i guess they do some checks on the website front page to see if it is "old style" and include it only if it is 10:21:14 I feel like this is going to be a problem https://server8.kiska.pw/uploads/0bc070b5366d602c/image.png 10:21:20 Limited to 10 per day... 10:48:03 kiska: yeah it would be a very long term effort 11:15:39 arkiver: depends on how it's done 11:15:51 if it's ip based, sure we're metaphorically screwed 11:16:01 but if it's SESSION based... (JAA's favorite) 11:16:45 i.e. store session with 24h expiry as cookie object, thus enabling easy bypassing if one were to simply ignore cookies 11:22:01 arkiver: Space wise the new pipelines like firepipe should fit 100+ gig files easily 11:27:38 that_lurker: thanks! i put it on firepipe-f 11:28:19 Interesting 11:28:34 Never heard of marginalia before now 11:41:32 nyany: looks to be IP based 11:43:02 does look like www.opensubtitles.org supports ipv6 though 12:00:36 That 10/day limit only appears on their new "beta" site for me, I can still download as much as I want through the regular/older one 12:01:18 Yet another reason to archive it before they change that I guess 12:13:09 My friend has a file analysis site and would like all his current reports archived, I have a list of almost 400k links and was wondering if someone could start the archivebot archive of them please - https://transfer.archivete.am/ffsKw/neiki%20analytics%20links.txt 12:13:10 inline (for browser viewing): https://transfer.archivete.am/inline/ffsKw/neiki%20analytics%20links.txt 12:14:06 sites currently running cloudflare with the "essentially off" setting enabled, but I can get them to disable cloudflare completely if needed, but don't think it will be needed 12:18:24 JaffaCakes118: your friend's reports depend on javascript-initiated requests; archivebot will be useless 12:20:13 i suppose it might work if we generated the corresponding api url for every page 12:20:22 thuban the links are able to be archived perfectly through the save now page 12:20:27 is it not the same for archivebot? 12:20:33 save page now is not archivebot 12:20:35 no 12:20:47 save page now runs a browser, archivebot doesn't 12:21:17 ah ok 12:21:39 is there any way we can still archive it? My friend of course will be willing to make changes 12:21:41 but seems we'd just need https://api.neiki.dev/analyze/reports?sha256=... 12:22:05 yeah he said save the api instead 12:22:10 and it will return the data of it 12:22:12 well, alongside 12:22:54 I will get a list of links now for the api.neiki.dev 12:23:04 no need 12:23:06 oh ok 13:58:48 wickerz: I'm sure you saw my little post in ab but that site should be all set for you, it's on the bot per c3manu 13:59:32 looks like a success too 14:00:10 ty nyany 14:16:34 Hey, i want to create an sitemap for my website to display all links that could be archived every month f.ex, what should i do for that? 14:16:34 just a big txt file with all links? 14:40:56 there's an xml standard for sitemaps, see https://www.sitemaps.org/protocol.html 16:08:01 Manu edited Mailman/2 (-15, /* https://ffmpeg.org/mailman/listinfo saved by…): https://wiki.archiveteam.org/?diff=52175&oldid=52174 16:16:02 Manu edited Mailman/2 (-2, /* https://listas.softwarelivre.org/ saved */): https://wiki.archiveteam.org/?diff=52176&oldid=52175 16:19:02 Manu edited Mailman/2 (+25, /* no archives for…): https://wiki.archiveteam.org/?diff=52177&oldid=52176 17:17:13 Manu edited Mailman/2 (+37, /* http://list.ehu.eus/ saved */): https://wiki.archiveteam.org/?diff=52178&oldid=52177 18:03:11 arkiver: Speaking of Marginalia Search, looks like it can also save the pages it crawls as WARC (although this is not enabled by default due to storage space reasons, and from the article it sounds like it might currently be done incorrectly): https://www.marginalia.nu/log/94_warc_warc/ https://www.marginalia.nu/release-notes/v2024-01-0/ 18:03:12 https://github.com/MarginaliaSearch/MarginaliaSearch/pull/62 18:04:15 JAA: ^ Possibly some more warc software to mention on https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem ? 18:15:33 FYI: https://www.bbc.com/news/world-middle-east-68961753 18:23:28 qwertyasdfuiopghjkl: Are they faking headers? 18:23:44 > In this case jwarc was a bit of an awkward fit, not a fault of the library, just a minor incompatibility with the level of operations, where much of Marginalia’s crawling works at a higher abstraction level and access to http protocol details isn’t always very easy, meaning some of the headers and handshakes is re-constructed after the fact. 18:25:25 The pull request says: 18:25:25 > A caveat is that it's not possible to fully record every aspect of the crawl due to incompatibilities of design and operation between the crawler and the expectations by the designers of the warc format, but a record of crawling is constructed after the fact. It may be possible to reconcile the two in the future, but this is outside of the scope 18:25:25 of this change. 18:26:16 They also seem to be having trouble with the size of WARC files. Maybe we can point out the warc.zst format? 18:28:57 TheTechRobo: I don't know anything about the actual coding stuff, but the article seemed to implied some stuff is saved incorrectly. JAA would probably be a better person to ask about that. 18:29:13 *seemed to imply 18:33:18 (if the inaccuracies can be fixed, maybe it could eventually be a good new source of data for the WBM on small web stuff?) 20:04:55 Is there a channel for Facebook? 20:14:35 arkiver: We have archived https://downloads.marginalia.nu/ a couple times recently. 20:17:21 qwertyasdfuiopghjkl: I feel like I've heard about Marginalia's WARCs when it was first announced, and yeah, it sounds like they're faking it, so not good WARCs. 20:18:22 Cc arkiver 20:31:24 Could someone AB https://wega-vinduer.dk/ they are soon to be declared bankrupt 20:43:43 wickerz: pokechu22 did it 20:49:27 ugh bad WARCs 21:21:32 now we just need proper warcs in archivebox... 21:23:06 Xaft edited List of websites excluded from the Wayback Machine (+37): https://wiki.archiveteam.org/?diff=52179&oldid=52122 21:24:06 MrScottyPieey edited Sploder (+106, The site shut down.): https://wiki.archiveteam.org/?diff=52180&oldid=51300 21:24:07 MrScottyPieey edited Template:Internet history (-15): https://wiki.archiveteam.org/?diff=52181&oldid=46691 21:24:08 MrScottyPieey edited Me at the zoo (+17, Wikipedia only allows UTC time zones.): https://wiki.archiveteam.org/?diff=52182&oldid=50561 21:24:09 MrScottyPieey edited YouTube (+33): https://wiki.archiveteam.org/?diff=52183&oldid=52147 21:24:10 MrScottyPieey created 2021 (+20, Created page with "{{Internet history}}"): https://wiki.archiveteam.org/?title=2021 21:24:11 MrScottyPieey uploaded File:YouTube screenshot 2024 May 4 2024 (cropped).png: https://wiki.archiveteam.org/?title=File%3AYouTube%20screenshot%202024%20May%204%202024%20%28cropped%29.png 21:24:12 BooruUser edited Deathwatch (+310, added booru.org): https://wiki.archiveteam.org/?diff=52186&oldid=52136 21:27:15 uh 21:27:31 2021? 21:34:08 JustAnotherArchivist edited Sploder (+164, Restore shutdown announcement verbatim; add…): https://wiki.archiveteam.org/?diff=52187&oldid=52180 21:34:10 Yeah, not a big fan of those {{Internet history}} pages. 21:41:10 JustAnotherArchivist edited Me at the zoo (+9, Then it's a good thing we aren't Wikipedia and…): https://wiki.archiveteam.org/?diff=52188&oldid=52182 21:43:10 JustAnotherArchivist edited Me at the zoo (+105, Update views count, add comments count): https://wiki.archiveteam.org/?diff=52189&oldid=52188 21:57:37 JAA: same here, it's part of the number of pages i feel are quite out of scope for us 21:58:18 /better managed elsewhere 22:00:14 JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=52190&oldid=52179 22:02:03 Because of Google's policy, can someone see if you can scrape this, Google doesn't allow scrapers tho https://www.google.com/search?q=site%3A*.drv.tw 23:06:24 we got some coverage(?) for subscene: https://www.techworm.net/2024/05/subscene-shutdown.html 23:07:54 90 GB reddit archived, is it possible to get it on WBM?