00:04:26 But yeah I'll take a look tonight and see if it works for portal maps 00:27:35 Great. I wouldn't like to be the one to do it but if nobody else would at the moment, then perhaps 00:32:43 Poke me if I forget but I'll spend some time to ight documenting it and put it on GitHub 01:48:28 https://en.wikipedia.org/wiki/Republic_of_Artsakh has just ceased to exist in the last few days 01:50:26 Fortunately thanks to gooshka we've been running https://artsakhlib.am/ for a while (but that site's also super slow and errors out if you run it faster) 01:50:39 I think we might have run some of their other sites too, but it'd be good to re-run them 07:27:01 i'm looking into a project for hardware info since they have crazy strict rate limits 07:30:46 if there are any "official" sites or youtube channels of https://en.wikipedia.org/wiki/Republic_of_Artsakh we should archive them in #archivebot (sites) and #down-the-tube (youtube) 07:32:05 https://www.spyur.am/ 07:34:38 arkiver: hardware info? 07:42:59 fireonlive: went read only on january 1, see deathwatch 07:43:39 ah! thanks 07:43:51 i went google instead for some reason -_- 07:44:01 i of all people should know the wiki :D 07:51:43 https://www.spyur.am seems to have strict cloudflare unfortunately :/ 07:52:02 (though it also sounds like it's Armenia in general, not Artsakh) 08:52:03 manu|m & others here: looks like c3 is going to be deleting a number of matrix channels for the event (via the irc bridge to #37c3-hall-1 and others) 08:52:19 unsure if there's a way to save matrix channels & its attachments/threads/etc or ? 08:52:33 message: PSA: This channel is a candidate for deletion. If you think this is a mistake, please let us know by replying to this message. Otherwise we are going to delete the channel in a few days. Thanks for using the matrix event chat, we are happy to hear your feedback: 08:52:33 https://events.ccc.de/congress/2023/hub/wiki/Feedback/ 08:53:14 account for that seems to be @admin:events.ccc.de 08:53:19 (/whois) 09:02:51 (just in bed, but just got a ping they're going this for data privacy reasons before we rush into this) 09:03:07 (will respond/ask qs tomorrow) 09:17:53 JAA: what directory contained the bulk of the size of archive.mozilla.org from your recent scan? 09:18:04 (CC Ryz ) 15:03:10 Hi all, I'm trying to get an old download from the Microsoft Download Center, which no longer seems to be available. I stumbled upon this page (https://wiki.archiveteam.org/index.php/Microsoft_Download_Center) which states that everything was archived. I found the file I'm looking for in the index (msxml6_SDK.msi), and the way I understand it, that 15:03:10 file should be findable in https://archive.org/details/archiveteam_microsoft_download?sort=title . However, I am completely confused as to how to find the file there. It seems to me that a number of files are bunched together into large downloads, but I can't figure out for the life of me in which one of those large downloads the file I'm looking 15:03:11 for is located. Is there any documentation or something that I'm missing? 15:06:26 BenjaminKrausseDB: You probably want to download the index https://archive.org/details/microsoft_download_center_html_index_2020-08 which will tell you which warc contains which URL (file), and then use something like pywb to replay the warc and extract the file. 15:11:07 Thanks for the link, I found the file I'm looking for in there, I'm just not sure where to go from there. Or is it the ID I'm looking for? 15:12:13 Essentially this is what I found: 15:12:22 ~~~

•Microsoft Core XML Services (MSXML) 6.0

MSXML 6.0 (MSXML6) has improved reliability, security, conformance with the XML 1.0 and XML Schema 1.0 W3C Recommendations, and compatibility with System.Xml 2.0.

15:12:22 href="https://web.archive.org/web/20200801/https://www.microsoft.com/en-us/download/details.aspx?id=3988">Original page)

15:12:23 href="https://web.archive.org/web/20200801/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6.msi">msxml6.msi (1.5MB)

15:12:23 href="https://web.archive.org/web/20200801/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6_ia64.msi">msxml6_ia64.msi (3.6MB)

~~~ 15:13:11 Those archive.org links seem work for me and start a download 15:13:18 so I guess that's exactly what you want! 15:14:12 BenjaminKrausseDB2: <@Sanqui> Those archive.org links seem work for me and start a download 15:14:12 <@Sanqui> so I guess that's exactly what you want! 15:14:51 OK, weird, they're not working here. I'll try those links on a different device... 15:16:15 if the download doesn't start, try putting "id_" after the timestamp in the url, as such: 15:16:20 https://web.archive.org/web/20200803205234id_/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6_ia64.msi 15:16:27 might have better compatibility 15:18:18 yes that goes directly to a 3MB binary file 15:18:51 OK, it worked on my phone. I suspect my work network is blocking something (although usually it says something, not sure what my IT department pulled off this time). Thanks for the help! 15:19:22 No prob, good luck getting that Itanic working! 15:26:08 Thanks, I think I'll need the luck the way this has been going up until now '=D 15:40:01 Got it working! Thanks for the help and all the work you guys do! 16:21:26 ^_^ 17:53:42 FireonLive edited Deathwatch (+371, add bear.community): https://wiki.archiveteam.org/?diff=51457&oldid=51455 17:53:49 that was fast 17:54:23 luck of the cron 17:57:42 FireonLive edited Current Projects (+78, add pastebin): https://wiki.archiveteam.org/?diff=51458&oldid=51407 17:57:43 FireonLive edited Pastebin (+24, DPoS): https://wiki.archiveteam.org/?diff=51459&oldid=47706 17:59:42 FireonLive edited Pastebin (+23, add CTA, make more secure): https://wiki.archiveteam.org/?diff=51460&oldid=51459 18:50:17 speaking of pastebin, i've noticed that the project code makes no attempt to extract outlinks from paste content. is that a deliberate choice? 18:55:47 hmmm. lots of spam there, but i think it's an older project so maybe not? 18:56:33 yeah, hence my uncertainty 18:58:43 arkiver? 19:00:06 could be a good source for links to filesharing projects (like mediafire or zippyshare) since it's often used as an agglomerator 19:01:28 (i know of at least one subreddit that bans download links, to avoid the attention of site admins, but tacitly encourages pastebins of same) 19:12:53 speaking of hid URLs, have projects ever made an effort to catch base64 encoded urls 19:14:07 using rot13 or base64, some file sharing communities hide mega, mediafire URLs from bots that issue DMCA takedowns 19:15:33 I question if those particular links are the kind of thing we want to archive >.> 19:16:20 sure 19:18:15 bocci_: no, afaik no projects have ever implemented that kind of filter-evasion matching 19:18:18 (there's some attempt to repair broken urls, but mainly for accidental syntax-mangling) 19:19:02 thanks, i just wanted to know/make it known 19:19:56 an example of a history of these encoded links being used: 19:19:57 https://warosu.org/ic/thread/6960541#p6960541 19:20:16 nicolas17: it can be legit. i remember doing a bunch of those manually during the zippyshare project--they were video game mods from some forum crawl 19:28:09 !tell Doranwen do you have a wiki account? 19:28:09 -eggdrop- [tell] ok, I'll tell Doranwen when they join next 19:28:43 ah yeah, base64 has been used a lot in /r/piracy wiki i think? 19:28:47 or some reddit wiki 19:29:22 for the record, the strings aren't random or encrypted 19:29:35 a base64-encoded https link always starts with aHR0cHM6Ly 19:30:11 and mediafire links aren't hard to spot once you memorize the pattern 19:30:25 https://www.mediafire.com/file/not-real 19:30:31 https://www.mediafire.com/file/some-file 19:30:42 aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL25vdC1yZWFsCg== 19:30:47 aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL3NvbWUtZmlsZQo= 19:31:03 ig yu'd want to look for aHR0cHM6Ly8 and aHR0cDovLw (https:// and http://) 19:31:10 oh no 8 19:31:25 interesting idea though i like it 19:33:10 would miss protocol-stripped links, but you'd have to get really aggressively heuristic to catch the general case, soz 19:33:16 interesting, i concur 19:35:40 i think you can find protocol-stripped links automatically without some crazy heuristic 19:35:47 if you limit yourself to some hosts 19:36:32 d3d3Lm1lZGlhZmlyZS5jb20K = www.mediafire.com 19:37:08 it's such a specific string, you wouldn't have any false positives 19:37:21 correct, but due to the way we backfeed discovered urls between projects, that could get awkward to maintain 19:38:27 i have no idea about that 19:41:18 i suppose for pastebin itself someone could make something bespoke to scrape the warcs 19:48:07 fireonlive: someone has :P 19:51:24 by which i mean JAA's done a horrible one-liner a couple of times. 19:54:39 bocci_: basically, if a project discovers outlinks, it sends them to the general urls project (#//), which checks them against the list of site-specific projects and forwards them appropriately if there's a match 19:54:48 if every project were to discover obfuscated outlinks to a specific list of hosts, then every project would need the list of site-specific projects 19:55:42 and keeping an n:n system consistent is hell compared to 1:n 19:55:44 ah :D 19:56:51 hmmmm. i guess you could use those 'indicators' for b64 http/https and do further local processing if found? 19:57:00 then ship it to urls as normal? 19:57:14 right 19:57:25 sounds fun :) 20:14:41 -+rss- Niklaus Wirth Passed Away: https://twitter.com/Bertrand_Meyer/status/1742613897675178347 https://news.ycombinator.com/item?id=38858012 20:14:42 nitter: https://nitter.net/Bertrand_Meyer/status/1742613897675178347 20:16:25 You would also need to account for all the different possible capitalizations of http:// and https:// since that would change the base64 21:17:26 iOS 17.3 beta 2 was released today, and soon it was discovered that it caused iPhones with a certain feature enabled to boot-loop, so 3 hours later it was pulled from the update server 21:18:27 they *might* delete the actual files from the CDN too... sum of all variants is 239GB, is this too much? would it work on AB or urls? 21:18:29 JAA: ^ 21:23:59 dumb question: what's wrong with just downloading the files and uploading to an archive.org collection if you wish to archive them 21:24:48 I could, and I have done that for files that were *already* deleted but I recovered from elsewhere 21:25:07 but then it won't work on WBM 21:25:23 oh 21:26:03 i've felt wrong for using the WBM for large files 21:26:04 and with my Internet it would take 20 hours to upload, but upload speeds *to IA* are usually worse 21:27:35 i kinda had the sense that directly hitting images/files on the WBM was an unintended effect of saving web pages 21:28:05 wayback machine is for webpages 21:28:12 i think im wrong 21:28:21 idk, that's why I'm asking first :P 21:35:26 bocci_: nothing wrong with having files in the wbm--in fact it's good, because it's more authoritative _and_ more discoverable than just having them somewhere on archive.org 21:35:30 (if you find a link somewhere and it's dead, it's a lot easier to plug the url into the wbm than to search around and maybe find a relevant item and maybe find the file within the item and hope it's correct) 21:35:34 buuut there's a lot of duct tape involved, so idk how large is too large either 21:36:37 it's 34 files from 6363 MiB to 7756 MiB 21:37:28 in total or each? 21:38:02 as I said total is 239GB x_x 21:39:00 MiB|url: https://paste.debian.net/1302977/ 21:43:38 nicolas17: doing it via AB is probably fine 21:44:25 just got to make sure it ends up on firepipe (1.44 TiB free) or addax (524 GiB free) per http://archivebot.com/pipelines 21:45:55 an !ao < list of https://transfer.archivete.am/inline/zkuP2/ios_17.3_beta_2_cdn_urls.txt (which deliberately includes that paste at the top as a small file) should be fine, I'll run it unless you've got a different plan 22:59:37 arkiver: Re archive.mozilla.org, I don't remember, but I believe I posted the link to the full JSONL scan output here some weeks ago. 23:00:47 Is jsonl the same as ndjson? 23:02:26 thuban: Can confirm, have written such horrible one-liners. 60% of the time, they work every time! 23:03:05 audrooku|m: Yes 23:04:33 Also referred to as 'JSON Lines' and some other variations. But .jsonl is the common file extension, and application/jsonl is the proposed media type. 23:05:13 Also 'Line-Delimited JSON', which has absolutely no potential of confusion with the entirely unrelated JSON-LD. 23:06:14 nicolas17, pokechu22: Yes, fine with AB. Large pipeline's a good idea, but if all pipelines are full, !ao < should end up on firepipe-ao anyway (unless that's full as well, didn't check). 23:06:41 (Of course, firepipe-ao won't run jobs queued with --pipeline.) 23:07:20 <@JAA> arkiver: [...] I believe I posted the link to the full JSONL scan output here some weeks ago. 23:07:21 It looked good as of an hour ago (I also see you got rid of addax-ao, which I guess makes sense because firepipe-ao receives jobs much faster) 23:07:23 https://transfer.archivete.am/a0mjU/archive.mozilla.org-files.jsonl.zst 23:07:29 (https://hackint.logs.kiska.pw/archiveteam-bs/20231118#c390573) 23:08:56 pokechu22: Yeah, that's why. jap-addax-ao was taking a minute or more to dequeue a job, just horrendous. 23:10:08 It's running (ab job ew2dbtuft08uz2xe0tf4lhlcv) 23:12:54 :-) 23:13:26 ^_^ 23:13:56 JAA, any thoughts on the wiki changes suggested in #//? 23:27:34 https://www.polygon.com/24024266/kim-kardashian-mobile-game-shutting-down-glu-mobile 23:30:51 https://www.eurogamer.net/stray-souls-developer-shuts-down-following-publishers-closure-cyberbullying-and-poor-sales 23:31:41 Doesn't look like Stray Souls has a website anymore, but they do have a Twitter if someone could throw it in AB. https://twitter.com/jukaistudio 23:31:42 nitter: https://nitter.net/jukaistudio 23:35:42 added it to next on the pad for when one of the two active finish 23:44:39 https://developer.apple.com/documentation/ios-ipados-release-notes/ios-ipados-17_3-release-notes now finally acknowledging the issue 23:47:24 archivebotted