00:04:26 But yeah I'll take a look tonight and see if it works for portal maps
00:27:35 Great. I wouldn't like to be the one to do it but if nobody else would at the moment, then perhaps
00:32:43 Poke me if I forget but I'll spend some time to ight documenting it and put it on GitHub
01:48:28 https://en.wikipedia.org/wiki/Republic_of_Artsakh has just ceased to exist in the last few days
01:50:26 Fortunately thanks to gooshka we've been running https://artsakhlib.am/ for a while (but that site's also super slow and errors out if you run it faster)
01:50:39 I think we might have run some of their other sites too, but it'd be good to re-run them
07:27:01 i'm looking into a project for hardware info since they have crazy strict rate limits
07:30:46 if there are any "official" sites or youtube channels of https://en.wikipedia.org/wiki/Republic_of_Artsakh we should archive them in #archivebot (sites) and #down-the-tube (youtube)
07:32:05 https://www.spyur.am/
07:34:38 arkiver: hardware info?
07:42:59 fireonlive: went read only on january 1, see deathwatch
07:43:39 ah! thanks
07:43:51 i went google instead for some reason -_-
07:44:01 i of all people should know the wiki :D
07:51:43 https://www.spyur.am seems to have strict cloudflare unfortunately :/
07:52:02 (though it also sounds like it's Armenia in general, not Artsakh)
08:52:03 manu|m & others here: looks like c3 is going to be deleting a number of matrix channels for the event (via the irc bridge to #37c3-hall-1 and others)
08:52:19 unsure if there's a way to save matrix channels & its attachments/threads/etc or ?
08:52:33 message: PSA: This channel is a candidate for deletion. If you think this is a mistake, please let us know by replying to this message. Otherwise we are going to delete the channel in a few days. Thanks for using the matrix event chat, we are happy to hear your feedback:
08:52:33 https://events.ccc.de/congress/2023/hub/wiki/Feedback/
08:53:14 account for that seems to be @admin:events.ccc.de
08:53:19 (/whois)
09:02:51 (just in bed, but just got a ping they're going this for data privacy reasons before we rush into this)
09:03:07 (will respond/ask qs tomorrow)
09:17:53 JAA: what directory contained the bulk of the size of archive.mozilla.org from your recent scan?
09:18:04 (CC Ryz )
15:03:10 Hi all, I'm trying to get an old download from the Microsoft Download Center, which no longer seems to be available. I stumbled upon this page (https://wiki.archiveteam.org/index.php/Microsoft_Download_Center) which states that everything was archived. I found the file I'm looking for in the index (msxml6_SDK.msi), and the way I understand it, that
15:03:10 file should be findable in https://archive.org/details/archiveteam_microsoft_download?sort=title . However, I am completely confused as to how to find the file there. It seems to me that a number of files are bunched together into large downloads, but I can't figure out for the life of me in which one of those large downloads the file I'm looking
15:03:11 for is located. Is there any documentation or something that I'm missing?
15:06:26 BenjaminKrausseDB: You probably want to download the index https://archive.org/details/microsoft_download_center_html_index_2020-08 which will tell you which warc contains which URL (file), and then use something like pywb to replay the warc and extract the file.
15:11:07 Thanks for the link, I found the file I'm looking for in there, I'm just not sure where to go from there. Or is it the ID I'm looking for?
15:12:13 Essentially this is what I found:
15:12:22 ~~~•Microsoft Core XML Services (MSXML) 6.0
MSXML 6.0 (MSXML6) has improved reliability, security, conformance with the XML 1.0 and XML Schema 1.0 W3C Recommendations, and compatibility with System.Xml 2.0.
15:12:22 href="https://web.archive.org/web/20200801/https://www.microsoft.com/en-us/download/details.aspx?id=3988">Original page)
15:12:23 href="https://web.archive.org/web/20200801/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6.msi">msxml6.msi (1.5MB)
15:12:23 href="https://web.archive.org/web/20200801/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6_ia64.msi">msxml6_ia64.msi (3.6MB)~~~
15:13:11 Those archive.org links seem work for me and start a download
15:13:18 so I guess that's exactly what you want!
15:14:12 BenjaminKrausseDB2: <@Sanqui> Those archive.org links seem work for me and start a download
15:14:12 <@Sanqui> so I guess that's exactly what you want!
15:14:51 OK, weird, they're not working here. I'll try those links on a different device...
15:16:15 if the download doesn't start, try putting "id_" after the timestamp in the url, as such:
15:16:20 https://web.archive.org/web/20200803205234id_/https://download.microsoft.com/download/2/e/0/2e01308a-e17f-4bf9-bf48-161356cf9c81/msxml6_ia64.msi
15:16:27 might have better compatibility
15:18:18 yes that goes directly to a 3MB binary file
15:18:51 OK, it worked on my phone. I suspect my work network is blocking something (although usually it says something, not sure what my IT department pulled off this time). Thanks for the help!
15:19:22 No prob, good luck getting that Itanic working!
15:26:08 Thanks, I think I'll need the luck the way this has been going up until now '=D
15:40:01 Got it working! Thanks for the help and all the work you guys do!
16:21:26 ^_^
17:53:42 FireonLive edited Deathwatch (+371, add bear.community): https://wiki.archiveteam.org/?diff=51457&oldid=51455
17:53:49 that was fast
17:54:23 luck of the cron
17:57:42 FireonLive edited Current Projects (+78, add pastebin): https://wiki.archiveteam.org/?diff=51458&oldid=51407
17:57:43 FireonLive edited Pastebin (+24, DPoS): https://wiki.archiveteam.org/?diff=51459&oldid=47706
17:59:42 FireonLive edited Pastebin (+23, add CTA, make more secure): https://wiki.archiveteam.org/?diff=51460&oldid=51459
18:50:17 speaking of pastebin, i've noticed that the project code makes no attempt to extract outlinks from paste content. is that a deliberate choice?
18:55:47 hmmm. lots of spam there, but i think it's an older project so maybe not?
18:56:33 yeah, hence my uncertainty
18:58:43 arkiver?
19:00:06 could be a good source for links to filesharing projects (like mediafire or zippyshare) since it's often used as an agglomerator
19:01:28 (i know of at least one subreddit that bans download links, to avoid the attention of site admins, but tacitly encourages pastebins of same)
19:12:53 speaking of hid URLs, have projects ever made an effort to catch base64 encoded urls
19:14:07 using rot13 or base64, some file sharing communities hide mega, mediafire URLs from bots that issue DMCA takedowns
19:15:33 I question if those particular links are the kind of thing we want to archive >.>
19:16:20 sure
19:18:15 bocci_: no, afaik no projects have ever implemented that kind of filter-evasion matching
19:18:18 (there's some attempt to repair broken urls, but mainly for accidental syntax-mangling)
19:19:02 thanks, i just wanted to know/make it known
19:19:56 an example of a history of these encoded links being used:
19:19:57 https://warosu.org/ic/thread/6960541#p6960541
19:20:16 nicolas17: it can be legit. i remember doing a bunch of those manually during the zippyshare project--they were video game mods from some forum crawl
19:28:09 !tell Doranwen do you have a wiki account?
19:28:09 -eggdrop- [tell] ok, I'll tell Doranwen when they join next
19:28:43 ah yeah, base64 has been used a lot in /r/piracy wiki i think?
19:28:47 or some reddit wiki
19:29:22 for the record, the strings aren't random or encrypted
19:29:35 a base64-encoded https link always starts with aHR0cHM6Ly
19:30:11 and mediafire links aren't hard to spot once you memorize the pattern
19:30:25 https://www.mediafire.com/file/not-real
19:30:31 https://www.mediafire.com/file/some-file
19:30:42 aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL25vdC1yZWFsCg==
19:30:47 aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL3NvbWUtZmlsZQo=
19:31:03 ig yu'd want to look for aHR0cHM6Ly8 and aHR0cDovLw (https:// and http://)
19:31:10 oh no 8
19:31:25 interesting idea though i like it
19:33:10 would miss protocol-stripped links, but you'd have to get really aggressively heuristic to catch the general case, soz
19:33:16 interesting, i concur
19:35:40 i think you can find protocol-stripped links automatically without some crazy heuristic
19:35:47 if you limit yourself to some hosts
19:36:32 d3d3Lm1lZGlhZmlyZS5jb20K = www.mediafire.com
19:37:08 it's such a specific string, you wouldn't have any false positives
19:37:21 correct, but due to the way we backfeed discovered urls between projects, that could get awkward to maintain
19:38:27 i have no idea about that
19:41:18 i suppose for pastebin itself someone could make something bespoke to scrape the warcs
19:48:07 fireonlive: someone has :P
19:51:24 by which i mean JAA's done a horrible one-liner a couple of times.
19:54:39 bocci_: basically, if a project discovers outlinks, it sends them to the general urls project (#//), which checks them against the list of site-specific projects and forwards them appropriately if there's a match
19:54:48 if every project were to discover obfuscated outlinks to a specific list of hosts, then every project would need the list of site-specific projects
19:55:42 and keeping an n:n system consistent is hell compared to 1:n
19:55:44 ah :D
19:56:51 hmmmm. i guess you could use those 'indicators' for b64 http/https and do further local processing if found?
19:57:00 then ship it to urls as normal?
19:57:14 right
19:57:25 sounds fun :)
20:14:41 -+rss- Niklaus Wirth Passed Away: https://twitter.com/Bertrand_Meyer/status/1742613897675178347 https://news.ycombinator.com/item?id=38858012
20:14:42 nitter: https://nitter.net/Bertrand_Meyer/status/1742613897675178347
20:16:25 You would also need to account for all the different possible capitalizations of http:// and https:// since that would change the base64
21:17:26 iOS 17.3 beta 2 was released today, and soon it was discovered that it caused iPhones with a certain feature enabled to boot-loop, so 3 hours later it was pulled from the update server
21:18:27 they *might* delete the actual files from the CDN too... sum of all variants is 239GB, is this too much? would it work on AB or urls?
21:18:29 JAA: ^
21:23:59 dumb question: what's wrong with just downloading the files and uploading to an archive.org collection if you wish to archive them
21:24:48 I could, and I have done that for files that were *already* deleted but I recovered from elsewhere
21:25:07 but then it won't work on WBM
21:25:23 oh
21:26:03 i've felt wrong for using the WBM for large files
21:26:04 and with my Internet it would take 20 hours to upload, but upload speeds *to IA* are usually worse
21:27:35 i kinda had the sense that directly hitting images/files on the WBM was an unintended effect of saving web pages
21:28:05 wayback machine is for webpages
21:28:12 i think im wrong
21:28:21 idk, that's why I'm asking first :P
21:35:26 bocci_: nothing wrong with having files in the wbm--in fact it's good, because it's more authoritative _and_ more discoverable than just having them somewhere on archive.org
21:35:30 (if you find a link somewhere and it's dead, it's a lot easier to plug the url into the wbm than to search around and maybe find a relevant item and maybe find the file within the item and hope it's correct)
21:35:34 buuut there's a lot of duct tape involved, so idk how large is too large either
21:36:37 it's 34 files from 6363 MiB to 7756 MiB
21:37:28 in total or each?
21:38:02 as I said total is 239GB x_x
21:39:00 MiB|url: https://paste.debian.net/1302977/
21:43:38 nicolas17: doing it via AB is probably fine
21:44:25 just got to make sure it ends up on firepipe (1.44 TiB free) or addax (524 GiB free) per http://archivebot.com/pipelines
21:45:55 an !ao < list of https://transfer.archivete.am/inline/zkuP2/ios_17.3_beta_2_cdn_urls.txt (which deliberately includes that paste at the top as a small file) should be fine, I'll run it unless you've got a different plan
22:59:37 arkiver: Re archive.mozilla.org, I don't remember, but I believe I posted the link to the full JSONL scan output here some weeks ago.
23:00:47 Is jsonl the same as ndjson?
23:02:26 thuban: Can confirm, have written such horrible one-liners. 60% of the time, they work every time!
23:03:05 audrooku|m: Yes
23:04:33 Also referred to as 'JSON Lines' and some other variations. But .jsonl is the common file extension, and application/jsonl is the proposed media type.
23:05:13 Also 'Line-Delimited JSON', which has absolutely no potential of confusion with the entirely unrelated JSON-LD.
23:06:14 nicolas17, pokechu22: Yes, fine with AB. Large pipeline's a good idea, but if all pipelines are full, !ao < should end up on firepipe-ao anyway (unless that's full as well, didn't check).
23:06:41 (Of course, firepipe-ao won't run jobs queued with --pipeline.)
23:07:20 <@JAA> arkiver: [...] I believe I posted the link to the full JSONL scan output here some weeks ago.
23:07:21 It looked good as of an hour ago (I also see you got rid of addax-ao, which I guess makes sense because firepipe-ao receives jobs much faster)
23:07:23 https://transfer.archivete.am/a0mjU/archive.mozilla.org-files.jsonl.zst
23:07:29 (https://hackint.logs.kiska.pw/archiveteam-bs/20231118#c390573)
23:08:56 pokechu22: Yeah, that's why. jap-addax-ao was taking a minute or more to dequeue a job, just horrendous.
23:10:08 It's running (ab job ew2dbtuft08uz2xe0tf4lhlcv)
23:12:54 :-)
23:13:26 ^_^
23:13:56 JAA, any thoughts on the wiki changes suggested in #//?
23:27:34 https://www.polygon.com/24024266/kim-kardashian-mobile-game-shutting-down-glu-mobile
23:30:51 https://www.eurogamer.net/stray-souls-developer-shuts-down-following-publishers-closure-cyberbullying-and-poor-sales
23:31:41 Doesn't look like Stray Souls has a website anymore, but they do have a Twitter if someone could throw it in AB. https://twitter.com/jukaistudio
23:31:42 nitter: https://nitter.net/jukaistudio
23:35:42 added it to next on the pad for when one of the two active finish
23:44:39 https://developer.apple.com/documentation/ios-ipados-release-notes/ios-ipados-17_3-release-notes now finally acknowledging the issue
23:47:24 archivebotted