-
mgrandi
But yeah I'll take a look tonight and see if it works for portal maps
-
Pedrosso
Great. I wouldn't like to be the one to do it but if nobody else would at the moment, then perhaps
-
mgrandi
Poke me if I forget but I'll spend some time to ight documenting it and put it on GitHub
-
flashfire42
en.wikipedia.org/wiki/Republic_of_Artsakh has just ceased to exist in the last few days
-
pokechu22
Fortunately thanks to gooshka we've been running
artsakhlib.am for a while (but that site's also super slow and errors out if you run it faster)
-
pokechu22
I think we might have run some of their other sites too, but it'd be good to re-run them
-
arkiver
i'm looking into a project for hardware info since they have crazy strict rate limits
-
arkiver
if there are any "official" sites or youtube channels of
en.wikipedia.org/wiki/Republic_of_Artsakh we should archive them in #archivebot (sites) and #down-the-tube (youtube)
-
flashfire42
-
fireonlive
arkiver: hardware info?
-
arkiver
fireonlive: went read only on january 1, see deathwatch
-
fireonlive
ah! thanks
-
fireonlive
i went google instead for some reason -_-
-
fireonlive
i of all people should know the wiki :D
-
pokechu22
spyur.am seems to have strict cloudflare unfortunately :/
-
pokechu22
(though it also sounds like it's Armenia in general, not Artsakh)
-
fireonlive
manu|m & others here: looks like c3 is going to be deleting a number of matrix channels for the event (via the irc bridge to #37c3-hall-1 and others)
-
fireonlive
unsure if there's a way to save matrix channels & its attachments/threads/etc or ?
-
fireonlive
message: <admintechnicaladministrationnoc> PSA: This channel is a candidate for deletion. If you think this is a mistake, please let us know by replying to this message. Otherwise we are going to delete the channel in a few days. Thanks for using the matrix event chat, we are happy to hear your feedback:
-
fireonlive
-
fireonlive
account for that seems to be @admin:events.ccc.de
-
fireonlive
(/whois)
-
fireonlive
(just in bed, but just got a ping they're going this for data privacy reasons before we rush into this)
-
fireonlive
(will respond/ask qs tomorrow)
-
arkiver
JAA: what directory contained the bulk of the size of archive.mozilla.org from your recent scan?
-
arkiver
(CC Ryz )
-
BenjaminKrausseDB
Hi all, I'm trying to get an old download from the Microsoft Download Center, which no longer seems to be available. I stumbled upon this page (
wiki.archiveteam.org/index.php/Microsoft_Download_Center) which states that everything was archived. I found the file I'm looking for in the index (msxml6_SDK.msi), and the way I understand it, that
-
BenjaminKrausseDB
file should be findable in
archive.org/details/archiveteam_microsoft_download?sort=title . However, I am completely confused as to how to find the file there. It seems to me that a number of files are bunched together into large downloads, but I can't figure out for the life of me in which one of those large downloads the file I'm looking
-
BenjaminKrausseDB
for is located. Is there any documentation or something that I'm missing?
-
Sanqui
BenjaminKrausseDB: You probably want to download the index
archive.org/details/microsoft_download_center_html_index_2020-08 which will tell you which warc contains which URL (file), and then use something like pywb to replay the warc and extract the file.
-
BenjaminKrausseDB
Thanks for the link, I found the file I'm looking for in there, I'm just not sure where to go from there. Or is it the ID I'm looking for?
-
BenjaminKrausseDB
Essentially this is what I found:
-
BenjaminKrausseDB
~~~<h3 id="3988"><a href="#3988">•</a>Microsoft Core XML Services (MSXML) 6.0 </h3><p>MSXML 6.0 (MSXML6) has improved reliability, security, conformance with the XML 1.0 and XML Schema 1.0 W3C Recommendations, and compatibility with System.Xml 2.0.</p>
-
BenjaminKrausseDB
-
BenjaminKrausseDB
-
BenjaminKrausseDB
-
Sanqui
Those archive.org links seem work for me and start a download
-
Sanqui
so I guess that's exactly what you want!
-
Sanqui
BenjaminKrausseDB2: <@Sanqui> Those archive.org links seem work for me and start a download
-
Sanqui
<@Sanqui> so I guess that's exactly what you want!
-
BenjaminKrausseDB
OK, weird, they're not working here. I'll try those links on a different device...
-
Sanqui
if the download doesn't start, try putting "id_" after the timestamp in the url, as such:
-
Sanqui
-
Sanqui
might have better compatibility
-
nicolas17
yes that goes directly to a 3MB binary file
-
BenjaminKrausseDB
OK, it worked on my phone. I suspect my work network is blocking something (although usually it says something, not sure what my IT department pulled off this time). Thanks for the help!
-
Sanqui
No prob, good luck getting that Itanic working!
-
BenjaminKrausseDB
Thanks, I think I'll need the luck the way this has been going up until now '=D
-
BenjaminKrausseDB
Got it working! Thanks for the help and all the work you guys do!
-
fireonlive
^_^
-
h2ibot
FireonLive edited Deathwatch (+371, add bear.community):
wiki.archiveteam.org/?diff=51457&oldid=51455
-
fireonlive
that was fast
-
fireonlive
luck of the cron
-
h2ibot
FireonLive edited Current Projects (+78, add pastebin):
wiki.archiveteam.org/?diff=51458&oldid=51407
-
h2ibot
-
h2ibot
FireonLive edited Pastebin (+23, add CTA, make more secure):
wiki.archiveteam.org/?diff=51460&oldid=51459
-
thuban
speaking of pastebin, i've noticed that the project code makes no attempt to extract outlinks from paste content. is that a deliberate choice?
-
fireonlive
hmmm. lots of spam there, but i think it's an older project so maybe not?
-
thuban
yeah, hence my uncertainty
-
fireonlive
arkiver?
-
thuban
could be a good source for links to filesharing projects (like mediafire or zippyshare) since it's often used as an agglomerator
-
thuban
(i know of at least one subreddit that bans download links, to avoid the attention of site admins, but tacitly encourages pastebins of same)
-
bocci_
speaking of hid URLs, have projects ever made an effort to catch base64 encoded urls
-
bocci_
using rot13 or base64, some file sharing communities hide mega, mediafire URLs from bots that issue DMCA takedowns
-
nicolas17
I question if those particular links are the kind of thing we want to archive >.>
-
bocci_
sure
-
thuban
bocci_: no, afaik no projects have ever implemented that kind of filter-evasion matching
-
thuban
(there's some attempt to repair broken urls, but mainly for accidental syntax-mangling)
-
bocci_
thanks, i just wanted to know/make it known
-
bocci_
an example of a history of these encoded links being used:
-
bocci_
-
thuban
nicolas17: it can be legit. i remember doing a bunch of those manually during the zippyshare project--they were video game mods from some forum crawl
-
fireonlive
!tell Doranwen do you have a wiki account?
-
eggdrop
[tell] ok, I'll tell Doranwen when they join next
-
fireonlive
ah yeah, base64 has been used a lot in /r/piracy wiki i think?
-
fireonlive
or some reddit wiki
-
bocci_
for the record, the strings aren't random or encrypted
-
bocci_
a base64-encoded https link always starts with aHR0cHM6Ly
-
bocci_
and mediafire links aren't hard to spot once you memorize the pattern
-
bocci_
-
bocci_
-
bocci_
aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL25vdC1yZWFsCg==
-
bocci_
aHR0cHM6Ly93d3cubWVkaWFmaXJlLmNvbS9maWxlL3NvbWUtZmlsZQo=
-
fireonlive
ig yu'd want to look for aHR0cHM6Ly8 and aHR0cDovLw (https:// and http://)
-
fireonlive
oh no 8
-
fireonlive
interesting idea though i like it
-
thuban
would miss protocol-stripped links, but you'd have to get really aggressively heuristic to catch the general case, soz
-
thuban
interesting, i concur
-
bocci_
i think you can find protocol-stripped links automatically without some crazy heuristic
-
bocci_
if you limit yourself to some hosts
-
bocci_
d3d3Lm1lZGlhZmlyZS5jb20K = www.mediafire.com
-
bocci_
it's such a specific string, you wouldn't have any false positives
-
thuban
correct, but due to the way we backfeed discovered urls between projects, that could get awkward to maintain
-
bocci_
i have no idea about that
-
fireonlive
i suppose for pastebin itself someone could make something bespoke to scrape the warcs
-
thuban
fireonlive: someone has :P
-
thuban
by which i mean JAA's done a horrible one-liner a couple of times.
-
thuban
bocci_: basically, if a project discovers outlinks, it sends them to the general urls project (#//), which checks them against the list of site-specific projects and forwards them appropriately if there's a match
-
thuban
if every project were to discover obfuscated outlinks to a specific list of hosts, then every project would need the list of site-specific projects
-
thuban
and keeping an n:n system consistent is hell compared to 1:n
-
fireonlive
ah :D
-
fireonlive
hmmmm. i guess you could use those 'indicators' for b64 http/https and do further local processing if found?
-
fireonlive
then ship it to urls as normal?
-
thuban
right
-
fireonlive
sounds fun :)
-
fireonlive
-
eggdrop
-
qwertyasdfuiopghjkl
You would also need to account for all the different possible capitalizations of http:// and https:// since that would change the base64
-
nicolas17
iOS 17.3 beta 2 was released today, and soon it was discovered that it caused iPhones with a certain feature enabled to boot-loop, so 3 hours later it was pulled from the update server
-
nicolas17
they *might* delete the actual files from the CDN too... sum of all variants is 239GB, is this too much? would it work on AB or urls?
-
nicolas17
JAA: ^
-
bocci_
dumb question: what's wrong with just downloading the files and uploading to an archive.org collection if you wish to archive them
-
nicolas17
I could, and I have done that for files that were *already* deleted but I recovered from elsewhere
-
nicolas17
but then it won't work on WBM
-
bocci_
oh
-
bocci_
i've felt wrong for using the WBM for large files
-
nicolas17
and with my Internet it would take 20 hours to upload, but upload speeds *to IA* are usually worse
-
bocci_
i kinda had the sense that directly hitting images/files on the WBM was an unintended effect of saving web pages
-
bocci_
wayback machine is for webpages
-
bocci_
i think im wrong
-
nicolas17
idk, that's why I'm asking first :P
-
thuban
bocci_: nothing wrong with having files in the wbm--in fact it's good, because it's more authoritative _and_ more discoverable than just having them somewhere on archive.org
-
thuban
(if you find a link somewhere and it's dead, it's a lot easier to plug the url into the wbm than to search around and maybe find a relevant item and maybe find the file within the item and hope it's correct)
-
thuban
buuut there's a lot of duct tape involved, so idk how large is too large either
-
nicolas17
it's 34 files from 6363 MiB to 7756 MiB
-
bocci_
in total or each?
-
nicolas17
as I said total is 239GB x_x
-
nicolas17
-
pokechu22
nicolas17: doing it via AB is probably fine
-
pokechu22
just got to make sure it ends up on firepipe (1.44 TiB free) or addax (524 GiB free) per
archivebot.com/pipelines
-
pokechu22
an !ao < list of
transfer.archivete.am/inline/zkuP2/ios_17.3_beta_2_cdn_urls.txt (which deliberately includes that paste at the top as a small file) should be fine, I'll run it unless you've got a different plan
-
JAA
arkiver: Re archive.mozilla.org, I don't remember, but I believe I posted the link to the full JSONL scan output here some weeks ago.
-
audrooku|m
Is jsonl the same as ndjson?
-
JAA
thuban: Can confirm, have written such horrible one-liners. 60% of the time, they work every time!
-
JAA
audrooku|m: Yes
-
JAA
Also referred to as 'JSON Lines' and some other variations. But .jsonl is the common file extension, and application/jsonl is the proposed media type.
-
JAA
Also 'Line-Delimited JSON', which has absolutely no potential of confusion with the entirely unrelated JSON-LD.
-
JAA
nicolas17, pokechu22: Yes, fine with AB. Large pipeline's a good idea, but if all pipelines are full, !ao < should end up on firepipe-ao anyway (unless that's full as well, didn't check).
-
JAA
(Of course, firepipe-ao won't run jobs queued with --pipeline.)
-
thuban
<@JAA> arkiver: [...] I believe I posted the link to the full JSONL scan output here some weeks ago.
-
pokechu22
It looked good as of an hour ago (I also see you got rid of addax-ao, which I guess makes sense because firepipe-ao receives jobs much faster)
-
thuban
-
thuban
-
JAA
pokechu22: Yeah, that's why. jap-addax-ao was taking a minute or more to dequeue a job, just horrendous.
-
pokechu22
It's running (ab job ew2dbtuft08uz2xe0tf4lhlcv)
-
JAA
:-)
-
fireonlive
^_^
-
thuban
JAA, any thoughts on the wiki changes suggested in #//?
-
nulldata
-
nulldata
-
nulldata
Doesn't look like Stray Souls has a website anymore, but they do have a Twitter if someone could throw it in AB.
twitter.com/jukaistudio
-
eggdrop
-
fireonlive
added it to next on the pad for when one of the two active finish
-
nicolas17
-
fireonlive
archivebotted