00:33:42 wouldn't put it past x to do something dumb, JAA 13:55:57 -rss/#hackernews- Noam Chomsky 'no longer able to talk' after 'medical event': https://www.independent.co.uk/arts-entertainment/books/news/noam-chomsky-health-update-tributes-b2559831.html https://news.ycombinator.com/item?id=40641361 17:12:46 damn, I forgot *how* busy things get in apple-hacking-land after WWDC 17:14:04 there's 5 files from yesterday on https://data.nicolas17.xyz/samsung-grab/ 17:14:09 and I'll be adding 6 more after lunch 20:00:39 nicolas17: https://img.kuhaon.fun/u/aIZKzJ.gif 21:33:54 [remind] JAA: Remove {[[PANDA]]} unless someone spoke up. 21:35:10 Oh yeah, the random curly brackets. 21:43:23 JustAnotherArchivist deleted PANDA (This does not deserve a wiki page here;…) 22:06:04 Elo! I do have a personal data science project, that might generate a unique set of urls and I may have the means to run the machines to collect all the content myself. Would this be the right chat to ask questions and maybe get some (emotional) support? 22:07:19 data in question is content generated by politicians and political institutions. data is generated by a personal open (?) data sideproject that got 'out of hand' ;) 22:07:45 sebs: sounds like you're in the right place, yes 22:07:52 <3 22:08:46 in a nutshell what I (partly intend to) do: 22:08:56 sounds like we might even be interested in archiving that data if its publically available (but thats not my call to make, so take that with a bucket of salt) 22:09:11 this is why I am here 22:10:14 its only politcs (official politician pages, official bodies etc). I am trying to map the network of (germand and euro) politics (who is linking who etc) and as a sideproduct I produce the big url set 22:11:02 at some point i need the data of the pages, but that promps the question about content changes. Writing this on a napkin leads to 'archive does most of that' 22:12:01 plan is to collect about 200 million content urls in about 5 years 22:12:58 question: would such a project fill gaps in the content of the wayback machines data, or is it there already anyway? 22:14:41 question 2: can i run my own 'warrior' to archive the stuff myself and help out the project that way? It strikes me this is a way better method to 're-index' the data and look for changes than writing this myself (so many hard parts, so lazy sebs) 22:16:41 Do you have a sample of the urls? Are the urls on single website or in mutliple? Most likely this could be done through archivebot, but there is a posivility the url project could fit this. 22:16:52 I don't think we have tooling to only rearchive pages that have changed, that might have to happen externally 22:20:07 that_lurker: how big do you want the sample to be? the urls are multiple urls on multiple websites. 22:20:47 imer: did not knew that. Thought this was kind of the purpose of this project. Never thought about how the timeline feature is built. 22:22:07 it's just separate copies that happened to have been archived again (for reasons), we probably don't want to re-archive all those millions of urls constantly 22:22:21 if they are on different websites that could be something datechnoman would maybe like to take a look at :-) 22:22:49 sebs: wrt your first question, it depends. government bodies tend to be well-covered; individual politicians, especially local politicians, less so (and we do a lot of that through #archivebot, so your suggestions would be welcome there) 22:23:07 wrt your second question, warriors only work on warrior projects, so you couldn't use a warrior to archive arbitrary websites; however, you could suggest arbitrary websites for archival in #archivebot (or urls in #//, with some caveats). 22:23:38 Maybe this could even be a #Y thing? 22:23:38 that said, if you're doing serious data processing, that might not actually have the advantages you'd think it does, because (a) neither archiveteam (as imer said) nor the internet archive do change detection, and (b) you would have to get the data back out of the internet archive, which is slow 22:24:01 yeah... if #Y ever happens ;) 22:27:56 thuban: for fast access i would dl the data for myself. I am assuming that. if I can fill some gaps at first, that would be worth the effort for me already. 22:28:15 thanks for the hint with the warrior projects 22:28:43 what is #Y 22:30:08 sebs: i'm not sure what you mean exactly by "fill some gaps". are you suggesting that you would download websites and upload them to the internet archive yourself? 22:30:54 sebs: Basically a Warrior project where the intention is to make a distributed ArchiveBot with possibility for custom code without making an entirely new project. Unfortunately, it's currently dead. 22:31:02 #Y is https://wiki.archiveteam.org/index.php/Distributed_recursive_crawls, a hypothetical project that's been in planning stages for years 22:31:20 (ninja'd) 22:32:00 thuban: filling in some gaps means I do think I will find some parts of the poitical space soon that are missing in archive. Especailly the mentioned local politicians. And yes, if I could do it, i would be at least give it a try to operate the infra myself instead of stressing already stressed ressources. 22:33:36 TheTechRobo: My spidey senses tingled on custom code ;) Having tried a bunch of things in that area: that is no easy feat and I do have an idea why smth like this stalls. 22:34:13 sebs: while you're of course welcome to upload your own web archives to the internet archive, note that ia does not index third-party warcs into the wayback machine (since there's no way to guarantee their correctness). 22:34:52 warcs from archivebot (like other archiveteam projects) _do_ go into the wbm, so please don't hesitate to make suggestions there 22:37:11 thuban: I did read about the warc format. I do have to do my homework there as I only discovered the archive team wiki in the middle of the night. Thanks for clarification regarding being in teh wbm or not. was not aware 22:55:30 Anyway: thanks for teaching me a lot and taking the time. very much apprechiated 23:09:59 you're welcome!