20:52:31 we're discovering zippyshare.com URLs now and queuing those to zippyshare-urls project (for later processing) 20:52:32 also 20:52:39 ideas needed! 20:53:02 i think it'll be good to archive some especially-interesting pages monthly from websites. 20:54:30 we'd simply once a month archive all URLs that are found on a web page front page that contain in the path the keywords "context", "terms", "dmca", "conditions", "about", "privacy" 20:54:55 what are opinions here on this? anyone else have ideas? of course, we might need the versions of these keywords in multiple languages 20:55:52 I like this. I'd add 'contact' and 'imprint', too. 20:56:36 The latter is primarily relevant in other languages I guess, where there are laws requiring (some) sites to have an imprint, e.g. Germany. 20:57:51 + 'disclaimer'? 20:58:46 Also "Site map" and "Sitemap"? I see those on some websites - no sitemap.xml, but they do have a sitemap 20:59:08 I think Postmedia does it at least 20:59:39 (Looks like Postmedia also does sitemap.xml, but not all websites do that - I'd find an example but I don't remember) 21:01:18 Yeah, good one. In the URL, maybe `site\W?map` (or whatever incarnation that turns into in Lua)? 21:02:44 Thats a great idea 21:03:44 'security' and 'status' perhaps? 21:03:53 Just looking at some random sites and what they have in their footer. 21:04:26 "Press"? 21:04:42 also maybe "support"? 21:06:50 also would "Careers" and "Jobs" be good for this? since those might change a lot 21:20:45 TheTechRobo: nice one 21:20:51 JAA: adding those terms! 21:25:17 Some German terms: agb, rechtliches, nutzungsbedingungen, datenschutz, über uns (probably like `(ü|ue)ber\W?uns`), kontakt, hilfe, impressum, karriere, medien 21:25:49 (I'm assuming case insensitivity for everything.) 21:30:53 are there enough sites with a 'canary' or similar page, to be worth looking for them? 21:42:16 Well it appears Hetzner have had enough of me running this project on their cloud with the abuse notices :( 21:42:32 My account has been locked for a month and is in reviewal 21:43:53 http://transfer.datechnoman.com/rb6WHI73jg/hetzner.jpg 21:43:57 Oof 21:44:30 Tried getting it unlocked and spoke with them but they wont make any acceptations :/ 21:46:03 At this time my current workloads are still running but if they are stopped or deleted they cannot be replaced 21:46:26 That sucks from them :/ 21:46:37 Not gonna lie im pretty bummed atm but that is the risk of running this 21:47:12 Very sorry to hear datechnoman :/ 21:47:34 RIP 21:48:14 arkiver - not your fault at all! I'm the one running them on my account. Will be just a matter of deleting the VM's shortly and leaving the account empty for the month so hopefully there is a chance of getting it unlocked 21:48:48 datechnoman: on the other hand - if the account gets through review fine, there may be a note added to the account so that next time it'll not be flagged this fast (or for the same reason) 21:52:34 I feel like ive been on their radar for quite some time so id like to tell myself that would be the case but most likely there will be no special treatment 21:53:55 Either way im not giving up on this project. There are always other ways 21:55:24 perhaps I'll hold off with adding the idea from earlier for some time until we're running stable with low queues (near 0) 21:56:35 Throw it in. Wont create much more load. Much less than sitemaps 21:58:09 I have scaleway and will see what I can do to keep things running (just a little bit slower) as Scaleway are more expensive 21:59:21 This project is to important to be scare off by one person getting their account locked 21:59:34 I'll move some queues around 21:59:45 going to move the todo:backfeed queue to todo:secondary 22:00:09 so that we can clearly see if we're in good state. good state means todo:backfeed staying near 0 22:00:25 (we'll then be slowly chipping away at the other queues 22:01:26 Filtering is ramping up also so that will decrease it more rapidly 22:01:30 yeah 22:01:47 i'm pausing this temporarily while moving todo:backfeed to todo:secondary 22:02:20 190 million items moving around :) 22:02:38 Pew Pew Pew 22:02:42 Movin stuff round lol 23:13:24 datechnoman ya thats roughly a message i got quite a while back (best i remember it) but havent gotten around to trying to get unlocked 23:18:18 Ahhh so that is why you havent been running this project anymore fuzzy8021 23:18:22 I thought it might have been cost 23:18:28 Hetzner dont like us :( 23:19:29 i have 20 small boxes with them yet that i never deleted so just been using those on other projects 23:20:08 Like we process millions of URL's and only get a few abuse messages from stupid websites and it wrecks it for all :( 23:20:10 >:( 23:21:31 arkiver - we far off going live again? 23:28:38 resumed! 23:28:40 datechnoman: ^ 23:29:17 hmm 23:29:21 maybe too early 23:31:10 swear it was faster moving them around in the past 23:31:14 weird 23:31:21 Im just impatient lol 23:31:32 Going to test some things out on scaleway for this project 23:31:43 no, in the past the amounts were smaller :P 23:31:50 21 million left 23:32:17 All good! Sorry for the poke 23:32:45 there was a joke? :P 23:34:07 Na poke! You were keeping an eye on things 23:36:06 hah 23:36:07 oops 23:38:48 There are no breaks on the #// project train :P 23:39:06 brakes**** 23:40:45 datechnoman: resumed! :) 23:40:56 <3 thanks!