-
arkiver
we're discovering zippyshare.com URLs now and queuing those to zippyshare-urls project (for later processing)
-
arkiver
also
-
arkiver
ideas needed!
-
arkiver
i think it'll be good to archive some especially-interesting pages monthly from websites.
-
arkiver
we'd simply once a month archive all URLs that are found on a web page front page that contain in the path the keywords "context", "terms", "dmca", "conditions", "about", "privacy"
-
arkiver
what are opinions here on this? anyone else have ideas? of course, we might need the versions of these keywords in multiple languages
-
JAA
I like this. I'd add 'contact' and 'imprint', too.
-
JAA
The latter is primarily relevant in other languages I guess, where there are laws requiring (some) sites to have an imprint, e.g. Germany.
-
JAA
+ 'disclaimer'?
-
TheTechRobo
Also "Site map" and "Sitemap"? I see those on some websites - no sitemap.xml, but they do have a sitemap
-
TheTechRobo
I think Postmedia does it at least
-
TheTechRobo
(Looks like Postmedia also does sitemap.xml, but not all websites do that - I'd find an example but I don't remember)
-
JAA
Yeah, good one. In the URL, maybe `site\W?map` (or whatever incarnation that turns into in Lua)?
-
datechnoman
Thats a great idea
-
JAA
'security' and 'status' perhaps?
-
JAA
Just looking at some random sites and what they have in their footer.
-
TheTechRobo
"Press"?
-
TheTechRobo
also maybe "support"?
-
TheTechRobo
also would "Careers" and "Jobs" be good for this? since those might change a lot
-
arkiver
TheTechRobo: nice one
-
arkiver
JAA: adding those terms!
-
JAA
Some German terms: agb, rechtliches, nutzungsbedingungen, datenschutz, über uns (probably like `(ü|ue)ber\W?uns`), kontakt, hilfe, impressum, karriere, medien
-
JAA
(I'm assuming case insensitivity for everything.)
-
myself
are there enough sites with a 'canary' or similar page, to be worth looking for them?
-
datechnoman
Well it appears Hetzner have had enough of me running this project on their cloud with the abuse notices :(
-
datechnoman
My account has been locked for a month and is in reviewal
-
datechnoman
-
JAA
Oof
-
datechnoman
Tried getting it unlocked and spoke with them but they wont make any acceptations :/
-
datechnoman
At this time my current workloads are still running but if they are stopped or deleted they cannot be replaced
-
arkiver
That sucks from them :/
-
datechnoman
Not gonna lie im pretty bummed atm but that is the risk of running this
-
arkiver
Very sorry to hear datechnoman :/
-
TheTechRobo
RIP
-
datechnoman
arkiver - not your fault at all! I'm the one running them on my account. Will be just a matter of deleting the VM's shortly and leaving the account empty for the month so hopefully there is a chance of getting it unlocked
-
arkiver
datechnoman: on the other hand - if the account gets through review fine, there may be a note added to the account so that next time it'll not be flagged this fast (or for the same reason)
-
datechnoman
I feel like ive been on their radar for quite some time so id like to tell myself that would be the case but most likely there will be no special treatment
-
datechnoman
Either way im not giving up on this project. There are always other ways
-
arkiver
perhaps I'll hold off with adding the idea from earlier for some time until we're running stable with low queues (near 0)
-
datechnoman
Throw it in. Wont create much more load. Much less than sitemaps
-
datechnoman
I have scaleway and will see what I can do to keep things running (just a little bit slower) as Scaleway are more expensive
-
datechnoman
This project is to important to be scare off by one person getting their account locked
-
arkiver
I'll move some queues around
-
arkiver
going to move the todo:backfeed queue to todo:secondary
-
arkiver
so that we can clearly see if we're in good state. good state means todo:backfeed staying near 0
-
arkiver
(we'll then be slowly chipping away at the other queues
-
datechnoman
Filtering is ramping up also so that will decrease it more rapidly
-
arkiver
yeah
-
arkiver
i'm pausing this temporarily while moving todo:backfeed to todo:secondary
-
arkiver
190 million items moving around :)
-
datechnoman
Pew Pew Pew
-
datechnoman
Movin stuff round lol
-
fuzzy8021
datechnoman ya thats roughly a message i got quite a while back (best i remember it) but havent gotten around to trying to get unlocked
-
datechnoman
Ahhh so that is why you havent been running this project anymore fuzzy8021
-
datechnoman
I thought it might have been cost
-
datechnoman
Hetzner dont like us :(
-
fuzzy8021
i have 20 small boxes with them yet that i never deleted so just been using those on other projects
-
datechnoman
Like we process millions of URL's and only get a few abuse messages from stupid websites and it wrecks it for all :(
-
datechnoman
>:(
-
datechnoman
arkiver - we far off going live again?
-
arkiver
resumed!
-
arkiver
datechnoman: ^
-
arkiver
hmm
-
arkiver
maybe too early
-
datechnoman
swear it was faster moving them around in the past
-
datechnoman
weird
-
datechnoman
Im just impatient lol
-
datechnoman
Going to test some things out on scaleway for this project
-
arkiver
no, in the past the amounts were smaller :P
-
arkiver
21 million left
-
datechnoman
All good! Sorry for the poke
-
arkiver
there was a joke? :P
-
datechnoman
Na poke! You were keeping an eye on things
-
arkiver
hah
-
arkiver
oops
-
datechnoman
There are no breaks on the #// project train :P
-
datechnoman
brakes****
-
arkiver
datechnoman: resumed! :)
-
datechnoman
<3 thanks!