-
arkiver
the URLs project is online on github at
github.com/ArchiveTeam/urls-grab
-
arkiver
with this we'll archive random discovered outlinks from various sources
-
arkiver
it also uses the newest Wget-AT
-
arkiver
-
arkiver
kiska: we could use a target if you have something available :)
-
kiska
Just woke up
-
kiska
Hrm?
-
arkiver
for the URLs project
-
arkiver
it needs the newest megaWARC factory though
-
kiska
-
arkiver
wooh
-
kiska
-
kiska
:D
-
arkiver
:P
-
arkiver
which do you want
-
arkiver
haha
-
arkiver
also we can push to
-
arkiver
archiveteam_urls_
-
arkiver
Archive Team URLs:
-
kiska
Either I am making fun of how you pronounce URLs :D
-
JAA
Make sure it's the latest factory.
-
kiska
archiveteam@rsync ~/archiveteam-megawarc-factory $ git pull
-
kiska
Already up to date.
-
kiska
archiveteam@rsync ~/archiveteam-megawarc-factory $ cd megawarc/
-
kiska
archiveteam@rsync ~/archiveteam-megawarc-factory/megawarc $ git pull
-
kiska
Already up to date.
-
kiska
:D
-
kiska
I did that 12h ago
-
JAA
:-)
-
kiska
And then went to sleep without creating the rsync endpoint
-
kiska
Ooops?
-
kiska
IA_ITEM_TITLE="Archive Team URLs:"
-
kiska
IA_ITEM_PREFIX="archiveteam_urls_"
-
kiska
FILE_PREFIX="urls_"
-
kiska
Yes?
-
arkiver
kiska: yes!
-
jodizzle
Is there a separate channel?
-
kiska
Now I am going to catch up on some messages
-
arkiver
project started!
-
JAA
Wheeeeee
-
arkiver
200 million URLs/items being queued
-
arkiver
20 items per job
-
arkiver
(first use of 'multi-items')
-
arkiver
scripts updated already
-
arkiver
20201031.04
-
JAA
4 versions in less than an hour? Impressive.
-
kiska
:D
-
arkiver
yeah forgot some stuff, and fat fingered something
-
arkiver
feel free to go wild :P this is a ton of different domains
-
fuzzy8021
is there a site to see new docker images like
hub.docker.com/u/warcforceone now that everything is bult as atdr.meo.ws/archiveteam?
-
JAA
Fusl: ^
-
Fusl
fuzzy8021: atdr.meo.ws/archiveteam/<project>-grab
-
Fusl
*everything* that contains a Dockerfile is there
-
JAA
I think fuzzy8021 was asking about a web page listing the images.
-
fuzzy8021
ya
-
Fusl
there is none
-
Fusl
github is the source of truth for this
-
fuzzy8021
k thnaks
-
fuzzy8021
that works
-
arkiver
this is working well
-
nico_32
hum
-
nico_32
my warrior is not able to find wget+at
-
nico_32
let's if i can update the container
-
JAA
nico_32: This project probably won't work in the warrior.
-
mgrandi
i have time this weekend, can we just...update that docker image
-
mgrandi
literally all it needs is a newer version of ubuntu
-
mgrandi
and ubuntu-20.10 just came out
-
Arcorann
Agreed, it seems to me like most projects don't work in the warrior nowadays
-
Arcorann
What's the latest LTS again?
-
mgrandi
well, in this case, wget-at needs `libzstd` and the one in the version the docker image is using (ubuntu 18.X? ) istoo old
-
mgrandi
well, in this case, wget-at needs `libzstd` and the one in the version the docker image is using (ubuntu 18.X? ) is too old
-
mgrandi
ffff i wish that wasn't a feature in irccloud
-
mgrandi
Ubuntu 20.04.1 LTS
-
nico_32
ho
-
JAA
mgrandi: Feel free to try, but test it against a test system, not the live tracker. (Note that the Lua script also talks to the prod system on many current projects.)
-
nico_32
# Use phusion/baseimage as base image.
-
nico_32
FROM phusion/baseimage:0.11
-
mgrandi
i have to learn how to docker first
-
nico_32
FROM phusion/baseimage:18.04-1.0.0
-
nico_32
& retry
-
mgrandi
why are we basing it off of that
-
nico_32
JAA: is there any test tracker?
-
mgrandi
is it smaller?
-
nico_32
that's the current config
-
nico_32
-
JAA
nico_32: Not sure. I never worked much with the tracker.
-
nico_32
(or we could build wget-at static and just add it to ArchiveTeam/warrior-code2)
-
mgrandi
that works too
-
JAA
We already have a static binary (I think) on the project images.
-
mgrandi
.oh lol
-
mgrandi
we should just github it and then have the script `wget` it then
-
nico_32
],
-
nico_32
['./wget-at']
-
JAA
-
nico_32
-
nico_32
it will always use the local copy
-
mgrandi
"what is your purpose", "to download wget-at" "oh my god"
-
mgrandi
/s/your/my
-
mgrandi
yes, there is a lot of duplication of logic, we could make a separate git repo that has a script that tries to download static / compile / throw an error if nothing else works and just submodule that or something
-
mgrandi
(rather than including get-wgetat.sh in every project)
-
Fusl
there's the next person wasting brain cycles on implementing something that will be made obsolete
-
Fusl
-> #warrior
-
Fusl
it will be entirely rewritten from scratch
-
Fusl
no ubuntu
-
Fusl
no debian
-
mgrandi
ok
-
mgrandi
Also: i have been working on the store.playstation.com stuff, and i was wondering if its worth to set up a full blown warrior project for this
-
mgrandi
Turns out the site did not get shut down on the 28th, but the new site was merely deployed , and you can still access the old site
-
mgrandi
Some erratta: we don't have all the URLs yet, that is in progress. Sony is also...inconsistent with banning and i cannot for the life of me figure out what their strategy is
-
Fusl
the last time someone wasted braincycles around here was when they were trying to rewrite the tracker in python and getting rid of redis, we have since made progress on rewriting it in lua and improving performance by a factor of literally 300000%
-
mgrandi
i ran 20 boxes to do HTML scrapes with wget-at and 10 of them were banned but 10 weren't
-
mgrandi
(what made the perf go up that much? o.o)
-
Fusl
lua
-
mgrandi
what was it written in before?
-
Fusl
ruby
-
mgrandi
hmm, i guess lua is neat then
-
Fusl
and many redis single command calls
-
JAA
It's more about how the code is written than what language.
-
mgrandi
Anyway, store.PSN project, the urls are just simple URLs, potential banning, the URL list is not complete yet but the VGPC discord is working on it
-
Fusl
it's now pretty much doing a lot of the magic in redis for locking items and scanning through all the queues
-
mgrandi
not sure if everyone is too busy to set up another project
-
mgrandi
but thought i should bring it up here before i try to make something myself to do it
-
nico_32
hum
-
nico_32
there is a new version of the warrior ova
-
nico_32
wget-at built and installed sucessfully!
-
nico_32
i recommand a note to tell users to upgrade their warrior to v3.1
-
nico_32
success! i can run the docker image for google-sites-grab on my synology nas
-
Kaz
Sean Connery has died
-
Kaz
pls archivebot all the things
-
Arcorann
Is there a channel for the URL project?
-
wessel1512
and is ther a docker image for URL project?
-
wessel1512
-
Kaz
we no longer use dockerhub
-
Kaz
docker image is available at atdr.meo.ws/archiveteam/urls-grab
-
Kaz
apart from the fact it appears to be broken
-
Arcorann
For that matter is there a page which describes the goals/etc. of the project in more detail?
-
Kaz
should probably make a wiki page I guess
-
Kaz
tl;dr I think is basically 'a project we can throw a collection of random urls at and have them grabbed'
-
Kaz
distributed archivebot, in a very limited form
-
wessel1512
archivebot lite
-
wessel1512
it works atdr.meo.ws/archiveteam/urls-grab
-
Kaz
ah perfect
-
Kaz
yeah, we're moving pretty much everything over to drone
-
Kaz
so avoid dockerhub limits
-
Arcorann
I've been wondering if it was feasible to archive every page linked from
novelupdates.com (novel translation link aggregation site)
-
Kaz
sure it is
-
Kaz
spam arkiver enough to see if he'll do it
-
wessel1512
also maybe add a web gui to the docker registry
-
Kaz
there is one - I think you have to log in to it though
-
Kaz
the simple assumption is that every new project should be auto-built there
-
wessel1512
-
wessel1512
so that normal pleps like me can see what docker images are available
-
kiska
-
Kaz
-
Kaz
repos are public, you just can't list them
-
Kaz
if you log in via github it may let you
-
wessel1512
let you what ?
-
Kaz
list repos
-
arkiver
mgrandi: how many URLs/apps?
-
JAA
Aw, RIP.
seanconnery.com is dead, couldn't find anything else. I set up a monitor for the site.
-
JAA
seanconnery.com is back up since a few minutes and has been replaced by a message about his death.
-
arkiver
ah
-
arkiver
there's quite some utm_* params in urls-grab
-
arkiver
we'll queue those again to backfeed
-
Kaz
it's also just an image, which is an interesting way to go about making a static site
-
arkiver
yeah so upcoming is:
-
arkiver
- don't abort all items if one of the multi items is bad
-
arkiver
- queue back without utm_* params
-
arkiver
later on we'll also queue some page requisites, but need to add some good checks for that
-
jodizzle
JAA: Do you have a method for getting Disqus comments? I think I remember you mentioning something about that a while back. I was thinking about getting the Disqus comments on
afreshcup.com
-
jodizzle
Hm, but it seems like the Disqus comments there aren't loading at all, at least for me.
-
jodizzle
Would be good to know anyway, though.
-
JAA
jodizzle: Er, I never actually finished that. But yes, I have something that kind of works, mostly.
-
jodizzle
Oh?
-
JAA
Reminds me that I wanted to grab the Picosong and Kotaku/Gizmodo/Lifehacker UK discussions from there though.
-
JAA
As for grabbing the comments on the site, that's a disgusting mess.
-
jodizzle
Darn, that sounds really annoying
-
jodizzle
Can someone confirm, by the way, that the Disqus comments aren't loading for them on
afreshcup.com/? I want to know if it's just my browser/connection.
-
JAA
The 'disgusting mess', by the way, is that the Disqus URL includes the <title> of the page embedding it.
-
JAA
And it includes it twice, and the two might differ IIRC.
-
JAA
So yeah...
-
JAA
It basically has to be done with specific code for each site on which you want to archive those to emulate what the site's JS does with the <title>.
-
JAA
I did some of this on Picosong, I believe.
-
JAA
And yes, I can confirm that it doesn't work. Looks like the forum no longer exists:
disqus.com/home/forum/a-fresh-cup
-
nico_32
Kaz:
pix.milkywan.fr/B0RCcmXh.png <= it look like that on Synlogy' docker app
-
Kaz
hmm, neat
-
nico_32
yes
-
nico_32
but the app is a little buggy
-
jodizzle
JAA: Thanks
-
Wingy
JAA apparently I don't need an AT wiki password reset lol
-
Wingy
It was in my password manager the whole time
-
Wingy
just no URL
-
Wingy
so not in the "this tab" category
-
Wingy
and I couldn't remember it because it was random
-
JAA
lol
-
JAA
Well, great. :-)
-
OrIdow6
Sounds like this URL project is basically warriorbot
-
arkiver
sort of
-
arkiver
updates coming yp
-
arkiver
up
-
Fusl
Out of memory: Kill process 7067 (redis-server) score 447 or sacrifice child
-
Fusl
RIP redis
-
Zerote
is there a wiki page and/or dedicated channel for the URL project?
-
JAA
Nope
-
arkiver
let's make a channel!
-
arkiver
code is updated, this will speed things up a lot (not aborting 20 URLs when one is bad)
-
arkiver
update asap please, will set as minimum version soon
-
arkiver
20201031.06
-
arkiver
any ideas for a channel?
-
OrIdow6
Pretty hard to "mock" URLs as a general concept
-
OrIdow6
#emailaddress
-
arkiver
-
arkiver
#noslashes ?
-
arkiver
not sure if I like it :P
-
OrIdow6
#//
-
arkiver
fuck yes
-
arkiver
#// works
-
arkiver
nice
-
arkiver
any strong opinions about that channel anyone?
-
Fusl
nope
-
Fusl
i like it
-
JAA
+1
-
arkiver
HCross: I see you just went faster, but with the old version
-
HCross
… Maybe because nobody told me there was an update
-
arkiver
well I did write here, but didn't ping you
-
arkiver
pinged you now though :)
-
arkiver
#// for the URLs project!