-
tzt
ge.tt is shutting down 28th of February
-
kiska
Here is a test file I uploaded:
ge.tt/6CMOUZB3
-
kiska
Seems to be 8 chars with 0-9 and A-Z
-
kiska
Here is another
ge.tt/3u8QUZB3
-
kiska
So 8 characters of 0-9 A-Za-z
-
kiska
arkiver: fun times?
-
purplebot
Deathwatch edited by Kiska (+128, Add ge.tt) just now --
archiveteam.org/?diff=46365&oldid=46359
-
SCSi
doesnt give us much time to archive
-
JAA
[0-9a-zA-Z]{8} = ~218 trillion combinations. Yeah, not going to happen.
-
Ryz
It may be shutting down on 2021 February 28, but the files will not available on 2021 March 10,
-
Ryz
So that's more time than just 2 days
-
SCSi
so when we rallying the troops and knocking this one out
-
SCSi
i wanna show up on the leaderboards not in the bottom 50
-
tech234a
-
tech234a
-
tech234a
still not enough but possibly helpful
-
tech234a
-
tech234a
-
arkiver
hi
-
arkiver
ah
-
arkiver
fun fun fun
-
arkiver
lets go 218 trillion URLs :)
-
arkiver
JAA: could you do a quick twitter scan?
-
arkiver
i'll send them an email ow
-
arkiver
no
-
arkiver
ow
-
arkiver
now*
-
atphoenix
multiple files can be behind one base URL, e.g.
ge.tt/64SMJpf has 5 files including
ge.tt/64SMJpf/v/81
-
atphoenix
^ url found via a search engine searching for site:get.tt
-
tech234a
Only 53 million files
-
atphoenix
the example I gave is for SweetFX ReShade files. I wonder if we can find forums that liked to use this file hoster.
-
tech234a
Plenty of URLs already in Google and CDX
-
tech234a
Bing too
-
atphoenix
duckduckgo too. My URL example is used here:
forum.doom9.org/showthread.php?t=170357
-
arkiver
53 million is not much, but these may be large
-
atphoenix
there are a bunch of forums and reddit posts that reference my example.
duckduckgo.com/?q=http%3A%2F%2Fge.tt%2F64SMJpf . Hopefully they can just give us the list of 53m...
-
purplebot
URLTeam edited by Aarchi (+245, Update git.io) just now --
archiveteam.org/?diff=46366&oldid=46362
-
OrIdow6^2
So NicoNino now has about a day and a half before deleting past comments
-
OrIdow6^2
-
OrIdow6^2
As far as I know, it is completely done except for zstd and a version number bump
-
OrIdow6^2
There are about 23 million videos this would need to cover to be complete, and it is heavily rate-limited during the day in Japan
-
OrIdow6^2
Can one of the ops look at this? I can add a static dictionary if need be.
-
EggplantN
jfc tech234a thats a lotta items 🙃
-
kiska
arkiver: See OrIdow6^2 pls
-
EggplantN
oh hey OrIdow6^2
-
EggplantN
are you here?
-
atphoenix
if you want a project channel, may I suggest nicotino (going up in smoke) or NicoNinoNoMore or
-
EggplantN
whatcha need from us ops
-
EggplantN
need a tracker + items loading? or need someone to do a code review.
-
EggplantN
Tracker is up
-
EggplantN
set min version to 20210227.01
-
EggplantN
target added (rsync only)
-
EggplantN
also r.e ZSTD was this something required? or just would like?
-
EggplantN
also r.e rate limiting we can deploy 2k+ IPs + hetzner cloud
-
OrIdow6^2
Away for like 20 more minutes
-
EggplantN
okie fyi zstd isnt much use on video afaik
-
EggplantN
so I believe we can ignore that (?)
-
EggplantN
gonna go for a shower will be back similar time
-
kiska
-
EggplantN
Is this what were archiving
-
kiska
no this is from the IA talk that Jonah did
-
EggplantN
Okay few questions for OrIdow6^2 when you're back
-
EggplantN
a) are we archiving video or just text
-
EggplantN
b) if video then not much point for zstd?
-
EggplantN
c) if video any reason for the multi item size of 30?
-
OrIdow6^2
EggplantN: Zstd is useful, because it's not getting videos themselves, just metadata pages & comments
-
OrIdow6^2
So (a) is "just text"
-
EggplantN
okie all is fine
-
EggplantN
we would need to likely wait for arkiver for zstd + deploy a different target
-
OrIdow6^2
Multi-item may actually be useful being increased
-
OrIdow6^2
Hm
-
EggplantN
Do you have the item list at least for me to load that into the tracker ready?
-
kiska
Any reason why we aren't archiving video?
-
OrIdow6^2
That's why I wondered why a static dict was possible
-
OrIdow6^2
EggplantN: No, going to work on that, need to decide whether to try all possible IDs or get a subset from the search page
-
EggplantN
We can just throw all possible in no problem
-
OrIdow6^2
kiska: Because it's not being deleted; only old comments and metadata of deleted videos are being removed
-
kiska
I see...
-
EggplantN
okay no worries. The targets to zstd the warcs uploaded but if you want to do zstd on the worker that needs arkiver to configure it on researcher7
-
OrIdow6^2
Thinking about it, it may not be too big with gzip
-
EggplantN
How big would per item be?
-
OrIdow6^2
I'm going to say something on the order of 30 KB
-
OrIdow6^2
Good chance it's less than that
-
EggplantN
at 30KB 23mil items its only 660GB
-
OrIdow6^2
Hard to tell exactly, because I don't know the distribution of comments/video and used IDs
-
OrIdow6^2
Yeah
-
OrIdow6^2
Running item generation now
-
kiska
Lets put this on atr3 if you're going to do gz or with a static dict
-
OrIdow6^2
As EggplantN has made me realize, probably just gz
-
EggplantN
perfect
-
EggplantN
Please PR the code to AT GH repo when ready, kiska has prepared docker to build. target is ready to just accept data (not send to IA yet). Send me the items or make a GH repo called niconino-items (preferred) and I'll fork to AT
-
EggplantN
and add to tracker
-
EggplantN
I'll also default this project on the warrior
-
EggplantN
wait hold up OrIdow6^2
-
OrIdow6^2
What
-
EggplantN
is this project called niconico
-
EggplantN
or niconino
-
EggplantN
wiki says niconico
-
OrIdow6^2
Oh, did I mispell it again?
-
EggplantN
lmao we've set everything for niconino lol
-
EggplantN
I can move it over to niconico
-
OrIdow6^2
I'll change the repo
-
OrIdow6^2
I did this before, caught it myself though
-
EggplantN
lol I already called it ninonino on tracker before realising now noticed when getting the logo its niconico
-
EggplantN
okie added to tracker frontpage
-
EggplantN
peoples notifications will be alerting shortly
-
EggplantN
-
EggplantN
there we go
-
purplebot
Niconico edited by Tglass (+47, Update IRC channel) just now --
archiveteam.org/?diff=46367&oldid=46321
-
OrIdow6^2
Good thing you caught that
-
OrIdow6^2
-
OrIdow6^2
Should probably rename the repo, too
-
EggplantN
yeah
-
EggplantN
lemme do that
-
EggplantN
merged changed name
-
AK
Is there a niconinoniconino channel yet?
-
OrIdow6
Thank you
-
EggplantN
Not yet AK
-
EggplantN
just here for now ;)
-
OrIdow6
Having to take a bit of a detour with items
-
EggplantN
okie lemme know if you wanna just yeet all possible in
-
OrIdow6
There is a more efficient way than all possible
-
OrIdow6
But it will take time, and I don't know whether this will go fast enough that prioritization isn't necessary
-
OrIdow6
-
OrIdow6
Go slow at first, and it's possible that I omitted something with multiitems
-
EggplantN
Okie I'll test it first real quick
-
OrIdow6
Ok
-
OrIdow6
By the way, don't test with the very early IDs - sm8 has millions of comments
-
EggplantN
its running and redis is random so i'll get whatever it gives
-
EggplantN
-
EggplantN
Server returned 503 (RETRFINISHED). Sleeping.
-
OrIdow6
That's niconico's version of a 429
-
EggplantN
Ah right
-
EggplantN
so 503 is 429 here jesus im running 1 concurrent
-
EggplantN
it slept for the right amount of time though
-
OrIdow6
I have it set for 1 minute on 503s
-
EggplantN
perfect
-
EggplantN
so this is a 1-2 concurrent per IP?
-
OrIdow6
Probably
-
EggplantN
its bloody slow lol
-
AK
I get a 503 on that and I haven't visited niconico before
-
AK
I think that one might actually be a 503
-
AK
*404
-
EggplantN
yeah 404's are fine
-
EggplantN
but 503 = 429 here
-
EggplantN
🙃
-
kiska
OOF
-
EggplantN
least they give error codes and not 200's
-
EggplantN
for everything
-
kiska
At least its better than halo
-
EggplantN
okay it worked OrIdow6
-
EggplantN
3.46MiB for 30 items
-
OrIdow6
Good
-
EggplantN
~110kb but thats a small data set
-
EggplantN
You happy for us to run with this?
-
kiska
Maybe we should do 10 multi items if they give 503 for 429 :D
-
EggplantN
i only got 1 503 kiska
-
kiska
I see
-
EggplantN
for a multi item of 30
-
OrIdow6
Can you loook at your logs? How many "watch_dll" and "watch_app" scripts did it get?
-
EggplantN
instead
-
EggplantN
-
EggplantN
enjoy the whole log
-
OrIdow6
Oh, thanks
-
OrIdow6
So I think this looks like it's running ok for now
-
EggplantN
ready for me to open the flood gates?
-
OrIdow6
I think so
-
OrIdow6
Start slowly
-
OrIdow6
The main site has a lot of capacity (at least at certain times of day), but I'm not sure about the history endpoint
-
HCross
If you can hold off 20 minutes I can scale
-
OrIdow6
I.e. all the
nmsg.nicovideo.jp/api requests
-
EggplantN
ah right i've not put a rate limit on
-
EggplantN
but we will as needed
-
EggplantN
deployed 50 concurrent
-
AK
Error response from daemon: manifest for atdr.meo.ws/archiveteam/niconico-grab:latest not found: manifest unknown: manifest unknown
-
AK
It's niconico right?
-
EggplantN
ah crap
-
EggplantN
kiska
-
kiska
Yes?
-
EggplantN
i renamed repo
-
EggplantN
can you redeploy drone
-
EggplantN
also OrIdow6 they're intelligent and rate limiting me using a /24
-
EggplantN
🙃
-
kiska
Sync'ing
-
kiska
One moment
-
kiska
Building
-
EggplantN
hrm
-
OrIdow6
EggplantN: Is it possible it could be by UA or something?
-
EggplantN
its possible
-
EggplantN
want me to try?
-
OrIdow6
But I suppose /24 rate limiting is what I'd expect from them
-
OrIdow6
Not yet
-
EggplantN
i tried doing 50 concurrent 1/IP from the same /24
-
HCross
I’ll deploy the usual solution in a bit
-
EggplantN
even fucking 8 IPs
-
EggplantN
instant 503
-
OrIdow6
Looking into it
-
AK
Exception: No usable Wget+At found.
-
AK
Looking for Wget+AT in ./wget-at
-
AK
./wget-at: Incorrect Wget+AT version (want ['GNU Wget 1.20.3-at.20210212.02']).
-
AK
Welp
-
EggplantN
was that the docker CT?
-
AK
Yep
-
EggplantN
uh]
-
EggplantN
thats the latest verison
-
AK
-
EggplantN
gimme a sec hrm,
-
EggplantN
also OrIdow6 if you want an example of random UA's see reddit
-
OrIdow6
Ok
-
OrIdow6
Is anyone besides Eggplant running yet? Would help narrowing down the cause of the 503s
-
EggplantN
yeah AK seems docker is using the old wget-at?
-
AK
Ooh
-
AK
grab-base vs grab-base-df?
-
AK
reddit uses
-
AK
FROM atdr.meo.ws/archiveteam/grab-base:gnutls
-
AK
This uses FROM atdr.meo.ws/archiveteam/grab-base-df
-
AK
Urls-grab uses FROM atdr.meo.ws/archiveteam/grab-base
-
EggplantN
ah yes
-
EggplantN
lemme update that
-
EggplantN
we're just debugging this now AK
-
EggplantN
please hold off
-
AK
No worries, happy for me to leave them running with watchtower? Or turn them off?
-
EggplantN
watchtower is good
-
arkiver
OrIdow6: why is it running? i thought we wanted to wait for zstd and all
-
arkiver
ah crap a day and a half
-
OrIdow6
arkiver: It was pointed out to me that it will likely be under a TB
-
arkiver
OrIdow6: please feel free to spam me next time :)
-
OrIdow6
arkiver: Will do :)
-
arkiver
i see the static user_id is still in, is this correct OrIdow6 ?
-
arkiver
223 in Lua
-
kiska
arkiver
-
kiska
:D
-
arkiver
hi
-
EggplantN
hey arkiver we've gone ahead and just started seems smallish dealing with bugs right now with OrIdow6
-
kiska
Would you like me to ping you once an hour for time critical things for this?
-
EggplantN
also you'll be glad to know another people deserve to go to hell arkiver
-
EggplantN
randomly returning 302/200 instead of 404's
-
arkiver
yeah TB is fine without ZSTD
-
Jake
-
Jake
Orldow6 ^
-
OrIdow6
Will look
-
Jake
Thank you!
-
EggplantN
will need more items also soon OrIdow6
-
EggplantN
:D
-
arkiver
JAA: idea for your bot, how about it warns us for approaching deadlines on the deathwatch pages? :)
-
Jake
I'm not sure it's doing comment threads correctly?
-
kiska
Ping at least 2 months in advanced
-
kiska
And once per day
-
OrIdow6
Jake: Do the other getwaybackkey requests work? I.e., is it just this one time it's failing for you, or is it doing this all the time?
-
OrIdow6
And what do you think the problem with comment threads is?
-
OrIdow6
Wow, that was fast
-
Jake
I'm seeing very few getwaybackkey requests, maybe actually only the ones with the lua error?
-
OrIdow6
It is nighttime in Japan, so they have excess capacity
-
Jake
maybe every 100th video id it's calling getwaybackkey?
-
OrIdow6
That's expected
-
Jake
well, they are all erroring out like that paste I posted above.
-
OrIdow6
Anyone else having this problem?
-
OrIdow6
EggplantN: How many items to a batch?
-
EggplantN
uh
-
EggplantN
1 mil?
-
OrIdow6
Ok
-
arkiver
OrIdow6: how many items in total?
-
OrIdow6
arkiver: 20-30 million
-
arkiver
guess we can add all of them
-
arkiver
take the upper range
-
arkiver
upper limit of the range
-
OrIdow6
That's what's going on right now
-
OrIdow6
They're not all sequential (there are other types besides "sm"), but most are
-
OrIdow6
vid:sm[N] where [N] is from 1 to 38400400
-
OrIdow6
So do you want one big file, or 39 small ones?
-
OrIdow6
Anyhow, some more nuanced discovery will be needed for non-SMs
-
arkiver
what are the non-SMs
-
arkiver
OrIdow6: whatever is easier for you, i'd keep it to ~1 million per file?
-
OrIdow6
Video IDs have prefixes, the most common type are "sm" (submitted by regular user)
-
OrIdow6
Ok
-
OrIdow6
The page on the AT wiki discussed it
-
EggplantN
also OrIdow6 yes I saw the bug Jak e did
-
OrIdow6
Looks like it's something transient
-
Jake
I got it again as well, more full logs:
paste.ubuntu.com/p/42MtG6R3bX
-
OrIdow6
How often is this happening?
-
OrIdow6
And is it necessary that I generate all 38 million items, versus someone loading them into the tracker directly?
-
Jake
from my limited logs, every time it tries to do getwaybackkey
-
OrIdow6
650 MB of sequential numbers
-
OrIdow6
Then retrying isn't useful
-
OrIdow6
EggplantN: When this happens to you, does it happen all the time?
-
OrIdow6
Or just occasionally?
-
EggplantN
occasionally
-
OrIdow6
Should there be a channel for this?
-
OrIdow6
How about #niconino
-
arkiver
OrIdow6: please do generate all of them and put them in -items
-
arkiver
for logs
-
OrIdow6
-
OrIdow6
(I'm sure Github loves this kind of commit)
-
OrIdow6
Also, I see that 00 has not been marked added
-
OrIdow6
Anyhow, I seem to have a habit of starting "small" projects that spam up -bs, please come to #niconino
-
arkiver
we're cooperating with github and IA on #gitgud
-
arkiver
also
-
arkiver
feel free to ZSTD things :P
-
OrIdow6
Are static dicts fine in the future?
-
arkiver
no
-
arkiver
do we need dicts here?
-
OrIdow6
No
-
OrIdow6
Oh, I see what you mean, just for the compression savings
-
OrIdow6
Will do
-
OrIdow6
*just for the algorithm
-
OrIdow6
And sorry for just declaring that people should "come to #niconino", but I am going (going away from IRC, that is) here soon
-
OrIdow6
If you pick a more interesting channel name... just send me the logs
-
arkiver
queuing
-
EggplantN
i added 00_vid
-
OrIdow6
I mean moved to "added" in the items repo
-
OrIdow6
Though for all I know that's not something usual
-
arkiver
00_vid is added?
-
arkiver
ah ok
-
arkiver
adding everything
-
arkiver
nice speed EggplantN
-
EggplantN
>_>
-
arkiver
also HCross is gaining on you
-
arkiver
i think
-
EggplantN
oh this is a single Dual E5
-
EggplantN
and 1 /24
-
OrIdow6
Please reduce your rate
-
EggplantN
yes
-
OrIdow6
Site as a whole is having problems
-
OrIdow6
I don't want to have disrupted anyone's actual use of this site
-
taka
I think we should reduce the connection bandwidth to Niconico. They have blocked connections from foreign IPs in the past when we experienced a lot of foreign access.
-
kiska
Ah good lemme go and get some jp IPs
-
OrIdow6
Why the non-multi items?
-
arkiver
because of limits OrIdow6
-
OrIdow6
Oh
-
arkiver
10k items/min
-
wessel1512
what is the Niconico docker link
-
wessel1512
atdr.meo.ws/archiveteam/niconico-grab:latest isint working
-
wessel1512
arkiver can you help
-
arkiver
kiska: does it need to be poked? ^
-
kiska
Sorry?
-
JAA
arkiver: Twitter scrape for ge.tt running now. Going back to 2010 since it appears that the domain was registered in October 2010.
-
wessel1512
the docker container cant be pulled
-
kiska
One moment lemme try it
-
kiska
-
wessel1512
it works hear as well
-
wessel1512
got 407 couple of minutes ago
-
arkiver
JAA: thank you!
-
JAA
Older uploads on ge.tt have 7-char IDs. Also, I see lots of IDs with the same few trailing characters (note that kiska's examples from last night both end with UZB3). Don't think it's enough to make bruteforcing feasible though.
-
JAA
-
JAA
-
arkiver
that was fast...
-
arkiver
thanks!
-
JAA
parallel snscrape ftw. :-)
-
EggplantN
Nico limit now 25k
-
EggplantN
So what's next arkiver mediafire Google sites periscope or something off deathwatch :P
-
arkiver
periscope
-
arkiver
:)
-
kiska
arkiver: ge.tt first pls
-
kiska
Its terminating service on the 10 Marhc
-
kiska
March*
-
arkiver
#microscope for periscope btw
-
arkiver
yeah JAA gave the list
-
SCSi
meh is the docker image not setup for webs?
-
Craigle
SCSi: atdr.meo.ws/archiveteam/webs-grab
-
SCSi
gotcha, thx
-
Craigle
SCSi: Forgot to mention, the project channel is #webbed as well
-
SCSi
rgr
-
mgrandi
griddy.com is shutting down
-
mgrandi
(famous for: the texas weather apocalypse )
-
mgrandi
looks like from the logs (thanks JAA) that it was run on feb 18th
-
mgrandi
it looks like we didn't get their zendesk instance (
griddy.zendesk.com/hc/en-us ) though, i'll download their youtube
-
JAA
-
JAA
Zendesk's been annoying lately, but I'll try.
-
mgrandi
i can download their android app, maybe someone can attempt to grab their ios app?
-
mgrandi
i also don't see a link to their mobile apps so maybe we can throw
play.google.com/store/apps/details?id=com.app.griddy&hl=en_US&gl=US in ? i have no idea how to prevent that from infinitely recursing though
-
JAA
I've thrown
apkpure.com/griddy/com.app.griddy in, which includes past versions.
-
mgrandi
-
mgrandi
i was more thinking about the comments, it seems to only have a subset and then you have to scroll down / click a button to show more, i can just, right click -> save webpage as, unless there is a handy way to do this as a warc somewhere
-
JAA
Browser with warcprox, brozzler, crocoite.
-
mgrandi
is that at all easy to set up
-
JAA
Not really, no.
-
mgrandi
lol
-
mgrandi
is warcprox sufficient maybe?
-
mgrandi
i can just manually scroll down
-
JAA
Yeah, but you need to be careful not to also capture all your other browser traffic, which is virtually impossible due to all the crap that gets transmitted in the background these days.
-
mgrandi
luckily i have 5 browsers installed
-
mgrandi
i'll just configure only one to use the proxy
-
taka
Currently, Niconico is having difficulty viewing videos, probably due to the AT load, and they are performing emergency maintenance.
-
taka
-
taka
Can you make the load a little smaller?
-
Jake
EggplantN: ^^ or maybe someone else with tracker access?
-
EggplantN
LOL
-
EggplantN
sec
-
Jake
yup sorry ;)
-
EggplantN
halved
-
Jake
<3
-
susu
Hi all... I have launched a docker runner fore webbed. I would like to check crawl data, it is possible to rsync back a copy of the ouput folder (RSYNC_SRV/NICKNAME) to my local filesystem to check the warc, without putting the mess ?
-
susu
(i put it here because.. it may apply for another projects in the future)