-
mgrandi
So, is there any way I can get help with some ps store scraping stuff
-
mgrandi
I'm gonna have a crap ton of URLs and I don't think it's feasible to set up a warrior project this late
-
OrIdow6
What are you trying to do, and what's your deadline?
-
OrIdow6
I've mostly skipped over conversations about that, and now apparently it's moved to Discord anyway
-
mgrandi
Yeah, they are the experts there so that helped
-
mgrandi
so we are trying to scrape a list of content-ids (aka SKU identifiers) for each region in store.playstation.com
-
mgrandi
with each SKU we can create the full URL, so `UP3252-PCSE01475_00-JANDUSOFT0000001` for the en-us region becomes
store.playstation.com/en-us/product…475_00-JANDUSOFT0000001?smcid=psapp
-
mgrandi
so therefore we will just be having a bunch of URLs we would like downloaded , and they say that the 'deadline' is the 28th at some time
-
mgrandi
so i think we have done the 'hard' part but now we just need the download
-
OrIdow6
How many? And are there any restrictions or potential problems? I remember age gates and IP address blocks being mentioned
-
mgrandi
yeah, they seem to ip ban people temporarily , but my script that ive been running to hit the JSON api seems to be fine with some sleeps
-
mgrandi
is wget-at smart enough to de-duplicate a url if its a page requisite
-
mgrandi
or is wpull what i should be using? @JAA you might know since you know more about wpull
-
JAA
mgrandi: wpull definitely dedupes page requisites, and I think wget-at does as well.
-
mgrandi
i know it does, but there are these CSS files that are pretty big, and i don't want to waste time downloading them repeatedly
-
mgrandi
so i was asking if wpull or wget-at are smart enough to realize it has downloaded that URL already and skip it
-
JAA
If it's the same URL, it's only retrieved once.
-
mgrandi
wget-at has '--dedup-file-agnostic' but i dunno if that is for the list of files given to it as a input or also for page-requsites
-
JAA
You mean --warc-dedup-url-agnostic?
-
JAA
That is for deduping within the WARC, i.e. write revisit records instead of responses when they have the same content, even if the URLs differ.
-
JAA
But every URL is only retrieved once. Certainly in wpull, 99.9 % sure in wget-at.
-
mgrandi
ok
-
mgrandi
cool, i'll leave those in then
-
mgrandi
i shouldn't have to craft a regex to ignore files like the JS and whatnot
-
OrIdow6
Yes, wget/wget-lua/wget-at should only get them once
-
OrIdow6
Be aware that crawling recursively or getting prerequisites does require it to parse the page (though I wouldn't be too surprised if it parsed it anyway)
-
OrIdow6
So if you have a million to do in an hour or something, that should be off
-
JAA
A million in an hour with wget/wpull? Yeah, good luck.
-
OrIdow6
Yeah, I was being hyperbolic, but you see what I mean
-
JAA
Yeah
-
JAA
Not entirely sure, but I don't think you can disable parsing.
-
JAA
One of the major reasons why I wrote qwarc.
-
mgrandi
the pages are pretty small, a full download of all of the assets is like 7 mb, and like 6mb is just the stupid fonts/javascript stuff that i was asking about
-
JAA
That's not the point. The HTML parsing takes a quite significant amount of CPU time.
-
JAA
If you're highly rate-limited anyway, it probably won't matter though.
-
JAA
If you use wpull, make sure to select the libxml2 parser rather than the default pure-Python html5lib.
-
OrIdow6
Putting some printfs into wget, doesn't look like it parses it if you don't have page-requisites
-
OrIdow6
And/or recursive etc.
-
OrIdow6
But yeah, in any case, singular wget is not the best for speed over many small pages
-
JAA
-
JAA
get_urls_html is what invokes the parsing and extraction.
-
JAA
I believe wpull always parses though.
-
mgrandi
i can confirm that wget-at is downloading the same url over and over
-
OrIdow6
It shouldn't
-
OrIdow6
And I doubt it is
-
OrIdow6
Everything else being normal
-
mgrandi
well i just ran a grep and got several copies of the same url
-
mgrandi
unless its download ing it and not storing it?
-
OrIdow6
Want to upload your logs?
-
mgrandi
yeah i have them, give me a bit
-
mgrandi
wget-at apparently loves memory
-
OrIdow6
yesyes
-
OrIdow6
*yes
-
mgrandi
i hope its not like a infinite memory leak
-
mgrandi
-
OrIdow6
mgrandi: What options are you running this with? Something strange is happening here
-
mgrandi
i based it off the warrior projects
-
mgrandi
-
mgrandi
eeh thats not auto opening, let me gist it
-
mgrandi
-
mgrandi
i think the logs i checked didn't have --reject-regex , i added that because it was downloading the big CSS/font files every time
-
OrIdow6
Hm, looks good, unless I'm skipping over something obvious
-
mgrandi
haha oh my god, even with 2 gb they are still dying within 10 minutes
-
mgrandi
i dont understand how i ran 4 workers with some of these warrior projects on 1gb of RAM and this is causing it to run out of memory
-
OrIdow6
Well, I am successfully able to duplicate it here, so I'll try looking in a bit more
-
OrIdow6
I wonder if you've happened upon some way to mess up the queue, and that's why it's using so much memory as well
-
mgrandi
"--truncate-output is a Wget+Lua option. It makes --output-document into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes."
-
mgrandi
uhhhhhhhh
-
mgrandi
is it like keeping that whole file in memory if that is not specified? that was the only switch missing when i compared with
github.com/ArchiveTeam/tencent-weibo-grab/blob/master/pipeline.py
-
OrIdow6
No, they'll be on disk
-
OrIdow6
By the way, in case I have to drop out here, the thing that seems to be happening is that page requisites aren't being removed from the queue, so each normal page fetch fetches its own unique requisites as well as all past requisites (including ones that it doesn't correctly have)
-
mgrandi
so that is why the memory just keeps going up until it gets oom-killed?
-
mgrandi
-
OrIdow6
OK, so I've found out that whatever it is, it only seems to happen when you do --output-document - compare `wget --page-requisites --no-verbose "
archiveteam.org/index.php?title=Angelfire" "
archiveteam.org/index.php?title=ArchiveBot" --output-document=ot` the same without --output-document
-
OrIdow6
Maybe this is some wget thing I don't know about
-
OrIdow6
Works with vanilla wget too, by the way
-
OrIdow6
If it's not intended... there are some obvious causes for why that could happen
-
OrIdow6
I wonder if it extracts URLs from the copy of the output document on disk, instead of in memory, and as it builds up, it parses & extracts everything concatenated to there; consistent with behavior of --output-document=/dev/null
-
OrIdow6
This is getting into #archiveteam-dev material
-
mgrandi
yeah. Either way, i think --truncate-document seems to have worked? its not blowing up in memory anymore
-
OrIdow6
I think it should, if that's true
-
OrIdow6
Yes, it does
-
mgrandi
that is a pretty major bug or caveat lol
-
OrIdow6
Actually, is it now getting the page requisites only once? mgrandi
-
OrIdow6
For my test command, even with --truncate-output, it gets the ArchiveTeam logo on every page
-
OrIdow6
Even though they're no longer building up
-
OrIdow6
Almost convinced at this point that this is just because I don't have enough experience with wget
-
mgrandi
i can test in a bit
-
OrIdow6
Thing that works seems to do neither --output-document nor --truncate-output, but just --delete-after
-
mgrandi
basically, this has just reinformed my desire to get `wpull` working lol
-
OrIdow6
Haha
-
OrIdow6
If it's time-sensitive, might be better to use wget (if you aren't getting requisites multiple times, or if my fix works), however rickety it is
-
mgrandi
yeah
-
mgrandi
thats what i'm doing
-
mgrandi
and yeah, i think its still getting duplicate urls, i see this 'verifiedbyvisa.png' file repeatedly
-
JAA
mgrandi: Are you feeding in a list of URLs or recursing from one page? If the former, that might explain why you get duplicates with wget. I think it builds a new tree for each root URL and only doesn't retrieve duplicates within that. wpull, on the other hand, dedupes globally.
-
mgrandi
a list of URLs
-
mgrandi
thats what i figured yeah
-
mgrandi
still kinda mad over the --please-don't-leak-memory switch i apparently forgot >.>
-
JAA
Heh, yeah.
-
mgrandi
anyway, the playstation store "last minute project" is going fairly well all things considered , and the very last minute nature of it,
ethercalc.net/bq3ga1r7w59q
-
thuban
what even is the distinction between wpull and wget-lua/wget-at these days? all i can remember is wpull having phantomjs options, but god knows how much longer that'll work for anything
-
mgrandi
they are the same, retty much
-
mgrandi
oh i thought you meant the diff between wget-lua/wget-at
-
thuban
i would also like to know that
-
mgrandi
for that, i think is two names for the same project
-
mgrandi
but my understanding is that wget is 24 year old software that not many people know the internals of and lua support was bolted onto it, and has....various errata
-
JAA
phuzion: wpull has --database, --warc-append, and a bunch of other things as well. wget-at has Lua hooks, ZSTD support, and soon other features.
-
JAA
Er
-
JAA
thuban: ^
-
» phuzion grumbles
-
phuzion
whatcha wakin me up for JAA?
-
phuzion
just messing
-
JAA
Oh no, I've awaken the Sheeple. :-|
-
mgrandi
but wpull being python means much faster turnaround time in theory, easier to run (no or little compilation needed), etc etc
-
mgrandi
zstd support shouldn't be that hard to add to wpull, as well the same amount of hooks that wget-at has
-
JAA
Yeah, unfortunately wpull is just barely holding together and working. It desparately needs a serious cleanup and partial rewrite.
-
JAA
wpull already has more hooks than wget-at I believe.
-
mgrandi
well luckily you have me, a python starved engineer who isn't getting any python love since i started working at MS (although they do use python a lot, just not the project i'm working on)
-
JAA
:-)
-
JAA
How much do you hate asyncio yet?
-
thuban
i guess my real question is, do we have a good reason for maintaining two tools that do similar-but-slightly-different things, and if not, is there some idea of what to focus on in the future
-
mgrandi
wget-at is basically...not maintained at all? lol
-
thuban
oh, i had the impression that a fair amount of stuff had been done recently
-
thuban
but if not that is... also true of wpull, afaik
-
JAA
Yes, there has.
-
mgrandi
well, 8 commits this year, more than i expected actually
-
JAA
Work is ongoing at the moment as well.
-
mgrandi
wget-at has the advantage of being the defacto tool for warrior projects at the moment
-
JAA
One *major* advantage of wpull is --database. Not that relevant for small crawls, but for AB jobs with millions and millions of URLs in an item, it'd be infeasible to store the entire URL table in memory.
-
mgrandi
but also is the single reason why the warrior docker image can't be used for said warrior projects cause you need to compile libzstd and its too old a version of ubuntu and a a a a a
-
JAA
And adding that to wget-at would be a magnificient PITA.
-
mgrandi
yes, sqlite is a painfully underused technology
-
thuban
lol, where even is wget-at?
-
JAA
If wpull were more stable and reliable, we might use it in the DPoS projects. Alas, it isn't.
-
mgrandi
but yeah the exact thing i was mentioning earlier, wget-at can't dedulicate urls if using a input URL list, while wpull can save that url in the database and be like "i already downloaded this" and tada
-
JAA
Whereas wget-at is rock-solid mostly.
-
JAA
-
mgrandi
-
mgrandi
and i don't hate asyncio, i'm using it for another project at the moment
-
thuban
oh, i was confused by the name change :(
-
JAA
Yeah, I think we wanted to rename the repo actually.
-
mgrandi
its good at what its meant for, and for one thing i had to make a ProcessExecutor to not deadlock the ExecutionLoop or whatever, because i'm calling into it using `ctypes` and therefore are unable to call PYTHON_THREAD_SAFE_1 or whatever
-
JAA
mgrandi: Then I guess you haven't worked enough with it yet. :-P
-
JAA
Some things are really unintuitive and messy.
-
mgrandi
oh the api is garbage, and the documentation is terrible lol
-
JAA
Especially network-related stuff.
-
mgrandi
ah, my stuff is all local for now
-
JAA
Ah, yeah, for that it's pretty great.
-
mgrandi
like, i think it works well enough once you realize you just make Tasks and then async.gather() on them but then it has all these terms that really don't matter
-
mgrandi
network stuff should be its cup of tea too, since python can be free from blocking if its only IO bound
-
mgrandi
but that requires using something that can call PYTHON_THREAD_SAFE_1 in c code or whatever
-
thuban
is the problem with database support in wget-at just that the c sqlite api is low-level and annoying, or is there some more fundamental issue?
-
mgrandi
i think its more the fact that wget is a 24 year old piece of software, of which people bolted on lua support onto it
-
thuban
age is not, in and of itself, an architectural issue :P
-
mgrandi
and then to add sqlite support to that, but then how does that work , can you access the C apis in lua? oh god
-
JAA
Yeah, you have to change a lot of the core data structures.
-
thuban
mm, makes sense
-
mgrandi
no, but its more an issue with C code where its using like...probably c89 and you get no nice features of modernish c
-
JAA
--database was one of the main reasons why wpull was originally developed as a drop-in replacement for wget, I believe.
-
JAA
I.e. it was deemed easier to reimplement the whole thing than add SQLite support into wget.
-
mgrandi
im not a fan of lua or c, and then trying to make it work when you are calling into lua and somehow having that work is pretty bonkers
-
JAA
Fun fact: the wget devs also don't like wget anymore, and a complete rewrite is in progress.
-
thuban
dang
-
mgrandi
i guess no one would notice standard wget for downloading a single file has been pretty solid
-
mgrandi
I WANT A RSYNC REWRITE HOLY HELL
-
JAA
Oh yeah, it works great unless you push it to its limits.
-
JAA
Just don't try to change its behaviour, or you'll start to hate yourself.
-
mgrandi
please gaze upon the work i have to do to get rsync to work in windows:
-
JAA
The source code's a hot mess.
-
mgrandi
.\rsync --progress -args -e="..\..\cygnative1.2\cygnative plink" mgrandi@IPHERE:"/home/mgrandi/something" /cygdrive/c/Users/mgrandi/whyohgod
-
JAA
'in windows'
-
JAA
Found your problem.
-
» anelki twitches looking at that
-
mgrandi
rsync is also not great everywhere, there is no librsync, so you have to subprocess it and then consume standard out like a barbarian
-
kiska
I found your problem, you're not using WSL :P
-
JAA
Yeah, that one's been bothering me for years.
-
mgrandi
@kiska i cannot get rsync to work with SSH though, WSL doesn't have a ssh daemon or i can't get it to connect to the windows one
-
kiska
Sure it does :D
-
mgrandi
and then, i was consuming the output and it failed for a coworker, i looked, the homebrew version of it has a patch that added an extra character that broke my regex and i cried
-
thuban
i once had to develop in windows.
-
anelki
i once tried installing ruby on a friend's windows machine
-
mgrandi
i have no idea why ruby is so bad on windows
-
mgrandi
apparently its an ongoing thing, i guess python is just blessed with really solid windows support
-
mgrandi
@kiska i can't even add a ssh key i just get "Could not open a connection to your authentication agent."
-
kiska
OOF?
-
JAA
Try using a proper OS?
-
JAA
:-)
-
thuban
the guy in charge wanted to create an application for a piece of ten-year-old hardware from a company that no longer existed. there were linux drivers, but when i tried them they bundled an outdated version of qt and installed it right over mine, breaking all kinds of things. having seen dll hell i probably would have kept going anyway after i got things cleaned up, but i
-
thuban
never did get them to work
-
kiska
I mean works for me :D
-
thuban
later i discovered that the person who had been lead programmer before me had fixed all the "undefined reference" errors by typing #define. have i mentioned this application served a completely nonexistent market?
-
thuban
ask me whether i ever got that degree
-
anelki
did you?
-
thuban
i did not
-
anelki
-
» anelki is alergic to coin
-
mgrandi
i don't quite understand filecoin, like, just use merkle tree
-
mgrandi
instead it is some thing to incentivize 'storing' of data by playing numberwang and getting funbucks or something, meh
-
mgrandi
i guess its slightly better that its a curated list, but i still feel like it could be done without the whole 'proof of work blockchain" funbux
-
JAA
The Filecoin team sucks though.
-
purplebot
List of websites excluded from the Wayback Machine edited by M.Barry (+29, + www.cyberciti.biz) just now --
archiveteam.org/?diff=45726&oldid=45668
-
mgrandi
i have not heard anything about them
-
anelki
i'd be curious about any insights you have there JAA
-
arkiver
JAA: yeah we should rename the repo
-
JAA
anelki: Well, for starters, they totally fucked up the launch and alienated virtually all the storage operators on the network. The issues go back years though, but I don't remember the details. Anyway, further discusion on Filecoin in -ot please.
-
arkiver
major update coming up later is proper FTP archiving support in Wget-AT
-
anelki
ope, sorry
-
JAA
arkiver: Will anything break if we rename it? E.g. build process, Docker images, etc. GitHub should redirect to the new name I think, so I guess it should be fine.
-
JAA
Perhaps we should throw the full repo into #gitgud before the rename also just in case.
-
arkiver
JAA: depends on if we change the URL
-
arkiver
I would like to change the actual location of the rapo (the URL)
-
arkiver
definitely :)
-
JAA
Yeah, renaming the repo will change the URL.
-
arkiver
Wget-AT was actually one of my test cases for developing the github project
-
arkiver
it's been saved ocmpletely a few times :P
-
JAA
Heh, nice.
-
JAA
I'll throw it in.
-
purplebot
Deathwatch edited by JustAnotherArchivist (+151, /* 2020 */ Add Fast.io) 1 minute ago --
archiveteam.org/?diff=45728&oldid=45706
-
purplebot
Fast.io created by JustAnotherArchivist (+558, Basic page) just now --
archiveteam.org/?diff=45729&oldid=0
-
purplebot
Nagi created by JustAnotherArchivist (+1185, Basic page) just now --
archiveteam.org/?diff=45731&oldid=0
-
purplebot
Docker Hub created by JustAnotherArchivist (+1158, Basic page) just now --
archiveteam.org/?diff=45732&oldid=0
-
purplebot
NAVERまとめ created by JustAnotherArchivist (+593, Very basic page) just now --
archiveteam.org/?diff=45734&oldid=0
-
purplebot
NAVER Matome created by JustAnotherArchivist (+28, Redirected page to [[NAVERまとめ]]) just now --
archiveteam.org/?diff=45735&oldid=0
-
purplebot
Deathwatch edited by JustAnotherArchivist (+4, /* 2020 */ Link to Nagi page), JustAnotherArchivist (-29, /* 2020 */ Link to NAVERまとめ page) 19 minutes ago --
archiveteam.org/?diff=45733&oldid=45728
-
OrIdow6
JAA: AFAICT, wget dedupes globally when you output normally (mirror the remote directory structure in your own filesystem), but not when you do --output-file=/dev/null
-
arkiver
Wget-AT writes revisit records
-
OrIdow6
Oh, I see what you mean
-
OrIdow6
arkiver: We're talking about deduping in the URL queue, not in the warc output
-
OrIdow6
JAA: Never mind, see what you mena
-
mgrandi
Is it just me or is there way more high profile things on deathwatch from today till the end of 2020 than normal
-
mgrandi
Chrome extensions, xda forums, playstation store, flash everything, yahoo groups, twitch sings clips D:
-
JAA
Pretty normal towards the end of the year.
-
JAA
In the last two months of 2019, we also had Apple, Google (twice), Yahoo (twice), and Intel in there.
-
JAA
Do we have any idea yet what we could do about the UK-owned .eu domains?
-
JAA
That will be a couple hundred thousand websites gone. :-|
-
OrIdow6
Two stages to that - identification and crawling
-
OrIdow6
Identification can (for lack of something better) be done by the various heuristics people can give
-
JAA
Yeah, the identification is the difficult part.
-
OrIdow6
*have guven
-
OrIdow6
Crawling may need resources (and it's much more like a "wide crawl" than what AT usually does), but is straightforward
-
OrIdow6
I think it'll just have to be an educated guess
-
OrIdow6
At identification
-
JAA
Yeah
-
OrIdow6
Here's a general approach - first gather many features on all the .eu websites available, then try to figure out which ones are likely to be good indicators of being in the UK
-
OrIdow6
For the second step, e.g. if, of all those sites that list physical addresses, those that use a certain hosting provider (or cloud region or whatever) always have UK addresses, than it's a good indicator
-
arkiver
HCross: did we have a list of all sites from some tld? I forgot