#archiveteam-bs

00:22

mgrandi

So, is there any way I can get help with some ps store scraping stuff
00:22

mgrandi

I'm gonna have a crap ton of URLs and I don't think it's feasible to set up a warrior project this late
00:23

OrIdow6

What are you trying to do, and what's your deadline?
00:24

OrIdow6

I've mostly skipped over conversations about that, and now apparently it's moved to Discord anyway
00:24

mgrandi

Yeah, they are the experts there so that helped
00:25

mgrandi

so we are trying to scrape a list of content-ids (aka SKU identifiers) for each region in store.playstation.com
00:28

mgrandi

with each SKU we can create the full URL, so `UP3252-PCSE01475_00-JANDUSOFT0000001` for the en-us region becomes store.playstation.com/en-us/product…475_00-JANDUSOFT0000001?smcid=psapp
00:28

mgrandi

so therefore we will just be having a bunch of URLs we would like downloaded , and they say that the 'deadline' is the 28th at some time
00:30

mgrandi

so i think we have done the 'hard' part but now we just need the download
00:41

OrIdow6

How many? And are there any restrictions or potential problems? I remember age gates and IP address blocks being mentioned
00:52

mgrandi

yeah, they seem to ip ban people temporarily , but my script that ive been running to hit the JSON api seems to be fine with some sleeps
01:17

mgrandi

is wget-at smart enough to de-duplicate a url if its a page requisite
01:17

mgrandi

or is wpull what i should be using? @JAA you might know since you know more about wpull
02:23

JAA

mgrandi: wpull definitely dedupes page requisites, and I think wget-at does as well.
02:24

mgrandi

i know it does, but there are these CSS files that are pretty big, and i don't want to waste time downloading them repeatedly
02:24

mgrandi

so i was asking if wpull or wget-at are smart enough to realize it has downloaded that URL already and skip it
02:24

JAA

If it's the same URL, it's only retrieved once.
02:24

mgrandi

wget-at has '--dedup-file-agnostic' but i dunno if that is for the list of files given to it as a input or also for page-requsites
02:25

JAA

You mean --warc-dedup-url-agnostic?
02:26

JAA

That is for deduping within the WARC, i.e. write revisit records instead of responses when they have the same content, even if the URLs differ.
02:26

JAA

But every URL is only retrieved once. Certainly in wpull, 99.9 % sure in wget-at.
02:28

mgrandi

ok
02:28

mgrandi

cool, i'll leave those in then
02:28

mgrandi

i shouldn't have to craft a regex to ignore files like the JS and whatnot
02:36

OrIdow6

Yes, wget/wget-lua/wget-at should only get them once
02:37

OrIdow6

Be aware that crawling recursively or getting prerequisites does require it to parse the page (though I wouldn't be too surprised if it parsed it anyway)
02:37

OrIdow6

So if you have a million to do in an hour or something, that should be off
02:38

JAA

A million in an hour with wget/wpull? Yeah, good luck.
02:41

OrIdow6

Yeah, I was being hyperbolic, but you see what I mean
02:41

JAA

Yeah
02:41

JAA

Not entirely sure, but I don't think you can disable parsing.
02:42

JAA

One of the major reasons why I wrote qwarc.
02:45

mgrandi

the pages are pretty small, a full download of all of the assets is like 7 mb, and like 6mb is just the stupid fonts/javascript stuff that i was asking about
02:46

JAA

That's not the point. The HTML parsing takes a quite significant amount of CPU time.
02:46

JAA

If you're highly rate-limited anyway, it probably won't matter though.
02:47

JAA

If you use wpull, make sure to select the libxml2 parser rather than the default pure-Python html5lib.
03:00

OrIdow6

Putting some printfs into wget, doesn't look like it parses it if you don't have page-requisites
03:00

OrIdow6

And/or recursive etc.
03:00

OrIdow6

But yeah, in any case, singular wget is not the best for speed over many small pages
03:02

JAA

Looks like you're right: github.com/ArchiveTeam/wget-lua/blo…602a338bdd9f4/src/recur.c#L485-L493
03:02

JAA

get_urls_html is what invokes the parsing and extraction.
03:04

JAA

I believe wpull always parses though.
05:41

mgrandi

i can confirm that wget-at is downloading the same url over and over
05:48

OrIdow6

It shouldn't
05:52

OrIdow6

And I doubt it is
05:53

OrIdow6

Everything else being normal
06:24

mgrandi

well i just ran a grep and got several copies of the same url
06:25

mgrandi

unless its download ing it and not storing it?
06:30

OrIdow6

Want to upload your logs?
06:46

mgrandi

yeah i have them, give me a bit
06:51

mgrandi

wget-at apparently loves memory
07:00

OrIdow6

yesyes
07:00

OrIdow6

*yes
07:05

mgrandi

i hope its not like a infinite memory leak
07:07

mgrandi

gist.github.com/mgrandi/e8a35077ae944c79cad28601970b4e59 @OrIdow6 , example of a URL that keeps repeating is store.playstation.com/assets/vendor…e01bcb167174f8baf0eb82c68f7e3a62.js
07:13

OrIdow6

mgrandi: What options are you running this with? Something strange is happening here
07:14

mgrandi

i based it off the warrior projects
07:14

mgrandi

161.35.231.94/wget_at_args_es-es.sh
07:14

mgrandi

eeh thats not auto opening, let me gist it
07:15

mgrandi

gist.github.com/mgrandi/3c2200c6435c33b21a18fd32cc1eb871
07:18

mgrandi

i think the logs i checked didn't have --reject-regex , i added that because it was downloading the big CSS/font files every time
07:20

OrIdow6

Hm, looks good, unless I'm skipping over something obvious
07:25

mgrandi

haha oh my god, even with 2 gb they are still dying within 10 minutes
07:26

mgrandi

i dont understand how i ran 4 workers with some of these warrior projects on 1gb of RAM and this is causing it to run out of memory
07:34

OrIdow6

Well, I am successfully able to duplicate it here, so I'll try looking in a bit more
07:34

OrIdow6

I wonder if you've happened upon some way to mess up the queue, and that's why it's using so much memory as well
07:36

mgrandi

"--truncate-output is a Wget+Lua option. It makes --output-document into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes."
07:36

mgrandi

uhhhhhhhh
07:37

mgrandi

is it like keeping that whole file in memory if that is not specified? that was the only switch missing when i compared with github.com/ArchiveTeam/tencent-weibo-grab/blob/master/pipeline.py
07:39

OrIdow6

No, they'll be on disk
07:43

OrIdow6

By the way, in case I have to drop out here, the thing that seems to be happening is that page requisites aren't being removed from the queue, so each normal page fetch fetches its own unique requisites as well as all past requisites (including ones that it doesn't correctly have)
07:45

mgrandi

so that is why the memory just keeps going up until it gets oom-killed?
07:46

mgrandi

i built a fresh copy of wget-at like today: gist.github.com/mgrandi/785d98d07a5b8da23e8370c66c078fb3
08:01

OrIdow6

OK, so I've found out that whatever it is, it only seems to happen when you do --output-document - compare `wget --page-requisites --no-verbose "archiveteam.org/index.php?title=Angelfire" "archiveteam.org/index.php?title=ArchiveBot" --output-document=ot` the same without --output-document
08:01

OrIdow6

Maybe this is some wget thing I don't know about
08:01

OrIdow6

Works with vanilla wget too, by the way
08:02

OrIdow6

If it's not intended... there are some obvious causes for why that could happen
08:07

OrIdow6

I wonder if it extracts URLs from the copy of the output document on disk, instead of in memory, and as it builds up, it parses & extracts everything concatenated to there; consistent with behavior of --output-document=/dev/null
08:07

OrIdow6

This is getting into #archiveteam-dev material
08:08

mgrandi

yeah. Either way, i think --truncate-document seems to have worked? its not blowing up in memory anymore
08:09

OrIdow6

I think it should, if that's true
08:09

OrIdow6

Yes, it does
08:09

mgrandi

that is a pretty major bug or caveat lol
08:11

OrIdow6

Actually, is it now getting the page requisites only once? mgrandi
08:12

OrIdow6

For my test command, even with --truncate-output, it gets the ArchiveTeam logo on every page
08:12

OrIdow6

Even though they're no longer building up
08:13

OrIdow6

Almost convinced at this point that this is just because I don't have enough experience with wget
08:13

mgrandi

i can test in a bit
08:16

OrIdow6

Thing that works seems to do neither --output-document nor --truncate-output, but just --delete-after
08:19

mgrandi

basically, this has just reinformed my desire to get `wpull` working lol
08:19

OrIdow6

Haha
08:21

OrIdow6

If it's time-sensitive, might be better to use wget (if you aren't getting requisites multiple times, or if my fix works), however rickety it is
08:31

mgrandi

yeah
08:31

mgrandi

thats what i'm doing
08:33

mgrandi

and yeah, i think its still getting duplicate urls, i see this 'verifiedbyvisa.png' file repeatedly
13:12

JAA

mgrandi: Are you feeding in a list of URLs or recursing from one page? If the former, that might explain why you get duplicates with wget. I think it builds a new tree for each root URL and only doesn't retrieve duplicates within that. wpull, on the other hand, dedupes globally.
13:12

mgrandi

a list of URLs
13:13

mgrandi

thats what i figured yeah
13:13

mgrandi

still kinda mad over the --please-don't-leak-memory switch i apparently forgot >.>
13:13

JAA

Heh, yeah.
13:16

mgrandi

anyway, the playstation store "last minute project" is going fairly well all things considered , and the very last minute nature of it, ethercalc.net/bq3ga1r7w59q
13:16

thuban

what even is the distinction between wpull and wget-lua/wget-at these days? all i can remember is wpull having phantomjs options, but god knows how much longer that'll work for anything
13:16

mgrandi

they are the same, retty much
13:17

mgrandi

oh i thought you meant the diff between wget-lua/wget-at
13:17

thuban

i would also like to know that
13:17

mgrandi

for that, i think is two names for the same project
13:18

mgrandi

but my understanding is that wget is 24 year old software that not many people know the internals of and lua support was bolted onto it, and has....various errata
13:18

JAA

phuzion: wpull has --database, --warc-append, and a bunch of other things as well. wget-at has Lua hooks, ZSTD support, and soon other features.
13:18

JAA

Er
13:18

JAA

thuban: ^
13:18

» phuzion grumbles
13:19

phuzion

whatcha wakin me up for JAA?
13:19

phuzion

just messing
13:19

JAA

Oh no, I've awaken the Sheeple. :-|
13:19

mgrandi

but wpull being python means much faster turnaround time in theory, easier to run (no or little compilation needed), etc etc
13:20

mgrandi

zstd support shouldn't be that hard to add to wpull, as well the same amount of hooks that wget-at has
13:20

JAA

Yeah, unfortunately wpull is just barely holding together and working. It desparately needs a serious cleanup and partial rewrite.
13:20

JAA

wpull already has more hooks than wget-at I believe.
13:21

mgrandi

well luckily you have me, a python starved engineer who isn't getting any python love since i started working at MS (although they do use python a lot, just not the project i'm working on)
13:21

JAA

:-)
13:21

JAA

How much do you hate asyncio yet?
13:21

thuban

i guess my real question is, do we have a good reason for maintaining two tools that do similar-but-slightly-different things, and if not, is there some idea of what to focus on in the future
13:22

mgrandi

wget-at is basically...not maintained at all? lol
13:22

thuban

oh, i had the impression that a fair amount of stuff had been done recently
13:22

thuban

but if not that is... also true of wpull, afaik
13:22

JAA

Yes, there has.
13:22

mgrandi

well, 8 commits this year, more than i expected actually
13:23

JAA

Work is ongoing at the moment as well.
13:23

mgrandi

wget-at has the advantage of being the defacto tool for warrior projects at the moment
13:23

JAA

One *major* advantage of wpull is --database. Not that relevant for small crawls, but for AB jobs with millions and millions of URLs in an item, it'd be infeasible to store the entire URL table in memory.
13:24

mgrandi

but also is the single reason why the warrior docker image can't be used for said warrior projects cause you need to compile libzstd and its too old a version of ubuntu and a a a a a
13:24

JAA

And adding that to wget-at would be a magnificient PITA.
13:24

mgrandi

yes, sqlite is a painfully underused technology
13:25

thuban

lol, where even is wget-at?
13:25

JAA

If wpull were more stable and reliable, we might use it in the DPoS projects. Alas, it isn't.
13:25

mgrandi

but yeah the exact thing i was mentioning earlier, wget-at can't dedulicate urls if using a input URL list, while wpull can save that url in the database and be like "i already downloaded this" and tada
13:25

JAA

Whereas wget-at is rock-solid mostly.
13:26

JAA

thuban: github.com/ArchiveTeam/wget-lua
13:26

mgrandi

github.com/ArchiveTeam/wget-lua
13:26

mgrandi

and i don't hate asyncio, i'm using it for another project at the moment
13:26

thuban

oh, i was confused by the name change :(
13:26

JAA

Yeah, I think we wanted to rename the repo actually.
13:27

mgrandi

its good at what its meant for, and for one thing i had to make a ProcessExecutor to not deadlock the ExecutionLoop or whatever, because i'm calling into it using `ctypes` and therefore are unable to call PYTHON_THREAD_SAFE_1 or whatever
13:27

JAA

mgrandi: Then I guess you haven't worked enough with it yet. :-P
13:27

JAA

Some things are really unintuitive and messy.
13:27

mgrandi

oh the api is garbage, and the documentation is terrible lol
13:27

JAA

Especially network-related stuff.
13:28

mgrandi

ah, my stuff is all local for now
13:28

JAA

Ah, yeah, for that it's pretty great.
13:28

mgrandi

like, i think it works well enough once you realize you just make Tasks and then async.gather() on them but then it has all these terms that really don't matter
13:28

mgrandi

network stuff should be its cup of tea too, since python can be free from blocking if its only IO bound
13:29

mgrandi

but that requires using something that can call PYTHON_THREAD_SAFE_1 in c code or whatever
13:29

thuban

is the problem with database support in wget-at just that the c sqlite api is low-level and annoying, or is there some more fundamental issue?
13:29

mgrandi

i think its more the fact that wget is a 24 year old piece of software, of which people bolted on lua support onto it
13:30

thuban

age is not, in and of itself, an architectural issue :P
13:30

mgrandi

and then to add sqlite support to that, but then how does that work , can you access the C apis in lua? oh god
13:30

JAA

Yeah, you have to change a lot of the core data structures.
13:30

thuban

mm, makes sense
13:30

mgrandi

no, but its more an issue with C code where its using like...probably c89 and you get no nice features of modernish c
13:30

JAA

--database was one of the main reasons why wpull was originally developed as a drop-in replacement for wget, I believe.
13:31

JAA

I.e. it was deemed easier to reimplement the whole thing than add SQLite support into wget.
13:31

mgrandi

im not a fan of lua or c, and then trying to make it work when you are calling into lua and somehow having that work is pretty bonkers
13:31

JAA

Fun fact: the wget devs also don't like wget anymore, and a complete rewrite is in progress.
13:32

thuban

dang
13:32

mgrandi

i guess no one would notice standard wget for downloading a single file has been pretty solid
13:32

mgrandi

I WANT A RSYNC REWRITE HOLY HELL
13:32

JAA

Oh yeah, it works great unless you push it to its limits.
13:33

JAA

Just don't try to change its behaviour, or you'll start to hate yourself.
13:33

mgrandi

please gaze upon the work i have to do to get rsync to work in windows:
13:33

JAA

The source code's a hot mess.
13:33

mgrandi

.\rsync --progress -args -e="..\..\cygnative1.2\cygnative plink" mgrandi@IPHERE:"/home/mgrandi/something" /cygdrive/c/Users/mgrandi/whyohgod
13:33

JAA

'in windows'
13:33

JAA

Found your problem.
13:34

» anelki twitches looking at that
13:34

mgrandi

rsync is also not great everywhere, there is no librsync, so you have to subprocess it and then consume standard out like a barbarian
13:35

kiska

I found your problem, you're not using WSL :P
13:35

JAA

Yeah, that one's been bothering me for years.
13:35

mgrandi

@kiska i cannot get rsync to work with SSH though, WSL doesn't have a ssh daemon or i can't get it to connect to the windows one
13:35

kiska

Sure it does :D
13:36

mgrandi

and then, i was consuming the output and it failed for a coworker, i looked, the homebrew version of it has a patch that added an extra character that broke my regex and i cried
13:36

thuban

i once had to develop in windows.
13:37

anelki

i once tried installing ruby on a friend's windows machine
13:37

mgrandi

i have no idea why ruby is so bad on windows
13:37

mgrandi

apparently its an ongoing thing, i guess python is just blessed with really solid windows support
13:38

mgrandi

@kiska i can't even add a ssh key i just get "Could not open a connection to your authentication agent."
13:38

kiska

OOF?
13:40

JAA

Try using a proper OS?
13:40

JAA

:-)
13:40

thuban

the guy in charge wanted to create an application for a piece of ten-year-old hardware from a company that no longer existed. there were linux drivers, but when i tried them they bundled an outdated version of qt and installed it right over mine, breaking all kinds of things. having seen dll hell i probably would have kept going anyway after i got things cleaned up, but i
13:40

thuban

never did get them to work
13:40

kiska

I mean works for me :D
13:40

thuban

later i discovered that the person who had been lead programmer before me had fixed all the "undefined reference" errors by typing #define. have i mentioned this application served a completely nonexistent market?
13:40

thuban

ask me whether i ever got that degree
13:41

anelki

did you?
13:41

thuban

i did not
13:42

anelki

did i miss the chatter on this? blog.archive.org/2020/10/22/what-in…should-we-be-preserving-in-filecoin
13:43

» anelki is alergic to coin
13:43

mgrandi

i don't quite understand filecoin, like, just use merkle tree
13:43

mgrandi

instead it is some thing to incentivize 'storing' of data by playing numberwang and getting funbucks or something, meh
13:46

mgrandi

i guess its slightly better that its a curated list, but i still feel like it could be done without the whole 'proof of work blockchain" funbux
13:48

JAA

The Filecoin team sucks though.
13:49

purplebot

List of websites excluded from the Wayback Machine edited by M.Barry (+29, + www.cyberciti.biz) just now -- archiveteam.org/?diff=45726&oldid=45668
13:51

mgrandi

i have not heard anything about them
14:13

anelki

i'd be curious about any insights you have there JAA
14:19

arkiver

JAA: yeah we should rename the repo
14:20

JAA

anelki: Well, for starters, they totally fucked up the launch and alienated virtually all the storage operators on the network. The issues go back years though, but I don't remember the details. Anyway, further discusion on Filecoin in -ot please.
14:20

arkiver

major update coming up later is proper FTP archiving support in Wget-AT
14:20

anelki

ope, sorry
14:21

JAA

arkiver: Will anything break if we rename it? E.g. build process, Docker images, etc. GitHub should redirect to the new name I think, so I guess it should be fine.
14:22

JAA

Perhaps we should throw the full repo into #gitgud before the rename also just in case.
14:22

arkiver

JAA: depends on if we change the URL
14:23

arkiver

I would like to change the actual location of the rapo (the URL)
14:23

arkiver

definitely :)
14:23

JAA

Yeah, renaming the repo will change the URL.
14:23

arkiver

Wget-AT was actually one of my test cases for developing the github project
14:23

arkiver

it's been saved ocmpletely a few times :P
14:23

JAA

Heh, nice.
14:24

JAA

I'll throw it in.
15:59

purplebot

Deathwatch edited by JustAnotherArchivist (+151, /* 2020 */ Add Fast.io) 1 minute ago -- archiveteam.org/?diff=45728&oldid=45706
16:02

purplebot

Fast.io created by JustAnotherArchivist (+558, Basic page) just now -- archiveteam.org/?diff=45729&oldid=0
16:09

purplebot

Nagi created by JustAnotherArchivist (+1185, Basic page) just now -- archiveteam.org/?diff=45731&oldid=0
16:19

purplebot

Docker Hub created by JustAnotherArchivist (+1158, Basic page) just now -- archiveteam.org/?diff=45732&oldid=0
16:26

purplebot

NAVERまとめ created by JustAnotherArchivist (+593, Very basic page) just now -- archiveteam.org/?diff=45734&oldid=0
16:27

purplebot

NAVER Matome created by JustAnotherArchivist (+28, Redirected page to [[NAVERまとめ]]) just now -- archiveteam.org/?diff=45735&oldid=0
16:40

purplebot

Deathwatch edited by JustAnotherArchivist (+4, /* 2020 */ Link to Nagi page), JustAnotherArchivist (-29, /* 2020 */ Link to NAVERまとめ page) 19 minutes ago -- archiveteam.org/?diff=45733&oldid=45728
17:36

OrIdow6

JAA: AFAICT, wget dedupes globally when you output normally (mirror the remote directory structure in your own filesystem), but not when you do --output-file=/dev/null
17:41

arkiver

Wget-AT writes revisit records
17:41

OrIdow6

Oh, I see what you mean
17:42

OrIdow6

arkiver: We're talking about deduping in the URL queue, not in the warc output
17:43

OrIdow6

JAA: Never mind, see what you mena
22:50

mgrandi

Is it just me or is there way more high profile things on deathwatch from today till the end of 2020 than normal
22:52

mgrandi

Chrome extensions, xda forums, playstation store, flash everything, yahoo groups, twitch sings clips D:
22:53

JAA

Pretty normal towards the end of the year.
22:55

JAA

In the last two months of 2019, we also had Apple, Google (twice), Yahoo (twice), and Intel in there.
22:56

JAA

Do we have any idea yet what we could do about the UK-owned .eu domains?
22:57

JAA

That will be a couple hundred thousand websites gone. :-|
23:00

OrIdow6

Two stages to that - identification and crawling
23:00

OrIdow6

Identification can (for lack of something better) be done by the various heuristics people can give
23:00

JAA

Yeah, the identification is the difficult part.
23:00

OrIdow6

*have guven
23:02

OrIdow6

Crawling may need resources (and it's much more like a "wide crawl" than what AT usually does), but is straightforward
23:02

OrIdow6

I think it'll just have to be an educated guess
23:02

OrIdow6

At identification
23:04

JAA

Yeah
23:09

OrIdow6

Here's a general approach - first gather many features on all the .eu websites available, then try to figure out which ones are likely to be good indicators of being in the UK
23:09

OrIdow6

For the second step, e.g. if, of all those sites that list physical addresses, those that use a certain hosting provider (or cloud region or whatever) always have UK addresses, than it's a good indicator
23:09

arkiver

HCross: did we have a list of all sites from some tld? I forgot

4 years ago

« a day earlier

a day later »

today »