00:22:11 <mgrandi> So, is there any way I can get help with some ps store scraping stuff
00:22:43 <mgrandi> I'm gonna have a crap ton of URLs and I don't think it's feasible to set up a warrior project this late
00:23:35 <OrIdow6> What are you trying to do, and what's your deadline?
00:24:08 <OrIdow6> I've mostly skipped over conversations about that, and now apparently it's moved to Discord anyway
00:24:34 <mgrandi> Yeah, they are the experts there so that helped
00:25:36 <mgrandi> so we are trying to scrape a list of content-ids (aka SKU identifiers) for each region in store.playstation.com
00:28:00 <mgrandi> with each SKU we can create the full URL, so `UP3252-PCSE01475_00-JANDUSOFT0000001` for the en-us region becomes https://store.playstation.com/en-us/product/UP3252-PCSE01475_00-JANDUSOFT0000001?smcid=psapp
00:28:28 <mgrandi> so therefore we will just be having a bunch of URLs we would like downloaded , and they say that the 'deadline' is the 28th at some time
00:30:32 <mgrandi> so i think we have done the 'hard' part but now we just need the download
00:41:14 <OrIdow6> How many? And are there any restrictions or potential problems? I remember age gates and IP address blocks being mentioned
00:52:43 <mgrandi> yeah, they seem to ip ban people temporarily , but my script that ive been running to hit the JSON api seems to be fine with some sleeps
01:17:22 <mgrandi> is wget-at smart enough to de-duplicate a url if its a page requisite
01:17:56 <mgrandi> or is wpull what i should be using? @JAA  you might know since you know more about wpull
02:23:56 <JAA> mgrandi: wpull definitely dedupes page requisites, and I think wget-at does as well.
02:24:17 <mgrandi> i know it does, but there are these CSS files that are pretty big, and i don't want to waste time downloading them repeatedly
02:24:30 <mgrandi> so i was asking if wpull or wget-at are smart enough to realize it has downloaded that URL already and skip it
02:24:40 <JAA> If it's the same URL, it's only retrieved once.
02:24:55 <mgrandi> wget-at has '--dedup-file-agnostic' but i dunno if that is for the list of files given to it as a input or also for page-requsites
02:25:47 <JAA> You mean --warc-dedup-url-agnostic?
02:26:11 <JAA> That is for deduping within the WARC, i.e. write revisit records instead of responses when they have the same content, even if the URLs differ.
02:26:33 <JAA> But every URL is only retrieved once. Certainly in wpull, 99.9 % sure in wget-at.
02:28:18 <mgrandi> ok
02:28:24 <mgrandi> cool, i'll leave those in then
02:28:44 <mgrandi> i shouldn't have to craft a regex to ignore files like the JS and whatnot
02:36:36 <OrIdow6> Yes, wget/wget-lua/wget-at should only get them once
02:37:28 <OrIdow6> Be aware that crawling recursively or getting prerequisites does require it to parse the page (though I wouldn't be too surprised if it parsed it anyway)
02:37:51 <OrIdow6> So if you have a million to do in an hour or something, that should be off
02:38:56 <JAA> A million in an hour with wget/wpull? Yeah, good luck.
02:41:11 <OrIdow6> Yeah, I was being hyperbolic, but you see what I mean
02:41:32 <JAA> Yeah
02:41:55 <JAA> Not entirely sure, but I don't think you can disable parsing.
02:42:15 <JAA> One of the major reasons why I wrote qwarc.
02:45:25 <mgrandi> the pages are pretty small, a full download of all of the assets is like 7 mb, and like 6mb is just the stupid fonts/javascript stuff that i was asking about
02:46:09 <JAA> That's not the point. The HTML parsing takes a quite significant amount of CPU time.
02:46:28 <JAA> If you're highly rate-limited anyway, it probably won't matter though.
02:47:10 <JAA> If you use wpull, make sure to select the libxml2 parser rather than the default pure-Python html5lib.
03:00:07 <OrIdow6> Putting some printfs into wget, doesn't look like it parses it if you don't have page-requisites
03:00:23 <OrIdow6> And/or recursive etc.
03:00:56 <OrIdow6> But yeah, in any case, singular wget is not the best for speed over many small pages
03:02:25 <JAA> Looks like you're right: https://github.com/ArchiveTeam/wget-lua/blob/09942221e1e550e6c8516e6193c602a338bdd9f4/src/recur.c#L485-L493
03:02:35 <JAA> get_urls_html is what invokes the parsing and extraction.
03:04:12 <JAA> I believe wpull always parses though.
05:41:05 <mgrandi> i can confirm that wget-at is downloading the same url over and over
05:48:21 <OrIdow6> It shouldn't
05:52:48 <OrIdow6>  And I doubt it is
05:53:29 <OrIdow6> Everything else being normal
06:24:54 <mgrandi> well i just ran a grep and got several copies of the same url
06:25:02 <mgrandi> unless its download ing it and not storing it?
06:30:22 <OrIdow6> Want to upload your logs?
06:46:18 <mgrandi> yeah i have them, give me a bit
06:51:12 <mgrandi> wget-at apparently loves memory
07:00:15 <OrIdow6> yesyes
07:00:23 <OrIdow6> *yes
07:05:36 <mgrandi> i hope its not like a infinite memory leak
07:07:47 <mgrandi> https://gist.github.com/mgrandi/e8a35077ae944c79cad28601970b4e59 @OrIdow6 , example of a URL that keeps repeating is https://store.playstation.com/assets/vendor-e01bcb167174f8baf0eb82c68f7e3a62.js
07:13:20 <OrIdow6> mgrandi: What options are you running this with? Something strange is happening here
07:14:14 <mgrandi> i based it off the warrior projects
07:14:28 <mgrandi> http://161.35.231.94/wget_at_args_es-es.sh
07:14:46 <mgrandi> eeh thats not auto opening, let me gist it
07:15:51 <mgrandi> https://gist.github.com/mgrandi/3c2200c6435c33b21a18fd32cc1eb871
07:18:59 <mgrandi> i think the logs i checked didn't have --reject-regex , i added that because it was downloading the big CSS/font files every time
07:20:24 <OrIdow6> Hm, looks good, unless I'm skipping over something obvious
07:25:30 <mgrandi> haha oh my god, even with 2 gb they are still dying within 10 minutes
07:26:34 <mgrandi> i dont understand how i ran 4 workers with some of these warrior projects on 1gb of RAM and this is causing it to run out of memory
07:34:18 <OrIdow6> Well, I am successfully able to duplicate it here, so I'll try looking in a bit more
07:34:57 <OrIdow6> I wonder if you've happened upon some way to mess up the queue, and that's why it's using so much memory as well
07:36:20 <mgrandi> "--truncate-output is a Wget+Lua option. It makes --output-document into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes."
07:36:23 <mgrandi> uhhhhhhhh
07:37:02 <mgrandi> is it like keeping that whole file in memory if that is not specified? that was the only switch missing when i compared with https://github.com/ArchiveTeam/tencent-weibo-grab/blob/master/pipeline.py
07:39:40 <OrIdow6> No, they'll be on disk
07:43:41 <OrIdow6> By the way, in case I have to drop out here, the thing that seems to be happening is that page requisites aren't being removed from the queue, so each normal page fetch fetches its own unique requisites as well as all past requisites (including ones that it doesn't correctly have)
07:45:51 <mgrandi> so that is why the memory just keeps going up until it gets oom-killed?
07:46:54 <mgrandi> i built a fresh copy of wget-at like today: https://gist.github.com/mgrandi/785d98d07a5b8da23e8370c66c078fb3
08:01:30 <OrIdow6> OK, so I've found out that whatever it is, it only seems to happen when you do --output-document - compare `wget --page-requisites --no-verbose "https://www.archiveteam.org/index.php?title=Angelfire" "https://www.archiveteam.org/index.php?title=ArchiveBot" --output-document=ot` the same without --output-document
08:01:45 <OrIdow6> Maybe this is some wget thing I don't know about
08:01:53 <OrIdow6> Works with vanilla wget too, by the way
08:02:44 <OrIdow6> If it's not intended... there are some obvious causes for why that could happen
08:07:25 <OrIdow6> I wonder if it extracts URLs from the copy of the output document on disk, instead of in memory, and as it builds up, it parses & extracts everything concatenated to there; consistent with behavior of --output-document=/dev/null
08:07:36 <OrIdow6> This is getting into #archiveteam-dev material
08:08:29 <mgrandi> yeah. Either way, i think --truncate-document seems to have worked? its not blowing up in memory anymore
08:09:22 <OrIdow6> I think it should, if that's true
08:09:44 <OrIdow6> Yes, it does
08:09:51 <mgrandi> that is a pretty major bug or caveat lol
08:11:38 <OrIdow6> Actually, is it now getting the page requisites only once? mgrandi
08:12:07 <OrIdow6> For my test command, even with --truncate-output, it gets the ArchiveTeam logo on every page
08:12:27 <OrIdow6> Even though they're no longer building up
08:13:31 <OrIdow6> Almost convinced at this point that this is just because I don't have enough experience with wget
08:13:31 <mgrandi> i can test in a bit
08:16:29 <OrIdow6> Thing that works seems to do neither --output-document nor --truncate-output, but just --delete-after
08:19:26 <mgrandi> basically, this has just reinformed my desire to get `wpull` working lol
08:19:46 <OrIdow6> Haha
08:21:31 <OrIdow6> If it's time-sensitive, might be better to use wget (if you aren't getting requisites multiple times, or if my fix works), however rickety it is
08:31:20 <mgrandi> yeah
08:31:23 <mgrandi> thats what i'm doing
08:33:48 <mgrandi> and yeah, i think its still getting duplicate urls, i see this 'verifiedbyvisa.png' file repeatedly
13:12:36 <JAA> mgrandi: Are you feeding in a list of URLs or recursing from one page? If the former, that might explain why you get duplicates with wget. I think it builds a new tree for each root URL and only doesn't retrieve duplicates within that. wpull, on the other hand, dedupes globally.
13:12:50 <mgrandi> a list of URLs
13:13:01 <mgrandi> thats what i figured yeah
13:13:15 <mgrandi> still kinda mad over the --please-don't-leak-memory switch i apparently forgot >.>
13:13:53 <JAA> Heh, yeah.
13:16:08 <mgrandi> anyway, the playstation store "last minute project" is going fairly well all things considered , and the very last minute nature of it, https://ethercalc.net/bq3ga1r7w59q
13:16:33 <thuban> what even is the distinction between wpull and wget-lua/wget-at these days? all i can remember is wpull having phantomjs options, but god knows how much longer that'll work for anything
13:16:53 <mgrandi> they are the same, retty much
13:17:04 <mgrandi> oh i thought you meant the diff between wget-lua/wget-at
13:17:27 <thuban> i would also like to know that
13:17:53 <mgrandi> for that, i think is two names for the same project
13:18:28 <mgrandi> but my understanding is that wget is 24 year old software that not many people know the internals of and lua support was bolted onto it, and has....various errata
13:18:39 <JAA> phuzion: wpull has --database, --warc-append, and a bunch of other things as well. wget-at has Lua hooks, ZSTD support, and soon other features.
13:18:42 <JAA> Er
13:18:44 <JAA> thuban: ^
13:18:57 * phuzion grumbles
13:19:09 <phuzion> whatcha wakin me up for JAA?
13:19:17 <phuzion> just messing
13:19:28 <JAA> Oh no, I've awaken the Sheeple. :-|
13:19:31 <mgrandi> but wpull being python means much faster turnaround time in theory, easier to run (no or little compilation needed), etc etc
13:20:13 <mgrandi> zstd support shouldn't be that hard to add to wpull, as well the same amount of hooks that wget-at has
13:20:22 <JAA> Yeah, unfortunately wpull is just barely holding together and working. It desparately needs a serious cleanup and partial rewrite.
13:20:36 <JAA> wpull already has more hooks than wget-at I believe.
13:21:10 <mgrandi> well luckily you have me, a python starved engineer who isn't getting any python love since i started working at MS (although they do use python a lot, just not the project i'm working on)
13:21:42 <JAA> :-)
13:21:48 <JAA> How much do you hate asyncio yet?
13:21:50 <thuban> i guess my real question is, do we have a good reason for maintaining two tools that do similar-but-slightly-different things, and if not, is there some idea of what to focus on in the future
13:22:03 <mgrandi> wget-at is basically...not maintained at all? lol
13:22:30 <thuban> oh, i had the impression that a fair amount of stuff had been done recently
13:22:40 <thuban> but if not that is... also true of wpull, afaik
13:22:49 <JAA> Yes, there has.
13:22:51 <mgrandi> well, 8 commits this year, more than i expected actually
13:23:02 <JAA> Work is ongoing at the moment as well.
13:23:32 <mgrandi> wget-at has the advantage of being the defacto tool for warrior projects at the moment
13:23:50 <JAA> One *major* advantage of wpull is --database. Not that relevant for small crawls, but for AB jobs with millions and millions of URLs in an item, it'd be infeasible to store the entire URL table in memory.
13:24:07 <mgrandi> but also is the single reason why the warrior docker image can't be used for said warrior projects cause you need to compile libzstd and its too old a version of ubuntu and a a a a a
13:24:11 <JAA> And adding that to wget-at would be a magnificient PITA.
13:24:31 <mgrandi> yes, sqlite is a painfully underused technology
13:25:48 <thuban> lol, where even is wget-at?
13:25:49 <JAA> If wpull were more stable and reliable, we might use it in the DPoS projects. Alas, it isn't.
13:25:49 <mgrandi> but yeah the exact thing i was mentioning earlier, wget-at can't dedulicate urls if using a input URL list, while wpull can save that url in the database and be like "i already downloaded this" and tada
13:25:57 <JAA> Whereas wget-at is rock-solid mostly.
13:26:03 <JAA> thuban: https://github.com/ArchiveTeam/wget-lua
13:26:04 <mgrandi> https://github.com/ArchiveTeam/wget-lua
13:26:21 <mgrandi> and i don't hate asyncio, i'm using it for another project at the moment
13:26:27 <thuban> oh, i was confused by the name change :(
13:26:56 <JAA> Yeah, I think we wanted to rename the repo actually.
13:27:09 <mgrandi> its good at what its meant for, and for one thing i had to make a ProcessExecutor to not deadlock the ExecutionLoop or whatever, because i'm calling into it using `ctypes` and therefore are unable to call PYTHON_THREAD_SAFE_1 or whatever
13:27:11 <JAA> mgrandi: Then I guess you haven't worked enough with it yet. :-P
13:27:43 <JAA> Some things are really unintuitive and messy.
13:27:52 <mgrandi> oh the api is garbage, and the documentation is terrible lol
13:27:55 <JAA> Especially network-related stuff.
13:28:03 <mgrandi> ah, my stuff is all local for now
13:28:18 <JAA> Ah, yeah, for that it's pretty great.
13:28:32 <mgrandi> like, i think it works well enough once you realize you just make Tasks and then async.gather() on them but then it has all these terms that really don't matter
13:28:54 <mgrandi> network stuff should be its cup of tea too, since python can be free from blocking if its only IO bound
13:29:10 <mgrandi> but that requires using something that can call PYTHON_THREAD_SAFE_1 in c code or whatever
13:29:18 <thuban> is the problem with database support in wget-at just that the c sqlite api is low-level and annoying, or is there some more fundamental issue?
13:29:47 <mgrandi> i think its more the fact that wget is a 24 year old piece of software, of which people bolted on lua support onto it
13:30:13 <thuban> age is not, in and of itself, an architectural issue :P
13:30:15 <mgrandi> and then to add sqlite support to that, but then how does that work , can you access the C apis in lua? oh god
13:30:37 <JAA> Yeah, you have to change a lot of the core data structures.
13:30:47 <thuban> mm, makes sense
13:30:47 <mgrandi> no, but its more an issue with C code where its using like...probably c89 and you get no nice features of modernish c
13:30:59 <JAA> --database was one of the main reasons why wpull was originally developed as a drop-in replacement for wget, I believe.
13:31:13 <JAA> I.e. it was deemed easier to reimplement the whole thing than add SQLite support into wget.
13:31:52 <mgrandi> im not a fan of lua or c, and then trying to make it work when you are calling into lua and somehow having that work is pretty bonkers
13:31:54 <JAA> Fun fact: the wget devs also don't like wget anymore, and a complete rewrite is in progress.
13:32:03 <thuban> dang
13:32:20 <mgrandi> i guess no one would notice standard wget for downloading a single file has been pretty solid
13:32:30 <mgrandi> I WANT A RSYNC REWRITE HOLY HELL
13:32:56 <JAA> Oh yeah, it works great unless you push it to its limits.
13:33:10 <JAA> Just don't try to change its behaviour, or you'll start to hate yourself.
13:33:19 <mgrandi> please gaze upon the work i have to do to get rsync to work in windows:
13:33:20 <JAA> The source code's a hot mess.
13:33:40 <mgrandi> .\rsync --progress -args -e="..\..\cygnative1.2\cygnative plink" mgrandi@IPHERE:"/home/mgrandi/something" /cygdrive/c/Users/mgrandi/whyohgod
13:33:51 <JAA> 'in windows'
13:33:53 <JAA> Found your problem.
13:34:11 * anelki twitches looking at that
13:34:59 <mgrandi> rsync is also not great everywhere, there is no librsync, so you have to subprocess it and then consume standard out like a barbarian
13:35:00 <kiska> I found your problem, you're not using WSL :P
13:35:28 <JAA> Yeah, that one's been bothering me for years.
13:35:31 <mgrandi> @kiska i cannot get rsync to work with SSH though, WSL doesn't have a ssh daemon or i can't get it to connect to the windows one
13:35:45 <kiska> Sure it does :D
13:36:04 <mgrandi> and then, i was consuming the output and it failed for a coworker, i looked, the homebrew version of it has a patch that added an extra character that broke my regex and i cried
13:36:49 <thuban> i once had to develop in windows.
13:37:15 <anelki> i once tried installing ruby on a friend's windows machine
13:37:39 <mgrandi> i have no idea why ruby is so bad on windows
13:37:59 <mgrandi> apparently its an ongoing thing, i guess python is just blessed with really solid windows support
13:38:08 <mgrandi> @kiska i can't even add a ssh key i just get "Could not open a connection to your authentication agent."
13:38:38 <kiska> OOF?
13:40:09 <JAA> Try using a proper OS?
13:40:15 <JAA> :-)
13:40:18 <thuban> the guy in charge wanted to create an application for a piece of ten-year-old hardware from a company that no longer existed. there were linux drivers, but when i tried them they bundled an outdated version of qt and installed it right over mine, breaking all kinds of things. having seen dll hell i probably would have kept going anyway after i got things cleaned up, but i
13:40:19 <thuban> never did get them to work
13:40:27 <kiska> I mean works for me :D
13:40:45 <thuban> later i discovered that the person who had been lead programmer before me had fixed all the "undefined reference" errors by typing #define. have i mentioned this application served a completely nonexistent market?
13:40:52 <thuban> ask me whether i ever got that degree
13:41:14 <anelki> did you?
13:41:26 <thuban> i did not
13:42:52 <anelki> did i miss the chatter on this? https://blog.archive.org/2020/10/22/what-information-should-we-be-preserving-in-filecoin/
13:43:22 * anelki is alergic to coin
13:43:23 <mgrandi> i don't quite understand filecoin, like, just use  merkle tree
13:43:56 <mgrandi> instead it is some thing to incentivize 'storing' of data by playing numberwang and getting funbucks or something,  meh
13:46:14 <mgrandi> i guess its slightly better that its a curated list, but i still feel like it could be done without the whole 'proof of work blockchain" funbux
13:48:05 <JAA> The Filecoin team sucks though.
13:49:19 -purplebot- List of websites excluded from the Wayback Machine edited by M.Barry (+29, + www.cyberciti.biz) just now -- https://www.archiveteam.org/?diff=45726&oldid=45668
13:51:11 <mgrandi> i have not heard anything about them
14:13:03 <anelki> i'd be curious about any insights you have there JAA
14:19:51 <arkiver> JAA: yeah we should rename the repo
14:20:04 <JAA> anelki: Well, for starters, they totally fucked up the launch and alienated virtually all the storage operators on the network. The issues go back years though, but I don't remember the details. Anyway, further discusion on Filecoin in -ot please.
14:20:24 <arkiver> major update coming up later is proper FTP archiving support in Wget-AT
14:20:55 <anelki> ope, sorry
14:21:28 <JAA> arkiver: Will anything break if we rename it? E.g. build process, Docker images, etc. GitHub should redirect to the new name I think, so I guess it should be fine.
14:22:09 <JAA> Perhaps we should throw the full repo into #gitgud before the rename also just in case.
14:22:57 <arkiver> JAA: depends on if we change the URL
14:23:13 <arkiver> I would like to change the actual location of the rapo (the URL)
14:23:20 <arkiver> definitely :)
14:23:41 <JAA> Yeah, renaming the repo will change the URL.
14:23:44 <arkiver> Wget-AT was actually one of my test cases for developing the github project
14:23:49 <arkiver> it's been saved ocmpletely a few times :P
14:23:58 <JAA> Heh, nice.
14:24:05 <JAA> I'll throw it in.
15:59:18 -purplebot- Deathwatch edited by JustAnotherArchivist (+151, /* 2020 */ Add Fast.io) 1 minute ago -- https://www.archiveteam.org/?diff=45728&oldid=45706
16:02:19 -purplebot- Fast.io created by JustAnotherArchivist (+558, Basic page) just now -- https://www.archiveteam.org/?diff=45729&oldid=0
16:09:18 -purplebot- Nagi created by JustAnotherArchivist (+1185, Basic page) just now -- https://www.archiveteam.org/?diff=45731&oldid=0
16:19:18 -purplebot- Docker Hub created by JustAnotherArchivist (+1158, Basic page) just now -- https://www.archiveteam.org/?diff=45732&oldid=0
16:26:19 -purplebot- NAVERまとめ created by JustAnotherArchivist (+593, Very basic page) just now -- https://www.archiveteam.org/?diff=45734&oldid=0
16:27:19 -purplebot- NAVER Matome created by JustAnotherArchivist (+28, Redirected page to [[NAVERまとめ]]) just now -- https://www.archiveteam.org/?diff=45735&oldid=0
16:40:19 -purplebot- Deathwatch edited by JustAnotherArchivist (+4, /* 2020 */ Link to Nagi page), JustAnotherArchivist (-29, /* 2020 */ Link to NAVERまとめ page) 19 minutes ago -- https://www.archiveteam.org/?diff=45733&oldid=45728
17:36:56 <OrIdow6> JAA: AFAICT, wget dedupes globally when you output normally (mirror the remote directory structure in your own filesystem), but not when you do --output-file=/dev/null
17:41:20 <arkiver> Wget-AT writes revisit records
17:41:57 <OrIdow6> Oh, I see what you mean
17:42:18 <OrIdow6> arkiver: We're talking about deduping in the URL queue, not in the warc output
17:43:22 <OrIdow6> JAA: Never mind, see what you mena
22:50:40 <mgrandi> Is it just me or is there way more high profile things on deathwatch from today till the end of 2020 than normal
22:52:19 <mgrandi> Chrome extensions, xda forums, playstation store, flash everything, yahoo groups, twitch sings clips D:
22:53:49 <JAA> Pretty normal towards the end of the year.
22:55:33 <JAA> In the last two months of 2019, we also had Apple, Google (twice), Yahoo (twice), and Intel in there.
22:56:35 <JAA> Do we have any idea yet what we could do about the UK-owned .eu domains?
22:57:07 <JAA> That will be a couple hundred thousand websites gone. :-|
23:00:27 <OrIdow6> Two stages to that - identification and crawling
23:00:44 <OrIdow6> Identification can (for lack of something better) be done by the various heuristics people can give
23:00:46 <JAA> Yeah, the identification is the difficult part.
23:00:48 <OrIdow6> *have guven
23:02:00 <OrIdow6> Crawling may need resources (and it's much more like a "wide crawl" than what AT usually does), but is straightforward
23:02:14 <OrIdow6> I think it'll just have to be an educated guess
23:02:26 <OrIdow6> At identification
23:04:40 <JAA> Yeah
23:09:00 <OrIdow6> Here's a general approach - first gather many features on all the .eu websites available, then try to figure out which ones are likely to be good indicators of being in the UK
23:09:49 <OrIdow6> For the second step, e.g. if, of all those sites that list physical addresses, those that use a certain hosting provider (or cloud region or whatever) always have UK addresses, than it's a good indicator
23:09:51 <arkiver> HCross: did we have a list of all sites from some tld? I forgot