00:22:11 So, is there any way I can get help with some ps store scraping stuff 00:22:43 I'm gonna have a crap ton of URLs and I don't think it's feasible to set up a warrior project this late 00:23:35 What are you trying to do, and what's your deadline? 00:24:08 I've mostly skipped over conversations about that, and now apparently it's moved to Discord anyway 00:24:34 Yeah, they are the experts there so that helped 00:25:36 so we are trying to scrape a list of content-ids (aka SKU identifiers) for each region in store.playstation.com 00:28:00 with each SKU we can create the full URL, so `UP3252-PCSE01475_00-JANDUSOFT0000001` for the en-us region becomes https://store.playstation.com/en-us/product/UP3252-PCSE01475_00-JANDUSOFT0000001?smcid=psapp 00:28:28 so therefore we will just be having a bunch of URLs we would like downloaded , and they say that the 'deadline' is the 28th at some time 00:30:32 so i think we have done the 'hard' part but now we just need the download 00:41:14 How many? And are there any restrictions or potential problems? I remember age gates and IP address blocks being mentioned 00:52:43 yeah, they seem to ip ban people temporarily , but my script that ive been running to hit the JSON api seems to be fine with some sleeps 01:17:22 is wget-at smart enough to de-duplicate a url if its a page requisite 01:17:56 or is wpull what i should be using? @JAA you might know since you know more about wpull 02:23:56 mgrandi: wpull definitely dedupes page requisites, and I think wget-at does as well. 02:24:17 i know it does, but there are these CSS files that are pretty big, and i don't want to waste time downloading them repeatedly 02:24:30 so i was asking if wpull or wget-at are smart enough to realize it has downloaded that URL already and skip it 02:24:40 If it's the same URL, it's only retrieved once. 02:24:55 wget-at has '--dedup-file-agnostic' but i dunno if that is for the list of files given to it as a input or also for page-requsites 02:25:47 You mean --warc-dedup-url-agnostic? 02:26:11 That is for deduping within the WARC, i.e. write revisit records instead of responses when they have the same content, even if the URLs differ. 02:26:33 But every URL is only retrieved once. Certainly in wpull, 99.9 % sure in wget-at. 02:28:18 ok 02:28:24 cool, i'll leave those in then 02:28:44 i shouldn't have to craft a regex to ignore files like the JS and whatnot 02:36:36 Yes, wget/wget-lua/wget-at should only get them once 02:37:28 Be aware that crawling recursively or getting prerequisites does require it to parse the page (though I wouldn't be too surprised if it parsed it anyway) 02:37:51 So if you have a million to do in an hour or something, that should be off 02:38:56 A million in an hour with wget/wpull? Yeah, good luck. 02:41:11 Yeah, I was being hyperbolic, but you see what I mean 02:41:32 Yeah 02:41:55 Not entirely sure, but I don't think you can disable parsing. 02:42:15 One of the major reasons why I wrote qwarc. 02:45:25 the pages are pretty small, a full download of all of the assets is like 7 mb, and like 6mb is just the stupid fonts/javascript stuff that i was asking about 02:46:09 That's not the point. The HTML parsing takes a quite significant amount of CPU time. 02:46:28 If you're highly rate-limited anyway, it probably won't matter though. 02:47:10 If you use wpull, make sure to select the libxml2 parser rather than the default pure-Python html5lib. 03:00:07 Putting some printfs into wget, doesn't look like it parses it if you don't have page-requisites 03:00:23 And/or recursive etc. 03:00:56 But yeah, in any case, singular wget is not the best for speed over many small pages 03:02:25 Looks like you're right: https://github.com/ArchiveTeam/wget-lua/blob/09942221e1e550e6c8516e6193c602a338bdd9f4/src/recur.c#L485-L493 03:02:35 get_urls_html is what invokes the parsing and extraction. 03:04:12 I believe wpull always parses though. 05:41:05 i can confirm that wget-at is downloading the same url over and over 05:48:21 It shouldn't 05:52:48 And I doubt it is 05:53:29 Everything else being normal 06:24:54 well i just ran a grep and got several copies of the same url 06:25:02 unless its download ing it and not storing it? 06:30:22 Want to upload your logs? 06:46:18 yeah i have them, give me a bit 06:51:12 wget-at apparently loves memory 07:00:15 yesyes 07:00:23 *yes 07:05:36 i hope its not like a infinite memory leak 07:07:47 https://gist.github.com/mgrandi/e8a35077ae944c79cad28601970b4e59 @OrIdow6 , example of a URL that keeps repeating is https://store.playstation.com/assets/vendor-e01bcb167174f8baf0eb82c68f7e3a62.js 07:13:20 mgrandi: What options are you running this with? Something strange is happening here 07:14:14 i based it off the warrior projects 07:14:28 http://161.35.231.94/wget_at_args_es-es.sh 07:14:46 eeh thats not auto opening, let me gist it 07:15:51 https://gist.github.com/mgrandi/3c2200c6435c33b21a18fd32cc1eb871 07:18:59 i think the logs i checked didn't have --reject-regex , i added that because it was downloading the big CSS/font files every time 07:20:24 Hm, looks good, unless I'm skipping over something obvious 07:25:30 haha oh my god, even with 2 gb they are still dying within 10 minutes 07:26:34 i dont understand how i ran 4 workers with some of these warrior projects on 1gb of RAM and this is causing it to run out of memory 07:34:18 Well, I am successfully able to duplicate it here, so I'll try looking in a bit more 07:34:57 I wonder if you've happened upon some way to mess up the queue, and that's why it's using so much memory as well 07:36:20 "--truncate-output is a Wget+Lua option. It makes --output-document into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes." 07:36:23 uhhhhhhhh 07:37:02 is it like keeping that whole file in memory if that is not specified? that was the only switch missing when i compared with https://github.com/ArchiveTeam/tencent-weibo-grab/blob/master/pipeline.py 07:39:40 No, they'll be on disk 07:43:41 By the way, in case I have to drop out here, the thing that seems to be happening is that page requisites aren't being removed from the queue, so each normal page fetch fetches its own unique requisites as well as all past requisites (including ones that it doesn't correctly have) 07:45:51 so that is why the memory just keeps going up until it gets oom-killed? 07:46:54 i built a fresh copy of wget-at like today: https://gist.github.com/mgrandi/785d98d07a5b8da23e8370c66c078fb3 08:01:30 OK, so I've found out that whatever it is, it only seems to happen when you do --output-document - compare `wget --page-requisites --no-verbose "https://www.archiveteam.org/index.php?title=Angelfire" "https://www.archiveteam.org/index.php?title=ArchiveBot" --output-document=ot` the same without --output-document 08:01:45 Maybe this is some wget thing I don't know about 08:01:53 Works with vanilla wget too, by the way 08:02:44 If it's not intended... there are some obvious causes for why that could happen 08:07:25 I wonder if it extracts URLs from the copy of the output document on disk, instead of in memory, and as it builds up, it parses & extracts everything concatenated to there; consistent with behavior of --output-document=/dev/null 08:07:36 This is getting into #archiveteam-dev material 08:08:29 yeah. Either way, i think --truncate-document seems to have worked? its not blowing up in memory anymore 08:09:22 I think it should, if that's true 08:09:44 Yes, it does 08:09:51 that is a pretty major bug or caveat lol 08:11:38 Actually, is it now getting the page requisites only once? mgrandi 08:12:07 For my test command, even with --truncate-output, it gets the ArchiveTeam logo on every page 08:12:27 Even though they're no longer building up 08:13:31 Almost convinced at this point that this is just because I don't have enough experience with wget 08:13:31 i can test in a bit 08:16:29 Thing that works seems to do neither --output-document nor --truncate-output, but just --delete-after 08:19:26 basically, this has just reinformed my desire to get `wpull` working lol 08:19:46 Haha 08:21:31 If it's time-sensitive, might be better to use wget (if you aren't getting requisites multiple times, or if my fix works), however rickety it is 08:31:20 yeah 08:31:23 thats what i'm doing 08:33:48 and yeah, i think its still getting duplicate urls, i see this 'verifiedbyvisa.png' file repeatedly 13:12:36 mgrandi: Are you feeding in a list of URLs or recursing from one page? If the former, that might explain why you get duplicates with wget. I think it builds a new tree for each root URL and only doesn't retrieve duplicates within that. wpull, on the other hand, dedupes globally. 13:12:50 a list of URLs 13:13:01 thats what i figured yeah 13:13:15 still kinda mad over the --please-don't-leak-memory switch i apparently forgot >.> 13:13:53 Heh, yeah. 13:16:08 anyway, the playstation store "last minute project" is going fairly well all things considered , and the very last minute nature of it, https://ethercalc.net/bq3ga1r7w59q 13:16:33 what even is the distinction between wpull and wget-lua/wget-at these days? all i can remember is wpull having phantomjs options, but god knows how much longer that'll work for anything 13:16:53 they are the same, retty much 13:17:04 oh i thought you meant the diff between wget-lua/wget-at 13:17:27 i would also like to know that 13:17:53 for that, i think is two names for the same project 13:18:28 but my understanding is that wget is 24 year old software that not many people know the internals of and lua support was bolted onto it, and has....various errata 13:18:39 phuzion: wpull has --database, --warc-append, and a bunch of other things as well. wget-at has Lua hooks, ZSTD support, and soon other features. 13:18:42 Er 13:18:44 thuban: ^ 13:18:57 * phuzion grumbles 13:19:09 whatcha wakin me up for JAA? 13:19:17 just messing 13:19:28 Oh no, I've awaken the Sheeple. :-| 13:19:31 but wpull being python means much faster turnaround time in theory, easier to run (no or little compilation needed), etc etc 13:20:13 zstd support shouldn't be that hard to add to wpull, as well the same amount of hooks that wget-at has 13:20:22 Yeah, unfortunately wpull is just barely holding together and working. It desparately needs a serious cleanup and partial rewrite. 13:20:36 wpull already has more hooks than wget-at I believe. 13:21:10 well luckily you have me, a python starved engineer who isn't getting any python love since i started working at MS (although they do use python a lot, just not the project i'm working on) 13:21:42 :-) 13:21:48 How much do you hate asyncio yet? 13:21:50 i guess my real question is, do we have a good reason for maintaining two tools that do similar-but-slightly-different things, and if not, is there some idea of what to focus on in the future 13:22:03 wget-at is basically...not maintained at all? lol 13:22:30 oh, i had the impression that a fair amount of stuff had been done recently 13:22:40 but if not that is... also true of wpull, afaik 13:22:49 Yes, there has. 13:22:51 well, 8 commits this year, more than i expected actually 13:23:02 Work is ongoing at the moment as well. 13:23:32 wget-at has the advantage of being the defacto tool for warrior projects at the moment 13:23:50 One *major* advantage of wpull is --database. Not that relevant for small crawls, but for AB jobs with millions and millions of URLs in an item, it'd be infeasible to store the entire URL table in memory. 13:24:07 but also is the single reason why the warrior docker image can't be used for said warrior projects cause you need to compile libzstd and its too old a version of ubuntu and a a a a a 13:24:11 And adding that to wget-at would be a magnificient PITA. 13:24:31 yes, sqlite is a painfully underused technology 13:25:48 lol, where even is wget-at? 13:25:49 If wpull were more stable and reliable, we might use it in the DPoS projects. Alas, it isn't. 13:25:49 but yeah the exact thing i was mentioning earlier, wget-at can't dedulicate urls if using a input URL list, while wpull can save that url in the database and be like "i already downloaded this" and tada 13:25:57 Whereas wget-at is rock-solid mostly. 13:26:03 thuban: https://github.com/ArchiveTeam/wget-lua 13:26:04 https://github.com/ArchiveTeam/wget-lua 13:26:21 and i don't hate asyncio, i'm using it for another project at the moment 13:26:27 oh, i was confused by the name change :( 13:26:56 Yeah, I think we wanted to rename the repo actually. 13:27:09 its good at what its meant for, and for one thing i had to make a ProcessExecutor to not deadlock the ExecutionLoop or whatever, because i'm calling into it using `ctypes` and therefore are unable to call PYTHON_THREAD_SAFE_1 or whatever 13:27:11 mgrandi: Then I guess you haven't worked enough with it yet. :-P 13:27:43 Some things are really unintuitive and messy. 13:27:52 oh the api is garbage, and the documentation is terrible lol 13:27:55 Especially network-related stuff. 13:28:03 ah, my stuff is all local for now 13:28:18 Ah, yeah, for that it's pretty great. 13:28:32 like, i think it works well enough once you realize you just make Tasks and then async.gather() on them but then it has all these terms that really don't matter 13:28:54 network stuff should be its cup of tea too, since python can be free from blocking if its only IO bound 13:29:10 but that requires using something that can call PYTHON_THREAD_SAFE_1 in c code or whatever 13:29:18 is the problem with database support in wget-at just that the c sqlite api is low-level and annoying, or is there some more fundamental issue? 13:29:47 i think its more the fact that wget is a 24 year old piece of software, of which people bolted on lua support onto it 13:30:13 age is not, in and of itself, an architectural issue :P 13:30:15 and then to add sqlite support to that, but then how does that work , can you access the C apis in lua? oh god 13:30:37 Yeah, you have to change a lot of the core data structures. 13:30:47 mm, makes sense 13:30:47 no, but its more an issue with C code where its using like...probably c89 and you get no nice features of modernish c 13:30:59 --database was one of the main reasons why wpull was originally developed as a drop-in replacement for wget, I believe. 13:31:13 I.e. it was deemed easier to reimplement the whole thing than add SQLite support into wget. 13:31:52 im not a fan of lua or c, and then trying to make it work when you are calling into lua and somehow having that work is pretty bonkers 13:31:54 Fun fact: the wget devs also don't like wget anymore, and a complete rewrite is in progress. 13:32:03 dang 13:32:20 i guess no one would notice standard wget for downloading a single file has been pretty solid 13:32:30 I WANT A RSYNC REWRITE HOLY HELL 13:32:56 Oh yeah, it works great unless you push it to its limits. 13:33:10 Just don't try to change its behaviour, or you'll start to hate yourself. 13:33:19 please gaze upon the work i have to do to get rsync to work in windows: 13:33:20 The source code's a hot mess. 13:33:40 .\rsync --progress -args -e="..\..\cygnative1.2\cygnative plink" mgrandi@IPHERE:"/home/mgrandi/something" /cygdrive/c/Users/mgrandi/whyohgod 13:33:51 'in windows' 13:33:53 Found your problem. 13:34:11 * anelki twitches looking at that 13:34:59 rsync is also not great everywhere, there is no librsync, so you have to subprocess it and then consume standard out like a barbarian 13:35:00 I found your problem, you're not using WSL :P 13:35:28 Yeah, that one's been bothering me for years. 13:35:31 @kiska i cannot get rsync to work with SSH though, WSL doesn't have a ssh daemon or i can't get it to connect to the windows one 13:35:45 Sure it does :D 13:36:04 and then, i was consuming the output and it failed for a coworker, i looked, the homebrew version of it has a patch that added an extra character that broke my regex and i cried 13:36:49 i once had to develop in windows. 13:37:15 i once tried installing ruby on a friend's windows machine 13:37:39 i have no idea why ruby is so bad on windows 13:37:59 apparently its an ongoing thing, i guess python is just blessed with really solid windows support 13:38:08 @kiska i can't even add a ssh key i just get "Could not open a connection to your authentication agent." 13:38:38 OOF? 13:40:09 Try using a proper OS? 13:40:15 :-) 13:40:18 the guy in charge wanted to create an application for a piece of ten-year-old hardware from a company that no longer existed. there were linux drivers, but when i tried them they bundled an outdated version of qt and installed it right over mine, breaking all kinds of things. having seen dll hell i probably would have kept going anyway after i got things cleaned up, but i 13:40:19 never did get them to work 13:40:27 I mean works for me :D 13:40:45 later i discovered that the person who had been lead programmer before me had fixed all the "undefined reference" errors by typing #define. have i mentioned this application served a completely nonexistent market? 13:40:52 ask me whether i ever got that degree 13:41:14 did you? 13:41:26 i did not 13:42:52 did i miss the chatter on this? https://blog.archive.org/2020/10/22/what-information-should-we-be-preserving-in-filecoin/ 13:43:22 * anelki is alergic to coin 13:43:23 i don't quite understand filecoin, like, just use merkle tree 13:43:56 instead it is some thing to incentivize 'storing' of data by playing numberwang and getting funbucks or something, meh 13:46:14 i guess its slightly better that its a curated list, but i still feel like it could be done without the whole 'proof of work blockchain" funbux 13:48:05 The Filecoin team sucks though. 13:49:19 -purplebot- List of websites excluded from the Wayback Machine edited by M.Barry (+29, + www.cyberciti.biz) just now -- https://www.archiveteam.org/?diff=45726&oldid=45668 13:51:11 i have not heard anything about them 14:13:03 i'd be curious about any insights you have there JAA 14:19:51 JAA: yeah we should rename the repo 14:20:04 anelki: Well, for starters, they totally fucked up the launch and alienated virtually all the storage operators on the network. The issues go back years though, but I don't remember the details. Anyway, further discusion on Filecoin in -ot please. 14:20:24 major update coming up later is proper FTP archiving support in Wget-AT 14:20:55 ope, sorry 14:21:28 arkiver: Will anything break if we rename it? E.g. build process, Docker images, etc. GitHub should redirect to the new name I think, so I guess it should be fine. 14:22:09 Perhaps we should throw the full repo into #gitgud before the rename also just in case. 14:22:57 JAA: depends on if we change the URL 14:23:13 I would like to change the actual location of the rapo (the URL) 14:23:20 definitely :) 14:23:41 Yeah, renaming the repo will change the URL. 14:23:44 Wget-AT was actually one of my test cases for developing the github project 14:23:49 it's been saved ocmpletely a few times :P 14:23:58 Heh, nice. 14:24:05 I'll throw it in. 15:59:18 -purplebot- Deathwatch edited by JustAnotherArchivist (+151, /* 2020 */ Add Fast.io) 1 minute ago -- https://www.archiveteam.org/?diff=45728&oldid=45706 16:02:19 -purplebot- Fast.io created by JustAnotherArchivist (+558, Basic page) just now -- https://www.archiveteam.org/?diff=45729&oldid=0 16:09:18 -purplebot- Nagi created by JustAnotherArchivist (+1185, Basic page) just now -- https://www.archiveteam.org/?diff=45731&oldid=0 16:19:18 -purplebot- Docker Hub created by JustAnotherArchivist (+1158, Basic page) just now -- https://www.archiveteam.org/?diff=45732&oldid=0 16:26:19 -purplebot- NAVERまとめ created by JustAnotherArchivist (+593, Very basic page) just now -- https://www.archiveteam.org/?diff=45734&oldid=0 16:27:19 -purplebot- NAVER Matome created by JustAnotherArchivist (+28, Redirected page to [[NAVERまとめ]]) just now -- https://www.archiveteam.org/?diff=45735&oldid=0 16:40:19 -purplebot- Deathwatch edited by JustAnotherArchivist (+4, /* 2020 */ Link to Nagi page), JustAnotherArchivist (-29, /* 2020 */ Link to NAVERまとめ page) 19 minutes ago -- https://www.archiveteam.org/?diff=45733&oldid=45728 17:36:56 JAA: AFAICT, wget dedupes globally when you output normally (mirror the remote directory structure in your own filesystem), but not when you do --output-file=/dev/null 17:41:20 Wget-AT writes revisit records 17:41:57 Oh, I see what you mean 17:42:18 arkiver: We're talking about deduping in the URL queue, not in the warc output 17:43:22 JAA: Never mind, see what you mena 22:50:40 Is it just me or is there way more high profile things on deathwatch from today till the end of 2020 than normal 22:52:19 Chrome extensions, xda forums, playstation store, flash everything, yahoo groups, twitch sings clips D: 22:53:49 Pretty normal towards the end of the year. 22:55:33 In the last two months of 2019, we also had Apple, Google (twice), Yahoo (twice), and Intel in there. 22:56:35 Do we have any idea yet what we could do about the UK-owned .eu domains? 22:57:07 That will be a couple hundred thousand websites gone. :-| 23:00:27 Two stages to that - identification and crawling 23:00:44 Identification can (for lack of something better) be done by the various heuristics people can give 23:00:46 Yeah, the identification is the difficult part. 23:00:48 *have guven 23:02:00 Crawling may need resources (and it's much more like a "wide crawl" than what AT usually does), but is straightforward 23:02:14 I think it'll just have to be an educated guess 23:02:26 At identification 23:04:40 Yeah 23:09:00 Here's a general approach - first gather many features on all the .eu websites available, then try to figure out which ones are likely to be good indicators of being in the UK 23:09:49 For the second step, e.g. if, of all those sites that list physical addresses, those that use a certain hosting provider (or cloud region or whatever) always have UK addresses, than it's a good indicator 23:09:51 HCross: did we have a list of all sites from some tld? I forgot