00:30:07 Everything accessible on the Knowledge Adventure CDN and present as of my initial listing on 2023-06-14 or the relisting about 6 hours ago should now be archived. 00:30:12 betamax, nicolas17: ^ 03:51:38 FireonLive edited Current Projects (-10, move Tiki to recently finished): https://wiki.archiveteam.org/?diff=50050&oldid=50046 05:26:57 Visa to Acquire Pismo for US$ 1 billion in cash: https://www.pismo.io/blog/visa-to-acquire-pismo/ 05:29:04 "Pismo will retain our founders and current management team. The transaction is subject to regulatory approvals and other customary closing conditions and is expected to close by the end of 2023.", website probably not super in danger i guess 08:11:05 Is there any way to monitor the offload of the targets? I think someone was saying a few were getting full or close to 08:46:34 flashfire42|m: nope, ideally targets run at near-full anyways to apply backpressure - if they were empty that just means IA can accept more data and we're archiving too slow ;) 08:47:48 Heh I mean yeah but there are some projects currently paused because we were grabbing too much data for IA to keep up 08:49:30 yeah. not quite sure what the status there is. someone else would have to chime in what is going to happen there, if anything 08:50:05 could be a matter of waiting it out until things slow down naturally or there might be improvements on the IA/AT side so things can go faster 08:50:48 a lot of data though, so all not easy I can imagine 09:19:27 IA is a common bottleneck, the S3 upload "loading bays" are the bottleneck pretty often. AT can suckle data out faster than they can be ingested there 10:50:36 JAA: that's amazing, thanks so much! 10:51:20 Would you be able to share your relisting from a day or so ago? My friend is working with others to reverse engineer the server for the game and having the full file listing would be very helpful 12:48:25 OrIdow6 edited Egloos (+649, Account of the grab): https://wiki.archiveteam.org/?diff=50051&oldid=50043 12:57:09 No reply from Wysp.ws 13:15:21 Are there archives of the leaderboards for past projects? 13:26:30 If you know the project name, you can use that in the normal tracker URL: https://tracker.archiveteam.org/[projectName]/. For example, the project for Enjin is done, but the leaderbord is still accessible: https://tracker.archiveteam.org/enjin/ 14:32:47 Yts98 edited LINE BLOG (+139, Add link to data): https://wiki.archiveteam.org/?diff=50052&oldid=49955 15:01:50 Where is the repo to (at least the front end of) tracker.archiveteam.org? 15:09:37 https://github.com/ArchiveTeam/universal-tracker I think? 15:13:43 Really? Probably want to contribute some code but looks "dead" 15:16:56 Manu edited Deathwatch (+261, Stitcher will shut down end of August): https://wiki.archiveteam.org/?diff=50053&oldid=50047 15:17:56 Noxian edited Tumblr (+0, /* See also */ latest version of TumblThree): https://wiki.archiveteam.org/?diff=50054&oldid=49141 15:17:57 Hans5958 edited Egloos (-12, Little bit of rewording): https://wiki.archiveteam.org/?diff=50055&oldid=50051 15:17:58 Exorcism edited Tiki (+23): https://wiki.archiveteam.org/?diff=50056&oldid=50049 15:17:59 Exorcism uploaded File:Tiki logo.png: https://wiki.archiveteam.org/?title=File%3ATiki%20logo.png 15:18:57 Exorcism edited Deathwatch (+0): https://wiki.archiveteam.org/?diff=50058&oldid=50053 15:41:40 egloos, tiki, and lineblog project are done! 15:42:15 tracker front page is becoming less busy :P 15:43:15 arkiver: great! now I want to propose a warrior project for Xuite :p https://github.com/yts98/xuite-grab 15:44:14 i read that as xtube which is both incorrect and also long gone (and already done) :c 15:44:24 tiki was fun. my first top 10 finish :D 15:47:26 haha yeah first where i was near the top :p 15:49:03 Yts98 edited Current Projects (+0, Move LINE BLOG to recently finished): https://wiki.archiveteam.org/?diff=50059&oldid=50050 15:50:19 Just wanted to throw this out as a forum to archive: https://memoriesoffear.jcink.net 15:50:40 They did a number of translated games, one namely Toilet in Wonderland (which Vinny Vinesauce played on stream) 15:50:42 Hans5958: looks like that's the one yeah 15:50:45 https://memoriesoffear.jcink.net/index.php?showtopic=56 15:53:04 Yts98 edited LINE BLOG (+1, Finish the project): https://wiki.archiveteam.org/?diff=50060&oldid=50052 15:53:09 i imagine everyone is quite busy with a lot of other things (including things outside of archiveteam) so it's not high priority as other stuff 15:54:02 yts98: :D 15:54:17 fireonlive, do you mean that forum I linked sorry, or replying to someone else 15:54:32 rktk: oh sorry, replying to Hans5958 15:54:32 If there is a recommended way of scraping a forum like that, I have no issue to do it myself 15:54:40 ah ok fireonlive :) 15:54:42 :) 15:54:58 regarding the https://tracker.archiveteam.org codebase 16:09:46 rktk: Probably archivebot, but it's fairly full currently. That one should be pretty easy to run though since it's small 16:10:04 pokechu22, could I run an archivebot myself locally? 16:10:08 or should I just do an wget mirror 16:10:24 yts98: why JSObj? 16:10:47 ArchiveBot isn't designed to be run locally, https://github.com/ArchiveTeam/grab-site is the more usable equivalent 16:10:56 There's also a forum-dl project or something like that that might be usable 16:11:23 arkiver: to deal with JS objects embedded in the HTML. 16:11:32 wget's also fine, but wouldn't end up on web.archive.org (though anything a random person does probably wouldn't end up there) 16:11:51 Looks like they also have mediafire links so those will need to be put into #mediaonfire 16:12:25 I found simply replacing single quotes with double quotes may still cause errors 16:13:18 yts98: on the item types, can you please make then a bit more descriptive? 16:14:00 Looks like there's actually a lot of forums under jcink.net, so that's something to check later 16:15:12 yts98: looks pretty good! 16:17:07 pokechu22, yeah this is just a random personal grab. and i could save to warc, mainly just as a means of throwing it on archive as an object, rather than web archive 16:17:19 pokechu22, yeah definitely something worth looking at 16:18:39 I chose very short item type names because the wiki said "Because the Tracker uses Redis as its database, memory usage is a concern." 16:18:42 let's make a channel for xuite! i'm not sure if this word has a meaning, perhaps we can have a play on words in the language of this word 16:18:58 yts98: ah. well lists are mostly offloaded, so not a huge concern now 16:19:35 arkiver: watch this video. 16:19:35 https://vlog.xuite.net/play/Qm9leW9BLTEzODg4Ni5mbHY= 16:19:43 There's small website that I wish to regularly save a few pages for (usually 1-2 pages a day). The prompt to save the page would be an email notification from said site. I already have extracting the link sorted. Is there an API equivalent of https://web.archive.org/save ? Saving the page is fairly time critical as once items are sold the page is 16:19:43 updated and information is removed. 16:20:03 rktk: I've started an archivebot job anyways, shouldn't take too long 16:20:06 Xuite's slogan is "My Xuite, So Sweet~" 16:20:11 hurray! pokechu22 16:20:15 yts98: i see some stuff there like TODOs on handling malformed JSON responses 16:20:33 someone should save digitalfaq before all the scam evidence is wiped away 16:20:34 threedeeitguy: Pretty sure web.archive.org/save can be treated as an API endpoint, I remember seeing some docs on that, one sec 16:20:41 digitalfaq? 16:21:19 those malformed responses should be caught in write_to_warc, then not be written to WARC, and either be marked for retrying to retrieve, or the item should be aborted. or in rare cases no write to WARC and let it continue as usual if this is an 'error' that is fine 16:21:20 arkiver: their API sometimes mix cp950 with utf8 16:21:27 https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA/edit 16:21:46 right, i see. so the error is on our side, not on theirs? 16:21:52 yts98: ^ 16:21:52 pokechu22, digitalfaq.com 16:22:23 What's the deal with scam evidence? 16:22:51 Looks like it was previously saved August 2022: https://archive.fart.website/archivebot/viewer/job/4ialw 16:23:05 err, no, those are small enough that saving it probably failed 16:23:38 arkiver: yes. the error is caused in JSON.lua. 16:25:13 pokechu22 thanks, il take a look. It may not be suitable anyway. I just tried a page and its far from clean: https://web.archive.org/web/20230629161553/https://www.stationroadsteam.com/3-12-inch-gauge-union-pacific-big-boy-4-8-8-4-stock-code-11379/# 16:25:57 yts98: i see there is still a change of 'bad data' getting into the WARC, for example I see a check on json["ok"] get_urls. at this point the data is already in the WARC, which it should be if there is an indication of an error 16:26:07 betamax: Yeah, everything will be on IA once the upload finishes. 16:26:29 so this json["ok"] check should be in write_to_warc, and then again either retried or items aborted (or accepted in rare cases) if the error is there 16:26:44 there may be other checks in get_urls that should move to write_to_warc 16:29:10 arkiver: json["ok"] being false is not rare. It happens when an article is protected by the password, or an user did not activate one of the blog, album, or vlog service. 16:29:24 alright good 16:30:15 and then I saw thousands of usernames discovered, but the API will respond with "no such user". 16:32:10 their username search API even returns illegal usernames, possibly manually altered by the moderator to deactivate some accounts 16:33:10 interesting 16:33:11 so 16:33:12 on images 16:33:20 photo.xuite.net, and such 16:33:52 can different items get to the same images? can they be duplicated between items? i see they are now generally always accepted for immediate archiving 16:36:46 I sent some image URLs in API responses of user item, but some of these images belong to an album, so the current script will grab them twice or more. 16:37:32 are the URLs for a single image unique? 16:38:11 as in, is it always 3.example.com/image.png, or can there also be 2.example.com/image.png, 3.example.com/image?format=png, etc.? 16:40:16 I see the TODO about false positives. yes, this may produce false positives. but archiving is usually done with the thought of "better discovery too much than too little". so if we are sure everything will be discovered with very strict rules, then that is fine 16:40:52 for photo.xuite.net, the image URLs are unique; 16:40:52 when images embedded in blog articles, the service possibly generates another URL that accepts outlinks 16:41:04 but it is often good to keep the rules somewhat relaxed, allow for a possibility of false positives. eliminate these false positives if we find them. and that way perhaps extract/archive more than we initially were under the impression was actually there 16:41:36 yts98: "another URL that accepts outlinks" - for an image? what do you mean? 16:44:29 yts98: on the video URLs and load balancing. can video URLs to the same video be found in different items? as in, can there be duplicates? (same as what i asked for the photos) 16:45:17 if the a certain video will _only_ be discovered from a single item, then good! and then let's get whatever load balancers they use, Wget-AT will prevent writing duplicate data, while still preserving the URLs. 16:46:06 there will only be duplicate data downloaded on the side of the Warrior, but this extra data will be deduplicated away when written to the WARC. if xuite can handle it, then it's good to get this duplicate data. 16:46:40 because this is not only about purely data preservation, but also about URL preservation. we want to try and cover the entire range of possible URLs, so that those can be found through the Wayback Machine. 16:47:54 so. let's say we have 1.example.com/image.png and 2.example.com/image.png both pointing to the same image. we download them _in the same Wget-AT session_, then they will be deduplicated, while both their URLs are preserved (yes, data will be downloaded twice) 16:48:48 if we have separate items for those two URLs to the same image, then it is likely that those separate items end up in different Wget-AT session, and are not deduplicated, which wastes bytes 16:49:27 if we're talking about 1 TB or so of duplicated data, that is not a big problem. but if it turns into 10 TB or 100 TB of duplicated data, that is a problem 16:51:25 yts98: i see you store data in _data.txt, what is the use of this. we're actually not really using data.txt anymore. in the past data.txt was used to discover items, but nowadays we use backfeed for that. 16:51:41 there is nothing on the targets currently that will do anything with the _data.txt file. 16:52:03 I did not remember I saw image URL formats other than 1.share.photo.xuite.net in which article. 16:52:03 Separating images to new items is a reasonable approach. Let's handle them like cdn-obs in lineblog. 16:52:03 Video URLs may also be checked in user items. But they may expire if we backfeed them as item. 16:52:03 I thought warc revisit can only be used on the same URL. So warc revisit applies to different URLs when the response body is identical. 16:52:45 yes, on the response body being identical 16:53:36 i see on expiring video URLs. are the video URLs you get through a user item actually used for playback? or are they "just there" in some data blob, while actually only the video URL on the post page is used for playback? 16:54:43 on FlashVars rules - those are not known yet? 16:57:52 yts98: well overall looks pretty good, i'll be further checking this later! 16:59:13 the purpose of data.txt is to inspect the metadata not included in item names, including blog_id and every . 16:59:13 I've discovered 5 types of FlashVars rules https://wiki.archiveteam.org/index.php/Xuite#Flash-based_creations , but I'm not sure if I missed more. 17:00:23 arkiver: thanks for taking a look! I learned very much about archiving practices :) 17:00:37 good to hear :) 17:00:54 alright i'm not sure yet about data.txt, will be having a better look later! 17:01:37 (i only actually looked at the code - not the site yet) 17:07:38 a possible alternative to data.txt is to create a dummy backfeed that does not actually backfeed the items into the project. 17:08:19 that sounds better yes 17:08:37 but i'm not sure if we actually need it, need to do some experiments as well 17:09:04 if there is something unexpected, can item be simply aborted? 17:09:55 i see for example that when an a: item is queued, it is always written to the data.txt as well, that is not needed i think? 17:22:49 gettyimages acquired unsplash earlier in 2021: https://unsplash.com/blog/unsplash-getty/ and looks like they’re jumping on the “oh fuck AI is going to ruin us” bandwagon way too late https://twitter.com/sindresorhus/status/1674390882399801345 17:23:12 not sure what he means by “removed their free non-API endpoint” though 17:26:25 yts98: i see very explicity extraction of certain URLs, also from the HTML, line 1096 for example. i think this is already handled by the 'general' URLs extraction happening at line 1966? if not, that might be a better place 17:27:07 Next AT project: archive everything that has a free API. 17:27:20 this is again coming from the point of "better extract too much than too little" - if we only allow extraction of very specific URLs in very specific places, there is a great risk of missing something. 17:27:52 hmm 17:28:12 or, is this being extracted specifically here to have the certain referer be different than the current URLs we're working on? 17:28:41 in which case it would be good. later it'd be picked up in the 'general' extraction code, but not queued since it was queued before 17:28:52 current URL* 18:17:31 JAA: yeeeeah :| 18:17:52 🙃 🔫 18:18:23 they said AI/ML would destroy the internet 18:18:36 i just didn't think it would be in this way 20:30:41 tinaja.com looks kinda big so I'm not going to put it into archivebot until we have a little bit more space 20:31:00 let's see 20:31:11 interesting site 20:31:38 seems to have a lot of pdf's so might be big 20:32:23 pokechu22: shall we put it in archivebot anyway? 20:32:32 I was about to ask what you look for to determine whether it looks big or not. At first glance I figured it looks like it's from the 90s, so small 20:33:05 Currently all the AB pipelines are full because hel3/hel4 are low on disk space because of the general upload backlog to my understanding 20:34:17 Probably we could still queue it though 20:39:16 actually those pdf's are not that big so might be something like 50 - 60 gigs at tops 20:40:25 could be good to queue it as you can just pause it in the event that there is no space right? 20:41:21 Alright, queued it 20:41:47 It'll auto-pause when there's no space (< 5 GB I think) 20:42:46 LUL was already started aparently :P 21:00:39 pokechu22: general upload backlog to where? 21:00:47 is IA the bottle neck? 21:01:24 I think so? 21:01:31 JAA talked more about it I think 21:01:53 main thing is that if you look at http://archivebot.com/pipelines most machines are full 21:01:58 we need an "ArchiveBot talk" channel 21:02:20 arkiver: #down-the-tube and AB used the same rsync target. The former clogged it. 21:02:31 ah 21:02:42 JAA: how about that archivebot talk channel? 21:02:44 That comes up every few months or so. It'd be mostly a dead channel probably. 21:03:09 i usually miss messages someone posts to me in #archivebot 21:03:10 oh well 21:03:25 warning to all ^ if I need to really notice the message, don't write to me in #archivebot 21:03:33 Make your client log highlights into a separate window. :-) 21:03:55 Relevant messages are on 03:47:37 UTC on June 29 21:04:23 This is what I've been using to check. Is this known as a good way to see if they're clogged? https://monitor.archive.org/weathermap/weathermap.html 21:04:57 I don't think the rsync targets would be on there as they're archiveteam infrastructure, but I'm not 100% sure of that 21:05:14 the switchtc0-200paul has been in the red for around 30+ hours 21:05:17 JAA That is the one thing from znc I would like to have on thelounge 21:05:26 JAA: that would be something i need to figure out and not doing that now 21:05:31 did someone said that archive.org had an issue with (or intentionally?) limited inbound speed? 21:05:38 vokunal|m: no, there can be many reasons 21:05:41 that was oof a while ago though 21:08:07 Oh, it was also mentioned that https://yarus.ru/ was shutting down shortly per https://yarus.ru/post/1989728469 - there's an AB job for it, but there's basically no chance it'll finish completely :| 21:10:55 ugh, it looks like that site's also JS-based so AB's not going to get anything useful :| (and I think I pushed it too hard and am now getting 403s :|) 21:12:36 no wonder google translate did not work on it :P 21:15:41 Yeah I was wondering why it wasn't working 21:18:44 Oh and just found out The Lounge has a recent mentions feature 21:19:07 thats convenient 21:19:27 indeed! the @ symbol 21:24:53 pokechu22: checking 21:25:28 pokechu22: are you planning to pull tinaja.com through AB later? 21:25:46 It turns out it was already running in AB since yesterday 21:26:07 oof just seeing yarus in my browser with that loading screen... oof oof 21:26:39 what 21:26:41 June 30? 21:26:49 not again 21:26:50 Several hours ago it was 18 hours 21:26:57 frankly I think it's not possible to get it done 21:27:01 It does have a complete sitemap though 21:27:23 they posted the message you linked today? 21:27:29 for a shutdown tomorrow? 21:28:09 nyuuzyou: ^ 21:28:17 It seems like that's the case though 21:28:26 rewby: are you around? 21:28:36 i'm not sure if we can get a project up in time 21:28:45 but we might need a target for a shutdown tomorrow... announced today :( 21:29:08 I'll get you a target if you get a tracker proj and vars in... 30 mins 21:29:14 woah sequential post IDs? 21:29:16 i like it 21:29:16 "У вас будет время сохранить весь свой контент" - "You will have time to save all your content." yeah, sure... 21:29:33 Sequential IDs and a full sitemap as far as I can tell 21:29:44 but on the other hand, javascript 21:29:57 21:30:02 i'm always skeptical about sitemaps 21:30:43 rewby|backup: alright 21:32:54 they seem to have a rate limit (on api. at least), returns a standard nginx 403 21:32:54 and now that's changed to another 403 page 21:34:20 imer: proper status code? 21:34:27 yep 21:34:30 403 21:34:39 good 21:34:51 here's the content of the non-nginx 403: https://transfer.archivete.am/hyaCY/2023-06-29_23-34-40_wmbgyH3GLo.txt 21:34:57 i've censored my ip with XXX 21:35:57 Archivebot is still getting 403s a while after con=6, d=0 (that wasn't using the API and in fact wasn't even trying to retrieve stuff from the API, though) 21:36:08 ok everyone gather around for a picture 21:36:17 an api actually used a proper http status code 21:36:19 block doesnt seem to be shared across domains, but obviously the site wont work 21:36:20 we need to remember this moment 21:37:37 interesting 21:37:42 i'll keep checking if I get unblocked 21:37:43 IDs sequential with a huge sudden gap 21:38:07 response headers: https://transfer.archivete.am/mTei1/2023-06-29_23-37-36_Qp3eqSS4hN.png content-type is proper as well 21:38:35 no ipv6 (why do I even bother checking this) 21:39:18 one day you'll be rewarded 21:39:22 it's like finding a rare coin 21:39:30 the toyota yarus, https://en.wikipedia.org/wiki/Toyota_Yaris 21:39:32 lol 21:40:33 do we have a channel name yet? i'll throw into the hat #norus if not 21:40:52 nop 21:40:55 words i can arrange sentence to 21:40:56 mine was #yaaaaaaaaasus but that's kinda gay 21:40:57 :p 21:41:01 imer: see what i wrote earlier ;) 21:41:02 also not punny enough 21:41:19 #norus it is 21:41:32 arkiver: you were in the tiki channel 21:41:32 :D 21:42:00 HEY EVERYONE! JAA is not in #norus , let's party there. no one tell JAA please!! 21:54:10 JustAnotherArchivist created ЯRUS (+194, Created page with "{{Infobox project | URL =…): https://wiki.archiveteam.org/?title=%D0%AFRUS 21:55:11 JustAnotherArchivist created Yarus.ru (+19, Redirected page to [[ЯRUS]]): https://wiki.archiveteam.org/?title=Yarus.ru 22:04:12 Pcr edited List of websites excluded from the Wayback Machine (+26, Add TH3D): https://wiki.archiveteam.org/?diff=50063&oldid=49985 22:07:17 :D 22:10:39 arkiver: wrt noise in #archivebot, if you use weechat, there are some filters at https://wiki.archiveteam.org/index.php/User:Switchnode