01:15:57 AntoninDelFabbro|m: which website, what are you trying to get? I tend to use different things for different purposes. for eg: googler/ddgr for site: search engine queries. curl/wget for downloads. pup for HTML parsing/querying. jq for JSON querying 02:19:39 Systwi uploaded File:Duck Hunt (World)-0--twitter-5.png (Mr. Peepers holding the Twitter bird,…): https://wiki.archiveteam.org/?title=File%3ADuck%20Hunt%20%28World%29-0--twitter-5.png 02:27:41 Systwi edited Twitter (+177, /* Vital Signs */ Added meme and serious caption.): https://wiki.archiveteam.org/?diff=50593&oldid=50591 02:35:29 systwi: 😃 02:39:43 Systwi edited Site exploration (+398, /* Twitter */ Mentioned Nitter and Twitter's…): https://wiki.archiveteam.org/?diff=50594&oldid=50492 02:40:02 fireonlive: :-D 04:54:05 wowturkey is down, not known it's temporary and re-restored or permanent as in finally closed 04:54:48 let's leave the bot running in hope it returns once more 04:55:02 the announced date was august 31 05:02:15 nicolas17: anything to save for this? https://www.volkerkrause.eu/2023/08/26/kde-jenkins-retirement-progress.html 05:03:50 pabs: I doubt it because most data there was ephemeral in the first place, eg. there's projects that do a daily build and only the last 5 binaries are kept 05:04:04 and what about the phabricator? 05:04:23 phabricator is a bugtracker, that's significant 05:04:42 tickets may have good things 05:05:34 we'll probably turn it into static pages somehow 05:05:55 I'm not sure how easy it is to archive, I think there's like, JS-backed "load more comments" stuff? 05:05:57 * pabs recommends an AB job, then download the static files :) 05:06:30 I did a phabricator recently, apart from the large amount of ignores I think it worked ok 05:07:38 we'd have to do the same with missing wowturkey viewtopic pages with p=### links 05:08:06 corresponding t=####&start=### ones have already been crawled 05:08:27 at one point we considered moving issues from phabricator to gitlab and it was messy because tickets can have multiple tags/projects that they belong to, while gitlab issues belong to *one* project 05:09:15 so we would need to check case by case and make a list of "if a ticket has tag X and tag Y, put it in repo Y" 05:09:58 AB job then static seems better 05:10:18 well yeah, this was *early* in the gitlab move when a lot of tickets would still be active 05:11:38 by now I guess a lot was closed, or stopped mattering, or was still active and someone moved it manually 05:12:31 Ok should I focus on webs or orange today? both have close cut off dates. Or do I say the hell with the both of them and continue with the aussie ISPs that have technically passed their shutdown date and are still up? 05:14:39 "September 1: wowTURKEY[IA•Wcite•.today•MemWeb], a large Turkish photo sharing forum[23]" september 1 → august 27 05:15:00 don't kill the archivebot crawler tho 05:15:12 it still crawls previously failed external links 05:17:18 if we had started this crawl one day before, we would have the full archive today... 05:17:45 Alas that is the joys of web archival 05:17:50 things are lost every day my friend 05:18:07 and it sucks. it does. but we do what we can 05:18:16 87% is better than nothing 05:18:38 (known item count is ~9.4M, we got 8,338,246) 05:23:32 https://www.science.org/content/article/government-seizure-nicaraguan-university-blow-science-researchers-say 05:23:49 300ms is too short for wowturkey, the server's own delay was about 500ms even when it was up 05:36:25 flashfire42: I should probably do an !a < list job on orange - webs might be better to focus on. On the other hand webs has the stupid calendars that make things a mess :| 05:37:10 orange has about 4 different subdomains and some of them dont even resolve for me but do for others. Webs is not a set and forget thing which is really what I am aiming for because the calenders are so fucking broken 06:43:31 Can I get a list of those different subdomains? (Lists of individual sites would be useful too but I have some ideas of how to get those once I know the starting points) 06:46:51 pagesperso-orange.fr 06:46:51 monsite-orange.fr 06:46:53 those are the main 2 06:47:06 images are hosted on cdn.woopic.com 06:56:56 a few individual sites: https://crawlyproject.digitaldragon.dev/cds/lists/fr/pagesperso-orange/ 07:04:08 (ignore the .txt at the end of everything) 07:09:33 pabs: I wqnt to download https://annuaire-pp.orange.fr/accueil, but thanks for your help! :D 07:11:42 a good option for that is open it in your browser, open dev tools, click on all the things on the site, then save all the requests as a .har and then AB all the URLs output by this shell oneliner: 07:11:44 for f in *.har ; do jq -r '.log.entries[].request.url' < "$f" ; done | sort -u 07:12:12 ah, better open dev tools before loading the page, woops 07:12:49 there are some browser based crawler things on the wiki somewhere too 07:13:15 but they may not work if you need to interact with the site 07:20:28 Gold! I just woke up, but I'm impatient to try this asap! Thank you! 07:31:24 qyxojzh|m: wowturkey definitively down 08:04:18 JAA: wowturkey definitively down, as of 0400UTC today 08:10:31 AntoninDelFabbro|m: that's a good way to capture data you can click through manually, but if the amount of navigation required is very large, i personally prefer to write a short script. 08:11:36 i have done so for annuaire-pp.orange.fr (and in the process, i believe, discovered more results than are shown in the browser) and will dump results tomorrow 08:20:56 Awesome! Haha, well you saved me lot of time, thanks ;) 08:26:10 AntoninDelFabbro|m: you're welcome! 08:32:08 also, uh, can someone remind me what the status is on orange isp hosting in general? are we still just dumping stuff in archivebot? because there are tens of thousands of these 08:34:19 (i was going to add 'and some of them require javascript', but based on my spot-checking they're all in the weird 'put everything in the html, but don't actually display it until the js loads' idiom, so i think archivebot would actually be fine in that respect) 08:47:26 qyxojzh|m: JAA: arkiver: one of the former wowturkey's mods are about to ask the owner to buy and resurrect wowturkey.com 08:47:37 we might get a last chance revive 08:58:11 thuban um everything into archivebot unless you design some scripts because its about a week away from going bye bye and we have like 3 or 4 warrior projects on the go and fuck all ingestion to IA right now 09:44:11 Bzc6p edited Demotivalo.net (+37, /* Sister sites */ Update stati): https://wiki.archiveteam.org/?diff=50595&oldid=47826 11:25:15 wowturkey is down as of now 12:19:27 this site has 6TB of FLACs for bluegrass music: https://bluegrassarchive.com/ (frameset for https://gdarchive.net/Public/Bluegrass/contents.htm) 12:20:21 would be nice to grab eventually, but seems a bit big for AB, especially with the current IA upload limits 12:32:44 PaulWise edited Bugzilla (+1073, more from BZ site…): https://wiki.archiveteam.org/?diff=50596&oldid=50588 12:33:44 PaulWise edited Bugzilla (-93, remove accidentally added done ones): https://wiki.archiveteam.org/?diff=50597&oldid=50596 12:43:45 PaulWise edited Deathwatch (+295, Eclipse Wiki shutdown): https://wiki.archiveteam.org/?diff=50598&oldid=50584 12:53:48 PaulWise edited Bugzilla (+0, Eclipse Bugzilla shutdown, AB in progress:…): https://wiki.archiveteam.org/?diff=50599&oldid=50597 13:34:56 Exorcism edited Orain (-6): https://wiki.archiveteam.org/?diff=50600&oldid=44412 13:46:06 pabs: it's maybe fine to put in AB when the current problems at IA are fixed 13:46:31 ok, wasn't sure if AB could handle that volume either 13:46:44 thanks 13:48:26 well JAA is the expert on that 13:55:00 Exorcism edited Nupedia (+10): https://wiki.archiveteam.org/?diff=50601&oldid=28721 14:05:33 erkinalp: Ugh. Yeah, let's hope it's resurrected. 14:07:44 pabs, arkiver: AB doesn't care much about data size as long as there aren't huge files in it. The other limiting factor is number of URLs, but until you go over 100M, that's not usually a problem either. 14:28:10 JAA: seems no hope 14:51:51 why are archivebot downloads so slow? 14:52:21 Do you mean downloads of ArchiveBot data from the Internet Archive? 15:15:18 no, arcihvebot data downloads from archive.fart.website 15:15:40 i'm getting isdn speeds of download currently 15:16:02 it can't be due to my link speeds either 15:16:15 i hapen to have 70mbps down, 10mpbs up 15:16:30 The AB viewer is just an index of the data on IA. 15:16:38 The links go to IA. 15:17:04 And yeah, downloads from IA are notoriously slow, especially if you aren't near the Bay Area. 15:18:08 it isn't that slow normally 15:18:22 it was usually a few mbp 15:18:24 s 15:18:40 i could get good dsl speeds of download 15:18:47 not dialup or isdn speeds 15:21:04 It varies depending on IA load and which server the data is on. 17:27:41 IA downloads are now down to dialup speeds 17:30:31 Not surprising. IA is pretty busy recently, and it slows various things to a crawl. 17:31:32 Even IA-internal things are slow. One particular item I was monitoring took over 6 days to move 43 GB around internally. 17:31:43 ooof 17:32:04 (Move it from S3 to item server, checksums, and mirror to backup server.) 17:42:00 i'm planning on dusting off wikis-grab for the upcoming deletions of wikis 17:42:09 though those are also largely covered already by wikiteam dumps i believe 17:42:21 the wikis-grab would more be a general method of archiving wikis 17:43:00 project is also coming for ZOWA 17:44:36 any idea for a channel for zowa.app ? 17:45:05 flashfire42: do you know if we have the orange ISP hosting stuff fully covered with AB? 17:48:43 arkiver: have you seen #wikibot? 17:48:58 ah, you're already in that channel 17:49:26 pokechu22: yeah 17:49:42 i think it's good to have both dumps from there and from a project creating WARCs 17:49:46 I haven't seen wikis-grab before - does it try to do everything wikiteam does, or is it mostly focused on saving the current revision of every page? 17:49:53 Yeah, WARCs are good 17:50:03 current revision 17:50:21 may be good to allow wikiteam higher priority with the dumps than WARCs, since it's more complete 17:50:36 but after that we should attempt to create WARCs as well 17:51:03 Saving the current revision and maybe all pages on the history tab (but not the revisions themselves - just the history list for attribution) probably is enough for WARCs 17:51:36 yeah 17:51:52 Love it! 17:51:55 it's largely for URL preservation, so it's in the wayback machine and easily browsable 17:52:12 the dumps can be used to restore a wiki with (right?), but for browsing WARCs are better 17:52:14 JAA: :) 17:53:28 Yep, that's accurate. 17:53:44 The dumps are entirely unusable for the average person. 17:54:11 and of course outlinks for #// :) 17:55:16 "any idea for a channel for zowa..." <- zowch 17:55:57 Soddy arkiver it’s 4am here and you are lucky I snap awake for random reasons. Orange is far from complete in archivebot. I’ve been launching as many jobs as I can manage but it’s like fighting a fire with a kids water bucket. I’ll get a sampling but not all of it 17:56:50 I will continue to throw in a much as possible during the next week but we won’t get it all I can say that with certainty. Not unless we get a stay of execution for another month or 2 17:57:18 Hopefully that info is helpful it is time for me to head back to sleep for another 2 hours. 17:58:22 I'll try to do an !a < list job for it too 17:58:33 The deadline for webs is sooner though 17:58:41 JAA: unless they have a WARC reader, 17:59:22 z-oww-a 18:00:49 #nowa 18:01:21 although I like the oww one better :D 18:08:49 Or perhaps some play on the content. What are some sounds you'd absolutely not want to hear in an ASMR video? 18:09:00 zowaah, zowie 🤷‍♂️ 18:10:51 #zo🍽️ 18:14:28 My terminal is sad about that last one. 18:17:13 Yeah, lets not :) 18:32:26 :) 18:53:53 arkiver, re orange isp hosting: AntoninDelFabbro|m posted a link to a page listing sites, and i have been enumerating them using its api 18:55:46 if my suspicion that supplying 0 as the category id retrieves all categories is correct, i expect to be able to enumerate 159832 sites (some fraction of which will be duplicates or inaccessible due to various oddnesses) 18:57:20 i don't think it's realistic to do this 'manually', but maybe some `!a <` jobs? individual sites are quite small as a rule 19:03:36 I'm really impressed and thankful 19:10:26 i for one vote for emoji channel ;) 19:10:44 :3 19:11:29 JAA: wowturkey definitively dead, we can update the deadwatch now (death date: 2023-08-27,0400Z) 19:11:58 s/deadwatch/deathwatch/ 19:11:59 oh it is confirmed by owner? 19:12:25 owner not responding to any correspondence 19:12:33 no hope of coming back up again 19:12:53 the AB job has a few external links (~650 or so) pending 19:12:56 ah :( 19:14:11 to skip wowturkey.com without impacting the remaining ~650 external resources, i'd propose to temporarily map wowturkey.com to 0.0.0.0 ;[ 19:14:15 :-( 19:14:44 in the bot's end i mean 19:15:26 The AB job is paused, and the offsite URLs aren't in danger, so we can let it sit until the true deadline just in case it comes back. 19:15:50 oh, i though it was looping over and ober 19:15:57 good that it's paused 19:16:49 if it doesn't come back up until 23 september, then it's DaaD 19:17:15 (23 september is when the hosting expires, exactly 22 years from the website's start) 19:17:52 and the shutdown was exactly 20 years and 1 day from the first turkish language post 19:18:21 wowturkey initially consisted of english threads 19:18:31 promoting turkey to outsiders 19:21:34 https://transfer.archivete.am/inline/eDzUk/monsite-orange.fr_seed_urls.txt - this is the smaller one of the two :| 19:22:00 (this also contains urls from monsite.orange.fr and monsite.wanadoo.fr, both of which give a page redirecting (but not a 3xx redirect) to monsite-orange.fr) 19:24:15 pokechu22: how was that list collected? 19:24:35 erkinalp: Well, to be precise, 'paused' here just means a very slow request rate (one request every five minutes in this case), not actually paused. 19:24:44 Also, not sure where you got that 650 number from. 19:25:02 Most of it was https://archive.org/developers/wayback-cdx-server.html (e.g. https://web.archive.org/cdx/search/cdx?url=pagesperso-orange.fr&matchType=domain&collapse=urlkey&fl=original&limit=100000&showResumeKey=1&resumeKey=fr%2Cpagesperso-orange%2Clignerolles-allier%29%2Fcartes_postales%2Fteillet%2520argenty%2Falbum%2Fslides%2Fle%2520tumulus.html+20141112051426) - I also 19:25:04 mixed in a list from #webroasting a while back 19:25:16 There are about 8.3k offsite URLs in the remaining queue. 19:25:33 SrainUser's https://transfer.archivete.am/Y5Qsp/orange_isp_hosting_urls.txt which I think was scraped from the list the site gives but I'm not 100% sure 19:26:01 and, yes, there's a fair bit of garbage on my list - easier to let it be attempted and fail than to try to filter it out 19:26:15 thuban: if you have a list of sites, please do post them! 19:26:27 arkiver: still processing, will do 19:26:55 I can deduplicate my list against anything you find and start a second job for whatever's missing 19:28:10 It looks like there's a pagespro-orange.fr in addition to a pagesperso-orange.fr incidentally 19:28:46 yep 19:29:37 thuban: thank you 19:29:52 and in the meantime - all those queuing AB jobs for orange, please keep doing that 19:31:30 I'm currently doing an !a < list AB job for it - this is easier since it's one job for thousands of sites, but it's a bit buggy in that if the sites link to eachother, it might not recurse properly. Still, it seems like the most practical way to do this 19:32:51 JAA: thanks for the number 19:33:24 erkinalp: we got a pretty serious chunk of it i believe 19:34:50 89% of items saved 19:35:06 maybe more 19:35:48 (the website had 9.35M posts, according to their own stats) 19:35:50 that's good! 19:36:06 :D 19:36:13 not sure the percentage is correct, but we got more than half i think 19:36:14 after scraping and reconstruction, i might actually get more posts 19:36:30 pokechu22: agreed re practicality (and i don't think sites linking to each other will be a problem--at worst, it won't get pages linked to by other sites but not their host site's homepage) 19:37:06 (and at best it will work fine, although there seems to be some confusion about this https://hackint.logs.kiska.pw/archiveteam-bs/20220725#c323962 ... https://hackint.logs.kiska.pw/archiveteam-bs/20220725#c323982) 19:38:46 i see that you have a job running for monsite-orange.fr_seed_urls.txt; are you going to start another for orange_isp_hosting_urls.txt? (or has that already been done?) 19:43:10 I'm working on building my own list for the orange one based on orange_isp_hosting_urls.txt but I'm not going to run orange_isp_hosting_urls.txt directly 19:44:17 ok, cool 19:46:13 erkinalp: Those stats are not right. You'd have to analyse the WARC data to tell how much we covered. The number of URLs retrieved is not really correlated to that in a meaningful way. 19:46:25 We fetched offsite URLs, we fetched rating.php, and so on. 19:47:51 A coarser estimate would be possible by analysing just the log file and retrieving how many topic IDs appear there, but later pages could be missing, so that's still only a rough estimate. 19:49:45 JAA: yeah that was what i was referring to by "after .. reconstruction, i might actually get more posts" 19:50:14 Or less. We don't know how far it got through the forum pagination. 19:52:08 thankfully wowturkey's viewtopic.php page size is fixed at 10, and ttum.php page size is fixed at 100 19:53:52 and both were configured in a manner to link to the most recent pages of each respective topic 20:44:57 Hi, how can I help with the efforts? Is an v6-space-holder useful anyhow? 20:48:13 IIRC there's at least one project currently active that uses a ton of v6 ips 20:48:20 I forget which one, imer wrote the code for it 20:48:23 Or rather 20:48:30 imer wrote the deployment code that makes it do lots of v6 20:48:50 Well, I can technically use a whole /39, so ready to help out 20:49:10 #deadcat but we're bandwidth limited there anyways (and they seem to rate limit per single ip or something like that) 20:49:12 #deadcat is the one 20:49:27 target bandwidth limited that is 20:49:56 Also, I could see if I can hook up my spare /32 to some workers at some point 20:50:39 I'm just busy with targets 20:50:39 szczot3k: here's the aforementioned code/script as well: https://gist.github.com/imerr/614e534218a6b93be1a40b088dee885a 20:50:51 i heard #sweet supports ipv6 too but I don't know about their ratelimiting 20:51:27 there is none unless you go way too fast and then they will (it seems) manually block your ip 20:51:54 hah 20:51:56 Ok got back through scrollback and it seems the consensus is to ignore webs for the moment and focus on orange? cc arkiver 20:52:49 also, glad to hear about wikis-grab! 20:55:10 i have some wikibot #// extraction almost ready but unsure about filtering 22:18:33 ok, orange.fr enumeration finished and spot-checks suggest that i got all the categories 22:18:38 processing the results now 22:20:11 malformed urls won't break archivebot, right? there are a few fun ones in here, like `usftennis2.monsite-orange.fr/index.html#="'>

abcd

${{7*7}}${7*7}%{7+7}[[7*7]]@(1+2)<%= 7*7 %>` and `monsite.orange.fr@la-canaliere` 22:23:40 Right 22:23:56 the first one would just be treated as usftennis2.monsite-orange.fr/index.html because of the # 22:24:24 the second one would be probably treated as trying to log in as user monsite.orange.fr on site http://la-canaliere which obviously won't work, but will fail in an acceptable way 22:24:35 FireonLive edited Deathwatch (+294, move wowTURKEY to dead (we should use that…): https://wiki.archiveteam.org/?diff=50602&oldid=50598 22:24:45 the main thing that breaks archivebot is FTP - there are a few other things that can cause problems but they aren't easy to control for 22:33:37 FireonLive edited Deathwatch (+2, fix url for 2028-Russia going to example.com…): https://wiki.archiveteam.org/?diff=50603&oldid=50602 22:33:41 (i was like example.com?!) 22:34:43 That mistake is so common. 22:35:07 I wish there was a way to make edits throw an error when a template isn't used correctly. 22:35:24 Probably possible with an extension or something ridiculous like that. 22:37:27 You could probably use an editfilter 22:37:48 er, for that one, probably the right thing to do is make it generate a big red message of anger instead of silently using example.com 22:38:20 We do have https://wiki.archiveteam.org/index.php/Category:Pages_with_broken_URLs for all uses of Template:URL where the URL is empty. 22:38:41 I just remembered that I added that at one point. 22:40:40 https://en.wikipedia.org/wiki/Module:Check_for_unknown_parameters exists but I don't think lua is enabled on the AT wiki 22:42:38 Pokechu22 edited Template:Url (+138, add visible warning about broken URLs): https://wiki.archiveteam.org/?diff=50604&oldid=49244 22:42:39 Pokechu22 edited Reddit (-1, fix incorrect {{URL}} usage): https://wiki.archiveteam.org/?diff=50605&oldid=49987 22:43:38 Pokechu22 edited Talk:Twitter (+2, fix incorrect {{URL}} usage): https://wiki.archiveteam.org/?diff=50606&oldid=49771 22:44:39 Pokechu22 created Category:Pages with broken URLs (+210, Created page with "Pages that use…): https://wiki.archiveteam.org/?title=Category%3APages%20with%20broken%20URLs 22:45:32 Good idea, thanks. 22:46:37 Should be good enough. 22:56:18 awesome ^_^