00:08:53 re mastodon, here is a command-line client that requires no JS and dumps info to the terminal https://github.com/jwilk/zygolophodon 00:09:15 probably could take the guts of it to make an archiver for mastodon 01:42:01 "transgender phenomenon" it's not a phenomenon lmao 01:42:09 is this stuff that's being archived? I feel like why not let that content die? 01:42:21 Want to have that debate now? It is -bs 01:42:49 Have to warn you, though, no matter what you say, I'm doing it anyway. 01:43:28 debate on transgender being a phenomenon or archiving content like that 01:43:34 Archiving content. 01:43:48 I feel its important to have that stuff archived for future historians and the sort. But I personally will not be going out of my way to archive it. But Agree its important to do so. 01:44:56 Indeed, I didn't consider the historical research perspective 01:46:20 those who forget history are doomed to repeat it 🤷 01:47:26 I am now just scared at the absolute fucking wall of text sketchcow must be typing 01:53:20 "those who forget history are..." <- It's about what _to_ do and what _not_ to do =P 01:53:26 Gotta have both 02:05:34 thankfully i am going to bed but I am on znc so. 02:05:42 SketchCow, feel free to PM me if you want instead :P 02:44:13 we're not letting any content die 02:44:41 as with any site, if it is shutting down, and effort can be made in #archivebot or a custom project to archive that website 02:44:45 (or webpage/etc.) 03:27:28 Yts98 edited Mobile Phone Applications (+186, Clean up dead links, add APKPure): https://wiki.archiveteam.org/?diff=50714&oldid=48489 03:28:30 Android subsystem for windows may come in handy for archival of android stuff 03:33:29 Yts98 edited Mobile Phone Applications (+50): https://wiki.archiveteam.org/?diff=50715&oldid=50714 03:54:33 Yts98 edited BlackBerry World (+596, Update project status): https://wiki.archiveteam.org/?diff=50716&oldid=46619 03:56:49 RIP 04:44:04 https://www.theverge.com/2023/9/1/23856029/gizmodo-shuts-down-spanish-language-site-ai-translations 04:44:12 "Gizmodo’s owner shuts down Spanish language site in favor of AI translations" 04:44:42 seems the site stays up but every article is Google-translated (poorly) 04:45:09 w t f 04:45:34 I should have said every *new* article 04:45:47 bad human translations are better because bad grammar can be *noticed* 04:47:03 the better the AI translation the harder it is to notice when it screws up :/ 04:47:47 * pabs wonders whether to archive the site or not... 04:54:41 wat 04:55:09 pabs: might be worthy to grab the old articles i guess... 04:55:18 in case they just scrap the thing alltogether in the future 05:08:30 https://torrentfreak.com/tv-museum-will-die-in-48-hours-unless-sony-retracts-youtube-copyright-strikes-230904/ 05:10:25 was thrown in DTT 05:10:32 website failed in AB due to how overloaded/slow it is 05:11:19 https://old.reddit.com/r/DataHoarder/comments/169q4cp/ claims 'multiple copies have been made' as well 05:11:41 4900 videos, seems most are not too long though 05:13:21 hmm, the site has a ton of subdomains 05:14:09 pretty broken in a browser 05:14:29 my yahoo videos indexing seems to be going well 05:14:40 pabs: wonder if it's relying on www.* for styles/js? 05:14:58 a .tar download from IA has been going for 26 continuous hours without getting interrupted, amazing 05:15:04 www. seems broken indeed 05:15:47 Yts98 edited Duelyst (+40, Clean up formatting): https://wiki.archiveteam.org/?diff=50717&oldid=46610 05:15:59 response times for things from there seem to take multiple seconds at times 05:18:18 in AB I just got a 403 and then a timeout :( 05:24:34 hmm, in the browser all the subdomains are the same but different to the main site 05:47:27 https://nitter.net/BIS_records/status/1698920897036374117?s=20 https://www.macrumors.com/2023/09/05/apple-acquires-bis-records/ 05:57:55 Yts98 edited URLTeam/Dead (-82, untiny.me (untiny.com) is an unshortner, not a…): https://wiki.archiveteam.org/?diff=50718&oldid=50695 06:43:05 Yts98 edited URLTeam (+597, Filter out dead table rows): https://wiki.archiveteam.org/?diff=50719&oldid=50694 06:48:03 yts98: nice work lately :) 06:50:36 ;) 06:54:19 ;o 09:02:31 https://www.theguardian.com/world/2023/sep/05/bharat-g20-invitation-fuels-rumours-india-may-change-name 13:58:42 I found this url shortener + pastebin site https://fars.ee/ that has a notice saying "All data may be erased without notifications". Maybe something that should be archived? IDs are 1 to 4 characters long, a-zA-Z0-9_- so might be possible to just !ao < a list of all possible combinations. 16:19:49 download interrupted for 46GiB .tar.bz2 from yahoo videos :( at least the others seem to be still going well 16:22:01 nicolas17: Wget allows resuming downloads 16:22:14 and IA allows download a range of bytes 16:22:31 downloading* 16:22:48 arkiver: I'm using "curl | tar tv", but I'll switch to wget which I *think* will auto-resume on errors even when outputting to stdout 16:23:21 right, within in the same Wget process it might 16:23:37 would be interesting to know 16:24:06 for this one I can't "wget -c" now because there's no file to resume from 16:24:56 I think I like the wget progress indicator better anyway :p 16:25:03 me toooooo 16:25:17 wish i could shove that into curl 16:27:04 there is "curl --progress-bar" to get a... bar, but it only shows % and no throughput or MiB or ETA 16:27:21 :( 16:33:12 Length: 484259014419 (451G) [application/octet-stream] 16:33:14 yikes 16:34:36 ooof lol 16:34:52 that’ll take a second 16:35:11 110.20M 907KB/s eta 6d 14h 16:35:34 * FireFly . o O ( if it takes a second that's some _very_ impressive downlink (and uplink) :p ) 16:37:20 you could split it up in pieces and download concurrently using the bytes ranges 16:38:27 arkiver: that would require 451GB of disk space ;) I'm piping into tar -tv 16:38:28 FireFly: :P 17:40:56 it could be interesting to make a tool that does multithreaded downloads on smaller chunks and outputs them to stdout as they finish 17:44:16 with some kind of container so they can be rearranged and reassembled afterward? (or synchronisation to make sure they're output in-order?) 17:44:36 yeah internal buffering and output in order 17:45:07 yeah could be interesting 17:45:57 Basically a reverse ia-upload-stream. It does exactly that, just in the other direction (reading from stdin, uploading in chunks in parallel in order). 17:46:24 JAA: you mentioned chunked uploads to IA have drawbacks, right? 17:46:41 Yeah, the processing on IA's side is inefficient. 17:47:14 It copies the chunks to the backup server, then assembles them in a separate task and copies the assembled file over again. 17:47:54 And because that's always a separate task, snowballing doesn't work well for uploading multiple files to the same item. 17:48:10 Neither of this should affect chunked downloads, of course. 17:51:49 hm 17:52:03 if I upload a file to an existing item, it goes to the same server as the other files, right? 17:52:39 Yes 17:53:14 An entire item is always on a single server (plus its mirror at the other facility). Even on a single disk in that server, I think. 17:53:21 Or well, single FS at least. 17:53:34 what happens if I upload multiple files, in parallel, targeting the same non-existent item name? will they end up in the same server? when is the server "assigned" to an item? 17:53:35 Maybe there's some RAID in place, no idea. 17:54:10 You can't. All but one upload will fail with some weird error message IIRC. 17:54:18 well that's good to know 17:54:37 I have uploaded multiple files in parallel to get better speed 17:54:42 which worked great 17:54:51 good to know I shouldn't do that for the *first* file... 17:55:15 Yeah, at least one upload has to be done individually, afterwards you can go parallel, even if the archive.php task for that first upload hasn't run yet I believe. 17:55:37 But it's also worth mentioning that IA generally discourages parallel uploads to individual items. 17:57:04 why? 17:57:23 too much load on a single server? 17:59:21 I believe it has to do with task limits. I.e. you're more likely to run into rate limiting errors. 17:59:41 ah 17:59:51 that may be more relevant if it was hundreds of files I guess? 18:00:21 I was uploading like, 5 files, each of them >1GB 18:01:14 Yeah, or at least several dozen. 18:01:30 with IA's current ingestion problems I would probably upload one at a time and let it take as long as it wants tho... don't add to the problem ^^ 18:01:44 Agreed 18:01:58 even if I have 3Mbps upstream 18:43:00 in theory, would it be possible to, say, take all (wiki)pages that have an infobox and pull attributes from them onto a page? e.g. 'every infobox's IRC channel' 18:43:29 Anything is possible! 18:43:54 Retrieving the page contents is easy enough. But then you have to parse MediaWiki syntax probably... 18:44:00 the eldritch horror of mediawiki :D 18:45:38 oh! i meant to ping you JAA - there's a page I can't quite edit: https://wiki.archiveteam.org/index.php?title=Main_Page&action=edit 18:45:48 "Monday, Nov. 09, 2009" 18:46:01 no rushy :3 18:47:29 Yup, protected page. 18:47:42 ye, bad phrasing 18:47:53 could you please datetime-ify that for me =] 18:48:18 Ah 18:49:34 JustAnotherArchivist edited Main Page (+10, Datetimeify): https://wiki.archiveteam.org/?diff=50720&oldid=48497 18:49:50 :D thanks 22:42:04 sooo any news on IA ingestion? 22:42:37 Frame 6 of 6 22:42:55 x_x I know what you're referencing 22:43:18 I guess temp storage already has its hat on fire too 22:46:03 Most likely 22:46:10 nicolas17: there is internal progress on resolving it. it's not resolved yet 22:46:14 This is archiveteam what else do you expect 22:46:16 i don't have an ETA 22:46:34 i'm hoping within a month... but i don't know, i'm not the one handling this 22:46:37 that's fine 22:47:17 I'm not like "what's taking so long?!", more like "by any chance did I miss news while I was offline?" 22:47:43 do we have a month worth of temp storage? x_x 22:48:39 Depends on if any urgent projects come up, probably. Without, we're not far away with the throttled projects. But getting #shreddit up again would be good and would change the equation. 22:49:13 Looks like #zowch isn’t happening then….. 22:49:19 it is happening 22:49:44 #zowch ^ 22:50:15 nicolas17: completely understood! i had no negative (or necessarily positive) reading of your question 22:50:20 just a question, i answered :P 22:50:45 (reading back - may have come off as annoyed/harsh? not my intention) 22:50:47 a few days ago I complained because people saw #telegrab was idle and started adding items "to keep things busy" 22:50:52 flashfire42|m: ZOWA is small enough to not be an issue. 22:51:25 nicolas17: if you have large lists of channels, feel free to pass them to me 22:51:37 Yeah that was my fault partially and I apologise for that Nicolas17 22:51:40 best is in the formats of channel:CHANNELNAME lines in a file, then I can queue it directly 22:52:44 flashfire42|m: yeah don't worry, you already apologized at the time 22:53:46 I was also a bit unsure if it's *really* such a big deal to add large channels like that 22:53:59 like *how* conservative should we be adding items? are we in "emergency mode - low capacity - only archive if it's truly at risk", or is that too extreme? :p 22:54:19 the rules were much more relaxed 22:54:39 then a ton of big news channels were put in, we added like 50 TB of youtube to IA for a few days 22:54:42 i noticed too late 22:55:15 arkiver: oh I was talking about telegram, a few days ago 22:55:31 and we got in trouble at IA. we're known as a trusted responsible organisation, and dumping 100s of TBs of youtube into IA of content that will likely not be deleted any time soon does not fit the "responsible" flag we have 22:55:39 oh 22:55:42 uh 22:57:02 Telegram items are a lot smaller but there were people including myself throwing in "busywork" so to speak. Stuff that could be useful but like also probably not super important. A few million crypto faucet items 22:57:26 I think nicolas17 is trying to ask what limits should be in place for that with the limited storage 22:57:50 I wont be adding anymore except for the ones I scrape off the wiki because we are at like 40 million to do as of right now. 22:59:11 arkiver: I was like "is that channel actually important to archive? or are we adding stuff just to keep workers busy? I don't think we need to 'stay busy' while we have limited capacity" 22:59:20 but it's not really my place to judge that if I don't even know how much capacity we have or how long it will take for IA issues to resolve 22:59:36 i just need to do those checks for reddit and we can restart 23:00:03 also, if I ask "is that item actually important" it's probably a genuine question and not judging that it's not important, maybe it is :) 23:00:18 Because yeah when we have the free flow telegram is a free for all but is it needed to be more selective right now for that project 23:01:47 my stats on telegram say: avg item 1.7MB, success rate 50.1% (completed ÷ dequeued), estimated data remaining in queue 34TB 23:02:22 not bad 23:02:50 We have around 180 TiB of remaining offload capacity currently. 23:02:58 it's very hard to give an "ETA for queue empty" because it seems we're hitting target "max connections (-1)" errors, so the speed goes up and down a lot 23:04:32 imgur has so many items failing that I estimate like 1TB left lol 23:08:21 rip the i.imgur.com refuge 23:13:25 I started a telegram worker using a ramdisk for data, and it's now on request 5357 with 104MB of data total /o\ 23:18:18 active channels go brrrrrrrrrrrrr