09:43:29 arkiver: I will deduplicate the list and enter what's new. Thank you 15:26:48 Sanqui: sounds good! 16:19:57 <@Sanqui> arkiver: 3482 new domains from your set of 13352 16:19:57 <@Sanqui> thank you very much! 16:19:57 <@Sanqui> (sweb.cz) 16:25:44 Sanqui edited Sweb.cz (+307): https://wiki.archiveteam.org/?diff=49176&oldid=49174 16:31:23 arkiver: that said, your domains include a lot of non-sweb urls like http://www.synagoga-slatina.atlasweb.cz/ 16:31:38 (not that atlasweb.cz shouldn't also be archived at some point ,'D) 16:59:06 Sanqui: oops, sorry about that 16:59:26 can you filter those out, or should I? 16:59:31 too late 16:59:33 they're getting archived 16:59:38 fun :P 16:59:39 we should archive atlasweb.cz at some point anyway 17:09:53 Switchnode edited Deathwatch (+289, /* 2022 */ add blog.siol.net): https://wiki.archiveteam.org/?diff=49177&oldid=49171 17:14:08 Sanqui: for blog.siol.net reported by HCross I have a list here (may be incomplete) https://transfer.archivete.am/10qXvT/blog.siol.net.txt 17:14:12 1700 sites 17:14:17 I hope AB is enough for that? 17:14:51 yeah, for sweb.cz I put in 4000 domain batches, but that's also because half of them are typically already dead 17:15:16 does archivebot !a < work without http:// prefixes? 17:15:21 also, do you know how many sweb.cz sites that you had in your lists previously that were not in the list I created? 17:15:32 Sanqui: no, needs http or https 17:15:41 OK, noted, I will handle it 17:15:41 i think 17:16:42 arkiver: i have bad news 17:16:44 it's all wordpress 17:16:49 hilariously butchered wordpress 17:17:40 arkiver: 3.5k domains from the 13k you provided were new, I previously archived 144k domains, that means 131k of them you didn't know about 17:18:04 (but some of them may never have appeared online -- because I also derived sweb.com/[username] to [username].sweb.cz) 17:18:31 Sanqui: oof 17:18:46 it was a very incomplete list then, good to know 17:18:50 wait I may have miscalculated that (I'm bad at math) 17:18:51 ah 17:18:54 hmm 17:18:57 but you get the idea 17:19:20 HCross: i'm not expert on AB - what is the consequence of that for AB? 17:19:22 I got my urls from Bing scrape, CDX, and mwlinkscrape (including czech wikipedia) 17:19:31 nice sources 17:21:09 I still want to scrape some warcs for more domains but depends on if I find the time 17:21:33 where did you source yours arkiver? 17:49:00 does urlteam also scrape t.co links? 17:52:02 👀 17:52:07 :p 18:22:49 I've said it before in #archivebot but it easily gets swallowed by the noise there, so for visibility: stay away from !a < if at all possible. It has lots of pitfalls that may lead to either missing content or overzealous recursion, depending on the URL list, cross-links, and even retrieval order. This is why it isn't documented (and won't be until there are changes made in that area). 18:23:11 Sanqui, arkiver: ^ 18:23:45 oh dear 18:24:11 I've always had good experience with it, and I just did all of sweb with it. I can't think of an alternative way 18:26:20 It should be fine if there are no links between the sites appearing in the list (assuming it's all plain domains without paths, else it gets more complicated). But if there are, various things can go wrong. 18:27:31 One alternative might be to use wget-at or wpull with --span-hosts (and no --span-hosts-allow) and a --domains filter, although that would skip external page requisites and outlinks. Or wpull with a custom accept_url hook that overrides its internal filtering. 18:28:08 Neither of these are *good* alternatives though. 18:29:48 JAA: I don’t think I got these kinds of errors while going through my url list of lacartoonerie 19:18:36 however, there seem to be database errors on some pages, and grab-site did not notice them (maybe they were 200s?) 19:30:17 schwarzkatz|m: Oof. What do those errors look like? 19:36:22 I already deleted the warc locally, so I don't know really, sorry 19:41:57 since I can't even find it via the search on archive.org, here is it if you want to take a look https://archive.org/details/forum.lacartoonerie.com-2022-11-11-24a72456-00000.warc 19:42:34 Ack, thanks. 20:10:22 Fidel edited List of websites excluded from the Wayback Machine (+21, Add pawoo.net, as it is now (2022-11-21)…): https://wiki.archiveteam.org/?diff=49178&oldid=49151 21:00:30 JAABot edited List of websites excluded from the Wayback Machine (+0): https://wiki.archiveteam.org/?diff=49179&oldid=49178 22:52:50 how is illegal material on archive.org handled? Or in site grabs in general? 22:54:11 site owner wants to know before eventually helping us save their site 23:00:20 generally if something causes a problem the item is darked and no one ever sees it again 23:01:09 hm, so it gets reported through archive.org and they will handle it all? 23:04:47 the previous owner cannot be held responsible, right? 23:05:26 that was a dumb question, I'm sorry :/ 23:05:39 "the previous owner" you mean the original website? 23:05:44 yes 23:07:02 probably best if I just paste the question of the owner here: 23:07:02 There could also be illegal material on there, as I've had problems with it in the past. How will this be handled? 23:07:12 if I collect evidence that someone broke the law and put it on archive.org, can they be held responsible for breaking the law? 23:07:17 perhaps 23:07:21 ivan, I thank you for helping me, I just don't want to say something wrong 23:08:36 if you can just archive in such a way to not collect illegal material that would be optimal 23:10:19 you can PM me and I'll tell you the odds of someone getting prosecuted 23:10:19 haha 23:10:30 that's probably impossible... we're looking at a file hoster here, how would you even determine if it is illegal via scripting :D 23:11:16 putting those on IA seems dubious 23:11:34 collect known-good links from forums and archive those 23:11:57 it's a pomf clone 23:12:28 well, it's way older than pomf 23:12:28 losing that amount of files would be insane