-
Sanqui
arkiver: I will deduplicate the list and enter what's new. Thank you
-
arkiver
Sanqui: sounds good!
-
Sanqui
<@Sanqui> arkiver: 3482 new domains from your set of 13352
-
Sanqui
<@Sanqui> thank you very much!
-
Sanqui
<@Sanqui> (sweb.cz)
-
h2ibot
-
Sanqui
arkiver: that said, your domains include a lot of non-sweb urls like
synagoga-slatina.atlasweb.cz
-
Sanqui
(not that atlasweb.cz shouldn't also be archived at some point ,'D)
-
arkiver
Sanqui: oops, sorry about that
-
arkiver
can you filter those out, or should I?
-
Sanqui
too late
-
Sanqui
they're getting archived
-
arkiver
fun :P
-
Sanqui
we should archive atlasweb.cz at some point anyway
-
h2ibot
Switchnode edited Deathwatch (+289, /* 2022 */ add blog.siol.net):
wiki.archiveteam.org/?diff=49177&oldid=49171
-
arkiver
Sanqui: for blog.siol.net reported by HCross I have a list here (may be incomplete)
transfer.archivete.am/10qXvT/blog.siol.net.txt
-
arkiver
1700 sites
-
arkiver
I hope AB is enough for that?
-
Sanqui
yeah, for sweb.cz I put in 4000 domain batches, but that's also because half of them are typically already dead
-
Sanqui
does archivebot !a < work without http:// prefixes?
-
arkiver
also, do you know how many sweb.cz sites that you had in your lists previously that were not in the list I created?
-
arkiver
Sanqui: no, needs http or https
-
Sanqui
OK, noted, I will handle it
-
arkiver
i think
-
HCross
arkiver: i have bad news
-
HCross
it's all wordpress
-
HCross
hilariously butchered wordpress
-
Sanqui
arkiver: 3.5k domains from the 13k you provided were new, I previously archived 144k domains, that means 131k of them you didn't know about
-
Sanqui
(but some of them may never have appeared online -- because I also derived sweb.com/[username] to [username].sweb.cz)
-
arkiver
Sanqui: oof
-
arkiver
it was a very incomplete list then, good to know
-
Sanqui
wait I may have miscalculated that (I'm bad at math)
-
arkiver
ah
-
arkiver
hmm
-
Sanqui
but you get the idea
-
arkiver
HCross: i'm not expert on AB - what is the consequence of that for AB?
-
Sanqui
I got my urls from Bing scrape, CDX, and mwlinkscrape (including czech wikipedia)
-
arkiver
nice sources
-
Sanqui
I still want to scrape some warcs for more domains but depends on if I find the time
-
Sanqui
where did you source yours arkiver?
-
joepie91|m
does urlteam also scrape t.co links?
-
Peetz0r|m
👀
-
joepie91|m
:p
-
JAA
I've said it before in #archivebot but it easily gets swallowed by the noise there, so for visibility: stay away from !a < if at all possible. It has lots of pitfalls that may lead to either missing content or overzealous recursion, depending on the URL list, cross-links, and even retrieval order. This is why it isn't documented (and won't be until there are changes made in that area).
-
JAA
Sanqui, arkiver: ^
-
Sanqui
oh dear
-
Sanqui
I've always had good experience with it, and I just did all of sweb with it. I can't think of an alternative way
-
JAA
It should be fine if there are no links between the sites appearing in the list (assuming it's all plain domains without paths, else it gets more complicated). But if there are, various things can go wrong.
-
JAA
One alternative might be to use wget-at or wpull with --span-hosts (and no --span-hosts-allow) and a --domains filter, although that would skip external page requisites and outlinks. Or wpull with a custom accept_url hook that overrides its internal filtering.
-
JAA
Neither of these are *good* alternatives though.
-
schwarzkatz|m
JAA: I don’t think I got these kinds of errors while going through my url list of lacartoonerie
-
schwarzkatz|m
however, there seem to be database errors on some pages, and grab-site did not notice them (maybe they were 200s?)
-
JAA
schwarzkatz|m: Oof. What do those errors look like?
-
schwarzkatz|m
I already deleted the warc locally, so I don't know really, sorry
-
schwarzkatz|m
since I can't even find it via the search on archive.org, here is it if you want to take a look
archive.org/details/forum.lacartoon….com-2022-11-11-24a72456-00000.warc
-
JAA
Ack, thanks.
-
h2ibot
Fidel edited List of websites excluded from the Wayback Machine (+21, Add pawoo.net, as it is now (2022-11-21)…):
wiki.archiveteam.org/?diff=49178&oldid=49151
-
h2ibot
JAABot edited List of websites excluded from the Wayback Machine (+0):
wiki.archiveteam.org/?diff=49179&oldid=49178
-
schwarzkatz|m
how is illegal material on archive.org handled? Or in site grabs in general?
-
schwarzkatz|m
site owner wants to know before eventually helping us save their site
-
ivan
generally if something causes a problem the item is darked and no one ever sees it again
-
schwarzkatz|m
hm, so it gets reported through archive.org and they will handle it all?
-
schwarzkatz|m
the previous owner cannot be held responsible, right?
-
schwarzkatz|m
that was a dumb question, I'm sorry :/
-
ivan
"the previous owner" you mean the original website?
-
schwarzkatz|m
yes
-
schwarzkatz|m
probably best if I just paste the question of the owner here:
-
schwarzkatz|m
There could also be illegal material on there, as I've had problems with it in the past. How will this be handled?
-
ivan
if I collect evidence that someone broke the law and put it on archive.org, can they be held responsible for breaking the law?
-
ivan
perhaps
-
schwarzkatz|m
ivan, I thank you for helping me, I just don't want to say something wrong
-
ivan
if you can just archive in such a way to not collect illegal material that would be optimal
-
ivan
you can PM me and I'll tell you the odds of someone getting prosecuted
-
ivan
haha
-
schwarzkatz|m
that's probably impossible... we're looking at a file hoster here, how would you even determine if it is illegal via scripting :D
-
ivan
putting those on IA seems dubious
-
ivan
collect known-good links from forums and archive those
-
schwarzkatz|m
it's a pomf clone
-
schwarzkatz|m
well, it's way older than pomf
-
schwarzkatz|m
losing that amount of files would be insane