#archiveteam-bs

09:43

Sanqui

arkiver: I will deduplicate the list and enter what's new. Thank you
15:26

arkiver

Sanqui: sounds good!
16:19

Sanqui

<@Sanqui> arkiver: 3482 new domains from your set of 13352
16:19

Sanqui

<@Sanqui> thank you very much!
16:19

Sanqui

<@Sanqui> (sweb.cz)
16:25

h2ibot

Sanqui edited Sweb.cz (+307): wiki.archiveteam.org/?diff=49176&oldid=49174
16:31

Sanqui

arkiver: that said, your domains include a lot of non-sweb urls like synagoga-slatina.atlasweb.cz
16:31

Sanqui

(not that atlasweb.cz shouldn't also be archived at some point ,'D)
16:59

arkiver

Sanqui: oops, sorry about that
16:59

arkiver

can you filter those out, or should I?
16:59

Sanqui

too late
16:59

Sanqui

they're getting archived
16:59

arkiver

fun :P
16:59

Sanqui

we should archive atlasweb.cz at some point anyway
17:09

h2ibot

Switchnode edited Deathwatch (+289, /* 2022 */ add blog.siol.net): wiki.archiveteam.org/?diff=49177&oldid=49171
17:14

arkiver

Sanqui: for blog.siol.net reported by HCross I have a list here (may be incomplete) transfer.archivete.am/10qXvT/blog.siol.net.txt
17:14

arkiver

1700 sites
17:14

arkiver

I hope AB is enough for that?
17:14

Sanqui

yeah, for sweb.cz I put in 4000 domain batches, but that's also because half of them are typically already dead
17:15

Sanqui

does archivebot !a < work without http:// prefixes?
17:15

arkiver

also, do you know how many sweb.cz sites that you had in your lists previously that were not in the list I created?
17:15

arkiver

Sanqui: no, needs http or https
17:15

Sanqui

OK, noted, I will handle it
17:15

arkiver

i think
17:16

HCross

arkiver: i have bad news
17:16

HCross

it's all wordpress
17:16

HCross

hilariously butchered wordpress
17:17

Sanqui

arkiver: 3.5k domains from the 13k you provided were new, I previously archived 144k domains, that means 131k of them you didn't know about
17:18

Sanqui

(but some of them may never have appeared online -- because I also derived sweb.com/[username] to [username].sweb.cz)
17:18

arkiver

Sanqui: oof
17:18

arkiver

it was a very incomplete list then, good to know
17:18

Sanqui

wait I may have miscalculated that (I'm bad at math)
17:18

arkiver

ah
17:18

arkiver

hmm
17:18

Sanqui

but you get the idea
17:19

arkiver

HCross: i'm not expert on AB - what is the consequence of that for AB?
17:19

Sanqui

I got my urls from Bing scrape, CDX, and mwlinkscrape (including czech wikipedia)
17:19

arkiver

nice sources
17:21

Sanqui

I still want to scrape some warcs for more domains but depends on if I find the time
17:21

Sanqui

where did you source yours arkiver?
17:49

joepie91|m

does urlteam also scrape t.co links?
17:52

Peetz0r|m

👀
17:52

joepie91|m

:p
18:22

JAA

I've said it before in #archivebot but it easily gets swallowed by the noise there, so for visibility: stay away from !a < if at all possible. It has lots of pitfalls that may lead to either missing content or overzealous recursion, depending on the URL list, cross-links, and even retrieval order. This is why it isn't documented (and won't be until there are changes made in that area).
18:23

JAA

Sanqui, arkiver: ^
18:23

Sanqui

oh dear
18:24

Sanqui

I've always had good experience with it, and I just did all of sweb with it. I can't think of an alternative way
18:26

JAA

It should be fine if there are no links between the sites appearing in the list (assuming it's all plain domains without paths, else it gets more complicated). But if there are, various things can go wrong.
18:27

JAA

One alternative might be to use wget-at or wpull with --span-hosts (and no --span-hosts-allow) and a --domains filter, although that would skip external page requisites and outlinks. Or wpull with a custom accept_url hook that overrides its internal filtering.
18:28

JAA

Neither of these are *good* alternatives though.
18:29

schwarzkatz|m

JAA: I don’t think I got these kinds of errors while going through my url list of lacartoonerie
19:18

schwarzkatz|m

however, there seem to be database errors on some pages, and grab-site did not notice them (maybe they were 200s?)
19:30

JAA

schwarzkatz|m: Oof. What do those errors look like?
19:36

schwarzkatz|m

I already deleted the warc locally, so I don't know really, sorry
19:41

schwarzkatz|m

since I can't even find it via the search on archive.org, here is it if you want to take a look archive.org/details/forum.lacartoon….com-2022-11-11-24a72456-00000.warc
19:42

JAA

Ack, thanks.
20:10

h2ibot

Fidel edited List of websites excluded from the Wayback Machine (+21, Add pawoo.net, as it is now (2022-11-21)…): wiki.archiveteam.org/?diff=49178&oldid=49151
21:00

h2ibot

JAABot edited List of websites excluded from the Wayback Machine (+0): wiki.archiveteam.org/?diff=49179&oldid=49178
22:52

schwarzkatz|m

how is illegal material on archive.org handled? Or in site grabs in general?
22:54

schwarzkatz|m

site owner wants to know before eventually helping us save their site
23:00

ivan

generally if something causes a problem the item is darked and no one ever sees it again
23:01

schwarzkatz|m

hm, so it gets reported through archive.org and they will handle it all?
23:04

schwarzkatz|m

the previous owner cannot be held responsible, right?
23:05

schwarzkatz|m

that was a dumb question, I'm sorry :/
23:05

ivan

"the previous owner" you mean the original website?
23:05

schwarzkatz|m

yes
23:07

schwarzkatz|m

probably best if I just paste the question of the owner here:
23:07

schwarzkatz|m

There could also be illegal material on there, as I've had problems with it in the past. How will this be handled?
23:07

ivan

if I collect evidence that someone broke the law and put it on archive.org, can they be held responsible for breaking the law?
23:07

ivan

perhaps
23:07

schwarzkatz|m

ivan, I thank you for helping me, I just don't want to say something wrong
23:08

ivan

if you can just archive in such a way to not collect illegal material that would be optimal
23:10

ivan

you can PM me and I'll tell you the odds of someone getting prosecuted
23:10

ivan

haha
23:10

schwarzkatz|m

that's probably impossible... we're looking at a file hoster here, how would you even determine if it is illegal via scripting :D
23:11

ivan

putting those on IA seems dubious
23:11

ivan

collect known-good links from forums and archive those
23:11

schwarzkatz|m

it's a pomf clone
23:12

schwarzkatz|m

well, it's way older than pomf
23:12

schwarzkatz|m

losing that amount of files would be insane

2 years ago

« a day earlier

a day later »

today »