-
fireonlive
is there a way to stop grab-site and have it resume where it left off on a different server?
-
h2ibot
Tomodachi94 edited Prnt.sc (+28, Add category):
wiki.archiveteam.org/?diff=49853&oldid=49849
-
h2ibot
Tomodachi94 edited Roblox (+425, /* Group sales */ New section):
wiki.archiveteam.org/?diff=49854&oldid=48716
-
h2ibot
Tomodachi94 created Technic Platform (+851, Create page):
wiki.archiveteam.org/?title=Technic%20Platform
-
h2ibot
Tomodachi94 uploaded
File:Technic Platform wordmark.webp (Wordmark of [[Technic Platform]]):
wiki.archiveteam.org/?title=File%3ATechnic%20Platform%20wordmark.webp
-
h2ibot
Tomodachi94 uploaded
File:Technic Platform 2023-05-29 homepage.png (Homepage of [[Technic Platform]] on 2023-05-29):
wiki.archiveteam.org/?title=File%3A…latform%202023-05-29%20homepage.png
-
masterX244
damnit... 302 to ignored content gets the 302 dropped from the WARC on grabsitre, too...
-
pokechu22
That's weird since archivebot doesn't check ignores on redirect targets at all
-
pokechu22
(it does apply the no-parent rule to redirect targets, though, but I'm not sure if the redirect is dropped from the WARC in that case)
-
spirit
pokechu22: artdoxa has finished, right? :)
-
pokechu22
Yeah, it finished a while ago
-
spirit
awesome!
-
pokechu22
It did error on
artdoxa.com/more?page=5753 (after successfully loading
artdoxa.com/more?page=5752) and while it was going through those it was occasionally finding new users (but not new artworks I think). It seems like now
artdoxa.com/more?page=5754 gives an error instead - this might just be it running out of pages or something, not sure.
-
spirit
it's totally fine if some pages were not archived. the contact initially did not think about archiving it at all so this is fantastic
-
spirit
thank you so much!
-
spirit
is there a way to see the list of warcs?
-
pokechu22
an example of a user it found is
artdoxa.com/e-amsalk, who has no submitted artworks but did have favorites (which I guess it picked up from that list)
-
pokechu22
-
spirit
hm, i had estimated about 3 times that. is there a way to get a log of all urls contained in the warcs without downloading them?
-
JAA
spirit: The -meta.warc.gz contains the complete log of the job, including ignored URLs.
-
spirit
excellent
-
JAA
For more details, like MIME type, response size (in the general case), etc., you'd want each WARC's CDX.
-
pokechu22
The meta-warc is basically just a (gz-compressed) text file in that case; you can read it with zless
-
pokechu22
Looks like amazon urls do have a length in the meta-warc but the artdoxa urls don't (probably because they don't set a content-length header)
-
spirit
i'll just take a brief look at the number of URLs later, if that matches expectations roughly, then this is good enough
-
spirit
thanks again!
-
pokechu22
JAA: is there a case where the meta-warc wouldn't contain the MIME type? I see `Length: unspecified [text/html; charset=utf-8]` in it currently which seems to imply that it has one even when it doesn't have the length
-
pokechu22
It might be worth checking if every thumbnail also had a full-sized image saved and vice versa; that'd be a good heuristic to make sure everything's saved
-
JAA
pokechu22: Hmm, yeah, I guess the MIME type should always be there, true.
-
rewby|backup
Heyo, I know people've been trying to get me to fix target stuff. Can anyone give me a summary? ( arkiver, datechnoman , JAA)
-
pokechu22
I guess also, if you estimated the expected total size by using the average size of recently uploaded images, it's probably the case that older images are generally smaller
-
pokechu22
though I'm not sure if that'd make a 400GB difference
-
spirit
pokechu22: yeah, that was my thinking too with the corresponding images
-
spirit
it was a *really* rough estimate :)
-
spirit
iirc i sampled ~10 full size images, might have been huge ones by random chance
-
JAA
rewby: The targets behind Reddit and #// are struggling to keep up with the increased load from historical Reddit data, and some targets on Imgur have been erroring consistently (though that might just be the SBs).
-
rewby|backup
JAA: Ack. First prio is dealing with #// and reddit. I need to shift some stuff around. I was handed back one of the target servers a day or two ago. I need to put it back online.
-
JAA
Lovely :-)
-
spirit
looks good. there should be ~130k artworks and grabbed were ~125k fullsize artworks. nice!
-
icedice
Is #// Imgur brute force?
-
pokechu22
#// is a general URLs project, I think? Not sure of the details. The imgur brute force is discussed in #imgone to my understanding
-
JAA
Correct, #// mostly handles external URLs discovered in other projects, e.g. Reddit and, yes, Imgur (image descriptions etc.).
-
JAA
*Everything* Imgur-related is in #imgone and should stay there.
-
icedice
All right, gotcha