-
datechnoman
-
h2ibotdatechnoman: Registering isAJoDlg for '!a transfer.archivete.am/gsq2b/unique_pdfs_output.txt'
-
h2ibotdatechnoman: Skipped 4203 invalid URLs: transfer.archivete.am/4fZWf/unique_pdfs_output.txt.bad-urls.txt (isAJoDlg)
-
h2ibotdatechnoman: Fixed 1 unprintable URLs: transfer.archivete.am/nhadS/unique_pdfs_output.txt.not-printable.txt (isAJoDlg)
-
h2ibotdatechnoman: Skipped 1 very long URLs: transfer.archivete.am/X76xV/unique_pdfs_output.txt.skipped.txt (isAJoDlg)
-
h2ibotdatechnoman: Deduplicating and queuing 9401326 items. (isAJoDlg)
-
datechnomanWell it turns out id already queued that one before so nothing new :)
-
arkiverdatechnoman: kiska: do note that the URLs are only 'cleaner' in the way that we do not make requests anymore to certain IP ranges
-
fireonlivearkiver: reminds me of github.com/robertdavidgraham/masscan/blob/master/data/exclude.conf
-
fireonliveaka the list of internal IPs + crazy people
-
arkiverfireonlive: interesting
-
arkiverfireonlive: we use this github.com/ArchiveTeam/wget-lua/blob/v1.21.3-at/src/host.c#L74-L138
-
fireonlivearkiver: ah :) just reserved
-
fireonlivearkiver: they regularly (or at least used to) scan the whole internet so got a bundle of fun 'abuse' complaints from people :/
-
fireonlivewell; anyone can use the software to
-
arkiverfireonlive: right
-
arkiverbut for example "#Janet is a UK research and education network!"
-
arkiveri think we would not want to exclude those addresses
-
arkiveralso not military ranges, in case they host files
-
fireonliveindeed
-
fireonliveimportant stuff to get :)
-
fireonlivei'm glad we just exclude the reserved stuff
-
arkiverindeed!
-
arkiverbut yeah we can always block more if needed
-
arkiverbetter block too little than too much in my opinion
-
fireonlive:)
-
fireonliveagreed
-
fireonlivehm, interesting urls-tor item i saw fly by on the tracker: custom:comment=special%2dinterest%2dfrom%2d<redacted>&random=202404&url=http%3a%2f%2f<redacted>
-
datechnomanarkiver those changes are excellent. Keeping us under the radar
-
datechnomanThe url cleaner stuff I mean