-
arkiverJAA: how many URLs is that list?
-
arkivermay be nice to queue yes
-
JAAarkiver: Don't remember exactly, need to try and find the list again, but I think it was a couple hundred million.
-
arkiverthat's a lot
-
arkiverif you find it can you ping me with it? so I can glance over it before we put 100 million URLs in here
-
JAAYeah, of course.
-
arkiverif these are all good outlinks (no other 'inlinks' we don't want here), it would be interesting to see how much is deduplicated away when queuing
-
JAAarkiver: Found it, and it's a giant mess, that's why I never pursued it further. There are 227M lines in the file, and they should all contain a URL, but there's also a lot of noise in them and it includes something like 60M truncated URLs due to faulty description extraction at the time.
-
JAANoise meaning 'text:URL' and stuff like that.
-
arkiverright :/
-
JAAWe could reprocess the dislikes project data. Doesn't look like we were feeding back to here at the time.
-
JAAI don't feel like cleaning up those 227M lines of mess.
-
arkiverJAA: can you send me the 227 million lines file?
-
JAAarkiver: Yes, will PM you in a sec. I also have another list that I don't think was queued either and is also somewhat messy. Will send you that one as well.
-
arkiverthanks
-
arkiver
-
TheTechRobohow big is this file? :P
-
arkiverGB
-
h2ibotarkiver: Skipped 30108 bad URLs: transfer.archivete.am/PUde1/youtube-urls.txt.bad-urls.txt
-
h2ibotarkiver: Skipped 46706 unprintable URLs: transfer.archivete.am/fnF7r/youtube-urls.txt.not-printable.txt
-
h2ibotarkiver: Deduplicating and queuing 17769410 items.
-
JAA:-)
-
arkiverlooking later into these unprintable URLs, I think we can fix most of them
-
arkiver
-
h2ibotarkiver: Skipped 46706 unprintable URLs: transfer.archivete.am/AlG1t/youtube…not-printable.txt.not-printable.txt
-
h2ibotarkiver: Deduplicating and queuing 0 items.
-
h2ibotarkiver: Deduplicated and queued 0 items.
-
arkiverright
-
h2ibotarkiver: Deduplicated and queued 17769410 items.
-
arkivermoving to :secondary
-
arkiverRyz: are you happy :P ^
-
RyzYaaaaayyyyyyy~
-
arkiverJAA: surprisingly few duplicates
-
arkivermay be due to http:// often being used
-
arkiverhttp/https are treated as unique
-
JAAHow many after dedupe?
-
arkiverno idea
-
arkiverbut looking at the number after queuing not that big of a difference
-
JAAInteresting
-
arkiver
-
h2ibotarkiver: Skipped 95 bad URLs: transfer.archivete.am/NWevV/youtube….txt.not-printable.txt.bad-urls.txt
-
h2ibotarkiver: Fixed 46706 unprintable URLs: transfer.archivete.am/P1NIO/youtube…not-printable.txt.not-printable.txt
-
h2ibotarkiver: Deduplicating and queuing 46611 items.
-
h2ibotarkiver: Deduplicated and queued 46611 items.
-
arkiverJAA: looks like we have lots of shortened URLs, that may be a reason for the few duplicates
-
JAAAh