02:48:07 JAA: how many URLs is that list? 02:48:11 may be nice to queue yes 02:52:30 arkiver: Don't remember exactly, need to try and find the list again, but I think it was a couple hundred million. 02:52:56 that's a lot 02:53:16 if you find it can you ping me with it? so I can glance over it before we put 100 million URLs in here 02:53:26 Yeah, of course. 02:53:54 if these are all good outlinks (no other 'inlinks' we don't want here), it would be interesting to see how much is deduplicated away when queuing 02:58:14 arkiver: Found it, and it's a giant mess, that's why I never pursued it further. There are 227M lines in the file, and they should all contain a URL, but there's also a lot of noise in them and it includes something like 60M truncated URLs due to faulty description extraction at the time. 02:58:39 Noise meaning 'text:URL' and stuff like that. 03:00:51 right :/ 03:20:16 We could reprocess the dislikes project data. Doesn't look like we were feeding back to here at the time. 03:21:28 I don't feel like cleaning up those 227M lines of mess. 13:03:15 JAA: can you send me the 227 million lines file? 18:09:56 arkiver: Yes, will PM you in a sec. I also have another list that I don't think was queued either and is also somewhat messy. Will send you that one as well. 18:11:12 thanks 20:21:18 !a https://transfer.archivete.am/8mBAr/youtube-urls.txt 20:31:32 how big is this file? :P 20:31:46 GB 20:32:32 arkiver: Skipped 30108 bad URLs: https://transfer.archivete.am/PUde1/youtube-urls.txt.bad-urls.txt 20:32:35 arkiver: Skipped 46706 unprintable URLs: https://transfer.archivete.am/fnF7r/youtube-urls.txt.not-printable.txt 20:33:23 arkiver: Deduplicating and queuing 17769410 items. 20:33:36 :-) 20:34:07 looking later into these unprintable URLs, I think we can fix most of them 20:36:43 !a https://transfer.archivete.am/fnF7r/youtube-urls.txt.not-printable.txt 20:36:47 arkiver: Skipped 46706 unprintable URLs: https://transfer.archivete.am/AlG1t/youtube-urls.txt.not-printable.txt.not-printable.txt 20:36:48 arkiver: Deduplicating and queuing 0 items. 20:36:49 arkiver: Deduplicated and queued 0 items. 20:36:52 right 20:39:08 arkiver: Deduplicated and queued 17769410 items. 20:39:31 moving to :secondary 20:40:55 Ryz: are you happy :P ^ 20:41:21 Yaaaaayyyyyyy~ 20:41:50 JAA: surprisingly few duplicates 20:41:57 may be due to http:// often being used 20:42:07 http/https are treated as unique 20:44:09 How many after dedupe? 20:44:13 no idea 20:44:26 but looking at the number after queuing not that big of a difference 20:44:39 Interesting 21:01:08 !a https://transfer.archivete.am/fnF7r/youtube-urls.txt.not-printable.txt 21:01:13 arkiver: Skipped 95 bad URLs: https://transfer.archivete.am/NWevV/youtube-urls.txt.not-printable.txt.bad-urls.txt 21:01:16 arkiver: Fixed 46706 unprintable URLs: https://transfer.archivete.am/P1NIO/youtube-urls.txt.not-printable.txt.not-printable.txt 21:01:17 arkiver: Deduplicating and queuing 46611 items. 21:01:20 arkiver: Deduplicated and queued 46611 items. 21:13:44 JAA: looks like we have lots of shortened URLs, that may be a reason for the few duplicates 21:15:46 Ah