00:08:09 kelly, if you're reading logs: i believe the item is archiveteam_urls_20201102030531_9c6f61f7/urls_20201102030531_9c6f61f7.megawarc.warc.gz 00:08:32 keep in mind that while the chances are low, page requisites might be in a different megawarc 00:08:50 (fwiw, i got that megawarc from headers; more specifically x-archive-src) 00:12:56 Page requisites go through the tracker again, don't they? So there's actually a decent chance they might be in a different WARC. 00:13:11 JAA: I thought just outlinks go through the tracker again 00:13:13 could be wrong though 00:13:43 We wouldn't dedupe requisites if they didn't. 00:13:49 right 00:14:07 so uh yeah you'll probably have to go through the page requisites manually :/ 00:14:33 or at least check the warc and make sure everything's there, but that's easy to mess up 01:38:44 I have since collected some more open directory links from the Internet. 01:39:04 http://www.salixa.com/trh/ios/IPAs/ 01:39:09 http://www.salixa.com/trh/ios/Deb/ 01:39:16 http://www.salixa.com/trh/ios/System/ 01:39:38 Old iOS hacking related open directories to AB 01:42:00 http://185.8.7.44/ an IP address filled with pictures of showers and other plumbing-related stuff, and I have tended to notice that IP address websites are WAY more volatile than their named counterparts 01:43:19 http://s1.bitdl.ir/ - More iOS apps in this link 01:43:50 http://te.censoft.com/download/ - additional iOS software 01:44:14 https://marcoalima.com/ - More iOS software 01:45:09 https://doc.downloadha.com/ - More iOS software, however it has ~200 GB of non-iOS zip files, maybe set an exclusion for those 01:45:48 https://dl.ievo.top/ - Various softwares 01:47:12 https://files.zmodo.com/ - Downloads OD of some sort of TV company? 01:48:15 https://carboncostume.com/wordpress/wp-content/uploads/ - Pictures of haloween costumes 01:48:56 https://www.gwern.net/images/ - Memes and IT-related images, including AI ones 01:50:13 https://pierrepapierciseaux.net/.skynet/img/ - images for Windows 93 SkyNethttps://pierrepapierciseaux.net/.skynet/midi/ - ditto for MIDIhttps://pierrepapierciseaux.net/.skynet/js/ - ditto for JS 01:51:05 https://smallake.kr/wp-content/uploads/ - Various academic papers 01:52:10 http://star-www.st-and.ac.uk/~kdh1/ada/ - University course about data 01:53:06 https://www.catsarecute.xyz/ - Random small OD site, however the strict rate limits may make the site harder for crawlers to archive 01:53:40 https://ftp.dexp.club/ - Another medium firmware site 01:55:12 https://pastebin.com/dKSmBM7x - The rest of the OD links I have gathered online, make sure to remove the dashes on two of them 01:56:03 http://142.11.229.92/ - IP address site, has gone down a few days ago 02:34:52 Why is this in the URLs project? 02:34:58 *URLs irc channel 02:35:02 better for archivebot 02:39:28 Unfortunately we cannot archive these links here as we could potentially take them off-line with a DDOS. As techrobo has said the archivebot channel is much better suited :) 02:39:49 It will conduct a much slower (rate limited) crawl of the sites