00:45:01 <JAA> More fun with TechnologyGuide: if the URL contains 'nonexistant', the server resets the connection: http://forum.notebookreview.com/nonexistant/
00:45:54 <Frogging101> notebookreview is closing? :/
00:46:08 <JAA> Yep
00:46:40 <Frogging101> that sucks
00:46:49 <JAA> My qwarc archive of the thread pages just finished a few minutes ago. Should be complete apart from the countless broken shit like the above.
00:47:17 <Jake> what even.... 'temp' and 'nonexistant'
00:47:43 <JAA> http://forum.notebookreview.com/threads/asus-v6j-everest-benchmarks.43048/ returns an empty page.
00:47:57 <JAA> It's kind of hilarious just how broken these forums are.
00:50:31 <JAA> http://forum.notebookreview.com/nessus/
00:50:36 <JAA> ¯\_(ツ)_/¯
00:52:12 <Jake> they have some terribly broken WAF paired with a 10 year old corrupted forum database?
05:14:23 <JAA> TechnologyGuide forum archive post counts: 1841025 from Brighthand (1880674 on homepage), 53604 from DigitalCameraReview (58092), 4180267 from NotebookReview (no global stats, about 9.35M from adding up the subforum numbers, but those don't include everything), 508834 from TabletPCReview (521529)
05:15:05 <JAA> The NotebookReview discrepancy seems pretty bad, but no idea where it comes from. I didn't see any systematic problems in the data.
05:17:39 <JAA> Adding up the 'Messages' numbers for the subforums gives 9347229 there. This doesn't include http://forum.notebookreview.com/forums/nbr-marketplace.18/ (which is shown as a link on the homepage instead of a forum entry).
06:53:25 <stormy> looking for someone who helped to archive soup.io
07:09:48 <OrIdow6> stormy: Unless you're looking for someone to describe their personal experiences on the project, it's best just to ask their question
07:13:22 <stormy> fair enough. so if I understood correctly, the archived data gets uploaded directly to archive.org and nothing else. what I'm trying to find out is: did the crawler get around the "Content Warning" pages, and, once on archive.org, how do I get around the content warning pages.
07:20:02 <OrIdow6> Do you have an example?
07:20:43 <stormy> sure: https://web.archive.org/web/20191201125421/http://einefragevonstil.soup.io/
07:21:27 <stormy> I have enough experience in web crawling to know how it can work on the crawling side, but not with archive.org...
07:33:54 <OrIdow6> So I don't see anything to indicate that the project got those
07:34:21 <stormy> where can I find a list of soup.io hosts that the project did cover?
07:35:30 <OrIdow6> I do not think any such list exists
07:35:50 <OrIdow6> Well, I expect it was deleted since that project as 2 years ago
07:36:18 <OrIdow6> Also, that site was "covered" by the project, I just don't see anything getting around the content warning
07:37:20 <OrIdow6> If you do want a list I can giveyou some quick info on how to generate it
07:37:55 <stormy> that'd be great, thanks
07:41:24 <spirit> stormy: https://web.archive.org/web/*/http://example.com/* will list all URLs
07:41:29 <OrIdow6> Use https://archive.org/services/docs/api/internetarchive/cli.html to download all the .os.cdx.gz files in collection:archiveteam:soup.io , then decompress then, extract the URLs, and then the domains
07:42:13 <OrIdow6> It would be better to tell us what you're trying to do, though, since asking for what the project covered is fairly specific
07:44:07 <stormy> thanks, will try both of this later. what I'm trying to do it getting an offline mirror for some soups that I remember. anything that still exists. I was hoping that it would exist in other forms than on archive.org, but I guess I'll have to make do with that.
07:44:30 <stormy> gotta run for now, back in an hour.
08:51:03 <stormy> OrIdow6 are you sure that collection:archiveteam:soup.io is the correct identifier? since it gives me a query error https://archive.org/advancedsearch.php?q=collection%3Aarchiveteam%3Asoup.io&fl%5B%5D=identifier&sort%5B%5D=&sort%5B%5D=&sort%5B%5D=&rows=50&page=1&output=json&callback=callback&save=yes
08:58:23 <rewby> Usually you just search for a file from the project in the UI and you can find the collection that way
08:58:54 <rewby> Also, if you're worried the data only lives on the ia, you can just list the items using the cli (and collection name) and just download them all
08:59:33 <rewby> Usually you can get most of it to replay by loading the warcs into pywb3. Sometimes you need to find some additional warcs with static site data like javascript and css files.
08:59:38 <rewby> The IA has them somewhere usually
09:01:56 <stormy> I'm still having trouble getting an item list from a url prefix, either with cli or web
09:02:31 <OrIdow6> You're right I said iy wrong, that should be archiveteam_soupio, not archiveteam:soupio
09:09:12 <stormy> thanks, that's getting me somewhere
09:52:32 <pabs> this company is allegedly shutting down https://cyberninjas.com/ https://edition.cnn.com/2022/01/07/politics/cyber-ninjas-shutting-down-arizona/index.html
10:27:25 <IDK> Anyone know how do request specific IDs with the API of https://www.thiswebsitewillselfdestruct.com/api/get_letter
10:27:52 <IDK> Currently all IDs are randomized and random messages are displayed
10:51:27 <mateoooo> ee
15:29:27 <Hifihedgehog> Hey JAA. If you don't mind me asking, how goes the archiving effort for the TechnologyGuide sites?
15:36:10 <Hifihedgehog> Thanks JAA!
15:36:12 <Hifihedgehog> https://archive.org/details/technologyguide_forums_20220125
18:04:18 <IDK> Appearently roblox is banning all users with YT or any other refrence to off site platforms, be ready to see some 404s
18:04:26 <IDK> *some
18:07:36 <IDK> https://www.youtube.com/watch?v=9DBb6_aVS4M
20:00:44 <h2ibot> JustAnotherArchivist edited TechnologyGuide (+53): https://wiki.archiveteam.org/?diff=48220&oldid=48214
20:00:45 <h2ibot> Gridkr edited Coronavirus/Affected companies (+401): https://wiki.archiveteam.org/?diff=48221&oldid=45063
20:00:48 <duce1337> anyone archive all CIA world factobook? https://www.cia.gov/the-world-factbook/
20:41:22 <AK> Making sure a copy is grabbed now duce1337
20:43:05 <duce1337> ok
21:33:31 <wessel1512> OrIdow6 Im curreny very buzzy with my day job so i have had the time to spent on the Ukrainian archive
21:34:19 <wessel1512> But i got some more list of urls that i need to clean before they can be added to the wiki