-
JAA
More fun with TechnologyGuide: if the URL contains 'nonexistant', the server resets the connection:
forum.notebookreview.com/nonexistant
-
Frogging101
notebookreview is closing? :/
-
JAA
Yep
-
Frogging101
that sucks
-
JAA
My qwarc archive of the thread pages just finished a few minutes ago. Should be complete apart from the countless broken shit like the above.
-
Jake
what even.... 'temp' and 'nonexistant'
-
JAA
-
JAA
It's kind of hilarious just how broken these forums are.
-
JAA
-
JAA
¯\_(ツ)_/¯
-
Jake
they have some terribly broken WAF paired with a 10 year old corrupted forum database?
-
JAA
TechnologyGuide forum archive post counts: 1841025 from Brighthand (1880674 on homepage), 53604 from DigitalCameraReview (58092), 4180267 from NotebookReview (no global stats, about 9.35M from adding up the subforum numbers, but those don't include everything), 508834 from TabletPCReview (521529)
-
JAA
The NotebookReview discrepancy seems pretty bad, but no idea where it comes from. I didn't see any systematic problems in the data.
-
JAA
Adding up the 'Messages' numbers for the subforums gives 9347229 there. This doesn't include
forum.notebookreview.com/forums/nbr-marketplace.18 (which is shown as a link on the homepage instead of a forum entry).
-
stormy
looking for someone who helped to archive soup.io
-
OrIdow6
stormy: Unless you're looking for someone to describe their personal experiences on the project, it's best just to ask their question
-
stormy
fair enough. so if I understood correctly, the archived data gets uploaded directly to archive.org and nothing else. what I'm trying to find out is: did the crawler get around the "Content Warning" pages, and, once on archive.org, how do I get around the content warning pages.
-
OrIdow6
Do you have an example?
-
stormy
-
stormy
I have enough experience in web crawling to know how it can work on the crawling side, but not with archive.org...
-
OrIdow6
So I don't see anything to indicate that the project got those
-
stormy
where can I find a list of soup.io hosts that the project did cover?
-
OrIdow6
I do not think any such list exists
-
OrIdow6
Well, I expect it was deleted since that project as 2 years ago
-
OrIdow6
Also, that site was "covered" by the project, I just don't see anything getting around the content warning
-
OrIdow6
If you do want a list I can giveyou some quick info on how to generate it
-
stormy
that'd be great, thanks
-
spirit
-
OrIdow6
Use
archive.org/services/docs/api/internetarchive/cli.html to download all the .os.cdx.gz files in collection:archiveteam:soup.io , then decompress then, extract the URLs, and then the domains
-
OrIdow6
It would be better to tell us what you're trying to do, though, since asking for what the project covered is fairly specific
-
stormy
thanks, will try both of this later. what I'm trying to do it getting an offline mirror for some soups that I remember. anything that still exists. I was hoping that it would exist in other forms than on archive.org, but I guess I'll have to make do with that.
-
stormy
gotta run for now, back in an hour.
-
stormy
OrIdow6 are you sure that collection:archiveteam:soup.io is the correct identifier? since it gives me a query error
archive.org/advancedsearch.php?q=co…put=json&callback=callback&save=yes
-
rewby
Usually you just search for a file from the project in the UI and you can find the collection that way
-
rewby
Also, if you're worried the data only lives on the ia, you can just list the items using the cli (and collection name) and just download them all
-
rewby
Usually you can get most of it to replay by loading the warcs into pywb3. Sometimes you need to find some additional warcs with static site data like javascript and css files.
-
rewby
The IA has them somewhere usually
-
stormy
I'm still having trouble getting an item list from a url prefix, either with cli or web
-
OrIdow6
You're right I said iy wrong, that should be archiveteam_soupio, not archiveteam:soupio
-
stormy
thanks, that's getting me somewhere
-
pabs
-
IDK
Anyone know how do request specific IDs with the API of
thiswebsitewillselfdestruct.com/api/get_letter
-
IDK
Currently all IDs are randomized and random messages are displayed
-
mateoooo
ee
-
Hifihedgehog
Hey JAA. If you don't mind me asking, how goes the archiving effort for the TechnologyGuide sites?
-
Hifihedgehog
Thanks JAA!
-
Hifihedgehog
-
IDK
Appearently roblox is banning all users with YT or any other refrence to off site platforms, be ready to see some 404s
-
IDK
*some
-
IDK
-
h2ibot
JustAnotherArchivist edited TechnologyGuide (+53):
wiki.archiveteam.org/?diff=48220&oldid=48214
-
h2ibot
Gridkr edited Coronavirus/Affected companies (+401):
wiki.archiveteam.org/?diff=48221&oldid=45063
-
duce1337
anyone archive all CIA world factobook?
cia.gov/the-world-factbook
-
AK
Making sure a copy is grabbed now duce1337
-
duce1337
ok
-
wessel1512
OrIdow6 Im curreny very buzzy with my day job so i have had the time to spent on the Ukrainian archive
-
wessel1512
But i got some more list of urls that i need to clean before they can be added to the wiki