#archiveteam-bs

00:45

JAA

More fun with TechnologyGuide: if the URL contains 'nonexistant', the server resets the connection: forum.notebookreview.com/nonexistant
00:45

Frogging101

notebookreview is closing? :/
00:46

JAA

Yep
00:46

Frogging101

that sucks
00:46

JAA

My qwarc archive of the thread pages just finished a few minutes ago. Should be complete apart from the countless broken shit like the above.
00:47

Jake

what even.... 'temp' and 'nonexistant'
00:47

JAA

forum.notebookreview.com/threads/asus-v6j-everest-benchmarks.43048 returns an empty page.
00:47

JAA

It's kind of hilarious just how broken these forums are.
00:50

JAA

forum.notebookreview.com/nessus
00:50

JAA

¯\_(ツ)_/¯
00:52

Jake

they have some terribly broken WAF paired with a 10 year old corrupted forum database?
05:14

JAA

TechnologyGuide forum archive post counts: 1841025 from Brighthand (1880674 on homepage), 53604 from DigitalCameraReview (58092), 4180267 from NotebookReview (no global stats, about 9.35M from adding up the subforum numbers, but those don't include everything), 508834 from TabletPCReview (521529)
05:15

JAA

The NotebookReview discrepancy seems pretty bad, but no idea where it comes from. I didn't see any systematic problems in the data.
05:17

JAA

Adding up the 'Messages' numbers for the subforums gives 9347229 there. This doesn't include forum.notebookreview.com/forums/nbr-marketplace.18 (which is shown as a link on the homepage instead of a forum entry).
06:53

stormy

looking for someone who helped to archive soup.io
07:09

OrIdow6

stormy: Unless you're looking for someone to describe their personal experiences on the project, it's best just to ask their question
07:13

stormy

fair enough. so if I understood correctly, the archived data gets uploaded directly to archive.org and nothing else. what I'm trying to find out is: did the crawler get around the "Content Warning" pages, and, once on archive.org, how do I get around the content warning pages.
07:20

OrIdow6

Do you have an example?
07:20

stormy

sure: web.archive.org/web/20191201125421/http://einefragevonstil.soup.io
07:21

stormy

I have enough experience in web crawling to know how it can work on the crawling side, but not with archive.org...
07:33

OrIdow6

So I don't see anything to indicate that the project got those
07:34

stormy

where can I find a list of soup.io hosts that the project did cover?
07:35

OrIdow6

I do not think any such list exists
07:35

OrIdow6

Well, I expect it was deleted since that project as 2 years ago
07:36

OrIdow6

Also, that site was "covered" by the project, I just don't see anything getting around the content warning
07:37

OrIdow6

If you do want a list I can giveyou some quick info on how to generate it
07:37

stormy

that'd be great, thanks
07:41

spirit

stormy: web.archive.org/web/*/http://example.com/* will list all URLs
07:41

OrIdow6

Use archive.org/services/docs/api/internetarchive/cli.html to download all the .os.cdx.gz files in collection:archiveteam:soup.io , then decompress then, extract the URLs, and then the domains
07:42

OrIdow6

It would be better to tell us what you're trying to do, though, since asking for what the project covered is fairly specific
07:44

stormy

thanks, will try both of this later. what I'm trying to do it getting an offline mirror for some soups that I remember. anything that still exists. I was hoping that it would exist in other forms than on archive.org, but I guess I'll have to make do with that.
07:44

stormy

gotta run for now, back in an hour.
08:51

stormy

OrIdow6 are you sure that collection:archiveteam:soup.io is the correct identifier? since it gives me a query error archive.org/advancedsearch.php?q=co…put=json&callback=callback&save=yes
08:58

rewby

Usually you just search for a file from the project in the UI and you can find the collection that way
08:58

rewby

Also, if you're worried the data only lives on the ia, you can just list the items using the cli (and collection name) and just download them all
08:59

rewby

Usually you can get most of it to replay by loading the warcs into pywb3. Sometimes you need to find some additional warcs with static site data like javascript and css files.
08:59

rewby

The IA has them somewhere usually
09:01

stormy

I'm still having trouble getting an item list from a url prefix, either with cli or web
09:02

OrIdow6

You're right I said iy wrong, that should be archiveteam_soupio, not archiveteam:soupio
09:09

stormy

thanks, that's getting me somewhere
09:52

pabs

this company is allegedly shutting down cyberninjas.com edition.cnn.com/2022/01/07/politics…as-shutting-down-arizona/index.html
10:27

IDK

Anyone know how do request specific IDs with the API of thiswebsitewillselfdestruct.com/api/get_letter
10:27

IDK

Currently all IDs are randomized and random messages are displayed
10:51

mateoooo

ee
15:29

Hifihedgehog

Hey JAA. If you don't mind me asking, how goes the archiving effort for the TechnologyGuide sites?
15:36

Hifihedgehog

Thanks JAA!
15:36

Hifihedgehog

archive.org/details/technologyguide_forums_20220125
18:04

IDK

Appearently roblox is banning all users with YT or any other refrence to off site platforms, be ready to see some 404s
18:04

IDK

*some
18:07

IDK

youtube.com/watch?v=9DBb6_aVS4M
20:00

h2ibot

JustAnotherArchivist edited TechnologyGuide (+53): wiki.archiveteam.org/?diff=48220&oldid=48214
20:00

h2ibot

Gridkr edited Coronavirus/Affected companies (+401): wiki.archiveteam.org/?diff=48221&oldid=45063
20:00

duce1337

anyone archive all CIA world factobook? cia.gov/the-world-factbook
20:41

AK

Making sure a copy is grabbed now duce1337
20:43

duce1337

ok
21:33

wessel1512

OrIdow6 Im curreny very buzzy with my day job so i have had the time to spent on the Ukrainian archive
21:34

wessel1512

But i got some more list of urls that i need to clean before they can be added to the wiki

2 years ago

« a day earlier

a day later »

today »