-
JAA
Thanks
-
JAA
VerifiedJ: Submitted to AB.
-
VerifiedJ
@JAA: thanks
-
JAA
Also threw in the /json URLs. That should at least fetch some of the map data.
-
Jake
halo.bungie.net is (finally) shutting down February 9th. Coverage looks pretty good (last run in AB 2014), but might be worth another run?
windowscentral.com/bungie-original-…website-being-taken-down-february-9
-
JAA
Whew, blast from the past.
-
arkiver
yeah
-
JAA
Ah yeah, I archived their forums a while ago.
-
JAA
That's why it seemed so familiar. :-)
-
JAA
I'll throw it into AB.
-
JAA
Anyone want to save ~124 million Halo screenshots?
-
AK
Is #findelmondo dead? 9.75m left but I'm the only one in the room haha
-
JAA
That's because it's #findelmundo with a u.
-
Jake
Thanks JAA! :)
-
flashfire42
JAA man this is like a throwback to one of the first projects I took part in
-
flashfire42
the halo match records
-
flashfire42
because there was a big argument over why the fuck we were doing it
-
JAA
Yeah indeed, I just remembered those as well.
-
AK
BlameGithub :P
-
flashfire42
that reminds me is astrid still around?
-
hook54321
no
-
flashfire42
oh
-
JAA
-
JAA
124 million files, mostly screenshots of a couple hundred KiB.
-
JAA
Well, 124 million IDs, no idea whether they all exist.
-
AK
So I take it that AB farms the work out to dedis? Or does it take the urls and throw them into the urls queue?
-
JAA
There are also videos, which appear to actually be game data saves that were then rerendered by the server. Rendering is broken, and the raw data can't be downloaded. :-/
-
Jake
124 million... wow, did we get all of that back in 2014?
-
JAA
-
JAA
Jake: Possibly. Certainly not through AB, but there was also a DPoS project it seems.
-
JAA
Yeah, looks like that was for these files. :-)
-
Jake
-
arkiver
didnt we already get most of this?
-
arkiver
yeah
-
AK
Endomondo can probably be removed off the tracker homepage now, same with Flash I'm guessing?
-
JAA
Looks like there are large gaps though.
-
JAA
-
flashfire42
JAA we agreed to grab the first million and last million last time
-
flashfire42
something like that
-
JAA
flashfire42: That was something else.
-
flashfire42
Oh ok
-
flashfire42
My bad ignore me
-
JAA
First/last/random million was game stats. These are the shared files.
-
arkiver
no endomondo cannot be removed yet
-
arkiver
there's some API not archived yet
-
JAA
Oh actually, those nine-digit files are all not from the project.
-
arkiver
I'm fine with setting up a project for halo if we figure out what we dont have yet
-
JAA
Or the WBM APIs aren't working correctly again, or something went wrong on the indexing, or whatever.
-
arkiver
indexing problems seem plausible
-
AK
arkiver: ahh didn't realise. The tracker is only giving profile stuff at the moment which does seem to be dead
-
JAA
github.com/ArchiveTeam/halo-items/tree/master/halo3file/ADDED suggests that only up to 120 million might've been attempted.
-
JAA
-
JAA
So I guess we might need a project for the IDs above 120 million.
-
JAA
Specifically, 120000100 and up if I'm reading the code and -items correctly.
-
AK
Is it worth cancelling the AB job if it's potentially gonna be done by us anyway?
-
AK
Not sure how AB hands out it's tasks and whether it actually affects anything to have lots of jobs running
-
JAA
AB does recursive crawls, a distributed project would be much more specific. It makes sense to do another recursive crawl.
-
AK
Ahh I see
-
JAA
-
arkiver
lol
-
arkiver
early days
-
arkiver
missed a , there
-
JAA
I wonder what the lone file outside of ADDED in -items is about.
-
arkiver
-
arkiver
no idea
-
JAA
Looks like those were indeed not grabbed.
-
JAA
At least based on WBM prefix search.
-
arkiver
0-99999?
-
JAA
Yeah
-
JAA
-
arkiver
so lets update the project and bring it back to life :P
-
arkiver
with all above 120 million
-
arkiver
and those 0-99999
-
JAA
Sounds good :-)
-
JAA
Let's reopen #yolohalo ?
-
arkiver
yep
-
purplebot
Deathwatch edited by JustAnotherArchivist (+207, /* 2021 */ Add Halo) just now --
archiveteam.org/?diff=46174&oldid=46170
-
purplebot
Halo edited by JustAnotherArchivist (+1872, Complete overhaul, 2018 additions, …) just now --
archiveteam.org/?diff=46175&oldid=28888
-
JAA
images.nga.gov 'will no longer be available as of January 1, 2021' my arse.
-
JAA
yano: I found a few more things on PirateBox. Also, they stated that the tracker might go down earlier than the rest of the stuff in
forum.piratebox.cc/read.php?9,23070, so would be best to grab all of that and upload to IA as soon as possible. If you could dump the .torrent file for each of the magnet links, that would also be useful for preservation purposes.
-
purplebot
Current Projects edited by Wickedplayer494 (-318, Go home everyone) just now --
archiveteam.org/?diff=46176&oldid=46156
-
mgrandi
i finished up this:
github.com/mgrandi/archive_pogchamp_emote which will help me archive the daily PogChamp emote
-
mgrandi
i should probably write something to help generate urls for all the twitch emotes, plus the information about them (streamer / shortcode ), since the
archiveteam.org/index.php?title=Twitch.tv page says the last run didn't have those
-
purplebot
Current Projects edited by Wickedplayer494 (+168, Halo's back, bitches) 20 minutes ago --
archiveteam.org/?diff=46177&oldid=46176
-
DaxServer
Hi. I made some modifications to warrior Docker build and forked some grabs. Unfortunately, I wasn't aware that unofficial versions are discourages. I am wondering how I can contribute my changes to upstream.
-
DaxServer
-
DaxServer
In the meantime, I have stopped my instance of warrior
-
DaxServer
The main issue I wanted to resolve with my version of Dockerfile is the wget-at. I see that there is a Warrior Extras Installer. Since it would be mostly installing all the dependencies, I wanted to move to the warrior container itself
-
DaxServer
Thus I built wget-at (and zstd from Github) and forked some grabs to make them use the common wget-at. For ex:
ArchiveTeam/domains-grab #3
-
yano
JAA: i'm using qbittorrent; i'll check but i don't think i have the .torrent files anymore, can the IA not ingest magnet links for torrents?
-
yano
oh nice, it looks like i am saving them
-
yano
-
yano
i didn't know where to upload them so i put them there
-
yano
🤷
-
yano
lol, it creates a bittorrent of the bittorrent files lol
-
kiska
torrent ception?
-
JAA
We need to go deeper.
-
yano
hehe
-
arkiver
yano: yes IA can do magnet links
-
yano
i like qbittorrent because when you add a torrent of torrent files it asks if you want to start download on the subsequent torrents
-
arkiver
make a .torrent with the magnet: URI as content
-
yano
arkiver: oh, i could have avoided uploaded the .torrent to IA then
-
arkiver
upload to IA and let is derive
-
yano
oh well
-
arkiver
it*
-
yano
anyways, the torrents are now on IA
-
JAA
Are the old versions still seeded?
-
yano
some of them are
-
yano
i got them in my seedbox trying to find them on dht
-
yano
also, yay my first upload to IA
-
arkiver
congrats
-
arkiver
many more to follow :)
-
yano
hehe :D
-
yano
i mean, i've uploaded through AT, but this is my first direct upload :D
-
JAA
Yay :-)
-
atphoenix
DaxServer, please stay around. I'm sure one of the ops will get back to you. Most people run the docker image workers these days. Instructions are at
archiveteam.org/index.php?title=Run…g_Archive_Team_Projects_with_Docker
-
atphoenix
and thanks for stopping by :)
-
DaxServer
Thanks :)
-
arkiver
DaxServer: people should not edit our code and run it 'in production'
-
arkiver
in this case some Lua dependencies have not been installed on the warrior - which causes issues
-
arkiver
we might have indeed missed data due the edits
-
DaxServer
Can you please mark all of my uploads for a redo? So sorry about the data loss
-
SketchTheCow
Someone is uploading all of thingiverse to IA
-
AK
Damn how big is that going to be?
-
Kaz
I thought we already did that
-
purplebot
Webzdarma edited by Sanqui (+25, /* ArchiveBot jobs */ job 20) just now --
archiveteam.org/?diff=46178&oldid=46093
-
JAA
There was a project in 2015 it seems.
-
wessel1512
The hole dutch parliament just handed in their resignation
-
kiska
RIP
-
JAA
lol, I didn't even know that was possible. Rather than dissolving parliament, calling a new general election, etc.
-
wessel1512
and the netherlands is without a parliament now till mid March 2021
-
jut
Does the goverment also go?
-
wessel1512
not sure
-
AK
-
AK
Most of their MPs have been in for what seems like ages
-
AK
Or "seniority"
-
wessel1512
just happened 2 hours ago
-
AK
Yeah it just seems like most of them have been in for ages, tho tbh I expect the numbers are similar for UK mps
-
JAA
Uh, what I'm reading, it's the government that resigned, not parliament?
-
jut
yeah that seems more sane
-
jut
-
JAA
Yeah
-
wessel1512
we have every 4 yeas general election for parliament
-
rewby
Election's in a few months anyway. I do love the videos of Rutte going to the king on his bike. Because of course he did that.
-
wessel1512
not sure tho what's this means for the upcoming elections as most of the candidates are the same as the current Government
-
JAA
I mean, the cabinet isn't elected anyway, right?
-
rewby
Not directly no
-
rewby
I just realised we're in -bs. Maybe we should move this to -ot?
-
JAA
Yeah
-
SketchTheCow
So, I am being informed that we only grabbed the Halo 3 stats from the Halo Forerunner Project.
-
SketchTheCow
Now, this is going to lead to some sackings
-
SketchTheCow
But before then, can someone fire up a raspberry pi and get the rest of halo
-
JAA
Only Halo 3 files*, not stats, as far as I can see.
-
JAA
Screenshots and stuff
-
JAA
I updated the wiki page last night with my understanding of what was covered.
-
JAA
And yes, #yolohalo is back in business.
-
Sanqui
JAA: enjoying bing-scrape. do you have a little thing for URL derivation (i.e. for
example.com/a/b.html, derive
example.com/a and
example.com/)?
-
Sanqui
if not I may want to contribute it
-
Sanqui
just dunno how to call it
-
Sanqui
suffix-stripping-derivation
-
JAA
Sanqui: I have written something for that before, yeah.
-
Sanqui
I've written it too but it's nowhere specific. I'm bad at keeping my oneliners organized
-
JAA
Same
-
JAA
Sanqui: awk -F/ 'BEGIN { OFS="/" } /\/$/ { --NF } { for (i = NF; i > 3; --i) { --NF; print $0 OFS; } }'
-
Sanqui
frith, you awklord
-
Sanqui
cheers
-
JAA
Doesn't print the original URL, add an extra `{ print }` before `/\/$/` if you want that.
-
JAA
And just in case you intend to use this for AB !a <, well, might not work as intended.
-
JAA
Sanqui: Actually, hold on, writing a better version. :-)
-
JAA
-
Sanqui
ah, ain't it beautiful when things get done with minimal effort from my side
-
Sanqui
thanks a bunch!!
-
JAA
:-)
-
AK
What's the lua regex for all the instagram login pages? (
instagram.com/accounts/login/?next=/reel/CJ6Yx9KJEP9 As in it can have anything after the /login)
-
AK
^https?://www%.instagram%.com/accounts/login.-$ Would that do it? (I'm brand new to Lua matching)
-
atphoenix
I think we should increase our efforts to save online forums (i.e. #msgbored ) . I feel that user-generated content everywhere is under threat, and forums, especially old forums, are pretty much just that. Content written in them years ago may be viewed in different lights now.
lauren.vortex.com/2021/01/15/moderating-ugc
-
Sanqui
+1
-
Sanqui
I'm working on some czech forums but yeah
-
JAA
++
-
JAA
BugTraq archives are shutting down at the end of the month:
securityfocus.com/archive/1/542247/30/0/threaded (Thanks, gb in #urlteam)
-
gb
So, I report here as well that BugTraq's archives will be shut down on January 31st, 2021. See
securityfocus.com/archive/1/542247/30/0/threaded . I wasn't able to find existing mirrors.
-
JAA
Since it's a mailing list, there are definitely mirrors. E.g.
seclists.org/bugtraq
-
JAA
But we should archive the original site anyway.
-
gb
Ah you're right, I checked quite poorly then
-
gb
The archive at securityfocus.com is actually much more limited than the seclist.org one, it only goes back to 2002 (
securityfocus.com/cgi-bin/index.cgi…ID=1&mode=threaded&expand_all=false)
-
JAA
Interesting.
-
JAA
I've started an ArchiveBot job for
securityfocus.com. We'll see how that goes.
-
JAA
It's one of those sites with an absolutely disgusting URL structure.
-
gb
Wow that was fast, thanks!
-
purplebot
Deathwatch edited by JustAnotherArchivist (+310, /* 2021 */ Add BugTraq) just now --
archiveteam.org/?diff=46180&oldid=46174
-
purplebot
99.se edited by Flashfire42 (+42) just now --
archiveteam.org/?diff=46181&oldid=37553
-
gb
They also have unauthenticated "Post message" links which you should be wary about, although theoretically the list stopped accepting posts since 2020/02
-
JAA
There's also this vulnerability database, not just the mailing list:
securityfocus.com/bid
-
JAA
I guess most of that information is in the mailing list archives anyway, but still nice to have.
-
purplebot
Comcast Personal Web Pages edited by Flashfire42 (+0) just now --
archiveteam.org/?diff=46183&oldid=28777
-
purplebot
Hackpad edited by Flashfire42 (+0) just now --
archiveteam.org/?diff=46184&oldid=29345
-
gb
Ah right, with "archive" they probably mean that too! Than one goes back a lot further, to (!) 1980:
securityfocus.com/cgi-bin/index.cgi…y_list&vendor=&version=&title=&CVE=
-
gb
So starting from the last pages of that vulnerability archive is probably a good idea, at those times there weren't CVE or stuff yet so it's a little more likely that there aren't references elsewhere
-
JAA
True. In principle, the recursive crawl should get to those, but only towards the end.
-
JAA
The dates on the early bugs seem sketchy though.
-
JAA
securityfocus.com/bid/2053 lists Debian 2.3 for example, which obviously didn't exist in 1980.
-
JAA
'This vulnerability was first announced by vort-fu <vort⊙wn> on December 5, 2000.'
-
JAA
Heh, just 20 years off.
-
purplebot
99.se edited by JustAnotherArchivist (+98) 24 minutes ago --
archiveteam.org/?diff=46182&oldid=46181
-
gb
Yeah, it did seem fishy... "Bugtraq was created on November 5, 1993"
-
gb
But maybe that database started earlier anyway
-
atphoenix
adding to the user-generated content comment...if there are major changes to Section 230...that could lead to a whole wave of site closures. Not unlike fallout from GDPR of sites feeling they might be noncompliant.
-
gb
Ok I have to leave, thank you and keep up the good work, At some point hopefully I'll manage to lend a hand. Bye
-
Billy549
-
Billy549
this'd be good to make a warrior for? :p
-
Billy549
I see someone's already added it to wiki, good
-
Craigle
Billy549 #yolohalo
-
Billy549
ty for the channel name
-
thuban
JAA: from the Halo wiki page: "The forums were archived in full shortly before the shutdown at the end of June 2018." i take it this means something other than the still-operational
carnage.bungie.org/haloforum/halo.forum.pl ?
-
thuban
(not now at the halo. subdomain, but used to be)
-
thuban
oh god nvm, they're bungie.org and always have been.
-
Billy549
yeah bungie.org isnt bungie
-
thuban
shows what i think of the ~official~ sites :P
-
OrIdow6
I'll try to finalize the (i.e. remove all the SmackJeeves references from) So-Net U+ script soon, think that might be a relatively slow site
-
OrIdow6
(Closes the 28th)
-
arkiver
OrIdow6: what is that one about?
-
OrIdow6
arkiver: It's a Japanese personal webpage host that had its heyday in (I'd say) around 2006
-
arkiver
thats awesome
-
arkiver
looking at some sites, pretty nice
-
arkiver
well ping me when you have something, will be a nice project
-
OrIdow6
Ok
-
OrIdow6
For now I think I might actually try to do something about CrowdMap
-
OrIdow6
In the 1.5 hours I have here
-
OrIdow6
But that's small enough that even I should have enough capacity
-
arkiver
got a list of everything?
-
arkiver
getting item lists is sometimes the biggest problem
-
OrIdow6
Verified J made a scrape
-
OrIdow6
Or are you tlaking about So-Net U+?
-
arkiver
both
-
EggplantN
orldow6 you aiming to have a project up tonight?!
-
EggplantN
if so please do let me know ASAP
-
OrIdow6
EggplantN: Next few days, maybe, tonight (my tonight, your morning) if everything else in the day takes me a very low amount of time
-
EggplantN
kk if you let me know ahead of time to get a target up we can blitz through it like SJ
-
OrIdow6
arkiver: I haven't checked yet that the CrowdMap scrape is comprehensive, but I think it may be; you're right that So-Net U+ will probably benefit from something more extensive than the WBM CDX for discovery
-
arkiver
how was the crowdmap scrape done?
-
arkiver
will check with so-net u as well
-
OrIdow6
EggplantN: Will do, was nice working with you last time, and I don't think this will be a stressfully huge amount of data
-
EggplantN
thats fine by me, just holla and i'll be helpful
-
OrIdow6
arkiver: To my knowledge everything that's happened to Crowdmap thus far is that someone (don't remember who) ran some simple AB jobs
-
arkiver
contacting someone at IA now who can read japanese
-
arkiver
about so-net u
-
arkiver
OrIdow6: or do we have someone already? ^
-
arkiver
i want to ask if he sees any contact info that we could use for this
-
arkiver
maybe can help with a japanese email as well
-
OrIdow6
arkiver: I don't speak Japanese, if that's what you're asking; there's someone who does who comes on sporadically who told me about this
-
arkiver
alright
-
arkiver
will contact
-
OrIdow6
"taka", last online January 2 UTC
-
OrIdow6
arkiver: Ok, would be nice to have a complete list
-
OrIdow6
Well, online the 10th, but didn't say anything
-
JAA
thuban: It's the forums that used to be at
halo.bungie.net/Forums/default.aspx (until 2018).
-
JAA
OrIdow6, arkiver: Re Crowdmap, the AB job for the JSONs failed on a few very slow responses. Not sure if the reports pages were grabbed (links are in the JSONs), and there is some data on those that isn't in the JSONs.
-
JAA
thuban: I wonder if we should grab those forums you linked though. I can't imagine that they'll stay around a lot longer.
-
thuban
i wondered that myself. (that said, they're _so_ old that i wonder whether the lindy effect kicks in--they've been going since 1999)
-
thuban
i did some scripting to extract urls from the very similar marathon story forums, when we threw into archivebot last year
-
thuban
don't remember the details but i think grabbing posts from this one would not be too hard, will do when i get a chance. should probably hit the oni forums as well
-
JAA
Unfortunately, the archived posts are on
library.bungie.org instead.
-
thuban
s/when/which/
-
JAA
Which then links to yet another domain, forums.bungie.org.
-
JAA
'The HBO Forum Archive is maintained with WebBBS 4.33.' - 'THE PERL SCRIPTS ARE NO LONGER BEEING SUPPORTED'
-
JAA
:-)
-
JAA
thisisfine.png
-
atphoenix
I would put very little trust in digital lindy effects. Political pressure on Section 230 could change matters overnight.
-
JAA
Can't quickly find a version history, but WebBBS 5.0 dates back to before 2002.
-
thuban
JAA: i wouldn't worry about the domains if we just feed the bot individual post urls like last time
-
JAA
I'm worried about keeping the archives accessible and browsable.
-
JAA
But yeah, for the content itself, you're right.
-
thuban
individual post pages link to the other posts in their thread, and we could include search result pages to get all-posts indexes (think i did that last time too). would be awkward to find a specific post if you didn't have the url, but usable
-
JAA
WebBBS 4.33 is from mid-2000 and has a vulnerability that's been known since mid-2002. lol
-
thuban
please don't pwn the forums before we've saved them :(
-
JAA
I've thrown carnage and library into AB.
-
JAA
carnage actually returns a list of all (non-archived) posts on the homepage when using the AB UA. That's handy.
-
JAA
There are also a bunch of other old forums hosted on carnage. Might be worth digging into sometime.