-
pabs
AntoninDelFabbro|m: which website, what are you trying to get? I tend to use different things for different purposes. for eg: googler/ddgr for site: search engine queries. curl/wget for downloads. pup for HTML parsing/querying. jq for JSON querying
-
h2ibot
Systwi uploaded
File:Duck Hunt (World)-0--twitter-5.png (Mr. Peepers holding the Twitter bird,…):
wiki.archiveteam.org/?title=File%3A…Hunt%20%28World%29-0--twitter-5.png
-
h2ibot
Systwi edited Twitter (+177, /* Vital Signs */ Added meme and serious caption.):
wiki.archiveteam.org/?diff=50593&oldid=50591
-
fireonlive
systwi: 😃
-
h2ibot
Systwi edited Site exploration (+398, /* Twitter */ Mentioned Nitter and Twitter's…):
wiki.archiveteam.org/?diff=50594&oldid=50492
-
systwi
fireonlive: :-D
-
erkinalp
wowturkey is down, not known it's temporary and re-restored or permanent as in finally closed
-
erkinalp
let's leave the bot running in hope it returns once more
-
erkinalp
the announced date was august 31
-
pabs
-
nicolas17
pabs: I doubt it because most data there was ephemeral in the first place, eg. there's projects that do a daily build and only the last 5 binaries are kept
-
pabs
and what about the phabricator?
-
erkinalp
phabricator is a bugtracker, that's significant
-
erkinalp
tickets may have good things
-
nicolas17
we'll probably turn it into static pages somehow
-
nicolas17
I'm not sure how easy it is to archive, I think there's like, JS-backed "load more comments" stuff?
-
» pabs recommends an AB job, then download the static files :)
-
pabs
I did a phabricator recently, apart from the large amount of ignores I think it worked ok
-
erkinalp
we'd have to do the same with missing wowturkey viewtopic pages with p=### links
-
erkinalp
corresponding t=####&start=### ones have already been crawled
-
nicolas17
at one point we considered moving issues from phabricator to gitlab and it was messy because tickets can have multiple tags/projects that they belong to, while gitlab issues belong to *one* project
-
nicolas17
so we would need to check case by case and make a list of "if a ticket has tag X and tag Y, put it in repo Y"
-
pabs
AB job then static seems better
-
nicolas17
well yeah, this was *early* in the gitlab move when a lot of tickets would still be active
-
nicolas17
by now I guess a lot was closed, or stopped mattering, or was still active and someone moved it manually
-
flashfire42
Ok should I focus on webs or orange today? both have close cut off dates. Or do I say the hell with the both of them and continue with the aussie ISPs that have technically passed their shutdown date and are still up?
-
erkinalp
"September 1: wowTURKEY[IA•Wcite•.today•MemWeb], a large Turkish photo sharing forum[23]" september 1 → august 27
-
erkinalp
don't kill the archivebot crawler tho
-
erkinalp
it still crawls previously failed external links
-
erkinalp
if we had started this crawl one day before, we would have the full archive today...
-
flashfire42
Alas that is the joys of web archival
-
flashfire42
things are lost every day my friend
-
flashfire42
and it sucks. it does. but we do what we can
-
erkinalp
87% is better than nothing
-
erkinalp
(known item count is ~9.4M, we got 8,338,246)
-
pabs
-
erkinalp
300ms is too short for wowturkey, the server's own delay was about 500ms even when it was up
-
pokechu22
flashfire42: I should probably do an !a < list job on orange - webs might be better to focus on. On the other hand webs has the stupid calendars that make things a mess :|
-
flashfire42
orange has about 4 different subdomains and some of them dont even resolve for me but do for others. Webs is not a set and forget thing which is really what I am aiming for because the calenders are so fucking broken
-
pokechu22
Can I get a list of those different subdomains? (Lists of individual sites would be useful too but I have some ideas of how to get those once I know the starting points)
-
flashfire42
pagesperso-orange.fr
-
flashfire42
monsite-orange.fr
-
flashfire42
those are the main 2
-
flashfire42
images are hosted on cdn.woopic.com
-
DigitalDragons
-
DigitalDragons
(ignore the .txt at the end of everything)
-
AntoninDelFabbro|m
pabs: I wqnt to download
annuaire-pp.orange.fr/accueil, but thanks for your help! :D
-
pabs
a good option for that is open it in your browser, open dev tools, click on all the things on the site, then save all the requests as a .har and then AB all the URLs output by this shell oneliner:
-
pabs
for f in *.har ; do jq -r '.log.entries[].request.url' < "$f" ; done | sort -u
-
pabs
ah, better open dev tools before loading the page, woops
-
pabs
there are some browser based crawler things on the wiki somewhere too
-
pabs
but they may not work if you need to interact with the site
-
AntoninDelFabbro|m
Gold! I just woke up, but I'm impatient to try this asap! Thank you!
-
erkinalp
qyxojzh|m: wowturkey definitively down
-
erkinalp
JAA: wowturkey definitively down, as of 0400UTC today
-
thuban
AntoninDelFabbro|m: that's a good way to capture data you can click through manually, but if the amount of navigation required is very large, i personally prefer to write a short script.
-
thuban
i have done so for annuaire-pp.orange.fr (and in the process, i believe, discovered more results than are shown in the browser) and will dump results tomorrow
-
AntoninDelFabbro|m
Awesome! Haha, well you saved me lot of time, thanks ;)
-
thuban
AntoninDelFabbro|m: you're welcome!
-
thuban
also, uh, can someone remind me what the status is on orange isp hosting in general? are we still just dumping stuff in archivebot? because there are tens of thousands of these
-
thuban
(i was going to add 'and some of them require javascript', but based on my spot-checking they're all in the weird 'put everything in the html, but don't actually display it until the js loads' idiom, so i think archivebot would actually be fine in that respect)
-
erkinalp
qyxojzh|m: JAA: arkiver: one of the former wowturkey's mods are about to ask the owner to buy and resurrect wowturkey.com
-
erkinalp
we might get a last chance revive
-
flashfire42
thuban um everything into archivebot unless you design some scripts because its about a week away from going bye bye and we have like 3 or 4 warrior projects on the go and fuck all ingestion to IA right now
-
h2ibot
Bzc6p edited Demotivalo.net (+37, /* Sister sites */ Update stati):
wiki.archiveteam.org/?diff=50595&oldid=47826
-
erkinalp
wowturkey is down as of now
-
pabs
this site has 6TB of FLACs for bluegrass music:
bluegrassarchive.com (frameset for
gdarchive.net/Public/Bluegrass/contents.htm)
-
pabs
would be nice to grab eventually, but seems a bit big for AB, especially with the current IA upload limits
-
h2ibot
PaulWise edited Bugzilla (+1073, more from BZ site…):
wiki.archiveteam.org/?diff=50596&oldid=50588
-
h2ibot
PaulWise edited Bugzilla (-93, remove accidentally added done ones):
wiki.archiveteam.org/?diff=50597&oldid=50596
-
h2ibot
PaulWise edited Deathwatch (+295, Eclipse Wiki shutdown):
wiki.archiveteam.org/?diff=50598&oldid=50584
-
h2ibot
PaulWise edited Bugzilla (+0, Eclipse Bugzilla shutdown, AB in progress:…):
wiki.archiveteam.org/?diff=50599&oldid=50597
-
h2ibot
-
arkiver
pabs: it's maybe fine to put in AB when the current problems at IA are fixed
-
pabs
ok, wasn't sure if AB could handle that volume either
-
pabs
thanks
-
arkiver
well JAA is the expert on that
-
h2ibot
-
JAA
erkinalp: Ugh. Yeah, let's hope it's resurrected.
-
JAA
pabs, arkiver: AB doesn't care much about data size as long as there aren't huge files in it. The other limiting factor is number of URLs, but until you go over 100M, that's not usually a problem either.
-
erkinalp
JAA: seems no hope
-
erkinalp
why are archivebot downloads so slow?
-
JAA
Do you mean downloads of ArchiveBot data from the Internet Archive?
-
erkinalp
no, arcihvebot data downloads from archive.fart.website
-
erkinalp
i'm getting isdn speeds of download currently
-
erkinalp
it can't be due to my link speeds either
-
erkinalp
i hapen to have 70mbps down, 10mpbs up
-
JAA
The AB viewer is just an index of the data on IA.
-
JAA
The links go to IA.
-
JAA
And yeah, downloads from IA are notoriously slow, especially if you aren't near the Bay Area.
-
erkinalp
it isn't that slow normally
-
erkinalp
it was usually a few mbp
-
erkinalp
s
-
erkinalp
i could get good dsl speeds of download
-
erkinalp
not dialup or isdn speeds
-
JAA
It varies depending on IA load and which server the data is on.
-
erkinalp
IA downloads are now down to dialup speeds
-
JAA
Not surprising. IA is pretty busy recently, and it slows various things to a crawl.
-
JAA
Even IA-internal things are slow. One particular item I was monitoring took over 6 days to move 43 GB around internally.
-
fireonlive
ooof
-
JAA
(Move it from S3 to item server, checksums, and mirror to backup server.)
-
arkiver
i'm planning on dusting off wikis-grab for the upcoming deletions of wikis
-
arkiver
though those are also largely covered already by wikiteam dumps i believe
-
arkiver
the wikis-grab would more be a general method of archiving wikis
-
arkiver
project is also coming for ZOWA
-
arkiver
any idea for a channel for zowa.app ?
-
arkiver
flashfire42: do you know if we have the orange ISP hosting stuff fully covered with AB?
-
pokechu22
arkiver: have you seen #wikibot?
-
pokechu22
ah, you're already in that channel
-
arkiver
pokechu22: yeah
-
arkiver
i think it's good to have both dumps from there and from a project creating WARCs
-
pokechu22
I haven't seen wikis-grab before - does it try to do everything wikiteam does, or is it mostly focused on saving the current revision of every page?
-
pokechu22
Yeah, WARCs are good
-
arkiver
current revision
-
arkiver
may be good to allow wikiteam higher priority with the dumps than WARCs, since it's more complete
-
arkiver
but after that we should attempt to create WARCs as well
-
pokechu22
Saving the current revision and maybe all pages on the history tab (but not the revisions themselves - just the history list for attribution) probably is enough for WARCs
-
arkiver
yeah
-
JAA
Love it!
-
arkiver
it's largely for URL preservation, so it's in the wayback machine and easily browsable
-
arkiver
the dumps can be used to restore a wiki with (right?), but for browsing WARCs are better
-
arkiver
JAA: :)
-
JAA
Yep, that's accurate.
-
JAA
The dumps are entirely unusable for the average person.
-
arkiver
and of course outlinks for #// :)
-
nstrom|m
<arkiver> "any idea for a channel for zowa..." <- zowch
-
flashfire42|m
Soddy arkiver it’s 4am here and you are lucky I snap awake for random reasons. Orange is far from complete in archivebot. I’ve been launching as many jobs as I can manage but it’s like fighting a fire with a kids water bucket. I’ll get a sampling but not all of it
-
flashfire42|m
I will continue to throw in a much as possible during the next week but we won’t get it all I can say that with certainty. Not unless we get a stay of execution for another month or 2
-
flashfire42|m
Hopefully that info is helpful it is time for me to head back to sleep for another 2 hours.
-
pokechu22
I'll try to do an !a < list job for it too
-
pokechu22
The deadline for webs is sooner though
-
erkinalp
JAA: unless they have a WARC reader,
-
JAA
z-oww-a
-
imer
#nowa
-
imer
although I like the oww one better :D
-
JAA
Or perhaps some play on the content. What are some sounds you'd absolutely not want to hear in an ASMR video?
-
nstrom|m
zowaah, zowie 🤷♂️
-
imer
#zo🍽️
-
JAA
My terminal is sad about that last one.
-
imer
Yeah, lets not :)
-
arkiver
:)
-
thuban
arkiver, re orange isp hosting: AntoninDelFabbro|m posted a link to a page listing sites, and i have been enumerating them using its api
-
thuban
if my suspicion that supplying 0 as the category id retrieves all categories is correct, i expect to be able to enumerate 159832 sites (some fraction of which will be duplicates or inaccessible due to various oddnesses)
-
thuban
i don't think it's realistic to do this 'manually', but maybe some `!a <` jobs? individual sites are quite small as a rule
-
AntoninDelFabbro|m
I'm really impressed and thankful
-
fireonlive
i for one vote for emoji channel ;)
-
fireonlive
:3
-
erkinalp
JAA: wowturkey definitively dead, we can update the deadwatch now (death date: 2023-08-27,0400Z)
-
erkinalp
s/deadwatch/deathwatch/
-
fireonlive
oh it is confirmed by owner?
-
erkinalp
owner not responding to any correspondence
-
erkinalp
no hope of coming back up again
-
erkinalp
the AB job has a few external links (~650 or so) pending
-
fireonlive
ah :(
-
erkinalp
to skip wowturkey.com without impacting the remaining ~650 external resources, i'd propose to temporarily map wowturkey.com to 0.0.0.0 ;[
-
JAA
:-(
-
erkinalp
in the bot's end i mean
-
JAA
The AB job is paused, and the offsite URLs aren't in danger, so we can let it sit until the true deadline just in case it comes back.
-
erkinalp
oh, i though it was looping over and ober
-
erkinalp
good that it's paused
-
erkinalp
if it doesn't come back up until 23 september, then it's DaaD
-
erkinalp
(23 september is when the hosting expires, exactly 22 years from the website's start)
-
erkinalp
and the shutdown was exactly 20 years and 1 day from the first turkish language post
-
erkinalp
wowturkey initially consisted of english threads
-
erkinalp
promoting turkey to outsiders
-
pokechu22
-
pokechu22
(this also contains urls from monsite.orange.fr and monsite.wanadoo.fr, both of which give a page redirecting (but not a 3xx redirect) to monsite-orange.fr)
-
thuban
pokechu22: how was that list collected?
-
JAA
erkinalp: Well, to be precise, 'paused' here just means a very slow request rate (one request every five minutes in this case), not actually paused.
-
JAA
Also, not sure where you got that 650 number from.
-
pokechu22
-
pokechu22
mixed in a list from #webroasting a while back
-
JAA
There are about 8.3k offsite URLs in the remaining queue.
-
pokechu22
SrainUser's
transfer.archivete.am/Y5Qsp/orange_isp_hosting_urls.txt which I think was scraped from the list the site gives but I'm not 100% sure
-
pokechu22
and, yes, there's a fair bit of garbage on my list - easier to let it be attempted and fail than to try to filter it out
-
arkiver
thuban: if you have a list of sites, please do post them!
-
thuban
arkiver: still processing, will do
-
pokechu22
I can deduplicate my list against anything you find and start a second job for whatever's missing
-
pokechu22
It looks like there's a pagespro-orange.fr in addition to a pagesperso-orange.fr incidentally
-
thuban
yep
-
arkiver
thuban: thank you
-
arkiver
and in the meantime - all those queuing AB jobs for orange, please keep doing that
-
pokechu22
I'm currently doing an !a < list AB job for it - this is easier since it's one job for thousands of sites, but it's a bit buggy in that if the sites link to eachother, it might not recurse properly. Still, it seems like the most practical way to do this
-
erkinalp
JAA: thanks for the number
-
arkiver
erkinalp: we got a pretty serious chunk of it i believe
-
erkinalp
89% of items saved
-
erkinalp
maybe more
-
erkinalp
(the website had 9.35M posts, according to their own stats)
-
arkiver
that's good!
-
fireonlive
:D
-
arkiver
not sure the percentage is correct, but we got more than half i think
-
erkinalp
after scraping and reconstruction, i might actually get more posts
-
thuban
pokechu22: agreed re practicality (and i don't think sites linking to each other will be a problem--at worst, it won't get pages linked to by other sites but not their host site's homepage)
-
thuban
-
thuban
i see that you have a job running for monsite-orange.fr_seed_urls.txt; are you going to start another for orange_isp_hosting_urls.txt? (or has that already been done?)
-
pokechu22
I'm working on building my own list for the orange one based on orange_isp_hosting_urls.txt but I'm not going to run orange_isp_hosting_urls.txt directly
-
thuban
ok, cool
-
JAA
erkinalp: Those stats are not right. You'd have to analyse the WARC data to tell how much we covered. The number of URLs retrieved is not really correlated to that in a meaningful way.
-
JAA
We fetched offsite URLs, we fetched rating.php, and so on.
-
JAA
A coarser estimate would be possible by analysing just the log file and retrieving how many topic IDs appear there, but later pages could be missing, so that's still only a rough estimate.
-
erkinalp
JAA: yeah that was what i was referring to by "after .. reconstruction, i might actually get more posts"
-
JAA
Or less. We don't know how far it got through the forum pagination.
-
erkinalp
thankfully wowturkey's viewtopic.php page size is fixed at 10, and ttum.php page size is fixed at 100
-
erkinalp
and both were configured in a manner to link to the most recent pages of each respective topic
-
szczot3k
Hi, how can I help with the efforts? Is an v6-space-holder useful anyhow?
-
rewby
IIRC there's at least one project currently active that uses a ton of v6 ips
-
rewby
I forget which one, imer wrote the code for it
-
rewby
Or rather
-
rewby
imer wrote the deployment code that makes it do lots of v6
-
szczot3k
Well, I can technically use a whole /39, so ready to help out
-
imer
#deadcat but we're bandwidth limited there anyways (and they seem to rate limit per single ip or something like that)
-
DigitalDragons
#deadcat is the one
-
imer
target bandwidth limited that is
-
rewby
Also, I could see if I can hook up my spare /32 to some workers at some point
-
rewby
I'm just busy with targets
-
imer
szczot3k: here's the aforementioned code/script as well:
gist.github.com/imerr/614e534218a6b93be1a40b088dee885a
-
DigitalDragons
i heard #sweet supports ipv6 too but I don't know about their ratelimiting
-
imer
there is none unless you go way too fast and then they will (it seems) manually block your ip
-
DigitalDragons
hah
-
flashfire42
Ok got back through scrollback and it seems the consensus is to ignore webs for the moment and focus on orange? cc arkiver
-
DigitalDragons
also, glad to hear about wikis-grab!
-
DigitalDragons
i have some wikibot #// extraction almost ready but unsure about filtering
-
thuban
ok, orange.fr enumeration finished and spot-checks suggest that i got all the categories
-
thuban
processing the results now
-
thuban
malformed urls won't break archivebot, right? there are a few fun ones in here, like `usftennis2.monsite-orange.fr/index.html#="'><h1>abcd</h1>${{7*7}}${7*7}%{7+7}[[7*7]]@(1+2)<%= 7*7 %>` and `monsite.orange.fr@la-canaliere`
-
pokechu22
Right
-
pokechu22
the first one would just be treated as usftennis2.monsite-orange.fr/index.html because of the #
-
pokechu22
the second one would be probably treated as trying to log in as user monsite.orange.fr on site
la-canaliere which obviously won't work, but will fail in an acceptable way
-
h2ibot
FireonLive edited Deathwatch (+294, move wowTURKEY to dead (we should use that…):
wiki.archiveteam.org/?diff=50602&oldid=50598
-
pokechu22
the main thing that breaks archivebot is FTP - there are a few other things that can cause problems but they aren't easy to control for
-
h2ibot
FireonLive edited Deathwatch (+2, fix url for 2028-Russia going to example.com…):
wiki.archiveteam.org/?diff=50603&oldid=50602
-
fireonlive
(i was like example.com?!)
-
JAA
That mistake is so common.
-
JAA
I wish there was a way to make edits throw an error when a template isn't used correctly.
-
JAA
Probably possible with an extension or something ridiculous like that.
-
pokechu22
You could probably use an editfilter
-
pokechu22
er, for that one, probably the right thing to do is make it generate a big red message of anger instead of silently using example.com
-
JAA
We do have
wiki.archiveteam.org/index.php/Category:Pages_with_broken_URLs for all uses of Template:URL where the URL is empty.
-
JAA
I just remembered that I added that at one point.
-
pokechu22
-
h2ibot
Pokechu22 edited Template:Url (+138, add visible warning about broken URLs):
wiki.archiveteam.org/?diff=50604&oldid=49244
-
h2ibot
Pokechu22 edited Reddit (-1, fix incorrect {{URL}} usage):
wiki.archiveteam.org/?diff=50605&oldid=49987
-
h2ibot
Pokechu22 edited Talk:Twitter (+2, fix incorrect {{URL}} usage):
wiki.archiveteam.org/?diff=50606&oldid=49771
-
h2ibot
Pokechu22 created Category:Pages with broken URLs (+210, Created page with "Pages that use…):
wiki.archiveteam.org/?title=Category%3APages%20with%20broken%20URLs
-
JAA
Good idea, thanks.
-
JAA
Should be good enough.
-
fireonlive
awesome ^_^