-
JAA
I've been battling with SAP's Q&A site that'll be taken down very soon. Their server is very broken. I kind of expected that, given it's SAP, but it's impressive just how bad it is.
-
JAA
We got decent coverage from ArchiveBot already, but that can't have gotten everything. Among the things missed will be pagination of answers and attachments in answers or comments, I believe.
-
JAA
-
JAA
The good news is there don't seem to be any significant rate limits.
-
pabs
were they missed due to JS or?
-
JAA
Yeah
-
JAA
XHR that returns HTML in JSON in JSON for some data, and so on.
-
nicolas17
yo. dawg.
-
JAA
Real pleasure to work with.
-
pabs
fuuuugly
-
pabs
-
JAA
The server also returns truncated responses and extra data after completed responses.
-
nicolas17
are they all called image.png?
-
JAA
They are not.
-
pabs
-
JAA
Earlier, I saw 404s on URLs that had been returning 302s just before. Shortly after writing about it in #archivebot, it started returning 302s again.
-
JAA
I'll try my best, but this is a shitshow.
-
JAA
Oh yeah, I'm seeing that 404 thing again right now.
-
JAA
I'll probably get a lot of false 404s.
-
JAA
It seems to only affect IDs that aren't questions anyway, so maybe that's 'fine', but yeah.
-
pabs
hmm, on that page above, the attachments are just hrefs
-
pabs
no JS needed...
-
nicolas17
I loaded that answer with NoScript enabled
-
nicolas17
I could see the inline image and the link
-
JAA
Only for attachments in the question, not in answers.
-
pabs
ah
-
JAA
Nor in comments on answers.
-
JAA
The comments are where that JSON in JSON happens.
-
pabs
the answer attachments are in JSON in the HTML it seems
-
pabs
-
JAA
Yeah right, there's another layer of HTML around it for the first page of answers, too.
-
JAA
So HTML in JSON in JSON in HTML :-)
-
nicolas17
/o\
-
pabs
| grep -oE '/storage/[^\"]+'
-
JAA
Anyway, I've got all these things figured out already, the remaining problem is their server not cooperating at the HTTP level.
-
pabs
can just grep the WARC of the answer HTML :)
-
pabs
ah, whats the issue there?
-
JAA
I also liked this one:
-
JAA
> const simplifiedQuestionView = JSON.parse("true");
-
nicolas17
<JAA> Earlier, I saw 404s on URLs that had been returning 302s just before. Shortly after writing about it in #archivebot, it started returning 302s again.
-
JAA
No, the truncated responses and extra data after responses. The latter is probably a wrong Content-Length header.
-
pabs
how are you detecting that?
-
JAA
By getting JSON and HTTP parser errors.
-
pabs
oh geez...
-
fireonlive
<nicolas17> are they all called image.png? < that's my power move
-
pabs
does retrying the failing ones help?
-
JAA
As I wrote in #archivebot earlier, it takes some effort to fuck up this badly.
-
JAA
-
JAA
:-)
-
nicolas17
to err is human
-
nicolas17
to really fuck things up you need a computer
-
pabs
yeah, just saw that...
-
pabs
we don't need the login page though :)
-
JAA
Yeah, that's a bug in my code. Already fixed that.
-
fireonlive
lord lol
-
JAA
The failures are responses with (presumably) extra data at the end.
-
JAA
> * Excess found in a non pipelined read: excess = 97, size = 3632, maxdownload = 3632, bytecount = 0
-
JAA
That's what curl emits.
-
JAA
And the response ends at a random point in the middle of the HTML.
-
JAA
That's why I suspect they're sending the wrong Content-Length.
-
pabs
is it one particular IP address that is bad or all three of them?
-
» fireonlive sits back and wonders how
-
JAA
I think all are affected, but I'm not currently logging that information on the errors, so I can't confirm.
-
JAA
Highly unlikely it didn't hit all IPs on 721 attempts though.
-
pabs
and does --http1.1 help?
-
pabs
nope, it doesn't
-
JAA
WARC can only store HTTP/1.1 anyway.
-
pabs
ah
-
fireonlive
modern web--
-
eggdrop
[karma] 'modern web' now has -1 karma!
-
JAA
Oh yeah, there's a PROTOCOL_ERROR on HTTP/2 as well, right. :-)
-
pabs
s/-1/-Inf/
-
JAA
SAP--
-
eggdrop
[karma] 'SAP' now has -1 karma!
-
fireonlive
SAP--
-
eggdrop
[karma] 'SAP' now has -2 karma!
-
pabs
plain http redirects to TLS too, so no way to check that
-
pabs
love it how even when you get the right Content-Length, the HTML is still broken, no </body></html>
-
pabs
doesn't even close the <form> tag
-
fireonlive
full, solid, enterprise software
-
JAA
Confirmed I see the excess data error on all three IPs.
-
JAA
One item was looping on the login retries, had to just ^C it after almost 10k attempts.
-
JAA
Yep, 10k retries, all with that excess data stuff, apparently.
-
JAA
> ClientResponseError("400, message='invalid constant string'",)
-
JAA
That's how it shows up on qwarc.
-
fireonlive
wow
-
pabs
how does SPN do?
-
pabs
its interesting, here I don't get the extra data every time, but have once or twice
-
JAA
It might be that the excess data is always there and it's a matter of timing whether your HTTP client considers the transaction done or actually emits an error.
-
JAA
Or how are you testing exactly?
-
pabs
curl -v ...
-
JAA
Yeah, same
-
pabs
I almost always see this tho: * Connection #0 to host answers.sap.com left intact
-
pabs
so I guess that means the server is always not closing the connection, and you're right with the timing comment
-
nicolas17
I think that means "not closing the connection since we may reuse it in the next request"
-
pabs
ah
-
nicolas17
* Connection #0 to host google.com left intact
-
fireonlive
<resists joke>
-
nicolas17
if you do two URLs (same hostname) in the same curl command:
-
nicolas17
* Connection #0 to host google.com left intact
-
nicolas17
* Re-using existing connection! (#0) with host google.com
-
JAA
Oh
-
JAA
lol
-
JAA
It seems to break at a hidden <input> containing a CSRF token!
-
fireonlive
ooh!
-
fireonlive
is there a token? or is it trying to make one and just catastrophically failing
-
JAA
Usually, the response ends with the submit button.
-
JAA
Immediately after that button would be the hidden token.
-
pabs
so the token is the extra data?
-
fireonlive
ah!
-
JAA
But I only get that (and the closing </form></body></html>) sometimes.
-
nicolas17
JAA: is this on truncated responses or on extra-data responses?
-
pabs
so something is choking depending on what token gets generated?
-
pabs
is it only the login page that gets this extra data?
-
nicolas17
server asking the overkill HSM to encrypt and sign the CSRF token and failing (?)
-
JAA
-
JAA
`printf '%s\r\n' 'GET /users/login.html HTTP/1.1' 'User-Agent: curl/7.38.0' 'Host: answers.sap.com' 'Accept: */*' '' | openssl s_client -connect answers.sap.com:443 -ign_eof`
-
JAA
There's an extra 96 bytes at the end, which is exactly that CSRF input tag plus the closing tags.
-
JAA
Those are not accounted for in the Content-Length.
-
nicolas17
tried twice, once the response ended at </html>, next time the response ended at value="Submit" />
-
nicolas17
I didn't count bytes to see what the content-length included :P
-
pabs
does --ignore-content-length help? :)
-
JAA
pabs: Not only the login page, I saw the same error elsewhere as well, all on 404 pages.
-
JAA
Maybe it generates a token there as well for a 'report an error' thing.
-
JAA
nicolas17: Extra data, I haven't looked at the truncation yet.
-
JAA
The extra data makes qwarc very sad. The truncation just causes a small number of items to crash.
-
pabs
with --ignore-content-length I still see the truncation sometimes
-
JAA
Yeah, try the OpenSSL command instead, it'll stall after the submit button randomly.
-
pabs
but --ignore-content-length fixes lack of body
-
pabs
some broken ass hit
-
JAA
I can't ignore the Content-Length in qwarc anyway; that happens deep in a C library for HTTP parsing.
-
pabs
qwarc doesn't use curl?
-
JAA
No
-
JAA
aiohttp
-
JAA
I'm not aware of anyone having implemented WARC into curl.
-
fireonlive
nulldata you have competition
-
JAA
Anyway, taking a bit of a break, then trying to fix the remaining issues and getting it properly started.
-
nulldata
:O
-
JAA
First very rough size estimate is ~300 GiB.
-
nullpeta
Hi. Japanese edition of slashdot (
srad.jp) will be shutdown at 2024/01/31. Can anyone help archive?
-
» pabs looks
-
nullpeta
-
pabs
lots of subdomains...
-
pabs
looks like the article subdomains are just categories, so subdomain stories are on the main domain too
-
pabs
comments are very JS-y like slashdot
-
nullpeta
According to the closure notice, OSDN.net (japanese github like sites) may also be closed.
-
pabs
fuck
-
fireonlive
oh no :(
-
pabs
nullpeta: started a job for srad.jp, see archivebot.com to watch it run
-
pabs
osdn.net was ultra-broken a while back, wonder how it is now
-
nullpeta
pabs: Thank you very much!
-
pabs
don't think I will do article subdomains, that will be duplication I think
-
pabs
not sure how to deal with comments either
-
nullpeta
pabs: Subdomains are just categories, so they should accessible from the main domain. If a story has over 50 comments, comments over 50 are loaded by JS later.;(
-
nullpeta
To get all comment, we need to click "すべてのコメントを取得" (Get all comments) button manually.
-
fireonlive
masterX244: do you know if there's a final/hard cutoff date set yet for discordcdn urls expiring/the signature&etc parameters being mandatory?
-
pabs
nullpeta: yeah, thats a POST request, which isn't archivable
-
h2ibot
Switchnode edited Deathwatch (+239, /* 2024 */ add srad.jp):
wiki.archiveteam.org/?diff=51539&oldid=51537
-
pabs
hmm, the site is timing out for me now :(
-
pabs
and also in AB
-
pabs
!d 6p7aqfevk41es3iuyywvw68a7 1800000 1800000
-
pabs
nullpeta: re comments, they look enumerable
srad.jp/comment/4597211
-
thuban
pabs: wrong channel
-
pabs
yeah :)
-
fireonlive
site is down for me as well
-
fireonlive
oh there it goes
-
fireonlive
jus very slow
-
nulldata
Yeah occasional 500s
-
fireonlive
the quote at the bottom was "人生unstable -- あるハッカー" which google converts to "Life is unstable -- a hacker"
-
fireonlive
suiting :)
-
nullpeta
pabs: So comments are archivable via
srad.jp/comment/* URI? Good!
-
nulldata
Looks like OSDN's magazine site has been broken at least since November of last year.
osdn.net/mag
-
pabs
sadly SWH won't be able to save OSDN git/hg/svn repos due to the domains having expired certs
-
pabs
posted about that to #swh (Libera) and #codearchiver
-
pabs
and escalated within SWH
-
fireonlive
🤞
-
nullpeta
srad.jp is running on a very old system (Perl-based?). I guess it is not strong enough to handle high load.;(
-
fireonlive
Server: Apache/1.3.42 (Debian) mod_gzip/1.3.26.1a mod_perl/1.31
-
fireonlive
that sounds quite old
-
fireonlive
X-Fry: It's all there, in the macaroni.
-
fireonlive
X-Powered-By: Slash 2.005001
-
fireonlive
??
-
JAA
Whew, that's ancient, yeah. Apache 1.x was the old thing ~20 years ago.
-
fireonlive
oh wow yeah not even apache2
-
JAA
1.3.42 is from early 2010.
-
JAA
At least it's the last 1.x version, but yeah.
-
fireonlive
ah ye was just looking that up
-
fireonlive
surprised it lasted that long
-
fireonlive
mod_perl 1.0: Version 1.31 - May 11, 2009 (also the current version of 1.0);(2.0: mod_perl 2.0: Version 2.0.13 - October 21, 2023)
-
arkiver
proactive archiving is not always something we can do due to size.
-
fireonlive
arkiver: referring to furaffinity?
-
fireonlive
(also, hi :3)
-
JAA
More SAP fuckery: on some URLs, I sometimes get a 302 to the login page and sometimes a 200.
-
JAA
-
JAA
I wonder whether I should retry when I get a login redirect.
-
fireonlive
hmm, seems to be a blank post
-
fireonlive
but... probably :/
-
nullpeta
I found some OSDN related doc which maybe help archiveing. shujisado is former CEO of OSDN.
gist.github.com/shujisado/2864e2475567fbbad8f8bacdb290d48a
-
fireonlive
has a comment from a minute ago too, very nice
-
pabs
ah they have CVS too :(
-
JAA
So the incomplete SAP responses are indeed truncated JSON. It simply 'forgets' to send the final }}}.
-
JAA
But not with wrong Content-Length as in the other cases. This is chunked TE, and it sends the terminating zero-length chunk.
-
JAA
I'd be surprised if I wasn't getting incomplete HTML as well.
-
JAA
I'll retry if there is no </html> or if the JSON doesn't parse.
-
JAA
THIS IS SUCH FUN!
-
fireonlive
(╯°□°)╯︵ ┻━┻
-
arkiver
fireonlive: somewhat
-
fireonlive
ah ok :)
-
h2ibot
Pokechu22 edited ISP Hosting (+447, LaCoocan):
wiki.archiveteam.org/?diff=51540&oldid=51360
-
h2ibot
Pokechu22 edited Deathwatch (+454, /* 2024 */ domain@nifty, not sure what actions…):
wiki.archiveteam.org/?diff=51541&oldid=51539
-
JAA
lol, nearly every response I get is truncated...
-
fireonlive
x_x
-
fireonlive
𝓺𝓾𝓪𝓵𝓲𝓽𝔂
-
JAA
I had a bug in the check, but I was indeed getting lots of truncated responses, especially on 404s.
-
JAA
Yeah, I won't retry on 404s. It's just too ridiculous.
-
JAA
80+% of 404s get truncated.
-
fireonlive
oof yeah..
-
JAA
15% of requests generate a warning, almost all of them about extra data after the response.
-
JAA
Around 2k warnings per minute now. I think I'm on panel 7 or 8 of <this_is_fine.png>.
-
nullpeta
The former CEO of OSDN confirmed that OSDN.net will also close at the end of January.
nitter.net/shujisado/status/1749300822691958969
-
JAA
If you don't want to math, I'm doing around 200 req/s.
-
JAA
ETA if that holds up: 33 hours
-
fireonlive
pabs: ^
-
pabs
thanks. all the git/hg repos are being saved by SWH, and #codearchiver after SWH is done
-
pabs
the svn and CVS repos I'm not sure about how to find them
-
pabs
and the site is in AB but times out a fucking lot
-
pabs
the osdn_mirror_contents_url.md gist is interesting, but it looks like none of the mirrors allow enumeration of projects
-
h2ibot
Switchnode edited Deathwatch (+67, /* 2024 */ cleanup):
wiki.archiveteam.org/?diff=51542&oldid=51541
-
nullpeta
For srad.jp,
srad.jp/journal/* and
srad.jp/submission/* are also enumerable URIs. Could this be used for crawl seeds?
-
pabs
enumerable things can't be used as crawl seeds, you can only enumerate and save them
-
pabs
at least with archivebot right now
-
pabs
also, it looks like the main job is finding ~user/journal/1111 URLs, but not /journal/1111 URLs
-
pabs
nullpeta: ^
-
pabs
arkiver JAA - I think we should save all of
dotsrc.dl.osdn.net/osdn (alias of
mirrors.dotsrc.org/osdn) because OSDN is going down and
osdn.dl.osdn.net is not enumerable
-
nullpeta
pabs: Thanks. So that means some non-AB crawls are needed to save all the comments, etc.... Hmmm.
-
pabs
(the rest of the site is mirrors of other stuff)
-
pabs
nullpeta: no, the enumeration is saving all the comments, see 7veb8mluv16mtw6ezn2jqmfbv in AB
-
pabs
the issue is that the enumeration won't save anything else except the comments
-
pabs
the main job is saving everything found from the front page
-
pabs
that is 6p7aqfevk41es3iuyywvw68a7
-
pabs
-
pabs
hmm, maybe I could use
dotsrc.dl.osdn.net/osdn to enumerate and then translate
osdn.dl.osdn.net URLs
-
nullpeta
pabs: I see, thanks for the explanation.
-
JAA
SAP slowed down massively a bit ago, and now I got banned. We'll see how long it lasts. The 403s arrived *very* fast though. I peaked at almost 1k req/s. lol
-
nullpeta
srad.jp/story/24 This page has links to each months page for 2024, its child pages have links to each days page for each month, and each days page has all the stories submitted for that day. Each year's page also has a link to the previous year's page. Hope this helps crawl all the stories.
-
c3manu
nullpeta: the www.goodsmile.info server responds fairly slowly, so AB might not be able to grab everything in time. but i see for now the website (and its suspension announcement) are still up, so fingers crossed they don't just relaunch it immediately
-
nullpeta
c3manu: Thank you for trying to archive goodsmile.info.According to the official website, the update has been postponed. As of now, the re-launch date has not been disclosed.
-
ScenarioPlanet
Maybe Spore backups should get their own collection to be mentioned on
wiki.archiveteam.org/index.php/Spore ?
-
Pedrosso
pabs: Did you convert the other lists into the static subdomain too?
-
Pedrosso
ScenarioPlanet: I think you're right. The wiki should note the AB job on spore.com as well as these new lists
-
pabs
Pedrosso: I only saw one spore job running in the AB dashboard
-
pabs
so I only replaced that one
-
Pedrosso
Alright, I'll update the other 5 lists (2 more png lists, 3 xml lists) to be static.spore.com
-
pabs
thanks
-
Pedrosso
The wiki should also mention said subdomain that I failed to recgonize
-
ScenarioPlanet
It does
-
ScenarioPlanet
> Should be requested on static subdomain which uses CDN.
-
Pedrosso
If you're quoting from discord, I didn't notice that
-
Pedrosso
I overlooked it due to having /static/ in the url
-
Pedrosso
Thank you for the correction, this'll be much faster :)
-
Pedrosso
ooh
-
Pedrosso
You're quoting from the wiki nvm, they did say it on the discord server as well though
-
ScenarioPlanet
-
Pedrosso
What's that about postcards?
-
ScenarioPlanet
Postcards are exclusion here
-
ScenarioPlanet
Their web pages (/view/postcard/) use www.
-
Pedrosso
I see
-
Pedrosso
Spore randomly throws .jpgs into the mix
-
Pedrosso
> Note that all the non-ASCII symbols will be replaced with ? in the response.
-
Pedrosso
Does this apply when viewing the item as well?
-
ScenarioPlanet
Not viewing, only REST requests
-
Pedrosso
Well, that's a shame.
-
Pedrosso
So, what of the REST service? If we'd be able to grab creation data (like creator, subscribers, comments) via those we'd be able to find all users and grab even more from there; notably sporecasts.
-
Pedrosso
How would REST deal with an AB job?
-
ScenarioPlanet
Mostly fine, if that's 1000ms
-
Pedrosso
shrug better than nothing
-
ScenarioPlanet
So we are not doing DDoS
-
Pedrosso
Wouldn't DDoS require c>1 ?
-
ScenarioPlanet
Yes, that's why it should be 1 or 2-3 at most
-
Pedrosso
I see
-
Pedrosso
DWR Interface, has anything more been done/researched regarding that since the wiki was last updated?
-
ScenarioPlanet
Also, we must not to use REST as the only option to preserve creation metadata
-
ScenarioPlanet
ATOM and especially DWR too
-
Pedrosso
hm?
-
ScenarioPlanet
I could ask Kade to join Archiveteam's IRC. They know a lot about the DWR stuff.
-
Pedrosso
Yeah. But what did you mean just now?
-
ScenarioPlanet
We need to save ATOM responses too (see wiki page, "ATOM Feeds")
-
Pedrosso
So what you meant is that we should not solely save one? I agree
-
ScenarioPlanet
Sure
-
Pedrosso
Does DWR have something special, other than unicode?
-
ScenarioPlanet
Yes, POST requests
-
Pedrosso
I mean information-wise
-
Pedrosso
Something the others lack information-wise
-
ScenarioPlanet
It also holds adventure leaderboards, captain stats data and more
-
Pedrosso
Oooh, yeah that's good
-
ScenarioPlanet
Basically everything you can see on
spore.com/sporepedia and can't find in any of the mentioned endpoints responses
-
Pedrosso
I'd say yeah, ask Kade to join
-
ScenarioPlanet
Done
-
TempleOfGoo88
Hello. I've been getting an error several times when trying to upload something. Can anyone help?
-
TempleOfGoo88
This is error I'm getting:
-
TempleOfGoo88
<?xml version='1.0' encoding='UTF-8'?><Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><Resource>Your upload of goo-goo-dolls-all-that-you-are-live-at-the-late-late-show-19.12.2011 from username templeofgoo⊙gc appears to be spam. If you believe this is a mistake, contact info⊙ao and include this entire
-
TempleOfGoo88
message in your email.</Resource><RequestId>6ee4aeaf-231a-4a93-bbb1-ace0055e9b4a</RequestId></Error>
-
c3manu
TempleOfGoo88: i've had good results with following the instructions in the error message.
-
TempleOfGoo88
Well, the error message just says to send an email
-
TempleOfGoo88
Which I did. I'll just wait and see
-
c3manu
you shouldn't have to wait longer than 24h in my experience.
-
TempleOfGoo88
(y)
-
c3manu
"archiveteam" and "archive.org" are separate entities, so bothering us wouldn't have helped either ;)
-
TempleOfGoo88
Oh ok. thanks
-
c3manu
np :)
-
fireonlive
"Terraform Labs files for bankruptcy: Terraform Labs, the company behind the Terra blockchain, has filed for bankruptcy. Its flagship product, the Terra stablecoin and associated LUNA token, failed spectacularly in May 2022."
web3isgoinggreat.com/single/terraform-labs-files-for-bankruptcy
-
fireonlive
(via #web3)
-
nulldata
Was just about to ask if someone could throw
terra.money and
medium.com/terra-money into AB :P
-
fireonlive
:P
-
fireonlive
there's also "CFTC files complaint against Debiex platform for using "romance scam tactics" to steal $2.3 million
web3isgoinggreat.com/single/debiex-cftc-complaint" - which lists their domains; but I can only find
debiex.com/wap that still works; everything else they seem to have 404'd
-
kuro68k
Hi guys. I came here because I heard about srag.jp going offline at the end of the month. It's not listed in Warrior but I'd like to help archive it if I can. Is it possible to contribute to that job, ideally via Warrior?
-
nulldata
kuro68k - you mean srad.jp ?
-
nulldata
So far we're able to grab it with #ArchiveBot , so it's not a Warrior project
-
nulldata
You can check the progress on
archivebot.com/?showNicks=1
-
fireonlive
threw in terra's two urls there (medium i had to tack on `archive` on the end)
-
JAA
Still banned from SAP
-
JAA
All this effort for that...
-
kuro68k
Yes, srad.org, sorry typo
-
kuro68k
So does ArchiveBot mean you don't need any help?
-
fireonlive
:|
-
kuro68k
I worry about slashdot.org too. It's a hard site to archive properly, in a way that preserves all the links and conversations. I tried archiving it once, the results were not stellar.
-
arkiver
JAA: do we need anything Warrior for OSDN?
-
arkiver
-
arkiver
looking further into it as well
-
JAA
arkiver: Most of the discussion has been happening in #codearchiver.
-
arkiver
ah
-
JAA
I don't know what we can do about SAP Q&A.
-
JAA
I suspect the ban is manual.
-
JAA
arkiver: Do you want to try a DPoS? It'd have to start *very* soon. They intend to finish their migration (which I'm sure will lose data given it's SAP) on Wed.
-
JAA
There is a lot of weirdness in this one. Incomplete responses, wrong Content-Length headers, etc. I intend to document it all on the wiki.
-
ScenarioPlanet
-
ScenarioPlanet
(that's like 30% of everything)
-
arkiver
ScenarioPlanet: what is this?
-
ScenarioPlanet
arkiver: fullsized images lists (wordpress+drupal+google+wix)
-
fireonlive
so we don't just have the thumbnails from the AB jobs
-
fireonlive
AIUI
-
beastbg8
Hello. I would like to bring something to your attention. I hope I'm in the right place. In a month time as of today the largest local video portal in Bulgaria circa 2006, VBOX7 (very similar to the Hungarian "videa.hu") is about to "hide" all user-uploaded content, which according to their devs are over 14M videos. They already did that last
-
beastbg8
week, but opened the gates for final time after social media outrage. It contains a lot of, rare nowhere-to-be-found media, specifically concerning Bulgaria and quite a lot otherwise "lost" media (foreign movies, TV series) that survives only there, either with a dub or not. Is there something that can be done in a such a narrow time frame?
-
beastbg8
Currently these videos can only be accessed with Bulgarian IP. Only "partnered" videos (only content that will remain on the site starting 22 February 2024) can be watched from abroad.
-
JAA
-
JAA
The georestriction is going to be a pain.
-
h2ibot
JustAnotherArchivist edited Deathwatch (+428, /* 2024 */ Add Vbox7):
wiki.archiveteam.org/?diff=51543&oldid=51542
-
nulldata
Example of a working video in the US:
vbox7.com/play:2e08276728 -> (player loads this which links to a mpd file, the URLs of which seem to be static - or at least didn't change when accessing via a different connection and computer)
vbox7.com/aj/player/item/options?vid=2e08276728
-
nulldata
-
beastbg8
Currently the yt-dlp's extractor for VBOX7 is broken (only extracts mpds from some videos) but a friend fixed it awhile ago. Providing it with their consent. Please take a look.
pastebin.com/raw/300v4NwC (vbox7.py)
-
nulldata
No geo restriction on the video server itself it seems - if you have a Bulgaria connection to grab the URL from the API response it'll download no problem on a US connection
-
beastbg8
yep
-
nulldata
The API gives a mpd file, but you can change the extension to m3u8 and play with VLC.
edge211.vbox7.com/sl/1iswCSTN-zXpz-…33600/b3/b312ac1a7a/b312ac1a7a.m3u8
-
nicolas17
oh joy they're using byte ranges
-
nicolas17
so it's not 3 million tiny files with 5 seconds of video each
-
JAA
And DASH and HLS both reference the same MP4 file as well!
-
nulldata
Oh yeah I should've looked harder - the mpd file specifies the mp4 files in the BaseURL node.
edge211.vbox7.com/sl/1iswCSTN-zXpz-…1a7a/b312ac1a7a_480_track1_dash.mp4
-
nulldata
Question becomes - is there a nice way to enumerate all valid video ids lol
-
JAA
424k requests per second could bruteforce the trillion possible [0-9a-f]{10} IDs in 30 days. That's not going to happen but less unreasonable than I expected.
-
nicolas17
and how do you go from video ID to mpd URL?
-
nicolas17
hmm weird
-
nicolas17
you posted an mpd/m3u8 for b312ac1a7a
-
nicolas17
but
vbox7.com/ajax/video/nextvideo.php?vid=b312ac1a7a returns a direct .mp4 instead of an mpd manifest
-
griz
old ones are mostly mp4
-
griz
some are only flv
-
nulldata
nicolas17 - I think the only way would be finding someone with a legit connection, or VPN, dedicated to grabbing the URLs from the API to feed to a tracker. The grabbers could be on any connection.
-
nicolas17
but where did you even get that mpd if the API returns mp4?
-
nulldata
That's because of the geoblock - if you access via Bulgaria it returns the mpd link
-
h2ibot
Pokechu22 edited List of website hosts (+471, XFree / Thin Cloud for Free):
wiki.archiveteam.org/?diff=51544&oldid=51508
-
nicolas17
oh
-
nicolas17
I didn't realize that .mp4 was the "not available" placeholder >_>
-
griz
direct mp4 over https is fastest route.. then HLS & MPD
-
JAA
I mean, we don't need to refetch the video for HLS and MPD since it's all the same MP4 file behind it.
-
ThreeHM
I'm able to bypass the georestriction by adding an X-Forwarded-For header with a Bulgarian IP to the API request
-
project10
lol
-
fireonlive
😏
-
ThreeHM
Even works if I use the IP that's behind vbox7.com
-
ThreeHM
lol
-
Barto
someone screwed their reverse proxy config
-
fireonlive
excellent
-
griz
X-Forwarded-For is also used in the yt-dlp script, that's how it works
-
Barto
:-) I wonder if the more 'recent' Forwarded header works
-
beastbg8
made an article on the wiki
-
kiska
When warrior go brrr? :D
-
kiska
Seems like that'll be patched out soon
-
beastbg8
i doubt they're checking their code base too myopically
-
beastbg8
there's a large hole in their subtitle writing functionality, where videos can be obtained even if hidden, but it needs a user account
-
JAA
Nice
-
JAA
So how do we find video IDs?
-
kiska
Bruteforce?
-
beastbg8
-
JAA
There's a trillion of them, kiska.
-
JAA
16^10
-
kiska
Seems more doable than something like youtube :D
-
griz
the android app is interesting too, mobile api is very extensive
-
kiska
I suppose we could do discovery if they have "recommended" on the side
-
beastbg8
<div class="container">
-
JAA
We're not going to do 425k req/s for a month though. lol
-
JAA
But yeah, as I wrote above, I would've expected it to be worse.
-
kiska
I'd doubt they let us do that for a month :D
-
beastbg8
one way is through obtaining all user accounts through search keywords and scraping their channels
-
beastbg8
but that's kinda impractical
-
ThreeHM
They have recommendations, but we'd have to check if that gives you geoblocked videos
-
griz
-
kiska
Do they have a sitemap :D
-
griz
-
griz
random links from mobile app
-
beastbg8
also the subrec method (which will likely stay post feb 22) does not support .flv videos
-
beastbg8
it can be used as a last resort
-
beastbg8
btw hidden videos still retain their metadata
-
beastbg8
assuming you have the URL
-
beastbg8
vbox7.com/play:4a81cafe81 here is one such video. no playback endpoints but title, thumbnail, comments etc are all there
-
beastbg8
all videos were like this yesterday
-
nicolas17
crawl recommended videos recursively
-
nstrom|m
user pages also seem to have all the videos a user posted, so that probably helps w discovery a bit too. eg.
vbox7.com/user:wochit_news has 626 pages of videos
-
nstrom|m
-
griz
-
griz
though api bugs after about 1000 results
-
griz
-
beastbg8
if videos in subrec mode give 404 error, renaming the extension from .mp4 (they're always passed as .mp4) to .flv might work, though not always
-
h2ibot
Missaustraliana edited Deathwatch (+4, Add 'was ' to Studio 10 as the time has passed.):
wiki.archiveteam.org/?diff=51545&oldid=51543
-
missaustraliana
yuh\