"TheEndless" wrote:
I suspect they expanded support, in general, in 5.4 to include additional encodings, as well as unicode (per the observation with the stick here
Yeah, the thing is - it is tricky.
First there is no "Unicode encoding" - rather, in Unicode each character is represented by a "code point", a number in the range 0 - 0x10FFFF. That means we will need at least 21 bits per character, which is unwieldy (does not fit neatly in octets) - so instead Unicode text is represented in some kind of encoding - like UTF-8, which for all practical purposes is THE way to store/transfer unicode texts. Each unicode character gets stored in 1, 2, 3 or 4 bytes - where ASCII characters \0 - \x7F stay exactly the same 1 byte in UTF-8 and all the other chars are sequence of 2 (or more) bytes >= \x80. Consequently, if i give you a file where all chars are < \x80, it is both in UTF-8 and ASCII/ANSI encoding. If there are characters > \x7F though, it may be either UTF-8 or 1-byte ISO-8859-X (where X and hence the language is unknown)... or something else.
So i am curious how they try to guess the type of the SRT. There is one cheap way - more of a hack, really - of looking if the file starts with \xEF \xBB \xBF (BOM) and if so - it is UTF8. The problem is if it does not start with a BOM, it can still be either UTF8 or ISO/ANSI.