ocean.text.convert.Utf

Fast Unicode transcoders. These are particularly sensitive to minor changes on 32bit x86 devices, because the register set of those devices is so small. Beware of subtle changes which might extend the execution-period by as much as 200%. Because of this, three of the six transcoders might read past the end of input by one, two, or three bytes before arresting themselves. Note that support for streaming adds a 15% overhead to the dchar => char conversion, but has little effect on the others.

These routines were tuned on an Intel P4; other devices may work more efficiently with a slightly different approach, though this is likely to be reasonably optimal on AMD x86 CPUs also. These algorithms would benefit significantly from those extra AMD64 registers. On a 3GHz P4, the dchar/char conversions take around 2500ns to process an array of 1000 ASCII elements. Invoking the memory manager doubles that period, and quadruples the time for arrays of 100 elements. Memory allocation can slow down notably in a multi-threaded environment, so avoid that where possible.

Surrogate-pairs are dealt with in a non-optimal fashion when transcoding between utf16 and utf8. Such cases are considered to be boundary-conditions for this module.

There are three common cases where the input may be incomplete, including each 'widening' case of utf8 => utf16, utf8 => utf32, and utf16 => utf32. An edge-case is utf16 => utf8, if surrogate pairs are present. Such cases will throw an exception, unless streaming-mode is enabled ~ in the latter mode, an additional integer is returned indicating how many elements of the input have been consumed. In all cases, a correct slice of the output is returned.

For details on Unicode processing see:

Members

Functions

cropLeft
T[] cropLeft(T[] s)

Adjust the content such that no partial encodings exist on the left side of the provided text.

cropRight
T[] cropRight(T[] s)

Adjust the content such that no partial encodings exist on the right side of the provided text.

decode
dchar decode(cstring src, size_t ate)

Decodes a single dchar from the given src text, and indicates how many chars were consumed from src to do so.

decode
dchar decode(const(wchar)[] src, size_t ate)

Decodes a single dchar from the given src text, and indicates how many wchars were consumed from src to do so.

encode
mstring encode(mstring dst, dchar c)

Encode a dchar into the provided dst array, and return a slice of it representing the encoding

encode
wchar[] encode(wchar[] dst, dchar c)

Encode a dchar into the provided dst array, and return a slice of it representing the encoding

fromString16
const(char)[] fromString16(const(wchar)[] s, char[] dst)

Convert from a wchar[] into the type of the dst provided.

fromString16
const(wchar)[] fromString16(const(wchar)[] s, wchar[] dst)
Undocumented in source. Be warned that the author may not have intended to support it.
fromString16
const(dchar)[] fromString16(const(wchar)[] s, dchar[] dst)
Undocumented in source. Be warned that the author may not have intended to support it.
fromString32
const(char)[] fromString32(const(dchar)[] s, char[] dst)

Convert from a dchar[] into the type of the dst provided.

fromString32
const(wchar)[] fromString32(const(dchar)[] s, wchar[] dst)
Undocumented in source. Be warned that the author may not have intended to support it.
fromString32
const(dchar)[] fromString32(const(dchar)[] s, dchar[] dst)
Undocumented in source. Be warned that the author may not have intended to support it.
fromString8
const(T)[] fromString8(cstring s, T[] dst)

Convert from a char[] into the type of the dst provided.

isValid
bool isValid(dchar c)

Is the given character valid?

main
void main()
toString
const(char)[] toString(const(char)[] src, char[] dst, size_t* ate)

Symmetric calls for equivalent types; these return the provided input with no conversion

toString
const(wchar)[] toString(const(wchar)[] src, wchar[] dst, size_t* ate)
Undocumented in source. Be warned that the author may not have intended to support it.
toString
const(dchar)[] toString(const(dchar)[] src, dchar[] dst, size_t* ate)
Undocumented in source. Be warned that the author may not have intended to support it.
toString
void toString(const(char)[] input, size_t delegate(cstring) dg)
void toString(const(wchar)[] input, size_t delegate(cstring) dg)
void toString(const(dchar)[] input, size_t delegate(cstring) dg)

Encode a string of characters into an UTF-8 string, providing one character at a time to the delegate.

toString
mstring toString(const(wchar)[] input, mstring output, size_t* ate)

Encode Utf8 up to a maximum of 4 bytes long (five & six byte variations are not supported).

toString
mstring toString(const(dchar)[] input, mstring output, size_t* ate)

Encode Utf8 up to a maximum of 4 bytes long (five & six byte variations are not supported). Throws an exception where the input dchar is greater than 0x10ffff.

toString16
wchar[] toString16(cstring input, wchar[] output, size_t* ate)

Decode Utf8 produced by the above toString() method.

toString16
wchar[] toString16(const(dchar)[] input, wchar[] output, size_t* ate)

Encode Utf16 up to a maximum of 2 bytes long. Throws an exception where the input dchar is greater than 0x10ffff.

toString32
dchar[] toString32(const(char)[] input, dchar[] output, size_t* ate)

Decode Utf8 produced by the above toString() method.

toString32
dchar[] toString32(const(wchar)[] input, dchar[] output, size_t* ate)

Decode Utf16 produced by the above toString16() method.

Meta

License

Tango Dual License: 3-Clause BSD License / Academic Free License v3.0. See LICENSE_TANGO.txt for details.

Version

Initial release: Oct 2004

Authors

Kris