Description
Returns a substring of str that respects Unicode character boundaries.
License
Open Source License
Parameter
Parameter | Description |
---|
str | the original String |
begin | the beginning index, inclusive |
end | the ending index, exclusive |
Exception
Parameter | Description |
---|
IndexOutOfBoundsException | if the begin is negative,or end is larger than the length of str, or begin is larger than end |
Return
the specified substring, possibly adjusted in order to not split unicode surrogate pairs
Declaration
private static String unicodePreservingSubstring(String str, int begin, int end)
Method Source Code
//package com.java2s;
// Licensed under the Apache License, Version 2.0 (the "License");
public class Main {
/**/* www . jav a 2s. com*/
* Returns a substring of {@code str} that respects Unicode character
* boundaries.
*
* <p>The string will never be split between a [high, low] surrogate pair,
* as defined by {@link Character#isHighSurrogate} and
* {@link Character#isLowSurrogate}.
*
* <p>If {@code begin} or {@code end} are the low surrogate of a unicode
* character, it will be offset by -1.
*
* <p>This behavior guarantees that
* {@code str.equals(StringUtil.unicodePreservingSubstring(str, 0, n) +
* StringUtil.unicodePreservingSubstring(str, n, str.length())) } is
* true for all {@code n}.
* </pre>
*
* <p>This means that unlike {@link String#substring(int, int)}, the length of
* the returned substring may not necessarily be equivalent to
* {@code end - begin}.
*
* @param str the original String
* @param begin the beginning index, inclusive
* @param end the ending index, exclusive
* @return the specified substring, possibly adjusted in order to not
* split unicode surrogate pairs
* @throws IndexOutOfBoundsException if the {@code begin} is negative,
* or {@code end} is larger than the length of {@code str}, or
* {@code begin} is larger than {@code end}
*/
private static String unicodePreservingSubstring(String str, int begin, int end) {
return str.substring(unicodePreservingIndex(str, begin), unicodePreservingIndex(str, end));
}
/**
* Normalizes {@code index} such that it respects Unicode character
* boundaries in {@code str}.
*
* <p>If {@code index} is the low surrogate of a unicode character,
* the method returns {@code index - 1}. Otherwise, {@code index} is
* returned.
*
* <p>In the case in which {@code index} falls in an invalid surrogate pair
* (e.g. consecutive low surrogates, consecutive high surrogates), or if
* if it is not a valid index into {@code str}, the original value of
* {@code index} is returned.
*
* @param str the String
* @param index the index to be normalized
* @return a normalized index that does not split a Unicode character
*/
private static int unicodePreservingIndex(String str, int index) {
if (index > 0 && index < str.length()) {
if (Character.isHighSurrogate(str.charAt(index - 1)) && Character.isLowSurrogate(str.charAt(index))) {
return index - 1;
}
}
return index;
}
}
Related
- unicodeConvert(String str)
- unicodeCount(String sStr)
- unicodeEncode(String s)
- unicodeHTMLEscape(final String s)
- unicodePreservingIndex(String str, int index)
- unicodeToChar(char[] unicode)
- unicodeToHTMLUnicodeEntity(final String text)
- unicodeTrim(String s)