Java String.Split 字符串分割转义字符问题

Demon.Lee 2020年09月29日 1,701次浏览

问题

这两天开发时,发现一个java字符串分割的坑,如下代码:

  @Slf4j
  public class StringTest {

    private static final String SPLIT_SYMBOL = "|";

    /**
     * 错误示例
     */
    @Test
    public void testSplit() {
        String str = "abc" + SPLIT_SYMBOL + "test123" + SPLIT_SYMBOL + "你好world";
        String[] strArr = str.split(SPLIT_SYMBOL);
        log.info("strArr.len: {}", strArr.length);
        for (String s : strArr) {
            log.info("strArr element: {}", s);
        }
    }
  }

输出结果为:

09:21:55.453 [main] INFO com.practice.learn.devtraps.StringTest - strArr.len: 19
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: a
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: b
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: c
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: |
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: t
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: e
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: s
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: t
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: 1
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: 2
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: 3
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: |
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: 你
09:21:55.454 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: 好
09:21:55.455 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: w
09:21:55.455 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: o
09:21:55.455 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: r
09:21:55.455 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: l
09:21:55.455 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: d

分析

经查才知道,因为String.split(String regex)处理时会把regex作为正在表达式判断,而我们知道"|" 是OR(或)操作,所以我们上面的程序解析之后肯定是错误的 (具体细节分析,我现在也没有完全看懂,后续学习正则表达式以及阅读并调试Java源代码后,再补充)。String类部分源代码如下(jdk版本:jdk-11.0.7):

public final class String implements java.io.Serializable, Comparable<String>, CharSequence {
    ...
    public String[] split(String regex) {
            return split(regex, 0);
    }
    ...
    ...
    public String[] split(String regex, int limit) {
        /* fastpath if the regex is a
         (1)one-char String and this character is not one of the
            RegEx's meta characters ".$|()[{^?*+\\", or
         (2)two-char String and the first char is the backslash and
            the second is not the ascii digit or ascii letter.
         */
        char ch = 0;
        if (((regex.length() == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))
        {
            ...
            ...
            String[] result = new String[resultSize];
            return list.subList(0, resultSize).toArray(result);
        }
        return Pattern.compile(regex).split(this, limit);
    }
    ...
}

解决方案

而解决方案也很简单,就是对特殊字符进行转义,即 “|” --> “\\|”,代码调整为:

  @Slf4j
  public class StringTest {

    private static final String SPLIT_SYMBOL = "|";
    private static final String SPLIT_SYMBOL_WITH_ESCAPE_CHARACTER = "\\" + SPLIT_SYMBOL;

    /**
     * 使用 '|' 等字段进行的分割,需要使用转义符
     */
    @Test
    public void testSplit() {
        String str = "abc" + SPLIT_SYMBOL + "test123" + SPLIT_SYMBOL + "你好world";
        String[] strArr = str.split(SPLIT_SYMBOL_WITH_ESCAPE_CHARACTER);
        log.info("strArr.len: {}", strArr.length);
        for (String s : strArr) {
            log.info("strArr element: {}", s);
        }
    }
  }

输出结果为:

09:43:21.060 [main] INFO com.practice.learn.devtraps.StringTest - strArr.len: 3
09:43:21.060 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: abc
09:43:21.060 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: test123
09:43:21.060 [main] INFO com.practice.learn.devtraps.StringTest - strArr element: 你好world

再深入一点

而手动增加’\\'并不是一个好idea,因为我记不住哪些字符需要转义,哪些不需要。所以最好的办法是由jdk自己封装并判断。而jdk开发者们,早就为我们想到了,即:

public final class Pattern implements java.io.Serializable
{
  ...
  ...
  public static String quote(String s) {
     int slashEIndex = s.indexOf("\\E");
     if (slashEIndex == -1)
         return "\\Q" + s + "\\E";

     int lenHint = s.length();
     lenHint = (lenHint < Integer.MAX_VALUE - 8 - lenHint) ?
             (lenHint << 1) : (Integer.MAX_VALUE - 8);

     StringBuilder sb = new StringBuilder(lenHint);
     sb.append("\\Q");
     int current = 0;
     do {
         sb.append(s, current, slashEIndex)
                 .append("\\E\\\\E\\Q");
         current = slashEIndex + 2;
     } while ((slashEIndex = s.indexOf("\\E", current)) != -1);

     return sb.append(s, current, s.length())
             .append("\\E")
             .toString();
  }
  ...
  ...
}

我们只需要调用Pattern.quote方法,将regex进行封装即可。针对上面的测试代码,调整起来很简单:

private static final String SPLIT_SYMBOL_WITH_ESCAPE_CHARACTER = Pattern.quote(SPLIT_SYMBOL);

运行结果没有任何问题。

结论

  1. 对java的字符串分割,心中要有一根弦:正则表达式。
  2. 使用Pattern.quote(String s)对分割符进行封装。