总的来说,敏感词过滤就是词库匹配,你定义一个词库,里面有很多敏感词,匹配到了就说明这个词是敏感词。
所以最简单的办法就是建立一个list,先把所有的敏感词读进这个list,然后再利用list的contains方法,就可以判断某一句话中是否有敏感词,如果有就弹个提示,告诉用户语句中有敏感词,禁止用户发送,但是如果须要把把敏感词屏蔽掉(比如用” * “号代替)这个时候contains方法就不行了,得自己写算法判断敏感词所在的位置并屏蔽掉,实现起来并不那么简单。
下面有个简单的办法,大体的思路就是构造词库的时候改变一下思路,不再是一个敏感词一组了,而是将敏感词变成一棵树,用Map集合来存放,举个例子:有个敏感词叫翻墙,那么key就是翻字(具体会转成char类型),value是一个节点对象,它有一个指向下一个字的引用,用代码描述大概是这样{翻=对象类{key=墙,下一个对象{},是否是结尾}}
,类似的结构很常见,值得一提的是,用一个标志位来表示这棵树是否到头了,这样做有个很明显的好处,比如有两个敏感词,一个叫法轮功,一个叫法轮大法,那么这两个敏感词会存放在同一课树中,key都是法,后面是轮,完了有两个分支,一个叫功,一个叫大法,每一个字前面都有一个标志位来标记结束,功和法字的标志位都是true,因为是两个词的结尾位置。主要就在于构建这个敏感词库,有了它查询敏感词就很简单了,下面是实现代码,词库添加和搜索敏感词都是毫秒级别的,生产上完全可用。
敏感词树的节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
| public class SensitiveWordNode {
private char key;
private Map<Character,SensitiveWordNode> nextNodes;
private boolean end;
public SensitiveWordNode (char key) { this.key = key; nextNodes = new HashMap<>(); end = false; }
public SensitiveWordNode getNextNode (char key) { return nextNodes.get(key); }
public void putNextNode (SensitiveWordNode node) { nextNodes.put (node.getKey (), node); }
public char getKey () { return key; }
public void setKey (char key) { this.key = key; }
public Map getNextNodes () { return nextNodes; }
public void setNextNodes (Map nextNodes) { this.nextNodes = nextNodes; }
public boolean isEnd () { return end; }
public void setEnd (boolean end) { this.end = end; }
@Override public String toString() { return "SensitiveWordNode{" + "key=" + key + ", nextNodes=" + nextNodes + ", end=" + end + '}'; } }
|
敏感词树
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
| public class SensitiveWordTree {
private static final Logger LOG = Logger.getLogger("SensitiveWordTree"); private static SensitiveWordNode root = null;
private static final String ENCODING = "gbk";
private static final String filePath = "D:/1.txt";
private static Set readSensitiveWords() { Set keyWords = new HashSet(); BufferedReader reader = null; try { reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), ENCODING)); String line; while ((line = reader.readLine()) != null) { keyWords.add(line.trim()); } } catch (UnsupportedEncodingException e) { LOG.info("敏感词库编码错误!"); } catch (FileNotFoundException e) { LOG.info("敏感词库不存在!"); } catch (IOException e) { LOG.info("读取敏感词库失败!"); } finally { if (reader != null) { try { reader.close(); } catch (IOException e) { LOG.info("读取敏感词库失败!"); } } } return keyWords; }
private static void init() { Set<String> keyWords = readSensitiveWords(); root = new SensitiveWordNode(' '); for (String keyWord : keyWords) { createSensitiveWordNode(keyWord); } }
private static void createSensitiveWordNode(String keyWord) { if (root == null) { LOG.info("根节点不存在!"); return; } SensitiveWordNode nowNode = root; for (Character c : keyWord.toCharArray()) { SensitiveWordNode nextNode = nowNode.getNextNode(c); if (nextNode == null) { nextNode = new SensitiveWordNode(c); nowNode.putNextNode(nextNode); } nowNode = nextNode; } nowNode.setEnd(true); }
private static String censorWords(String text) { if (root == null) { init(); } StringBuilder sensitiveWords = new StringBuilder(); StringBuilder temp_sensitiveWords = new StringBuilder(); char[] text_to_char = text.toCharArray(); SensitiveWordNode sensitiveWordNode = root; SensitiveWordNode this_sensitiveWordNode = null; boolean flag; for (int start = 0; start < text_to_char.length; start++) { SensitiveWordNode temp_sensitiveWordNode = sensitiveWordNode.getNextNode(text_to_char[start]);
if (temp_sensitiveWordNode != null || this_sensitiveWordNode != null && this_sensitiveWordNode.getNextNode(text_to_char[start]) != null && this_sensitiveWordNode.getNextNode(text_to_char[start]).getKey() == text_to_char[start]) { flag = true;
} else { flag = false; temp_sensitiveWords = new StringBuilder(); this_sensitiveWordNode = null; }
if (flag) {
if (this_sensitiveWordNode != null && this_sensitiveWordNode.getNextNode(text_to_char[start]) == null) { if (this_sensitiveWordNode.isEnd()) { sensitiveWords.append(temp_sensitiveWords + ","); this_sensitiveWordNode = null; } temp_sensitiveWords = new StringBuilder(); }
if (temp_sensitiveWordNode != null && temp_sensitiveWordNode.isEnd() || this_sensitiveWordNode != null && this_sensitiveWordNode.getNextNode(text_to_char[start]) != null && this_sensitiveWordNode.getNextNode(text_to_char[start]).getNextNodes() != null && this_sensitiveWordNode.getNextNode(text_to_char[start]).isEnd() && (start == text_to_char.length - 1 || this_sensitiveWordNode.getNextNode(text_to_char[start]).getNextNodes().size() == 0)) { temp_sensitiveWords.append(text_to_char[start] + ","); sensitiveWords.append(temp_sensitiveWords); temp_sensitiveWords = new StringBuilder(); this_sensitiveWordNode = null; } else { temp_sensitiveWords.append(text_to_char[start]); if (this_sensitiveWordNode == null) { this_sensitiveWordNode = temp_sensitiveWordNode; } if (this_sensitiveWordNode != null && this_sensitiveWordNode.getNextNode(text_to_char[start]) != null && this_sensitiveWordNode.getNextNode(text_to_char[start]).getNextNodes().size() != 0) { this_sensitiveWordNode = this_sensitiveWordNode.getNextNode(text_to_char[start]); } }
}
} return sensitiveWords.length() > 0 ? sensitiveWords.subSequence(0, sensitiveWords.length() - 1).toString() : "未匹配到敏感词"; }
public static void main(String[] args) { long start_time = System.currentTimeMillis(); init(); System.out.println(System.currentTimeMillis() - start_time); System.out.println(root); start_time = System.currentTimeMillis(); String list = censorWords("语文数,学数学,学英英语语文文"); System.out.println(System.currentTimeMillis() - start_time); System.out.println(list); }
}
|
下面这种算法的核心跟上面类似,只不过加入了符号识别和大小写以及全角半角识别,识别率更好一些。
敏感词节点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
| public class WordNode {
private int value;
private List<WordNode> subNodes;
private boolean isLast;
public WordNode(int value) { this.value = value; }
public WordNode(int value, boolean isLast) { this.value = value; this.isLast = isLast; }
private WordNode addSubNode(final WordNode subNode) { if (subNodes == null) subNodes = new LinkedList<WordNode>(); subNodes.add(subNode); return subNode; }
public WordNode addIfNoExist(final int value, final boolean isLast) { if (subNodes == null) { return addSubNode(new WordNode(value, isLast)); } for (WordNode subNode : subNodes) { if (subNode.value == value) { if (!subNode.isLast && isLast) subNode.isLast = true; return subNode; } } return addSubNode(new WordNode(value, isLast)); }
public WordNode querySub(final int value) { if (subNodes == null) { return null; } for (WordNode subNode : subNodes) { if (subNode.value == value) return subNode; } return null; }
public boolean isLast() { return isLast; }
public void setLast(boolean isLast) { this.isLast = isLast; }
@Override public int hashCode() { return value; }
}
|
敏感词树
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
| public class WordFilter {
private static final FilterSet set = new FilterSet(); private static final Map<Integer, WordNode> nodes = new HashMap<Integer, WordNode>(1024, 1); private static final Set<Integer> stopwdSet = new HashSet<>(); private static final char SIGN = '*';
static { try { long a = System.nanoTime(); init(); a = System.nanoTime() - a; System.out.println("加载时间 : " + a + "ns"); System.out.println("加载时间 : " + a / 1000000 + "ms"); } catch (Exception e) { throw new RuntimeException("初始化过滤器失败"); } }
private static void init() { addSensitiveWord(readWordFromFile("wd.txt")); addStopWord(readWordFromFile("stopwd.txt")); }
private static List<String> readWordFromFile(String path) { List<String> words; BufferedReader br = null; try { br = new BufferedReader(new InputStreamReader(WordFilter.class.getClassLoader().getResourceAsStream(path),"UTF-8")); words = new ArrayList<String>(1200); for (String buf = ""; (buf = br.readLine()) != null;) { if (buf == null || buf.trim().equals("")) continue; words.add(buf); } } catch (Exception e) { throw new RuntimeException(e); } finally { try { if (br != null) br.close(); } catch (IOException e) { } } return words; }
private static void addStopWord(final List<String> words) { if (words != null && words.size() > 0) { char[] chs; for (String curr : words) { chs = curr.toCharArray(); for (char c : chs) { stopwdSet.add(charConvert(c)); } } } }
private static void addSensitiveWord(final List<String> words) { if (words != null && words.size() > 0) { char[] chs; int fchar; int lastIndex; WordNode fnode; for (String curr : words) { chs = curr.toCharArray(); fchar = charConvert(chs[0]); if (!set.contains(fchar)) { set.add(fchar); fnode = new WordNode(fchar, chs.length == 1); nodes.put(fchar, fnode); } else { fnode = nodes.get(fchar); if (!fnode.isLast() && chs.length == 1) fnode.setLast(true); } lastIndex = chs.length - 1; for (int i = 1; i < chs.length; i++) { fnode = fnode.addIfNoExist(charConvert(chs[i]), i == lastIndex); } } } }
public static final String doFilter(final String src) { char[] chs = src.toCharArray(); int length = chs.length; int currc; int k; WordNode node; for (int i = 0; i < length; i++) { currc = charConvert(chs[i]); if (!set.contains(currc)) { continue; } node = nodes.get(currc); if (node == null) continue; boolean couldMark = false; int markNum = -1; if (node.isLast()) { couldMark = true; markNum = 0; } k = i; for (; ++k < length;) { int temp = charConvert(chs[k]); if (stopwdSet.contains(temp)) continue; node = node.querySub(temp); if (node == null) break; if (node.isLast()) { couldMark = true; markNum = k - i; } } if (couldMark) { for (k = 0; k <= markNum; k++) { chs[k + i] = SIGN; } i = i + markNum; } }
return new String(chs); }
public static final boolean isContains(final String src) { char[] chs = src.toCharArray(); int length = chs.length; int currc; int k; WordNode node; for (int i = 0; i < length; i++) { currc = charConvert(chs[i]); if (!set.contains(currc)) { continue; } node = nodes.get(currc); if (node == null) continue; boolean couldMark = false; if (node.isLast()) { couldMark = true; } k = i; for (; ++k < length;) { int temp = charConvert(chs[k]); if (stopwdSet.contains(temp)) continue; node = node.querySub(temp); if (node == null) break; if (node.isLast()) { couldMark = true; } } if (couldMark) { return true; } }
return false; }
private static int charConvert(char src) { int r = BCConvert.qj2bj(src); return (r >= 'A' && r <= 'Z') ? r + 32 : r; }
public static void main(String args[]) { System.out.println(doFilter("日%%##你你妈")); }
}
|
FilterSet,枚举了0~65535的所有char是否是某个敏感词开头的状态
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
| public class FilterSet{
private final long[] elements;
public FilterSet() { elements = new long[1 + (65535 >>> 6)]; } public void add(final int no) { elements[no >>> 6] |= (1L << (no & 63)); } public void add(final int... no) { for(int currNo : no) elements[currNo >>> 6] |= (1L << (currNo & 63)); } public void remove(final int no) { elements[no >>> 6] &= ~(1L << (no & 63)); }
public boolean addAndNotify(final int no) { int eWordNum = no >>> 6; long oldElements = elements[eWordNum]; elements[eWordNum] |= (1L << (no & 63)); boolean result = elements[eWordNum] != oldElements;
return result; }
public boolean removeAndNotify(final int no) { int eWordNum = no >>> 6; long oldElements = elements[eWordNum]; elements[eWordNum] &= ~(1L << (no & 63)); boolean result = elements[eWordNum] != oldElements; return result; } public boolean contains(final int no) { return (elements[no >>> 6] & (1L << (no & 63))) != 0; }
public boolean containsAll(final int... no) { if(no.length==0) return true; for(int currNo : no) if((elements[currNo >>> 6] & (1L << (currNo & 63))) == 0) return false; return true; }
public boolean containsAll_ueslessWay(final int... no) { long[] elements = new long[this.elements.length]; for(int currNo : no){ elements[currNo >>> 6] |= (1L << (currNo & 63)); } for (int i = 0; i < elements.length; i++) if ((elements[i] & ~this.elements[i]) != 0) return false; return true; }
public int size() { int size = 0; for (long element : elements) size += Long.bitCount(element); return size; } public static void main(String[] args) { FilterSet oi = new FilterSet(); System.out.println(oi.elements.length); }
}
|
全角/半角转换
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
| public class BCConvert {
static final char DBC_CHAR_START = 33;
static final char DBC_CHAR_END = 126;
static final char SBC_CHAR_START = 65281;
static final char SBC_CHAR_END = 65374;
static final int CONVERT_STEP = 65248;
static final char SBC_SPACE = 12288;
static final char DBC_SPACE = ' ';
public static String bj2qj(String src) { if (src == null) { return src; } StringBuilder buf = new StringBuilder(src.length()); char[] ca = src.toCharArray(); for (int i = 0; i < ca.length; i++) { if (ca[i] == DBC_SPACE) { buf.append(SBC_SPACE); } else if ((ca[i] >= DBC_CHAR_START) && (ca[i] <= DBC_CHAR_END)) { buf.append((char) (ca[i] + CONVERT_STEP)); } else { buf.append(ca[i]); } } return buf.toString(); }
public static int bj2qj(char src) { int r = src; if (src == DBC_SPACE) { src = SBC_SPACE; } else if ((src >= DBC_CHAR_START) && (src <= DBC_CHAR_END)) { r = src + CONVERT_STEP; } return r; }
public static String qj2bj(String src) { if (src == null) { return src; } StringBuilder buf = new StringBuilder(src.length()); char[] ca = src.toCharArray(); for (int i = 0; i < src.length(); i++) { if (ca[i] >= SBC_CHAR_START && ca[i] <= SBC_CHAR_END) { buf.append((char) (ca[i] - CONVERT_STEP)); } else if (ca[i] == SBC_SPACE) { buf.append(DBC_SPACE); } else { buf.append(ca[i]); } } return buf.toString(); }
public static int qj2bj(char src) { int r = src; if (src >= SBC_CHAR_START && src <= SBC_CHAR_END) { r = src - CONVERT_STEP; } else if (src == SBC_SPACE) { r = DBC_SPACE; } return r; }
}
|
以上两种写法的代码我都托管在了这个地址上,当然最好还是自己动手写一写,这样更便于理解。