总的来说,敏感词过滤就是词库匹配,你定义一个词库,里面有很多敏感词,匹配到了就说明这个词是敏感词。
所以最简单的办法就是建立一个list,先把所有的敏感词读进这个list,然后再利用list的contains方法,就可以判断某一句话中是否有敏感词,如果有就弹个提示,告诉用户语句中有敏感词,禁止用户发送,但是如果须要把把敏感词屏蔽掉(比如用” * “号代替)这个时候contains方法就不行了,得自己写算法判断敏感词所在的位置并屏蔽掉,实现起来并不那么简单。

下面有个简单的办法,大体的思路就是构造词库的时候改变一下思路,不再是一个敏感词一组了,而是将敏感词变成一棵树,用Map集合来存放,举个例子:有个敏感词叫翻墙,那么key就是翻字(具体会转成char类型),value是一个节点对象,它有一个指向下一个字的引用,用代码描述大概是这样{翻=对象类{key=墙,下一个对象{},是否是结尾}},类似的结构很常见,值得一提的是,用一个标志位来表示这棵树是否到头了,这样做有个很明显的好处,比如有两个敏感词,一个叫法轮功,一个叫法轮大法,那么这两个敏感词会存放在同一课树中,key都是法,后面是轮,完了有两个分支,一个叫功,一个叫大法,每一个字前面都有一个标志位来标记结束,功和法字的标志位都是true,因为是两个词的结尾位置。主要就在于构建这个敏感词库,有了它查询敏感词就很简单了,下面是实现代码,词库添加和搜索敏感词都是毫秒级别的,生产上完全可用。

敏感词树的节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
public class SensitiveWordNode {

// 节点所代表的字符
private char key;

// 节点的子节点
private Map<Character,SensitiveWordNode> nextNodes;

// 该节点是否为 End 节点
private boolean end;

public SensitiveWordNode (char key) {
this.key = key;
nextNodes = new HashMap<>();
end = false;
}

public SensitiveWordNode getNextNode (char key) {
return nextNodes.get(key);
}

public void putNextNode (SensitiveWordNode node) {
nextNodes.put (node.getKey (), node);
}

public char getKey () {
return key;
}

public void setKey (char key) {
this.key = key;
}

public Map getNextNodes () {
return nextNodes;
}

public void setNextNodes (Map nextNodes) {
this.nextNodes = nextNodes;
}

public boolean isEnd () {
return end;
}

public void setEnd (boolean end) {
this.end = end;
}

@Override
public String toString() {
return "SensitiveWordNode{" +
"key=" + key +
", nextNodes=" + nextNodes +
", end=" + end +
'}';
}
}

敏感词树

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
public class SensitiveWordTree {

// 日志
private static final Logger LOG = Logger.getLogger("SensitiveWordTree");
// 根节点
private static SensitiveWordNode root = null;
// 敏感词库编码
// private static final String ENCODING = "utf-8";
private static final String ENCODING = "gbk";
// 敏感词库位置
// private static final String filePath = "/intellijProjects/resource/1.txt";
private static final String filePath = "D:/1.txt";

/**
* 读取敏感词库
*
* @return
*/
private static Set readSensitiveWords() {
Set keyWords = new HashSet();
BufferedReader reader = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), ENCODING));
String line;
while ((line = reader.readLine()) != null) {
keyWords.add(line.trim());
}
} catch (UnsupportedEncodingException e) {
LOG.info("敏感词库编码错误!");
} catch (FileNotFoundException e) {
LOG.info("敏感词库不存在!");
} catch (IOException e) {
LOG.info("读取敏感词库失败!");
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
LOG.info("读取敏感词库失败!");
}
}
}
return keyWords;
}

/**
* 初始化敏感词库
*/
private static void init() {
// 读取敏感词库
Set<String> keyWords = readSensitiveWords();
// 初始化根节点
root = new SensitiveWordNode(' ');
// 创建敏感词
for (String keyWord : keyWords) {
createSensitiveWordNode(keyWord);
}
}

/**
* 构建敏感词
*
* @param keyWord
*/
private static void createSensitiveWordNode(String keyWord) {
if (root == null) {
LOG.info("根节点不存在!");
return;
}
SensitiveWordNode nowNode = root;
for (Character c : keyWord.toCharArray()) {
SensitiveWordNode nextNode = nowNode.getNextNode(c);
if (nextNode == null) {
nextNode = new SensitiveWordNode(c);
nowNode.putNextNode(nextNode);
}
nowNode = nextNode;
}
nowNode.setEnd(true);
}

/**
* 检查敏感词
*
* @return 所有查出的敏感词
*/
private static String censorWords(String text) {
if (root == null) {
init();
}
StringBuilder sensitiveWords = new StringBuilder();
StringBuilder temp_sensitiveWords = new StringBuilder();
char[] text_to_char = text.toCharArray();
SensitiveWordNode sensitiveWordNode = root;
SensitiveWordNode this_sensitiveWordNode = null;
boolean flag;
for (int start = 0; start < text_to_char.length; start++) {
SensitiveWordNode temp_sensitiveWordNode = sensitiveWordNode.getNextNode(text_to_char[start]);

if (temp_sensitiveWordNode != null || this_sensitiveWordNode != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]) != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]).getKey() == text_to_char[start]) {
flag = true;

} else {
flag = false;
temp_sensitiveWords = new StringBuilder();
this_sensitiveWordNode = null;
}

if (flag) {

if (this_sensitiveWordNode != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]) == null) {
if (this_sensitiveWordNode.isEnd()) {
sensitiveWords.append(temp_sensitiveWords + ",");
this_sensitiveWordNode = null;
}
temp_sensitiveWords = new StringBuilder();
}

if (temp_sensitiveWordNode != null && temp_sensitiveWordNode.isEnd() || this_sensitiveWordNode != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]) != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]).getNextNodes() != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]).isEnd()
&& (start == text_to_char.length - 1
|| this_sensitiveWordNode.getNextNode(text_to_char[start]).getNextNodes().size() == 0)) {
temp_sensitiveWords.append(text_to_char[start] + ",");
sensitiveWords.append(temp_sensitiveWords);
temp_sensitiveWords = new StringBuilder();
this_sensitiveWordNode = null;
} else {
temp_sensitiveWords.append(text_to_char[start]);
if (this_sensitiveWordNode == null) {
this_sensitiveWordNode = temp_sensitiveWordNode;
}
if (this_sensitiveWordNode != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]) != null
&& this_sensitiveWordNode.getNextNode(text_to_char[start]).getNextNodes().size() != 0) {
this_sensitiveWordNode = this_sensitiveWordNode.getNextNode(text_to_char[start]);
}
}

}

}
return sensitiveWords.length() > 0 ?
sensitiveWords.subSequence(0, sensitiveWords.length() - 1).toString()
: "未匹配到敏感词";
}

public static void main(String[] args) {
long start_time = System.currentTimeMillis();
init();
System.out.println(System.currentTimeMillis() - start_time);
System.out.println(root);
start_time = System.currentTimeMillis();
String list = censorWords("语文数,学数学,学英英语语文文");
System.out.println(System.currentTimeMillis() - start_time);
System.out.println(list);
}

}



下面这种算法的核心跟上面类似,只不过加入了符号识别和大小写以及全角半角识别,识别率更好一些。

敏感词节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
public class WordNode {

private int value; // 节点名称

private List<WordNode> subNodes; // 子节点

private boolean isLast;// 默认false

public WordNode(int value) {
this.value = value;
}

public WordNode(int value, boolean isLast) {
this.value = value;
this.isLast = isLast;
}

/**
*
* @param subNode
* @return 就是传入的subNode
*/
private WordNode addSubNode(final WordNode subNode) {
if (subNodes == null)
subNodes = new LinkedList<WordNode>();
subNodes.add(subNode);
return subNode;
}

/**
* 有就直接返回该子节点, 没有就创建添加并返回该子节点
*
* @param value
* @return
*/
public WordNode addIfNoExist(final int value, final boolean isLast) {
if (subNodes == null) {
return addSubNode(new WordNode(value, isLast));
}
for (WordNode subNode : subNodes) {
if (subNode.value == value) {
if (!subNode.isLast && isLast)
subNode.isLast = true;
return subNode;
}
}
return addSubNode(new WordNode(value, isLast));
}

public WordNode querySub(final int value) {
if (subNodes == null) {
return null;
}
for (WordNode subNode : subNodes) {
if (subNode.value == value)
return subNode;
}
return null;
}

public boolean isLast() {
return isLast;
}

public void setLast(boolean isLast) {
this.isLast = isLast;
}

@Override
public int hashCode() {
return value;
}

}

敏感词树

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
public class WordFilter {

private static final FilterSet set = new FilterSet(); // 存储首字
private static final Map<Integer, WordNode> nodes = new HashMap<Integer, WordNode>(1024, 1); // 存储节点
private static final Set<Integer> stopwdSet = new HashSet<>(); // 停顿词
private static final char SIGN = '*'; // 敏感词过滤替换

static {
try {
long a = System.nanoTime();
init();
a = System.nanoTime() - a;
System.out.println("加载时间 : " + a + "ns");
System.out.println("加载时间 : " + a / 1000000 + "ms");
} catch (Exception e) {
throw new RuntimeException("初始化过滤器失败");
}
}

private static void init() {
// 获取敏感词
addSensitiveWord(readWordFromFile("wd.txt"));
addStopWord(readWordFromFile("stopwd.txt"));
}

/**
* 增加敏感词
* @param path
* @return
*/
private static List<String> readWordFromFile(String path) {
List<String> words;
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(WordFilter.class.getClassLoader().getResourceAsStream(path),"UTF-8"));
words = new ArrayList<String>(1200);
for (String buf = ""; (buf = br.readLine()) != null;) {
if (buf == null || buf.trim().equals(""))
continue;
words.add(buf);
}
} catch (Exception e) {
throw new RuntimeException(e);
} finally {
try {
if (br != null)
br.close();
} catch (IOException e) {
}
}
return words;
}

/**
* 增加停顿词
*
* @param words
*/
private static void addStopWord(final List<String> words) {
if (words != null && words.size() > 0) {
char[] chs;
for (String curr : words) {
chs = curr.toCharArray();
for (char c : chs) {
stopwdSet.add(charConvert(c));
}
}
}
}

/**
* 添加DFA节点
* @param words
*/
private static void addSensitiveWord(final List<String> words) {
if (words != null && words.size() > 0) {
char[] chs;
int fchar;
int lastIndex;
WordNode fnode; // 首字母节点
for (String curr : words) {
chs = curr.toCharArray();
fchar = charConvert(chs[0]);
if (!set.contains(fchar)) {// 没有首字定义
set.add(fchar);// 首字标志位 可重复add,反正判断了,不重复了
fnode = new WordNode(fchar, chs.length == 1);
nodes.put(fchar, fnode);
} else {
fnode = nodes.get(fchar);
if (!fnode.isLast() && chs.length == 1)
fnode.setLast(true);
}
lastIndex = chs.length - 1;
for (int i = 1; i < chs.length; i++) {
fnode = fnode.addIfNoExist(charConvert(chs[i]), i == lastIndex);
}
}
}
}

/**
* 过滤判断 将敏感词转化为成屏蔽词
* @param src
* @return
*/
public static final String doFilter(final String src) {
char[] chs = src.toCharArray();
int length = chs.length;
int currc;
int k;
WordNode node;
for (int i = 0; i < length; i++) {
currc = charConvert(chs[i]);
if (!set.contains(currc)) {
continue;
}
node = nodes.get(currc);// 日 2
if (node == null)// 其实不会发生,习惯性写上了
continue;
boolean couldMark = false;
int markNum = -1;
if (node.isLast()) {// 单字匹配(日)
couldMark = true;
markNum = 0;
}
// 继续匹配(日你/日你妹),以长的优先
// 你-3 妹-4 夫-5
k = i;
for (; ++k < length;) {
int temp = charConvert(chs[k]);
if (stopwdSet.contains(temp))
continue;
node = node.querySub(temp);
if (node == null)// 没有了
break;
if (node.isLast()) {
couldMark = true;
markNum = k - i;// 3-2
}
}
if (couldMark) {
for (k = 0; k <= markNum; k++) {
chs[k + i] = SIGN;
}
i = i + markNum;
}
}

return new String(chs);
}

/**
* 是否包含敏感词
* @param src
* @return
*/
public static final boolean isContains(final String src) {
char[] chs = src.toCharArray();
int length = chs.length;
int currc;
int k;
WordNode node;
for (int i = 0; i < length; i++) {
currc = charConvert(chs[i]);
if (!set.contains(currc)) {
continue;
}
node = nodes.get(currc);// 日 2
if (node == null)// 其实不会发生,习惯性写上了
continue;
boolean couldMark = false;
if (node.isLast()) {// 单字匹配(日)
couldMark = true;
}
// 继续匹配(日你/日你妹),以长的优先
// 你-3 妹-4 夫-5
k = i;
for (; ++k < length;) {
int temp = charConvert(chs[k]);
if (stopwdSet.contains(temp))
continue;
node = node.querySub(temp);
if (node == null)// 没有了
break;
if (node.isLast()) {
couldMark = true;
}
}
if (couldMark) {
return true;
}
}

return false;
}

/**
* 大写转化为小写 全角转化为半角
*
* @param src
* @return
*/
private static int charConvert(char src) {
int r = BCConvert.qj2bj(src);
return (r >= 'A' && r <= 'Z') ? r + 32 : r;
}


public static void main(String args[]) {
System.out.println(doFilter("日%%##你你妈"));
}

}

FilterSet,枚举了0~65535的所有char是否是某个敏感词开头的状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
public class FilterSet{

private final long[] elements;


public FilterSet() {
elements = new long[1 + (65535 >>> 6)];
}

public void add(final int no) {
elements[no >>> 6] |= (1L << (no & 63));
}

public void add(final int... no) {
for(int currNo : no)
elements[currNo >>> 6] |= (1L << (currNo & 63));
}

public void remove(final int no) {
elements[no >>> 6] &= ~(1L << (no & 63));
}

/**
*
* @param no
* @return true:添加成功 false:原已包含
*/
public boolean addAndNotify(final int no) {
int eWordNum = no >>> 6;
long oldElements = elements[eWordNum];
elements[eWordNum] |= (1L << (no & 63));
boolean result = elements[eWordNum] != oldElements;
// if (result)
// size++;
return result;
}

/**
*
* @param no
* @return true:移除成功 false:原本就不包含
*/
public boolean removeAndNotify(final int no) {
int eWordNum = no >>> 6;
long oldElements = elements[eWordNum];
elements[eWordNum] &= ~(1L << (no & 63));
boolean result = elements[eWordNum] != oldElements;
return result;
}

public boolean contains(final int no) {
return (elements[no >>> 6] & (1L << (no & 63))) != 0;
}

public boolean containsAll(final int... no) {
if(no.length==0)
return true;
for(int currNo : no)
if((elements[currNo >>> 6] & (1L << (currNo & 63))) == 0)
return false;
return true;
}

/**
* 不如直接循环调用contains
* @param no
* @return
*/
public boolean containsAll_ueslessWay(final int... no) {
long[] elements = new long[this.elements.length];
for(int currNo : no){
elements[currNo >>> 6] |= (1L << (currNo & 63));
}//这一步执行完跟循环调用contains差不多了

for (int i = 0; i < elements.length; i++)
if ((elements[i] & ~this.elements[i]) != 0)
return false;
return true;
}

/**
* 目前没有去维护size,每次都是去计算size
* @return
*/
public int size() {
int size = 0;
for (long element : elements)
size += Long.bitCount(element);
return size;
}

public static void main(String[] args) {
FilterSet oi = new FilterSet();
System.out.println(oi.elements.length);
}

}

全角/半角转换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
public class BCConvert {

/**
* ASCII表中可见字符从!开始,偏移位值为33(Decimal)
*/
static final char DBC_CHAR_START = 33; // 半角!

/**
* ASCII表中可见字符到~结束,偏移位值为126(Decimal)
*/
static final char DBC_CHAR_END = 126; // 半角~

/**
* 全角对应于ASCII表的可见字符从!开始,偏移值为65281
*/
static final char SBC_CHAR_START = 65281; // 全角!

/**
* 全角对应于ASCII表的可见字符到~结束,偏移值为65374
*/
static final char SBC_CHAR_END = 65374; // 全角~

/**
* ASCII表中除空格外的可见字符与对应的全角字符的相对偏移
*/
static final int CONVERT_STEP = 65248; // 全角半角转换间隔

/**
* 全角空格的值,它没有遵从与ASCII的相对偏移,必须单独处理
*/
static final char SBC_SPACE = 12288; // 全角空格 12288

/**
* 半角空格的值,在ASCII中为32(Decimal)
*/
static final char DBC_SPACE = ' '; // 半角空格

/**
* <PRE>
* 半角字符->全角字符转换
* 只处理空格,!到˜之间的字符,忽略其他
* </PRE>
*/
public static String bj2qj(String src) {
if (src == null) {
return src;
}
StringBuilder buf = new StringBuilder(src.length());
char[] ca = src.toCharArray();
for (int i = 0; i < ca.length; i++) {
if (ca[i] == DBC_SPACE) { // 如果是半角空格,直接用全角空格替代
buf.append(SBC_SPACE);
} else if ((ca[i] >= DBC_CHAR_START) && (ca[i] <= DBC_CHAR_END)) { // 字符是!到~之间的可见字符
buf.append((char) (ca[i] + CONVERT_STEP));
} else { // 不对空格以及ascii表中其他可见字符之外的字符做任何处理
buf.append(ca[i]);
}
}
return buf.toString();
}

/**
* 半角转换全角
*
* @param src
* @return
*/
public static int bj2qj(char src) {
int r = src;
if (src == DBC_SPACE) { // 如果是半角空格,直接用全角空格替代
src = SBC_SPACE;
} else if ((src >= DBC_CHAR_START) && (src <= DBC_CHAR_END)) { // 字符是!到~之间的可见字符
r = src + CONVERT_STEP;
}
return r;
}

/**
* <PRE>
* 全角字符->半角字符转换
* 只处理全角的空格,全角!到全角~之间的字符,忽略其他
* </PRE>
*/
public static String qj2bj(String src) {
if (src == null) {
return src;
}
StringBuilder buf = new StringBuilder(src.length());
char[] ca = src.toCharArray();
for (int i = 0; i < src.length(); i++) {
if (ca[i] >= SBC_CHAR_START && ca[i] <= SBC_CHAR_END) { // 如果位于全角!到全角~区间内
buf.append((char) (ca[i] - CONVERT_STEP));
} else if (ca[i] == SBC_SPACE) { // 如果是全角空格
buf.append(DBC_SPACE);
} else { // 不处理全角空格,全角!到全角~区间外的字符
buf.append(ca[i]);
}
}
return buf.toString();
}

/**
* 全角转换半角
*
* @param src
* @return
*/
public static int qj2bj(char src) {
int r = src;
if (src >= SBC_CHAR_START && src <= SBC_CHAR_END) { // 如果位于全角!到全角~区间内
r = src - CONVERT_STEP;
} else if (src == SBC_SPACE) { // 如果是全角空格
r = DBC_SPACE;
}
return r;
}

}

以上两种写法的代码我都托管在了这个地址上,当然最好还是自己动手写一写,这样更便于理解。