-
公开(公告)号:US09280747B1
公开(公告)日:2016-03-08
申请号:US14928784
申请日:2015-10-30
Applicant: SAS Institute Inc.
Inventor: Ning Jin , James Allen Cox
CPC classification number: G06N7/005 , G06F17/3053 , G06F17/30705 , G06N99/005 , H04L51/04 , H04L51/063
Abstract: Electronic communications can be normalized using feature sets. For example, an electronic representation of a noncanonical communication can be received, and multiple candidate canonical versions of the noncanonical communication can be determined. A first feature set representative of the noncanonical communication can be determined by splitting the noncanonical communication into at least one n-gram and at least one k-skip-n-gram. Multiple comparison feature sets can be determined by splitting multiple terms in training data into respective comparison feature sets. Multiple Jaccard index values can be determined using the first feature set and the multiple comparison feature sets. A subset of the multiple terms in the training data in which an associated Jaccard index value exceeds a threshold can be selected. The subset of the multiple terms can be included in the multiple candidate canonical versions. A normalized version of the noncanonical communication can be selected from the multiple candidate canonical versions.
Abstract translation: 电子通信可以使用特征集进行归一化。 例如,可以接收非经典通信的电子表示,并且可以确定非正规通信的多个候选规范版本。 代表非经典通信的第一特征集可以通过将非经典通信分解为至少一个n-gram和至少一个k-skip-n-gram来确定。 可以通过将训练数据中的多个项分成相应的比较特征集来确定多个比较特征集合。 可以使用第一个特征集和多个比较特征集来确定多个Jaccard索引值。 可以选择训练数据中相关Jaccard指数值超过阈值的多项的子集。 多个术语的子集可以包含在多个候选规范版本中。 可以从多个候选规范版本中选择非规范通信的归一化版本。
-
公开(公告)号:US10762390B2
公开(公告)日:2020-09-01
申请号:US15952833
申请日:2018-04-13
Applicant: SAS Institute Inc.
Inventor: Aysu Ezen Can , Ning Jin , Ethem F. Can , Xiangqian Hu , Saratendu Sethi
IPC: G06K9/62 , G06N3/04 , G06K9/46 , G06N3/08 , G06F40/279
Abstract: Machine-learning models and behavior can be visualized. For example, a machine-learning model can be taught using a teaching dataset. A test input can then be provided to the machine-learning model to determine a baseline confidence-score of the machine-learning model. Next, weights for elements in the teaching dataset can be determined. An analysis dataset can be generated that includes a subset of the elements that have corresponding weights above a predefined threshold. For each overlapping element in both the analysis dataset and the test input, (i) a modified version of the test input can be generated that excludes the overlapping element, and (ii) the modified version of the test input can be provided to the machine-learning model to determine an effect of the overlapping element on the baseline confidence-score. A graphical user interface can be generated that visually depicts the test input and various elements' effects on the baseline confidence-score.
-