-
公开(公告)号:US11200514B1
公开(公告)日:2021-12-14
申请号:US17342825
申请日:2021-06-09
Applicant: SAS Institute Inc.
IPC: G06N20/00
Abstract: Unclassified observations are classified. Similarity values are computed for each unclassified observation and for each target variable value. A confidence value is computed for each unclassified observation using the similarity values. A high-confidence threshold value and a low-confidence threshold value are computed from the confidence values. For each observation, when the confidence value is greater than the high-confidence threshold value, the observation is added to a training dataset and, when the confidence value is greater than the low-confidence threshold value and less than the high-confidence threshold value, the observation is added to the training dataset based on a comparison between a random value drawn from a uniform distribution and an inclusion percentage value. A classification model is trained with the training dataset and classified observations. The trained classification model is executed with the unclassified observations to determine a label assignment.
-
公开(公告)号:US11010691B1
公开(公告)日:2021-05-18
申请号:US17093917
申请日:2020-11-10
Applicant: SAS Institute Inc.
Inventor: Xu Chen , Jorge Manuel Gomes da Silva , Brett Alan Wujek
Abstract: Data is classified using semi-supervised data. A decomposition is performed to define a first decomposition matrix that includes first eigenvectors of a weight matrix, a second decomposition matrix that includes second eigenvectors of a transpose of the weight matrix, and a diagonal matrix that includes eigenvalues of the first eigenvectors. Eigenvectors are selected from the first eigenvectors to define a reduced decomposition matrix. A linear transformation matrix is computed as a function of the first decomposition matrix, the reduced decomposition matrix, the diagonal matrix, and a penalty matrix. When a rank of the linear transformation matrix is less than a number of rows of the penalty matrix, a classification matrix is computed by updating a gradient of a cost function. When the rank of the linear transformation matrix is equal to the number of rows of the penalty matrix, the classification matrix is computed using a dual formulation.
-
公开(公告)号:US20180053071A1
公开(公告)日:2018-02-22
申请号:US15686863
申请日:2017-08-25
Applicant: SAS Institute Inc.
CPC classification number: G06K9/6264 , G06F9/48 , G06K9/6259 , G06K9/627 , G06K9/66 , G06N3/0454 , G06N5/003 , G06N99/005
Abstract: A computing device predicts occurrence of an event or classifies an object using distributed unlabeled data. Supervised data that includes a labeled subset of a plurality of observation vectors is identified. A total number of threads that will perform labeling of an unlabeled subset of the plurality of observation vectors is determined. The identified supervised data is uploaded to each thread of the total number of threads. Unlabeled observation vectors are randomly select from the unlabeled subset of the plurality of observation vectors to allocate to each thread of the total number of threads. The randomly selected, unlabeled observation vectors are uploaded to each thread of the total number of threads based on the allocation. The value of the target variable for each observation vector of the unlabeled subset of the plurality of observation vectors is determined based on a converged classification matrix and output to a labeled dataset.
-
公开(公告)号:US09792562B1
公开(公告)日:2017-10-17
申请号:US15335530
申请日:2016-10-27
Applicant: SAS Institute Inc.
CPC classification number: G06N99/005 , G06N5/003 , G06N7/005
Abstract: A computing device predicts occurrence of an event or classifies an object using semi-supervised data. A label set defines permissible values for a target variable. A value of the permissible values is defined for a subset of observation vectors. A predefined number of times, a distance matrix is computed that defines a distance value between pairs of observation vectors using a distance function and a converged classification matrix; a number of observation vectors is selected that have minimum values for the distance value; a label is requested and a response is received for each of the selected observation vectors; the value of the target variable is updated for each of the selected observation vectors with the received response; and the value of the target variable is determined again by recomputing the converged classification matrix. The value of the target variable for each observation vector is output to a second dataset.
-
公开(公告)号:US11379685B2
公开(公告)日:2022-07-05
申请号:US17386706
申请日:2021-07-28
Applicant: SAS Institute Inc.
Inventor: Xu Chen
Abstract: A computing device classifies unclassified observations. A first batch of unclassified observation vectors and a first batch of classified observation vectors are selected. A prior regularization error value and a decoder reconstruction error value are computed. A first batch of noise observation vectors is generated. An evidence lower bound (ELBO) value is computed. A gradient of an encoder neural network model is computed, and the ELBO value is updated. A decoder neural network model and an encoder neural network model are updated. The decoder neural network model is trained. The target variable value is determined for each observation vector of the unclassified observation vectors based on an output of the trained decoder neural network model. The target variable value is output.
-
公开(公告)号:US10929762B1
公开(公告)日:2021-02-23
申请号:US16940501
申请日:2020-07-28
Applicant: SAS Institute Inc.
Inventor: Xu Chen , Brett Alan Wujek
Abstract: Data is classified using corrected semi-supervised data. Cluster centers are defined for unclassified observations. A class is determined for each cluster. A distance value is computed between a classified observation and each cluster center. When the class of the classified observation is not the class determined for the cluster center having a minimum distance, a first distance value is selected as the minimum distance, a second distance value is selected as the distance value computed to the cluster center having the class of the classified observation, a ratio value is computed between the second distance value and the first distance value, and the class of the classified observation is changed to the class determined for the cluster center having the minimum distance value when the computed ratio value satisfies a label correction threshold. A classification matrix is defined using corrected observations to determine the class for the unclassified observations.
-
公开(公告)号:US10832174B1
公开(公告)日:2020-11-10
申请号:US16816382
申请日:2020-03-12
Applicant: SAS Institute Inc.
Inventor: Xu Chen , Brett Alan Wujek
Abstract: Data is classified using automatically selected hyperparameter values. (A) A first loss value is determined based on a converged classification matrix. (B) Each observation vector is assigned to a cluster using a clustering algorithm based on the converged classification matrix. (C) A predefined number of observation vectors is selected from each cluster. D) Classified observation vectors and unclassified observation vectors are updated based on the selections in (C) and (A) is repeated. (E) An entropy loss value is determined, wherein (A) to (E) are repeated for a plurality of different values of a kernel parameter value and a batch size value. (F) A second loss value is determined based on the converged classification matrix, a label matrix defined from the converged classification matrix, and a weight value. (L) (A) to (F) are repeated with a plurality of different values of the weight value until convergence is satisfied.
-
公开(公告)号:US20200151603A1
公开(公告)日:2020-05-14
申请号:US16706912
申请日:2019-12-09
Applicant: SAS Institute Inc.
Inventor: Xu Chen
Abstract: A computing device predicts occurrence of an event or classifies an object using distributed unlabeled data. A Laplacian matrix is computed using a kernel function. A predefined number of eigenvectors is selected from a decomposed Laplacian matrix to define a decomposition matrix. A gradient value is computed as a function of the defined decomposition matrix, a plurality of sparse coefficients, and a label matrix, a value of each coefficient of the plurality of sparse coefficients is updated based on the computed gradient value, and the computations are repeated until a convergence parameter value indicates the plurality of sparse coefficients have converged. A classification matrix is defined using the plurality of sparse coefficients to determine the target variable value for each observation vector of the plurality of unclassified observation vectors. The target variable value for each observation vector of the plurality of unclassified observation vectors is output.
-
公开(公告)号:US20190050368A1
公开(公告)日:2019-02-14
申请号:US16162794
申请日:2018-10-17
Applicant: SAS Institute Inc.
Inventor: Xu Chen , Saratendu Sethi
CPC classification number: G06N20/00 , G06F17/16 , G06K9/6223 , G06K9/6267 , G06N5/003 , G06N7/005 , G06N20/10
Abstract: A computing device automatically classifies an observation vector. A label set defines permissible values for a target variable. Supervised data includes a labeled subset that has one of the permissible values. A converged classification matrix is computed based on the supervised data and an unlabeled subset using a prior class distribution matrix that includes a row for each observation vector. Each column is associated with a single permissible value of the label set. A cell value in each column is a likelihood that each associated permissible value of the label set occurs based on prior class distribution information. The value of the target variable is selected using the converged classification matrix. A weighted classification label distribution matrix is computed from the converged classification matrix. The value of the target variable for each observation vector of the plurality of observation vectors is output to a labeled dataset.
-
公开(公告)号:US20210287116A1
公开(公告)日:2021-09-16
申请号:US17178798
申请日:2021-02-18
Applicant: SAS Institute Inc
Inventor: Xu Chen , Jorge Manuel Gomes da Silva , Brett Alan Wujek
Abstract: Data is classified using semi-supervised data. Sparse coefficients are computed using a decomposition of a Laplacian matrix. (B) Updated parameter values are computed for a dimensionality reduction method using the sparse coefficients, the Laplacian matrix, and a plurality of observation vectors. The updated parameter values include a robust estimator of a decomposition matrix determined from the decomposition of the Laplacian matrix. (B) is repeated until a convergence parameter value indicates the updated parameter values for the dimensionality reduction method have converged. A classification matrix is defined using the sparse coefficients and the robust estimator of the decomposition of the Laplacian matrix. The target variable value is determined for each observation vector based on the classification matrix. The target variable value is output for each observation vector of the plurality of unclassified observation vectors and is defined to represent a label for a respective unclassified observation vector.
-
-
-
-
-
-
-
-
-