Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for redacting data from a document collection generated for a set of documents that include personal information. The redaction of the data is based in part on a comparison of the document collection to a set of a personal documents of users for which the users have provided explicit approval to use in the processing of the document collection.
Abstract:
Methods and apparatus related to identifying one or more messages sent by a user, identifying two or more contacts that are associated with one or more of the messages, determining a strength of relationship score between identified contacts, and utilizing the strength of relationship scores to provide additional information related to the contacts. A strength of relationship score between a contact and one or more other contacts may be determined based on one or more properties of one or more of the messages. In some implementations, contacts groups may be determined based on the strength of relationship scores. In some implementations, contacts groups may be utilized to disambiguate references to contacts in messages. In some implementations, contacts group may be utilized to provide suggestions to the user of additional contacts of a contacts group that includes the indicated recipient contact of a message.
Abstract:
Methods, apparatus, systems, and computer-readable media are provided for selecting pattern matching segments suitable for electronic communication clustering. A set of pattern matching segments may be identified that match at least one of a corpus of electronic communication addresses. A measure of coverage of each of the set of pattern matching segments across the corpus of electronic communication addresses may be determined. A score associated with each pattern matching segment may be determined based on the measure of coverage and one or more measures of flexibility associated with each of the set of pattern matching segments. One or more of the pattern matching segments may be selected based on the determine scores. A corpus of electronic communications may then be grouped into a plurality of clusters based on a comparison of the one or more selected pattern matching segments to electronic communication addresses associated with the corpus of electronic communications.
Abstract:
Methods and apparatus related to providing additional information related to a vague term in a message. For example, in some implementations, one or more messages sent by a sender and received by one or more recipients may be identified, a vague term in the message may be identified, a user-restricted database may be identified that is associated with the sender or a recipient, and additional information related to the vague term may be determined from the user-restricted database. A vague term is a term which may have multiple meanings and that can be clarified with additional information. In some implementations, user-restricted databases may include additional information that is associated with the user that may be utilized to replace the vague term with a clarified term. In some implementations, a user-restricted database may be utilized to identify additional information in another database that may be utilized to clarify the vague term.
Abstract:
Methods and apparatus related to determining feature scores for message features. An electronic message associated with at least one user and associated with an event may be identified. A likelihood that the at least one user interacted with the event may be identified. One or more message features of the electronic message may be determined. Based on the likelihood that the at least one user interacted with the event, a feature score may be associated with a given message feature of the one or more message features, where the feature score is indicative of a likelihood that the at least one user will interact with another event associated with another message having the given message feature. The feature score may be associated with the given message feature.
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for redacting data from a document collection generated for a set of documents that include personal information. The redaction of the data is based in part on a comparison of the document collection to a set of a personal documents of users for which the users have provided explicit approval to use in the processing of the document collection.
Abstract:
Methods and apparatus related to providing additional information related to a vague term in a message. For example, in some implementations, one or more messages sent by a sender and received by one or more recipients may be identified, a vague term in the message may be identified, a user-restricted database may be identified that is associated with the sender or a recipient, and additional information related to the vague term may be determined from the user-restricted database. A vague term is a term which may have multiple meanings and that can be clarified with additional information. In some implementations, user-restricted databases may include additional information that is associated with the user that may be utilized to replace the vague term with a clarified term. In some implementations, a user-restricted database may be utilized to identify additional information in another database that may be utilized to clarify the vague term.
Abstract:
Methods, apparatus, and computer-readable media are provided for analyzing a cluster of communications, such as B2C emails, to generate a template for the cluster that defines transient segments and fixed segments of the cluster of communications. More particularly, methods, apparatus, and computer-readable media are provided for generating and/or applying a trained structured machine learning model for a generated template that can be used to determine, for one or more transient segments of subsequent communications, a corresponding probability that a given semantic label is the correct semantic label for extracted content of the transient segment(s).
Abstract:
Methods and apparatus related to providing additional information related to a vague term in a message. For example, in some implementations, one or more messages sent by a sender and received by one or more recipients may be identified, a vague term in the message may be identified, a user-restricted database may be identified that is associated with the sender or a recipient, and additional information related to the vague term may be determined from the user-restricted database. A vague term is a term which may have multiple meanings and that can be clarified with additional information. In some implementations, user-restricted databases may include additional information that is associated with the user that may be utilized to replace the vague term with a clarified term. In some implementations, a user-restricted database may be utilized to identify additional information in another database that may be utilized to clarify the vague term.
Abstract:
Methods, apparatus, systems, and computer-readable media are provided for generating and applying data extraction templates. In various implementations, a corpus of plain text communications such as emails may be grouped into clusters based on one or more similarities between the plain text communications. One or more segments of communications of a particular cluster may be classified as transient based on textual pattern matching. One or more other segments of the communications of the particular cluster may be classified as transient based on various criteria. One or more transient segments may be assigned a generic and/or specific semantic data type and/or a confidentiality designation based on various signals. A data extraction template may be generated to extract, from subsequent plain text communications, content associated with transient (and in some cases, non-confidential) segments.