Mining “Concept Embeddings” from Open-Source Data to Classify Previously Unseen Log Messages

Abstract: Given the verbosity with which modern software produces logs – it is useful to have many dimensions for filtering when looking for specific content. Often, groups of log messages will relate to a general software concept (e.g. security, resource utilization, database access, etc.) and it can be useful to examine these messages as a group or to pinpoint messages that fall within the intersection of one or more of these concepts. Here we describe our approach to classifying previously unseen log messages into these software concept categories. To handle the large domain-specific vocabulary used by log messages we augmented the “continuous bag of words” (CBOW) embedding training process with an additional semi-supervised training step in which we create a “concept vector”. This vector of concept terms was produced by interrogating the initial embedding and manually filtering out-of-concept or ambiguous concept terms. This vector is then used as the seed for a second “concept embedding” in which terms that associate strongly with each concept vector co-localize. This technique enabled us to minimize the amount of manual example labeling required for training our classifiers while enabling them to correctly classify log messages with terms unknown to the concept vector, the labeled training set or even the model’s human creator: (e.g. the log message “There was an error getting a DBCP datasource.” is correctly classified as a database message because of the term “DBCP”). Our embeddings were trained on open-source data sets, including content from stack exchange, RFC data and published sample software logs.

Bio: Coming Soon!