Is document classification too complicated? MIT and IBM work together to solve this challenge

Even the best text resolution recommendation algorithms can be hindered by a certain size of data set. To provide faster and better classification performance than most existing methods, the team at MIT-IBM Watson AI Labs and MIT’s Geometric Data Processing Group designed a technology that combines popular AI tools such as embedded and optimal transmission.

Is document classification too complicated? MIT and IBM work together to solve this challenge

They argue that this approach can cover millions of possibilities by taking into account a person’s historical preferences, or the preferences of a group of people.

Justin Solomon, an assistant professor at the Massachusetts Institute of Technology and lead author of the study, said in a statement that there is a lot of text on the Internet and that anything that helps run through the material is very useful.

To do this, Justin Solomon and his colleagues used algorithms to summarize text sets into topics based on common words in the collection. Next, it divides each text into 5 to 15 of the most important topics and ranks each topic into 5 to 15 of the most important topics.

In addition, embedding presses, in which case the data representation of words) helps make the similarity between words obvious, while optimal transfers help to calculate the most efficient way to move objects (or data points) between multiple destinations. At the same time, embedding makes it possible to “take advantage of two optimal transfers”: first compare topics in a collection, and then measure the extent to which common topics overlap.

The researchers say this method is particularly effective when scanning a large number of books and documents. In the evaluation of 1720 titles in the Gutenberg Project dataset, the algorithm successfully compared all titles in one second, nearly 800 times faster than the second place.

In addition, the algorithm does a better job of classifying documents than other methods. For example, books in the Gutenberg dataset are grouped by author, or product reviews on Amazon by department. At the same time, the algorithm provides a list of topics that explain to the user why a given document is recommended for easy understanding.

However, the researchers are not satisfied with the current level of technology. They will also continue to develop an end-to-end training technology that can combine to optimize embedding, thematic models, and optimal transmission, rather than optimizing individually as current implementations. In terms of applications, they also want to apply their methods to larger data sets and study applications for image or 3D data modeling.

In his paper summary report, Justin Solomon says that (our algorithm) seems to capture differences in the same way that one person compares two documents: first break each document down into easy-to-understand concepts, and then compare concepts.

For a closer idea, Justin Solomon says:

Having word embedding provides global semantic language information, and the meme model provides topic and topic distribution specific to the corpus. Empirically, these factors combine to provide excellent performance in a variety of measurement-based tasks.

Add a Comment

Your email address will not be published. Required fields are marked *