To schedule a demonstration, call 1-800-998-4874

Predictive Coding with Three Similarity Measurements


Machine-learning searches work only as well as their statistical model. Sometimes, no number of iterations can fix a statistical model with limited flexibility. That’s why being able to customize the statistical model to meet the challenges of a particular data set is so important for success.

Details and Benefits

Cavo Legal focused its attention for semantic predictive coding on Latent Semantic Indexing (LSI), but have added some significant capabilities to what we are now calling Enhanced Latent Semantic Indexing (ELSI) (patent pending). We have added three customizable similarity measurements to the statistical model so that it more accurately and more quickly finds similar documents to a reference set.

Traditional LSI uses the average cosine similarity between the query vector and the document vectors in the seed set. This approach has a serious drawback that it will give a lower score to a document that is highly similar to one of the seed document but dissimilar to most of the seed documents. Some applications may require ranking documents based on the similarity of their metadata in addition to the content-based similarities regardless of the reliability of the metadata in your particular data collection. Cavo Legal enhances LSI to make it work in both of these cases. Cavo eD provides three different similarity metrics that contribute to the score of a document given an input seed document set.

  1. Average Seed Similarity: This is the average cosine similarity of the input document with all seed documents
  2. Best Seed Similarity: This is the similarity of the input document with the seed document closest to the given document in the reduced dimensional space. Put another way, this is the highest similarity of the document with any document in the seed set.
  3. Metadata Similarity: This is the average metadata similarity of the document with any document in the seed set. Metadata similarity is based on the following fields due to their high reliability:
    • Date
    • Date Falloff
    • Date within Days
    • Email Address
    • Title