# Semantic Similarity

## NLP Embeddings
- Computers cannot understand human natural language text directly. 
- Embeddings are numeric vector representations of natural language text so that computers can understand the context and meaning of the text. 
- For a detailed overview of various state of the art NLP embedding techniques, please take a look at my Medium blog post: [Semantic Textual Similarity](https://towardsdatascience.com/semantic-textual-similarity-83b3ca4a840e?sk=8389935eda3449a172a5905b53150d30)

### SimCSE: Simple Contrastive Learning of Sentence Embeddings
One large issue with developing domain-specific NLP sentence embeddings for semantic similarity search is the need to curate a lot of pairwise labeled data. To alleviate this problem, the authors propose an **Unsupervised** technique using plain **Dropouts** to achieve pretty strong results! 

Unsupervised SimCSE takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise:

âœ…  Supports both Unsupervised and Supervised learning of sentence embeddings

âœ…  State of the art results on both Unsupervised and Supervised sentence embedding generation 

âœ…  Several SOTA pre-trained models provided out of the box

ðŸš€Â  Github: https://github.com/princeton-nlp/SimCSE

ðŸ“–Â  Paper: https://arxiv.org/abs/2104.08821v3

#nlp #machinelearning #datascience #researchpaper #github #semanticsimilarity #vectorsearch #sentencetransformers #unsupervised #deeplearning

![](images/similarity/simcse_1.png)

![](images/similarity/simcse_2.png)

## Fuzzy String Matching

In [1]:
# Generate random texts
from faker import Faker

fake = Faker()
fake.seed_instance(0)
fake_text = fake.text(max_nb_chars=200_000).split("\n")
print(fake_text[:3])

['American whole magazine truth stop whose. On traditional measure example sense peace. Would mouth relate own chair.', 'Together range line beyond. First policy daughter need kind miss.', 'Trouble behavior style report size personal partner. During foot that course nothing draw.']


In [2]:
# Imports
from rapidfuzz import process as rapidfuzz_process
from thefuzz import process as thefuzz_process

In [3]:
%%time
thefuzz_process.extract("stock ball organization", choices=fake_text)

CPU times: user 475 ms, sys: 46.3 ms, total: 522 ms
Wall time: 257 ms


('Security stock ball organization recognize civil. Pm her then nothing increase.',
 90)

In [4]:
%%time
rapidfuzz_process.extract("stock ball organization", choices=fake_text)

CPU times: user 5.52 ms, sys: 170 Âµs, total: 5.69 ms
Wall time: 5.7 ms


('Security stock ball organization recognize civil. Pm her then nothing increase.',
 90.0,
 10)