Text Pre-Processing#

Textual data, especially those scrapped from the internet, can be incredibly messy with tons of inconsistencies
Text pre-processing techniques aim to clean up the text data and prepare it for Machine Learning

ftfy: fixes text for you#

One of the most frustrating parts about working with tons of text in the real world is identifying and fixing “noise” in the text such as this

🙅 Noisy text: "The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows."

🙆 Actual text: "The Mona Lisa doesn't have eyebrows."

👉 Enter “ftfy”, a nifty little python library by Elia Lake that fixes Unicode text that’s broken in various ways. Take a look at the image below for some examples!

Fix multiple layers of mojibake (encoding mix-ups) including “curly quotes”, by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else.
Decode HTML entities that appear outside of HTML
Strongly avoids false positives (should never change a correctly-decoded text to something else)

🌟 Github: https://github.com/rspeer/python-ftfy

Data Science Blog

Text Pre-Processing

Contents

Text Pre-Processing#

ftfy: fixes text for you#