Text Pre-Processing
Contents
Text Pre-Processing#
Textual data, especially those scrapped from the internet, can be incredibly messy with tons of inconsistencies
Text pre-processing techniques aim to clean up the text data and prepare it for Machine Learning
ftfy: fixes text for you#
One of the most frustrating parts about working with tons of text in the real world is identifying and fixing “noise” in the text such as this
🙅 Noisy text: "The Mona Lisa doesn’t have eyebrows."
🙆 Actual text: "The Mona Lisa doesn't have eyebrows."
👉 Enter “ftfy”, a nifty little python library by Elia Lake that fixes Unicode text that’s broken in various ways. Take a look at the image below for some examples!
Fix multiple layers of mojibake (encoding mix-ups) including “curly quotes”, by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else.
Decode HTML entities that appear outside of HTML
Strongly avoids false positives (should never change a correctly-decoded text to something else)
🌟 Github: https://github.com/rspeer/python-ftfy