Text Pre-Processing#

  • Textual data, especially those scrapped from the internet, can be incredibly messy with tons of inconsistencies

  • Text pre-processing techniques aim to clean up the text data and prepare it for Machine Learning

ftfy: fixes text for you#

One of the most frustrating parts about working with tons of text in the real world is identifying and fixing “noise” in the text such as this

🙅 Noisy text: "The Mona Lisa doesn’t have eyebrows."

🙆 Actual text: "The Mona Lisa doesn't have eyebrows."

👉 Enter “ftfy”, a nifty little python library by Elia Lake that fixes Unicode text that’s broken in various ways. Take a look at the image below for some examples!

  • Fix multiple layers of mojibake (encoding mix-ups) including “curly quotes”, by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else.

  • Decode HTML entities that appear outside of HTML

  • Strongly avoids false positives (should never change a correctly-decoded text to something else)

🌟 Github: https://github.com/rspeer/python-ftfy