Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
stylenone

General tips

Open and re-save CSV using LibreOffice Calc

...

Notepad++ has an "Edit > Remove empty lines" function that can make this easy (and will also remove empty lines between populated lines).

Finding characters with invalid encoding

Characters with invalid encoding (i.e. that are not UTF-8) should be removed or replaced with valid UTF-8 characters.

LibreOffice Calc and Notepad++ both support regular expression find/replace. Other tools may as well.

The best way to find non-UTF-encoded characters via a regular expression search depends on what flavor/implementation of regular expression syntax is used by an application. Search the web for something like “how to find non-utf8 characters in {yourapplication}”.

The following regular expression (source) works as a Find value in LibreOffice Calc (with the Regular expressions option checked):

Code Block
\A([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*\z

However:

  • It returns a list of the cells containing a non-UTF-8 character, not a list of the individual non-UTF-8 characters used in the cells.

  • You could opt to replace any matching characters with a space or some other character, but then you’d lose whatever the matching characters are supposed to be, and the ability to replace all instances of a given character with an appropriate UTF-8 character.