User Manual: CSV Importer: Dealing with invalid CSVs

General tips

Open and re-save CSV using LibreOffice Calc

If you created and saved the invalid CSV using LibreOffice Calc, following the instructions for opening and saving files safely as CSV, this tip probably will not help.

As a dedicated tabular data creation and editing application, LibreOffice Calc is able to do fairly sophisticated automatic detection and remediation of different CSV dialects and other common structural issues with the format.

Try:

Open your CSV as a plain text file

Sometimes it can be useful to see exactly what the underlying data looks like, without any parsing or formatting applied.

A CSV file is just a specially formatted text file, so you can use plain text editors like Notepad or many others, to open and examine your files. If you make any changes to the file in the text editor, make sure you save the file with UTF-8 character encoding.

Notepad++ is a free plain text editor for Windows with several built-in editing functions that can streamline specific CSV file fixes. There is also a CSVLint plugin that you can enable, that provides more options for checking and validating CSV files within Notepad++.

Specific issues

The number of tools and techniques for working with CSV data are mind-boggling. We cannot specify exactly how you should fix an issue, because of things like:

  • You may or may not be allowed to install new applications on your machine

  • Applications and techniques available may vary by what operating system you use

  • Some people are are open to using the many powerful command line tools available, while others require point-and-click graphical user interface (GUI) applications.

  • There are some neat “upload your CSV to this web app and it will validate/format/analyze it for you” sites, but (a) we’ve seen many such sites come and go over the years; (b) we cannot vet the availability, trustworthiness, and efficacy of such sites; and (c) you may or may not be comfortable uploading your data to external web sites.

We recommend using a search engine to find exact solutions that will work for you. Your query should include at least “csv”, the issue you are facing (e.g. “find invalid encoding”), and your operating system.

The sections below try to be general, but do assume a preference for GUI applications. We note some features of Notepad++ because it is free, has been around for a very long time, and has some features that directly address some of the common issues we see with CSV files.

Finding and removing blank lines/rows

While working with your CSV as a spreadsheet

Tips for finding blank lines:

  • Select the objectNumber (or other required value for the record type) heading. Do Ctrl-DownArrow to jump to the next unpopulated value in the column. Do you see any populated rows below where your selection landed? If so, you have blank rows in between data rows and should remove the blank rows. If not, try doing Ctrl-DownArrow one more time. If there is any row populated lower in the file, you will have jumped to it.

Tips for removing blank lines:

  • If you have blank rows in between data rows, sort your CSV by a header for a required field.

  • Most spreadsheet applications show you lots of blank rows at the end of your file because you may want to enter more data in those rows. Some spreadsheet applications treat any row you ever put your cursor in as a “data row” even if it’s at the very bottom of the file and and all the cells are empty. They will output a bunch of empty rows at the end of the CSV file. To protect against this:

    • Sort on a required-value column heading

    • Use Ctrl-DownArrow to go to the end of your actual data

    • Click to the left of the first cell in the first blank row underneath your data (this should select the entire row)

    • Use Ctrl-DownArrow again to select all blank rows

    • Delete the selected rows (consult your application’s manual if you don’t know how)

    • Save the CSV

While working with your CSV as a plain text file

General tip:

  • Many plain text editors have a function/setting to show invisible/hidden characters like line breaks. Consult your application’s manual for how to toggle this.

Finding blank lines:

  • If you scroll/jump to the very end of the file, where does your cursor land? Delete blanks until your cursor is at the end of the last populated data row.

Notepad++ has an "Edit > Remove empty lines" function that can make this easy (and will also remove empty lines between populated lines).

Finding characters with invalid encoding

Characters with invalid encoding (i.e. that are not UTF-8) should be removed or replaced with valid UTF-8 characters.

LibreOffice Calc and Notepad++ both support regular expression find/replace. Other tools may as well.

The best way to find non-UTF-encoded characters via a regular expression search depends on what flavor/implementation of regular expression syntax is used by an application. Search the web for something like “how to find non-utf8 characters in {yourapplication}”.

The following regular expression (source) works as a Find value in LibreOffice Calc (with the Regular expressions option checked):

\A([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*\z

However:

  • It returns a list of the cells containing a non-UTF-8 character, not a list of the individual non-UTF-8 characters used in the cells.

  • You could opt to replace any matching characters with a space or some other character, but then you’d lose whatever the matching characters are supposed to be, and the ability to replace all instances of a given character with an appropriate UTF-8 character.