When working with flat files, encoding needs to be factored in right away to avoid issues down the line. UTF-8 (or UTF-16) is the de facto encoding that you hope to get. If the encoding is different, pay attention on how you load the file into R.
Let’s take the example of a file encoded as Windows-1252. Its content is displayed below using Notepad++. The editor does a pretty good job figuring out the encoding of the file. The encoding is displayed in the status bar while the Encoding
menu enables you to change the selected character set.
Beware of the default encoding
I work on Windows, and the Windows-1252 encoding is native to the platform:
Loading the CSV file from Windows with the utils
package appears to be a breeze:
However, once moving the code onto a Linux-environment, I got the following error:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string 3
On Linux environments, the locale is usually UTF-8:
[1] "LC_CTYPE=fr_FR.UTF-8;LC_NUMERIC=C;LC_TIME=C.UTF-8;LC_COLLATE=C.UTF-8;LC_MONETARY=C.UTF-8;LC_MESSAGES=C.UTF-8;LC_PAPER=fr_FR.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_FR.UTF-8;LC_IDENTIFICATION=C"
The file encoding needs therefore to be explicit as to ensure portability:
Note that the following code is equivalent:
read.csv2
uses by default the native encoding to load the CSV file.
If the default encoding varies from plateform to plateform, your code may not work unless you specify the type of encoding you want to have. For reproducible results, you may also want to refine the encoding used by default in our R session.
What about the readr
package?
The readr package is becoming a favorite among the R community. By default, UFT-8 encoding is assumed (see readr::default_locale()
), leading to issues:
The locale is UTF-8 be default:
The encoding needs to be specified using the locale
parameter:
Checklist
Let’s recap:
- Do you use encoding other than UTF?
- If so, does the file just contains plain ASCII characters? Does it contains extended ASCII characters (such as é, õ)? Does it contains non-extended characters such as Þ?
- Verify the encoding using external tools (such as Notepad++ if on Windows)
- Alternatively, use
guess_encoding
:
- In which environment do you expect to run the code? Windows? Linux? What is the locale of theses systems?
- Check the locale
Sys.getlocale()
- Check the supported encoding
iconvlist()
- [Configure]((http://withr.me/configure-character-encoding-for-r-under-linux-and-windows/) character encoding if required.
- Check the locale
- Which package will you use to load the file?
-
readr
vs.utils
? - What is the default encoding used by these packages? (
getOption('encoding')
vs.readr::default_locale()
)
-
- Finally, always specify the encoding being used to ensure greater portability of your code.