RSS

The Only Class You Need for CSV Files

23 Nov

A “Comma Separated Value” file is a pretty simple and straightforward concept.  You have N values on a line, each separated by a comma.  Obviously you have to do something special if the value has a comma or a CR character in it, and that is to put them in quotes, and that means you have to handle quote characters specially.  CSV are so common that many times have gone to look for a standard library to read them.

What I find in standard libraries are things that are so complicated, that in order to use the library you have to write a lot of code … actually almost as much code as you need to parse the CSV in the first place.  At the bottom I list a few such offering and why they are so wasteful.  Before that, I will put my entire implementation here.  A class so short and concise that I can put the entire thing in this blog post.  Here it is:

CSVHelper.java

import java.io.LineNumberReader;
import java.io.Writer;
import java.util.List;
import java.util.Vector;
public class CSVHelper
{
    public static void writeLine(Writer w, List<String> values) 
        throws Exception
    {
        boolean firstVal = true;
        for (String val : values)  {
            if (!firstVal) {
                w.write(",");
            }
            w.write(""");
            for (int i=0; i<val.length(); i++) {
                char ch = val.charAt(i);
                if (ch=='"') {
                    w.write(""");  //extra quote
                }
                w.write(ch);
            }
            w.write(""");
            firstVal = false;
        }
        w.write("n");
    }

    /**
    * Returns a null when the input stream is empty
    */
    public static List<String> parseLine(Reader r) throws Exception {
        int ch = r.read();
        while (ch == 'r') {
            ch = r.read();
        }
        if (ch<0) {
            return null;
        }
        Vector<String> store = new Vector<String>();
        StringBuffer curVal = new StringBuffer();
        boolean inquotes = false;
        boolean started = false;
        while (ch>=0) {
            if (inquotes) {
                started=true;
                if (ch == '"') {
                    inquotes = false;
                }
                else {
                    curVal.append((char)ch);
                }
            }
            else {
                if (ch == '"') {
                    inquotes = true;
                    if (started) {
   // if this is the second quote in a value, add a quote
   // this is for the double quote in the middle of a value
                        curVal.append('"');
                    }
                }
                else if (ch == ',') {
                    store.add(curVal.toString());
                    curVal = new StringBuffer();
                    started = false;
                }
                else if (ch == 'r') {
                    //ignore LF characters
                }
                else if (ch == 'n') {
                    //end of a line, break out
                    break;
                }
                else {
                    curVal.append((char)ch);
                }
            }
            ch = r.read();
        }
        store.add(curVal.toString());
        return store;
    }
}

Here are a couple of things that you might want to notice about this implementation:

  • The class has two static methods, one to read a line of a CSV file, and one to write a line of a CSV file. Nothing to set up, nothing to configure, nothing to go wrong.
  • It handles each character only once.  No temporary strings, and no concatenating of strings.
  • It makes a single pass through the stream.  There is no pre-fetching, no pushing back, no scanning ahead for length, and then re-parsing.  There is no breaking the string initially (prescan) into chunks that are then parsed separately.
  • This single pass approach is fast and it does not use any excessive memory, nor does it copy the string values more than once.
  • Everything is in characters, which means you can use any encoding into bytes that you like.  These methods never see the bytes.
  • You can parse from any stream source: file, http request, ftp, nntp, whatever. You only need to create a Reader to that source. No need to reinvent that.
  • You can write to out to any output destination, just supply the Writer class.
  • It focuses just on the parsing of the value, and not interpretation of the values.  I have found that once the line is parsed into a List of Strings, it is pretty easy to loop over this and do whatever interpretation is needed of the values.  Generally, that loop to interpret the values is the simplest, most concise implementation.  When I have used classes that automatically do further conversions, I find I nearly always have to write a loop anyway to do special things on the values beyond data types.  If I have to write the loop anyway to patch things up, I might as well convert from strings at the same time.
  • It does not tell you the size and shape of the file. for example it does not tell you the number of rows or columns.  You simply read each row and you get a number of values, telling you the number of columns.  A properly formed CSV file will have the same number of columns on each row, but this approach simply gives you the values that are there on the row.
  • There is no special handling for the first row.  Often the first row contains names of the columns.  The code reading the file can easily handle the first row differently.
  • This has no dependencies on other non-standard classes.  It uses that standard Java stream classes.  It does not invent new protocols or patterns of call, instead leveraging the standard Java streams as they were intended to be used.
  • It does not handle comments.  Why would anyone have comments in a CSV file?  If you need it, comments could be ignored with about 3 additional lines of code.  Since I have not encountered that need, I did not put it in.  (YAGNI)
  • I used Vector because I am more comfortable with it, but a purist might substitute ArrayList for performance reasons.  That is why the interface uses the more generic “List” interface.
  • On output it puts quotes around all values.  This was easier than figuring out whether the value needed it or not, and some might view this as wasteful, adding two extra characters for every values that does not need it.  For me, processing speed was more important than space on the disk, so I would rather avoid the overhead and complexity of figuring out whether it is needed, even if there is a small cost on output size.
  • About 50 lines of code, makes it smaller than the interface files of most competition.

This makes it so easy to write a CSV file.  I write a loop over my data set.  For each row, I collect all the values (converted to strings) into a List of Strings.   Then one method call writes that line out.  Repeat until finished. Generally this is all I need.  If I want a header with the column names, I do that just before looping over the data.  So easy.

Reading is similarly easy.  Open the file, and start pulling lines out as Lists of Strings.  It is easy then to iterate through that list, and plug the value into whatever internal data structure I am using.  If it needs to be converted to Integer or Long, I do it in that loop without difficulty.  This is the step of interpreting the data values, and generally I don’t need a CSV class to do that.

Most importantly, this one class can be reused anywhere without bringing any extra dependencies.  A single class, is a single file.  It uses only standard Java classes, so I don’t need to include any extra exotic libraries into my program.

Overly Complex Implementations

  • Commons CSV– This Apache project to support CSV files.
    • Large: The source involves 11 files, almost 2,500 lines of code — and I am only including the implementation, not the testing, etc.  Read through the code and compare to above, there is a significant amount of overhead at all points needed for things that are essentially never needed.
    • Multi-pass: It has its own ExtendedBufferedReader which allows for marking and rolling the stream back, which can only mean it that requires multiple passes over the stream.  Clearly this is not reading in a single pass, and likely to be inefficient.
    • Configurable: You can change all of the possible delimiter: semicolons instead of commas, and other characters instead of quotes.  This is all fully generalized so that all possible combinations can work.  Actually I doubt all combinations work, but there is a significant overhead that is simply not needed.  It does handle a lot of “CSV Dialects” as it puts it.  I propose a better way: if you have a non-standard CSV file that you need to read, then just edit the above source, replacing the one comma that occurs with a semicolon character (or whatever change you need) and make a class that parses that exotic dialect.
    • This is clearly a case of YAGNI.  Using it requires learning more interface methods, and more configuration variables that the above implementation has total lines.  Don’t implement capability until you actually need it.  Rather than creating an incredibly complex, 11-class implementation to read a simple file, I keep the simple case simple, and only if you really have the need for a special case, do you then have to create extra classes.
  • Super CSV- Exactly what it says “super” it is far more than you will ever need.
    • Large: A whopping 78 classes, and I did not count the number of lines.
    • Unnecessary: The author of this package felt that operating on a list of string is too plebeian  and so SuperCSV offers automatic conversions to/from other data types.  You create a big array of “CellProcessor” objects, which act to convert the data do functions like range checking etc.  I simply have never needed my CSV file reader to do this kind of data conversion.  There are good libraries for this sort of thing which can not be used and I certainly don’t need to re-write all my data handling into little class objects that are convenient so that this file conversion tool can easily do it. This is just far more complicated than I need — especially if there is an error.   And he had not considered the situation that a value in the row determines the format of another value.
    • Attempts to create a complete “database” like and object serialization like functionality, including automatic conversions to/from JavaBean objects. I have plenty of way to convert to and from Strings, and don’t need a file reader to do this for me, and if ti did, I would not be able to use my other methods.
    • If you want automatic conversions to more complicated structures, this will do it, but you will probably have to design your software so that this CSV reader is part of the central core, and not just an adjunct ability to import and export CSV files.
  • Skife CSV – Much more reasonable an approach that the above overbloated libraries.  Returns an array of strings for each line of a CSV, so it is quite simple and well written.  I like the clean style, but it still complicates the job by offering so many different ways to do things.
    • Large: it is 10 classes, about 900 lines.
    • Configurable: once agaoin you can change the delimiters, and this add overhead.
    • Bloated: It (optionally) opens and closes files for you and provides many functions which any reasonable programmer should know how to do.  these “convenience” methods just complicate the interface.  There are two main modes: read a lin at a time, and a callback approach that calls your method on each line.  All the call back does is save me from writing a three line loop — the extra complication of extra interfaces and methods is not worth it.
  • jcsv – Simple CSV – a pretty good straight forward library.   I like the orientation of being simple and easy to use.  Still, there is a class called “ColumnJoiner” which unnecessarily walk through the data a few times, converting strings into other strings with “replace” method, and adding quotes to strings before finally combining them together into a string with commas between them.  It is so much more efficient (faster, less memory) just to do this at the time you write the strings out in a single pass. There are about 36 classes in total, including about 10 “DataProcessors” which allow you to do automatic conversions of strings into other values … all you have to do is put together a complicated structure with all the right processors in the right place.  Unnecessary frills like the method that reads all the lines of a file and returns as a big list of the lines. A laudable attempt, but somewhere along the way the authors lost sight of the idea of keeping things simple.
  • OpenCSV (au.com.bytecode.opencsv) – less bloated than some, this implementation still has 12 classes & interfaces, with some 78 constructors and methods.  Half of this is to support automatic conversion to/from Java beans, which I still feel is hardly worth the effort.
    • uses standard Reader/Writer classes in a standard way: a strong plus for this!
    • The class CSVReader can be used almost by itself: constructor specifies the various delimiter values, and there are three key methods: get a single record, get all records as a List, and get a standard iterator.

Simple Implementation, that are Wrong

  • Blogspot: A Java CSV File Reader – This implementation is just plain wrong.  It uses the Java string “split” function to break the line on commas, but that does not work when the values have commas within the quote.  There are a lot of implementation that make this basic mistake.
  • Reading a *csv file – This implementation is wrong as well, it uses StringTokenizer to break the string on commas.  That does not work when there are commas in the values, between the quotes.
  • How To Export Data To CSV File – Java – This implementation is simple, but it completely ignores the issues with commas and quotes in the data.

How to Use It

Here are an examples of how you would use this to read a file:

public static List<MyClass> readData() throws Exception {
    List<MyClass> collection = new Vector<MyClass>();
    File fileTemplate = new File( <<path to your file >>);
    FileInputStream fis = new FileInputStream(fileTemplate);
    Reader fr = new InputStreamReader(fis, "UTF-8");

    List<String> values = CSVHelper.parseLine(fr);
    while (values!=null) {
        collection.add( MyClass.constructFromStrings(values) );
        values = CSVHelper.parseLine(fr);
    }
    lnr.close();
    return collection;
}

Points to note about this:

  • You open the stream yourself.  Often this will be from a file, but it could also be a web resource, a document management system, or many other possible sources.  There are many standard ways to locate the source of your CVS file, and you are free to use all of them.
  • In this case it uses a FileInputStream, and then converts that to a Reader with the “UTF-8” character encoding.  You may find that you need to support other encodings, and you have complete flexibility to do so using the standard Java classes for this.
  • Then, get each “row” of the CSV file as a record in a loop.  In this case there is no header row so I treat every row the same.  Just loop until you get a null back.
  • In this case I want to construct an object instance for each row, and it calls a static method that takes the values and initializes an object from a list of string objects.  Once again you have complete flexibility here on how you use the values.  If some of your object member use other data types, like Integer, you would have to convert from string to int in that routine, but once again there are standard Java mechanisms to do that.
  • If the file format is specified externally by someone else, then you will have to figure out how to map those column values into your object for storing them.  You may decide to ignore some columns.  This is easy to do when you have a list of string, each string represents a column of the CSV file.

Here is an example of how to output a collection of objects to a CSV file:

public static void saveData(List<MyClass> myData) throws Exception {
    File csvFile = new File(<<path to write to>>);
    FileOutputStream fos = new FileOutputStream(csvFile);
    Writer fw = new OutputStreamWriter(fos, "UTF-8");
    for (MyClass oneDatum : myData) {
        List<String> rowValues = oneDatum.getValues();
        CSVHelper.writeLine(fw, rowValues);
    }
    fw.flush();
    fw.close();
}

Things to note about this:

  • Again, you have complete freedom of where you write to.  In this case, it is writing directly to a file, but you could also write to the output stream of a web server (generating CSV files on demand) or to a memory buffer.  There are plenty of standard Java mechanisms to support all the cases you need.
  • I generally create the Writer using UTF-8 encoding, but you may have other encodings you need to use, and that is done in the Writer constructor.
  • Then loop over your data, converting the object to a list of string values, and then writing those strings to the file.  Once again, I have hidden this in a method on the class which walks through the members of the class that you want to persist, and adds a string to list for each such member.  You go through the members in a particular order when writing the file, and the same order when reading the file.  When the external CSV file is specified as a standard (e.g. column 5 must be a date of a particular format) you have do some coding to figure out how to get the exact required value into the required column.  But once you get the conversion and the order worked out, this method writes it reliably as a CSV file row.
  • Don’t forget to flush and close the file

Here is a sample class that fills in the two functions used above:

class MyClass {
    String     name;
    int        age;
    float      shoeSize;
    CustomType astrologicalSign;

    public static MyClass constructFromStrings(List<String> values) {

        String _name      = values.get(0);
        int    _age       = Integer.parseInt(values.get(1));
        float  _shoeSize  = Float.parseFloat(values.get(2));
        CustomType _aSign = new CustomType(values.get(3));
        return new MyClass(_name, _age, _shoeSize, _aSign);
    }

    public List<String> getValues() {
        Vector<String> values = new Vector<String>();
        values.add(name);
        values.add(Integer.toString(age));
        values.add(Float.toString(shoeSize));
        values.add(astrologicalSign.toString());
        return values;   
    }
}

Things to note:

  • These methods represent (implement) the mapping from the CSV columns to the member data.  CSVHelper does not attempt to automate this mapping.  You, as a programmer, have to write this code to do the mapping each time.  That may sound like a bother, but I have generally found that it is easier to write this method, than it is to write the configuration data required to get the bigger packages to do this.  But, if you want this automated, then feel free to use the other, more full featured, approaches.
  • Each class does not necessarily have only a single mapping to CSV files.  For instance, there may be a particular CSV file that you have to import from a foreign source that has columns in a particular order, and you may also have your own CSV file for exchanging information internally.  You can have as many of these mapping methods as you need.  There is no requirement for a one-to-one mapping between class members and CSV columns.
  • This is a simple case where a four column CSV goes to four class members of simple type.  In general I usually have a more complicated mapping problem, sometimes involving multiple classes.  Sometimes a value, like customer id, needs to be looked up in a separate collection, and that object attached.  Sometimes I am not simply returning an object instance, but instead updating or otherwise manipulating global collections.  Again, all of these are possible:  CSVHelper simply gives you column values from the CSV file, and you do whatever you need with them.
Advertisements
 
21 Comments

Posted by on November 23, 2012 in Uncategorized

 

21 responses to “The Only Class You Need for CSV Files

  1. Slamet Nurhadi

    December 5, 2012 at 9:11 pm

    thanks, how to modified data from database to csv file,. if a field ‘status’ content {0,1,2} and then in csv file containing field ‘status’ {active, success, fail},. whether can use if statement in java ??

     
    • kswenson

      December 6, 2012 at 9:27 am

      thanks for comment. Not sure what the question is.

       
      • Slamet Nurhadi

        December 7, 2012 at 11:08 pm

        hehe. I already found the my problem, before,.i confuse change a field table befero parse onto in scv format…i used to if ‘statement’

         
  2. Anon

    January 21, 2013 at 1:33 pm

    Haven’t read article yet, but noticed important Typos: “CVSHelper” & “CVS” several times.

    No need to publish comment obviously. Thanks for taking the time to share your knowledge with us.

     
    • kswenson

      January 21, 2013 at 3:23 pm

      Oh no! Thanks so much for pointing that out. My fingers tend to type CVS automatically because of long years of use of the source code versioning system. I fixed them. Thanks again.

       
  3. Vic Wolk

    January 24, 2013 at 8:42 pm

    Ok, I am really green to this (ok, noob is the word). I have a CSV file separated by semicolons and I need to read it and put whatever information I want into a database. Where do I use that crazy good implementation of yours? I don’t see it reading any file or being used at all in any real situations. I think you’ve wasted too much time and objectiveness writting about other people’s work. Feels incomplete to the nooby me, you see. Any lights?

     
    • kswenson

      January 28, 2013 at 5:53 pm

      Yes, I need to add some sample code. You are right, I got kind of caught up in showing how complicated the others are. My point, of course, is that some of the API interfaces are longer than this complete class. If you have to learn so much, you might as well just learn the parser itself, which is better for maintenance and debugging, not to mention code size and complexity. I will soon add example of how to use it.

       
  4. Cinder

    February 8, 2013 at 3:20 pm

    One thing that I’ve noticed that is lacking – IF your source csv file has newlines within a particular value, then your code will gobble the newlines and cat the entries together. I replaced the readline section with a plain old .read() call, and for inquotes false :

    else if (ch==’,’ || ch==’\n’) {
    store.add(curVal.toString());
    curVal = new StringBuffer();
    }

    This makes sure that if the last value in one ‘line’ is not followed by a comma, then it won’t be combined with the first value in the next set inadvertently.

    Otherwise, really solid logic!

     
    • kswenson

      February 8, 2013 at 3:26 pm

      Thanks for that comment — I had been worried about that situation. I know that some CSV generators put CR within the quotes. Your suggestion is a big improvement. I have updated the source listing to do this.

       
  5. sam

    February 11, 2013 at 5:38 pm

    Please add main method or full code to use it.

     
    • kswenson

      February 12, 2013 at 9:35 am

      OK, I have added that now.

       
  6. Semper Why

    February 26, 2013 at 1:48 pm

    Did you list Commons CSV as both an overly complicated and as a reasonable implementation? Or am I missing something?

     
    • kswenson

      February 26, 2013 at 10:57 pm

      As you know there are often multiple versions floating around, and given minimal time I can spend on this, it is sometimes hard to say what the definitive version is. Commons CSV is a lot more complicated than my implementation. But given how overly complex most of the other are, this one is relatively clean and shows promise. If you must use a well known library, this is probably the one I would recommend. However, it still requires many more classes and complexity that really required.

       
      • Semper Why

        February 28, 2013 at 9:02 am

        Fair enough. In any case, thanks for the resource! It quite nicely solved a pesky issue I was having.

         
  7. SomeOne

    March 23, 2013 at 3:57 am

    There is a bug in your code.
    Try parsing the line: """A,B"""
    The correct result should be

    "A,B"

    but you code gives

    A,B"

     
    • kswenson

      March 23, 2013 at 3:39 pm

      Right you are! The condition curVal.length()>0 is not the right test, because it fails to notice when the doubled quote comes as the first character of the value. I have replaced this with a boolean variable to track whether this is the first quote or not. Code is updated, and now it handles this case correctly.

       
  8. MC

    September 21, 2014 at 3:56 am

    Very helpful! Thank you!

     
  9. Ivan

    April 9, 2015 at 11:28 pm

    There is a bug in your code. If value is not quoted, but contains quotes (for example: ) quotes will be absent in the parse result. I think the correct way to parse in this case is to use FSM, not the discrete flags.

     
    • kswenson

      April 10, 2015 at 7:00 am

      According to RFC 4180, a double-quote character (“) is allowed in a field only when that field is quoted and the quote character itself is doubled. The quote character is not allowed in a non-quoted field. See http://tools.ietf.org/html/rfc4180

      The behavior is well defined at least. If it encounters data like this:

      aaa, my “big” day, ccc

      Here are the values you would get:

      1: aaa
      2: my “big day
      3: ccc

      To be specific, every other quote (the second, fourth, etc) will be suppressed.

      If the quotes are doubled like this:

      aaa, my “”big”” day, ccc

      then you will get:

      1: aaa
      2: my “big” day
      3: ccc

      These examples are where quotes match. Further malformed input might consider having a single double-quote charcter (or some other odd number of them) like this:

      aaa, my “big day, ccc

      In which case, of course, the quote would appear to guard a comma, and the parsed values would be:

      1: aaa
      2: my “big day, ccc (and the next line up to the quote)

      Just keep in mind: this input is not legal CSV files, thus interpretation of the file is unknown, and I would not want to distort this parser to handle these illegal cases. Whatever would be done would slow the parser down for all of the legal cases. If you are indeed faced with handling these malformed CSV files, then you probably need to resort to a much slower parser based on how you decide what those interior quotes were intended to mean.

       
  10. Code Swiper

    February 18, 2016 at 7:26 am

    First, great tidy code. very useful for my specific task. It looks like the code rendering is removing some backslashes in parseLine and writeLine. Probably something on the server. I tried Chrome and IE, no difference. It would be great to have a link to download.

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: