|
|
| Line 1: |
Line 1: |
| {{Short description|Text file formats for tabular data}} | | {{Short description |Text format for tabular data using a comma between fields}} |
| {{Infobox file format | | {{Infobox file format |
| | name = Comma-separated values | | | name = Comma-separated values |
| | icon = CsvDelimited001.svg | | | icon = CsvDelimited001.svg |
| | icon_size = 128px
| |
| | logo =
| |
| | screenshot = Example CSV file.png
| |
| | caption = A simple CSV file listing three people and the companies they work for
| |
| | _noextcode = on | | | _noextcode = on |
| | extension = {{code|.csv}} | | | extension = {{code|.csv}} |
| Line 26: |
Line 22: |
| | url = | | | url = |
| }} | | }} |
| '''Comma-separated values''' ('''CSV''') is a [[text file]] format that uses [[comma]]s to separate values, and [[newline]]s to separate records. A CSV file stores [[Table (information)|tabular]] data (numbers and text) in [[plain text]], where each line of the file typically represents one data [[record (computer science)|record]]. Each record consists of the same number of [[field (computer science)|fields]], and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.<ref>{{Cite web |date= |title=CSV Comma Separated Value File Format - How To - Creativyst - Explored,Designed,Delivered.(sm) |url=https://www.creativyst.com/Doc/Articles/CSV/CSV01.shtml |url-status=live |archive-url=https://web.archive.org/web/20210401084212/http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm |archive-date=1 April 2021 |access-date=22 August 2023 |website=Creativyst Software}}</ref>
| |
|
| |
|
| The CSV file format is one type of [[Delimiter-separated values|delimiter-separated file format]].<ref>{{Cite book|url=https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm|title=IBM DB2 Administration Guide|publisher=IBM|archive-date=2016-12-13|access-date=2016-12-13|archive-url=https://web.archive.org/web/20161213014111/https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm|url-status=live}}</ref> Delimiters frequently used include the comma, [[tab-separated values|tab]], space, and semicolon. Delimiter-separated files are often given a ".csv" [[filename extension|extension]] even when the field separator is not a comma. Many applications or libraries that consume or produce CSV files have options to specify an alternative delimiter.<ref>{{Cite web |title=Which are the available formats |url=https://wikis.ec.europa.eu/display/EUROSTATHELP/Which+are+the+available+formats |url-status=live |archive-url=https://web.archive.org/web/20230726193851/https://wikis.ec.europa.eu/display/EUROSTATHELP/Which+are+the+available+formats |archive-date=26 July 2023 |access-date=22 August 2023 |website=Eurostat}}</ref>
| | '''Comma-separated values''' ('''CSV''') is a [[plain text]] [[data]] [[content format |format]] for storing [[tabular data]] where the [[field (computer science)|fields]] (values) of a [[record (computer science)|record]] are separated by a [[comma]] and each record is a line (i.e. [[newline]] separated). CSV is commonly-used in [[software]] that generally deals with tabular data such as a [[database]] or a [[spreadsheet]].<ref>{{Cite web |title=Import or export text (.txt or .csv) files |url=https://support.microsoft.com/en-au/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba |access-date=2023-08-16 |website=Microsoft Support }}</ref> Benefits cited for using CSV include simplicity of use and [[human-readable |human readability]].<ref>{{Cite web |title=What is a CSV file: A comprehensive guide |url=https://flatfile.com/blog/what-is-a-csv-file-guide-to-uses-and-benefits/ |access-date=2024-10-28 |website=flatfile.com |language=en}}</ref> CSV is a form of [[delimiter-separated values]]. A '''CSV file''' is a file that contains CSV-formatted data. |
|
| |
|
| The lack of adherence to the CSV standard RFC 4180 necessitates the support for a variety of CSV formats in data input software. Despite this drawback, CSV remains widespread in data applications and is widely supported by a variety of software, including common spreadsheet applications such as [[Microsoft Excel]].<ref>{{Cite web |title=Import or export text (.txt or .csv) files |url=https://support.microsoft.com/en-au/office/import-or-export-text-txt-or-csv-files-5250ac4c-663c-47ce-937b-339e391393ba |access-date=2023-08-16 |website=Microsoft Support }}</ref> Benefits cited in favor of CSV include human readability and the simplicity of the format.<ref>{{Cite web |title=What is a CSV file: A comprehensive guide |url=https://flatfile.com/blog/what-is-a-csv-file-guide-to-uses-and-benefits/ |access-date=2024-10-28 |website=flatfile.com |language=en}}</ref>
| | CSV is not limited to a particular [[character encoding]].<ref name="rfc4180"/> It works just as well with [[Unicode]] (i.e. [[UTF-8]] or [[UTF-16]]) as with [[ASCII]] {{endash}} although particular [[computer program |programs]] may have limitations. Unlike many proprietary data formats, CSV data normally survives naïve translation from one character set to another. CSV does not, however, provide a way to indicate the character encoding, so that must be communicated separately. |
| | |
| ==Applications==
| |
| CSV is a common [[data exchange]] format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data<ref>{{cite web|url=https://datahub.io/docs/data-packages/csv|title=CSV - Comma Separated Values|access-date=2017-12-02|archive-date=2021-03-07|archive-url=https://web.archive.org/web/20210307221025/https://datahub.io/docs/data-packages/csv|url-status=live}}</ref><ref>{{cite web | url=http://www.csvreader.com/csv_format.php | title=CSV Files | access-date=June 4, 2014 | archive-date=April 30, 2021 | archive-url=https://web.archive.org/web/20210430225021/https://www.csvreader.com/csv_format.php | url-status=live }}</ref> between programs that natively operate on incompatible (often [[proprietary format|proprietary]] or undocumented) formats.<ref name="rfc4180">{{cite IETF |title=Common Format and MIME Type for CSV Files |last=Shafranovich |first=Y. |publisher=[[Internet Engineering Task Force|IETF]] |date=October 2005 |rfc=4180 |page=1 }}</ref> For example, a user may need to transfer information from a database program that stores data in a proprietary format, to a [[spreadsheet]] that uses a completely different format. Most [[Database|database programs]] can export data as CSV. Most spreadsheet programs can read CSV data, allowing CSV to be used as an intermediate format when transferring data from a database to a spreadsheet. Every major [[E-commerce|ecommerce]] platform provides support for exporting data as a CSV file.<ref>{{Cite web |title=CSV Supported Ecommerce Platforms |url=https://www.rfmcalc.com/supported-platforms/ |access-date=2025-03-09 | website=RFM Calc }}</ref>
| |
| | |
| CSV is also used for storing data. Common data science tools such as [[Pandas (software)|Pandas]] include the option to export data to CSV for long-term storage.<ref>{{Cite web |title=pandas.DataFrame.to_csv — pandas 2.0.3 documentation |url=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html |access-date=2023-08-16 |website=pandas.pydata.org}}</ref> Benefits of CSV for data storage include the simplicity of CSV makes parsing and creating CSV files easy to implement and fast compared to other data formats, human readability making editing or fixing data simpler,<ref>{{Cite web |date=2021-09-15 |title=CSV Format: History, Advantages and Why It Is Still Popular |url=https://bytescout.com/blog/csv-format-history-advantages.html |access-date=2023-08-16 |website=ByteScout |language=en-US}}</ref> and high compressibility leading to smaller data files.<ref>{{Cite web |date=2020-07-23 |title=Comparison of different file formats in Big Data |url=https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/ |access-date=2023-08-16 |website=www.adaltas.com |language=en}}</ref> Alternatively, CSV does not support more complex data relations and makes no distinction between null and empty values, and in applications where these features are needed other formats are preferred.
| |
| | |
| More than 200 local, regional, and national data portals, such as those of the [[UK government]] and the [[European Commission]], use CSV files with standardized [[Database catalog|data catalogs]].<ref>{{Cite book |last1=Mahmud |first1=S M Hasan |last2=Hossin |first2=Md Altab |last3=Jahan |first3=Hosney |last4=Noori |first4=Sheak Rashed Haider |last5=Bhuiyan |first5=Touhid |chapter=CSV-ANNOTATE: Generate annotated tables from CSV file |title=2018 International Conference on Artificial Intelligence and Big Data (ICAIBD) |chapter-url=https://ieeexplore.ieee.org/document/8396169 |publisher=IEEE |publication-date=2018 |pages=71–75 |doi=10.1109/ICAIBD.2018.8396169 |isbn=978-1-5386-6987-7}}</ref>
| |
| | |
| ==Specification==
| |
| {{IETF RFC|4180}} proposes a [[Specification (technical standard)|specification]] for the CSV format; however, actual practice often does not follow the RFC and the term "CSV" might refer to any file that:<ref name="rfc4180"/><ref>{{cite web | url=http://edoceo.com/utilitas/csv-file-format | title=Comma Separated Values (CSV) Standard File Format | publisher=Edoceo, Inc | access-date=June 4, 2014 | archive-date=July 14, 2020 | archive-url=https://web.archive.org/web/20200714050729/http://edoceo.com/utilitas/csv-file-format | url-status=live }}</ref>
| |
| | |
| # is [[plain text]] using a character encoding such as [[ASCII]], various [[Unicode]] character encodings (e.g. [[UTF-8]]), [[EBCDIC]], or [[Shift JIS]],
| |
| # consists of records (typically one record per line),
| |
| # with the records divided into [[field (computer science)|field]]s separated by a comma,
| |
| # where every record has the same sequence of fields.
| |
| | |
| Within these general constraints, many variations are in use. Therefore, without additional information (such as whether RFC 4180 is honored), a file claimed simply to be in "CSV" format is not fully specified.
| |
|
| |
|
| == History == | | == History == |
| Comma-separated values is a data format that predates [[personal computer]]s by more than a decade: the [[IBM]] [[Fortran]] (level H extended) compiler under [[OS/360]] supported CSV in 1972.<ref>{{Citation |title=IBM FORTRAN Program Products for OS and the CMS Component of VM/370 General Information |edition=first |date=July 1972 |id=GC28-6884-0 |url=http://bitsavers.trailing-edge.com/pdf/ibm/370/fortran/GC28-6884-0_IBM_FORTRAN_Program_Products_for_OS_and_CMS_General_Information_Jul72.pdf |page=17 |access-date=February 5, 2016 |quote=For users familiar with the predecessor FORTRAN IV G and H processors, these are the major new language capabilities |archive-date=March 4, 2016 |archive-url=https://web.archive.org/web/20160304113723/http://bitsavers.trailing-edge.com/pdf/ibm/370/fortran/GC28-6884-0_IBM_FORTRAN_Program_Products_for_OS_and_CMS_General_Information_Jul72.pdf |url-status=live }}</ref> List-directed ("free form") input/output was defined in [[FORTRAN 77]], approved in 1978. List-directed input used commas or spaces for delimiters, so unquoted character strings could not contain commas or spaces.<ref>{{Citation |title=Fortran 77 Language Reference |chapter=List-Directed I/O |chapter-url=http://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc5/index.html |publisher=Oracle |access-date=2012-10-26 |archive-date=2021-02-26 |archive-url=https://web.archive.org/web/20210226163504/https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc5/index.html |url-status=live }}</ref>
| | The CSV format predates [[personal computer]]s by more than a decade. The [[IBM]] [[Fortran]] (level H extended) compiler under [[OS/360]] supported list-directed ("free-form") input/output, with commas between values, in 1972.<ref>{{Citation |title=IBM FORTRAN Program Products for OS and the CMS Component of VM/370 General Information |edition=first |date=July 1972 |id=GC28-6884-0 |url=http://bitsavers.trailing-edge.com/pdf/ibm/370/fortran/GC28-6884-0_IBM_FORTRAN_Program_Products_for_OS_and_CMS_General_Information_Jul72.pdf |page=17 |access-date=February 5, 2016 |quote=For users familiar with the predecessor FORTRAN IV G and H processors, these are the major new language capabilities |archive-date=March 4, 2016 |archive-url=https://web.archive.org/web/20160304113723/http://bitsavers.trailing-edge.com/pdf/ibm/370/fortran/GC28-6884-0_IBM_FORTRAN_Program_Products_for_OS_and_CMS_General_Information_Jul72.pdf |url-status=live }}</ref> List-directed input/output was defined in [[FORTRAN 77]], approved in 1978. List-directed input used commas or spaces for delimiters, so unquoted character strings could not contain commas or spaces.<ref>{{Citation |title=Fortran 77 Language Reference |chapter=List-Directed I/O |chapter-url=http://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc5/index.html |publisher=Oracle |access-date=2012-10-26 |archive-date=2021-02-26 |archive-url=https://web.archive.org/web/20210226163504/https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnc5/index.html |url-status=live }}</ref> |
|
| |
|
| The term "comma-separated value" and the "CSV" abbreviation were in use by 1983.<ref>{{cite web |url=https://archive.org/stream/Infoworld-1983-09-12#page/n39/mode/2up/search/%22comma-separated+value%22 |title=SuperCalc², spreadsheet package for IBM, CP/M |access-date=December 11, 2017}}</ref> The manual for the [[Osborne Computer Corporation|Osborne]] Executive computer, which bundled the [[SuperCalc]] spreadsheet, documents the CSV quoting convention that allows strings to contain embedded commas, but the manual does not specify a convention for embedding quotation marks within quoted strings.<ref>{{cite web |url=https://archive.org/stream/bitsavers_osborneexeutiveRef1983_16634758/3F00186-00_ExecutiveRef_1983#page/n143/mode/2up/search/%22comma-separated+value%22 |title=Comma-Separated-Value Format File Structure |year=1983 |access-date=December 11, 2017}}</ref> | | The term "comma-separated value" and the "CSV" abbreviation were in use by 1983.<ref>{{cite web |url=https://archive.org/stream/Infoworld-1983-09-12#page/n39/mode/2up/search/%22comma-separated+value%22 |title=SuperCalc², spreadsheet package for IBM, CP/M |access-date=December 11, 2017}}</ref> The manual for the [[Osborne Computer Corporation|Osborne]] Executive computer, which bundled the [[SuperCalc]] spreadsheet, documents the CSV quoting convention that allows strings to contain embedded commas.<ref>{{cite web |url=https://archive.org/stream/bitsavers_osborneexeutiveRef1983_16634758/3F00186-00_ExecutiveRef_1983#page/n143/mode/2up/search/%22comma-separated+value%22 |title=Comma-Separated-Value Format File Structure |year=1983 |access-date=December 11, 2017}}</ref> |
|
| |
|
| Comma-separated value lists are easier to type (for example into [[punched card]]s) than fixed-column-aligned data, and they were less prone to producing incorrect results if a value was punched one column off from its intended location. | | Comma-separated value lists are easier to type (for example into [[punched card]]s) than fixed-column-aligned data, and they were less prone to producing incorrect results if a value was punched one column off from its intended location. |
|
| |
|
| Comma separated files are used for the interchange of database information between machines of two different architectures. The plain-text character of CSV files largely avoids incompatibilities such as [[endianness|byte-order]] and [[word size]]. The files are largely human-readable, so it is easier to deal with them in the absence of perfect documentation or communication.<ref>{{cite web |url=http://www.digitalpreservation.gov/formats/fdd/fdd000323.shtml |title=CSV, Comma Separated Values (RFC 4180) |access-date=June 4, 2014}}</ref> | | Comma separated files are used for the interchange of database information between machines of two different architectures. The plain-text character of CSV files largely avoids incompatibilities such as [[endianness|byte-order]] and [[word size]]. The files are largely human-readable, so it is easier to deal with them in the absence of perfect documentation or communication.<ref>{{cite web |url=https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml |title=CSV, Comma Separated Values (RFC 4180) |website=[[Library of Congress]] |access-date=September 22, 2025}}</ref> |
|
| |
|
| The main standardization initiative—transforming "''[[de facto]]'' fuzzy definition" into a more precise and ''[[de jure]]'' one—was in 2005, with {{IETF RFC|4180}}, defining CSV as a [[MIME content type|MIME Content Type]].<ref>{{cite IETF |rfc=4180 |title=Common Format and MIME Type for Comma-Separated Values (CSV) Files |access-date=December 22, 2020}}</ref> Later, in 2013, some of RFC 4180's deficiencies were tackled by a W3C recommendation.<ref>See [http://www.w3.org/TR/sparql11-results-csv-tsv/ sparql11-results-csv-tsv], the first W3C recommendation scoped in CSV and filling some of RFC 4180's deficiencies.</ref> | | The main standardization initiative—transforming "''[[de facto]]'' fuzzy definition" into a more precise and ''[[de jure]]'' one—was in 2005, with {{IETF RFC|4180}}, defining CSV as a [[MIME content type|MIME Content Type]].<ref>{{cite IETF |rfc=4180 |title=Common Format and MIME Type for Comma-Separated Values (CSV) Files |access-date=December 22, 2020}}</ref> Later, in 2013, some of RFC 4180's deficiencies were tackled by a W3C recommendation.<ref>See [http://www.w3.org/TR/sparql11-results-csv-tsv/ sparql11-results-csv-tsv], the first W3C recommendation scoped in CSV and filling some of RFC 4180's deficiencies.</ref> |
| Line 64: |
Line 42: |
| In 2015 [[W3C]], in an attempt to enhance CSV with [[Semantic Web|formal semantics]], publicized the first ''drafts of recommendations'' for CSV metadata standards, which began as ''recommendations'' in December of the same year.<ref name="w3c model for tabular data">{{cite web |url=https://www.w3.org/TR/tabular-data-model/ |title=Model for Tabular Data and Metadata on the Web |date=17 December 2015 |access-date=March 23, 2016}} (W3C Recommendation)</ref> | | In 2015 [[W3C]], in an attempt to enhance CSV with [[Semantic Web|formal semantics]], publicized the first ''drafts of recommendations'' for CSV metadata standards, which began as ''recommendations'' in December of the same year.<ref name="w3c model for tabular data">{{cite web |url=https://www.w3.org/TR/tabular-data-model/ |title=Model for Tabular Data and Metadata on the Web |date=17 December 2015 |access-date=March 23, 2016}} (W3C Recommendation)</ref> |
|
| |
|
| ==General functionality== | | == Specification == |
| CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single [[relation (database)|relation]] in a [[relational database]], or to data (though not calculations) in a typical spreadsheet. | | Casually, CSV refers to data that is [[plain text]] and consists of one record per line where each line has the same sequence of fields separated by a comma.<ref name="rfc4180"/><ref>{{cite web | url=http://edoceo.com/utilitas/csv-file-format | title=Comma Separated Values (CSV) Standard File Format | publisher=Edoceo, Inc | access-date=June 4, 2014 | archive-date=July 14, 2020 | archive-url=https://web.archive.org/web/20200714050729/http://edoceo.com/utilitas/csv-file-format | url-status=live }}</ref><ref>*{{Citation |last= Creativyst |year= 2010 |title= How To: The Comma Separated Value (CSV) File Format |publisher= creativyst.com |access-date= May 24, 2010 |url= http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm |archive-date= April 4, 2021 |archive-url= https://web.archive.org/web/20210404114001/http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm |url-status= live }}</ref> The format is more formally described in the 2005 technical standard {{IETF RFC|4180}} which codifies the CSV format and defines the [[MIME type]] <code>text/csv</code> for the handling of text-based fields. Among its requirements: |
|
| |
|
| The format dates back to the early days of business computing and is widely used to pass data between computers with different internal word sizes, data formatting needs, and so forth. For this reason, CSV files are common on all computer platforms. | | * A line is terminated per MS-DOS-style: carriage return and line feed (CR/LF) sequence |
| | * A line terminator is optional for the last line |
| | * The data can start with a header record but with no way to test whether the first line is, in fact, a header, care is required when importing |
| | * Each record should contain the same number of fields |
| | * A field containing a comma, double quote or line terminator character should be enclosed in double quotes |
| | * Any field may be enclosed in double quotes |
| | * If a field is enclosed in double quotes, then a double quote embedded in the field must be represented by a sequence of two double quotes |
|
| |
|
| CSV is a [[Delimiter-separated values|delimited]] text file that uses a [[Comma (punctuation)|comma]] to separate values (many implementations of CSV import/export tools allow other separators to be used; for example, the use of a "Sep=^" row as the first row in the *.csv file will cause [[Microsoft Excel|Excel]] to open the file expecting [[Caret (punctuation)|caret]] "^" to be the separator instead of comma ","). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " ([[double quote]]) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes,<ref>*{{Citation |last= Creativyst |year= 2010 |title= How To: The Comma Separated Value (CSV) File Format |publisher= creativyst.com |access-date= May 24, 2010 |url= http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm |archive-date= April 4, 2021 |archive-url= https://web.archive.org/web/20210404114001/http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm |url-status= live }}</ref> or by prefixing a double quote with an [[escape character]] such as a [[backslash]] (for example in [[Sybase]] Central). | | Common challenges with CSV include: |
| | * Programs may not support line terminator characters within a field even when properly quoted |
| | * Programs may confuse a header line with data or interpret the first data line as a header |
| | * Double quotes in a field may not be parsed correctly |
|
| |
|
| CSV formats are not limited to a particular [[character set]].<ref name="rfc4180"/> They work just as well with [[Unicode]] character sets (such as [[UTF-8]] or [[UTF-16]]) as with ASCII (although particular programs that support CSV may have their own limitations). CSV files normally will even survive naïve translation from one character set to another (unlike nearly all proprietary data formats). CSV does not, however, provide any way to indicate what character set is in use, so that must be communicated separately, or determined at the receiving end (if possible).
| | In 2011, [[Open Knowledge Foundation]] (OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata. (CSV lacks any type information to distinguish the string <code>1</code> from the number 1.)<ref>{{cite web |title=Tabular Data Package |url=https://frictionlessdata.io/specs/tabular-data-package/ |website=Frictionless Data Specs}}</ref> The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.<ref>{{cite web |title=CSV Dialect |url=https://frictionlessdata.io/specs/csv-dialect/ |website=Frictionless Data Specs}}</ref> |
|
| |
|
| Databases that include multiple [[relation (database)|relation]]s cannot be exported as a single CSV file{{citation needed|date=September 2014}}. Similarly, CSV cannot naturally represent [[hierarchical]] or [[object-oriented]] data. This is because every CSV record is expected to have the same structure. CSV is therefore rarely appropriate for [[documents]] created with [[HTML]], [[XML]], or other markup or word-processing technologies.
| | In 2013, the [[World Wide Web Consortium|W3C]] "CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats.<ref>{{cite web|url=http://www.w3.org/2013/csvw/wiki/Main_Page|title=CSV on the Web Working Group|year=2013|publisher=[[World Wide Web Consortium|W3C]] CSV WG|access-date=2015-04-22}}</ref> The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations<ref>{{cite web |url=https://github.com/w3c/csvw |title=CSV on the Web Repository |website=[[GitHub]] }} (on GitHub)</ref> for modeling "Tabular Data",<ref name="w3c model for tabular data" /> and enhancing CSV with [[metadata]] and [[Semantic Web|semantics]]. While the [[well-formed element|well-formedness]] of CSV data can readily be checked, testing validity and canonical form is less well developed, relative to more precise data formats, such as [[XML]] and [[SQL]], which offer richer types and rules-based validation.<ref>{{cite web|url=https://www.csvpath.org/topics/validation/schemas-or-rules|title=Rules Or Schemas|year=2024|publisher=CsvPath Project|access-date=2025-02-13}}</ref> |
|
| |
|
| [[Statistical database]]s in various fields often have a generally relation-like structure, but with some repeatable groups of fields. For example, health databases such as the [[Demographic and Health Survey]] typically repeat some questions for each child of a given parent (perhaps up to a fixed maximum number of children). [[Statistical analysis]] systems often include utilities that can "rotate" such data; for example, a "parent" record that includes information about five children can be split into five separate records, each containing (a) the information on one child, and (b) a copy of all the non-child-specific information. CSV can represent either the "vertical" or "horizontal" form of such data. | | == Applications == |
| | CSV is commonly used for [[data exchange]] and is widely supported by data-oriented [[application software |applications]]. It is often used to move tabular data between programs that natively operate on incompatible data {{endash}} often in formats that are [[proprietary format|proprietary]] or undocumented.<ref name="rfc4180">{{cite IETF |title=Common Format and MIME Type for CSV Files |last=Shafranovich |first=Y. |publisher=[[Internet Engineering Task Force|IETF]] |date=October 2005 |rfc=4180 |page=1 }}</ref><ref>{{cite web|url=https://datahub.io/docs/data-packages/csv|title=CSV - Comma Separated Values|access-date=2017-12-02|archive-date=2021-03-07|archive-url=https://web.archive.org/web/20210307221025/https://datahub.io/docs/data-packages/csv|url-status=dead}}</ref><ref>{{cite web | url=http://www.csvreader.com/csv_format.php | title=CSV Files | access-date=June 4, 2014 | archive-date=April 30, 2021 | archive-url=https://web.archive.org/web/20210430225021/https://www.csvreader.com/csv_format.php | url-status=live }}</ref> A common scenario is moving data from a database to a spreadsheet which, in general, use completely different formats. Most database systems can export as CSV and most spreadsheet programs can import CSV-formatted data, leveraging CSV as an intermediate format. Every major [[E-commerce |ecommerce]] platform provides support for exporting data as a CSV file.<ref>{{Cite web |title=CSV Supported Ecommerce Platforms |url=https://www.rfmcalc.com/supported-platforms/ |access-date=2025-03-09 | website=RFM Calc }}</ref> |
|
| |
|
| In a relational database, similar issues are readily handled by creating a separate relation for each such group, and connecting "child" records to the related "parent" records using a [[foreign key]] (such as an ID number or name for the parent). In markup languages such as XML, such groups are typically enclosed within a parent element and repeated as necessary (for example, multiple <code><child></code> nodes within a single <code><parent></code> node). With CSV there is no widely accepted single-file solution.
| | CSV is also used for storing data. Common data science tools such as [[Pandas (software)|Pandas]] include the option to export data to CSV for long-term storage.<ref>{{Cite web |title=pandas.DataFrame.to_csv — pandas 2.0.3 documentation |url=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html |access-date=2023-08-16 |website=pandas.pydata.org}}</ref> Benefits of CSV for data storage include the simplicity of CSV makes parsing and creating CSV files easy to implement and fast compared to other data formats, human readability making editing or fixing data simpler,<ref>{{Cite web |date=2021-09-15 |title=CSV Format: History, Advantages and Why It Is Still Popular |url=https://bytescout.com/blog/csv-format-history-advantages.html |access-date=2023-08-16 |website=ByteScout |language=en-US}}</ref> and high compressibility leading to smaller data files.<ref>{{Cite web |date=2020-07-23 |title=Comparison of different file formats in Big Data |url=https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/ |access-date=2023-08-16 |website=www.adaltas.com |language=en}}</ref> Alternatively, CSV does not support more complex data relations and makes no distinction between null and empty values, and in applications where these features are needed other formats are preferred. |
| | |
| ==Standardization==
| |
| The name "CSV" indicates the use of the comma to separate data fields. Nevertheless, the term "CSV" is widely used to refer to a large family of formats that differ in many ways. Some implementations allow or require single or double quotation marks around some or all fields; and some reserve the first record as a header containing a list of field names. The character set being used is undefined: some applications require a Unicode [[byte order mark]] (BOM) to enforce Unicode interpretation (sometimes even a UTF-8 BOM).<ref name="rfc4180"/> Files that use the tab character instead of comma can be more precisely referred to as "TSV" for tab-separated values.
| |
| | |
| Other implementation differences include the handling of more commonplace field separators (such as space or semicolon) and newline characters inside text fields. One more subtlety is the interpretation of a blank line: it can equally be the result of writing a record of zero fields, or a record of one field of zero length; thus decoding it is ambiguous.
| |
| | |
| === RFC <nowiki/>4180 and MIME standards === <!-- given markup prevents magic linking -->
| |
| The 2005 technical standard RFC 4180 formalizes the CSV file format and defines the [[MIME type]] "text/csv" for the handling of text-based fields. However, the interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Among its requirements:
| |
| * MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
| |
| * An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
| |
| * Each record ''should'' contain the same number of comma-separated fields.
| |
| * Any field ''may'' be quoted (with double quotes).
| |
| * Fields containing a line-break, double-quote or commas ''should'' be quoted. (If they are not, the file will likely be impossible to process correctly.)
| |
| * ''If'' double-quotes are used to enclose fields, then a double-quote in a field ''must'' be represented by two double-quote characters.
| |
| | |
| The format can be processed by most programs that claim to read CSV files. The exceptions are ''(a)'' programs may not support line-breaks within quoted fields, ''(b)'' programs may confuse the optional header with data or interpret the first data line as an optional header, and ''(c)'' double-quotes in a field may not be parsed correctly automatically.
| |
| | |
| === OKF frictionless tabular data package ===
| |
| | |
| In 2011 [[Open Knowledge Foundation]] (OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata (CSV lacks any type information to distinguish the string "1" from the number 1).<ref>{{cite web |title=Tabular Data Package |url=https://frictionlessdata.io/specs/tabular-data-package/ |website=Frictionless Data Specs}}</ref>
| |
| | |
| The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.<ref>{{cite web |title=CSV Dialect |url=https://frictionlessdata.io/specs/csv-dialect/ |website=Frictionless Data Specs}}</ref>
| |
| | |
| === W3C tabular data standard ===
| |
| In 2013 the [[World Wide Web Consortium|W3C]] "CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats.<ref>{{cite web|url=http://www.w3.org/2013/csvw/wiki/Main_Page|title=CSV on the Web Working Group|year=2013|publisher=[[World Wide Web Consortium|W3C]] CSV WG|access-date=2015-04-22}}</ref> The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations<ref>{{cite web |url=https://github.com/w3c/csvw |title=CSV on the Web Repository}} (on GitHub)</ref>
| |
| for modeling "Tabular Data",<ref name="w3c model for tabular data" /> and enhancing CSV with [[metadata]] and [[Semantic Web|semantics]].
| |
| | |
| While the [[well-formed element|well-formedness]] of CSV data can readily checked, testing validity and canonical form is less well developed, relative to more precise data formats, such as [[XML]] and [[SQL]], which offer richer types and rules-based validation.<ref>{{cite web|url=https://www.csvpath.org/topics/validation/schemas-or-rules|title=Rules Or Schemas|year=2024|publisher=CsvPath Project|access-date=2025-02-13}}</ref>
| |
| | |
| ==Basic rules==
| |
| Many informal documents exist that describe "CSV" formats.
| |
| [[IETF]] RFC 4180 (summarized above) defines the format for the "text/csv" [[MIME type]] registered with the [[Internet Assigned Numbers Authority|IANA]].
| |
| | |
| Rules typical of these and other "CSV" specifications and implementations are as follows:
| |
| {{unordered list
| |
| | CSV is a [[delimited]] data format that has [[field (computer science)|fields/columns]] separated by the [[Comma (punctuation)|comma]] [[grapheme|character]] and [[row (database)|records/rows]] terminated by newlines.
| |
| | A CSV file does not require a specific [[character encoding]], [[byte order]], or line terminator format (some software do not support all line-end variations).
| |
| | A record ends at a line terminator. However, line terminators can be embedded as data within fields, so software must recognize quoted line-separators (see below) in order to correctly assemble an entire record from perhaps multiple lines.
| |
| | All records should have the same number of fields, in the same order.
| |
| | Data within fields is interpreted as a sequence of [[character (computing)|character]]s, not as a sequence of bits or bytes (see RFC 2046, section 4.1). For example, the numeric quantity 65535 may be represented as the 5 ASCII characters "65535" (or perhaps other forms such as "0xFFFF", "000065535.000E+00", etc.); but not as a sequence of 2 bytes intended to be treated as a single binary integer rather than as two characters (e.g. the numbers 11264–11519 have a comma as their high order byte: <syntaxhighlight lang="perl" inline>ord(',')*256..ord(',')*256+255</syntaxhighlight>). If this "plain text" convention is not followed, then the CSV file no longer contains sufficient information to interpret it correctly, the CSV file will not likely survive transmission across differing computer architectures, and will not conform to the ''text/csv'' MIME type.
| |
| | Adjacent fields must be separated by a single comma. However, "CSV" formats vary greatly in this choice of separator character. In particular, in [[Locale (computer software)|locale]]s where the comma is used as a decimal separator, a semicolon, [[tab key|tab character]], or other character is used instead.
| |
| <pre>1997,Ford,E350</pre>
| |
| | Any field ''may'' be ''quoted'' (that is, enclosed within double-quote characters), while some fields ''must'' be quoted, as specified in the following rules and examples:
| |
| <pre>"1997","Ford","E350"</pre>
| |
| | Fields with embedded commas or double-quote characters must be quoted.
| |
| <pre>1997,Ford,E350,"Super, luxurious truck"</pre>
| |
| | Each of the embedded double-quote characters must be represented by a pair of double-quote characters.
| |
| <pre>1997,Ford,E350,"Super, ""luxurious"" truck"</pre>
| |
| | Fields with embedded line breaks must be quoted (however, many CSV implementations do not support embedded line breaks).
| |
| <pre>
| |
| 1997,Ford,E350,"Go get one now
| |
| they are going fast"
| |
| </pre>
| |
| | In some CSV implementations{{which|date=September 2017}}, leading and trailing spaces and tabs are trimmed (ignored). Such trimming is forbidden by RFC 4180, which states "Spaces are considered part of a field and should not be ignored."
| |
| <pre>
| |
| 1997, Ford, E350
| |
| not same as
| |
| 1997,Ford,E350
| |
| </pre>
| |
| | According to RFC 4180, spaces outside quotes in a field are not allowed{{failed verification|date=January 2024}}; however, the RFC also says that "Spaces are considered part of a field and should not be ignored." and "Implementers should 'be conservative in what you do, be liberal in what you accept from others' (RFC 793, section 2.10) when processing CSV files." | |
| <!-- DO NOT CORRECT THIS INTENTIONAL ERROR IN THE EXAMPLE! -->
| |
| <pre>1997, "Ford" ,E350</pre>
| |
| <!-- DO NOT CORRECT THIS INTENTIONAL ERROR IN THE EXAMPLE! -->
| |
| | In CSV implementations that do trim leading or trailing spaces, fields with such spaces as meaningful data must be quoted.
| |
| <pre>1997,Ford,E350," Super luxurious truck "</pre> | |
| | Double quote processing need only apply if the field starts with a double quote. Note, however, that double quotes are not allowed in unquoted fields according to RFC 4180.<!-- rule for non-escaped text: %x20-21 / %x23-2B / %x2D-7E (double quotes are %x22, which is explicitly omitted. --> | |
| <pre>
| |
| Los Angeles,34°03′N,118°15′W
| |
| New York City,40°42′46″N,74°00′21″W
| |
| Paris,48°51′24″N,2°21′03″E
| |
| </pre>
| |
| | The first record may be a "header", which contains column names in each of the fields (there is no reliable way to tell whether a file does this or not; however, it is uncommon to use characters other than letters, digits, and underscores in such column names). | |
| <pre>
| |
| Year,Make,Model
| |
| 1997,Ford,E350
| |
| 2000,Mercury,Cougar
| |
| </pre>
| |
| }}
| |
| | |
| === JavaScript code ===
| |
| <syntaxhighlight lang="javascript">
| |
| function readLine(str) {
| |
| let smallArray = [];
| |
| let buf = "";
| |
| let inQuote = false;
| |
| for (let i = 0; i < str.length; i++) {
| |
| const ch = str[i];
| |
| | |
| if (ch === '"') {
| |
| if (inQuote) {
| |
| // If we're in a quote and see another quote, check if it's an escaped quote
| |
| if (i + 1 < str.length && str[i + 1] === '"') {
| |
| buf += '"'; // Add a single double quote to the buffer
| |
| i++; // Skip the next character as it's part of the escape sequence
| |
| } else {
| |
| inQuote = false; // End of a quoted field
| |
| }
| |
| } else {
| |
| // If not in a quote and see a quote, start a quoted field
| |
| inQuote = true;
| |
| }
| |
| } else if (ch === ',' && !inQuote) {
| |
| // If it's a comma and not inside a quote, it's a field separator
| |
| smallArray.push(buf);
| |
| buf = ""; // Reset buffer for the next field
| |
| } else {
| |
| // Add the character to the buffer
| |
| buf += ch;
| |
| }
| |
| }
| |
| smallArray.push(buf); // Add the last field to the array (small array)
| |
| return smallArray;
| |
| }
| |
| </syntaxhighlight>
| |
| | |
| <syntaxhighlight lang="javascript">
| |
| function processCsvString(str) {
| |
| const bigArray = []; // Big array with each line (the small array returned by the readLine function) as an element.
| |
| | |
| // Split the entire CSV string into an array of lines
| |
| const lines = str.split(/\r?\n/);
| |
| | |
| // Loop through each line
| |
| for (let i = 0; i < lines.length; i++) {
| |
| const line = lines[i];
| |
| | |
| // Trim any leading or trailing whitespace from the line
| |
| if (line.trim() === '') {
| |
| continue; // Skip empty lines
| |
| }
| |
| | |
| // Call your readLine function for each line to get the small array
| |
| const smallArray = readLine(line);
| |
| | |
| // Add the small array to the big array
| |
| bigArray.push(smallArray);
| |
| }
| |
| | |
| return bigArray; // Return the big array containing all parsed lines
| |
| }
| |
| | |
| // --- Test with dataset 1 --- | |
| console.log("--- Testing with Dataset 1 ---");
| |
| const content_1 = `Year,Make,Model
| |
| 1997,Ford,E350
| |
| 2000,Mercury,Cougar
| |
| 1997,Ford,E350,"Super, luxurious truck"
| |
| 1997,Ford,E350,"Super, ""luxurious"" truck"
| |
| "1997", "Ford" ,E350
| |
| `;
| |
| | |
| const parsed_1 = processCsvString(content_1);
| |
| console.log(parsed_1);
| |
| | |
| /* Expected Output for Dataset 1:
| |
| [
| |
| [ 'Year', 'Make', 'Model' ],
| |
| [ '1997', 'Ford', 'E350' ],
| |
| [ '2000', 'Mercury', 'Cougar' ],
| |
| [ '1997', 'Ford', 'E350', 'Super, luxurious truck' ],
| |
| [ '1997', 'Ford', 'E350', 'Super, "luxurious" truck' ],
| |
| [ '1997', ' Ford ', 'E350' ]
| |
| ]
| |
| */
| |
| | |
| | |
| // --- Test with dataset 2 ---
| |
| console.log("\n--- Testing with Dataset 2 ---");
| |
| const content_2 = `Year,Make,Model,Description,Price
| |
| 1997,Ford,E350,"ac, abs, moon",3000.00
| |
| 1999,Chevy,"Venture ""Extended Edition""","",4900.00
| |
| 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
| |
| 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
| |
| `;
| |
| | |
| const parsed_2 = processCsvString(content_2);
| |
| console.log(parsed_2);
| |
| | |
| /* Expected Output for Dataset 2:
| |
| [
| |
| [ 'Year', 'Make', 'Model', 'Description', 'Price' ],
| |
| [ '1997', 'Ford', 'E350', 'ac, abs, moon', '3000.00' ],
| |
| [ '1999', 'Chevy', 'Venture "Extended Edition"', '', '4900.00' ],
| |
| [ '1999', 'Chevy', 'Venture "Extended Edition, Very Large"', '', '5000.00' ],
| |
| [ '1996', 'Jeep', 'Grand Cherokee', 'MUST SELL! air, moon roof, loaded', '4799.00' ]
| |
| ]
| |
| */
| |
| </syntaxhighlight> | |
| | |
| Most CSV (Comma Separated Values) files typically employ a convention where the first line serves as a header row. This row contains human-readable attribute names or column labels that describe the data in each respective column below it. For instance, in a CSV file detailing car information, the first line might read <code>Year,Make,Model,Price</code>, clearly indicating the content of the subsequent data entries.
| |
| | |
| This practice directly facilitates the interpretation and processing of the data, especially when importing it into analytical tools or programming environments. When such a CSV file is read, these attribute names from the header row are commonly utilized as keys to access the corresponding data values in each subsequent row.
| |
| | |
| This structural approach bears a strong resemblance to the Pandas DataFrame, a fundamental data structure in the Python programming language's Pandas library, widely used for data manipulation and analysis.
| |
| | |
| <syntaxhighlight lang="javascript">
| |
| | |
| function processCsvStringAsKeyValues(csvString) {
| |
| const lines = csvString.split(/\r?\n/);
| |
| const result = [];
| |
| let headers = []; // To store the column names from the first row
| |
| | |
| for (let i = 0; i < lines.length; i++) {
| |
| const line = lines[i];
| |
| | |
| if (line.trim() === '') {
| |
| continue; // Skip empty lines
| |
| }
| |
| | |
| const fields = readLine(line);
| |
| | |
| if (i === 0) {
| |
| // First line is the header
| |
| headers = fields;
| |
| } else {
| |
| // Subsequent lines are data rows
| |
| const rowObject = {};
| |
| for (let j = 0; j < headers.length; j++) {
| |
| // Assign each field to its corresponding header key
| |
| // Use a default empty string if a value is missing (e.g., if a row has fewer columns)
| |
| rowObject[headers[j]] = fields[j] !== undefined ? fields[j] : '';
| |
| }
| |
| result.push(rowObject);
| |
| }
| |
| }
| |
| | |
| return result; // Return the array of objects
| |
| }
| |
| | |
| // --- Test with dataset 1 ---
| |
| console.log("--- Testing with Dataset 1 (Key-Value Pairs) ---");
| |
| const content_1 = `Year,Make,Model
| |
| 1997,Ford,E350
| |
| 2000,Mercury,Cougar
| |
| 1997,Ford,E350,"Super, luxurious truck"
| |
| 1997,Ford,E350,"Super, ""luxurious"" truck"
| |
| "1997", "Ford" ,E350
| |
| `;
| |
| | |
| const parsed_1 = processCsvStringAsKeyValues(content_1);
| |
| console.log("Parsed Dataset 1 (Key-Value):");
| |
| console.log(parsed_1);
| |
| | |
| /* Expected Output for Dataset 1:
| |
| [
| |
| { Year: '1997', Make: 'Ford', Model: 'E350' },
| |
| { Year: '2000', Make: 'Mercury', Model: 'Cougar' },
| |
| { Year: '1997', Make: 'Ford', Model: 'E350', Description: 'Super, luxurious truck' }, // Note: Description key might be an issue if header only has 3 cols
| |
| { Year: '1997', Make: 'Ford', Model: 'E350', Description: 'Super, "luxurious" truck' },
| |
| { Year: '1997', Make: ' Ford ', Model: 'E350' }
| |
| ]
| |
| */
| |
| // A note on the above expected output: For the rows like "Super, luxurious truck", if the header only
| |
| // has "Year,Make,Model", then the extra field will not have a corresponding key unless you handle
| |
| // it explicitly or ensure your headers match your data structure.
| |
| // For the purpose of this solution, we assume the number of fields in data rows
| |
| // matches the number of headers. If a data row has more fields than headers,
| |
| // the extra fields will be ignored. If it has fewer, the corresponding values will be `''`.
| |
| | |
| | |
| // --- Test with dataset 2 ---
| |
| console.log("\n--- Testing with Dataset 2 (Key-Value Pairs) ---");
| |
| const content_2 = `Year,Make,Model,Description,Price
| |
| 1997,Ford,E350,"ac, abs, moon",3000.00
| |
| 1999,Chevy,"Venture ""Extended Edition""","",4900.00
| |
| 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
| |
| 1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
| |
| `;
| |
| | |
| const parsed_2 = processCsvStringAsKeyValues(content_2);
| |
| console.log("Parsed Dataset 2 (Key-Value):");
| |
| console.log(parsed_2);
| |
| | |
| /* Expected Output for Dataset 2:
| |
| [
| |
| { Year: '1997', Make: 'Ford', Model: 'E350', Description: 'ac, abs, moon', Price: '3000.00' },
| |
| { Year: '1999', Make: 'Chevy', Model: 'Venture "Extended Edition"', Description: '', Price: '4900.00' },
| |
| { Year: '1999', Make: 'Chevy', Model: 'Venture "Extended Edition, Very Large"', Description: '', Price: '5000.00' },
| |
| { Year: '1996', Make: 'Jeep', Model: 'Grand Cherokee', Description: 'MUST SELL! air, moon roof, loaded', Price: '4799.00' }
| |
| ]
| |
| */
| |
| </syntaxhighlight>
| |
| | |
| ==Example==
| |
| {| class="wikitable"
| |
| |-
| |
| !Year||Make||Model||Description||Price
| |
| |-
| |
| |1997||Ford||E350||ac, abs, moon||3000.00
| |
| |-
| |
| |1999||Chevy||Venture "Extended Edition"|| ||4900.00
| |
| |-
| |
| |1999||Chevy||Venture "Extended Edition, Very Large"|| ||5000.00
| |
| |-
| |
| |1996||Jeep||Grand Cherokee||MUST SELL!<br>air, moon roof, loaded||4799.00
| |
| |}
| |
| | |
| The above table of data may be represented in CSV format as follows:
| |
| | |
| Year,Make,Model,Description,Price
| |
| 1997,Ford,E350,"ac, abs, moon",3000.00
| |
| 1999,Chevy,"Venture ""Extended Edition""","",4900.00
| |
| 1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
| |
| 1996,Jeep,Grand Cherokee,"MUST SELL!
| |
| air, moon roof, loaded",4799.00
| |
| | |
| Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):
| |
| | |
| Year,Make,Model,Length
| |
| 1997,Ford,E350,2.35
| |
| 2000,Mercury,Cougar,2.38
| |
| | |
| Example of an analogous European CSV/[[delimiter-separated values|DSV]] file (where the decimal separator is a comma and the value separator is a semicolon):
| |
|
| |
|
| Year;Make;Model;Length
| | More than 200 local, regional, and national data portals, such as those of the [[UK government]] and the [[European Commission]], use CSV files with standardized [[Database catalog|data catalogs]].<ref>{{Cite conference |last1=Mahmud |first1=S M Hasan |last2=Hossin |first2=Md Altab |last3=Jahan |first3=Hosney |last4=Noori |first4=Sheak Rashed Haider |last5=Bhuiyan |first5=Touhid |title=CSV-ANNOTATE: Generate annotated tables from CSV file |conference=2018 International Conference on Artificial Intelligence and Big Data (ICAIBD) |publisher=IEEE |publication-date=2018 |pages=71–75 |doi=10.1109/ICAIBD.2018.8396169 |isbn=978-1-5386-6987-7}}</ref> |
| 1997;Ford;E350;2,35
| |
| 2000;Mercury;Cougar;2,38
| |
|
| |
|
| The latter format is not RFC 4180 compliant.<ref>{{harvtxt|Shafranovich|2005}} states, "Within the header and each record, there may be one or more fields, separated by commas."</ref> Compliance could be achieved by the use of a comma instead of a semicolon as a separator and by quoting all numbers that have a decimal mark.
| | Some applications use CSV as a [[Data exchange|data interchange format]] to enhance its [[interoperability]], exporting and importing CSV. Others use CSV as an internal format. CSV is supported by almost all spreadsheets and database management systems. |
|
| |
|
| ==Application support==
| | [[Spreadsheet]]s including Apple [[Numbers (spreadsheet)|Numbers]], [[LibreOffice Calc]], and [[Apache OpenOffice]] Calc. support reading CSV files. [[Microsoft Excel]] also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., {{as of|lc=y|2019}} Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma). [[LibreOffice Calc]] CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming. |
|
| |
|
| Some applications use CSV as a [[Data exchange|data interchange format]] to enhance its [[interoperability]], exporting and importing CSV. Others use CSV as an ''internal format''.
| | Various [[relational databases]] support saving query results to a CSV file. [[PostgreSQL]] provides the <code>COPY</code> command, which allows for both saving and loading data to and from a file. {{code|2=postgres|COPY (SELECT * FROM articles) TO '/home/wikipedia/file.csv' (FORMAT csv)}} saves the content of a table <code>articles</code> to a file called <code>/home/wikipedia/file.csv</code>.<ref>{{cite web|url=https://www.postgresql.org/docs/current/sql-copy.html |title=Documentation: 14: COPY |publisher=PostgreSQL |accessdate=2024-05-12}}</ref> Some relational databases, when using standard SQL, offer ''foreign-data wrapper'' (FDW). For example, PostgreSQL offers the {{code|2=postgres|CREATE FOREIGN TABLE}}<ref>{{cite web|url=https://www.postgresql.org/docs/current/postgres-fdw.html |title=Documentation: 14: F.35. postgres_fdw |publisher=PostgreSQL |date=2022-02-10 |accessdate=2022-03-04}}</ref> and {{code|2=postgres|CREATE EXTENSION file_fdw}}<ref>{{cite web|url=https://www.postgresql.org/docs/current/file-fdw.html |title=Documentation: 14: F.14. file_fdw |publisher=PostgreSQL |date=2022-02-10 |accessdate=2022-03-04}}</ref> commands to configure any variant of CSV. Databases like [[Apache Hive]] offer the option to express CSV or [[gzip|.csv.gz]] as an internal table format. |
|
| |
|
| As a data interchange format: the CSV file format is supported by almost all spreadsheets and database management systems,
| | Programs that work with CSV may have limits on the maximum number of rows CSV files can have. Examples include Microsoft Excel (1,048,576 rows), Apple Numbers (1,000,000 rows), Google Sheets (10,000,000 cells), and OpenOffice and LibreOffice (1,048,576 rows).<ref>{{cite web | url=https://www.import2.com/csv/understanding-csv-and-row-limits | title=Understanding CSV and row limits | access-date=Feb 28, 2021 | archive-date=January 15, 2021 | archive-url=https://web.archive.org/web/20210115142749/https://www.import2.com/csv/understanding-csv-and-row-limits | url-status=dead }}</ref> |
| * [[Spreadsheet]]s including Apple [[Numbers (spreadsheet)|Numbers]], [[LibreOffice Calc]], and [[Apache OpenOffice]] Calc. [[Microsoft Excel]] also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., {{as of|lc=y|2019}} Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma). [[LibreOffice Calc]] CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming.
| |
| * Various [[Relational databases]] support saving query results to a CSV file. [[PostgreSQL]] provides the <code>COPY</code> command, which allows for both saving and loading data to and from a file. {{code|2=postgres|COPY (SELECT * FROM articles) TO '/home/wikipedia/file.csv' (FORMAT csv)}} saves the content of a table <code>articles</code> to a file called <code>/home/wikipedia/file.csv</code>.<ref>{{cite web|url=https://www.postgresql.org/docs/current/sql-copy.html |title=Documentation: 14: COPY |publisher=PostgreSQL |accessdate=2024-05-12}}</ref>
| |
| * Many utility programs on [[Unix]]-style systems (such as [[Cut (Unix)|cut]], [[Paste (Unix)|paste]], [[Join (Unix)|join]], [[Sort (Unix)|sort]], [[uniq]], [[AWK|awk]]) can split files on a comma delimiter, and can therefore process simple CSV files. However, this method does not correctly handle commas or new lines within quoted strings, hence it is better to use tools like csvkit or Miller.
| |
|
| |
|
| As (main or optional) internal representation. Can be [[Native and foreign format|native or foreign]], but differ from interchange format ("export/import only") because it is not necessary to create a copy in another format:
| | == See also == |
| * Some [[Spreadsheet]]s including [[LibreOffice Calc]] offers this option, without enforcing user to adopt another format.
| |
| * Some relational databases, when using standard SQL, offer ''foreign-data wrapper'' (FDW). For example, PostgreSQL offers the {{code|2=postgres|CREATE FOREIGN TABLE}}<ref>{{cite web|url=https://www.postgresql.org/docs/current/postgres-fdw.html |title=Documentation: 14: F.35. postgres_fdw |publisher=PostgreSQL |date=2022-02-10 |accessdate=2022-03-04}}</ref> and {{code|2=postgres|CREATE EXTENSION file_fdw}}<ref>{{cite web|url=https://www.postgresql.org/docs/current/file-fdw.html |title=Documentation: 14: F.14. file_fdw |publisher=PostgreSQL |date=2022-02-10 |accessdate=2022-03-04}}</ref> commands to configure any variant of CSV.
| |
| * Databases like [[Apache Hive]] offer the option to express CSV or [[gzip|.csv.gz]] as an internal table format.
| |
| * The [[emacs]] editor can operate on CSV files using csv-nav mode.<ref>{{Cite web|url=https://www.emacswiki.org/emacs/CsvNav|title=EmacsWiki: Csv Nav|website=www.emacswiki.org}}</ref>
| |
|
| |
|
| CSV format is supported by libraries available for many [[programming language]]s. Most provide some way to specify the field delimiter, [[decimal separator]], character encoding, quoting conventions, date format, etc.
| | * {{anl |Comparison of data-serialization formats}} |
| | * [[Delimiter collision]] {{endash}} Formatting challenge caused by data that contains a delimiter |
| | * {{anl |Flat-file database}} |
| | * {{anl |Tab-separated values}} |
|
| |
|
| ===Software and row limits=== | | == References == |
| Programs that work with CSV may have limits on the maximum number of rows CSV files can have.
| |
| Below is a list of common software and its limitations:<ref>{{cite web | url=https://www.import2.com/csv/understanding-csv-and-row-limits | title=Understanding CSV and row limits | access-date=Feb 28, 2021 | archive-date=January 15, 2021 | archive-url=https://web.archive.org/web/20210115142749/https://www.import2.com/csv/understanding-csv-and-row-limits | url-status=live }}</ref>
| |
| * Microsoft Excel: 1,048,576 row limit;
| |
| * Microsoft PowerShell, no row or cell limit. (Memory Limited)
| |
| * Apple Numbers: 1,000,000 row limit;
| |
| * Google Sheets: 10,000,000 cell limit (the product of columns and rows);
| |
| * OpenOffice and LibreOffice: 1,048,576 row limit;
| |
| * Sourcetable:<ref>{{cite web |url=https://sourcetable.com |title=large data spreadsheet |website=Sourcetable Inc. |date=2024 |access-date=2024-11-14}}</ref> no row limit. (Spreadsheet-database hybrid);
| |
| * Text Editors (such as [[WordPad]], [[TextEdit]], [[Vim (text editor)|Vim]], etc.): no row or cell limit;
| |
| * Databases (COPY command and FDW): no row or cell limit.
| |
|
| |
|
| ==See also==
| | {{refs}} |
| * [[Tab-separated values]]
| |
| * [[Comparison of data-serialization formats]]
| |
| * [[Delimiter-separated values]]
| |
| * [[Delimiter collision]]
| |
| * [[Flat-file database]]
| |
| * [[Simple Data Format]]
| |
| * [[Substitute character]], [[Null character]], [[General Punctuation#Block|invisible comma U+2063]]
| |
|
| |
|
| ==References== | | == Further reading == |
| {{Reflist|30em}}
| |
|
| |
|
| ==Further reading==
| |
| * {{cite web |title=IBM DB2 Administration Guide - LOAD, IMPORT, and EXPORT File Formats |publisher=[[IBM]] |url=https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm |access-date=2016-12-12 |url-status=live |archive-url=https://web.archive.org/web/20161213014111/https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm |archive-date=2016-12-13}} (Has file descriptions of delimited ASCII (.DEL) (including comma- and semicolon-separated) and non-delimited ASCII (.ASC) files for data transfer.) | | * {{cite web |title=IBM DB2 Administration Guide - LOAD, IMPORT, and EXPORT File Formats |publisher=[[IBM]] |url=https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm |access-date=2016-12-12 |url-status=live |archive-url=https://web.archive.org/web/20161213014111/https://www.columbia.edu/sec/acis/db2/db2d0/db2d053.htm |archive-date=2016-12-13}} (Has file descriptions of delimited ASCII (.DEL) (including comma- and semicolon-separated) and non-delimited ASCII (.ASC) files for data transfer.) |
|
| |
|
| {{Data Exchange}} | | {{Data Exchange}} |
|
| |
|
| [[Category:Delimiter-separated format]] | | [[Category:Delimiter-separated formats]] |
| [[Category:Open formats]] | | [[Category:Open formats]] |
| [[Category:Spreadsheet file formats]] | | [[Category:Spreadsheet file formats]] |
Template:Short description
Script error: No such module "Infobox".Template:Template otherScript error: No such module "Check for unknown parameters".
Comma-separated values (CSV) is a plain text data format for storing tabular data where the fields (values) of a record are separated by a comma and each record is a line (i.e. newline separated). CSV is commonly-used in software that generally deals with tabular data such as a database or a spreadsheet.[1] Benefits cited for using CSV include simplicity of use and human readability.[2] CSV is a form of delimiter-separated values. A CSV file is a file that contains CSV-formatted data.
CSV is not limited to a particular character encoding.[3] It works just as well with Unicode (i.e. UTF-8 or UTF-16) as with ASCII
- REDIRECT Template:En dash
Template:R protected although particular programs may have limitations. Unlike many proprietary data formats, CSV data normally survives naïve translation from one character set to another. CSV does not, however, provide a way to indicate the character encoding, so that must be communicated separately.
History
The CSV format predates personal computers by more than a decade. The IBM Fortran (level H extended) compiler under OS/360 supported list-directed ("free-form") input/output, with commas between values, in 1972.[4] List-directed input/output was defined in FORTRAN 77, approved in 1978. List-directed input used commas or spaces for delimiters, so unquoted character strings could not contain commas or spaces.[5]
The term "comma-separated value" and the "CSV" abbreviation were in use by 1983.[6] The manual for the Osborne Executive computer, which bundled the SuperCalc spreadsheet, documents the CSV quoting convention that allows strings to contain embedded commas.[7]
Comma-separated value lists are easier to type (for example into punched cards) than fixed-column-aligned data, and they were less prone to producing incorrect results if a value was punched one column off from its intended location.
Comma separated files are used for the interchange of database information between machines of two different architectures. The plain-text character of CSV files largely avoids incompatibilities such as byte-order and word size. The files are largely human-readable, so it is easier to deal with them in the absence of perfect documentation or communication.[8]
The main standardization initiative—transforming "de facto fuzzy definition" into a more precise and de jure one—was in 2005, with Template:IETF RFC, defining CSV as a MIME Content Type.[9] Later, in 2013, some of RFC 4180's deficiencies were tackled by a W3C recommendation.[10]
In 2014 IETF published Template:IETF RFC describing the application of URI fragments to CSV documents. RFC 7111 specifies how row, column, and cell ranges can be selected from a CSV document using position indexes.[11]
In 2015 W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV metadata standards, which began as recommendations in December of the same year.[12]
Specification
Casually, CSV refers to data that is plain text and consists of one record per line where each line has the same sequence of fields separated by a comma.[3][13][14] The format is more formally described in the 2005 technical standard Template:IETF RFC which codifies the CSV format and defines the MIME type text/csv for the handling of text-based fields. Among its requirements:
- A line is terminated per MS-DOS-style: carriage return and line feed (CR/LF) sequence
- A line terminator is optional for the last line
- The data can start with a header record but with no way to test whether the first line is, in fact, a header, care is required when importing
- Each record should contain the same number of fields
- A field containing a comma, double quote or line terminator character should be enclosed in double quotes
- Any field may be enclosed in double quotes
- If a field is enclosed in double quotes, then a double quote embedded in the field must be represented by a sequence of two double quotes
Common challenges with CSV include:
- Programs may not support line terminator characters within a field even when properly quoted
- Programs may confuse a header line with data or interpret the first data line as a header
- Double quotes in a field may not be parsed correctly
In 2011, Open Knowledge Foundation (OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata. (CSV lacks any type information to distinguish the string 1 from the number 1.)[15] The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.[16]
In 2013, the W3C "CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats.[17] The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations[18] for modeling "Tabular Data",[12] and enhancing CSV with metadata and semantics. While the well-formedness of CSV data can readily be checked, testing validity and canonical form is less well developed, relative to more precise data formats, such as XML and SQL, which offer richer types and rules-based validation.[19]
Applications
CSV is commonly used for data exchange and is widely supported by data-oriented applications. It is often used to move tabular data between programs that natively operate on incompatible data
- REDIRECT Template:En dash
Template:R protected often in formats that are proprietary or undocumented.[3][20][21] A common scenario is moving data from a database to a spreadsheet which, in general, use completely different formats. Most database systems can export as CSV and most spreadsheet programs can import CSV-formatted data, leveraging CSV as an intermediate format. Every major ecommerce platform provides support for exporting data as a CSV file.[22]
CSV is also used for storing data. Common data science tools such as Pandas include the option to export data to CSV for long-term storage.[23] Benefits of CSV for data storage include the simplicity of CSV makes parsing and creating CSV files easy to implement and fast compared to other data formats, human readability making editing or fixing data simpler,[24] and high compressibility leading to smaller data files.[25] Alternatively, CSV does not support more complex data relations and makes no distinction between null and empty values, and in applications where these features are needed other formats are preferred.
More than 200 local, regional, and national data portals, such as those of the UK government and the European Commission, use CSV files with standardized data catalogs.[26]
Some applications use CSV as a data interchange format to enhance its interoperability, exporting and importing CSV. Others use CSV as an internal format. CSV is supported by almost all spreadsheets and database management systems.
Spreadsheets including Apple Numbers, LibreOffice Calc, and Apache OpenOffice Calc. support reading CSV files. Microsoft Excel also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., since 2019[update]Template:Dated maintenance category (articles)Script error: No such module "Check for unknown parameters". Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma). LibreOffice Calc CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming.
Various relational databases support saving query results to a CSV file. PostgreSQL provides the COPY command, which allows for both saving and loading data to and from a file. COPY (SELECT * FROM articles) TO '/home/wikipedia/file.csv' (FORMAT csv) saves the content of a table articles to a file called /home/wikipedia/file.csv.[27] Some relational databases, when using standard SQL, offer foreign-data wrapper (FDW). For example, PostgreSQL offers the CREATE FOREIGN TABLE[28] and CREATE EXTENSION file_fdw[29] commands to configure any variant of CSV. Databases like Apache Hive offer the option to express CSV or .csv.gz as an internal table format.
Programs that work with CSV may have limits on the maximum number of rows CSV files can have. Examples include Microsoft Excel (1,048,576 rows), Apple Numbers (1,000,000 rows), Google Sheets (10,000,000 cells), and OpenOffice and LibreOffice (1,048,576 rows).[30]
See also
- REDIRECT Template:En dash
Template:R protected Formatting challenge caused by data that contains a delimiter
References
<templatestyles src="Reflist/styles.css" />
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ a b c Template:Cite IETF
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Template:Cite IETF
- ↑ See sparql11-results-csv-tsv, the first W3C recommendation scoped in CSV and filling some of RFC 4180's deficiencies.
- ↑ Template:Cite IETF
- ↑ a b Script error: No such module "citation/CS1". (W3C Recommendation)
- ↑ Script error: No such module "citation/CS1".
- ↑ *Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1". (on GitHub)
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
Script error: No such module "Check for unknown parameters".
Further reading
- Script error: No such module "citation/CS1". (Has file descriptions of delimited ASCII (.DEL) (including comma- and semicolon-separated) and non-delimited ASCII (.ASC) files for data transfer.)
Template:Data Exchange