Pandas read_csv() - Reads CSV and delimited files in Pandas • datagy (2023)

In this tutorial you will learn how to use the pandasread_csv()Function to read CSV (or other delimited files) into DataFrames. CSV files are a ubiquitous file format that you will encounter no matter what industry you work in. Being able to read them effectively into Pandas DataFrames is an important skill for any Pandas user.

By the end of this tutorial you will have learned the following:

  • How to use the pandasread_csv()function
  • How to customize CSV file reading by specifying columns, headers, data types and more
  • How to limit the number of rows pandas reads
  • And much more

Table of contents

Understand Pandas' read_csv() function

The pandasread_csv()Function is one of the most used functions in pandas. The function returns aTonnethe functionality. In this tutorial, we'll cover the most important parameters of the function, which will give you considerable flexibility. In fact, you get the most comprehensive overview of the pandasread_csv()Function.

Take a look at the function below to get an idea of ​​the many different parameters available:

import pandas as pdpd.read_csv(filepath_or_buffer, *, sep=',', delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=_NoDefault.no_default , mangle_dupe_cols=True, dtype=Keine, engine=Keine, converters=Keine, true_values=Keine, false_values=Keine, skipinitialspace=False, skiprows=Keine, skipfooter=0, nrows=Keine, na_values=Keine, keep_default_na=True, na_filter =True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', Tausende =None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines =Keine, warn_bad_lines=Keine, on_bad_lines=Keine, delim_whitespace=Falsch, low_memory=True, memory_map=Falsch, float_precision=Keine, storage_options=Keine)

As I mentioned before, you won't get to know all of these parameters. However, you will learn about the most important ones, including:

  • filepath_or_buffer=provides a string representing the path to the file, including local files, URLs, URL schemes (e.g. for S3 storage)
  • sep=AndSeparator =Use a string to specify what characters delimit the file
  • header=Specifies the line number(s) to use as column names and can be used to indicate that there is no header in the file (withnone)
  • names=used to provide a list of column names, either when column headers are not provided or when you want to override them
  • usecols=used to specify which columns to read in by passing a list of column labels
  • skiprows=Andskipfooter=can specify a number of lines to skip above or below (and theskip linesparameter can even accept a callable)
  • parse_dates=accepts a list of columns to parse as data

The list above covers most of the common ones that cover most of the features you need to read CSV files in pandas.

How to read a CSV file with pandas

To read a CSV file in pandas, you can use theread_csv()function and just pass in the path to the file.In fact, the only required parameter of pandasread_csv()function is the path to the CSV file. Let's look at an example CSV file:

Name,Alter,Ort,FirmaNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokio,Nintendo

We can save this code to be calledExample 1.csv. To read this CSV file with pandas, we can simply pass the file path to this file to our function call. Let's see what this looks like:

# How to read a CSV file with Pandasimport pandas as pdfdf = pd.read_csv('sample1.csv')print(df.head())# Returns:# Name Age Location Company# 0 Nik 34 Toronto datagy# 1 Kate 33 New York City Apple# 2 Joe 40 Frankfurt Siemens# 3 Nancy 23 Tokyo Nintendo

We can see how easy it was to read this CSV file with pandas. Of course, it helped that the CSV was clean and well-structured. In the next few sections, you'll learn more about how to edit CSV files that aren't as neatly structured.

Here are a few more things to consider:

(Video) Python CSV files - with PANDAS

  1. Pandas read the first row as the columns of the record,
  2. Pandas assumed the file was comma separated, and
  3. The index was created using a range index.

Now let's explore how to use a custom delimiter when reading CSV files.

How to use a custom delimiter in pandas read_csv()

To use a custom delimiter when reading CSV files in pandas, you can use thesep=or theSeparator =arguments. By default this is set tosep=',', which means Pandas assumes the file is comma delimited.

Let's take a look at another dataset we've saved nowExample 2.csv:

Name;Alter;Ort;FirmaNik;34;Toronto;datagyKate;33;New York City;AppleJoe;40;Frankfurt;SiemensNancy;23;Tokio;Nintendo

The above dataset is the same dataset we worked with previously. However, the values ​​are now separated by semicolons instead of commas. Since this is different from the default, we now need to explicitly pass this to the function, as shown below:

# How to read a CSV file of pandas with custom delimiters Nik 34 Toronto datagy# 1 Kate 33 New York City Apple# 2 Joe 40 Frankfurt Siemens# 3 Nancy 23 Tokyo Nintendo

We can see that by specifying the delimiter pandas was able to read the file correctly. Since delimiters can vary widely, it's good to know how to handle these cases.

If your data were tab-delimited, you could use as wellsep='\t'.

How to specify a header in pandas read_csv()

By default, Pandas infers whether or not to read a header. This behavior can be controlled withheader=Parameter that accepts the following values:

  • an integer representing the row to read,
  • a list of integers to read,
  • noneif there is no header, and
  • 'closetrying to derive the data.

Previously Pandas assumed that the record header started at line 0. However, take a look at the dataset shown below that we saved toExample 3.csv:

Nik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokio,Nintendo

We can see that it is the same record but without the header. In these cases, we must explicitly pass the column names to use. Let's see what reading this file looks like:

# Specify a header in a CSV file import pandas as pdcols = ['name', 'age', 'city', 'company']df = pd.read_csv('sample3.csv', header=None, names=cols) print(df.head())# Returns:# Name Age City Company# 0 Nik 34 Toronto datagy# 1 Kate 33 New York City Apple# 2 Joe 40 Frankfurt Siemens# 3 Nancy 23 Tokyo Nintendo

With our block of code above, we have actually accomplished two things:

  1. We instructed Pandas not to read a line from the CSV file as a header, and
  2. We passed custom column names to the DataFrame

Now let's look at how we can use the pandas to skip linesread_csv()Function.

Pandas offers significant flexibility in skipping records when reading CSV files, including:

  1. skipping a specified number of lines from the top,
  2. skipping a list of rows with a list of values,
  3. Skipping lines with an invokable and
  4. Skip lines from below

Let's see how this works:

(Video) Python Pandas Tutorial 4: Read Write Excel CSV File

Skipping lines when reading a CSV file in pandas

In some cases, report solutions contain rows of information about a report, such as: a title. We can skip this by providing a single row reference or a list of rows to skip. Take a look at our example data set that we will refer tosample4a.csv:

MusterberichtName,Alter,Ort,FirmaNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokyo,Nintendo

We can see that we want to skip the first two rows of data. We can just pass by for thatSkip rows=2, as shown below:

# Skipping rows when reading a CSV file import pandas as pddf = pd.read_csv('sample4a.csv', skirows=2)print(df.head())# Returns:# Name Age Location Company# 0 Nik 34 Toronto datagy# 1 Kate 33 New York City Apple# 2 Joe 40 Frankfurt Siemens# 3 Nancy 23 Tokyo Nintendo

We can see that pandas just jumped over the first two rows in the data. This enabled us to prevent data from being read that is not part of the actual data set.

Using a callable(function) to skip rows in pandas read_csv

Pandas also allows you to pass an invocable so you can skip lines that match a condition. At first glance, this may seem confusing. However, the function can be used to read every second or fifth record, for example. Let's look at how we can read only every second record of our dataset (using the previousExample 1.csv):

Name,Alter,Ort,FirmaNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokio,Nintendo

To read only every other row, you can use the following lambda, which is callable in theskiprows=Parameter:

# Skip rows when reading a CSV file import pandas as pdfdf = pd.read_csv('sample1.csv', skirows = lambda x: x % 2)print(df.head())# Returns:# name old location Company# 0 Kate 33 New York City Apple# 1 Nancy 23 Tokyo Nintendo

In the code block above, we passed a lambda function fromLambda x: x % 2. In this function we check if there is a remainder from the module operation. If so, the value is true, which means it's returned.

Skip footers when reading a CSV file with pandas

Similarly, Pandas allows you to skip rows in a record's footer. This can be useful if the report generation software contains values ​​describing things like the date the report was generated.

Take a look at the dataset below that we've labeledsample4b.csv:

Name, Age, City, CompanyNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokyo,NintendoDate of execution: 05/05/2023

To remove the bottom two rows we can go insideskipfooter=2, as shown below:

# Skip lines when reading a CSV file import pandas as pdfdf = pd.read_csv('sample4b.csv', skipfooter=2, engine='python')print(df.head())# Returns:# Name Age Location Company# 0 Nik 34 Toronto datagy# 1 Kate 33 New York City Apple# 2 Joe 40 Frankfurt Siemens# 3 Nancy 23 Tokyo Nintendo

In the code block above, we passed two arguments:

  1. skipfooter=2indicates that the bottom two records should be skipped, and
  2. engine='python'which specifies the engine we want to use to read the data. Although not necessary, Python will throw aParserWarnungotherwise.

In the following section you will learn how to read just a few lines in the pandasread_csv()Function.

How to read only a certain number of rows in pandas read_csv()

When working with large datasets, it can be helpful to only read a certain number of datasets. This can be useful when working with datasets that are too large to keep in memory, or when you just want to view a portion of the data.

To read just a bunch of lines you cannrows=, which accepts an integer of values. Let's continue using our original dataset,Example 1.csv:

(Video) How to Read a CSV file into a Pandas DataFrame | Pandas Tutorial for Beginners

Name,Alter,Ort,FirmaNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokio,Nintendo

In the code block below, we use thenrows=Parameter to read only 2 of the lines:

# Read only a number of rows in PandasImport Pandas as pdff = pd.read_csv('sample1.csv', nrows=2)print(df.head())# Returns:# Name Age City Company# 0 Nik 34 Toronto datagy# 1 Kate 33 New York City Apple

In the code block above, we passed that we only wanted to read two lines. This prevents you from having to load more data into memory than is necessary.

The following section shows you how to read only a few columns in a CSV file.

How to read only some columns in pandas read_csv()

Pandas also allows you to read only specific columns when simply loading a dataset. In particular, the function allows you to specify columns using two different data types passed tousecols=Parameter:

  1. A list of column labels or
  2. A callable (function)

In most cases you will end up with a list of column labels. When using an invocable, the invocable is evaluated against the list of columns and only those that are true are returned.

Let's see how we can pass a list of column labels to read just a few columns in pandas. For this we use our originalExample 1.csvfile as shown below:

Name,Alter,Ort,FirmaNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokio,Nintendo

Now let's see how we can use themusecols=Parameters to read only a subset of columns:

# Read only a number of columns in pandasimport pandas as pdfdf = pd.read_csv('sample1.csv', usecols=['Name', 'Age'])print(df.head())# Returns:# Name Age# 0 Nick 34# 1 Kate 33# 2 Joe 40# 3 Nancy 23

We can see in the code block above that we used theusecols=-Parameter to pass a list of column labels. This allowed us to read only a few columns from the dataset.It's important to note that we can also pass a list of position labels. To replicate the example above, we could also useusecols=[0, 1].

Another important note is that the order of these values ​​does not matter. Useusecols=[0, 1]yields the same data set asusecols=[1, 0].

How to specify an index column in pandas read_csv()

To specify an index column when reading a CSV file in pandas, you can pass the following into theindex_col=Parameter:

  1. A column label or position (integer),
  2. A list of column labels or positions,
  3. INCORRECT, which forces pandas not to assign a column as an index.

Let's see how we can use oursExample 1.csvfile and read theNameColumn as index:

# Set an index column when reading CSV file import pandas as pdfdf = pd.read_csv('sample1.csv', index_col='Name')print(df.head())# Returns:# Age City Company# Name # Nick 34 Toronto datagy# Kate 33 New York City Apple# Joe 40 Frankfurt Siemens# Nancy 23 Tokyo Nintendo

We can see that we passed themNamecolumn in theindex_col=Parameter. This allowed us to read that column as the index of the resulting DataFrame.

How to analyze data in pandas read_csv()

When columns are read as data, pandas again offer significant capabilities. By using theparse_dates=parameter gives you a number of different options for parsing dates:

(Video) Pandas 6.2: How To Read The CSV Files In Pandas

  1. You can pass a boolean indicating whether the index column should be parsed as a date
  2. A list of integers or column labels, with each column read as a separate column
  3. A list of lists with each column read as a standard date part and returned as a single column, and
  4. A dictionary of `{'column_name': ['list', 'of', 'individual', 'columns']} where the key represents the name of the resulting column.

First, let's look at a simple example where we have a date stored in a column called'Datum', as shown insample5.csv':

Name,Year,Month,Day,DateNik,2022,5,5,"2022-05-05" Kate,2023,6,"2023-06-06" Joe,2024,7,7,"2024-07-07 "Nancy,2025,8,8,"2025-08-08"

To read the date column as a date, you can pass the label in a list to theparse_dates=Parameters as shown below:

# Parse dates when reading CSV files in pandasimport pandas as pddf = pd.read_csv('sample5.csv', parse_dates=['Date'])print(df.head())# Returns:# Name Year Month Day Date# 0 Nik 2022 5 5 2022-05-05# 1 Kate 2023 6 6 2023-06-06# 2 Joe 2024 7 7 2024-07-07# 3 Nancy 2025 8 8 2025-08-08

We can see that the resulting DataFrame read the date column correctly. We also have three columns representing the year, month and day. We could pass a list of lists containing those columns. However, pandas would invoke the resulting column'Year month day', which isn't great.

Instead, let's pass a dictionary labeling the column as shown below:

# Parse dates when reading CSV files in pandasimport pandas as pddf = pd.read_csv('sample5.csv', parse_dates={'Other Date': ['Year', 'Month', 'Day']})print (df.head())# Returns:# Miscellaneous Date Name Date# 0 2022-05-05 Nik 2022-05-05# 1 2023-06-06 Kate 2023-06-06# 2 2024-07-07 Joe 2024-07-07 #3 2025-08-08 Nancy 2025-08-08

In the code block above, we passedparse_dates={'Other date': ['year', 'month', 'day']}, where the key represents the resulting column label and the value represents the columns to read.

**If you work with different date formats, it is best to read in the dates first. Then you can use thosepd.to_datetime()Function to properly format the column.

How to specify data types in pandas read_csv()

In most cases pandas can correctly infer the datatypes of your columns. However, specifying the data types can significantly speed up reading the record and help correct erroneous assumptions. To specify a data type when reading a CSV file with pandas, you can use thedtype=Parameter.

Let's see how we can specify the data types of our original data set.Example 1.csv, as shown below:

Name,Alter,Ort,FirmaNik,34,Toronto,datagyKate,33,New York City,AppleJoe,40,Frankfurt,SiemensNancy,23,Tokio,Nintendo

To do this, we can pass a dictionary containing column labels and their associated data type, as shown below:

# Datentypen mit Pandas angeben read_csv()import pandas as pddf = pd.read_csv('sample1.csv', dtype={'Name':str, 'Age':int, 'Location':str, 'Company':str })print(df.head())# Returns:# Name Age Location Company# 0 Nik 34 Toronto datagy# 1 Kate 33 New York City Apple# 2 Joe 40 Frankfurt Siemens# 3 Nancy 23 Tokyo Nintendo

The sample data set we worked with above had data types that were easy to deduce. However, the power of this comes into play when you want to reduce the space of a record by specifying smaller data types, e.gnp.int32, etc.

Diploma

In this tutorial you learned how to use the pandasread_csv()Function to read CSV files (or other delimited files). The function offers enormous flexibility when reading files. For example, you can use the function to specify separators, set index columns, parse dates, and more.

Additional Resources

For more information on related topics, see the following resources:

Videos

1. Python: How to Read and Write to A CSV File With Pandas
(CODE MENTAL)
2. Read CSV file using pandas in Data Science | Codersarts
(CodersArts)
3. How to read CSV file in Python Jupyter Notebook | Pandas
(Stats Wire)
4. 1. Python - Importing CSV File using Pandas Library - Basic Python Learning #ytshort #python #pandas
(Analytics Box)
5. Read and Process large csv / dbf files using pandas chunksize option in python
(Learning Software)
6. Pandas - Day 2 Part 2, (Noman, 2023)
(AI Education)
Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated: 03/20/2023

Views: 5813

Rating: 4.1 / 5 (42 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.