palavido

Standarizing Excel File Headers via Python for Use in ArcGIS

Blog Post created by palavido on Jul 18, 2014

If your organization is anything like ours, there are probably hidden silos of legacy data hiding everywhere. In the past, a lot of users in various departments have relied on excel for tracking various types of information. I recently worked on a project that required taking information from 900+ excel files and importing the data into a single geodatabase table. Luckily for me, the excel files were all set up in a consistent format....unluckily for me that format consisted of mutli-row headers with some headers consisting of merged cells. To make things even more complex, each of the 900+ excel files each had 4 worksheets, each representing a specific geographic area (in this case quarter sections).

 

I knew in order to import the data into ArcGIS, I would need single row (non merged cell) headers. I didn't want to have to go through all 3600+ worksheets manually and reformat the worksheets. That lead me to dive in and see if I could come up with a somewhat quick and dirty solution using python. After some fumbling around, and much to my surprise, I ended up with a working solution. I thought it might be helpful if I shared my experience in case anyone else finds it useful.

 

When I first started looking into Python as a solution, I came to realize that I would need to use some additional python packages that aren't included by default with the ArcGIS installation of python. I found the easiest way to get these packages and install them was to use a python tool called "pip". You can download and install pip using the links and instructions found here https://pip.pypa.io/en/latest/installing.html. Below is a quick overview of how I installed it:

 

  1. Download/Save "get-pip.py" (see the link above). I placed the file in the directory that ArcGIS installed Python in.
  2. Open a command prompt and navigate to the Python install directory
  3. Execute the "get-pip.py" script by typing "python get-pip.py" pipinstall.png

After doing that, I determined that I would need to install two additional python packages, "xlrd" which allows you to read excel files and "xlwt" which allows you to write excel files. This was easily done using pip that was installed earlier. Below is the process I used to install xlrd and xlwt:

 

  1. In the command prompt still open from before, navigate to the "scripts" directory in the Python installetion folder. This is where pip is installed.
  2. Type "pip install xlrd" to install xlrd. Follow the same process to install xlwt (i.e. type "pip install xlwt")xlrd.png

Now that xlrd and xlwt are installed, we can look at our excel files and start writing our python code. In our case, we had excel files that had headers on the first 3 rows of the excel spreadsheet. These headers consisted of different combinations of merged cells(see blue and yellow highlights below). The actual data did not begin until the 4th row (see red highlight below).excel_original.png

So here is where we begin with the python code. The first step was to import the xlrd, xlwt, and datetime modules:

 

 

import xlrd, xlwt, datetime

 

 

 

The next step is to tell python which excel file we will be opening for reading and to designate a new output file for the reformatted version. In this case, I simply had it use the same name but with "_GIS" appended to the file name:

 

# Original Excel File Location
file_location = r"N:\GIS\Projects\RightOfWay\SourceIn\20140707_Indices\030618.xls"

# Cleaned Up Excel File Name
new_file_location = file_location[:-4] + "_GIS.xls"

# Open the existing workbook for reading
workbook = xlrd.open_workbook(file_location)

 

 

Now, we need to tell python which worksheet we will be reformatting. The xlrd module counts the worksheets starting with the number 0, so if you want the first worksheet, you would use 0 to reference it.

 

# Open the worksheet from the existing file
sheet = workbook.sheet_by_index(0)
sheet_name = sheet.name

 

We will need to get the number of rows in the existing worksheet. We need this later in order to iterate over the rows and rewrite them to a new file:

 

# Get the Number of Rows

sheet_rows = sheet.nrows

 

It is now time to create the new excel file and worksheet that we will be writing the reformatted data to. In this example, I create the new excel file and add a worksheet simply called "GIS":

 

# Create the new workbook
new_workbook = xlwt.Workbook()

# Add a blank sheet to the workbook
new_sheet = new_workbook.add_sheet("GIS", cell_overwrite_ok=True)

 

We need to add the headers to the worksheet as we want them to appear in ArcGIS. remember, xlrd and xlwt start their numbering at 0, so if we want to indicate the first row and first column, it would be 0,0. The second column would be 0,1 and so on.

 

# Add the Header Row to the new sheet

new_sheet.write(0,0,'DocumentNumber')

new_sheet.write(0,1,'AcquisitionPurpose')

new_sheet.write(0,2,'LegalDescription')

new_sheet.write(0,3,'GrantorOrGrantee')

new_sheet.write(0,4,'Conveyance')

new_sheet.write(0,5,'Acquisition')

new_sheet.write(0,6,'Deed')

new_sheet.write(0,7,'Quitclaim')

new_sheet.write(0,8,'Easement')

new_sheet.write(0,9,'Agreement')

new_sheet.write(0,10,'Lease')

new_sheet.write(0,11,'License')

new_sheet.write(0,12,'Condemnation')

new_sheet.write(0,13,'Permit')

new_sheet.write(0,14,'Other')

new_sheet.write(0,15,'APN')

new_sheet.write(0,16,'DocumentDate')

new_sheet.write(0,17,'RecordingDate')

new_sheet.write(0,18,'InstrumentNumber')

new_sheet.write(0,19,'ReferenceAndRemarks')

 

Now we need to iterate over the rows in the existing spreadsheet and write that to the new spreadsheet. If you recall, our data starts on row 4 of the spreadsheet. Since xlrd starts its numbering at 0, we actually need to tell the code to start at 3. You'll see that in the range statement below. You'll also notice that the code is checking if a cell contains a date as well. This is necessary in order for the dates to be written correctly. Excel stores dates behind the scenes as floating numbers but we don't want them to display as such in the output excel file.

 

#Copy rows from existing sheets

for rows in range (3, sheet_rows):

    data = [sheet.cell_value(rows, col) for col in range(sheet.ncols)]

    for index, value in enumerate(data):

        # Check to see if the cell contains a date, if so, format the output cell as a date

        if sheet.cell_type(rows,index) == 3:

            style = xlwt.easyxf(num_format_str='YYYY-MM-DD')

            value_date = datetime.datetime(*xlrd.xldate_as_tuple(value, workbook.datemode))

            new_sheet.write(rows-2, index, value_date, style)

        else:

            new_sheet.write(rows-2, index, value)

 

The very last step is to save the new workbook.

 

#Save the workbook

new_workbook.save(new_file_location)

 

The new excel file has now been reformatted to have single row headers and can easily be digested into ArcGIS.

 

excel_reformatted.pngThis example shows how to do a single worksheet in a single excel file. It can fairly easily be modified to go through an entire directory of excel files and multiple worksheets (which I have done and can share if anyone is interested), however I thought it was important to keep this example basic to illustrate the reformatting concept. Below is the same code explained above all put together.

 

import xlrd, xlwt, datetime
# Original Excel File Location
file_location = r"N:\GIS\Projects\RightOfWay\SourceIn\20140707_Indices\030618.xls"
# Cleaned Up Excel File Name
new_file_location = file_location[:-4] + "_GIS.xls"
# Open the existing workbook for reading
workbook = xlrd.open_workbook(file_location)
# Open the worksheet from the existing file
sheet = workbook.sheet_by_index(0)
sheet_name = sheet.name
# Get the Number of Rows
sheet_rows = sheet.nrows
# Create the new workbook
new_workbook = xlwt.Workbook()
# Add a blank sheet to the workbook
new_sheet = new_workbook.add_sheet("GIS", cell_overwrite_ok=True)
# Add the Header Row to the new sheet
new_sheet.write(0,0,'DocumentNumber')
new_sheet.write(0,1,'AcquisitionPurpose')
new_sheet.write(0,2,'LegalDescription')
new_sheet.write(0,3,'GrantorOrGrantee')
new_sheet.write(0,4,'Conveyance')
new_sheet.write(0,5,'Acquisition')
new_sheet.write(0,6,'Deed')
new_sheet.write(0,7,'Quitclaim')
new_sheet.write(0,8,'Easement')
new_sheet.write(0,9,'Agreement')
new_sheet.write(0,10,'Lease')
new_sheet.write(0,11,'License')
new_sheet.write(0,12,'Condemnation')
new_sheet.write(0,13,'Permit')
new_sheet.write(0,14,'Other')
new_sheet.write(0,15,'APN')
new_sheet.write(0,16,'DocumentDate')
new_sheet.write(0,17,'RecordingDate')
new_sheet.write(0,18,'InstrumentNumber')
new_sheet.write(0,19,'ReferenceAndRemarks')
#Copy rows from existing sheets
for rows in range (3, sheet_rows):
    data = [sheet.cell_value(rows, col) for col in range(sheet.ncols)]
    for index, value in enumerate(data):
        # Check to see if the cell contains a date, if so, format the output cell as a date
        if sheet.cell_type(rows,index) == 3:
            style = xlwt.easyxf(num_format_str='YYYY-MM-DD')
            value_date = datetime.datetime(*xlrd.xldate_as_tuple(value, workbook.datemode))
            new_sheet.write(rows-2, index, value_date, style)
        else:
            new_sheet.write(rows-2, index, value)
#Save the workbook
new_workbook.save(new_file_location)


Outcomes