Select to view content in your preferred language

Parsing Fixed width .dat file with Python

25129
7
09-12-2011 01:28 PM
AlexSmith2
Deactivated User
I have data in the form of .dat file (really just a text file).  The problem is that the data are multi-line with each line having an independent fixed width.  Also there are no headers.  For example:

10012345678FIXEDWIDTHDATA
11012345678FIXEDWIDTHBUTLARGERTHAN
12012345678FIXEDWIDTH
10012345678AFIXEDWIDTHDATA
11012345678AFIXEDWIDTHBUTLARGERTHAN
12012345678AFIXEDWIDTH

The good news is that have cheat sheet with FIELDNAME, SIZE, TYPE (e.g. NUM or CHAR) and START POSITION.  The first three digits are RECORDTYPE (e.g. 100, 110 or 120).  The 12345678(A) is the PARCELNUM.  That is where the fixed with similarities end.

I am new to Python and have been struggling with this for a few days now.  I have manged to open the file, read it into a list, sort the list and write out a new file:

# Read mode opens a file for reading only.
DataFileIn = open("D:\Path\st4206001.dat", "r")
# Read all the lines into a list.
DataList = DataFileIn.readlines()
DataList.sort()
DataFileIn.close()
DataTextOut = open('D:\Path\Data.txt', 'w')
DataTextOut.writelines(DataList) # Write a sequence of strings to a file
DataTextOut.close()


This is where I need some direction.  My goal is to sort the list and output a file for each RECORDTYPE.  It would be nice to add the HEADERS to the files before writing them.  I was looking into using the re module to do the sorting (perhaps match) but, again, I am new to Python. 

My hope is that someone out there has a strategy I could follow (i.e.  suggest modules and python tricks).  I just need to be pointed in the right direction.
Tags (2)
0 Kudos
7 Replies
DanPatterson_Retired
MVP Emeritus
Alex
I am sure you can figure out the rest from this verbose coding example:

'''
ParsingDataDemo.py

A demo file to parse data which is quasi-fixed width

File and script must reside in the same folder...fix this if you want

'''
import sys, os

data_path = (os.path.dirname(sys.argv[0]) + "/").replace("\\","/")  #can be skipped if you follow 
data_file = data_path + "ParsingDataDemoData.txt"    #fix this or better still create a tool
#data_file should be sys.argv[1] which allows a user to select a file in a folder
a_file = open(data_file)
data = a_file.readlines()
for a_line in data:
  record_type = a_line[:3]
  parcel_num = a_line[3:11]
  the_rest = a_line[11:]
  print record_type, parcel_num, the_rest
0 Kudos
StacyRendall1
Frequent Contributor
Hi Alex,

There are a number of ways you can do this; but I would recommend:

  1. learning a bit about the possible string operations in Python

  2. looking into Python dictionaries


Check out this page from the Python docs:Built-in Types.

The page is pretty daunting but the string and dictionary bits will help shed some light. I have modified your code to read each line separately, extract the record type (as it is a known width), then write the record type and other info to a Python dictionary using the record type as a 'key'. The other entries are added as list items, under the key. I.e. (viewing it as a heirarchy):
'120' # record type - the 'key'
  L ['12345678AFIXEDWIDTH', '12345678FIXEDWIDTH'] # other info, stored within a list
'110'
  L etc., etc.


Then I iterate through the keys, creating an output file for each one then writing its individual data, closing it, getting the next key.

Here is the code:
# Read mode opens a file for reading only.
DataFileIn = open("input.txt", "r")
# Read all the lines into a list.
DataList = DataFileIn.readlines()
DataFileIn.close()

DataDict = {}

for item in DataList: # iterate over the rows - each item is the string of data
 RECORDTYPE = item[0:3] # get parts 0 to 3 of the string (first 3 digits)
 item_ = item.strip('\n') # get rid of new line characters at the ends (if they are there - does nothing if not)
 try: DataDict[RECORDTYPE].append(item_[3:]) # try to append the rest to the dictionary sub-list as a list item
 except KeyError: DataDict[RECORDTYPE] = [item_[3:]] # if this is the first time this record has appeared, add it as a list item
 
for key in DataDict: # for every record type
 DataTextOut = open('output_%s.txt' % key, 'w') # i.e. output_120.txt
 for item in DataDict[key]: # for each line in the list
  DataTextOut.write(item+'\n') # write the data, then add a new line
 DataTextOut.close() # close this particular file
0 Kudos
AlexSmith2
Deactivated User
Stacy and Dan:

Thanks for the tips.  I have spent the last few days trying to learn my way through this.  Your references really helped speed things up.  I combined the snippets of code to look like this:


#test.txt=
#10012345678ABCDEF123abc
#11012345678ABC12345abcd
#12012345678111111abcdef
#10012345678AABCDEF123abc
#11012345678AABC12345abcd
#12012345678A111111abcdef
testtxt = open('D:/Python_Tests/test.txt', 'r')
testlist = testtxt.readlines() 
testtxt.close()
dict_ = {}
for item in testlist:
 RECORDTYPE = item[0:3]
 item_ = item.strip('\n')
 try: dict_[RECORDTYPE].append(item_[3:])
 except KeyError: dict_[RECORDTYPE] = [item_[3:]]
for key in dict_:
 textout = open('D:/Python_Tests/textout_%s.txt' %key, 'w')
 for item in dict_[key]:
  if key == '100': textout.write('100'+','+item[:9]+','+item[9:18]+','+item[18:21]+','+item[21:23]+'\n')#I added the if and elif to format each record type separately 
  elif key == '110': textout.write('110'+','+item[:9]+','+item[9:12]+','+item[12:18]+'\n')
  elif key == '120': textout.write('120'+','+item[:9]+','+item[9:12]+','+item[12:]+'\n')
 textout.close()

This works and gives me three separate "csv" files that I can eventually turn into dBASE tables.  The problem is that sometime the "parcel" field should be always have a width of nine even if the parcel number is only 8 characters in length (accounting for the "split" letters).

I tired using ljust() but it the text output does not seem to respond.

for key in dict_:
 textout = open('D:/Python_Tests/textout_%s.txt' %key, 'w')
 for item in dict_[key]:
  if key == '100': textout.write('100'+','+item.ljust(5)+'n'+'\n')
  elif key == '110': textout.write('100'+','+item[:9]+','+item[9:18]+','+item[18:21]+','+item[21:23]+'\n')
  elif key == '120': textout.write('100'+','+item[:9]+','+item[9:18]+','+item[18:21]+','+item[21:23]+'\n')
 textout.close()

Am I missing some syntax?  Is there a more efficient way to do this?  Would str.format() work within the textout.write()?  Thanks for the all the help.
0 Kudos
StacyRendall1
Frequent Contributor
Hi Alex, good work!

Unfortunately I don't understand what your problem is... Can you clarify with some examples? I.e. if the input line is:
12012345678ASOMEOTHERSTUFF

the output should be:
file: 120
# parcelnum, text
12345678A,SOMEOTHERSTUFF

but at the moment it is doing:
file: 120
# parcelnum, text
1234567,8ASOMEOTHERSTUFF

or whatever is actually doing on/you want to happen...
0 Kudos
AlexSmith2
Deactivated User
The main problem was me. My test.txt file contained the correct numbers but I neglected to insert the spaces after the parcel number:

test.txt=
10012345678ABCDEF123abc
11012345678ABC12345abcd
12012345678111111abcdef
10012345678AABCDEF123abc
11012345678AABC12345abcd
12012345678A111111abcdef

The data should have contained the spaces/padding to account for the split number in the parcel data:

10012345678 ABCDEF123abc
11012345678 ABC12345abcd
12012345678 111111abcdef
10012345678AABCDEF123abc
11012345678AABC12345abcd
12012345678A111111abcdef

It is hard to parse a "fixed width" file with variable column widths.  Once I corrected the test file, everything worked fine.  The code now looks like this:


testtxt = open('D:/Python_Tests/test.txt', 'r')
testlist = testtxt.readlines()
testtxt.close()
dict_ = {}

for item in testlist:
 RECORDTYPE = item[0:3]
 item_ = item.strip('\n'))
 try: dict_[RECORDTYPE].append(item_[3:])
 except KeyError: dict_[RECORDTYPE] = [item_[3:]]
 
 
for key in dict_:
 textout = open('D:/Python_Tests/textout_%s.txt' %key, 'w')
 for item in dict_[key]:
  if key == '100': textout.write('100'+','+item[:9]+','+item[9:10]+','+item[10:14]+','+item[14:18]+'\n')
  elif key == '110': textout.write('110'+','+item[:9]+','+item[9:18]+','+item[18:35]+'\n')
  elif key == '120': textout.write('120'+','+item[:9]+','+item[13:613]+','+'\n')
 textout.close()



The output for each record type is parsed and commas are added as delimiters:

Record type 100:
100,12345678 ,ABCDEF,123,abc
100,12345678A,ABCDEF,123,abc

Record type 110:
110,12345678 ,ABC,12345,abcd
110,12345678A,ABC,12345,abcd

Record type 120:
120,12345678 ,111111,abcdef
120,12345678A,111111,abcdef

I also found some code that removes the '\n' if it is there and keeps the trailing white space.

testtxt = open('D:/Python_Tests/test.txt', 'r')
testlist = testtxt.readlines()
testtxt.close()
dict_ = {}

def chomp(s):
    return s[:-1] if s.endswith('\n') else s #Keeps trailing whitespace
 
for item in testlist:
 RECORDTYPE = item[0:3]
 item_ = chomp(item) #used in place of xx.strip('\n')
 try: dict_[RECORDTYPE].append(item_[3:])
 except KeyError: dict_[RECORDTYPE] = [item_[3:]]
 
for key in dict_:
 textout = open('D:/Python_Tests/textout_%s.txt' %key, 'w')
 for item in dict_[key]:
  if key == '100': textout.write('100'+','+item[:9]+','+item[9:15]+','+item[15:18]+','+item[18:]+'\n')
  elif key == '110': textout.write('110'+','+item[:9]+','+item[9:12]+','+item[12:17]+','+item[17:]+'\n')
  elif key == '120': textout.write('120'+','+item[:9]+','+item[9:15]+','+item[15:]+'\n')
 textout.close() 


Now I just need to work on an iterator that will process all the files in a directory the same way.  Thanks for the help.  Learning just this little bit of Python makes data processing much easier.
0 Kudos
curtvprice
MVP Alum
Just a little note - you can strip whitepace at the right side of a string using the built-in rstrip() function:

>>> "a\n".rstrip()
'a'
0 Kudos
StacyRendall1
Frequent Contributor
Great!

However, you shouldn't have to edit all your input data to make it fit around your program. If there is a consistent logic to the letter that follows the parcel number, and the letters after that, or your data is consistently ordered, you can program something around that...

In reality, all you need to do to program something is be able to write some rules down on a piece of paper, then work through them in your head with your data. At the moment we have:
get first three digits
get next nine digits
if first three digits == '120' get next four digits # and so on...

All programming should start with something like this; you have to clearly know what you are trying to do and what you want to get out. All computers do is take the rules you give them and apply them lots of times.

Now, for your situation, the easiest thing might be if your input data is ordered (but with this kind of stuff, there are lots of options if it isn't) - i.e. it always goes parcel num 123454678 then 123454678A is after that (doesn't have to be immediately, just after it somewhere) and 123454678B is somewhere after that; or if the sort command does this for you... Then when adding to the dictionary you can note which one is the first, then A, B, C and so on (because you know 123454678 has already come up, you can search in the correct place for the sub letter.

Only bother with this if your real dataset is large, or you will need to run it lots - it's a tradeoff between more work coding now and more work later with editing all your data...
0 Kudos