Bulk Insert a fixed width file

scottdg · July 29, 2015, 2:40pm

We have a vendor that sends us a fixed width text file every day that needs to be imported to our database in 3 different tables. I am trying to import all of the data to a staging table and then plan on merging/inserting select data from the staging table to the 3 tables. The file has 77 columns of data and 20,000+ records. I created an XML format file which I sampled below:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
  <FIELD ID="RetNo" xsi:type="CharFixed" LENGTH="6"/>
  <FIELD ID="TrName" xsi:type="CharFixed" LENGTH="30"/>
  <FIELD ID="Add1" xsi:type="CharFixed" LENGTH="30"/>
</RECORD>
<ROW>
  <COLUMN SOURCE="RetNo" NAME="RETAIL_NUMBER" xsi:type="SQLCHAR" LENGTH="6"/>
  <COLUMN SOURCE="TrName" NAME="TR_NAME" xsi:type="SQLCHAR" LENGTH="30"/>
  <COLUMN SOURCE="Add1" NAME="ADDRESS_1" xsi:type="SQLCHAR" LENGTH="30"/>
</ROW>
</BCPFORMAT>

The data file is a fixed width file with no column delimiters or row delimiters that I can tell. When I run the following insert statement I get the error below it.

BULK INSERT myStagingTable
   FROM '\\...\...\...\myDataSource.txt' 
   WITH (	
			FORMATFILE = '\\...\...\...\myFormatFile.xml',
			ERRORFILE = '\\...\...\...\errorlog.log'
		);

Here is the error:

Msg 4832, Level 16, State 1, Line 1
Bulk load: An unexpected end of file was encountered in the data file.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".

Any suggestions would be greatly appreciated. I would like to just ask the vendor for a different format but that would likely take too long.

Kristen · July 29, 2015, 6:58pm

Can you try importing the file with BCP first? That might help bring to light any problems with the file.

IME these things are always very fragile ... one rogue character and it all goes to pot! I've had users open the file to "look at it" and then save it instead of cancelling, and in so doing they have added a end-of-file marker to the original file which then broke the import ...

JamesK · July 29, 2015, 7:42pm

If there are no row-delimiters, I don't know of a way to import the data. If you open the file in notepad, do you see a single long line? You can open it in a better text editor such as TextPad or Notepad++ and examine what the row delimiters are.

When you received a file generated in a linux system, it may not have the same row delimiters as what windows expects. Sometimes it is \n, sometimes it is \r and so on. So examine the file using TextPad or NotePad++ and find what the row-delimiter is. Then specify that in the bulk insert command using the ROWTERMINATOR = option.

scottdg · July 29, 2015, 8:00pm

@Kristen - Is there a way to initiate that as part of a stored procedure? That is eventually what I am hoping to accomplish.

scottdg · July 29, 2015, 8:01pm

@JamesK - Thanks - I will give that a try.

ScottPletcher · July 29, 2015, 8:22pm

You could also import it as one giant CLOB and then parse it apart in SQL itself. May not be the single best performing option, but for anything less than 1M rows I think it should still be fast enough, provided you use a very efficient splitter on the bulk text.

Kristen · July 29, 2015, 10:00pm

You can run BCP via command line from Sproc, but that may not be what you want. I was really thinking of debugging the process using BCP and then, using the knowledge from whatever problems you resolve, you build the BULK INSERT syntax.

For me I prefer bulk import from a command line because:

I can check that the file exists, and perform other tests on it (zero length files are not uncommon IME becuase of a Disk Full somewhere between Source Server and my SQL Server). BCP has the ability to trap errors to a file (both the command line output (using redirection) and also a specific command line parameter to cause BCP to log errors to a file.

We also tend to MOVE completed files to an Archive folder, and rename "the last 10 archive folders" so we have a short history of files - so when the user, eventually, screams because they found something wrong we can go back and see what actually happened.

Using BULK INSERT within SQL is much more like programming with both hands tied behind your back!

scottdg · July 30, 2015, 5:17pm

I opened the file with notepad++ and see that the row terminator is a line feed. I added

ROWTERMINATOR = '\r''

to my bulk insert and now I get an error that says:

The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.

scottdg · July 30, 2015, 5:59pm

disregard previous post- I made some changes to my script and took out the format file path and for got to put it back.

I am still getting the same

Msg 4832, Level 16, State 1, Line 1
Bulk load: An unexpected end of file was encountered in the data file.

JamesK · July 30, 2015, 6:05pm

My bad. Even though you had it there in I missed that you are using a format file. (Note to self: read posts more carefully before spewing out replies).

Change your format file to indicate the nature of the third column and specify the row terminator. Like shown below:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
  <FIELD ID="RetNo" xsi:type="CharFixed" LENGTH="6"/>
  <FIELD ID="TrName" xsi:type="CharFixed" LENGTH="30"/>
  <FIELD ID="Add1" xsi:type="CharTerm" TERMINATOR="\r" LENGTH="30"/>
</RECORD>
<ROW>
  <COLUMN SOURCE="RetNo" NAME="RETAIL_NUMBER" xsi:type="SQLCHAR" LENGTH="6"/>
  <COLUMN SOURCE="TrName" NAME="TR_NAME" xsi:type="SQLCHAR" LENGTH="30"/>
  <COLUMN SOURCE="Add1" NAME="ADDRESS_1" xsi:type="SQLCHAR" LENGTH="30"/>
</ROW>
</BCPFORMAT>

The key thing here is that via the third line in the RECORD section ( ), you are indicating that it is CharTerm (character terminated) rather than CharFixed (fixed length) and the TERMINATOR="\r" is telling it that that terminator is a \r.

PS: This is all theory, I don't have a file with \r terminator to test. So....

scottdg · July 30, 2015, 6:20pm

That's some good stuff James thanks. It did help get rid of that error. I thought I tried this once before but I must have had the syntax incorrect.

The only problem is now I really am getting the error:

Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 77. Verify that the field terminator and row terminator are specified correctly.

Column 77 is the last column in my data file. Every row in my file shows that there is a LF terminator except for the very last row. That doesn't have any terminator at all.

EDIT: I just tried a different data file and the last line of data still does not have a terminator but it does have a blank line after it.

JamesK · July 30, 2015, 7:06pm

I don't really know a good solution for this problem. I have three thoughts:

Ask the vendor who is providing you with the file to give the proper format file - i.e., every row should be terminated by the specified row terminator. If they are willing to do that I would also press them for giving the file as (pipe) delimited files rather than fixed width files. With fixed width files, if even one character is off, then your data integrity is lost.
Decide that you can live without the last character. If that is a possibility change the format file RECORD section to consider the row-terminator as another column. So it would be like this:

FIELD ID="RetNo" xsi:type="CharFixed" LENGTH="6"
FIELD ID="TrName" xsi:type="CharFixed" LENGTH="30"
FIELD ID="Add1" xsi:type="CharFixed" LENGTH="30"
FIELD ID="RowTerm" xsi:type="CharFixed" LENGTH="1"

My thinking is that it will read the row-terminators as another column. Then, it would find that the last row is incomplete, and will discard it.

Manually add a to append the carriage return. This can be one step in your import process - for example, by creating a file with nothing but a \r character (CHAR(13), not the string \r) and appending that to your source file.

copy /B OriginalSourceFile.txt + CRFile.txt NewSourceFile.txt

The CRFile.txt has the single character for CHAR(13). After doing this, you would then import the NewSourceFile.txt using the original format file.

JamesK · July 30, 2015, 7:11pm

By the way, if you can use SSIS, this would probably be trivial/painless to do.

scottdg · July 30, 2015, 7:34pm

I am going to give it a try.

Did you mean painless using a pipe delimited file or should I be okay even with the fixed width file?

Thanks again for all of your help.

JamesK · July 30, 2015, 7:45pm

SSIS is able to handle fixed width as well. I am not sure about the incomplete last line though. My experience with SSIS is at the basic level, so I just don't know.

Another possibility would be for you to bulk insert the file into a staging table as a single column and then split the columns in script. Even in that case, I don't know how to load the incomplete line.

INSERT INTO FinalTable
SELECT
	SUBSTRING(strCol, 1,6) AS RETAIL_NUMBER,
	SUBSTRING(strCol, 7,30) AS TR_NAME,
	SUBSTRING(strCol, 37,30) AS ADDRESS_1
FROM
	StagingTable

Kristen · July 31, 2015, 6:25am

Yuck! (Sorry ) We have regular web users putting "|" in their email address (clearly deliberately) ... I would avoid any delimiter that could possibly occur in the data. To me that might leave TAB (even that finds its way into data when folk Cut&Paste from somewhere else ...) or CSV (done properly it will Quote columns that contain commas, and double-quote quotes within columns that are quoted because they contain commas ...

Its issues like this why I do this from command line with BCP. I can work around any problems like this pre-processing the file (or just checking its validity), but that apart I would do it Scott's way:

Don't know about you, but I validate incoming data before using it. We use a staging table - one column for each column in the incoming data (whether we use it or not) and then an ErrorNo INT and ErrorMsg VARCHAR column. We validate the rows and add a message, and error number, for anything that fails the test - e.g. a date is invalid. We can then either abort the further processing, or only process "clean" rows - either way we can report any errors to the User easily (I assume this is a batch process, so user is not present and thus the need to report errors to the User "later")

The validation process (which could include splitting) could also log an error on rows that are not the correct length.

scottdg · July 31, 2015, 12:09pm

Thank you Kristen and Scott also. It looks like a have a couple options. I am going to work on this today and will update you on which approach I end up using.

Moseng · July 31, 2015, 2:26pm

Have you given a try?

JamesK · July 31, 2015, 2:31pm

(See edit below): No, I meant a pipe-delimited file instead of fixed length columns (even if you actual data is/has to be fixed length.

On the other hand, depending on the nature of your business, pipe might not be the best column separator, as @Kristen pointed out. I am in the financial sector, and most of our vendors and customers provide either comma-delimited files with double-quote as text qualifier, or pipe as the column separator.

EDIT: I had misread your question. With SSIS, it would be relatively painless with a fixed width or pipe-delimited file. WIth pipe (or any other delimiter), if there happens to be data elements that contain the delimiter character, they MUST be wrapped in a text-qualifier (such as double-quotes). If not, even SSIS cannot import the file properly.

Kristen · July 31, 2015, 2:38pm

Its commonplace, as you say. I'm fussy about such things ("Can go wrong = will go wrong") ...

... and to my mind begs the question how we ever wound up with Comma Delimited, or Pipe for that matter, as the standard (in particular as CSV means different things to different people, particular with reference to quoting-columns and then double-quoting-quotes ...)

How nice it would have been if, from the outset, something more robust had become the defacto standard. A leading size-byte following by that-many-characters (which also allows binary data of course). Reserve 255 as being following by a two-byte size instructor, reserve 255,255,255 as being followed by a 3-byte size instructor ....

I suppose Little & Big-Endian machines would make a Horlicks of it and if that had become the Standard then everyone would be cursing its lack of Endian-awareness