Hundreds of XML files need importing

Hello SQLTeam..

I have hundreds of XML files that I'm attempting to import into SQL Server...

My first attempt was to use OPENROWSET ... but getting error about undeclared prefix

xsi seems to be the "undeclared prefix" .. Is there another way to import without needing to modify each XML file ?

Example of my first attempt..

CREATE TABLE XMLwithOpenXML
(
Id INT IDENTITY PRIMARY KEY,
XMLData XML,
LoadedDateTime DATETIME
)

INSERT INTO XMLwithOpenXML(XMLData, LoadedDateTime)
SELECT CONVERT(XML, BulkColumn) AS BulkColumn, GETDATE() 
FROM OPENROWSET(BULK 'G:\temp\com_export_371_emp_002.xml', SINGLE_BLOB) AS x

ERROR:

	Execution trace:: INSERT INTO XMLwithOpenXML(XMLData, LoadedDateTime)
		SELECT CONVERT(XML, BulkColumn) AS BulkColumn, GETDATE() 
		FROM OPENROWSET(BULK 'G:\temp\com_export_371_emp_002.xml', SINGLE_BLOB) AS x Executed on SPID 57: 
	Error:: Lookup Error - SQL Server Database Error: XML parsing: line 13, character 33, undeclared prefix
	Execution trace:: Commit on SPID 57
<?xml version="1.0" encoding="UTF-8" standalone="no"?> 608854 371 John Doe J Doe,John J 2011-03-15T11:15:43.421495 2011-03-15 Principal Engineer 99 10379 5336 W123456 3359 -351 1 467 111-22-3333 1952-03-21 623169 false 7bB/Mu6hCFCgQ1234l2DvNr2aVSH+0SJS3BHLwqs2v4= eJ1pKaJG1L21234l3XXqJo/tSW+1Ll0Ndyyc6X6UX3I= 2cc66567-657d-4d01-a6a2-5d9f625e408f

Probably better to import the files as nvarchar(max) instead of XML. You can then do some string manipulation to address any XML parsing issues. Typically the <?xml ....> directives are unnecessary and can be removed with a STUFF/SUBSTRING expression, then the remaining data will parse as XML properly.

On further reading, the xsi is a namespace prefix, and if it's missing/undeclared in the XML stream then you might need to remove that prefix from the data. Again, easier to do if you import as nvarchar(max).

You'd have to modify the column data type in the table you're importing into, and probably change the OPENROWSET to use SINGLE_NCLOB or SINGLE_CLOB. Also note that the encoding may say UTF-8 but the file might have unicode characters that don't translate correctly under UTF-8, so again you'll probably want to remove the <?xml> directive.

Thanks Robert... but I'm not sure I understand importing files as nvarchar... there are multiple "emp" XML elements in each file... I could see if each file only contained 1 "emp" , but not sure if this is possible with multiple emps in each file..

You're second suggestion seems to suggest the need to modify each file which I was hoping not to do ...

Is there any other way to import this XML data

Sorry, I wasn't clear in the second part. You won't be modifying the files. You'll import the XML from the files pretty much exactly like you're doing now.

You'd modify your import table:

CREATE TABLE XMLwithOpenXML
(
Id INT IDENTITY PRIMARY KEY,
XMLData nvarchar(max),
LoadedDateTime DATETIME
)

And the OPENROWSET:

INSERT INTO XMLwithOpenXML(XMLData, LoadedDateTime)
SELECT BulkColumn, GETDATE()
FROM OPENROWSET(BULK 'G:\temp\com_export_371_emp_002.xml', SINGLE_NCLOB) AS x

That will avoid the parsing errors you're getting from SQL Server trying to convert the contents to XML. You should be able to import all the files this way.

You would next query your table for rows that have invalid XML:

SELECT * FROM XMLwithOpenXML WHERE TRY_CONVERT(xml, XMLData) IS NULL

If you want to simply try fixing the xsi: prefix issue:

UPDATE XMLwithOpenXML SET XMLData=REPLACE(XMLData,'xsi:','');

You can then try using CONVERT() to convert the contents to XML.

None of this will require you to modify the original files.