c# - BULK INSERT of Apple EPF Unicode Data Into SQL Server 2012 -
i'm attempting build application import apple epf feed data instance of sql server 2012. data supplied in flat files, use char(1) field delimiter , char(2) plus linefeed (char(10)) row delimiter. simplified 3-field row follows:
field1value[char(1)]field2value[char(1)]field3value[char(2)][linefeed] field1value[char(1)]field2value[char(1)]field3value[char(2)][linefeed]
the files encoded in utf-8 , include many different languages, , it's important of data preserved down-encoding ascii not option. however, sql server not support bulk insert of utf-8.
i'm writing c# command line application, , pre-processing each file using following method, strips comment lines final newline @ end of file, , converts file utf-8 utf-16:
public void prepareappleepffile(string datafilepath, string cleanedfilepath) { if (file.exists(cleanedfilepath)) { return; } // skip processing if file exists using (streamreader reader = new streamreader(datafilepath, encoding.utf8)) { using (streamwriter writer = new streamwriter(cleanedfilepath, false, encoding.unicode)) { string line; bool firstline = true; while (!reader.endofstream) { line = reader.readline(); if (line.length > 0 && line.substring(0, 1) != "#") // skip empty , commented lines { // done way avoid adding trailing newline, breaks bulk insert if (!firstline) { writer.write("\n"); } writer.write(line); firstline = false; } } } } }
once file prepared, use following bulk insert statement:
bulk insert dbo.application 'c:\itunes\data\application.cleaned' ( tablock, datafiletype = 'widechar', errorfile = 'c:\itunes\logs\application.log', formatfile = 'c:\itunes\formatdefinitions\application.xml' )
which makes use of following xml format file:
<?xml version="1.0"?> <bcpformat xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance"> <record> <field id="1" xsi:type="charterm" terminator="\x01\x00"/> <field id="2" xsi:type="charterm" terminator="\x01\x00"/> <field id="3" xsi:type="charterm" terminator="\x01\x00"/> <field id="4" xsi:type="charterm" terminator="\x01\x00"/> <field id="5" xsi:type="charterm" terminator="\x01\x00"/> <field id="6" xsi:type="charterm" terminator="\x01\x00"/> <field id="7" xsi:type="charterm" terminator="\x01\x00"/> <field id="8" xsi:type="charterm" terminator="\x01\x00"/> <field id="9" xsi:type="charterm" terminator="\x01\x00"/> <field id="10" xsi:type="charterm" terminator="\x01\x00"/> <field id="11" xsi:type="charterm" terminator="\x01\x00"/> <field id="12" xsi:type="charterm" terminator="\x01\x00"/> <field id="13" xsi:type="charterm" terminator="\x01\x00"/> <field id="14" xsi:type="charterm" terminator="\x01\x00"/> <field id="15" xsi:type="charterm" terminator="\x01\x00"/> <field id="16" xsi:type="charterm" terminator="\x01\x00"/> <field id="17" xsi:type="charterm" terminator="\x02"/> </record> <row> <column source="2" name="application_id" xsi:type="sqlnvarchar"/> <column source="3" name="title" xsi:type="sqlnvarchar"/> <column source="4" name="recommended_age" xsi:type="sqlnvarchar"/> <column source="5" name="artist_name" xsi:type="sqlnvarchar"/> <column source="6" name="seller_name" xsi:type="sqlnvarchar"/> <column source="7" name="company_url" xsi:type="sqlnvarchar"/> <column source="8" name="support_url" xsi:type="sqlnvarchar"/> <column source="9" name="view_url" xsi:type="sqlnvarchar"/> <column source="10" name="artwork_url_large" xsi:type="sqlnvarchar"/> <column source="11" name="artwork_url_small" xsi:type="sqlnvarchar"/> <column source="12" name="itunes_release_date" xsi:type="sqlnvarchar"/> <column source="13" name="copyright" xsi:type="sqlnvarchar"/> <column source="14" name="description" xsi:type="sqlntext"/> <column source="15" name="version" xsi:type="sqlnvarchar"/> <column source="16" name="itunes_version" xsi:type="sqlnvarchar"/> <column source="17" name="download_size" xsi:type="sqlnvarchar"/> <column source="1" name="export_date" xsi:type="sqlnvarchar"/> </row> </bcpformat>
the above scheme seems almost-working, data inserted table (which uses nvarchar/ntext columns) returned char(0)'s following each character. if value in column should have been "bob", "b[char(0)]o[char(0)]b[char(0)]" [char(0)] of course displays standard empty-square shown non-printing characters.
this image shows example of how data ends looking
the problem further illustrated following query:
select top 1 datalength(title) datalength_of_title, datalength(cast(title varchar(1000))) datalength_of_varchar_cast_title dbo.application
for row 7-character title, above query returns value of 28 datalength_of_title , 14 datalength_of_varchar_cast_title. these values double should be. bulk insert seems attempting re-encode unicode data, resulting in double-encoded data each character followed 3 char(0)s instead of one.
another, less-serious issue writing data file out in utf-16 causes bom added beginning of file, , bom sequence ends getting inserted part of first field value in first row of file.
i've tried using various values codepage parameter of bulk insert command none of them have improved situation, , ones did have effect caused bulk insert fail. example, using codepage = 1200 (which code page utf-16) causes following error:
msg 4864, level 16, state 1, line 1 bulk load data conversion error (type mismatch or invalid character specified codepage) row 1, column 1 (export_date).
i figured out problem while reviewing docs xml format file at:
http://msdn.microsoft.com/en-us/library/ms187833.aspx
the problem was using xsi:type="charterm" in elements when should have been using xsi:type="ncharterm".
Comments
Post a Comment