SQLTeam.com | Weblogs | Forums

Convert Tagged data into a table


#1

Dear SQL-Experts,

We have research-publications/Article data from Thomson Reuters, in raw "Tagged Format", where the data is in two columns the first column (2-characters) is the TAG, then and then the data.

Example:
Key Value
UT 000208756900001
TI The Religious Identity of the Arabic Language and the Affair of the Lead Books of the Sacromonte of Granada
AB This article deals, in the first place, with the religious identity of the Arabic language as defined by the ongoing debate, in 16th-17th century Spain, about its identification with Islam. Many new Christians of Muslim origin (Moriscos) tried to break this identification in an effort to salvage part of their culture, and specially the language, by separating it from Islam. I will argue that the Morisco forgery known as the Lead Books of the Sacromonte in Granada-an Arabic Evangile dictated by the Virgin Mary to Arabic disciples who came to Spain with the Apostle Saint James-was part of this effort. When the Lead Books were taken to the Vatican to be informed, they were studied by Maronite scholars who decided that they were written in "Muslim Arabic" and therefore could not be authentic Christian texts. The Maronites were engaged in creating and consolidating their own version of Christian Arabic to define and legitimise their own position inside the Roman world. The second part of the essay adresses the theological considerations and the defence of different cultural identities which are implied in these different versions of Arabic.
AU Garcia-Arenal, M
RO Author, Reprint Author
LN Garcia-Arenal
AF Mercedes


UT: is the Article-ID-Number
TI: Article Title
AB: Abstract

AU: is the Article Author.
RO: Role of the person in AU.
LN: Author Last name
AF: first Name
and so on....

What is the best way to convert this TAGGED data into two tables:
Table#1: with Article/Publication data with UT, TI, AB ..etc as column headers
Table#2: with AU, RO, LN, AF, ...ect as column headers and a foreign key to table#1

Thank you so much,
Vamshi.


#2

Assuming the only linebreaks are followed by a 2-character-tag-code (i.e. no linebreaks WITHIN the Abstract etc.) then I would SPLIT the text block on LINEBREAK (using a Splitter Function - Google for "jeff moden sql splitter function") and then use the first two characters as the "tag", and character-4 onwards as the "value"

You would then need to SELECT, from the split-data, using multiple JOINs to get each of the columns you need for your Table#1 and Table#2 ... or you could leave all the data in a TAG/Value table table comprising [Article-ID-Number], [Tag], [Value] columns, and "simulate" Table#1 and Table#2 with VIEWs onto the underlying TAG/Value table. This would have the benefit that additional TAGs will automatically be added over time, and you only need to adjust your VIEW for any new ones that you want to make available as "pseudo columns"


#3

Thank you Kristen.
Right now the data is already parsed and imported into a table with two Columns: Key & Value.

Now I am trying to convert (Pivot) this TAGGED data into two tables:
Table#1: with Article/Publication data with UT, TI, AB ..etc as column headers
Table#2: with AU, RO, LN, AF, ...ect as column headers and a foreign key to table#1

I did this below pivot-query that is generating the data/columns to be Inserted into Table#1.
But it also includes columns that should go into table#2 concatenated with '+' using GROUP_CONCAT_D() function.

select [UT], [T2], [T3 ], [T9], [AR], [SL], [TI], [RL], [RW], [RY], [AU], [RO], [LN], [AF], [AS], [AD], [AA], [EM], [AG], [DT], [BP], [EP], [PG], [LA], [MA], [DE], [ID], [RP], [RA], [NF], [NC], [ND], [NN], [NY], [NP], [NU], [NZ], [EA], [C1], [CN], [GT], [GB], [GO], [GN], [GX], [AV], [AB], [NR], [CP], [/A], [/Y], [/W], [/N], [/C], [EC], [CR], [R9], [RS], [/V], [/P], [IB], [C2 ], [C4 ], [C5 ], [CT ], [SP ], [CL ], [C3 ], [IE ], [EX], [/I]
from
(select DBO.GROUP_CONCAT_D([Value],'+') AS [Value], [key] from [ISI_Annual_2011_IN3O110113] where [KEY]<>'--' and ID>=@minItemID and ID <=@maxItemID GROUP BY [KEY] ) d
pivot
(
max([Value]) for [key] in ([UT], [T2], [T3 ], [T9], [AR], [SL], [TI], [RL], [RW], [RY], [AU], [RO], [LN], [AF], [AS], [AD], [AA], [EM], [AG], [DT], [BP], [EP], [PG], [LA], [MA], [DE], [ID], [RP], [RA], [NF], [NC], [ND], [NN], [NY], [NP], [NU], [NZ], [EA], [C1], [CN], [GT], [GB], [GO], [GN], [GX], [AV], [AB], [NR], [CP], [/A], [/Y], [/W], [/N], [/C], [EC], [CR], [R9], [RS], [/V], [/P], [IB], [C2 ], [C4 ], [C5 ], [CT ], [SP ], [CL ], [C3 ], [IE ], [EX], [/I])
) piv

is there a way to separate out the columns that go in Table#2?

Thank you so much,
Vamshi.**


#4

Sorry, missed that bit in your O/P :frowning:

I don't use PIVOT much, so someone more familiar with that may have a smart answer.

A two-column table won't be any help (that I can think of), it needs to be a three-column table with an ID column (the same value for all rows that came from the same original Article record. Personally I would use the UT value for that, but you could use an IDENTITY of some sort

A brute-force route would be:

INSERT INTO Table#1
(
    [UT], [TI], [AB], ...
)
SELECT UT.Value, TI.Value, AB.Value, ...
FROM dbo.YourThreeColumnTable AS UT
     LEFT OUTER JOIN dbo.YourThreeColumnTable AS TI
         ON TI.ID = UT.ID
        AND TI.Key = 'TI'
     LEFT OUTER JOIN dbo.YourThreeColumnTable  AS AB
         ON AB.ID = UT.ID
        AND AB.Key = 'AB'
    ....
WHERE TI.Key = 'UT'

#5

Then you're pretty much dead in the water because 1) there is no common key between the rows that make up a single article and 2) you cannot rely on any supposed natural order of rows to do a data smear with unless the data is stored externally and you're using BULK INSERT or BCP to import the data into a table that has an IDENTITY column AND you've guaranteed a single threaded import.

Given the current conditions of the data, you could certainly make something that appears to work but it will eventually fail because the order of the rows is not guaranteed.


#6

I didn't know that ... :frowning:

I wish SQL had a "strict" mode that would allow alerts for things like that. "Non repeatable order due to absence of ORDER BY" ... that sort of thing.