Welcome, Guest Login

Support Center

Creating an XML Data Source

Last Updated: Dec 16, 2015 11:49AM EST

Overview
XML Data Sources provide access to a huge body of content. RSS is a subset of XML and is therefore applicable to everything described here. When you first create an XML data source, all you need to do is enter the URL that serves the XML. If this URL can be reached and is valid XML, when you tab out of the URL field, the EachScape server will attempt to populate the Data Descriptor. Note that this functionality is only available to you when creating a new XML Data Source and not when editing an existing one.

The data descriptor, at a minimum must start with a record command, followed by at least one field command. It's good form to end the whole thing with an end construct that pairs up with the record. The system is forgiving and will overlook it if you omit that final end. Here's a rundown of what these each do. Note that older commands, specifically gathercontent and attr can still be used but their use is discouraged.

Creating A Data Descriptor

Let's assume for this discussion that you're working with this bit of XML.

<?xml version="1.0"?>
<?xml-stylesheet href="/css/rss20.xsl" type="text/xsl"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" 
  xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0">
  <channel>
    <title>NYT &gt; Home Page</title>
    <link>http://www.nytimes.com/pages/index.html?partner=rss</link>
    <atom:link rel="self" type="application/rss+xml" href="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml"/>
    <description/>
 
    <language>en-us</language>
    <copyright>Copyright 2010  The New York Times Company</copyright>
    <lastBuildDate>Thu, 13 May 2010 15:00:12 GMT </lastBuildDate>
    <image>
      <title>NYT &gt; Home Page</title>
      <url>http://graphics.nytimes.com/images/section/NytSectionHeader.gif</url>
 
      <link>http://www.nytimes.com/pages/index.html?partner=rss</link>
    </image>
    <item>
      <title>Cuomo Is Said to Question Banks&#x2019; Influence on Ratings</title>
      <link>http://feeds.nytimes.com/click.phdo?i=024f9536d9e26ad364f9a4914a74ec26</link>
      <guid isPermaLink="false">http://www.nytimes.com/2010/05/13/business/13street.html</guid>
      <media:content
      url="http://graphics8.nytimes.com/images/2010/05/13/business/13street_CA0/13street_CA0-thumbStandard.jpg" 
      medium="image" height="75" width="75"/>
      <media:description>Andrew Cuomo, the attorney general of New York, 
      sent subpoenas to eight Wall Street banks late Wednesday.</media:description>
      <media:credit>Chang W. Lee/The New York Times</media:credit>
      <description>The New York attorney general is said to be scrutinizing eight 
      banks that may have provided misleading information to rating agencies to inflate
      the grades of securities</description>
 
      <dc:creator>By LOUISE STORY</dc:creator>
      <pubDate>Thu, 13 May 2010 12:16:39 GMT</pubDate>
      <category domain="http://www.nytimes.com/namespaces/des">Banks and Banking</category>
      <category domain="http://www.nytimes.com/namespaces/des">Ratings and Rating Systems</category>
      <category domain="http://www.nytimes.com/namespaces/nyt_geo">Wall Street (NYC)</category>
      <category domain="http://www.nytimes.com/namespaces/des">Mortgages</category>
    </item>
    <item>
      <title>F.B.I. Conducts Raids in Times Square Bomb Case and Takes Several People Into Custody</title>
      <link>http://feeds.nytimes.com/click.phdo?i=5a0819c4880c6026e6131f4bcff68080</link>
      <guid isPermaLink="false">http://www.nytimes.com/2010/05/14/nyregion/14terror.html</guid>
      <description>Several people are taken into custody, but the authorities say there is no &#x201C;immediate
      threat&#x201D; to the public.</description>
 
      <dc:creator>By WILLIAM K. RASHBAUM</dc:creator>
      <pubDate>Thu, 13 May 2010 14:50:29 GMT</pubDate>
      <category domain="http://www.nytimes.com/namespaces/des">Terrorism</category>
      <category domain="http://www.nytimes.com/namespaces/des">Search and Seizure</category>
      <category domain="http://www.nytimes.com/namespaces/nyt_org_all">Federal Bureau of Investigation</category>
      <category domain="http://www.nytimes.com/namespaces/nyt_per">Shahzad, Faisal</category>
    </item>
    <item>
      <title>The New Poor: The Economy Shifts, Leaving Some Behind</title>
      <link>http://feeds.nytimes.com/click.phdo?i=57953a3deb2fd93df5e7878a62e683ce</link>
      <guid isPermaLink="false">http://www.nytimes.com/2010/05/13/business/economy/13obsolete.html</guid>
      <media:content url="http://graphics8.nytimes.com/images/2010/05/business/obsolete_CA0-thumbStandard.jpg" 
      medium="image" height="75" width="75"/>
      <media:description>Cynthia Norton, an administrative assistant in Jacksonville, Fla., has not found comparable work 
      since being laid off. </media:description>
      <media:credit>Lori Moffett for The New York Times</media:credit>
      <description>The economy gave employers a chance to do what they would have done anyway: 
      dismiss people in certain fields.</description>
 
      <dc:creator>By CATHERINE RAMPELL</dc:creator>
      <pubDate>Thu, 13 May 2010 07:10:19 GMT</pubDate>
      <category domain="http://www.nytimes.com/namespaces/des">Layoffs and Job Reductions</category>
      <category domain="http://www.nytimes.com/namespaces/mdes">Recession and Depression</category>
      <category domain="http://www.nytimes.com/namespaces/mdes">Economic Conditions and Trends</category>
      <category domain="http://www.nytimes.com/namespaces/mdes">Unemployment</category>
    </item>
  </channel>
</rss>

A data descriptor is composed of lines of commands, each serving to describe how to extract the data from the XML. The basic commands are record and field. The older commands gathercontent and attr are described at the bottom of this page; their use is discouraged. Note that some of the optional features vary according to the command you are using. A table below summarizes these.

Record Command

record tells the EachScape Builder what node (e.g., <item>) is the source of your records. As you can see, the above XML contains only 3 <item> nodes, so the resulting data source will have 3 items in it if you have the “record item” command at the start of the Descriptor.

  record item

If instead you said “record category” you'd get many more records, but all they could contain is the contents in or enclosed by the <category> nodes. Generally you want the highest level node that repeats, but the decision of what you record is defined depends on the data and your application.

Note that you can also give a fully-qualified XPATH, like /rss/channel/item if the tag <item> is abiguous or you want to be more specific about the source of the records.

  record /rss/channel/item

Field Command

The field command a single XPATH expression that is used to identify each field in the generated record. If you specify a path, it must be below the item in the record, and it is relative to the node defined in the record construct. Paths like tag/@id would retrieve the value of the id attribute in this: <tag id=“123”>, with the ”@” representing an attribute path.

  record //rss/channel/item
    field category
    field title
    field link
    field media:description name=caption
    field media:content/@url
    field media:content/@height
    field media:content/@width
  end

By adding the option name=caption above, the default name of “description” is overridden with the name “caption” for the column name. This same technique can be applied to the record construct to control the name of the generated table.

End Command

The *end* command added just closes up the record line. If you omit it, the system will fill it in for you, but it's good form to try to remember it.

Subrecords Command

Another directive, subrecords deals with the situation where a node contains multiple instances of another node that illustrates a many-to-one relationship That's a mouthful, and for now it won't be explained here. But just remember there are escape hatches for dealing with more complex XML and you should feel free to ask for help if you can't figure out how to do what you want. The Data Source handling is like a Swiss Army Knife, with lots of strange but useful gadgets… ask for help if you're not sure how it works.

Note that for each subrecords command, you must supply a matching end command. The system will not try to resolve this omission for you.

With Command

The “with” directive allows you to follow an embedded URL to another XML document and subsequently use that document in other commands. The with command takes a URL path and must also have a name= parameter specified. That name is used thereafter to reference that document. For example, in the following statements, the with command instructs the system to take the channel/article/url path contents from the source document, follow that url and create an instance of a document that can be referenced as “rss” thereafter. In the subsequent field and content commands, the use of source=rss means that the XPATH expressions refer to the “rss” document, not the original document. <code> record channel/article

with url name=rss
field //rss/channel/item[1]/title source=rss
content //rss/channel/item[1]/title source=rss

end </code>

Note that in some cases the most common use of this may be to pick up subrecords from another document, like this

record //channel/article
  with gallery_url name=gallery
  subrecords //rss/channel/item source=gallery
    field media:content@url
  end
end

NOTE THAT IN THIS CASE, THAT ONCE THE source= IS ATTACHED TO THE subrecords DIRECTIVE, EVERYTHING INSIDE IT REFERENCES THAT CONTENT. If you want to references the source that was used outside the subrecords directive, you can say “source=..” and refer to the XPATH context that was in use when the subrecords directive was processed. This only applies when the subrecords directive uses the source= option.

Constant Command

The constant command lets you add another data field that will contain the same field in every record. This feature is used infrequently, but in fact, if you ever have a Merged Data Source created from a series of RSS feeds, you might want this feature.

Data Descriptor Options

All commands except the end command can take options. The options are of the form x=y, where x can be one of the items below. The most-commonly used ones are listed first, followed by an alphabetical list of the remaining options.

Option Applies To Function
download= field, content, attr download=true creates a local file by following URL. download=blob is similar, but puts content into local database blob. download=hosted causes the data to be hosted by EachScape. download=original is the same as download=true, but preserves the original file name.*
index= field, content, attr Creates a full-text index that includes this field: index=true
name= gather, content, attr, record, field, subrecords, with, constant Overrides default name of record or field
pattern= field accepts a regular expression. The regular expression must contain a set a parentheses. The part of the expression that matches what is in the parentheses will be saved as the value; the rest will be discarded. Note that if you leave out the parentheses or have a regular expression that matches nothing, the field may be empty.
strip= field strip=links will remove any HTML links while preserving the link text
type= field, content, attr type=timestamp reformats dates, type=markup handles ill-formed XML that embeds HTML without escaping or CDATA.
Used for Image Post-Processing, these are only meaningful when download= is present. (Ask an EachScape employee for assistance.)
size= field, content, attr Modify the dimensions of the downloaded image. More information on this can be found here http://www.imagemagick.org/Magick++/Geometry.html
quality= field, content, attr Reduces the quality (and thus the size) of the image file. if used, must be a number between 1 and 100 inclusive. For JPEG files, 1 selects the poorest quality (and smallest file) while 100 produces the best quality (and largest file). Generally you don't ever need a number between 80-90 as large values may actually increase the file size without producing better quality. For PNG files, the numbers 1 to 100 are also used, and regulates the time spent attempting to compress the image. PNG files do not lose image quality as this value is changed, merely the resulting size may vary. The quality value does not apply for any other file type.
Used for Video Post-Processing*, these are only meaningful when download= is present.
rotate= field, content, attr rotate=90left rotates the video 90° counter-clockwise, rotate=90right rotates the video 90° clockwise. Note that these options require video re-encoding.
Options that are used infrequently
hostedid= field, content, attr Ask an EachScape employee for assistance with this option.
source= field, content, field, subrecord Refer to a name created by the “with” directive
split= field, content, attr split=, would cause the data value to be split at each comma, creating a tag field. Any non-blank character can be used.
The following are deprecated. To perform these functions you are encouraged to use XPATH selectors that include conditional selection.
ignore= content Ask an EachScape employee for assistance with this option.
ifAttr= content, attr Ask an EachScape employee for assistance with this option.

* Currently only available on tabular data sources.

Permanent Unique Key

With data sources, you may want to allow users of the app to bookmark data. This is only possible if the data source records contain an assigned Permanent Unique Key. We call them a "PUK" for short. There are two really important words in that name: Permanent means it will never change once assigned and Unique means just there will never be a duplicate value assigned. Simple enough, you'd think, but it's where a lot of data sources get tripped up. If you know what the PUK<nowiki> is for your data source, you can easily define it in the field for the Permanent Unique Key. Go back into the data source definition and add it in the box labeled Permanent Unique Keys. Note that you can have one <nowiki>PUK for each record. Most data sources contain only one record, but it is possible to have multiple records.

Deprecated Commands

The following commands are available, but their use is discouraged.

Gather Command

gather tells the EachScape Builder what node (e.g., <item>) is the source of your records. As you can see by looking it over, it contains only 3 <item> nodes, so the resulting data source will have 3 items in it if you have the “gather item” command at the start of the Descriptor.

  gather item

If instead you said “gather category” you'd get many more records, but all they could contain is the contents in or enclosed by the <category> nodes. Generally you want the highest level node that repeats, but the decision of what you gather is dependent on the data and your application. For the rest of this discussion, let's assume I say “gather item” and want to build up 3 item records.

Content Command

content tells the EachScape Builder to extract what's inside the referenced node. The content command takes an XPATH expression. If you just give a node name or collection of node names like “a/b/c” then they are assumed to be immediately subordinate to the node referenced by the gather command. If, the XPATH starts with a period, you can reference anything in the structure of the document, but again, it is relative to the node referenced by the gather directive.

Let's say I want to get at the category underneath the item, all I need to add is “content category”.

  gather item
    content category

We indent “content category” and other lines after gather just to clarify the structure, but it's not necessary. The content nodes look like this,

  <category domain="http://www.nytimes.com/namespaces/des">Banks and Banking</category>

so if I say “content category” I'll pick up the “Banks and Banking” value. (“domain=” is an attribute; we'd use a different directive to pick up that value. More on that, below.)

Let's pick up a few more items from the XML, by expanding the descriptor to look like this:

  gather item
    content category
    content title
    content link
    content media:description name=caption

Note that I added the words name=caption after the last line. This tells the system to create a field named “caption” rather than the default name, which would be “description”.

Attr Command

attr instructs the system to pick up an attribute value.

The attr command has two formats. It can either take one or two arguments: usually you see the attr command with two arguments. The first one is an XPATH expression (see the description above) to navigate to the desired node from the gather command's node and the second argument is the name of the attribute itself. If there is only one argument, it means that you're trying to pick up an attribute on the node referenced by the gather command itself.

For example, this bit of XML has the URL of an image, as well as its dimensions.

      <media:content
      url="http://graphics8.nytimes.com/images/2010/05/13/business/13street_CA0/13street_CA0-thumbStandard.jpg" 
      medium="image" height="75" width="75"/>

It's easy to extract them by expanding the Descriptor to look like this.

  gather item
    content category
    content title
    content link
    content media:description name=caption
    attr media:content url
    attr media:content height
    attr media:content width
  end

If you inserted a line like this “attr id” it would attempt to pick up the “id” attribute's value directly from the node that the gather command referenced. In this example, we don't have any attributes on that node, so it can't be used here.

Help us improve! Rate this article:

Yes I found this article helpful

Ask a Question   

support@eachscape.com
http://assets0.desk.com/
false
eachscape
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete