XML Data Sources provide access to a huge body of content. RSS is a subset of XML and is therefore applicable to everything described here. When you first create an XML data source, all you need to do is enter the URL that serves the XML. If this URL can be reached and is valid XML, when you tab out of the URL field, the EachScape server will attempt to populate the Data Descriptor. Note that this functionality is only available to you when creating a new XML Data Source and not when editing an existing one.
The data descriptor, at a minimum must start with a record command, followed by at least one field command. It's good form to end the whole thing with an end construct that pairs up with the record. The system is forgiving and will overlook it if you omit that final end. Here's a rundown of what these each do. Note that older commands, specifically gather, content and attr can still be used but their use is discouraged.
Creating A Data Descriptor
Let's assume for this discussion that you're working with this bit of XML.
<?xml version="1.0"?> <?xml-stylesheet href="/css/rss20.xsl" type="text/xsl"?> <rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:nyt="http://www.nytimes.com/namespaces/rss/2.0" version="2.0"> <channel> <title>NYT > Home Page</title> <link>http://www.nytimes.com/pages/index.html?partner=rss</link> <atom:link rel="self" type="application/rss+xml" href="http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml"/> <description/> <language>en-us</language> <copyright>Copyright 2010 The New York Times Company</copyright> <lastBuildDate>Thu, 13 May 2010 15:00:12 GMT </lastBuildDate> <image> <title>NYT > Home Page</title> <url>http://graphics.nytimes.com/images/section/NytSectionHeader.gif</url> <link>http://www.nytimes.com/pages/index.html?partner=rss</link> </image> <item> <title>Cuomo Is Said to Question Banks’ Influence on Ratings</title> <link>http://feeds.nytimes.com/click.phdo?i=024f9536d9e26ad364f9a4914a74ec26</link> <guid isPermaLink="false">http://www.nytimes.com/2010/05/13/business/13street.html</guid> <media:content url="http://graphics8.nytimes.com/images/2010/05/13/business/13street_CA0/13street_CA0-thumbStandard.jpg" medium="image" height="75" width="75"/> <media:description>Andrew Cuomo, the attorney general of New York, sent subpoenas to eight Wall Street banks late Wednesday.</media:description> <media:credit>Chang W. Lee/The New York Times</media:credit> <description>The New York attorney general is said to be scrutinizing eight banks that may have provided misleading information to rating agencies to inflate the grades of securities</description> <dc:creator>By LOUISE STORY</dc:creator> <pubDate>Thu, 13 May 2010 12:16:39 GMT</pubDate> <category domain="http://www.nytimes.com/namespaces/des">Banks and Banking</category> <category domain="http://www.nytimes.com/namespaces/des">Ratings and Rating Systems</category> <category domain="http://www.nytimes.com/namespaces/nyt_geo">Wall Street (NYC)</category> <category domain="http://www.nytimes.com/namespaces/des">Mortgages</category> </item> <item> <title>F.B.I. Conducts Raids in Times Square Bomb Case and Takes Several People Into Custody</title> <link>http://feeds.nytimes.com/click.phdo?i=5a0819c4880c6026e6131f4bcff68080</link> <guid isPermaLink="false">http://www.nytimes.com/2010/05/14/nyregion/14terror.html</guid> <description>Several people are taken into custody, but the authorities say there is no “immediate threat” to the public.</description> <dc:creator>By WILLIAM K. RASHBAUM</dc:creator> <pubDate>Thu, 13 May 2010 14:50:29 GMT</pubDate> <category domain="http://www.nytimes.com/namespaces/des">Terrorism</category> <category domain="http://www.nytimes.com/namespaces/des">Search and Seizure</category> <category domain="http://www.nytimes.com/namespaces/nyt_org_all">Federal Bureau of Investigation</category> <category domain="http://www.nytimes.com/namespaces/nyt_per">Shahzad, Faisal</category> </item> <item> <title>The New Poor: The Economy Shifts, Leaving Some Behind</title> <link>http://feeds.nytimes.com/click.phdo?i=57953a3deb2fd93df5e7878a62e683ce</link> <guid isPermaLink="false">http://www.nytimes.com/2010/05/13/business/economy/13obsolete.html</guid> <media:content url="http://graphics8.nytimes.com/images/2010/05/business/obsolete_CA0-thumbStandard.jpg" medium="image" height="75" width="75"/> <media:description>Cynthia Norton, an administrative assistant in Jacksonville, Fla., has not found comparable work since being laid off. </media:description> <media:credit>Lori Moffett for The New York Times</media:credit> <description>The economy gave employers a chance to do what they would have done anyway: dismiss people in certain fields.</description> <dc:creator>By CATHERINE RAMPELL</dc:creator> <pubDate>Thu, 13 May 2010 07:10:19 GMT</pubDate> <category domain="http://www.nytimes.com/namespaces/des">Layoffs and Job Reductions</category> <category domain="http://www.nytimes.com/namespaces/mdes">Recession and Depression</category> <category domain="http://www.nytimes.com/namespaces/mdes">Economic Conditions and Trends</category> <category domain="http://www.nytimes.com/namespaces/mdes">Unemployment</category> </item> </channel> </rss>
A data descriptor is composed of lines of commands, each serving to describe how to extract the data from the XML. The basic commands are record and field. The older commands gather, content and attr are described at the bottom of this page; their use is discouraged. Note that some of the optional features vary according to the command you are using. A table below summarizes these.
record tells the EachScape Builder what node (e.g., <item>) is the source of your records. As you can see, the above XML contains only 3 <item> nodes, so the resulting data source will have 3 items in it if you have the “record item” command at the start of the Descriptor.
If instead you said “record category” you'd get many more records, but all they could contain is the contents in or enclosed by the <category> nodes. Generally you want the highest level node that repeats, but the decision of what you record is defined depends on the data and your application.
Note that you can also give a fully-qualified XPATH, like /rss/channel/item if the tag <item> is abiguous or you want to be more specific about the source of the records.
The field command a single XPATH expression that is used to identify each field in the generated record. If you specify a path, it must be below the item in the record, and it is relative to the node defined in the record construct. Paths like tag/@id would retrieve the value of the id attribute in this: <tag id=“123”>, with the ”@” representing an attribute path.
record //rss/channel/item field category field title field link field media:description name=caption field media:content/@url field media:content/@height field media:content/@width end
By adding the option name=caption above, the default name of “description” is overridden with the name “caption” for the column name. This same technique can be applied to the record construct to control the name of the generated table.
The *end* command added just closes up the record line. If you omit it, the system will fill it in for you, but it's good form to try to remember it.
Another directive, subrecords deals with the situation where a node contains multiple instances of another node that illustrates a many-to-one relationship That's a mouthful, and for now it won't be explained here. But just remember there are escape hatches for dealing with more complex XML and you should feel free to ask for help if you can't figure out how to do what you want. The Data Source handling is like a Swiss Army Knife, with lots of strange but useful gadgets… ask for help if you're not sure how it works.
Note that for each subrecords command, you must supply a matching end command. The system will not try to resolve this omission for you.
The “with” directive allows you to follow an embedded URL to another XML document and subsequently use that document in other commands. The with command takes a URL path and must also have a name= parameter specified. That name is used thereafter to reference that document. For example, in the following statements, the with command instructs the system to take the channel/article/url path contents from the source document, follow that url and create an instance of a document that can be referenced as “rss” thereafter. In the subsequent field and content commands, the use of source=rss means that the XPATH expressions refer to the “rss” document, not the original document. <code> record channel/article
with url name=rss field //rss/channel/item/title source=rss content //rss/channel/item/title source=rss
Note that in some cases the most common use of this may be to pick up subrecords from another document, like this
record //channel/article with gallery_url name=gallery subrecords //rss/channel/item source=gallery field media:content@url end end
NOTE THAT IN THIS CASE, THAT ONCE THE source= IS ATTACHED TO THE subrecords DIRECTIVE, EVERYTHING INSIDE IT REFERENCES THAT CONTENT. If you want to references the source that was used outside the subrecords directive, you can say “source=..” and refer to the XPATH context that was in use when the subrecords directive was processed. This only applies when the subrecords directive uses the source= option.
The constant command lets you add another data field that will contain the same field in every record. This feature is used infrequently, but in fact, if you ever have a Merged Data Source created from a series of RSS feeds, you might want this feature.
Data Descriptor Options
All commands except the end command can take options. The options are of the form x=y, where x can be one of the items below. The most-commonly used ones are listed first, followed by an alphabetical list of the remaining options.
|download=||field, content, attr||download=true creates a local file by following URL. download=blob is similar, but puts content into local database blob. download=hosted causes the data to be hosted by EachScape. download=original is the same as download=true, but preserves the original file name.*|
|index=||field, content, attr||Creates a full-text index that includes this field: index=true|
|name=||gather, content, attr, record, field, subrecords, with, constant||Overrides default name of record or field|
|pattern=||field||accepts a regular expression. The regular expression must contain a set a parentheses. The part of the expression that matches what is in the parentheses will be saved as the value; the rest will be discarded. Note that if you leave out the parentheses or have a regular expression that matches nothing, the field may be empty.|
|strip=||field||strip=links will remove any HTML links while preserving the link text|
|type=||field, content, attr||type=timestamp reformats dates, type=markup handles ill-formed XML that embeds HTML without escaping or CDATA.|
|Used for Image Post-Processing, these are only meaningful when download= is present. (Ask an EachScape employee for assistance.)|
|size=||field, content, attr||Modify the dimensions of the downloaded image. More information on this can be found here https://www.imagemagick.org/Magick++/Geometry.html|
|quality=||field, content, attr||Reduces the quality (and thus the size) of the image file. if used, must be a number between 1 and 100 inclusive. For JPEG files, 1 selects the poorest quality (and smallest file) while 100 produces the best quality (and largest file). Generally you don't ever need a number between 80-90 as large values may actually increase the file size without producing better quality. For PNG files, the numbers 1 to 100 are also used, and regulates the time spent attempting to compress the image. PNG files do not lose image quality as this value is changed, merely the resulting size may vary. The quality value does not apply for any other file type.|
|Used for Video Post-Processing*, these are only meaningful when download= is present.|
|rotate=||field, content, attr||rotate=90left rotates the video 90° counter-clockwise, rotate=90right rotates the video 90° clockwise. Note that these options require video re-encoding.|
|Options that are used infrequently|
|hostedid=||field, content, attr||Ask an EachScape employee for assistance with this option.|
|source=||field, content, field, subrecord||Refer to a name created by the “with” directive|
|split=||field, content, attr||split=, would cause the data value to be split at each comma, creating a tag field. Any non-blank character can be used.|
|The following are deprecated. To perform these functions you are encouraged to use XPATH selectors that include conditional selection.|
|ignore=||content||Ask an EachScape employee for assistance with this option.|
|ifAttr=||content, attr||Ask an EachScape employee for assistance with this option.|
* Currently only available on tabular data sources.
Permanent Unique Key
With data sources, you may want to allow users of the app to bookmark data. This is only possible if the data source records contain an assigned Permanent Unique Key. We call them a "PUK" for short. There are two really important words in that name: Permanent means it will never change once assigned and Unique means just there will never be a duplicate value assigned. Simple enough, you'd think, but it's where a lot of data sources get tripped up. If you know what the PUK<nowiki> is for your data source, you can easily define it in the field for the Permanent Unique Key. Go back into the data source definition and add it in the box labeled Permanent Unique Keys. Note that you can have one <nowiki>PUK for each record. Most data sources contain only one record, but it is possible to have multiple records.
The following commands are available, but their use is discouraged.
gather tells the EachScape Builder what node (e.g., <item>) is the source of your records. As you can see by looking it over, it contains only 3 <item> nodes, so the resulting data source will have 3 items in it if you have the “gather item” command at the start of the Descriptor.
If instead you said “gather category” you'd get many more records, but all they could contain is the contents in or enclosed by the <category> nodes. Generally you want the highest level node that repeats, but the decision of what you gather is dependent on the data and your application. For the rest of this discussion, let's assume I say “gather item” and want to build up 3 item records.
content tells the EachScape Builder to extract what's inside the referenced node. The content command takes an XPATH expression. If you just give a node name or collection of node names like “a/b/c” then they are assumed to be immediately subordinate to the node referenced by the gather command. If, the XPATH starts with a period, you can reference anything in the structure of the document, but again, it is relative to the node referenced by the gather directive.
Let's say I want to get at the category underneath the item, all I need to add is “content category”.
gather item content category
We indent “content category” and other lines after gather just to clarify the structure, but it's not necessary. The content nodes look like this,
<category domain="http://www.nytimes.com/namespaces/des">Banks and Banking</category>
so if I say “content category” I'll pick up the “Banks and Banking” value. (“domain=” is an attribute; we'd use a different directive to pick up that value. More on that, below.)
Let's pick up a few more items from the XML, by expanding the descriptor to look like this:
gather item content category content title content link content media:description name=caption
Note that I added the words name=caption after the last line. This tells the system to create a field named “caption” rather than the default name, which would be “description”.
attr instructs the system to pick up an attribute value.
The attr command has two formats. It can either take one or two arguments: usually you see the attr command with two arguments. The first one is an XPATH expression (see the description above) to navigate to the desired node from the gather command's node and the second argument is the name of the attribute itself. If there is only one argument, it means that you're trying to pick up an attribute on the node referenced by the gather command itself.
For example, this bit of XML has the URL of an image, as well as its dimensions.
<media:content url="http://graphics8.nytimes.com/images/2010/05/13/business/13street_CA0/13street_CA0-thumbStandard.jpg" medium="image" height="75" width="75"/>
It's easy to extract them by expanding the Descriptor to look like this.
gather item content category content title content link content media:description name=caption attr media:content url attr media:content height attr media:content width end
If you inserted a line like this “attr id” it would attempt to pick up the “id” attribute's value directly from the node that the gather command referenced. In this example, we don't have any attributes on that node, so it can't be used here.