Media Sources are used to specify how Yioop should handle news feeds, podcast, and trending value sites. The Add Media Source form lets you add new media sources. What this form looks like depends on the Type dropdown chosen. Below we describe the form for each of the possible choices of type:
An RSS media source can be used to add an RSS or Atom feed (it auto-detects which kind) to the list of feeds which are downloaded hourly when Yioop's Media Updater is turned on. Besides the name you need to specify the URL of the feed in question. The Category field search usually be left at news. If you want to specify additional categories such as weather or sports, you typically want to create a mix that searches the default index with the keyword media:your_category injects, and then make a new subsearch with that mix.
This will allow your new category to show up on the Tools/More/Other Searches page.
An
HTML media source is a web page that has feed articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example,
Name: Cape Breton Post
URL: http://www.capebretonpost.com/News/Local-1968
Language: English
Category: news
Channel: //div[contains(@class, "channel")]
Item: //article
Title: //a
Description: //div[contains(@class, "dek")]
Link: //a
The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a .
A
JSON media source is a used to scrape feed articles from JSON data as may be provided by a websites API. To handle a JSON media source you provide the same information as with an HTML media source. Internally, Yioop converts all JSON sources to xml before processing. The root objects maps to /html/body.
A property
foo of the root object would get mapped to a tag <foo>. Array elements are mapped to a sequence of elements enclosed in <item> tags. The process is recursively applied until the JSON object is completely converted to an xml page. Once this is done the XPaths that a user provides are used to extract the feed items in the same way as how HTML feeds are extracted. As an example, Yioop search results and discussion groups can be output as JSON. To take Yioop's news feed and use it as a JSON media source in your search engine, you could use the settings:
Name: Yioop News
URL: https://www.yioop.com/s/news?f=json
Language: English
Category: news
Channel: //channel
Item: //item
Title: //title
Description: //description
Link: //link
A
Regex media source is a source of feed articles presented in some kind of non-tag based text format.
For example, the US National Weather Service has a text-based page for weather forecasts of major US cities
at
http://forecast.weather.gov/product.php?site=NWS&
issuedby=04&product=SCS&format=txt&
version=1&glossary=0
changing the 04 above to 03, 02, 01 varies the group of cities. Most of the data on this page appears in a pre tag as text.
Channel in this case would be a regex whose first capture group corresponds to the contents of this pre tag. We might want to get one item per line from the pre tag as that would correspond to the weather for one city. The
Item Separator is a regex used to split the results of the Channel operation into items. Finally,
Title ,
Description , and
Link are regexes each with one capture group used to get these respective feed item components out of an item given after the splitting process above. Hence, a reasonable choice of values for the weather service page might be:
Name: National Weather Service 04
URL: http://forecast.weather.gov/product.php?
site=NWS&issuedby=04&product=SCS&format=txt&
version=1&glossary=0
Language: English
Category: weather
Channel: /<pre(?:.+?)>([^<]+)/m
Item: /
/
Title: /^(.+?)\s\s\s+/
Description: /\s\s\s+(.+?)$/
Link: http://www.weather.gov/
Notice in the above that the Link element is http://www.weather.gov/. If you have a feed
and it doesn't provide links for individual items. You can always provide a link to some
fixed site by directly entering a URL in the Link field.
Not all feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed:
https://feeds.wired.com/wired/index
//description/div[contains(@class, "rss_thumbnail")]/img/@src
A
Feed Podcast source is an RSS or Atom source where each item contains a link to a podcast or video podcast. For example,
http://feed.cnet.com/feed/podcast/all/hd.xml
The
Alternative Link Tag field is used to say the XPath within the feed item to the link for the audio or video file. For the CNet example, this is:
enclosure
If it is blank the default link tag is used. The media updater job when run checks if any items in the feed are new. If so, it downloads them to the wiki resource folder of the wiki page provided in the
Wiki Destination field. This page is given in the format GroupName@PageName. If you give just PageName, the Public group is assumed. The
Expires field controls how long a feed item is kept before it is deleted.
For example, if we wanted to download the popular Ted talk podcasts into the Ted subfolder of the resource folder of the Example Podcast wiki page of the Public group, where we have podcasts expire after after 1 month, we could do:
Name: Ted
URL: https://pa.tedcdn.com/feeds/talks.rss
Language: English
Expires: One Month
Alternative Link Tag: enclosure
Wiki Destination: Library@News and Podcasts/Ted/%Y-%m-%d %F
Notice the string has "%Y-%m-%d %F" in it. This portion of the destination gives the format of the filename to use when storing a downloaded podcast file. It says name the file as the current year hyphen month hyphen day space the filename as given in the URL. %F is for the filename, other % modifiers can be standard date formatting instructions.
Yioop supports the downloading of single video or audio file sources, as well as more complicated stream sources such as m3u8 streams.
A
Scrape podcast source is like a
Feed Podcast source , but where one has a HTML or XML page which has a periodically updated link to a video or audio source. For example, it might be an evening news web site.
The
URL field should be the page with the periodically updated link. The
Aux Url XPaths field, if not blank, should be a sequence of XPaths or Regexes one per line. The first line will be applied to the page to obtain a next url to download. The next line's XPath or Regex is applied to this file and so on. The final url generated should be to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day,
Download XPath should be the XPath of the url of the video or audio file to download.
If a regex is used rather than an XPath, then the first capture group of the regex should give the url. A regex can be followed by json| to indicate the first capture group should be converted to a json object. To reference a path of through sub-objects of this object to a url. As an example, consider the following, which at some point, could download the Daily News Scrape Podcast to a wiki group:
Type: Scrape Podcast
Name: Daily News Podcast
URL: https://www.somenetwork.com/daily-news
Language: English
Aux Url XPaths:
/(https\:\/\/cdn.somenetwork.com\/daily-news\/video\/daily-[^\"]+)\"/
/window\.\_\_data\s*\=\s*([^\]+\}\;)/json|video|current|0|publicUrl
Download XPath: //video[contains(@height,'540')]
Wiki Destination: My Private Group@Podcasts/%Y-%m-%d.mp4
The initial page to be download will be: https://www.somenetwork.com/daily-news. On this page, we will use the first Aux Path to find a string in the page that matches /(https\:\/\/www.somenetwork.com\/daily-news\/video\/daily-[^\"]+)\"/. The contents matching between the parentheses is the first capture group and will be the next url to download. SO for example, one might get a url:
https://cdn.somenetwork.com/daily-news/video/daily-safghdsjfg
This url is then downloaded and a string matching the pattern /window\.\_\_data\s*\=\s*([^
]+\}\;)/ is found. The capture group portion of this string consists of what matches ([^
]+\}\;) is then converted to a JSON object, because of the json| in the Aux Url XPath. From this JSON object, we look at the video field, then the current subfields, its 0 subfield, and finally, the publicUrl field. This is the url we download next. Lastly, the download XPath is then used to actually get the final video link from this downloaded page.
Once this video is downloaded, it is stored in the Podcasts page's resource folder of the the My Private Group wiki group in a file with a name in the format: %Y-%m-%d.mp4.
A
Trending value source is a value on a web page that one would like to track using Yioop's trending search mechanism. The Name field is the name to use for the trending value. The URL field should be the page with the periodically updated value.
Category should be the trends category (a collection of trending values) one would like to track this value with.
Group Within Category is the default name of the key that will be associated with the value found on this page.
Trend Value Regex is a regular expression to match against the downloaded URL. If it matches and the expression has one capture group, then tat capture group will be used as the value for a particular download time. If it has two or more capture groups, the first two capture groups are used to give a key name, value pair for a particular download time. As an example,
Name: Yioop Ticker
URL: https://my-great-stock-quotes/yioop
Language: English
Category: stocks
Group Within Category: Yioop Price
Trend Value Regex: /Yioop\:\s+(\d+\.\d+)/
Here there is only one capture group (\d+\.\d+), so searching on trending:stocks, one would see all the hour, weekly, etc values for the trending values with that category. One such row would be Yioop Price whose values would be computed based on the numbers extracted according to this regex's (\d+\.\d+) capture group.