Historical Version of Media_Sources from 2021-01-05T07:19:49-08:00.
page_type=standard
page_alias=
page_border=solid-border
toc=true
title=
author=
robots=
description=
alternative_path=
page_header=
page_footer=
sort=aname
END_HEAD_VARS
Media Sources are used to specify how Yioop should handle video and news sites.
A
Video source is used to specify where to find the thumb nail of a video given the url of the video on a website. This is used by Yioop when displaying search results containing the video link to show the thumb nail. For example, if the Url value is
http://www.youtube.com/watch?v={}
and the Thumb value is
http://i1.ytimg.com/vi/{}/default.jpg,
this tells Yioop that if a search result contains something like
https://www.youtube.com/watch?v=dQw4w9WgXcQ
this says find the thumb at
http://i1.ytimg.com/vi/dQw4w9WgXcQ/default.jpg
An RSS media source can be used to add an RSS or Atom feed (it auto-detects which kind) to the list of feeds which are downloaded hourly when Yioop's Media Updater is turned on. Besides the name you need to specify the URL of the feed in question. The Category field search usually be left at news. If you want to specify additional categories such as weather or sports, you typically want to create a mix that searches the default index with the keyword media:your_category injects, and then make a new subsearch with that mix.
This will allow your new category to show up on the Tools/More/Other Searches page.
An
HTML media source is a web page that has feed articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example,
Name: Cape Breton Post
URL: http://www.capebretonpost.com/News/Local-1968
Language: English
Category: news
Channel: //div[contains(@class, "channel")]
Item: //article
Title: //a
Description: //div[contains(@class, "dek")]
Link: //a
The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a .
A
JSON media source is a used to scrape feed articles from JSON data as may be provided by a websites API. To handle a JSON media source you provide the same information as with an HTML media source. Internally, Yioop converts all JSON sources to xml before processing. The root objects maps to /html/body.
A property
foo of the root object would get mapped to a tag <foo>. Array elements are mapped to a sequence of elements enclosed in <item> tags. The process is recursively applied until the JSON object is completely converted to an xml page. Once this is done the XPaths that a user provides are used to extract the feed items in the same way as how HTML feeds are extracted. As an example, Yioop search results and discussion groups can be output as JSON. To take Yioop's news feed and use it as a JSON media source in your search engine, you could use the settings:
Name: Yioop News
URL: https://www.yioop.com/s/news?f=json
Language: English
Category: news
Channel: //channel
Item: //item
Title: //title
Description: //description
Link: //link
A
Regex media source is a source of feed articles presented in some kind of non-tag based text format.
For example, the US National Weather Service has a text-based page for weather forecasts of major US cities
at
http://forecast.weather.gov/product.php?site=NWS&
issuedby=04&product=SCS&format=txt&
version=1&glossary=0
changing the 04 above to 03, 02, 01 varies the group of cities. Most of the data on this page appears in a pre tag as text.
Channel in this case would be a regex whose first capture group corresponds to the contents of this pre tag. We might want to get one item per line from the pre tag as that would correspond to the weather for one city. The
Item Separator is a regex used to split the results of the Channel operation into items. Finally,
Title ,
Description , and
Link are regexes each with one capture group used to get these respective feed item components out of an item given after the splitting process above. Hence, a reasonable choice of values for the weather service page might be:
Name: National Weather Service 04
URL: http://forecast.weather.gov/product.php?
site=NWS&issuedby=04&product=SCS&format=txt&
version=1&glossary=0
Language: English
Category: weather
Channel: /<pre(?:.+?)>([^<]+)/m
Item: /
/
Title: /^(.+?)\s\s\s+/
Description: /\s\s\s+(.+?)$/
Link: http://www.weather.gov/
Notice in the above that the Link element is http://www.weather.gov/. If you have a feed
and it doesn't provide links for individual items. You can always provide a link to some
fixed site by directly entering a URL in the Link field.
Not all feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed:
http://feeds.wired.com/wired/index
//description/div[contains(@class,
"rss_thumbnail")]/img/@src