This Search Engine

Historical Version of Media_Sources from 2019-01-08T12:58:17-08:00.

page_type=standard page_alias= page_border=solid-border toc=true title= author= robots= description= alternative_path= page_header= page_footer= sort=aname END_HEAD_VARS

Media Sources are used to specify how Yioop should handle news feeds and podcast sites.

An RSS media source can be used to add an RSS or Atom feed (it auto-detects which kind) to the list of feeds which are downloaded hourly when Yioop's Media Updater is turned on. Besides the name you need to specify the URL of the feed in question. The Category field search usually be left at news. If you want to specify additional categories such as weather or sports, you typically want to create a mix that searches the default index with the keyword media:your_category injects, and then make a new subsearch with that mix. This will allow your new category to show up on the Tools/More/Other Searches page.

An HTML media source is a web page that has feed articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example,

 Name: Cape Breton Post
 URL: http://www.capebretonpost.com/News/Local-1968
 Language: English
 Category: news
 Channel: //div[contains(@class, "channel")]
 Item: //article
 Title:    //a
 Description: //div[contains(@class, "dek")]
 Link: //a

The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a .

A JSON media source is a used to scrape feed articles from JSON data as may be provided by a websites API. To handle a JSON media source you provide the same information as with an HTML media source. Internally, Yioop converts all JSON sources to xml before processing. The root objects maps to /html/body. A property foo of the root object would get mapped to a tag <foo>. Array elements are mapped to a sequence of elements enclosed in <item> tags. The process is recursively applied until the JSON object is completely converted to an xml page. Once this is done the XPaths that a user provides are used to extract the feed items in the same way as how HTML feeds are extracted. As an example, Yioop search results and discussion groups can be output as JSON. To take Yioop's news feed and use it as a JSON media source in your search engine, you could use the settings:

 Name: Yioop News
 URL: https://www.yioop.com/s/news?f=json
 Language: English
 Category: news
 Channel: //channel
 Item: //item
 Title: //title
 Description: //description
 Link: //link

A Regex media source is a source of feed articles presented in some kind of non-tag based text format. For example, the US National Weather Service has a text-based page for weather forecasts of major US cities at

 http://forecast.weather.gov/product.php?site=NWS&
  issuedby=04&product=SCS&format=txt&
  version=1&glossary=0

changing the 04 above to 03, 02, 01 varies the group of cities. Most of the data on this page appears in a pre tag as text. Channel in this case would be a regex whose first capture group corresponds to the contents of this pre tag. We might want to get one item per line from the pre tag as that would correspond to the weather for one city. The Item Separator is a regex used to split the results of the Channel operation into items. Finally, Title , Description , and Link are regexes each with one capture group used to get these respective feed item components out of an item given after the splitting process above. Hence, a reasonable choice of values for the weather service page might be:

 Name: National Weather Service 04
 URL: http://forecast.weather.gov/product.php?
  site=NWS&issuedby=04&product=SCS&format=txt&
  version=1&glossary=0
 Language: English
 Category: weather
 Channel: /<pre(?:.+?)>([^<]+)/m
 Item: /
/
 Title: /^(.+?)\s\s\s+/
 Description: /\s\s\s+(.+?)$/
 Link: http://www.weather.gov/

Notice in the above that the Link element is http://www.weather.gov/. If you have a feed and it doesn't provide links for individual items. You can always provide a link to some fixed site by directly entering a URL in the Link field.

Not all feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed:

  http://feeds.wired.com/wired/index
 //description/div[contains(@class,
    "rss_thumbnail")]/img/@src

A Feed Podcast source is an RSS or Atom source where each item contains a link to a podcast or video podcast. For example,

 http://feed.cnet.com/feed/podcast/all/hd.xml

The Alternative Link Tag field is used to say the xpath within the feed item to the link for the audio or video file. For the CNet example, this is:

 enclosure

If it is blank the default link tag is used. The media updater job when run checks if any items in the feed are new. If so, it downloads them to the wiki resource folder of the wiki page provided in the Wiki Destination field. This page is given in the format GroupName@PageName. If you give just PageName, the Public group is assumed. The Expires field controls how long a feed item is kept before it is deleted. Yioop supports the downloading of single video or audio file sources, as well as more complicated stream sources such as m3u8 streams.

A Scrape podcast source is like a Feed Podcast source , but where one has a HTML or XML page which has a periodically updated link to a video or audio source. For example, it might be an evening news web site. The URL field should be the page with the periodically updated link. The Aux Url XPath link, if not blank, should be an xpath on this page to the HTML or XML page that contains the media source for that day. Finally, on the page for the given day, Download XPath should be the xpath of the url of the video or audio file to download.