How to Extract Only the Content from a Web Page
Have you ever visited a web page and actually had to take a moment to figure out where the content was because the page was so heavily loaded with non-content stuff? With the growing number of websites, with different designs, one may wish to simply read the page’s content without having to deal with all the extra stuff (navigation, ads, social features…).
The excellent folks at Arc90 have come up with a solution: the Readability bookmarklet. This easy-to-use bookmarklet extracts the main content from a web page and displays it in a simple yet pretty way. You can even customize the style, size and margins to make your reading as enjoyable as possible. The bookmarklet uses a generic algorithm that works on most pages that actually have content. While it is not 100% accurate, they do claim a success rate over 99%. Try it yourself on this page by clicking here!
Here’s a short video that shows how simple and effective it is:
Besides improving the reading experience, there are other great uses to this bookmarklet. First, websites do not always provide printer-friendly versions of their pages. With Readability, you get a clutter-free article ready to be printed. There even is a “Print” button. Also, if you use Evernote with the Web Clipper, you should try using Readability on a page before clipping it. You will end up clipping only the article, which is more likely what you wanted to do!
Using the Readability Algorithm in Your Applications
You can even use the power of Readability if you need to extract web pages’ content in your applications. Some nice folks have ported the algorithm to other languages. See Nirmal Patel‘s Python port here, Keyvan Minoukadeh‘s PHP port here and Immortal‘s C# port here.