vía python pip and installing a python egg is not that easy – YouTube.
twitter 1.8.0 : Python Package Index
junio 29, 2012twitter 1.8.0
An API and command-line toolset for Twitter (twitter.com)
Downloads ↓
Python Twitter Tools
The Minimalist Twitter API for Python is a Python API for Twitter, everyone’s favorite Web 2.0 Facebook-style status updater for people on the go.
Also included is a twitter command-line tool for getting your friends’ tweets and setting your own tweet from the safety and security of your favorite shell and an IRC bot that can announce Twitter updates to an IRC channel.
For more information, after installing the twitter package:
import the twitter package and run help() on it
run twitter -h for command-line tool help
twitter – The Command-Line Tool
The command-line tool lets you do some awesome things:
view your tweets, recent replies, and tweets in lists
view the public timeline
follow and unfollow (leave) friends
various output formats for tweet information
The bottom line: type twitter, receive tweets.
How to Install Python Easy_install for use with Siri Server – YouTube
junio 28, 2012Instalar Phyton desde básico.
How to Install Python Easy_install for use with Siri Server – YouTube.
Hadoop – Wikipedia, la enciclopedia libre
junio 28, 2012Plataforma.
SQL o NoSQL technologies such as Hadoop or Cassandra. We do use some less-than-conventional storage technologies such as CouchDB and Redis.
A strong recommendation is that you master the fundamentals and prove out your thesis in a slightly less complex environment first before migrating to an inherently more complex dis- tributed system—and then be ready to make major adjustments to your algorithms to make them performant once data access is no longer local. A good option to investigate if you want to go this route is Dumbo. Stay tuned to this book’s Twitter account (@SocialWebMining) for extended examples that involve Dumbo.
MySQL, NoSQL, Hadoop or Cassandra, CouchDB and Redis
NoSQL
In computing, NoSQL is a class of database management system identified by its non-adherence to the widely used relational database management system (RDBMS) model:
- It does not use SQL as its query language
- NoSQL database systems rose alongside major internet companies, such as Google, Amazon, and Facebook, which had significantly different challenges in dealing with huge quantities of data that the traditional RDBMS solutions could not cope with. NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema. Data is partitioned among different machines (for performance reasons and size limitations) so JOIN operations are not usable and ACID guarantees are not given.
- It may not give full ACID guarantees
- Usually only eventual consistency is guaranteed or transactions limited to single data items. This means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system.
- It has a distributed, fault-tolerant architecture
- Several NoSQL systems employ a distributed architecture, with the data held in a redundant manner on several servers. In this way, the system can easily scale out by adding more servers, and failure of a server can be tolerated. This type of database typically scales horizontally and is used for managing big amounts of data, when the performance and real-time nature is more important than consistency (as indexing a large number of documents, serving pages on high-traffic websites, and delivering streaming media).
NoSQL database systems are often highly optimized for retrieve and append operations and often offer little functionality beyond record storage (e.g. key-value stores). The reduced run time flexibility compared to full SQL systems is compensated by significant gains in scalability and performance for certain data models.
In short, NoSQL database management systems are useful when working with a huge quantity of data and the data’s nature does not require a relational model for the data structure. The data could be structured, but it is of minimal importance and what really matters is the ability to store and retrieve great quantities of data, and not the relationships between the elements. For example, to store millions of key-value pairs in one or a few associative arrays or to store millions of data records. This is particularly useful for statistical or real-time analyses for growing list of elements (such as Twitter posts or the Internet server logs from a big group of users).
Hadoop
Apache Hadoop
Desarrollador
Apache Software Foundation
Información general
Última versión estable 1.0.0
27 de diciembre de 2011; hace 5 meses
Género Sistema de archivos distribuido
Programado en Java
Sistema operativo Multiplataforma
Plataforma Java
Licencia Apache License 2.0
Estado actual Activo
Idiomas inglés
En español
Apache Hadoop es un framework de software que soporta aplicaciones distribuidas bajo una licencia libre.1 Permite a las aplicaciones trabajar con miles de nodos y petabytes de datos. Hadoop se inspiró en los documentos Google para MapReduce y Google File System (GFS).
Hadoop es un proyecto de alto nivel Apache que está siendo construido y usado por una comunidad global de contribuidores,2 mediante el lenguaje de programación Java. Yahoo! ha sido el mayor contribuidor al proyecto,3 y usa Hadoop extensivamente en su negocio.4
vía Hadoop – Wikipedia, la enciclopedia libre.
CASSANDRA
Welcome to Apache Cassandra
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Redis
Redis es un motor de base de datos en memoria, basado en el almacenamiento en tablas de hashes(llave, valor) pero que opcionalmente puede ser usada como una base de datos durable o persistente. Está escrito en ANSI C por Salvatore Sanfilippo quien es patrocinado por VMware.1 2 y esta liberado bajo licencia BSD por lo que es considerado software de código abierto.
COUCHDB
Apache CouchDB, commonly referred to as CouchDB, is an open source database that focuses on ease of use and on being «a database that completely embraces the web».[1] It is a NoSQL database that uses JSON to store data, JavaScriptas its query language using MapReduce and HTTP for an API.[1] One of its distinguishing features is easy replication. CouchDB was first released in 2005 and later became an Apache project in 2008.
CouchDB is used in certain applications for Android like «SpreadLyrics» and applications for Facebook like «Will you Kissme» or «Birthday Greeting Cards» or webs like«Friendpaste
Meebo, for their social platform (web and applications)
Lenguage processiing : Python Package
junio 28, 2012collective.classification 0.1b2
Content classification/clustering through language processing
Downloads ↓
Introduction
collective.classification aims to provide a set of tools for automatic document classification. Currently it makes use of the Natural Language Toolkit and features a trainable document classifier based on Part Of Speech (POS) tagging, heavily influenced by topia.termextract. This product is mostly intended to be used for experimentation and development. Currently english and dutch are supported.
What is this all about?
It’s mostly about having fun! The package is in a very early experimental stage and awaits eagerly contributions. You will get a good understanding of what works or not by looking at the tests. You might also be able to do some useful things with it:
1) Term extraction can be performed to provide quick insight on what a document is about. 2) On a large site with a lot of content and tags (or subjects in the plone lingo) it might be difficult to assign tags to new content. In this case, a trained classifier could provide useful suggestions to an editor responsible for tagging content. 3) Similar documents can be found based on term similarity. 4) Clustering can help you organize unclassified content into groups.
How it works?
At the moment there exist the following type of utilities:
POS taggers, utilities for classifying words in a document as Parts Of Speech. Two are provided at the moment, a Penn TreeBank tagger and a trigram tagger. Both can be trained with some other language than english which is what we do here.
Term extractors, utilities responsible for extracting the important terms from some document. The extractor we use here, assumes that in a document only nouns matter and uses a POS tagger to find those mostly used in a document. For details please look at the code and the tests.
Content classifiers, utilities that can tag content in predefined categories. Here, a naive Bayes classifier is used. Basically, the classifier looks at already tagged content, performs term extraction and trains itself using the terms and tags as an input. Then, for new content, the classifier will provide suggestions for tags according to the extracted terms of the content.
Utilities that find similar content based on the extracted terms.
Clusterers, utilities that without prior knowledge of content classification can group content into groups according to feature similarity. At the moment NLTK’s k-means clusterer is used.
Download — NetworkX 1.6 documentation
junio 28, 2012Download
Source and binary releases
http://cheeseshop.python.org/pypi/networkx/
http://networkx.lanl.gov/download/networkx/
Mercurial source code repository
Anonymous
hg clone http://networkx.lanl.gov/hg/networkx
Authenticated
hg clone https://networkx.lanl.gov/hg/networkx
Documentation
http://networkx.lanl.gov/networkx_reference.pdf http://networkx.lanl.gov/networkx_tutorial.pdf
HTML in zip file
A beginners tutorial on Social Network Analysis – (Part 1) »
junio 28, 2012A beginners tutorial on Social Network Analysis – (Part 1)
by NIHARJYOTI SARANGI posted on MARCH 4, 2012
Social Network Analysis refers to the methods used for analyzing social networks or interconnections among individuals. The individuals are taken as “nodes” and are connected to each other based on their interconnections, which may be of various types (friendship, co-authorship, kinship, sexual relations, financial exchange, common interest etc.) SNA uses various techniques from Graph Theory, Game Theory and several other to study, explain and predict the network.
Tools Used for this tutorial: networkX
Programming Language: Python
Getting the tools:
NetworkX is a Python-based package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. If you are on a linux distribution like Ubuntu chances are it will be in your package manager. Otherwise, you can download and install the binary or even compile it from source from here.
Matplotlib is a set of plotting tools for python. You can download and install it from a package manager of your choice, or install it from source. This can take care of advanced 2D plotting for python. We will use this to plot our network.
vía A beginners tutorial on Social Network Analysis – (Part 1) » The Super User.
Herramientas: Python / NetworkX / Matplotlib
Extracción de datos 1
junio 28, 2012Sobre el tema de la extracción de datos en la web realizaré varios posts donde gestione la información sobre las actividades realizadas.
- Herramientas de gestión de contenidos. Bookmarklet / Blog. Visualkm / Youtube. luisyepez13 /
- Herramientas para la extracción y análisis: Python / RubyGems / networkX/ y otros
Datamining Twitter: Part 1 Creating a Database – Twitter Research.
Datamining Twitter
posted May 13, 2010 1:42 PM by Thomas Plotkowiak [ updated Mar 2, 2012 7:11 AM ]
In this short tutorial you will learn how to collect tweets using ruby and only two gems.
It is part of a series where I will show you what fantastic things you can do with twitter these days, if you love mining data :)The first gem I would like to introduce is sequel. It is a lightweight ORM layer that allows to to intterface a couple of of a databases in ruby without pain. It works great with mysql or sqlite. We will use sqlite today.I have been using mysql in combination wit rails and the nice activerecord ORM, but for the most tasks it is a bit too bulky. The problem with Sqlite can be though that it does not provide multitasking capabilities. But we will bump into that later… To get you started have a visit on http://sequel.rubyforge.org/ and have a look on the example. They are pretty straight forward. I can also recommend the cheatsheet under: http://sequel.rubyforge.org/rdoc/files/doc/cheat_sheet_rdoc.html Herramientas: SEQUEL – Rubyforce
|
How to Extract Only the Content from a Web Page – olussier.net
octubre 5, 2010How to Extract Only the Content from a Web Page
Have you ever visited a web page and actually had to take a moment to figure out where the content was because the page was so heavily loaded with non-content stuff? With the growing number of websites, with different designs, one may wish to simply read the page’s content without having to deal with all the extra stuff (navigation, ads, social features…).
The excellent folks at Arc90 have come up with a solution: the Readability bookmarklet. This easy-to-use bookmarklet extracts the main content from a web page and displays it in a simple yet pretty way. You can even customize the style, size and margins to make your reading as enjoyable as possible. The bookmarklet uses a generic algorithm that works on most pages that actually have content. While it is not 100% accurate, they do claim a success rate over 99%. Try it yourself on this page by clicking here!
Here’s a short video that shows how simple and effective it is:
Besides improving the reading experience, there are other great uses to this bookmarklet. First, websites do not always provide printer-friendly versions of their pages. With Readability, you get a clutter-free article ready to be printed. There even is a “Print” button. Also, if you use Evernote with the Web Clipper, you should try using Readability on a page before clipping it. You will end up clipping only the article, which is more likely what you wanted to do!
Using the Readability Algorithm in Your Applications
You can even use the power of Readability if you need to extract web pages’ content in your applications. Some nice folks have ported the algorithm to other languages. See Nirmal Patel‘s Python port here, Keyvan Minoukadeh‘s PHP port here and Immortal‘s C# port here.
vía How to Extract Only the Content from a Web Page – olussier.net.
Readability – Installation Video for Firefox, Safari & Chrome from Arc90 on Vimeo.
Online Ontology Visualisation: RDFa
octubre 5, 2010jOWL status updateI packaged the latest development version of jOWL into a 0.5 release, available at Google Code. jOWL is an AJAX/javascript extension to jQuery that I am developing. The jOWL library parses and reasons with OWL-DL documents. Supported browsers for this release are Internet Explorer 7 and Firefox 2 & 3.This release is accompanied by several new and impressive demos in my humble opinion. These make use of the new functionalities that have been incorporated so far. Below are some important highlights.