python pip and installing a python egg is not that easy – YouTube

junio 29, 2012

vía python pip and installing a python egg is not that easy – YouTube.


twitter 1.8.0 : Python Package Index

junio 29, 2012

twitter 1.8.0

An API and command-line toolset for Twitter (twitter.com)

Downloads ↓

Python Twitter Tools

The Minimalist Twitter API for Python is a Python API for Twitter, everyone’s favorite Web 2.0 Facebook-style status updater for people on the go.

Also included is a twitter command-line tool for getting your friends’ tweets and setting your own tweet from the safety and security of your favorite shell and an IRC bot that can announce Twitter updates to an IRC channel.

For more information, after installing the twitter package:

import the twitter package and run help() on it

run twitter -h for command-line tool help

twitter – The Command-Line Tool

The command-line tool lets you do some awesome things:

view your tweets, recent replies, and tweets in lists

view the public timeline

follow and unfollow (leave) friends

various output formats for tweet information

The bottom line: type twitter, receive tweets.

vía twitter 1.8.0 : Python Package Index.


How to Install Python Easy_install for use with Siri Server – YouTube

junio 28, 2012

Instalar Phyton desde básico.

 

How to Install Python Easy_install for use with Siri Server – YouTube.


Hadoop – Wikipedia, la enciclopedia libre

junio 28, 2012

Plataforma.

SQL o NoSQL technologies such as Hadoop or Cassandra. We do use some less-than-conventional storage technologies such as CouchDB and Redis.

A strong recommendation is that you master the fundamentals and prove out your thesis in a slightly less complex environment first before migrating to an inherently more complex dis- tributed system—and then be ready to make major adjustments to your algorithms to make them performant once data access is no longer local. A good option to investigate if you want to go this route is Dumbo. Stay tuned to this book’s Twitter account (@SocialWebMining) for extended examples that involve Dumbo.

MySQL, NoSQL, Hadoop or Cassandra, CouchDB and Redis

 

NoSQL

From Wikipedia, the free encyclopedia

In computingNoSQL is a class of database management system identified by its non-adherence to the widely used relational database management system (RDBMS) model:

  • It does not use SQL as its query language
NoSQL database systems rose alongside major internet companies, such as GoogleAmazon, and Facebook, which had significantly different challenges in dealing with huge quantities of data that the traditional RDBMS solutions could not cope with. NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema. Data is partitioned among different machines (for performance reasons and size limitations) so JOIN operations are not usable and ACID guarantees are not given.
  • It may not give full ACID guarantees
Usually only eventual consistency is guaranteed or transactions limited to single data items. This means that given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system.
  • It has a distributed, fault-tolerant architecture
Several NoSQL systems employ a distributed architecture, with the data held in a redundant manner on several servers. In this way, the system can easily scale out by adding more servers, and failure of a server can be tolerated. This type of database typically scales horizontally and is used for managing big amounts of data, when the performance and real-time nature is more important than consistency (as indexing a large number of documents, serving pages on high-traffic websites, and delivering streaming media).

NoSQL database systems are often highly optimized for retrieve and append operations and often offer little functionality beyond record storage (e.g. key-value stores). The reduced run time flexibility compared to full SQL systems is compensated by significant gains in scalability and performance for certain data models.

In short, NoSQL database management systems are useful when working with a huge quantity of data and the data’s nature does not require a relational model for the data structure. The data could be structured, but it is of minimal importance and what really matters is the ability to store and retrieve great quantities of data, and not the relationships between the elements. For example, to store millions of key-value pairs in one or a few associative arrays or to store millions of data records. This is particularly useful for statistical or real-time analyses for growing list of elements (such as Twitter posts or the Internet server logs from a big group of users).

 

Hadoop

Apache Hadoop

Desarrollador

Apache Software Foundation

http://hadoop.apache.org/

Información general

Última versión estable 1.0.0

27 de diciembre de 2011; hace 5 meses

Género Sistema de archivos distribuido

Programado en Java

Sistema operativo Multiplataforma

Plataforma Java

Licencia Apache License 2.0

Estado actual Activo

Idiomas inglés

En español

Apache Hadoop es un framework de software que soporta aplicaciones distribuidas bajo una licencia libre.1 Permite a las aplicaciones trabajar con miles de nodos y petabytes de datos. Hadoop se inspiró en los documentos Google para MapReduce y Google File System (GFS).

Hadoop es un proyecto de alto nivel Apache que está siendo construido y usado por una comunidad global de contribuidores,2 mediante el lenguaje de programación Java. Yahoo! ha sido el mayor contribuidor al proyecto,3 y usa Hadoop extensivamente en su negocio.4

vía Hadoop – Wikipedia, la enciclopedia libre.

 

CASSANDRA

Welcome to Apache Cassandra

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

http://cassandra.apache.org/

 

Redis

Redis es un motor de base de datos en memoria, basado en el almacenamiento en tablas de hashes(llave, valor) pero que opcionalmente puede ser usada como una base de datos durable o persistente. Está escrito en ANSI C por Salvatore Sanfilippo quien es patrocinado por VMware.1 2 y esta liberado bajo licencia BSD por lo que es considerado software de código abierto.

 

 

COUCHDB

Apache CouchDB, commonly referred to as CouchDB, is an open source database that focuses on ease of use and on being «a database that completely embraces the web».[1] It is a NoSQL database that uses JSON to store data, JavaScriptas its query language using MapReduce and HTTP for an API.[1] One of its distinguishing features is easy replication. CouchDB was first released in 2005 and later became an Apache project in 2008.

CouchDB is used in certain applications for Android like «SpreadLyrics» and applications for Facebook like «Will you Kissme» or «Birthday Greeting Cards» or webs like«Friendpaste

Meebo, for their social platform (web and applications)

http://en.wikipedia.org/wiki/CouchDB


Lenguage processiing : Python Package

junio 28, 2012

collective.classification 0.1b2

Content classification/clustering through language processing

Downloads ↓

Introduction

collective.classification aims to provide a set of tools for automatic document classification. Currently it makes use of the Natural Language Toolkit and features a trainable document classifier based on Part Of Speech (POS) tagging, heavily influenced by topia.termextract. This product is mostly intended to be used for experimentation and development. Currently english and dutch are supported.

What is this all about?

It’s mostly about having fun! The package is in a very early experimental stage and awaits eagerly contributions. You will get a good understanding of what works or not by looking at the tests. You might also be able to do some useful things with it:

1) Term extraction can be performed to provide quick insight on what a document is about. 2) On a large site with a lot of content and tags (or subjects in the plone lingo) it might be difficult to assign tags to new content. In this case, a trained classifier could provide useful suggestions to an editor responsible for tagging content. 3) Similar documents can be found based on term similarity. 4) Clustering can help you organize unclassified content into groups.

How it works?

At the moment there exist the following type of utilities:

POS taggers, utilities for classifying words in a document as Parts Of Speech. Two are provided at the moment, a Penn TreeBank tagger and a trigram tagger. Both can be trained with some other language than english which is what we do here.

Term extractors, utilities responsible for extracting the important terms from some document. The extractor we use here, assumes that in a document only nouns matter and uses a POS tagger to find those mostly used in a document. For details please look at the code and the tests.

Content classifiers, utilities that can tag content in predefined categories. Here, a naive Bayes classifier is used. Basically, the classifier looks at already tagged content, performs term extraction and trains itself using the terms and tags as an input. Then, for new content, the classifier will provide suggestions for tags according to the extracted terms of the content.

Utilities that find similar content based on the extracted terms.

Clusterers, utilities that without prior knowledge of content classification can group content into groups according to feature similarity. At the moment NLTK’s k-means clusterer is used.

vía collective.classification 0.1b2 : Python Package Index.


Download — NetworkX 1.6 documentation

junio 28, 2012

Download

Source and binary releases

http://cheeseshop.python.org/pypi/networkx/

http://networkx.lanl.gov/download/networkx/

Mercurial source code repository

Anonymous

hg clone http://networkx.lanl.gov/hg/networkx

Authenticated

hg clone https://networkx.lanl.gov/hg/networkx

Documentation

PDF

http://networkx.lanl.gov/networkx_reference.pdf http://networkx.lanl.gov/networkx_tutorial.pdf

HTML in zip file

http://networkx.lanl.gov/networkx-documentation.zip

vía Download — NetworkX 1.6 documentation.


A beginners tutorial on Social Network Analysis – (Part 1) »

junio 28, 2012

A beginners tutorial on Social Network Analysis – (Part 1)

by NIHARJYOTI SARANGI posted on MARCH 4, 2012

Social Network Analysis refers to the methods used for analyzing social networks or interconnections among individuals. The individuals are taken as “nodes” and are connected to each other based on their interconnections, which may be of various types (friendship, co-authorship, kinship, sexual relations, financial exchange, common interest etc.) SNA uses various techniques from Graph Theory, Game Theory and several other to study, explain and predict the network.

Tools Used for this tutorial: networkX

Programming Language: Python

Getting the tools:

NetworkX is a Python-based package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. If you are on a linux distribution like Ubuntu chances are it will be in your package manager. Otherwise, you can download and install the binary or even compile it from source from here.

Matplotlib is a set of plotting tools for python. You can download and install it from a package manager of your choice, or install it from source. This can take care of advanced 2D plotting for python. We will use this to plot our network.

vía A beginners tutorial on Social Network Analysis – (Part 1) » The Super User.

 

Herramientas: Python / NetworkX / Matplotlib


Extracción de datos 1

junio 28, 2012

Sobre el tema de la extracción de datos en la web realizaré varios posts donde gestione la información sobre las actividades realizadas.
  • Herramientas de gestión de contenidos. Bookmarklet / Blog. Visualkm / Youtube. luisyepez13 /
  • Herramientas para la extracción y análisis: Python / RubyGems / networkX/ y otros


Datamining Twitter: Part 1 Creating a Database – Twitter Research.

Datamining Twitter

posted May 13, 2010 1:42 PM by Thomas Plotkowiak   [ updated Mar 2, 2012 7:11 AM ]

In this short tutorial you will learn how to collect tweets using ruby and only two gems.
It is part of a series where I will show you what fantastic things you can do with twitter these days, if you love mining data :)The first gem I would like to introduce is sequel. It is a lightweight ORM layer that allows to to intterface a couple of of a databases in ruby without pain. It works great with mysql or sqlite. We will use sqlite today.I have been using mysql in combination wit rails and the nice activerecord ORM, but for the most tasks it is a bit too bulky. The problem with Sqlite can be though that it does not provide multitasking capabilities. But we will bump into that later…

To get you started have a visit on http://sequel.rubyforge.org/ and have a look on the example. They are pretty straight forward. I can also recommend the cheatsheet under: http://sequel.rubyforge.org/rdoc/files/doc/cheat_sheet_rdoc.html

Herramientas: SEQUEL – Rubyforce

How to Extract Only the Content from a Web Page – olussier.net

octubre 5, 2010

How to Extract Only the Content from a Web Page

Have you ever visited a web page and actually had to take a moment to figure out where the content was because the page was so heavily loaded with non-content stuff? With the growing number of websites, with different designs, one may wish to simply read the page’s content without having to deal with all the extra stuff (navigation, ads, social features…).

The excellent folks at Arc90 have come up with a solution: the Readability bookmarklet. This easy-to-use bookmarklet extracts the main content from a web page and displays it in a simple yet pretty way. You can even customize the style, size and margins to make your reading as enjoyable as possible. The bookmarklet uses a generic algorithm that works on most pages that actually have content. While it is not 100% accurate, they do claim a success rate over 99%. Try it yourself on this page by clicking here!

Here’s a short video that shows how simple and effective it is:

Besides improving the reading experience, there are other great uses to this bookmarklet. First, websites do not always provide printer-friendly versions of their pages. With Readability, you get a clutter-free article ready to be printed. There even is a “Print” button. Also, if you use Evernote with the Web Clipper, you should try using Readability on a page before clipping it. You will end up clipping only the article, which is more likely what you wanted to do!

Using the Readability Algorithm in Your Applications

You can even use the power of Readability if you need to extract web pages’ content in your applications. Some nice folks have ported the algorithm to other languages. See Nirmal Patel‘s Python port here, Keyvan Minoukadeh‘s PHP port here and Immortal‘s C# port here.

vía How to Extract Only the Content from a Web Page – olussier.net.

http://vimeo.com/moogaloop.swf?clip_id=8798492&server=vimeo.com&show_title=1&show_byline=1&show_portrait=1&color=&fullscreen=1&autoplay=0&loop=0

Readability – Installation Video for Firefox, Safari & Chrome from Arc90 on Vimeo.


Online Ontology Visualisation: RDFa

octubre 5, 2010

jOWL status updateI packaged the latest development version of jOWL into a 0.5 release, available at Google Code. jOWL is an AJAX/javascript extension to jQuery that I am developing. The jOWL library parses and reasons with OWL-DL documents. Supported browsers for this release are Internet Explorer 7 and Firefox 2 & 3.This release is accompanied by several new and impressive demos in my humble opinion. These make use of the new functionalities that have been incorporated so far. Below are some important highlights.

vía Online Ontology Visualisation: RDFa.