## Resources for machine learning Beginners

- A series of iPython notebooks that demonstrate scikit-learn
- An online course on machine learning by Andrew NG
- A course on statistical learning by Trevor Hastie and Rob Tibshirani
- Machine Learning in Action

## Resources for the theory behind big data computation

Books from the foundations and trends in theoretical computer science (TCS)

- Data Streams / S. Muthukrishnan 2005
- Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches / Graham Cormode, Minos Garofalakis, Peter J. Haas, Chris Jermaine 2011
- Algorithms and Data Structures for External Memory / Jeffrey Scott Vitter 2008

A textbook used in a course given in Stanford:

## Tensors

- See the Tensors page for a reading list.

## Linear systems and lossy compression

- A course on dynamic linear systems / Stephen Boyd, Stanford
- Time series analysis and it's applications (with R examples)/ Robert H. Shumway, David S. Stoffer
- Anova, regression and Logistic Regression
- Sigma-Delta Modulation
- DPCM: Lossy Predictive Coding

## NLP

- demonstration of Socher's recursive NN for sentiment analysis
- etcML - machine learning for classifying tweets.
- Noah Smith, Noah's work on NLP for twitter (I think this is where we got the twitter parser, but maybe not the latest version). Look under "Twitter Word Clusters" to see an interesting clustering of words into some 1000 clusters.
- SemEval a yearly competition on semantic analysis. This year there is a track for sentiment analysis of Tweets.

## Teaching

## Student Diaries

## Mouse Brain Atlas

## Hadoop Project

## Statistical models for network communication

## Automatic Cameraman project (2014)

## Projects for master's students

- Collaborative Tweet Filtering
- Analysis of energy feeds
- CAIDA internet analysis
- U.S. Census Currently down, might be because of the government shut-down, I have a recent snapshot of the data on Gordon.
- Weather history
- Data.gov Down
- world-bank data
- California Data
- AWS Public Dataset from Amazon
- Pointers from K. Claffy in CAIDA: [1] and [2] both provide public BGP data at a variety of granularities including raw updates.
- The Yelp Dataset Challange: a dataset from the Phoenix area.

### Tools:

- iPython
- Pandas
- Google Refine
- Data Wrangler

## iPython notebook collections

