There are many different types of “big data” available on the Web and Deep Web, ranging from numeric data to natural language text. Data can come in many formats, from structured, to semi-structured, to unstructured. Some examples of structured data include columnar sales reports and survey data. RSS feeds, written in XML, are examples of semi-structured data. Unstructured natural language data includes blog posts, books, and anything written freeform in a human language.
Data can be gathered from the Web by crawling and scraping webpages. Crawling a webpage consists of gathering all of the links on the page, then following the links and gathering all of the links on those webpages, etc… The page on which this blog post appears is an example of a webpage that can be scraped. I’ll go into detail about how to scrape a webpage in a later post, but to get an idea of the kind of data that you’ll be dealing with, right click on this page and select “View page source” from the dropdown menu.
Unlike the Web, the Deep Web consists largely of dynamically generated webpages, which can’t be crawled because they don’t exist until they’re generated by a user query. To gather a large dataset from the Deep Web, you need to programmatically query databases.
Once you’ve got a dataset, you’ll need to perform some type of analysis to find meaningful patterns in the data. If for example you’ve gathered a few million tweets, you’ll need to hammer this dataset down into a smaller set of representations or abstractions. In this case you’ll want to use natural language processing or visualization methods to render the data useful. There exist a variety of data mining techniques, most of which fall into the domain of machine learning and statistical computing.
The final step in the data mining process is to use the data to make predictions or to create visualizations. This step is sometimes completed in the analysis step, but other times it is useful to create further visualizations to communicate your findings to the decision makers who will be using your work to inform their brand strategy, or to communicate your findings to a general audience.