admin 管理员组

文章数量: 1086019


2024年3月9日发(作者:哪个老师的python视频教程好)

大数据挖掘外文翻译文献

文献信息:

文献标题:A Study of Data Mining with Big Data(大数据挖掘研究)

国外作者:VH Shastri,V Sreeprada

文献出处:《International Journal of Emerging Trends and Technology

in Computer Science》,2016,38(2):99-103

字数统计:英文2291单词,12196字符;中文3868汉字

外文文献:

A Study of Data Mining with Big Data

Abstract Data has become an important part of every economy, industry,

organization, business, function and individual. Big Data is a term used to

identify large data sets typically whose size is larger than the typical data base.

Big data introduces unique computational and statistical challenges. Big Data are

at present expanding in most of the domains of engineering and science. Data

mining helps to extract useful data from the huge data sets due to its volume,

variability and velocity. This article presents a HACE theorem that characterizes

the features of the Big Data revolution, and proposes a Big Data processing

model, from the data mining perspective.

Keywords: Big Data, Data Mining, HACE theorem, structured and

unstructured.

uction

Big Data refers to enormous amount of structured data and unstructured

data that overflow the organization. If this data is properly used, it can lead to

meaningful information. Big data includes a large number of data which

requires a lot of processing in real time. It provides a room to discover new

values, to understand in-depth knowledge from hidden values and provide a

space to manage the data effectively. A database is an organized collection of

logically related data which can be easily managed, updated and accessed. Data

mining is a process discovering interesting knowledge such as associations,

patterns, changes, anomalies and significant structures from large amount of

data stored in the databases or other repositories.

Big Data includes 3 V’s as its characteristics. They are volume, velocity and

variety. Volume means the amount of data generated every second. The data is in

state of rest. It is also known for its scale characteristics. Velocity is the speed

with which the data is generated. It should have high speed data. The data

generated from social media is an example. Variety means different types of data

can be taken such as audio, video or documents. It can be numerals, images, time

series, arrays etc.

Data Mining analyses the data from different perspectives and summarizing

it into useful information that can be used for business solutions and predicting

the future trends. Data mining (DM), also called Knowledge Discovery in

Databases (KDD) or Knowledge Discovery and Data Mining, is the process of

searching large volumes of data automatically for patterns such as association

rules. It applies many computational techniques from statistics, information

retrieval, machine learning and pattern recognition. Data mining extract only

required patterns from the database in a short time span. Based on the type of

patterns to be mined, data mining tasks can be classified into summarization,

classification, clustering, association and trends analysis.

Big Data is expanding in all domains including science and engineering

fields including physical, biological and biomedical sciences.

DATA with DATA MINING

Generally big data refers to a collection of large volumes of data and these

data are generated from various sources like internet, social-media, business

organization, sensors etc. We can extract some useful information with the help

of Data Mining. It is a technique for discovering patterns as well as descriptive,

understandable, models from a large scale of data.

Volume is the size of the data which is larger than petabytes and terabytes.

The scale and rise of size makes it difficult to store and analyse using traditional

tools. Big Data should be used to mine large amounts of data within the

predefined period of time. Traditional database systems were designed to address

small amounts of data which were structured and consistent, whereas Big Data

includes wide variety of data such as geospatial data, audio, video, unstructured

text and so on.

Big Data mining refers to the activity of going through big data sets to look

for relevant information. To process large volumes of data from different sources

quickly, Hadoop is used. Hadoop is a free, Java-based programming framework

that supports the processing of large data sets in a distributed computing

environment. Its distributed file system supports fast data transfer rates among

nodes and allows the system to continue operating uninterrupted at times of

node failure. It runs Map Reduce for distributed data processing and is works

with structured and unstructured data.

DATA characteristics- HACE THEOREM.

We have large volume of heterogeneous data. There exists a complex

relationship among the data. We need to discover useful information from this

voluminous data.

Let us imagine a scenario in which the blind people are asked to draw

elephant. The information collected by each blind people may think the trunk as

wall, leg as tree, body as wall and tail as rope. The blind men can exchange

information with each other.

Figure1: Blind men and the giant elephant

Some of the characteristics that include are:

data with heterogeneous and diverse sources: One of the fundamental

characteristics of big data is the large volume of data represented by

heterogeneous and diverse dimensions. For example in the biomedical world, a

single human being is represented as name, age, gender, family history etc., For

X-ray and CT scan images and videos are used. Heterogeneity refers to the

different types of representations of same individual and diverse refers to the

variety of features to represent single information.

mous with distributed and de- centralized control: the sources are

autonomous, i.e., automatically generated; it generates information without any

centralized control. We can compare it with World Wide Web (WWW) where

each server provides a certain amount of information without depending on

other servers.

x and evolving relationships: As the size of the data becomes

infinitely large, the relationship that exists is also large. In early stages, when

data is small, there is no complexity in relationships among the data. Data

generated from social media and other sources have complex relationships.

: OPEN SOURCE REVOLUTION

Large companies such as Facebook, Yahoo, Twitter, LinkedIn benefit and

contribute work on open source projects. In Big Data Mining, there are many

open source initiatives. The most popular of them are:

Apache Mahout: Scalable machine learning and data mining open source

software based mainly in Hadoop. It has implementations of a wide range of

machine learning and data mining algorithms: clustering, classification,

collaborative filtering and frequent patternmining.

R: open source programming language and software environment designed

for statistical computing and visualization. R was designed by Ross Ihaka and

Robert Gentleman at the University of Auckland, New Zealand beginning in

1993 and is used for statistical analysis of very large data sets.

MOA: Stream data mining open source software to perform data mining in

real time. It has implementations of classification, regression; clustering and

frequent item set mining and frequent graph mining. It started as a project of the

Machine Learning group of University of Waikato, New Zealand, famous for the

WEKA software. The streams framework provides an environment for defining

and running stream processes using simple XML based definitions and is able to

use MOA, Android and Storm.

SAMOA: It is a new upcoming software project for distributed stream

mining that will combine S4 and Storm with MOA.

Vow pal Wabbit: open source project started at Yahoo! Research and

continuing at Microsoft Research to design a fast, scalable, useful learning

algorithm. VW is able to learn from terafeature datasets. It can exceed the

throughput of any single machine networkinterface when doing linear learning,

via parallel learning.

MINING for BIG DATA

Data mining is the process by which data is analysed coming from different

sources discovers useful information. Data Mining contains several algorithms

which fall into 4 categories. They are:

ation Rule

ring

fication

sion

Association is used to search relationship between variables. It is applied in

searching for frequently visited items. In short it establishes relationship among

objects. Clustering discovers groups and structures in the fication

deals with associating an unknown structure to a known structure. Regression

finds a function to model the data.

The different data mining algorithms are:

Category

Association

Clustering

Algorithm

Apriori, FP growth

K-Means, Expectation.

Classification

Regression

Decision trees,SVM

Multivariate linear regression

Table 1. Classification of Algorithms

Data Mining algorithms can be converted into big map reduce algorithm

based on parallel computing basis.

Big Data Data Mining

It is everything in the It is the old Big

world now. Data.

Size of the data

Size of the data is larger.

is smaller.

Involves storage and

sets.

Interesting

processing of large data patterns can be

found.

Data mining

refers to the

of going

Big Data is the term for

activity

through big data

large data set.

set to look for

relevant

information.

Data mining is

Big data is the asset.

the handler which

provide beneficial

result.

Big data" varies

depending on the

Data mining

capabilities of the

refers to the

organization managing the

operation that

set, and on the capabilities

involve relatively

of the applications that are

sophisticated

traditionally used to

search operation.

process and analyse the

data.

Table 2. Differences between Data Mining and

Big Data

nges in BIG DATA

Meeting the challenges with BIG Data is difficult. The volume is increasing

every day. The velocity is increasing by the internet connected devices. The

variety is also expanding and the organizations’ capability to capture and process

the data is limited.

The following are the challenges in area of Big Data when it is handled:

capture and storage

transmission

curation

analysis

visualization

According to, challenges of big data mining are divided into 3 tiers.

The first tier is the setup of data mining algorithms. The second tier includes

ation sharing and Data Privacy.

and Application Knowledge.

The third one includes local learning and model fusion for multiple

information sources.

from sparse, uncertain and incomplete data.

complex and dynamic data.

Figure 2: Phases of Big Data Challenges

Generally mining of data from different data sources is tedious as size of

data is larger. Big data is stored at different places and collecting those data will

be a tedious task and applying basic data mining algorithms will be an obstacle

for it. Next we need to consider the privacy of data. The third case is mining

algorithms. When we are applying data mining algorithms to these subsets of

data the result may not be that much accurate.

st of the future

There are some challenges that researchers and practitioners will have to

deal during the next years:

Analytics Architecture: It is not clear yet how an optimal architecture of

analytics systems should be to deal with historic data and with real-time data at

the same time. An interesting proposal is the Lambda architecture of Nathan

Marz. The Lambda Architecture solves the problem of computing arbitrary

functions on arbitrary data in real time by decomposing the problem into three

layers: the batch layer, the serving layer, and the speed layer. It combines in the

same system Hadoop for the batch layer, and Storm for the speed layer. The

properties of the system are: robust and fault tolerant, scalable, general, and

extensible, allows ad hoc queries, minimal maintenance, and debuggable.

Statistical significance: It is important to achieve significant statistical

results, and not be fooled by randomness. As Efron explains in his book about

Large Scale Inference, it is easy to go wrong with huge data sets and thousands

of questions to answer at once.

Distributed mining: Many data mining techniques are not trivial to paralyze.

To have distributed versions of some methods, a lot of research is needed with

practical and theoretical analysis to provide new methods.

Time evolving data: Data may be evolving over time, so it is important that

the Big Data mining techniques should be able to adapt and in some cases to

detect change first. For example, the data stream mining field has very powerful

techniques for this task.

Compression: Dealing with Big Data, the quantity of space needed to store it

is very relevant. There are two main approaches: compression where we don’t

loose anything, or sampling where we choose what is thedata that is more

representative. Using compression, we may take more time and less space, so we

can consider it as a transformation from time to space. Using sampling, we are

loosing information, but the gains inspace may be in orders of magnitude. For

example Feldman et al use core sets to reduce the complexity of Big Data

problems. Core sets are small sets that provably approximate the original data

for a given problem. Using merge- reduce the small sets can then be used for

solving hard machine learning problems in parallel.

Visualization: A main task of Big Data analysis is how to visualize the results.

As the data is so big, it is very difficult to find user-friendly visualizations. New

techniques, and frameworks to tell and show stories will be needed, as for

example the photographs, infographics and essays in the beautiful book ”The

Human Face of Big Data”.

Hidden Big Data: Large quantities of useful data are getting lost since new

data is largely untagged file based and unstructured data. The 2012 IDC studyon

Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would

be useful for Big Data if tagged and analyzed. However, currently only 3% of the

potentially useful data is tagged, and even less is analyzed.

SION

The amounts of data is growing exponentially due to social networking sites,

search and retrieval engines, media sharing sites, stock trading sites, news

sources and so on. Big Data is becoming the new area for scientific data research

and for business applications.

Data mining techniques can be applied on big data to acquire some useful

information from large datasets. They can be used together to acquire some

useful picture from the data.

Big Data analysis tools like Map Reduce over Hadoop and HDFS helps

organization.

中文译文:

大数据挖掘研究

摘要 数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。大数据是用于识别大型数据集的一个术语,通常其大小比典型的数据库要大。大数据引入了独特的计算和统计挑战。在工程和科学的大部分领域,大数据目前都有延伸。由于大数据的数量之多、速度之快、种类之繁,所以可

以使用数据挖掘,有助于从庞大的数据集中提取有用的数据。本文介绍了HACE定理,它描述了大数据革命的特征,并从数据挖掘角度提出了一个大数据处理模型。

关键词:大数据,数据挖掘,HACE定理,结构化和非结构化。

一、简介

大数据指的是大量的结构化数据和非结构化数据,这些数据遍布了整个组织。如果这些数据被正确使用,将会产生有意义的信息。大数据包括大量的数据,需要大量的实时处理。它提供了两个空间,一个用于发现新价值,并从隐藏的价值中了解深入的知识,另一个用于有效管理数据。数据库是一个与数据相关的逻辑上有组织的集合,可以方便地管理、更新和访问。数据挖掘是从数据库或其他存储库中存储的大量数据中发现有趣的知识(如关联、模式、更改、异常和重要结构)的过程。

大数据包括3V的特征。它们是大量(volume)、高速(velocity)和多样(variety)。大量意味着每秒生成的数据量。数据是静态的,它的规模特征也是众所周知的。高速是数据生成的速度。大数据应该有高速数据,社交媒体产生的数据就是一个例子。多样意味着可以采取不同类型的数据,例如音频、视频或文档。它可以是数字、图像、时间序列、数组等。

数据挖掘从不同的角度分析数据,并将其汇总为有用的信息,可用于商业解决方案和预测未来趋势。数据挖掘(DM)也称为数据库中的知识发现(KDD),或者知识发现和数据挖掘,是为关联规则等模式自动搜索大量数据的过程。它应用了统计学、信息检索、机器学习和模式识别等方面的许多计算技术。数据挖掘仅在短时间内从数据库中提取所需的模式。根据要挖掘的模式类型,可以将数据挖掘任务分为汇总、分类、聚类、关联和趋势分析。

在包括物理、生物和生物医学等科学和工程领域在内的所有领域,大数据都有延伸。

二、大数据挖掘

一般而言,大数据是指大量数据的集合,这些数据来自互联网、社交媒体、商业组织、传感器等各种来源。我们可以借助数据挖掘技术来提取一些有用的信息。这是一种从大量数据中发现模式以及描述性、可理解的模型的技术。

容量是数据的大小,大于PB和TB。规模和容量的增加使得传统的工具难以存储和分析。在预定的时间段内,应该使用大数据挖掘大量数据。传统的数据库系统旨在解决少量的结构化和一致性的数据,而大数据包括各种数据,如地理空间数据、音频、视频、非结构化文本等。

大数据挖掘是指通过大数据集来查找相关信息的活动。为了快速处理不同来源的大量数据,使用了Hadoop。Hadoop是一个免费的基于Java的编程框架,支持在分布式计算环境中处理大型数据集。其分布式文件系统支持节点之间的快速数据传输速率,并允许系统在发生节点故障时不中断运行。它为分布式数据处理进行MapReduce,用于结构化和非结构化数据。

三、大数据特征——HACE定理

我们有大量的异构数据。数据之间存在复杂的关系。我们需要从这些庞大的数据中发现有用的信息。

让我们想象一下,一个盲人被要求画大象的场景。每个盲人收集到的信息可能会认为躯干像墙,腿像树,身体像墙,尾巴像绳子。盲人们可以相互交换信息。

图1:盲人和大象

其中的一些特征包括:

1.具有异构及不同来源的海量数据:大数据的基本特征之一是大量的异构数据和多样数据。例如,在生物医学世界中,个人用姓名、年龄、性别、家族病史等来表示,用于X射线和CT扫描图像和视频。异构是指同一个体的不同表现形式,多样是指用各种特征来表示单一信息。

2.具有分布式和非集中式控制的自治:来源是自治的,即自动生成;它在没有任何集中控制的情况下生成信息。我们可以将它与万维网(WWW)进行比较,其中每台服务器都提供一定数量的信息,而不依赖于其他服务器。

3.复杂且不断演化的关系:随着数据量变得无限大,存在的关系也很大。在早期阶段,当数据很小时,数据之间的关系并不复杂。社交媒体和其他来源生成的数据具有复杂的关系。

四.工具:开放源码革命

Facebook、雅虎、Twitter、LinkedIn等大公司受益于开源项目,并为之做出贡献。在大数据挖掘中,有许多开源计划。其中最受欢迎的是:

ApacheMahout:主要基于Hadoop的可扩展机器学习和数据挖掘的开源软件。

它实现了广泛的机器学习和数据挖掘算法:聚类、分类、协同过滤和频繁模式。

R:为统计计算和可视化设计的开源编程语言和软件环境。R是由在新西兰奥克兰大学的Ross Ihaka和Robert Gentleman在1993年开始设计的,用于统计分析超大型数据集。

MOA:流数据挖掘开源软件,可以实时进行数据挖掘。它具有分类、回归、聚类和频繁项集挖掘和频繁图挖掘等实现。它始于新西兰怀卡托大学机器学习小组的一个项目,以WEKA软件著称。流框架为使用简单的根据XML来定义和运行流过程提供了一个环境,并能够使用MOA、Android和Storm

SAMOA:这是一个新的即将推出的分布式流挖掘软件项目,它将S4和Storm与MOA结合在一起。

Vow pal Wabbit:在雅虎启动的开源项目。研究并继续在微软研究院设计一个快速的、可扩展的、有用的学习算法。VW能够从大量特征数据集中学习。在进行线性学习、通过并行学习时,它可以超过任何单机网络接口的吞吐量。

五、大数据的数据挖掘

数据挖掘是通过分析不同来源的数据从而发现有用的信息的过程。数据挖掘包含多种算法,分为4类。他们是:

1.关联规则

2.聚类

3.分类

4.回归

关联用于搜索变量之间的关系。它用于搜索经常访问的项目。总而言之,它建立了对象之间的关系。聚类发现数据中的组和结构。分类处理将未知结构关联到已知结构。回归找到一个函数来模拟数据。

不同的数据挖掘算法有:

类别

关联

聚类

分类

回归

算法

Apriori, FP growth

K-Means, 期望值

决策树,SVM

多元线性回归

表1.算法的分类

数据挖掘算法可以转化为基于并行计算的MapReduce算法。

大数据

这是现在世界上的一切。

数据的规模较大。

涉及大型数据集的存储和处理。

大数据是大型数据集的术语。

数据挖掘

这是旧的大数据。

数据的规模较小。

可以找到有趣的模式。

数据挖掘是指通过大数据集寻找相关信息的活动。

数据挖掘是提供大数据是资产。 有益结果的处理程序。

大数据取决于管理集的组织的能力,数据挖掘指的是以及传统上用于处理和分涉及相对复杂的搜析数据的应用程序的功能索操作的活动。

表2.大数据和数据挖掘的不同之处

六、大数据挑战

面对大数据的挑战很困难。数量每天都在增加。网络连接设备的速度在增加。种类也在不断扩大,而组织采集和处理数据的能力是有限的。

以下是处理大数据时面临的挑战:

1.数据采集和存储

2.数据传输

3.数据管理

4.数据分析

5.数据可视化

据了解,大数据挖掘面临的挑战分为3层。

第一层是数据挖掘算法的设置。第二层包括

1.信息共享和数据隐私。

2.域和应用知识。

第三层包括多个信息源的局部学习和模型融合。

3.从稀疏、不确定和不完全的数据中挖掘。

4.挖掘复杂和动态数据。

图2:大数据挑战的阶段

由于数据量较大,通常从不同数据源挖掘数据是很繁琐的。大数据存储在不同的地方,采集这些数据将是一项繁琐的任务,应用基本的数据挖掘算法将成为其障碍。接下来我们需要考虑数据的隐私。第三种情况是挖掘算法。当我们将数据挖掘算法应用于这些数据子集时,结果可能不那么准确。

七、未来预测

数据进行标记和分析,23%(643EB)的数字世界将对大数据有用。但是,目前只有3%的潜在有用数据被标记,甚至更少被分析。

八、结论

由于社交网站、搜索和检索引擎、媒体共享网站、股票交易网站、新闻来源等,数据量呈指数级增长。大数据正在成为科学数据研究和商业应用的新领域。

数据挖掘技术可以应用于大数据,从大数据集中获取有用的信息。它们可以一起使用,从数据中获取有用的图片。

像MapReduce、Hadoop和HDFS这样的大数据分析工具可以帮助组织。


本文标签: 数据 数据挖掘 信息 大量 用于