admin 管理员组文章数量: 1086019
2024年3月9日发(作者:哪个老师的python视频教程好)
大数据挖掘外文翻译文献
文献信息:
文献标题:A Study of Data Mining with Big Data(大数据挖掘研究)
国外作者:VH Shastri,V Sreeprada
文献出处:《International Journal of Emerging Trends and Technology
in Computer Science》,2016,38(2):99-103
字数统计:英文2291单词,12196字符;中文3868汉字
外文文献:
A Study of Data Mining with Big Data
Abstract Data has become an important part of every economy, industry,
organization, business, function and individual. Big Data is a term used to
identify large data sets typically whose size is larger than the typical data base.
Big data introduces unique computational and statistical challenges. Big Data are
at present expanding in most of the domains of engineering and science. Data
mining helps to extract useful data from the huge data sets due to its volume,
variability and velocity. This article presents a HACE theorem that characterizes
the features of the Big Data revolution, and proposes a Big Data processing
model, from the data mining perspective.
Keywords: Big Data, Data Mining, HACE theorem, structured and
unstructured.
uction
Big Data refers to enormous amount of structured data and unstructured
data that overflow the organization. If this data is properly used, it can lead to
meaningful information. Big data includes a large number of data which
requires a lot of processing in real time. It provides a room to discover new
values, to understand in-depth knowledge from hidden values and provide a
space to manage the data effectively. A database is an organized collection of
logically related data which can be easily managed, updated and accessed. Data
mining is a process discovering interesting knowledge such as associations,
patterns, changes, anomalies and significant structures from large amount of
data stored in the databases or other repositories.
Big Data includes 3 V’s as its characteristics. They are volume, velocity and
variety. Volume means the amount of data generated every second. The data is in
state of rest. It is also known for its scale characteristics. Velocity is the speed
with which the data is generated. It should have high speed data. The data
generated from social media is an example. Variety means different types of data
can be taken such as audio, video or documents. It can be numerals, images, time
series, arrays etc.
Data Mining analyses the data from different perspectives and summarizing
it into useful information that can be used for business solutions and predicting
the future trends. Data mining (DM), also called Knowledge Discovery in
Databases (KDD) or Knowledge Discovery and Data Mining, is the process of
searching large volumes of data automatically for patterns such as association
rules. It applies many computational techniques from statistics, information
retrieval, machine learning and pattern recognition. Data mining extract only
required patterns from the database in a short time span. Based on the type of
patterns to be mined, data mining tasks can be classified into summarization,
classification, clustering, association and trends analysis.
Big Data is expanding in all domains including science and engineering
fields including physical, biological and biomedical sciences.
DATA with DATA MINING
Generally big data refers to a collection of large volumes of data and these
data are generated from various sources like internet, social-media, business
organization, sensors etc. We can extract some useful information with the help
of Data Mining. It is a technique for discovering patterns as well as descriptive,
understandable, models from a large scale of data.
Volume is the size of the data which is larger than petabytes and terabytes.
The scale and rise of size makes it difficult to store and analyse using traditional
tools. Big Data should be used to mine large amounts of data within the
predefined period of time. Traditional database systems were designed to address
small amounts of data which were structured and consistent, whereas Big Data
includes wide variety of data such as geospatial data, audio, video, unstructured
text and so on.
Big Data mining refers to the activity of going through big data sets to look
for relevant information. To process large volumes of data from different sources
quickly, Hadoop is used. Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a distributed computing
environment. Its distributed file system supports fast data transfer rates among
nodes and allows the system to continue operating uninterrupted at times of
node failure. It runs Map Reduce for distributed data processing and is works
with structured and unstructured data.
DATA characteristics- HACE THEOREM.
We have large volume of heterogeneous data. There exists a complex
relationship among the data. We need to discover useful information from this
voluminous data.
Let us imagine a scenario in which the blind people are asked to draw
elephant. The information collected by each blind people may think the trunk as
wall, leg as tree, body as wall and tail as rope. The blind men can exchange
information with each other.
Figure1: Blind men and the giant elephant
Some of the characteristics that include are:
data with heterogeneous and diverse sources: One of the fundamental
characteristics of big data is the large volume of data represented by
heterogeneous and diverse dimensions. For example in the biomedical world, a
single human being is represented as name, age, gender, family history etc., For
X-ray and CT scan images and videos are used. Heterogeneity refers to the
different types of representations of same individual and diverse refers to the
variety of features to represent single information.
mous with distributed and de- centralized control: the sources are
autonomous, i.e., automatically generated; it generates information without any
centralized control. We can compare it with World Wide Web (WWW) where
each server provides a certain amount of information without depending on
other servers.
x and evolving relationships: As the size of the data becomes
infinitely large, the relationship that exists is also large. In early stages, when
data is small, there is no complexity in relationships among the data. Data
generated from social media and other sources have complex relationships.
: OPEN SOURCE REVOLUTION
Large companies such as Facebook, Yahoo, Twitter, LinkedIn benefit and
contribute work on open source projects. In Big Data Mining, there are many
open source initiatives. The most popular of them are:
Apache Mahout: Scalable machine learning and data mining open source
software based mainly in Hadoop. It has implementations of a wide range of
machine learning and data mining algorithms: clustering, classification,
collaborative filtering and frequent patternmining.
R: open source programming language and software environment designed
for statistical computing and visualization. R was designed by Ross Ihaka and
Robert Gentleman at the University of Auckland, New Zealand beginning in
1993 and is used for statistical analysis of very large data sets.
MOA: Stream data mining open source software to perform data mining in
real time. It has implementations of classification, regression; clustering and
frequent item set mining and frequent graph mining. It started as a project of the
Machine Learning group of University of Waikato, New Zealand, famous for the
WEKA software. The streams framework provides an environment for defining
and running stream processes using simple XML based definitions and is able to
use MOA, Android and Storm.
SAMOA: It is a new upcoming software project for distributed stream
mining that will combine S4 and Storm with MOA.
Vow pal Wabbit: open source project started at Yahoo! Research and
continuing at Microsoft Research to design a fast, scalable, useful learning
algorithm. VW is able to learn from terafeature datasets. It can exceed the
throughput of any single machine networkinterface when doing linear learning,
via parallel learning.
MINING for BIG DATA
Data mining is the process by which data is analysed coming from different
sources discovers useful information. Data Mining contains several algorithms
which fall into 4 categories. They are:
ation Rule
ring
fication
sion
Association is used to search relationship between variables. It is applied in
searching for frequently visited items. In short it establishes relationship among
objects. Clustering discovers groups and structures in the fication
deals with associating an unknown structure to a known structure. Regression
finds a function to model the data.
The different data mining algorithms are:
Category
Association
Clustering
Algorithm
Apriori, FP growth
K-Means, Expectation.
Classification
Regression
Decision trees,SVM
Multivariate linear regression
Table 1. Classification of Algorithms
Data Mining algorithms can be converted into big map reduce algorithm
based on parallel computing basis.
Big Data Data Mining
It is everything in the It is the old Big
world now. Data.
Size of the data
Size of the data is larger.
is smaller.
Involves storage and
sets.
Interesting
processing of large data patterns can be
found.
Data mining
refers to the
of going
Big Data is the term for
activity
through big data
large data set.
set to look for
relevant
information.
Data mining is
Big data is the asset.
the handler which
provide beneficial
result.
Big data" varies
depending on the
Data mining
capabilities of the
refers to the
organization managing the
operation that
set, and on the capabilities
involve relatively
of the applications that are
sophisticated
traditionally used to
search operation.
process and analyse the
data.
Table 2. Differences between Data Mining and
Big Data
nges in BIG DATA
Meeting the challenges with BIG Data is difficult. The volume is increasing
every day. The velocity is increasing by the internet connected devices. The
variety is also expanding and the organizations’ capability to capture and process
the data is limited.
The following are the challenges in area of Big Data when it is handled:
capture and storage
transmission
curation
analysis
visualization
According to, challenges of big data mining are divided into 3 tiers.
The first tier is the setup of data mining algorithms. The second tier includes
ation sharing and Data Privacy.
and Application Knowledge.
The third one includes local learning and model fusion for multiple
information sources.
from sparse, uncertain and incomplete data.
complex and dynamic data.
Figure 2: Phases of Big Data Challenges
Generally mining of data from different data sources is tedious as size of
data is larger. Big data is stored at different places and collecting those data will
be a tedious task and applying basic data mining algorithms will be an obstacle
for it. Next we need to consider the privacy of data. The third case is mining
algorithms. When we are applying data mining algorithms to these subsets of
data the result may not be that much accurate.
st of the future
There are some challenges that researchers and practitioners will have to
deal during the next years:
Analytics Architecture: It is not clear yet how an optimal architecture of
analytics systems should be to deal with historic data and with real-time data at
the same time. An interesting proposal is the Lambda architecture of Nathan
Marz. The Lambda Architecture solves the problem of computing arbitrary
functions on arbitrary data in real time by decomposing the problem into three
layers: the batch layer, the serving layer, and the speed layer. It combines in the
same system Hadoop for the batch layer, and Storm for the speed layer. The
properties of the system are: robust and fault tolerant, scalable, general, and
extensible, allows ad hoc queries, minimal maintenance, and debuggable.
Statistical significance: It is important to achieve significant statistical
results, and not be fooled by randomness. As Efron explains in his book about
Large Scale Inference, it is easy to go wrong with huge data sets and thousands
of questions to answer at once.
Distributed mining: Many data mining techniques are not trivial to paralyze.
To have distributed versions of some methods, a lot of research is needed with
practical and theoretical analysis to provide new methods.
Time evolving data: Data may be evolving over time, so it is important that
the Big Data mining techniques should be able to adapt and in some cases to
detect change first. For example, the data stream mining field has very powerful
techniques for this task.
Compression: Dealing with Big Data, the quantity of space needed to store it
is very relevant. There are two main approaches: compression where we don’t
loose anything, or sampling where we choose what is thedata that is more
representative. Using compression, we may take more time and less space, so we
can consider it as a transformation from time to space. Using sampling, we are
loosing information, but the gains inspace may be in orders of magnitude. For
example Feldman et al use core sets to reduce the complexity of Big Data
problems. Core sets are small sets that provably approximate the original data
for a given problem. Using merge- reduce the small sets can then be used for
solving hard machine learning problems in parallel.
Visualization: A main task of Big Data analysis is how to visualize the results.
As the data is so big, it is very difficult to find user-friendly visualizations. New
techniques, and frameworks to tell and show stories will be needed, as for
example the photographs, infographics and essays in the beautiful book ”The
Human Face of Big Data”.
Hidden Big Data: Large quantities of useful data are getting lost since new
data is largely untagged file based and unstructured data. The 2012 IDC studyon
Big Data explains that in 2012, 23% (643 exabytes) of the digital universe would
be useful for Big Data if tagged and analyzed. However, currently only 3% of the
potentially useful data is tagged, and even less is analyzed.
SION
The amounts of data is growing exponentially due to social networking sites,
search and retrieval engines, media sharing sites, stock trading sites, news
sources and so on. Big Data is becoming the new area for scientific data research
and for business applications.
Data mining techniques can be applied on big data to acquire some useful
information from large datasets. They can be used together to acquire some
useful picture from the data.
Big Data analysis tools like Map Reduce over Hadoop and HDFS helps
organization.
中文译文:
大数据挖掘研究
摘要 数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。大数据是用于识别大型数据集的一个术语,通常其大小比典型的数据库要大。大数据引入了独特的计算和统计挑战。在工程和科学的大部分领域,大数据目前都有延伸。由于大数据的数量之多、速度之快、种类之繁,所以可
以使用数据挖掘,有助于从庞大的数据集中提取有用的数据。本文介绍了HACE定理,它描述了大数据革命的特征,并从数据挖掘角度提出了一个大数据处理模型。
关键词:大数据,数据挖掘,HACE定理,结构化和非结构化。
一、简介
大数据指的是大量的结构化数据和非结构化数据,这些数据遍布了整个组织。如果这些数据被正确使用,将会产生有意义的信息。大数据包括大量的数据,需要大量的实时处理。它提供了两个空间,一个用于发现新价值,并从隐藏的价值中了解深入的知识,另一个用于有效管理数据。数据库是一个与数据相关的逻辑上有组织的集合,可以方便地管理、更新和访问。数据挖掘是从数据库或其他存储库中存储的大量数据中发现有趣的知识(如关联、模式、更改、异常和重要结构)的过程。
大数据包括3V的特征。它们是大量(volume)、高速(velocity)和多样(variety)。大量意味着每秒生成的数据量。数据是静态的,它的规模特征也是众所周知的。高速是数据生成的速度。大数据应该有高速数据,社交媒体产生的数据就是一个例子。多样意味着可以采取不同类型的数据,例如音频、视频或文档。它可以是数字、图像、时间序列、数组等。
数据挖掘从不同的角度分析数据,并将其汇总为有用的信息,可用于商业解决方案和预测未来趋势。数据挖掘(DM)也称为数据库中的知识发现(KDD),或者知识发现和数据挖掘,是为关联规则等模式自动搜索大量数据的过程。它应用了统计学、信息检索、机器学习和模式识别等方面的许多计算技术。数据挖掘仅在短时间内从数据库中提取所需的模式。根据要挖掘的模式类型,可以将数据挖掘任务分为汇总、分类、聚类、关联和趋势分析。
在包括物理、生物和生物医学等科学和工程领域在内的所有领域,大数据都有延伸。
二、大数据挖掘
一般而言,大数据是指大量数据的集合,这些数据来自互联网、社交媒体、商业组织、传感器等各种来源。我们可以借助数据挖掘技术来提取一些有用的信息。这是一种从大量数据中发现模式以及描述性、可理解的模型的技术。
容量是数据的大小,大于PB和TB。规模和容量的增加使得传统的工具难以存储和分析。在预定的时间段内,应该使用大数据挖掘大量数据。传统的数据库系统旨在解决少量的结构化和一致性的数据,而大数据包括各种数据,如地理空间数据、音频、视频、非结构化文本等。
大数据挖掘是指通过大数据集来查找相关信息的活动。为了快速处理不同来源的大量数据,使用了Hadoop。Hadoop是一个免费的基于Java的编程框架,支持在分布式计算环境中处理大型数据集。其分布式文件系统支持节点之间的快速数据传输速率,并允许系统在发生节点故障时不中断运行。它为分布式数据处理进行MapReduce,用于结构化和非结构化数据。
三、大数据特征——HACE定理
我们有大量的异构数据。数据之间存在复杂的关系。我们需要从这些庞大的数据中发现有用的信息。
让我们想象一下,一个盲人被要求画大象的场景。每个盲人收集到的信息可能会认为躯干像墙,腿像树,身体像墙,尾巴像绳子。盲人们可以相互交换信息。
图1:盲人和大象
其中的一些特征包括:
1.具有异构及不同来源的海量数据:大数据的基本特征之一是大量的异构数据和多样数据。例如,在生物医学世界中,个人用姓名、年龄、性别、家族病史等来表示,用于X射线和CT扫描图像和视频。异构是指同一个体的不同表现形式,多样是指用各种特征来表示单一信息。
2.具有分布式和非集中式控制的自治:来源是自治的,即自动生成;它在没有任何集中控制的情况下生成信息。我们可以将它与万维网(WWW)进行比较,其中每台服务器都提供一定数量的信息,而不依赖于其他服务器。
3.复杂且不断演化的关系:随着数据量变得无限大,存在的关系也很大。在早期阶段,当数据很小时,数据之间的关系并不复杂。社交媒体和其他来源生成的数据具有复杂的关系。
四.工具:开放源码革命
Facebook、雅虎、Twitter、LinkedIn等大公司受益于开源项目,并为之做出贡献。在大数据挖掘中,有许多开源计划。其中最受欢迎的是:
ApacheMahout:主要基于Hadoop的可扩展机器学习和数据挖掘的开源软件。
它实现了广泛的机器学习和数据挖掘算法:聚类、分类、协同过滤和频繁模式。
R:为统计计算和可视化设计的开源编程语言和软件环境。R是由在新西兰奥克兰大学的Ross Ihaka和Robert Gentleman在1993年开始设计的,用于统计分析超大型数据集。
MOA:流数据挖掘开源软件,可以实时进行数据挖掘。它具有分类、回归、聚类和频繁项集挖掘和频繁图挖掘等实现。它始于新西兰怀卡托大学机器学习小组的一个项目,以WEKA软件著称。流框架为使用简单的根据XML来定义和运行流过程提供了一个环境,并能够使用MOA、Android和Storm
SAMOA:这是一个新的即将推出的分布式流挖掘软件项目,它将S4和Storm与MOA结合在一起。
Vow pal Wabbit:在雅虎启动的开源项目。研究并继续在微软研究院设计一个快速的、可扩展的、有用的学习算法。VW能够从大量特征数据集中学习。在进行线性学习、通过并行学习时,它可以超过任何单机网络接口的吞吐量。
五、大数据的数据挖掘
数据挖掘是通过分析不同来源的数据从而发现有用的信息的过程。数据挖掘包含多种算法,分为4类。他们是:
1.关联规则
2.聚类
3.分类
4.回归
关联用于搜索变量之间的关系。它用于搜索经常访问的项目。总而言之,它建立了对象之间的关系。聚类发现数据中的组和结构。分类处理将未知结构关联到已知结构。回归找到一个函数来模拟数据。
不同的数据挖掘算法有:
类别
关联
聚类
分类
回归
算法
Apriori, FP growth
K-Means, 期望值
决策树,SVM
多元线性回归
表1.算法的分类
数据挖掘算法可以转化为基于并行计算的MapReduce算法。
大数据
这是现在世界上的一切。
数据的规模较大。
涉及大型数据集的存储和处理。
大数据是大型数据集的术语。
数据挖掘
这是旧的大数据。
数据的规模较小。
可以找到有趣的模式。
数据挖掘是指通过大数据集寻找相关信息的活动。
数据挖掘是提供大数据是资产。 有益结果的处理程序。
大数据取决于管理集的组织的能力,数据挖掘指的是以及传统上用于处理和分涉及相对复杂的搜析数据的应用程序的功能索操作的活动。
。
表2.大数据和数据挖掘的不同之处
六、大数据挑战
面对大数据的挑战很困难。数量每天都在增加。网络连接设备的速度在增加。种类也在不断扩大,而组织采集和处理数据的能力是有限的。
以下是处理大数据时面临的挑战:
1.数据采集和存储
2.数据传输
3.数据管理
4.数据分析
5.数据可视化
据了解,大数据挖掘面临的挑战分为3层。
第一层是数据挖掘算法的设置。第二层包括
1.信息共享和数据隐私。
2.域和应用知识。
第三层包括多个信息源的局部学习和模型融合。
3.从稀疏、不确定和不完全的数据中挖掘。
4.挖掘复杂和动态数据。
图2:大数据挑战的阶段
由于数据量较大,通常从不同数据源挖掘数据是很繁琐的。大数据存储在不同的地方,采集这些数据将是一项繁琐的任务,应用基本的数据挖掘算法将成为其障碍。接下来我们需要考虑数据的隐私。第三种情况是挖掘算法。当我们将数据挖掘算法应用于这些数据子集时,结果可能不那么准确。
七、未来预测
数据进行标记和分析,23%(643EB)的数字世界将对大数据有用。但是,目前只有3%的潜在有用数据被标记,甚至更少被分析。
八、结论
由于社交网站、搜索和检索引擎、媒体共享网站、股票交易网站、新闻来源等,数据量呈指数级增长。大数据正在成为科学数据研究和商业应用的新领域。
数据挖掘技术可以应用于大数据,从大数据集中获取有用的信息。它们可以一起使用,从数据中获取有用的图片。
像MapReduce、Hadoop和HDFS这样的大数据分析工具可以帮助组织。
版权声明:本文标题:大数据挖掘外文翻译文献 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.roclinux.cn/p/1709952655a551008.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论