基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树-Linux大棚

admin 管理员组

文章数量: 1086926

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

最近做一个项目需要对给定的文本中的句子做Parse，根据POS tag及句子成分信息找出词语/短语之间的dependency，然后根据dependency构建句子的parse tree. 需要用到Stanford Parser和OpenNLP 中的Shallow Parser，这两个Parser都用JAVA实现，提供API方式调用，可以根据句子输出语法解析树。下面总结两类Parser的作用及JAVA程序调用方法。

1 Shallow Parser

Shallow Parser主要作用是找出句子中的短语信息，包括名词短语NP，动词短语VP，形容词短语ADJP，副词短语ADVP等等，示例程序如下

package edu.pku.yangliu.nlp.pdt;import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.util.HashMap;import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;/**a Shallow Parser based on opennlp* @author yangliu* @blog * @mail yang.liu@pku.edu.cn*/public class ShallowParser {private static ShallowParser instance = null ;private static POSModel model;private static ChunkerModel cModel ;//Singleton patternpublic static ShallowParser getInstance() throws InvalidFormatException, IOException{if(ShallowParser.instance == null){POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));InputStream is = new FileInputStream("en-chunker.bin");ChunkerModel cModel = new ChunkerModel(is);ShallowParser.instance = new ShallowParser(model, cModel);}return ShallowParser.instance;}public ShallowParser(POSModel model, ChunkerModel cModel){ShallowParser.model = model;ShallowParser.cModel = cModel;}/** A shallow Parser, chunk a sentence and return a map for the phrase*  labels of words <wordsIndex, phraseLabel>*   Notice: There should be " " BEFORE and after ",", " ","(",")" etc.* @param input The input sentence* @param model The POSModel of the chunk* @param cModel The ChunkerModel of the chunk* @return  HashMap<Integer,String>*/public HashMap<Integer,String> chunk(String input) throws IOException { 	PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");POSTaggerME tagger = new POSTaggerME(model);ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(input));perfMon.start();String line;String whitespaceTokenizerLine[] = null; String[] tags = null;while ((line = lineStream.read()) != null) {whitespaceTokenizerLine = WhitespaceTokenizer.INSTANCE.tokenize(line);tags = tagger.tag(whitespaceTokenizerLine);	 POSSample posTags = new POSSample(whitespaceTokenizerLine, tags);System.out.println(posTags.toString());perfMon.incrementCounter();}perfMon.stopAndPrintFinalResult();// chunkerChunkerME chunkerME = new ChunkerME(cModel);String result[] = chunkerME.chunk(whitespaceTokenizerLine, tags);HashMap<Integer,String> phraseLablesMap = new HashMap<Integer, String>();Integer wordCount = 1;Integer phLableCount = 0;for (String phLable : result){if(phLable.equals("O")) phLable += "-Punctuation"; //The phLable of the last word is OPif(phLable.split("-")[0].equals("B")) phLableCount++;phLable = phLable.split("-")[1] + phLableCount;//if(phLable.equals("ADJP")) phLable = "NP"; //Notice: ADJP included in NP//if(phLable.equals("ADVP")) phLable = "VP"; //Notice: ADVP included in VPSystem.out.println(wordCount + ":" + phLable);phraseLablesMap.put(wordCount, phLable);wordCount++;}//Span[] span = chunkerME.chunkAsSpans(whitespaceTokenizerLine, tags);//for (Span phLable : span)//System.out.println(phLable.toString());return phraseLablesMap;}/** Just for testing* @param tdl Typed Dependency List* @return WDTreeNode root of WDTree*/public static void main(String[] args) throws IOException {//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.String input = "We really enjoyed using the Canon PowerShot SD500 .";//String input = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";ShallowParser swParser = ShallowParser.getInstance();swParser.chunk(input);}}

注意要配置好POS Model及Chunker Model的路径，这两个Model的数据文件都可以从OpenNLP的官网下载。

输出结果

Loading POS Tagger model ... done (1.563s)Average: 9.3 sent/s 
Total: 1 sent
Runtime: 0.107s
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.
1:NP1
2:ADVP2
3:VP3
4:VP3
5:NP4
6:NP4
7:NP4
8:NP4
9:Punctuation4

从结果中可以看出，Shallow Parser首先输出了POS tag信息，然后从句子中找出了两个名词短语NP1和NP4，一个动词短语VP3和一个副词短语ADVP2

2 Stanford Parser

Stanford Parser可以找出句子中词语之间的dependency关联信息，并且以Stanford Dependency格式输出，包括有向图及树等形式。示例代码如下

package edu.pku.yangliu.nlp.pdt;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.List;import opennlp.tools.util.InvalidFormatException;import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.objectbank.TokenizerFactory;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.PennTreebankLanguagePack;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;/**Phrase sentences based on stanford parser* @author yangliu* @blog * @mail yang.liu@pku.edu.cn*/public class StanfordParser {private static StanfordParser instance = null ;private static LexicalizedParser lp;//Singleton patternpublic static StanfordParser getInstance(){if(StanfordParser.instance == null){LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz","-retainTmpSubcategories");StanfordParser.instance = new StanfordParser(lp);}return StanfordParser.instance;}public StanfordParser(LexicalizedParser lp){StanfordParser.lp = lp;}/**Parse sentences in a file* @param SentFilename The input file* @return  void*/public void DPFromFile(String SentFilename) {TreebankLanguagePack tlp = new PennTreebankLanguagePack();GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();for (List<HasWord> sentence : new DocumentPreprocessor(SentFilename)) {Tree parse = lp.apply(sentence);parse.pennPrint();System.out.println();GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);List<TypedDependency> tdl = (List<TypedDependency>)gs.typedDependenciesCollapsedTree();System.out.println(tdl);System.out.println();}}/**Parse sentences from a String* @param sent The input sentence* @return  List<TypedDependency> The list for type dependency*/public List<TypedDependency> DPFromString(String sent) {TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sent)).tokenize();Tree parse = lp.apply(rawWords);TreebankLanguagePack tlp = new PennTreebankLanguagePack();GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);//Choose the type of dependenciesCollapseTree//so that dependencies which do not //preserve the tree structure are omittedreturn (List<TypedDependency>) gs.typedDependenciesCollapsedTree();   }
}

Main函数如下

/**Just for testing* @param args* @throws IOException * @throws InvalidFormatException */public static void main(String[] args) throws InvalidFormatException, IOException {// TODO Auto-generated method stub//Notice: There should be " " BEFORE and after ",", " ","(",")" etc.String sent = "We really enjoyed using the Canon PowerShot SD500 .";//String sent = "Bell , based in Los Angeles , makes and distributes electronic , computer and building products .";//String sent = "It has an exterior design that combines form and function more elegantly than any point-and-shoot we've ever tested . "; //String sent = "A Digic II-powered image-processing system enables the SD500 to snap a limitless stream of 7-megapixel photos at a respectable clip , its start-up time is tops in its class , and it delivers decent photos when compared to its competition . "; //String sent = "I've had it for about a month and it is simply the best point-and-shoot your money can buy . "; StanfordParser sdPaser = StanfordParser.getInstance();List<TypedDependency> tdl = sdPaser.DPFromString(sent);for(TypedDependency oneTdl : tdl){System.out.println(oneTdl);} ShallowParser swParser = ShallowParser.getInstance();HashMap<Integer,String> phraseLablesMap = new HashMap<Integer, String>();phraseLablesMap = swParser.chunk(sent);WDTree wdtree = new WDTree();WDTreeNode root = wdtree.bulidWDTreeFromList(tdl, phraseLablesMap);wdtree.printWDTree(root);}

输出的词语之间的dependency关联，POS tag信息及句子语法解析树如下

Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [2.1 sec].
nsubj(enjoyed-3, We-1)
advmod(enjoyed-3, really-2)
root(ROOT-0, enjoyed-3)
xcomp(enjoyed-3, using-4)
det(SD500-8, the-5)
nn(SD500-8, Canon-6)
nn(SD500-8, PowerShot-7)
dobj(using-4, SD500-8)
Loading POS Tagger model ... done (1.492s)
We_PRP really_RB enjoyed_VBD using_VBG the_DT Canon_NNP PowerShot_NNP SD500_NNP ._.Average: 200.0 sent/s 
Total: 1 sent
Runtime: 0.0050s
1:NP1
2:ADVP2
3:VP3
4:VP3
5:NP4
6:NP4
7:NP4
8:NP4
9:Punctuation4children of ROOT-0_ (phLable:null):
enjoyed-3_  rel:root phLable:VP3   children of enjoyed-3_ (phLable:VP3):
We-1_  rel:nsubj phLable:NP1   really-2_  rel:advmod phLable:ADVP2   using-4_  rel:xcomp phLable:VP3   children of using-4_ (phLable:VP3):
SD500-8_  rel:dobj phLable:NP4   children of SD500-8_ (phLable:NP4):
the-5_  rel:det phLable:NP4   Canon-6_  rel:nn phLable:NP4   PowerShot-7_  rel:nn phLable:NP4

本文标签：基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

版权声明：本文标题：基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.roclinux.cn/b/1687265360a82840.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

Linux大棚 – 不忘初心的技术博客，浮躁时代的安静角落

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

更多相关文章

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

发表评论

推荐文章

javascript - What is the sturdiest way to get the closure compiler's type safety together with AMDs (requirejs)? - Stack

java - Highlight and Edit XML in a browser - Stack Overflow

javascript - Polymer.js two-way binding to textarea value - Stack Overflow

typescript - Module not Found when using Custom Type .d.ts in Next.js - Stack Overflow

javascript - Can't import { useActionState } from 'react' following next.js tutorial, next.js v15.0.0-ca

热门文章

javascript - Change widthheightlength of 3D Cube created with Three.js at runtime - Stack Overflow

Google Play Developer API in C# - Stack Overflow

javascript - How can I call a function after grecaptcha.execute() has finished executing - triggered by an event? - Stack Overfl

javascript - Position resizable circles near each other - Stack Overflow

javascript - More elegant way to subscribe with rjxs instead of setTimeout function? - Stack Overflow

javascript - Tutorials for a blogging application in node.js - Stack Overflow

matlab - Undocumented `hittest` no longer working as expected - Stack Overflow

javascript - Different Content-Types on the same route with Serverless Next JS - Stack Overflow

ios - Is Task.detached a good and correct way to offload heavy work from the UI thread to keep the UI smooth? - Stack Overflow

javascript - Temporarily disable an input field if second input field is filled - Stack Overflow

最新文章

javascript - How do I toggle the readonly attribute of all child element with jquery - Stack Overflow

javascript - Might it be possible to block an entire US state from accessing my site, using PHP? - Stack Overflow

c++ - Is dereferencing std::span::end always undefined? - Stack Overflow

javascript - Delay function execution if it has been called recently - Stack Overflow

javascript - Google Maps Autocomplete List - Stack Overflow

Mac安装双系统教程

w ndows7怎么设置打印机,windows7中如何设置打印机纸张大小以241-2纸张为例

Windows 启动盘制作与使用

解决“win7系统无法定位程序输入点 SetDefaultDllDirectories“问题

msdn怎么下载win10专业版_msdn上下载win10专业版及安装方法

Exploring the Finest Accommodations: A Comprehensive Guide to Ruston LA Hotels

The Enchanting Experience of ScaliniTella NYC: A Culinary Gem in the Heart of Manhattan

Exploring the Exquisite Aloft Chicago O'Hare: A Blend of Modern Luxury and Convenience

A Culinary Journey: Discovering the Finest Dining Experiences in Waco, TX

A Culinary Journey: Discovering the Finest Dining Experiences in Athens, GA

Linux大棚 – 不忘初心的技术博客，浮躁时代的安静角落

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

更多相关文章

基于Stanford Parser 及OpenNLP Shallow Parser构建句子语法解析树

发表评论

推荐文章

javascript - What is the sturdiest way to get the closure compiler&#39;s type safety together with AMDs (requirejs)? - Stack

java - Highlight and Edit XML in a browser - Stack Overflow

javascript - Polymer.js two-way binding to textarea value - Stack Overflow

typescript - Module not Found when using Custom Type .d.ts in Next.js - Stack Overflow

javascript - Can&#39;t import { useActionState } from &#39;react&#39; following next.js tutorial, next.js v15.0.0-ca

热门文章

javascript - Change widthheightlength of 3D Cube created with Three.js at runtime - Stack Overflow

Google Play Developer API in C# - Stack Overflow

javascript - How can I call a function after grecaptcha.execute() has finished executing - triggered by an event? - Stack Overfl

javascript - Position resizable circles near each other - Stack Overflow

javascript - More elegant way to subscribe with rjxs instead of setTimeout function? - Stack Overflow

javascript - Tutorials for a blogging application in node.js - Stack Overflow

matlab - Undocumented `hittest` no longer working as expected - Stack Overflow

javascript - Different Content-Types on the same route with Serverless Next JS - Stack Overflow

ios - Is Task.detached a good and correct way to offload heavy work from the UI thread to keep the UI smooth? - Stack Overflow

javascript - Temporarily disable an input field if second input field is filled - Stack Overflow

最新文章

javascript - How do I toggle the readonly attribute of all child element with jquery - Stack Overflow

javascript - Might it be possible to block an entire US state from accessing my site, using PHP? - Stack Overflow

c++ - Is dereferencing std::span::end always undefined? - Stack Overflow

javascript - Delay function execution if it has been called recently - Stack Overflow

javascript - Google Maps Autocomplete List - Stack Overflow

Mac安装双系统教程

w ndows7怎么设置打印机,windows7中如何设置打印机纸张大小 以241-2纸张为例

Windows 启动盘制作与使用

解决“win7系统无法定位程序输入点 SetDefaultDllDirectories“问题

msdn怎么下载win10专业版_msdn上下载win10专业版及安装方法

Exploring the Finest Accommodations: A Comprehensive Guide to Ruston LA Hotels

The Enchanting Experience of ScaliniTella NYC: A Culinary Gem in the Heart of Manhattan

Exploring the Exquisite Aloft Chicago O'Hare: A Blend of Modern Luxury and Convenience

A Culinary Journey: Discovering the Finest Dining Experiences in Waco, TX

A Culinary Journey: Discovering the Finest Dining Experiences in Athens, GA

javascript - What is the sturdiest way to get the closure compiler's type safety together with AMDs (requirejs)? - Stack

javascript - Can't import { useActionState } from 'react' following next.js tutorial, next.js v15.0.0-ca

w ndows7怎么设置打印机,windows7中如何设置打印机纸张大小以241-2纸张为例