Theme NexT works best with JavaScript enabled
0%

spacy入门

Industrial-Strength Natural Language Processing ^ _ ^

Reference

spacy official website

Introduction

Spacy is an Industrial-Strength Natural Language Processing tool.

Quick Start

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# pip install -U spacy
# python -m spacy download en_core_web_sm
import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)

Execute python -m spacy download en_core_web_sm might fail because of the error of network connection. Another way to download language model is install package from github. Then install it offline: pip install <package>.

For example, install ans use “zh_core_web_sm-3.2.0”:

  1. download zh_core_web_sm-3.2.0.tar.gz from https://github.com/explosion/spacy-models/releases/.
  2. copy the package to the download folder under the project directory.
  • In Visual Studio, drag the package from local computer directory to remote server directory is convenient when the size of package is small.
  • If the package size is too big, scp /path/filename username@servername:/path will be more convenient. Because transfering too big file by VS Code might cause occidental lost of connection.
  1. install offline: pip install zh_core_web_sm-3.2.0.tar.gz
  2. Then you can load the language model in your python script: nlp = spacy.load("zh_core_web_sm")

Features

Support for 64+ languages

64 trained pipelines for 19 languages

pre-trained transformers

Multi-task learning with pretrained transformers like BERT

Pretrained word vectors

State-of-the-art speed

Production-ready training system

Linguistically-motivated tokenization

Components for many nlp tasks

Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more.

Custom components

Easily extensible with custom components and attributes.

Custom models

Support for custom models in PyTorch, TensorFlow and other frameworks.

Built in visualizers for syntax and NER

Easy model packaging, deployment and workflow management

Robust, rigorously evaluated accuracy