python - Can spaCy's entities (using EntityRuler) be made time-dependent? - TagMerge
5Can spaCy's entities (using EntityRuler) be made time-dependent?Can spaCy's entities (using EntityRuler) be made time-dependent?

Can spaCy's entities (using EntityRuler) be made time-dependent?

Asked 1 years ago
0
5 answers

You can't match against doc properties with patterns.

I would just use a component to post-process things here. Something that iterates over the doc and for any entity adds the ID by checking the year.

If your matches are all literal strings that should be pretty easy. If you have more complicated patterns and resolving the match back to the pattern isn't easy, I would use intermediate IDs that you can then use to resolve your year.

Source: link

1

i have installed spacyas below. when i go to jupyter notebook and run command nlp = spacy.load('en_core_web_sm') I get the below error
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-4-b472bef03043> in <module>()
      1 # Import spaCy and load the language library
      2 import spacy
----> 3 nlp = spacy.load('en_core_web_sm')
      4 
      5 # Create a Doc object

C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder\lib\site-packages\spacy\__init__.py in load(name, **overrides)
     13     if depr_path not in (True, False, None):
     14         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 15     return util.load_model(name, **overrides)
     16 
     17 

C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder\lib\site-packages\spacy\util.py in load_model(name, **overrides)
    117     elif hasattr(name, 'exists'):  # Path or Path-like to model data
    118         return load_model_from_path(name, **overrides)
--> 119     raise IOError(Errors.E050.format(name=name))
    120 
    121 

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
how I installed Spacy ---
(C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder) C:\Users\nikhizzz>conda install -c conda-forge spacy
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder:

The following NEW packages will be INSTALLED:

    blas:           1.0-mkl
    cymem:          1.31.2-py35h6538335_0    conda-forge
    dill:           0.2.8.2-py35_0           conda-forge
    msgpack-numpy:  0.4.4.2-py_0             conda-forge
    murmurhash:     0.28.0-py35h6538335_1000 conda-forge
    plac:           0.9.6-py_1               conda-forge
    preshed:        1.0.0-py35h6538335_0     conda-forge
    pyreadline:     2.1-py35_1000            conda-forge
    regex:          2017.11.09-py35_0        conda-forge
    spacy:          2.0.12-py35h830ac7b_0    conda-forge
    termcolor:      1.1.0-py_2               conda-forge
    thinc:          6.10.3-py35h830ac7b_2    conda-forge
    tqdm:           4.29.1-py_0              conda-forge
    ujson:          1.35-py35hfa6e2cd_1001   conda-forge

The following packages will be UPDATED:

    msgpack-python: 0.4.8-py35_0                         --> 0.5.6-py35he980bc4_3 conda-forge

The following packages will be DOWNGRADED:

    freetype:       2.7-vc14_2               conda-forge --> 2.5.5-vc14_2

Proceed ([y]/n)? y

blas-1.0-mkl.t 100% |###############################| Time: 0:00:00   0.00  B/s
cymem-1.31.2-p 100% |###############################| Time: 0:00:00   1.65 MB/s
msgpack-python 100% |###############################| Time: 0:00:00   5.37 MB/s
murmurhash-0.2 100% |###############################| Time: 0:00:00   1.49 MB/s
plac-0.9.6-py_ 100% |###############################| Time: 0:00:00   0.00  B/s
pyreadline-2.1 100% |###############################| Time: 0:00:00   4.62 MB/s
regex-2017.11. 100% |###############################| Time: 0:00:00   3.31 MB/s
termcolor-1.1. 100% |###############################| Time: 0:00:00 187.81 kB/s
tqdm-4.29.1-py 100% |###############################| Time: 0:00:00   2.51 MB/s
ujson-1.35-py3 100% |###############################| Time: 0:00:00   1.66 MB/s
dill-0.2.8.2-p 100% |###############################| Time: 0:00:00   4.34 MB/s
msgpack-numpy- 100% |###############################| Time: 0:00:00   0.00  B/s
preshed-1.0.0- 100% |###############################| Time: 0:00:00   0.00  B/s
thinc-6.10.3-p 100% |###############################| Time: 0:00:00   5.49 MB/s
spacy-2.0.12-p 100% |###############################| Time: 0:00:10   7.42 MB/s

(C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder) C:\Users\nikhizzz>python -V
Python 3.5.3 :: Anaconda custom (64-bit)

(C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder) C:\Users\nikhizzz>python -m spacy download en
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
    100% |################################| 37.4MB ...
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... done
Successfully installed en-core-web-sm-2.0.0

    Linking successful
    C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder\lib\site-packages\en_core_web_sm
    -->
    C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load('en')


(C:\Users\nikhizzz\AppData\Local\conda\conda\envs\tensorflowspyder) C:\Users\nikhizzz>
Initially I downloaded two en packages using following statements in anaconda prompt.
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_sm
But, I kept on getting linkage error and finally running below command helped me to establish link and solved error.
python -m spacy download en
The below worked for me :
import en_core_web_sm

nlp = en_core_web_sm.load()
Got to the path where it is downloaded. For e.g.
C:\Users\name\AppData\Local\Continuum\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.2.0
Paste it in:
nlp = spacy.load(r'C:\Users\name\AppData\Local\Continuum\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.2.0')
Download the model (change the name according to the size of the model)
!python -m spacy download en_core_web_lg
Test
import spacy
nlp = spacy.load("en_core_web_lg")
Then write the following code:
import en_core_web_sm
nlp = en_core_web_sm.load()
In your Anaconda Prompt, run the command:
!python -m spacy download en
After running the above command, you should be able to execute the below in your jupyter notebook:
spacy.load('en_core_web_sm')
import spacy

nlp = spacy.load('/opt/anaconda3/envs/NLPENV/lib/python3.7/site-packages/en_core_web_sm/en_core_web_sm-2.3.1')
pip install spacy==2.3.5 python -m spacy download en_core_web_sm python -m spacy download en
from chatterbot import ChatBot
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
ChatBot("hello")
download the best-matching version of a specific model for your spaCy installation
python -m spacy download en_core_web_sm
pip install .tar.gz archive from path or URL
pip install /Users/you/en_core_web_sm-2.2.0.tar.gz
or
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
if your not sure running below code
nlp = spacy.load('en_core_web_sm')
enironment.yml example
name: root
channels:
  - defaults
  - conda-forge
  - anaconda
dependencies:
  - python=3.8.3
  - pip
  - spacy=2.3.2
  - scikit-learn=0.23.2
  - pip:
    - https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz#egg=en_core_web_sm
a simple solution for this which I saw on spacy.io
from spacy.lang.en import English
nlp=English()
Open Anaconda Navigator. Click on any IDE. Run the code:
!pip install -U spacy download en_core_web_sm
!pip install -U spacy download en_core_web_sm
Loading the module using the different syntax worked for me.
import en_core_web_sm
nlp = en_core_web_sm.load()
You should then be able to run the following:
import spacy
nlp = spacy.load("en_core_web_sm")
Open command prompt or terminal and execute the below code:
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
Complete code will be:
python -m spacy download en_core_web_sm

import en_core_web_sm

nlp = en_core_web_sm.load()
This works with colab:
!python -m spacy download en
import en_core_web_sm
nlp = en_core_web_sm.load()
Or for the medium:
import en_core_web_md
nlp = en_core_web_md.load()
I then load a model using the explicit path and it worked from within PyCharm (note the path used goes all the way to en_core_web_lg-3.0.0; you will get an error if you do not use the folder with the config.cfg file):
nlpObject = spacy.load('/path_to_models/en_core_web_lg-3.0.0/en_core_web_lg/en_core_web_lg-3.0.0')
Run anaconda command prompt with admin privilege(Important) Then run below commands:
pip install -U --user spacy    
  python -m spacy download en
Try below command for verification:
import spacy
spacy.load('en')
Open Command Prompt as Administrator Go to c:> Activate my Conda environment (If you work in a specific conda environment):
c:\>activate <conda environment name>
(conda environment name)c:\>python -m spacy download en Return to Jupyter Notebook and you can load the language library:
nlp = en_core_web_sm.load()
Open terminal from anaconda or open anaconda evn. Run this:
pip3 install /Users/yourpath/Downloads/en_core_web_sm-3.1.0.tar.gz;
or
pip install /Users/yourpath/Downloads/en_core_web_sm-3.1.0.tar.gz;
Run this in os console:
python -m spacy download en
python -m spacy link en_core_web_sm en_core_web_sm
Then run this in python console or on your python IDE:
import spacy
spacy.load('en_core_web_sm')
Don't run !python -m spacy download en_core_web_lg from inside jupyter. Do this instead:
import spacy.cli
spacy.cli.download("en_core_web_lg")

Source: link

1

Select pipeline forefficiency accuracy
python -m venv .envsource .env/bin/activatesource .env/bin/activate.env\Scripts\activatepip install -U pip setuptools wheelpip install -U pip setuptools wheelpip install -U spacyconda install -c conda-forge spacyconda install -c conda-forge cupygit clone https://github.com/explosion/spaCycd spaCyexport PYTHONPATH=`pwd`set PYTHONPATH=C:\path\to\spaCypip install -r requirements.txtpython setup.py build_ext --inplacepip install .# packages only available via pippip install spacy-transformerspip install spacy-lookups-datapython -m spacy download ca_core_news_smpython -m spacy download zh_core_web_smpython -m spacy download da_core_news_smpython -m spacy download nl_core_news_smpython -m spacy download en_core_web_smpython -m spacy download fr_core_news_smpython -m spacy download de_core_news_smpython -m spacy download el_core_news_smpython -m spacy download it_core_news_smpython -m spacy download ja_core_news_smpython -m spacy download lt_core_news_smpython -m spacy download mk_core_news_smpython -m spacy download xx_ent_wiki_smpython -m spacy download nb_core_news_smpython -m spacy download pl_core_news_smpython -m spacy download pt_core_news_smpython -m spacy download ro_core_news_smpython -m spacy download ru_core_news_smpython -m spacy download es_core_news_sm
After installation you typically want to download a trained pipeline. For more info and available packages, see the models directory.
python -m spacy download en_core_web_sm>>> import spacy>>> nlp = spacy.load("en_core_web_sm")
Download pipelines After installation you typically want to download a trained pipeline. For more info and available packages, see the models directory.python -m spacy download en_core_web_sm>>> import spacy>>> nlp = spacy.load("en_core_web_sm")
pip install -U pip setuptools wheel
pip install -U spacy
When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state:
python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy
Example
pip install spacy[lookups,transformers]
Thanks to our great community, we’ve been able to re-add conda support. You can also install spaCy via conda-forge:
conda install -c conda-forge spacy
spaCy also provides a validate command, which lets you verify that all installed pipeline packages are compatible with your spaCy version. If incompatible packages are found, tips and installation instructions are printed. It’s recommended to run the command with python -m to make sure you’re executing the correct version of spaCy.
pip install -U spacypython -m spacy validate
spaCy can be installed on GPU by specifying spacy[cuda], spacy[cuda90], spacy[cuda91], spacy[cuda92], spacy[cuda100], spacy[cuda101], spacy[cuda102], spacy[cuda110], spacy[cuda111] or spacy[cuda112]. If you know your cuda version, using the more explicit specifier allows cupy to be installed via wheel, saving some compilation time. The specifiers should install cupy.
pip install -U spacy[cuda92]
Once you have a GPU-enabled installation, the best way to activate it is to call spacy.prefer_gpu or spacy.require_gpu() somewhere in your script before any pipelines have been loaded. require_gpu will raise an error if no GPU is available.
import spacy

spacy.prefer_gpu()
nlp = spacy.load("en_core_web_sm")
The other way to install spaCy is to clone its GitHub repository and build it from source. That is the common way if you want to make changes to the code base. You’ll need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip and git installed. The compiler part is the trickiest. How to do that depends on your system. See notes on Ubuntu, macOS / OS X and Windows for details.
python -m pip install -U pip setuptools wheel # install/update build tools
git clone https://github.com/explosion/spaCy  # clone spaCy
cd spaCy                                      # navigate into dir
python -m venv .env                           # create environment in .env
source .env/bin/activate                      # activate virtual env
pip install -r requirements.txt               # install requirements
pip install --no-build-isolation --editable . # compile and install spaCy
To install with extras:
pip install --no-build-isolation --editable .[lookups,cuda102]
Install in editable mode. Changes to .py files will be reflected as soon as the files are saved, but edits to Cython files (.pxd, .pyx) will require the pip install or python setup.py build_ext command below to be run again. Before installing in editable mode, be sure you have removed any previous installs with pip uninstall spacy, which you may need to run multiple times to remove all traces of earlier installs.
pip install -r requirements.txt
pip install --no-build-isolation --editable .
Build in parallel using N CPUs to speed up compilation and then install in editable mode:
pip install -r requirements.txt
python setup.py build_ext --inplace -j N
python setup.py develop
To use a .pex file, just replace python with the path to the file when you execute your code or CLI commands. This is equivalent to running Python in a virtual environment with spaCy installed.
./spacy.pex my_script.py
./spacy.pex -m spacy info
Usage To use a .pex file, just replace python with the path to the file when you execute your code or CLI commands. This is equivalent to running Python in a virtual environment with spaCy installed../spacy.pex my_script.py ./spacy.pex -m spacy info
git clone https://github.com/explosion/spaCy
cd spaCy
make
Alternatively, you can find out where spaCy is installed and run pytest on that directory. Don’t forget to also install the test utilities via spaCy’s requirements.txt:
python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))"
pip install -r path/to/requirements.txt
python -m pytest --pyargs spacy
Calling pytest on the spaCy directory will run only the basic tests. The flag --slow is optional and enables additional tests that take longer.
python -m pip install -U pytest               # update pytest
python -m pytest --pyargs spacy               # basic tests
python -m pytest --pyargs spacy --slow        # basic and slow tests
No compatible model found¶
No compatible package found for [lang] (spaCy vX.X.X).
Import error: No module named spacy¶
Import Error: No module named spacy
Import error: No module named [name]¶
ImportError: No module named 'en_core_web_sm'
Command not found: spacy¶
command not found: spacy
'module' object has no attribute 'load'¶
AttributeError: 'module' object has no attribute 'load'
Unhashable type: 'list'¶
TypeError: unhashable type: 'list'

Source: link

1

At this point, our text has already been tokenized, but spaCy stores tokenized text as a doc, and we’d like to look at it in list form, so we’ll create a for loop that iterates through our doc, adding each word token it finds in our text string to a list called token_list so that we can take a better look at how words are tokenized.
# Word tokenization
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)

# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
print(token_list)
# Word tokenization from spacy.lang.en import English # Load English tokenizer, tagger, parser, NER and word vectors nlp = English() text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!""" # "nlp" Object is used to create documents with linguistic annotations. my_doc = nlp(text) # Create list of word tokens token_list = [] for token in my_doc: token_list.append(token.text) print(token_list)
['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']
In the code below,spaCy tokenizes the text and creates a Doc object. This Doc object uses our preprocessing pipeline’s components tagger,parser and entity recognizer to break the text down into components. From this pipeline we can extract any component, but here we’re going to access sentence tokens using the sentencizer component.
# sentence tokenization

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')   

# Add the component to the pipeline
nlp.add_pipe(sbd)

text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print(sents_list)
# sentence tokenization # Load English tokenizer, tagger, parser, NER and word vectors nlp = English() # Create the pipeline 'sentencizer' component sbd = nlp.create_pipe('sentencizer') # Add the component to the pipeline nlp.add_pipe(sbd) text = """When learning data science, you shouldn't get discouraged! Challenges and setbacks aren't failures, they're just part of the journey. You've got this!""" # "nlp" Object is used to create documents with linguistic annotations. doc = nlp(text) # create list of sentence tokens sents_list = [] for sent in doc.sents: sents_list.append(sent.text) print(sents_list)
["When learning data science, you shouldn't get discouraged!", "\nChallenges and setbacks aren't failures, they're just part of the journey.", "You've got this!"]
Let’s take a look at the stopwords spaCy includes by default. We’ll import spaCy and assign the stopwords in its English-language model to a variable called spacy_stopwords so that we can take a look.
#Stop words
#importing stop words from English language.
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

#Printing the total number of stop words:
print('Number of stop words: %d' % len(spacy_stopwords))

#Printing first ten stop words:
print('First ten stop words: %s' % list(spacy_stopwords)[:20])
#Stop words #importing stop words from English language. import spacy spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS #Printing the total number of stop words: print('Number of stop words: %d' % len(spacy_stopwords)) #Printing first ten stop words: print('First ten stop words: %s' % list(spacy_stopwords)[:20])
Number of stop words: 312
First ten stop words: ['was', 'various', 'fifty', "'s", 'used', 'once', 'because', 'himself', 'can', 'name', 'many', 'seems', 'others', 'something', 'anyhow', 'nowhere', 'serious', 'forty', 'he', 'now']
Instead, we’ll create an empty list called filtered_sent and then iterate through our doc variable to look at each tokenized word from our source text. spaCy includes a bunch of helpful token attributes, and we’ll use one of them called is_stop to identify words that aren’t in the stopword list and then append them to our filtered_sent list.
from spacy.lang.en.stop_words import STOP_WORDS

#Implementation of stop words:
filtered_sent=[]

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# filtering stop words
for word in doc:
    if word.is_stop==False:
        filtered_sent.append(word)
print("Filtered Sentence:",filtered_sent)
from spacy.lang.en.stop_words import STOP_WORDS #Implementation of stop words: filtered_sent=[] # "nlp" Object is used to create documents with linguistic annotations. doc = nlp(text) # filtering stop words for word in doc: if word.is_stop==False: filtered_sent.append(word) print("Filtered Sentence:",filtered_sent)
Filtered Sentence: [learning, data, science, ,, discouraged, !,
, Challenges, setbacks, failures, ,, journey, ., got, !]
Since spaCy includes a build-in way to break a word down into its lemma, we can simply use that for lemmatization. In the following very simple example, we’ll use .lemma_ to produce the lemma for each word we’re analyzing.
# Implementing lemmatization
lem = nlp("run runs running runner")
# finding lemma for each word
for word in lem:
    print(word.text,word.lemma_)
# Implementing lemmatization lem = nlp("run runs running runner") # finding lemma for each word for word in lem: print(word.text,word.lemma_)
run run
runs run
running run
runner runner
(Note the u in u"All is well that ends well." signifies that the string is a Unicode string.)
# POS tagging

# importing the model en_core_web_sm of English for vocabluary, syntax & entities
import en_core_web_sm

# load en_core_web_sm of English for vocabluary, syntax & entities
nlp = en_core_web_sm.load()

#  "nlp" Objectis used to create documents with linguistic annotations.
docs = nlp(u"All is well that ends well.")

for word in docs:
    print(word.text,word.pos_)
# POS tagging # importing the model en_core_web_sm of English for vocabluary, syntax & entities import en_core_web_sm # load en_core_web_sm of English for vocabluary, syntax & entities nlp = en_core_web_sm.load() # "nlp" Objectis used to create documents with linguistic annotations. docs = nlp(u"All is well that ends well.") for word in docs: print(word.text,word.pos_)
All DET
is VERB
well ADV
that DET
ends VERB
well ADV
. PUNCT
Let’s try out some entity detection using a few paragraphs from this recent article in the Washington Post. We’ll use .label to grab a label for each entity that’s detected in the text, and then we’ll take a look at these entities in a more visual format using spaCy‘s displaCy visualizer.
#for visualization of Entity detection importing displacy from spacy:

from spacy import displacy

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(i, i.label_, i.label) for i in nytimes.ents]
entities
#for visualization of Entity detection importing displacy from spacy: from spacy import displacy nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""") entities=[(i, i.label_, i.label) for i in nytimes.ents] entities
[(New York City, 'GPE', 384),
 (Tuesday, 'DATE', 391),
 (At least 285, 'CARDINAL', 397),
 (September, 'DATE', 391),
 (Brooklyn, 'GPE', 384),
 (Williamsburg, 'GPE', 384),
 (four, 'CARDINAL', 397),
 (Bill de Blasio, 'PERSON', 380),
 (Tuesday, 'DATE', 391),
 (Orthodox Jews, 'NORP', 381),
 (6 months old, 'DATE', 391),
 (up to $1,000, 'MONEY', 394)]
Using displaCy we can also visualize our input text, with each identified entity highlighted by color and labeled. We’ll use style = "ent" to tell displaCy that we want to visualize entities here.
displacy.render(nytimes, style = "ent",jupyter = True)
Doing this is quite complicated, but thankfully spaCy will take care of the work for us! Below, let’s give spaCy another short sentence pulled from the news headlines. Then we’ll use another spaCy called noun_chunks, which breaks the input down into nouns and the words describing them, and iterate through each chunk in our source text, identifying the word, its root, its dependency identification, and which chunk it belongs to.
docp = nlp (" In pursuit of a wall, President Trump ran into one.")

for chunk in docp.noun_chunks:
   print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)
docp = nlp (" In pursuit of a wall, President Trump ran into one.") for chunk in docp.noun_chunks: print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)
pursuit pursuit pobj In
a wall wall pobj of
President Trump Trump nsubj ran
This output can be a little bit difficult to follow, but since we’ve already imported the displaCy visualizer, we can use that to view a dependency diagraram using style = "dep" that’s much easier to understand:
displacy.render(docp, style="dep", jupyter= True)
Using spaCy‘s en_core_web_sm model, let’s take a look at the length of a vector for a single word, and what that vector looks like using .vector and .shape.
import en_core_web_sm
nlp = en_core_web_sm.load()
mango = nlp(u'mango')
print(mango.vector.shape)
print(mango.vector)
import en_core_web_sm nlp = en_core_web_sm.load() mango = nlp(u'mango') print(mango.vector.shape) print(mango.vector)
(96,)
[ 1.0466383  -1.5323697  -0.72177905 -2.4700649  -0.2715162   1.1589639
  1.7113379  -0.31615403 -2.0978343   1.837553    1.4681302   2.728043
 -2.3457408  -5.17184    -4.6110015  -0.21236466 -0.3029521   4.220028
 -0.6813917   2.4016762  -1.9546705  -0.85086954  1.2456163   1.5107994
  0.4684736   3.1612053   0.15542296  2.0598564   3.780035    4.6110964
  0.6375268  -1.078107   -0.96647096 -1.3939928  -0.56914186  0.51434743
  2.3150034  -0.93199825 -2.7970662  -0.8540115  -3.4250052   4.2857723
  2.5058174  -2.2150877   0.7860181   3.496335   -0.62606215 -2.0213525
 -4.47421     1.6821622  -6.0789204   0.22800982 -0.36950028 -4.5340714
 -1.7978683  -2.080299    4.125556    3.1852438  -3.286446    1.0892276
  1.017115    1.2736416  -0.10613725  3.5102775   1.1902348   0.05483437
 -0.06298041  0.8280688   0.05514218  0.94817173 -0.49377063  1.1512338
 -0.81374085 -1.6104267   1.8233354  -2.278403   -2.1321895   0.3029334
 -1.4510616  -1.0584296  -3.5698352  -0.13046083 -0.2668339   1.7826645
  0.4639858  -0.8389523  -0.02689964  2.316218    5.8155413  -0.45935947
  4.368636    1.6603007  -3.1823301  -1.4959551  -0.5229269   1.3637555 ]
We’ll start by importing the libraries we’ll need for this task. We’ve already imported spaCy, but we’ll also want pandas and scikit-learn to help with our analysis.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
Let’s start by reading the data into a pandas dataframe and then using the built-in functions of pandas to help us take a closer look at our data.
# Loading TSV file
df_amazon = pd.read_csv ("datasets/amazon_alexa.tsv", sep="\t")
# Loading TSV file df_amazon = pd.read_csv ("datasets/amazon_alexa.tsv", sep="\t")
# Top 5 records
df_amazon.head()
rating date variation verified_reviews feedback 0 5 31-Jul-18 Charcoal Fabric Love my Echo! 1 1 5 31-Jul-18 Charcoal Fabric Loved it! 1 2 4 31-Jul-18 Walnut Finish Sometimes while playing a game, you can answer… 1 3 5 31-Jul-18 Charcoal Fabric I have had a lot of fun with this thing. My 4 … 1 4 5 31-Jul-18 Charcoal Fabric Music 1
# shape of dataframe
df_amazon.shape
# shape of dataframe df_amazon.shape
(3150, 5)
(3150, 5)
# View data information
df_amazon.info()
# View data information df_amazon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
rating              3150 non-null int64
date                3150 non-null object
variation           3150 non-null object
verified_reviews    3150 non-null object
feedback            3150 non-null int64
dtypes: int64(2), object(3)
memory usage: 123.1+ KB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3150 entries, 0 to 3149 Data columns (total 5 columns): rating 3150 non-null int64 date 3150 non-null object variation 3150 non-null object verified_reviews 3150 non-null object feedback 3150 non-null int64 dtypes: int64(2), object(3) memory usage: 123.1+ KB
# Feedback Value count
df_amazon.feedback.value_counts()
# Feedback Value count df_amazon.feedback.value_counts()
1    2893
0     257
Name: feedback, dtype: int64
Then, we’ll create a spacy_tokenizer() function that accepts a sentence as input and processes the sentence into tokens, performing lemmatization, lowercasing, and removing stop words. This is similar to what we did in the examples earlier in this tutorial, but now we’re putting it all together into a single function for preprocessing each user review we’re analyzing.
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens
To further clean our text data, we’ll also want to create a custom transformer for removing initial and end spaces and converting text into lower case. Here, we will create a custom predictors class wich inherits the TransformerMixin class. This class overrides the transform, fit and get_parrams methods. We’ll also create a clean_text() function that removes spaces and converts text into lowercase.
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):     
    # Removing spaces and converting text into lowercase
    return text.strip().lower()
N-grams are combinations of adjacent words in a given text, where n is the number of words that incuded in the tokens. for example, in the sentence “Who will win the football world cup in 2022?” unigrams would be a sequence of single words such as “who”, “will”, “win” and so on. Bigrams would be a sequence of 2 contiguous words such as “who will”, “will win”, and so on. So the ngram_range parameter we’ll use in the code below sets the lower and upper bounds of the our ngrams (we’ll be using unigrams). Then we’ll assign the ngrams to bow_vector.
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
Of course, we don’t have to calculate that by hand! We can generate TF-IDF automatically using scikit-learn‘s TfidfVectorizer. Again, we’ll tell it to use the custom tokenizer that we built with spaCy, and then we’ll assign the result to the variable tfidf_vector.
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
Conveniently, scikit-learn gives us a built-in function for doing this: train_test_split(). We just need to tell it the feature set we want it to split (X), the labels we want it to test against (ylabels), and the size we want to use for the test set (represented as a percentage in decimal form).
from sklearn.model_selection import train_test_split

X = df_amazon['verified_reviews'] # the features we want to analyze
ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)
Once this pipeline is built, we’ll fit the pipeline components using fit().
# Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)
# Logistic Regression Classifier from sklearn.linear_model import LogisticRegression classifier = LogisticRegression() # Create pipeline using Bag of Words pipe = Pipeline([("cleaner", predictors()), ('vectorizer', bow_vector), ('classifier', classifier)]) # model generation pipe.fit(X_train,y_train)
Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x00000254DA6F8940>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
      ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])
The documentation links above offer more details and more precise definitions of each term, but the bottom line is that all three metrics are measured from 0 to 1, where 1 is predicting everything completely correctly. Therefore, the closer our model’s scores are to 1, the better.
from sklearn import metrics
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
from sklearn import metrics # Predicting with a test dataset predicted = pipe.predict(X_test) # Model Accuracy print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted)) print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted)) print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
Logistic Regression Accuracy: 0.9417989417989417
Logistic Regression Precision: 0.9528508771929824
Logistic Regression Recall: 0.9863791146424518

Source: link

1

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Get busy living or get busy dying.")

print(f"{'text':{8}} {'POS':{6}} {'TAG':{6}} {'Dep':{6}} {'POS explained':{20}} {'tag explained'} ")
for token in doc:
print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')
import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Get busy living or get busy dying.") print(f"{'text':{8}} {'POS':{6}} {'TAG':{6}} {'Dep':{6}} {'POS explained':{20}} {'tag explained'} ") for token in doc: print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}') [Out] :
text     POS    TAG    Dep    POS explained        tag explained 
Get      AUX    VB     ROOT   auxiliary            verb, base form
busy     ADJ    JJ     amod   adjective            adjective
living   NOUN   NN     dobj   noun                 noun, singular or mass
or       CCONJ  CC     cc     coordinating conjunction conjunction, coordinating
get      AUX    VB     conj   auxiliary            verb, base form
busy     ADJ    JJ     acomp  adjective            adjective
dying    VERB   VBG    xcomp  verb                 verb, gerund or present participle
.        PUNCT  .      punct  punctuation          punctuation mark, sentence closer
In [2]
import spacy

nlp = spacy.load("en_core_web_sm")
tag_lst = nlp.pipe_labels['tagger']

print(len(tag_lst))
print(tag_lst)
[Out] :
50
['$', "''", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', 'XX', '_SP', '``']
import spacy

nlp = spacy.load("en_core_web_sm")
print("Pipeline:", nlp.pipe_names)
doc = nlp("I was heading towards North.")
for token in doc:  
    print(token.text)
    print(token.morph)   ## Printing all the morphological features.
    print(token.morph.get("Number"))   ## Printing a particular type of morphological 
                                       ## features such as Number(Singular or plural).
    print(token.morph.to_dict())       ## Prining the morphological features in dictionary format.
    print('\n\n')
Pipeline: ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
I
Case=Nom|Number=Sing|Person=1|PronType=Prs
['Sing']
{'Case': 'Nom', 'Number': 'Sing', 'Person': '1', 'PronType': 'Prs'}

was
Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
['Sing']
{'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', 'Tense': 'Past', 'VerbForm': 'Fin'}

heading
Aspect=Prog|Tense=Pres|VerbForm=Part
[]
{'Aspect': 'Prog', 'Tense': 'Pres', 'VerbForm': 'Part'}

towards
[]
{}

North
NounType=Prop|Number=Sing
['Sing']
{'NounType': 'Prop', 'Number': 'Sing'}

.
PunctType=Peri
[]
{'PunctType': 'Peri'}
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Don't be afraid to give up the good to go for the great")

# Counting the frequencies of different POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
print(POS_counts)

for k,v in sorted(POS_counts.items()):
    print(f'{k:{4}}. {doc.vocab[k].text:{5}}: {v}')
[Out] :
{87: 2, 94: 3, 84: 2, 100: 2, 85: 2, 90: 2, 92: 1}
  84. ADJ  : 2
  85. ADP  : 2
  87. AUX  : 2
  90. DET  : 2
  92. NOUN : 1
  94. PART : 3
 100. VERB : 2
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

# Counting the frequencies of different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

print(TAG_counts)
for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')
[Out] :
{15267657372422890137: 2, 10554686591937588953: 3, 15308085513773655218: 3, 17109001835818727656: 1, 1292078113972184607: 1, 74: 1, 12646065887601541794: 1}
74. POS : 1
1292078113972184607. IN  : 1
10554686591937588953. JJ  : 3
12646065887601541794. .   : 1
15267657372422890137. DT  : 2
15308085513773655218. NN  : 3
17109001835818727656. VBD : 1
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("You only live once, but if you do it right, once is enough.")
displacy.serve(doc, style="dep")
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I was reading the paper")
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}

displacy.serve(doc, style="dep",options=options)
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
text = "Life is a beautiful journey that is meant to be embraced to the fullest every day.However, that doesn’t mean you always wake up ready to seize the day, and sometimes need a reminder that life is a great gift."
doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")

Source: link

Recent Questions on python

    Programming Languages