python Re库入门与实战

Re库入门

正则表达式的常用操作符

.      表示任何单个字符

[]     字符集,对单个字符给出取值范围        [abc]表示a、b、c，[a‐z]表示a到z单个字符

[^]    非字符集，对单个字符给出排除范围      [^abc]表示非a或b或c的单个字符

*      前一个字符0次或无限次扩展            abc* 表示 ab、abc、abcc、abccc等

+      前一个字符1次或无限次扩展            abc+ 表示 abc、abcc、abccc等

?      前一个字符0次或1次扩展               abc? 表示 ab、abc

|      左右表达式任意一个                   abc|def 表示 abc、def

{m}    扩展前一个字符m次                    ab{2}c表示abbc

{m,n}  扩展前一个字符m至n次(含n)            ab{1,2}c表示abc、abbc

^      匹配字符串开头                       ^abc表示abc且在一个字符串的开头

$      匹配字符串结尾                       abc$表示abc且在一个字符串的结尾

()     分组标记，内部只能使用 | 操作符       (abc)表示abc，(abc|def)表示abc、def

\d     数字，等价于[0‐9]

\w     单词字符，等价于[A‐Za‐z0‐9_]

正则表达式语法实例

P(Y|YT|YTH|YTHO)?N             'PN'、'PYN'、'PYTN'、'PYTHN'、'PYTHON'

PYTHON+                        'PYTHON'、'PYTHONN'、'PYTHONNN' …

PY[TH]ON                       'PYTON'、'PYHON'

PY[^TH]?ON                     'PYON'、'PYaON'、'PYbON'、'PYcON'…

PY{:3}N                        'PN'、'PYN'、'PYYN'、'PYYYN'…

经典正则表达式实例

^[A‐Za‐z]+$                     由26个字母组成的字符串

^[A‐Za‐z0‐9]+$                  由26个字母和数字组成的字符串

^‐?\d+$                         整数形式的字符串

^[0‐9]*[1‐9][0‐9]*$             正整数形式的字符串

[1‐9]\d{5}                      中国境内邮政编码，6位

[\u4e00‐\u9fa5]                 匹配中文字符

\d{3}‐\d{8}|\d{4}‐\d{7}         国内电话号码，010‐68913536

                                匹配IP地址的正则表达式

(([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5]).){3}([1‐9]?\d|1\d{2}|2[0‐4]\d|25[0‐5])

正则表达式的表达类型

re库采用raw string类型表示正则表达式，表示为：r'text'
例如：r'[1‐9]\d{5}'       r'\d{3}‐\d{8}|\d{4}‐\d{7}'

re库也可以采用string类型表示正则表达式，但更繁琐
例如：'[1‐9]\\d{5}'       '\\d{3}‐\\d{8}|\\d{4}‐\\d{7}'

Re库的主要功能函数

re.search()        在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象

re.match()         从一个字符串的开始位置起匹配正则表达式，返回match对象

re.findall()       搜索字符串，以列表类型返回全部能匹配的子串

re.split()         将一个字符串按照正则表达式匹配结果进行分割，返回列表类型

re.finditer()      搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象

re.sub()           在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

例如：
re.search(pattern, string, flags=0)
    ∙ pattern : 正则表达式的字符串或原生字符串表示
    ∙ string : 待匹配字符串
    ∙ flags : 正则表达式使用时的控制标记

常用标记                           说明
re.I           re.IGNORECASE 忽略正则表达式的大小写，[A‐Z]能够匹配小写字符
re.M           re.MULTILINE 正则表达式中的^操作符能够将给定字符串的每行当作匹配开始
re.S           re.DOTALL 正则表达式中的.操作符能够匹配所有字符，默认匹配除换行外的所有字符

Re库的另一种等价用法

regex = re.compile(pattern, flags=0)
    ∙ pattern : 正则表达式的字符串或原生字符串表示
    ∙ flags : 正则表达式使用时的控制标记
regex = re.compile(r'[1‐9]\d{5}')

函数式用法：一次性操作
rst = re.search(r'[1‐9]\d{5}', 'BIT 100081')

面向对象用法：编译后的多次操作
pat = re.compile(r'[1‐9]\d{5}')
rst = pat.search('BIT 100081')

Match对象介绍

Match对象是一次匹配的结果，包含匹配的很多信息

match = re.search(r'[1‐9]\d{5}', 'BIT 100081')
if match:
	print(match.group(0))
type(match)
<class '_sre.SRE_Match'>

Match对象的属性

 属性                             说明

.string                      待匹配的文本

.re                       匹配时使用的patter对象（正则表达式）

.pos                      正则表达式搜索文本的开始位置

.endpos                   正则表达式搜索文本的结束位置

Match对象的方法

 方法                             说明

.group(0)                 获得匹配后的字符串

.start()                  匹配字符串在原始字符串的开始位置

.end()                    匹配字符串在原始字符串的结束位置

.span()                   返回(.start(), .end())

Re库默认采用贪婪匹配，即输出匹配最长的子串

最小匹配操作符

 操作符                            说明

*?                        前一个字符0次或无限次扩展，最小匹配

+?                        前一个字符1次或无限次扩展，最小匹配

??                        前一个字符0次或1次扩展，最小匹配

{m,n}?                    扩展前一个字符m至n次（含n），最小匹配

Re库实战

淘宝商品比价定向爬虫

import requests
import re

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号","价格","商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))

def main():
    goods = '书包'
    depth = 2
    start_url = 'http://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)
main()

股票数据定向爬虫

import requests
from bs4 import BeautifulSoup
import traceback
import re
def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class': 'stock-bets'})

            name = stockInfo.find_all(attrs={'class': 'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})

            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write(str(infoDict) + '\n')
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count * 100 / len(lst)), end="")
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count * 100 / len(lst)), end="")
            continue
def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'http://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist = []
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
main()