pymars 0.10.0


pip install pymars

  Latest version

Released: Jan 10, 2023

Project Links

Meta
Author: Qin Xuye
Maintainer: Qin Xuye

Classifiers

Operating System
  • OS Independent

Programming Language
  • Python
  • Python :: 3
  • Python :: 3.7
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: Implementation :: CPython

Topic
  • Software Development :: Libraries
https://raw.githubusercontent.com/mars-project/mars/master/docs/source/images/mars-logo-title.png

PyPI version Docs Build Coverage Quality License

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries.

Documentation, 中文文档

Installation

Mars is easy to install by

pip install pymars

Installation for Developers

When you want to contribute code to Mars, you can follow the instructions below to install Mars for development:

git clone https://github.com/mars-project/mars.git
cd mars
pip install -e ".[dev]"

More details about installing Mars can be found at installation section in Mars document.

Architecture Overview

https://raw.githubusercontent.com/mars-project/mars/master/docs/source/images/architecture.png

Getting Started

Starting a new runtime locally via:

>>> import mars
>>> mars.new_session()

Or connecting to a Mars cluster which is already initialized.

>>> import mars
>>> mars.new_session('http://<web_ip>:<ui_port>')

Mars Tensor

Mars tensor provides a familiar interface like Numpy.

Numpy

Mars tensor

import numpy as np
N = 200_000_000
a = np.random.uniform(-1, 1, size=(N, 2))
print((np.linalg.norm(a, axis=1) < 1)
      .sum() * 4 / N)
import mars.tensor as mt
N = 200_000_000
a = mt.random.uniform(-1, 1, size=(N, 2))
print(((mt.linalg.norm(a, axis=1) < 1)
        .sum() * 4 / N).execute())
3.14174502
CPU times: user 11.6 s, sys: 8.22 s,
           total: 19.9 s
Wall time: 22.5 s
3.14161908
CPU times: user 966 ms, sys: 544 ms,
           total: 1.51 s
Wall time: 3.77 s

Mars can leverage multiple cores, even on a laptop, and could be even faster for a distributed setting.

Mars DataFrame

Mars DataFrame provides a familiar interface like pandas.

Pandas

Mars DataFrame

import numpy as np
import pandas as pd
df = pd.DataFrame(
    np.random.rand(100000000, 4),
    columns=list('abcd'))
print(df.sum())
import mars.tensor as mt
import mars.dataframe as md
df = md.DataFrame(
    mt.random.rand(100000000, 4),
    columns=list('abcd'))
print(df.sum().execute())
CPU times: user 10.9 s, sys: 2.69 s,
           total: 13.6 s
Wall time: 11 s
CPU times: user 1.21 s, sys: 212 ms,
           total: 1.42 s
Wall time: 2.75 s

Mars Learn

Mars learn provides a familiar interface like scikit-learn.

Scikit-learn

Mars learn

from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
X, y = make_blobs(
    n_samples=100000000, n_features=3,
    centers=[[3, 3, 3], [0, 0, 0],
             [1, 1, 1], [2, 2, 2]],
    cluster_std=[0.2, 0.1, 0.2, 0.2],
    random_state=9)
pca = PCA(n_components=3)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)
from mars.learn.datasets import make_blobs
from mars.learn.decomposition import PCA
X, y = make_blobs(
    n_samples=100000000, n_features=3,
    centers=[[3, 3, 3], [0, 0, 0],
              [1, 1, 1], [2, 2, 2]],
    cluster_std=[0.2, 0.1, 0.2, 0.2],
    random_state=9)
pca = PCA(n_components=3)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)

Mars learn also integrates with many libraries:

Mars remote

Mars remote allows users to execute functions in parallel.

Vanilla function calls

Mars remote

import numpy as np


def calc_chunk(n, i):
    rs = np.random.RandomState(i)
    a = rs.uniform(-1, 1, size=(n, 2))
    d = np.linalg.norm(a, axis=1)
    return (d < 1).sum()

def calc_pi(fs, N):
    return sum(fs) * 4 / N

N = 200_000_000
n = 10_000_000

fs = [calc_chunk(n, i)
      for i in range(N // n)]
pi = calc_pi(fs, N)
print(pi)
import numpy as np
import mars.remote as mr

def calc_chunk(n, i):
    rs = np.random.RandomState(i)
    a = rs.uniform(-1, 1, size=(n, 2))
    d = np.linalg.norm(a, axis=1)
    return (d < 1).sum()

def calc_pi(fs, N):
    return sum(fs) * 4 / N

N = 200_000_000
n = 10_000_000

fs = [mr.spawn(calc_chunk, args=(n, i))
      for i in range(N // n)]
pi = mr.spawn(calc_pi, args=(fs, N))
print(pi.execute().fetch())
3.1416312
CPU times: user 32.2 s, sys: 4.86 s,
           total: 37.1 s
Wall time: 12.4 s
3.1416312
CPU times: user 616 ms, sys: 307 ms,
           total: 923 ms
Wall time: 3.99 s

DASK on Mars

Refer to DASK on Mars for more information.

Eager Mode

Mars supports eager mode which makes it friendly for developing and easy to debug.

Users can enable the eager mode by options, set options at the beginning of the program or console session.

>>> from mars.config import options
>>> options.eager_mode = True

Or use a context.

>>> from mars.config import option_context
>>> with option_context() as options:
>>>     options.eager_mode = True
>>>     # the eager mode is on only for the with statement
>>>     ...

If eager mode is on, tensor, DataFrame etc will be executed immediately by default session once it is created.

>>> import mars.tensor as mt
>>> import mars.dataframe as md
>>> from mars.config import options
>>> options.eager_mode = True
>>> t = mt.arange(6).reshape((2, 3))
>>> t
array([[0, 1, 2],
       [3, 4, 5]])
>>> df = md.DataFrame(t)
>>> df.sum()
0    3
1    5
2    7
dtype: int64

Mars on Ray

Mars also has deep integration with Ray and can run on Ray efficiently and interact with the large ecosystem of machine learning and distributed systems built on top of the core Ray.

Starting a new Mars on Ray runtime locally via:

import mars
mars.new_session(backend='ray')
# Perform compute

Interact with Ray Dataset:

import mars.tensor as mt
import mars.dataframe as md
df = md.DataFrame(
    mt.random.rand(1000_0000, 4),
    columns=list('abcd'))
# Convert mars dataframe to ray dataset
ds = md.to_ray_dataset(df)
print(ds.schema(), ds.count())
ds.filter(lambda row: row["a"] > 0.5).show(5)
# Convert ray dataset to mars dataframe
df2 = md.read_ray_dataset(ds)
print(df2.head(5).execute())

Refer to Mars on Ray for more information.

Easy to scale in and scale out

Mars can scale in to a single machine, and scale out to a cluster with thousands of machines. It’s fairly simple to migrate from a single machine to a cluster to process more data or gain a better performance.

Bare Metal Deployment

Mars is easy to scale out to a cluster by starting different components of mars distributed runtime on different machines in the cluster.

A node can be selected as supervisor which integrated a web service, leaving other nodes as workers. The supervisor can be started with the following command:

mars-supervisor -h <host_name> -p <supervisor_port> -w <web_port>

Workers can be started with the following command:

mars-worker -h <host_name> -p <worker_port> -s <supervisor_endpoint>

After all mars processes are started, users can run

>>> sess = new_session('http://<web_ip>:<ui_port>')
>>> # perform computation

Kubernetes Deployment

Refer to Run on Kubernetes for more information.

Yarn Deployment

Refer to Run on Yarn for more information.

Getting involved

Thank you in advance for your contributions!

0.10.0 Jan 10, 2023
0.10.0a1 Jun 12, 2022
0.9.0 Jun 12, 2022
0.9.0rc3 May 07, 2022
0.9.0rc2 Apr 09, 2022
0.9.0rc1 Mar 22, 2022
0.9.0b2 Mar 08, 2022
0.9.0b1 Feb 21, 2022
0.9.0a2 Feb 03, 2022
0.9.0a1 Dec 16, 2021
0.8.7 May 10, 2022
0.8.6 May 07, 2022
0.8.5 Apr 09, 2022
0.8.4 Mar 23, 2022
0.8.3 Mar 08, 2022
0.8.2 Feb 21, 2022
0.8.1 Feb 03, 2022
0.8.0 Dec 16, 2021
0.8.0rc1 Oct 23, 2021
0.8.0b2 Oct 09, 2021
0.8.0b1 Sep 21, 2021
0.8.0a3 Sep 05, 2021
0.8.0a2 Aug 23, 2021
0.8.0a1 Aug 06, 2021
0.7.5 Oct 23, 2021
0.7.4 Oct 09, 2021
0.7.3 Sep 21, 2021
0.7.2.post1 Sep 10, 2021
0.7.2 Sep 05, 2021
0.7.1 Aug 23, 2021
0.7.0 Aug 06, 2021
0.7.0rc2 Jul 09, 2021
0.7.0rc1 Jun 26, 2021
0.7.0b2 Jun 08, 2021
0.7.0b1 Apr 11, 2021
0.7.0a8 Mar 31, 2021
0.7.0a7 Mar 08, 2021
0.7.0a6 Feb 25, 2021
0.7.0a5 Feb 01, 2021
0.7.0a4 Jan 17, 2021
0.7.0a3 Jan 03, 2021
0.7.0a2 Dec 20, 2020
0.7.0a1 Dec 05, 2020
0.6.11 Jul 09, 2021
0.6.10 Jun 25, 2021
0.6.9 Jun 08, 2021
0.6.8 Apr 11, 2021
0.6.7 Mar 22, 2021
0.6.6 Mar 07, 2021
0.6.5 Feb 22, 2021
0.6.4 Jan 30, 2021
0.6.3 Jan 17, 2021
0.6.2 Jan 03, 2021
0.6.1 Dec 20, 2020
0.6.0 Dec 05, 2020
0.6.0rc1 Nov 24, 2020
0.6.0b2 Nov 07, 2020
0.6.0b1 Oct 24, 2020
0.6.0a3 Sep 30, 2020
0.6.0a2 Sep 14, 2020
0.6.0a1 Aug 29, 2020
0.5.5 Nov 24, 2020
0.5.4 Nov 07, 2020
0.5.3 Oct 24, 2020
0.5.2 Sep 30, 2020
0.5.1 Sep 14, 2020
0.5.0 Aug 29, 2020
0.5.0rc1 Aug 15, 2020
0.5.0b3 Aug 03, 2020
0.5.0b2 Jul 27, 2020
0.5.0b1 Jul 11, 2020
0.5.0a3 Jun 25, 2020
0.5.0a2 May 29, 2020
0.5.0a1 May 23, 2020
0.4.7 Aug 29, 2020
0.4.6 Aug 16, 2020
0.4.5 Aug 05, 2020
0.4.4 Jul 29, 2020
0.4.3 Jul 12, 2020
0.4.2 Jun 22, 2020
0.4.1 May 29, 2020
0.4.0 May 23, 2020
0.4.0rc1 Apr 25, 2020
0.4.0b2 Mar 30, 2020
0.4.0b1 Mar 20, 2020
0.4.0a2 Feb 22, 2020
0.4.0a1 Jan 19, 2020
0.3.4 Apr 24, 2020
0.3.3 Mar 31, 2020
0.3.2 Mar 21, 2020
0.3.1 Feb 22, 2020
0.3.0 Jan 17, 2020
0.3.0rc1 Dec 15, 2019
0.3.0b2 Nov 15, 2019
0.3.0b1 Oct 19, 2019
0.3.0a2 Sep 21, 2019
0.3.0a1 Aug 24, 2019
0.2.4 Dec 14, 2019
0.2.3 Nov 15, 2019
0.2.2 Oct 19, 2019
0.2.1 Sep 21, 2019
0.2.0 Aug 24, 2019
0.2.0rc1 Jul 29, 2019
0.2.0b2 Jul 05, 2019
0.2.0b1 May 31, 2019
0.2.0a3 Apr 23, 2019
0.2.0a2 Mar 01, 2019
0.2.0a1 Jan 24, 2019
0.1.5 Jul 24, 2019
0.1.4 Jul 06, 2019
0.1.3 May 31, 2019
0.1.2 Apr 20, 2019
0.1.1 Mar 01, 2019
0.1.0 Jan 24, 2019
0.1.0b1 Dec 29, 2018
0.1.0a2 Dec 21, 2018
0.1.0a1 Dec 19, 2018

Wheel compatibility matrix

Platform CPython 3.7 CPython 3.8 CPython 3.9 CPython 3.10
macosx_10_9_universal2
macosx_10_9_x86_64
manylinux1_x86_64
manylinux2014_aarch64
manylinux2014_x86_64
manylinux_2_17_aarch64
manylinux_2_17_x86_64
manylinux_2_5_x86_64
win32
win_amd64

Files in release

Extras:
Dependencies:
numpy (>=1.14.0)
pandas (>=1.0.0)
scipy (>=1.0.0)
scikit-learn (>=0.20)
numexpr (>=2.6.4)
cloudpickle (>=1.5.0)
pyyaml (>=5.1)
psutil (>=5.9.0)
tornado (>=6.0)
sqlalchemy (>=1.2.0)
defusedxml (>=0.5.0)
tqdm (>=4.1.0)
pickle5
shared-memory38 (>=0.1.1)
uvloop (>=0.14.0)