Skip to content
Snippets Groups Projects
  1. Mar 25, 2016
    • Andrew Or's avatar
      [SPARK-14014][SQL] Integrate session catalog (attempt #2) · 20ddf5fd
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.
      
      ## How was this patch tested?
      
      See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11938 from andrewor14/session-catalog-again.
      20ddf5fd
  2. Mar 24, 2016
  3. Mar 23, 2016
    • Andrew Or's avatar
      [SPARK-14014][SQL] Replace existing catalog with SessionCatalog · 5dfc0197
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.
      
      As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
      - SPARK-14013: Properly implement temporary functions in `SessionCatalog`
      - SPARK-13879: Decide which DDL/DML commands to support natively in Spark
      - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
      - SPARK-?????: Merge SQL/HiveContext
      
      ## How was this patch tested?
      
      This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #11836 from andrewor14/use-session-catalog.
      5dfc0197
  4. Mar 08, 2016
    • Wenchen Fan's avatar
      [SPARK-13593] [SQL] improve the `createDataFrame` to accept data type string and verify the data · d57daf1f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`.
      It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users.
      If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't.
      
      ## How was this patch tested?
      
      new tests in `test.py` and doc test in `types.py`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11444 from cloud-fan/pyrdd.
      d57daf1f
  5. Mar 02, 2016
  6. Feb 24, 2016
    • Wenchen Fan's avatar
      [SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code · a60f9128
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
      
      ## How was the this patch tested?
      
      by existing unit tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11342 from cloud-fan/python-clean.
      a60f9128
  7. Feb 21, 2016
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  8. Jan 24, 2016
    • Jeff Zhang's avatar
      [SPARK-12120][PYSPARK] Improve exception message when failing to init… · e789b1d2
      Jeff Zhang authored
      …ialize HiveContext in PySpark
      
      davies Mind to review ?
      
      This is the error message after this PR
      
      ```
      15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
      /Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
        warnings.warn("You must build Spark with Hive. "
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
          return DataFrameReader(self)
        File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
          self._jreader = sqlContext._ssql_ctx.read()
        File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
          raise e
      py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
      : java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
      	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
      	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
      	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
      	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
      	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
      	at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
      	at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
      	at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
      	at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
      	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
      	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
      	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
      	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
      	at py4j.Gateway.invoke(Gateway.java:214)
      	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
      	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
      	at py4j.GatewayConnection.run(GatewayConnection.java:209)
      	at java.lang.Thread.run(Thread.java:745)
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10126 from zjffdu/SPARK-12120.
      e789b1d2
  9. Jan 04, 2016
  10. Dec 30, 2015
    • Holden Karau's avatar
      [SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections · d1ca634d
      Holden Karau authored
      Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
      d1ca634d
  11. Nov 26, 2015
  12. Nov 25, 2015
  13. Nov 12, 2015
  14. Nov 02, 2015
    • Jason White's avatar
      [SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema · f92f334c
      Jason White authored
      When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided.
      
      Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect.
      
      marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #9392 from JasonMWhite/createDataFrame_without_take.
      f92f334c
  15. Oct 19, 2015
  16. Sep 08, 2015
  17. Aug 13, 2015
  18. Jul 30, 2015
    • Davies Liu's avatar
      [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__ · e044705b
      Davies Liu authored
      Also we could create a Python UDT without having a Scala one, it's important for Python users.
      
      cc mengxr JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7453 from davies/class_in_main and squashes the following commits:
      
      4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
      793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      dc65f19 [Davies Liu] address comment
      a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      a86e1fc [Davies Liu] fix serialization
      ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      63f52ef [Davies Liu] fix pylint check
      655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      316a394 [Davies Liu] support Python UDT with UTF
      0bcb3ef [Davies Liu] fix bug in mllib
      de986d6 [Davies Liu] fix test
      83d65ac [Davies Liu] fix bug in StructType
      55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
      e044705b
  19. Jul 20, 2015
    • Davies Liu's avatar
      [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type · 9f913c4f
      Davies Liu authored
      This PR also remove the duplicated code between registerFunction and UserDefinedFunction.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7450 from davies/fix_return_type and squashes the following commits:
      
      e80bf9f [Davies Liu] remove debugging code
      f94b1f6 [Davies Liu] fix mima
      8f9c58b [Davies Liu] convert returned object from UDF into internal type
      9f913c4f
  20. Jul 09, 2015
    • Davies Liu's avatar
      [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of... · c9e2ef52
      Davies Liu authored
      [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of serialization for Python DataFrame
      
      This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr .
      
      There is no generated `Row` anymore.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7301 from davies/sql_ser and squashes the following commits:
      
      81bef71 [Davies Liu] address comments
      e9217bd [Davies Liu] add regression tests
      db34167 [Davies Liu] Refactor of serialization for Python DataFrame
      c9e2ef52
  21. Jun 30, 2015
    • x1-'s avatar
      [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe... · b6e76edf
      x1- authored
      [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
      
      Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`.
      So I think `pandas.columns` are should be convert to `String`.
      
      ### issue
      
      * [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535)
      
      Author: x1- <viva008@gmail.com>
      
      Closes #7124 from x1-/SPARK-8535 and squashes the following commits:
      
      d68fd38 [x1-] modify unit-test using pandas.
      ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String.
      b6e76edf
    • Davies Liu's avatar
      [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API · 58ee2a2e
      Davies Liu authored
      Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7135 from davies/ananylis and squashes the following commits:
      
      dad7ae7 [Davies Liu] add comment
      ec0c0e8 [Davies Liu] Update utils.py
      cdd7edd [Davies Liu] add doc
      7b044c2 [Davies Liu] fix python 3
      f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
      58ee2a2e
  22. Jun 29, 2015
    • Davies Liu's avatar
      [SPARK-8070] [SQL] [PYSPARK] avoid spark jobs in createDataFrame · afae9766
      Davies Liu authored
      Avoid the unnecessary jobs when infer schema from list.
      
      cc yhuai mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6606 from davies/improve_create and squashes the following commits:
      
      a5928bf [Davies Liu] Update MimaExcludes.scala
      62da911 [Davies Liu] fix mima
      bab4d7d [Davies Liu] Merge branch 'improve_create' of github.com:davies/spark into improve_create
      eee44a8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
      8d9292d [Davies Liu] Update context.py
      eb24531 [Davies Liu] Update context.py
      c969997 [Davies Liu] bug fix
      d5a8ab0 [Davies Liu] fix tests
      8c3f10d [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
      6ea5925 [Davies Liu] address comments
      6ceaeff [Davies Liu] avoid spark jobs in createDataFrame
      afae9766
  23. Jun 22, 2015
    • Wenchen Fan's avatar
      [SPARK-8104] [SQL] auto alias expressions in analyzer · da7bbb94
      Wenchen Fan authored
      Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6647 from cloud-fan/alias and squashes the following commits:
      
      552eba4 [Wenchen Fan] fix python
      5b5786d [Wenchen Fan] fix agg
      73a90cb [Wenchen Fan] fix case-preserve of ExtractValue
      4cfd23c [Wenchen Fan] fix order by
      d18f401 [Wenchen Fan] refine
      9f07359 [Wenchen Fan] address comments
      39c1aef [Wenchen Fan] small fix
      33640ec [Wenchen Fan] auto alias expressions in analyzer
      da7bbb94
  24. Jun 03, 2015
  25. May 23, 2015
    • Davies Liu's avatar
      [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates · efe3bfdf
      Davies Liu authored
      1. ntile should take an integer as parameter.
      2. Added Python API (based on #6364)
      3. Update documentation of various DataFrame Python functions.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6374 from rxin/window-final and squashes the following commits:
      
      69004c7 [Reynold Xin] Style fix.
      288cea9 [Reynold Xin] Update documentaiton.
      7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
      66092b4 [Davies Liu] update docs
      ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
      ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
      8936ade [Davies Liu] fix maxint in python 3
      2649358 [Davies Liu] update docs
      778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
      efe3bfdf
  26. May 21, 2015
    • Davies Liu's avatar
      [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs · 8ddcb25b
      Davies Liu authored
      Add version info for public Python SQL API.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6295 from davies/versions and squashes the following commits:
      
      cfd91e6 [Davies Liu] add more version for DataFrame API
      600834d [Davies Liu] add version to SQL API docs
      8ddcb25b
  27. May 19, 2015
    • Davies Liu's avatar
      [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python · 4de74d26
      Davies Liu authored
      cc rxin, please take a quick look, I'm working on tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6238 from davies/readwrite and squashes the following commits:
      
      c7200eb [Davies Liu] update tests
      9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      f0c5a04 [Davies Liu] use sqlContext.read.load
      5f68bc8 [Davies Liu] update tests
      6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      bcc6668 [Davies Liu] add reader amd writer API in Python
      4de74d26
  28. May 18, 2015
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
    • Davies Liu's avatar
      [SPARK-6216] [PYSPARK] check python version of worker with driver · 32fbd297
      Davies Liu authored
      This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6203 from davies/py_version and squashes the following commits:
      
      b8fb76e [Davies Liu] fix test
      6ce5096 [Davies Liu] use string for version
      47c6278 [Davies Liu] check python version of worker with driver
      32fbd297
  29. Apr 21, 2015
    • Davies Liu's avatar
      [SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression · ab9128fb
      Davies Liu authored
      This PR enable auto_convert in JavaGateway, then we could register a converter for a given types, for example, date and datetime.
      
      There are two bugs related to auto_convert, see [1] and [2], we workaround it in this PR.
      
      [1]  https://github.com/bartdag/py4j/issues/160
      [2] https://github.com/bartdag/py4j/issues/161
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5570 from davies/py4j_date and squashes the following commits:
      
      eb4fa53 [Davies Liu] fix tests in python 3
      d17d634 [Davies Liu] rollback changes in mllib
      2e7566d [Davies Liu] convert tuple into ArrayList
      ceb3779 [Davies Liu] Update rdd.py
      3c373f3 [Davies Liu] support date and datetime by auto_convert
      cb094ff [Davies Liu] enable auto convert
      ab9128fb
  30. Apr 20, 2015
  31. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  32. Apr 08, 2015
  33. Mar 31, 2015
    • Reynold Xin's avatar
      [Doc] Improve Python DataFrame documentation · 305abe1e
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5287 from rxin/pyspark-df-doc-cleanup-context and squashes the following commits:
      
      1841b60 [Reynold Xin] Lint.
      f2007f1 [Reynold Xin] functions and types.
      bc3b72b [Reynold Xin] More improvements to DataFrame Python doc.
      ac1d4c0 [Reynold Xin] Bug fix.
      b163365 [Reynold Xin] Python fix. Added Experimental flag to DataFrameNaFunctions.
      608422d [Reynold Xin] [Doc] Cleanup context.py Python docs.
      305abe1e
  34. Mar 30, 2015
    • Davies Liu's avatar
      [SPARK-6603] [PySpark] [SQL] add SQLContext.udf and deprecate inferSchema() and applySchema · f76d2e55
      Davies Liu authored
      This PR create an alias for `registerFunction` as `udf.register`, to be consistent with Scala API.
      
      It also deprecated inferSchema() and applySchema(), show an warning for them.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5273 from davies/udf and squashes the following commits:
      
      476e947 [Davies Liu] address comments
      c096fdb [Davies Liu] add SQLContext.udf and deprecate inferSchema() and applySchema
      f76d2e55
  35. Feb 27, 2015
    • Davies Liu's avatar
      [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType · e0e64ba4
      Davies Liu authored
      The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.
      
      Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).
      
      This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).
      
      cc pwendell  JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4808 from davies/leak and squashes the following commits:
      
      6a322a4 [Davies Liu] tests refactor
      3da44fc [Davies Liu] fix __eq__ of Singleton
      534ac90 [Davies Liu] add more checks
      46999dc [Davies Liu] fix tests
      d9ae973 [Davies Liu] fix memory leak in sql
      e0e64ba4
Loading