Skip to content
Snippets Groups Projects
  1. Mar 31, 2015
    • Reynold Xin's avatar
      [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python. · b80a030e
      Reynold Xin authored
      To maintain consistency with the Scala API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5284 from rxin/df-na-alias and squashes the following commits:
      
      19f46b7 [Reynold Xin] Show DataFrameNaFunctions in docs.
      6618118 [Reynold Xin] [SPARK-6623][SQL] Alias DataFrame.na.drop and DataFrame.na.fill in Python.
      b80a030e
  2. Mar 30, 2015
    • Reynold Xin's avatar
      [SPARK-6119][SQL] DataFrame support for missing data handling · b8ff2bc6
      Reynold Xin authored
      This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5274 from rxin/df-missing-value and squashes the following commits:
      
      4ee1b98 [Reynold Xin] Improve error reporting in Python.
      33a330c [Reynold Xin] Remove replace for now.
      bc4fdbb [Reynold Xin] Added documentation for replace.
      d56f5a5 [Reynold Xin] Added replace for Scala/Java.
      2385d00 [Reynold Xin] Feedback from Xiangrui on "how".
      914a374 [Reynold Xin] fill with map.
      185c67e [Reynold Xin] Allow specifying column subsets in fill.
      749eb47 [Reynold Xin] fillna
      249b94e [Reynold Xin] Removing undefined functions.
      6a73c68 [Reynold Xin] Missing file.
      67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
      b8ff2bc6
  3. Mar 29, 2015
    • Reynold Xin's avatar
      [DOC] Improvements to Python docs. · 5eef00d0
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5238 from rxin/pyspark-docs and squashes the following commits:
      
      c285951 [Reynold Xin] Reset deprecation warning.
      8c1031e [Reynold Xin] inferSchema
      dd91b1a [Reynold Xin] [DOC] Improvements to Python docs.
      5eef00d0
  4. Mar 26, 2015
    • Reynold Xin's avatar
      [SPARK-6117] [SQL] Improvements to DataFrame.describe() · 784fcd53
      Reynold Xin authored
      1. Slightly modifications to the code to make it more readable.
      2. Added Python implementation.
      3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5201 from rxin/df-describe and squashes the following commits:
      
      25a7834 [Reynold Xin] Reset run-tests.
      6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe()
      784fcd53
    • Davies Liu's avatar
      [SPARK-6536] [PySpark] Column.inSet() in Python · f5358029
      Davies Liu authored
      ```
      >>> df[df.name.inSet("Bob", "Mike")].collect()
      [Row(age=5, name=u'Bob')]
      >>> df[df.age.inSet([1, 2, 3])].collect()
      [Row(age=2, name=u'Alice')]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5190 from davies/in and squashes the following commits:
      
      6b73a47 [Davies Liu] Column.inSet() in Python
      f5358029
  5. Mar 17, 2015
  6. Mar 14, 2015
    • Davies Liu's avatar
      [SPARK-6210] [SQL] use prettyString as column name in agg() · b38e073f
      Davies Liu authored
      use prettyString instead of toString() (which include id of expression) as column name in agg()
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5006 from davies/prettystring and squashes the following commits:
      
      cb1fdcf [Davies Liu] use prettyString as column name in agg()
      b38e073f
  7. Mar 09, 2015
    • Davies Liu's avatar
      [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect() · 8767565c
      Davies Liu authored
      Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.
      
      This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4923 from davies/fix_collect and squashes the following commits:
      
      d730286 [Davies Liu] address comments
      24c92a4 [Davies Liu] fix style
      ba54614 [Davies Liu] use socket to transfer data from JVM
      9517c8f [Davies Liu] fix memory leak in collect()
      8767565c
  8. Feb 27, 2015
    • Davies Liu's avatar
      [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType · e0e64ba4
      Davies Liu authored
      The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.
      
      Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).
      
      This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).
      
      cc pwendell  JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4808 from davies/leak and squashes the following commits:
      
      6a322a4 [Davies Liu] tests refactor
      3da44fc [Davies Liu] fix __eq__ of Singleton
      534ac90 [Davies Liu] add more checks
      46999dc [Davies Liu] fix tests
      d9ae973 [Davies Liu] fix memory leak in sql
      e0e64ba4
  9. Feb 26, 2015
    • Jacky Li's avatar
      [SPARK-6007][SQL] Add numRows param in DataFrame.show() · 23586575
      Jacky Li authored
      It is useful to let the user decide the number of rows to show in DataFrame.show
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #4767 from jackylk/show and squashes the following commits:
      
      a0e0f4b [Jacky Li] fix testcase
      7cdbe91 [Jacky Li] modify according to comment
      bb54537 [Jacky Li] for Java compatibility
      d7acc18 [Jacky Li] modify according to comments
      981be52 [Jacky Li] add numRows param in DataFrame.show()
      23586575
  10. Feb 24, 2015
    • Davies Liu's avatar
      [SPARK-5994] [SQL] Python DataFrame documentation fixes · d641fbb3
      Davies Liu authored
      select empty should NOT be the same as select. make sure selectExpr is behaving the same.
      join param documentation
      link to source doesn't work in jekyll generated file
      cross reference of columns (i.e. enabling linking)
      show(): move df example before df.show()
      move tests in SQLContext out of docstring otherwise doc is too long
      Column.desc and .asc doesn't have any documentation
      in documentation, sort functions.*)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4756 from davies/df_docs and squashes the following commits:
      
      f30502c [Davies Liu] fix doc
      32f0d46 [Davies Liu] fix DataFrame docs
      d641fbb3
    • Reynold Xin's avatar
      [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python. · fba11c2f
      Reynold Xin authored
      Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4752 from rxin/SPARK-5985 and squashes the following commits:
      
      aeda5ae [Reynold Xin] Added Experimental flag to ColumnName.
      047ad03 [Reynold Xin] Lift alias out of cast.
      c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
      fba11c2f
  11. Feb 19, 2015
    • Reynold Xin's avatar
      [SPARK-5904][SQL] DataFrame API fixes. · 8ca3418e
      Reynold Xin authored
      1. Column is no longer a DataFrame to simplify class hierarchy.
      2. Don't use varargs on abstract methods (see Scala compiler bug SI-9013).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4686 from rxin/SPARK-5904 and squashes the following commits:
      
      fd9b199 [Reynold Xin] Fixed Python tests.
      df25cef [Reynold Xin] Non final.
      5221530 [Reynold Xin] [SPARK-5904][SQL] DataFrame API fixes.
      8ca3418e
  12. Feb 18, 2015
    • Davies Liu's avatar
      [SPARK-5722] [SQL] [PySpark] infer int as LongType · aa8f10e8
      Davies Liu authored
      The `int` is 64-bit on 64-bit machine (very common now), we should infer it as LongType for it in Spark SQL.
      
      Also, LongType in SQL will come back as `int`.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4666 from davies/long and squashes the following commits:
      
      6bc6cc4 [Davies Liu] infer int as LongType
      aa8f10e8
    • Davies Liu's avatar
      [SPARK-5878] fix DataFrame.repartition() in Python · c1b6fa98
      Davies Liu authored
      Also add tests for distinct()
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4667 from davies/repartition and squashes the following commits:
      
      79059fd [Davies Liu] add test
      cb4915e [Davies Liu] fix repartition
      c1b6fa98
  13. Feb 17, 2015
    • Davies Liu's avatar
      [SPARK-5871] output explain in Python · 3df85dcc
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4658 from davies/explain and squashes the following commits:
      
      db87ea2 [Davies Liu] output explain in Python
      3df85dcc
    • Davies Liu's avatar
      [SPARK-5859] [PySpark] [SQL] fix DataFrame Python API · d8adefef
      Davies Liu authored
      1. added explain()
      2. add isLocal()
      3. do not call show() in __repl__
      4. add foreach() and foreachPartition()
      5. add distinct()
      6. fix functions.col()/column()/lit()
      7. fix unit tests in sql/functions.py
      8. fix unicode in showString()
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4645 from davies/df6 and squashes the following commits:
      
      6b46a2c [Davies Liu] fix DataFrame Python API
      d8adefef
  14. Feb 16, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-5799][SQL] Compute aggregation function on specified numeric columns · 5c78be7a
      Liang-Chi Hsieh authored
      Compute aggregation function on specified numeric columns. For example:
      
          val df = Seq(("a", 1, 0, "b"), ("b", 2, 4, "c"), ("a", 2, 3, "d")).toDataFrame("key", "value1", "value2", "rest")
          df.groupBy("key").min("value2")
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4592 from viirya/specific_cols_agg and squashes the following commits:
      
      9446896 [Liang-Chi Hsieh] For comments.
      314c4cd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into specific_cols_agg
      353fad7 [Liang-Chi Hsieh] For python unit tests.
      54ed0c4 [Liang-Chi Hsieh] Address comments.
      b079e6b [Liang-Chi Hsieh] Remove duplicate codes.
      55100fb [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into specific_cols_agg
      880c2ac [Liang-Chi Hsieh] Fix Python style checks.
      4c63a01 [Liang-Chi Hsieh] Fix pyspark.
      b1a24fc [Liang-Chi Hsieh] Address comments.
      2592f29 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into specific_cols_agg
      27069c3 [Liang-Chi Hsieh] Combine functions and add varargs annotation.
      371a3f7 [Liang-Chi Hsieh] Compute aggregation function on specified numeric columns.
      5c78be7a
  15. Feb 14, 2015
    • Reynold Xin's avatar
      [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames · e98dfe62
      Reynold Xin authored
      - The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
      - toDataFrame -> toDF
      - Dsl -> functions
      - implicits moved into SQLContext.implicits
      - addColumn -> withColumn
      - renameColumn -> withColumnRenamed
      
      Python changes:
      - toDataFrame -> toDF
      - Dsl -> functions package
      - addColumn -> withColumn
      - renameColumn -> withColumnRenamed
      - add toDF functions to RDD on SQLContext init
      - add flatMap to DataFrame
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4556 from rxin/SPARK-5752 and squashes the following commits:
      
      5ef9910 [Reynold Xin] More fix
      61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
      ff5832c [Reynold Xin] Fix python
      749c675 [Reynold Xin] count(*) fixes.
      5806df0 [Reynold Xin] Fix build break again.
      d941f3d [Reynold Xin] Fixed explode compilation break.
      fe1267a [Davies Liu] flatMap
      c4afb8e [Reynold Xin] style
      d9de47f [Davies Liu] add comment
      b783994 [Davies Liu] add comment for toDF
      e2154e5 [Davies Liu] schema() -> schema
      3a1004f [Davies Liu] Dsl -> functions, toDF()
      fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
      0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
      97dd47c [Davies Liu] fix mistake
      6168f74 [Davies Liu] fix test
      1fc0199 [Davies Liu] fix test
      a075cd5 [Davies Liu] clean up, toPandas
      663d314 [Davies Liu] add test for agg('*')
      9e214d5 [Reynold Xin] count(*) fixes.
      1ed7136 [Reynold Xin] Fix build break again.
      921b2e3 [Reynold Xin] Fixed explode compilation break.
      14698d4 [Davies Liu] flatMap
      ba3e12d [Reynold Xin] style
      d08c92d [Davies Liu] add comment
      5c8b524 [Davies Liu] add comment for toDF
      a4e5e66 [Davies Liu] schema() -> schema
      d377fc9 [Davies Liu] Dsl -> functions, toDF()
      6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
      807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
      e98dfe62
  16. Feb 12, 2015
    • Yin Huai's avatar
      [SQL] Move SaveMode to SQL package. · c025a468
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4542 from yhuai/moveSaveMode and squashes the following commits:
      
      65a4425 [Yin Huai] Move SaveMode to sql package.
      c025a468
  17. Feb 11, 2015
    • Davies Liu's avatar
      [SPARK-5677] [SPARK-5734] [SQL] [PySpark] Python DataFrame API remaining tasks · b694eb9c
      Davies Liu authored
      1. DataFrame.renameColumn
      
      2. DataFrame.show() and _repr_
      
      3. Use simpleString() rather than jsonValue in DataFrame.dtypes
      
      4. createDataFrame from local Python data, including pandas.DataFrame
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4528 from davies/df3 and squashes the following commits:
      
      014acea [Davies Liu] fix typo
      6ba526e [Davies Liu] fix tests
      46f5f95 [Davies Liu] address comments
      6cbc154 [Davies Liu] dataframe.show() and improve dtypes
      6f94f25 [Davies Liu] create DataFrame from local Python data
      b694eb9c
  18. Feb 10, 2015
    • Yin Huai's avatar
      [SPARK-5658][SQL] Finalize DDL and write support APIs · aaf50d05
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-5658
      
      Author: Yin Huai <yhuai@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #4446 from yhuai/writeSupportFollowup and squashes the following commits:
      
      f3a96f7 [Yin Huai] davies's comments.
      225ff71 [Yin Huai] Use Scala TestHiveContext to initialize the Python HiveContext in Python tests.
      2306f93 [Yin Huai] Style.
      2091fcd [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      537e28f [Yin Huai] Correctly clean up temp data.
      ae4649e [Yin Huai] Fix Python test.
      609129c [Yin Huai] Doc format.
      92b6659 [Yin Huai] Python doc and other minor updates.
      cbc717f [Yin Huai] Rename dataSourceName to source.
      d1c12d3 [Yin Huai] No need to delete the duplicate rule since it has been removed in master.
      22cfa70 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      d91ecb8 [Yin Huai] Fix test.
      4c76d78 [Yin Huai] Simplify APIs.
      3abc215 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      0832ce4 [Yin Huai] Fix test.
      98e7cdb [Yin Huai] Python style.
      2bf44ef [Yin Huai] Python APIs.
      c204967 [Yin Huai] Format
      a10223d [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      9ff97d8 [Yin Huai] Add SaveMode to saveAsTable.
      9b6e570 [Yin Huai] Update doc.
      c2be775 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      99950a2 [Yin Huai] Use Java enum for SaveMode.
      4679665 [Yin Huai] Remove duplicate rule.
      77d89dc [Yin Huai] Update doc.
      e04d908 [Yin Huai] Move import and add (Scala-specific) to scala APIs.
      cf5703d [Yin Huai] Add checkAnswer to Java tests.
      7db95ff [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      6dfd386 [Yin Huai] Add java test.
      f2f33ef [Yin Huai] Fix test.
      e702386 [Yin Huai] Apache header.
      b1e9b1b [Yin Huai] Format.
      ed4e1b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      af9e9b3 [Yin Huai] DDL and write support API followup.
      2a6213a [Yin Huai] Update API names.
      e6a0b77 [Yin Huai] Update test.
      43bae01 [Yin Huai] Remove createTable from HiveContext.
      5ffc372 [Yin Huai] Add more load APIs to SQLContext.
      5390743 [Yin Huai] Add more save APIs to DataFrame.
      aaf50d05
    • Michael Armbrust's avatar
      [SQL] Add toString to DataFrame/Column · de80b1ba
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4436 from marmbrus/dfToString and squashes the following commits:
      
      8a3c35f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into dfToString
      b72a81b [Michael Armbrust] add toString
      de80b1ba
  19. Feb 09, 2015
    • Davies Liu's avatar
      [SPARK-5469] restructure pyspark.sql into multiple files · 08488c17
      Davies Liu authored
      All the DataTypes moved into pyspark.sql.types
      
      The changes can be tracked by `--find-copies-harder -M25`
      ```
      davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
      2       5       python/docs/pyspark.ml.rst
      0       3       python/docs/pyspark.mllib.rst
      10      2       python/docs/pyspark.sql.rst
      1       1       python/pyspark/mllib/linalg.py
      21      14      python/pyspark/{mllib => sql}/__init__.py
      14      2108    python/pyspark/{sql.py => sql/context.py}
      10      1772    python/pyspark/{sql.py => sql/dataframe.py}
      7       6       python/pyspark/{sql_tests.py => sql/tests.py}
      8       1465    python/pyspark/{sql.py => sql/types.py}
      4       2       python/run-tests
      1       1       sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
      ```
      
      Also `git blame -C -C python/pyspark/sql/context.py` to track the history.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4479 from davies/sql and squashes the following commits:
      
      1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
      2b2b983 [Davies Liu] restructure pyspark.sql
      08488c17
Loading