Skip to content
Snippets Groups Projects
  1. Apr 01, 2015
    • ksonj's avatar
      [SPARK-6553] [pyspark] Support functools.partial as UDF · 98f72dfc
      ksonj authored
      Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used.
      
      Author: ksonj <kson@siberie.de>
      
      Closes #5206 from ksonj/partials and squashes the following commits:
      
      ea66f3d [ksonj] Inserted blank lines for PEP8 compliance
      d81b02b [ksonj] added tests for udf with partial function and callable object
      2c76100 [ksonj] Makes UDFs work with all types of callables
      b814a12 [ksonj] support functools.partial as udf
      98f72dfc
  2. Mar 30, 2015
    • Reynold Xin's avatar
      [SPARK-6119][SQL] DataFrame support for missing data handling · 67c885e3
      Reynold Xin authored
      
      This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5274 from rxin/df-missing-value and squashes the following commits:
      
      4ee1b98 [Reynold Xin] Improve error reporting in Python.
      33a330c [Reynold Xin] Remove replace for now.
      bc4fdbb [Reynold Xin] Added documentation for replace.
      d56f5a5 [Reynold Xin] Added replace for Scala/Java.
      2385d00 [Reynold Xin] Feedback from Xiangrui on "how".
      914a374 [Reynold Xin] fill with map.
      185c67e [Reynold Xin] Allow specifying column subsets in fill.
      749eb47 [Reynold Xin] fillna
      249b94e [Reynold Xin] Removing undefined functions.
      6a73c68 [Reynold Xin] Missing file.
      67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
      
      (cherry picked from commit b8ff2bc6)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      67c885e3
  3. Feb 27, 2015
    • Davies Liu's avatar
      [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType · 49f2187a
      Davies Liu authored
      
      The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.
      
      Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).
      
      This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).
      
      cc pwendell  JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4808 from davies/leak and squashes the following commits:
      
      6a322a4 [Davies Liu] tests refactor
      3da44fc [Davies Liu] fix __eq__ of Singleton
      534ac90 [Davies Liu] add more checks
      46999dc [Davies Liu] fix tests
      d9ae973 [Davies Liu] fix memory leak in sql
      
      (cherry picked from commit e0e64ba4)
      Signed-off-by: default avatarJosh Rosen <joshrosen@databricks.com>
      49f2187a
  4. Feb 24, 2015
    • Davies Liu's avatar
      [SPARK-5994] [SQL] Python DataFrame documentation fixes · 5c421e03
      Davies Liu authored
      
      select empty should NOT be the same as select. make sure selectExpr is behaving the same.
      join param documentation
      link to source doesn't work in jekyll generated file
      cross reference of columns (i.e. enabling linking)
      show(): move df example before df.show()
      move tests in SQLContext out of docstring otherwise doc is too long
      Column.desc and .asc doesn't have any documentation
      in documentation, sort functions.*)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4756 from davies/df_docs and squashes the following commits:
      
      f30502c [Davies Liu] fix doc
      32f0d46 [Davies Liu] fix DataFrame docs
      
      (cherry picked from commit d641fbb3)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      5c421e03
  5. Feb 20, 2015
  6. Feb 18, 2015
  7. Feb 17, 2015
  8. Feb 14, 2015
    • Reynold Xin's avatar
      [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames · ba91bf5f
      Reynold Xin authored
      
      - The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
      - toDataFrame -> toDF
      - Dsl -> functions
      - implicits moved into SQLContext.implicits
      - addColumn -> withColumn
      - renameColumn -> withColumnRenamed
      
      Python changes:
      - toDataFrame -> toDF
      - Dsl -> functions package
      - addColumn -> withColumn
      - renameColumn -> withColumnRenamed
      - add toDF functions to RDD on SQLContext init
      - add flatMap to DataFrame
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4556 from rxin/SPARK-5752 and squashes the following commits:
      
      5ef9910 [Reynold Xin] More fix
      61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
      ff5832c [Reynold Xin] Fix python
      749c675 [Reynold Xin] count(*) fixes.
      5806df0 [Reynold Xin] Fix build break again.
      d941f3d [Reynold Xin] Fixed explode compilation break.
      fe1267a [Davies Liu] flatMap
      c4afb8e [Reynold Xin] style
      d9de47f [Davies Liu] add comment
      b783994 [Davies Liu] add comment for toDF
      e2154e5 [Davies Liu] schema() -> schema
      3a1004f [Davies Liu] Dsl -> functions, toDF()
      fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
      0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
      97dd47c [Davies Liu] fix mistake
      6168f74 [Davies Liu] fix test
      1fc0199 [Davies Liu] fix test
      a075cd5 [Davies Liu] clean up, toPandas
      663d314 [Davies Liu] add test for agg('*')
      9e214d5 [Reynold Xin] count(*) fixes.
      1ed7136 [Reynold Xin] Fix build break again.
      921b2e3 [Reynold Xin] Fixed explode compilation break.
      14698d4 [Davies Liu] flatMap
      ba3e12d [Reynold Xin] style
      d08c92d [Davies Liu] add comment
      5c8b524 [Davies Liu] add comment for toDF
      a4e5e66 [Davies Liu] schema() -> schema
      d377fc9 [Davies Liu] Dsl -> functions, toDF()
      6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
      807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
      
      (cherry picked from commit e98dfe62)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      ba91bf5f
  9. Feb 11, 2015
    • Davies Liu's avatar
      [SPARK-5677] [SPARK-5734] [SQL] [PySpark] Python DataFrame API remaining tasks · d66aae21
      Davies Liu authored
      
      1. DataFrame.renameColumn
      
      2. DataFrame.show() and _repr_
      
      3. Use simpleString() rather than jsonValue in DataFrame.dtypes
      
      4. createDataFrame from local Python data, including pandas.DataFrame
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4528 from davies/df3 and squashes the following commits:
      
      014acea [Davies Liu] fix typo
      6ba526e [Davies Liu] fix tests
      46f5f95 [Davies Liu] address comments
      6cbc154 [Davies Liu] dataframe.show() and improve dtypes
      6f94f25 [Davies Liu] create DataFrame from local Python data
      
      (cherry picked from commit b694eb9c)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      d66aae21
  10. Feb 10, 2015
    • Davies Liu's avatar
      [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns · 1056c5b1
      Davies Liu authored
      
      Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4498 from davies/create and squashes the following commits:
      
      08469c1 [Davies Liu] remove Scala/Java API for now
      c80a7a9 [Davies Liu] fix hive test
      d1bd8f2 [Davies Liu] cleanup applySchema
      9526e97 [Davies Liu] createDataFrame from RDD with columns
      
      (cherry picked from commit ea602840)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      1056c5b1
    • Yin Huai's avatar
      [SPARK-5658][SQL] Finalize DDL and write support APIs · a21090eb
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-5658
      
      
      
      Author: Yin Huai <yhuai@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #4446 from yhuai/writeSupportFollowup and squashes the following commits:
      
      f3a96f7 [Yin Huai] davies's comments.
      225ff71 [Yin Huai] Use Scala TestHiveContext to initialize the Python HiveContext in Python tests.
      2306f93 [Yin Huai] Style.
      2091fcd [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      537e28f [Yin Huai] Correctly clean up temp data.
      ae4649e [Yin Huai] Fix Python test.
      609129c [Yin Huai] Doc format.
      92b6659 [Yin Huai] Python doc and other minor updates.
      cbc717f [Yin Huai] Rename dataSourceName to source.
      d1c12d3 [Yin Huai] No need to delete the duplicate rule since it has been removed in master.
      22cfa70 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      d91ecb8 [Yin Huai] Fix test.
      4c76d78 [Yin Huai] Simplify APIs.
      3abc215 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      0832ce4 [Yin Huai] Fix test.
      98e7cdb [Yin Huai] Python style.
      2bf44ef [Yin Huai] Python APIs.
      c204967 [Yin Huai] Format
      a10223d [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      9ff97d8 [Yin Huai] Add SaveMode to saveAsTable.
      9b6e570 [Yin Huai] Update doc.
      c2be775 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      99950a2 [Yin Huai] Use Java enum for SaveMode.
      4679665 [Yin Huai] Remove duplicate rule.
      77d89dc [Yin Huai] Update doc.
      e04d908 [Yin Huai] Move import and add (Scala-specific) to scala APIs.
      cf5703d [Yin Huai] Add checkAnswer to Java tests.
      7db95ff [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      6dfd386 [Yin Huai] Add java test.
      f2f33ef [Yin Huai] Fix test.
      e702386 [Yin Huai] Apache header.
      b1e9b1b [Yin Huai] Format.
      ed4e1b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      af9e9b3 [Yin Huai] DDL and write support API followup.
      2a6213a [Yin Huai] Update API names.
      e6a0b77 [Yin Huai] Update test.
      43bae01 [Yin Huai] Remove createTable from HiveContext.
      5ffc372 [Yin Huai] Add more load APIs to SQLContext.
      5390743 [Yin Huai] Add more save APIs to DataFrame.
      
      (cherry picked from commit aaf50d05)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      a21090eb
  11. Feb 09, 2015
    • Davies Liu's avatar
      [SPARK-5469] restructure pyspark.sql into multiple files · f0562b42
      Davies Liu authored
      
      All the DataTypes moved into pyspark.sql.types
      
      The changes can be tracked by `--find-copies-harder -M25`
      ```
      davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
      2       5       python/docs/pyspark.ml.rst
      0       3       python/docs/pyspark.mllib.rst
      10      2       python/docs/pyspark.sql.rst
      1       1       python/pyspark/mllib/linalg.py
      21      14      python/pyspark/{mllib => sql}/__init__.py
      14      2108    python/pyspark/{sql.py => sql/context.py}
      10      1772    python/pyspark/{sql.py => sql/dataframe.py}
      7       6       python/pyspark/{sql_tests.py => sql/tests.py}
      8       1465    python/pyspark/{sql.py => sql/types.py}
      4       2       python/run-tests
      1       1       sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
      ```
      
      Also `git blame -C -C python/pyspark/sql/context.py` to track the history.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4479 from davies/sql and squashes the following commits:
      
      1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
      2b2b983 [Davies Liu] restructure pyspark.sql
      
      (cherry picked from commit 08488c17)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      f0562b42
  12. Feb 03, 2015
    • Davies Liu's avatar
      [SPARK-5554] [SQL] [PySpark] add more tests for DataFrame Python API · 4640623b
      Davies Liu authored
      
      Add more tests and docs for DataFrame Python API, improve test coverage, fix bugs.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4331 from davies/fix_df and squashes the following commits:
      
      dd9919f [Davies Liu] fix tests
      467332c [Davies Liu] support string in cast()
      83c92fe [Davies Liu] address comments
      c052f6f [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
      8dd19a9 [Davies Liu] fix tests in python 2.6
      35ccb9f [Davies Liu] fix build
      78ebcfa [Davies Liu] add sql_test.py in run_tests
      9ab78b4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
      6040ba7 [Davies Liu] fix docs
      3ab2661 [Davies Liu] add more tests for DataFrame
      
      (cherry picked from commit 068c0e2e)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4640623b
Loading